173 Million Taxi Records Lost in Massive Location Data Heist

It has been revealed that employees of the New York records system accidentally released the details of 173 million routes within the city, handing over the data to a software developer named Vijay Pandurangan.

location data

Photo: Ivakoleva / Shutterstock

City officials released the information in response to a public records request and specifically obscured the drivers’ license and medallion numbers. Rather than including those numbers in plaintext, the 20 gigabyte file contained one-way cryptographic hashes using the MD5 algorithm.

Since they’re one-way hashes, they supposedly can’t be converted back into their original values. Presumably, officials used the hashes to preserve the privacy of individual drivers since the records provide a detailed view of their locations and work performance over an extended period of time.

Unbeknownst to city officials, if the 173 million records were exploited correctly, they could easily reveal the exact locations and medallion numbers of registered drivers. This would give whoever cracks the code unfettered access to the full map of routes and schedules for every taxi service in New York, potentially putting the privacy and security of millions of passengers and their drivers at significant risk.

The public records request seemed innocent enough at first, which is why the city was confident enough to release the information as long as it was hidden behind the MD5 encryption algorithm.

However, it turns out there’s a significant flaw in the approach. Because both the medallion and hack numbers are structured in predictable patterns, it was trivial table work to run all possible iterations through the same MD5 algorithm, and then compare the output to the data contained in the nearly 20GB file.

“Security researchers have been warning for a while that simply using hash functions is an ineffective way to anonymize data,” Pandurangan wrote in a post published over the weekend on Medium. “In this case, it’s substantially worse because of the structured format of the input data. This anonymization is so poor that anyone could, with less than two hours work, figure which driver drove every single trip in this entire dataset. It would even be easy to calculate drivers’ gross income or infer where they live.”

In this case, it wasn’t the use of MD5 that was the weakness in the system, but rather the format of the taxi license numbers that made the code so simple to crack. As license plates are always laid out as one deviation of a six or seven digit code or another, all the hacker had to do was run a program designed to pluck that information out of the hashed data, and run a cross-check against publicly available records online.

Through this process Pandurangan was able to de-anonymize the entire file in just under two hours, and thankfully he only went through the process as a proof of concept to show just how simple it was to acquire the information through standard bureaucratic channels.

“The cat is already out of the bag in this case,” Pandurangan wrote, “but hopefully in the future, agencies will think carefully about the method they use to anonymize data before releasing it to the public.”