Retail analytics is a fraught field. The premise is straightforward: enable brick-and-mortar stores to track their customers. The technology is straightforward, too: monitor broadcasts from shoppers’ smartphones. Privacy concerns have, however, put a damper on the nascent industry. Regulators, legislators, and advocacy groups have questioned the legitimacy of surreptitiously monitoring shoppers’ gadgets.
Last fall, Senator Schumer announced a grand bargain with retail analytics firms. They will be bound by a “Mobile Location Analytics Code of Conduct,” a set of voluntary practices intended to assuage privacy fears. The document has already been widely panned, both as a product of backroom dealing, and for providing little substantive protection to consumers.
One particular point of contention is how the industry proposes to preserve privacy through cryptography. This post explains the Code of Conduct’s crypto, and demonstrates how it can trivially be undone.
A brief explanation of retail analytics sets the stage. Your smartphone includes WiFi and Bluetooth chips, and those chips each have a unique serial number, called a MAC address. Periodically your phone will announce itself, including those MAC addresses. The most common approach to retail analytics simply logs these broadcasts and compiles a shopper’s activity. A retail analytics firm might, for example, build a database like this.
MAC Address | Locations |
aa:bb:cc:dd:ee:ff |
|
11:22:33:44:55:66 |
|
As a concession to privacy concerns, the Code of Conduct calls for a math fix. Before a MAC address is saved, it gets passed through a cryptographic hash function. The idea, without going into detail, is to produce a random-looking number from each MAC address. The result is a database like this.1
Hashed MAC Address | Locations |
317060aa70a5a9e846… |
|
5c6a981c81b9fcb030… |
|
It’s far from clear that hashing actually solves privacy problems here. If someone wants to learn the shopping history associated with a particular MAC address, they can simply apply the hash function, then look up the hash in the database.
$ echo -n "aa:bb:cc:dd:ee:ff" | openssl sha1
317060aa70a5a9e846...
Hashing is also no defense against re-identification. For example, if you know Alice went to Gap in Chicago on Tuesday morning, and the Apple Store in San Francisco on Thursday afternoon, it’s trivial to identify Alice’s smartphone in the database.
Hashed MAC Address | Locations |
317060aa70a5a9e846… |
|
There’s yet another problem here. The very purpose of this crypto is to prevent reversing a hash back into an unknown MAC address. Euclid, one of the better-known retail analytics firms, claims:
Hashed data cannot be reverse-engineered by a third party to reveal a device’s MAC address. This means that anyone who gains access to the database . . . would see only long strings of numbers and letters. They would be unable to get any information that could be linked to a back to a particular mobile device owner.
Challenge accepted. In under an hour, and for less than a dollar, I built a cloud system that reverses hashed MAC addresses.2
Some back of the envelope math suggested the task was doable. There are 6 bytes in a MAC address; the first 3 bytes are allocated to the network device vendor, and the last 3 bytes are chosen by the vendor. In total, then, there are 248 possible MAC addresses. Since only 19,130 vendor prefixes have been actually allocated for use, however, there are at most 238.22 validly assigned MAC addresses. That number might sound big, but modern consumer hardware can calculate roughly 230 hashes per second. In other words, it should be possible to check every validly assigned MAC address in just a few minutes.
Since I had just a puny notebook on hand, I rented a server with a graphics card in Amazon’s cloud. (Hashing involves parallelized math, so a graphics card gives a substantial performance boost.) Next, I installed oclHashcat, a fast hash-checking program. Writing format files for hashcat took just a few minutes.3 (They’re available on GitHub for the curious.) Then, with no effort at optimization, I tossed in the hash of my smartphone’s MAC address. Reversing the hash took just 12 minutes. Total cost: $0.65, plus tax.
There’s plenty of room for improvement. A more sophisticated approach would be to prioritize MAC prefixes associated with smartphone vendors. It’s also trivial to compute hashes for all the valid MAC addresses and save them for quick lookup; a few consumer hard drives would provide sufficient storage.4
Before closing, let me add a quick note about salted hashing and hash-based message authentication codes (HMAC). Those techniques integrate extra information in the course of hashing, frustrating attacks that rely on precomputed hash values. They do not, however, protect against attacks that involve actually computing hashes. If an employee or intruder has access to hashed MAC addresses, they presumptively also have access to the extra information. Salting and HMAC are no solution here.
The problem underlying all this is a flawed assumption within the Code of Conduct. In cringe-inducing legalese, that document presumes retail analytics data can be simultaneously:
- “associated with a particular . . . device,” and
- not “reasonably . . . linked to an individual,” including their MAC address.
There is no such class of data, so long as retail analytics free rides on smartphone WiFi and Bluetooth.5 Hashing is not a silver bullet for electronic privacy. As we have seen, it is possible to test retail analytics data against every possible device. If data is associated with a particular device, it is always linkable back to an individual.
1. Throughout this post, I use the SHA-1 cryptographic hash function. My understanding is that several retail analytics firms have deployed it in their products. The same analysis would apply for a different hash function, just with different output values, performance, and memory requirements.
2. After drafting this post, I came across a master’s degree research paper at INRIA on WiFi tracking and privacy. The author also concludes that reversing hashed MAC addresses is easy, and “hashing a MAC address is not a satisfactory solution” for location privacy.
3. I formatted MAC addresses in lowercase, without byte separators. It would be trivial to use a different format, or even multiple formats.
4. A naïve approach would be to store each valid MAC address and its SHA1 hash in a database. Since each MAC address is 6 bytes, and each SHA1 hash is 20 bytes, the pair is 26 bytes. There are 238.22 valid MAC addresses, as discussed above. The entire set is therefore 26 · 238.22 bytes ≈ 8.5 TB.
5. If shoppers ran store software on their smartphones, there would be viable privacy-preserving approaches to retail analytics. The appeal of WiFi and Bluetooth, of course, is that shoppers are automatically included.