Joel Reardon, Ken Bamberger, and Serge Egelman

While legal scholars have cited decades of computer science research that demonstrates why anonymity is a hard problem (and that datasets should not be labelled as “anonymous” cavalierly), industry and legal practitioners have not heeded those warnings: many organizations trafficking in consumer data continue to make assertions to customers, courts, and regulators, that their data is anonymous or “deidentified.” We acquired datasets from multiple data brokers to empirically demonstrate why these assertions are false. Using publicly-available email addresses found in data breaches posted on the Internet, we show that one can trivially reidentify 88% of the hashed email addresses that we obtained.

Reidentifying hashed email addresses need not rely on illicit data: by constructing rainbow tables, we reidentified a majority of the hashed email addresses. In all cases, the hashed email addresses were linked to other device-based identifiers (e.g., mobile data advertising IDs, IPs, etc.), demonstrating why device-based identifiers have long been considered  personally-identifiable information. Relatedly, organizations trafficking in this data make another assertion, that this data was collected from consumers with their consent. To evaluate this claim, we performed a survey (n = 369), in which we emailed the reidentified individuals in our datasets to recruit them to participate in a survey. This survey asked participants about their recollections of having provided consent (99.1% had no recollection of providing consent) and whether they would prefer that the data brokers delete their data (94.2% said they would prefer their email address was not sold, while 76.4% said they planned to submit deletion requests). Overall, our study shows that hashed email addresses and device identifiers do not come close to meeting commonly-understood definitions of “anonymous” or “deidentified,” and that any notion of “consent” must also involve a similarly-tortured definition. We argue that this industry and its defenders are not simply misinformed or indifferent to the veracity of their statements, but that this is an example of Plato’s “noble lie”: their entire social order relies on these demonstrably untrue statements being true.