Earlier last week the Federal Trade Commission published a blog about how the hashing of data does not, in fact, make the data anonymous. This was news to many people who do not deal with security or privacy law regularly and that’s understandable because few people take the time to really develop a strong understanding of cryptographic concepts1. It’s made more complex by the fact that the legal definition of anonymous and what most people consider anonymous are not the same.
The Legal Definition
Laws vary – but here in the United States most companies attempt to align to California’s Consumer Privacy Act. That law doesn’t make reference to anonymous data, but does to ‘de-identified’ data as reads below:
(m) “Deidentified” means information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer provided that the business that possesses the information:
(1) Takes reasonable measures to ensure that the information cannot be associated
with a consumer or household.(2) Publicly commits to maintain and use the information in deidentified form and not
to attempt to reidentify the information, except that the business may attempt to
reidentify the information solely for the purpose of determining whether its
deidentification processes satisfy the requirements of this subdivision.(3) Contractually obligates any recipients of the information to comply with all
https://cppa.ca.gov/regulations/pdf/cppa_act.pdf
provisions of this subdivision
This is the type of definition the Federal Trade Commission is talking about. The law is clear in that the information can’t be used to infer information about a specific consumer. It’s in this way that the law conflicts with the realities of hashing. Hold on to this understanding, because we’ll loop back around to it.
The Technical Reality of Hashing
At a high level, a hashing algorithm is a one way function that, for a specific input, returns a specific message digest. So provided you have the same input, and know the specific hashing algorithm used – you should be able to create the same digest. The process is deterministic in this regard and while it does obscure the source data (the hash is one way) it’s not anonymous. When specific algorithms are found to have collisions (in that two inputs can return the same digest, the algorithm typically gets retired for a more secure replacement.
Some examples of common use cases:
Validating Files
You can use a specific hashing algorithm, such as SHA2, and input a specific piece of data (such as a software executable file) and create a hash. That hash can then be compared to the hash published on the software publisher’s website to determine if the software is genuine or has been modified since it was published, and is therefore untrusted.
Digital Signatures
Digital signatures are used to verify the authenticity of digital messages and documents. This relies on public key asymmetric cryptography to help provide assurance that both that the message is real, and non-repudiable (they can’t claim they didn’t see it / sign it). If you’ve ever signed a contract online – this process is how that becomes legally binding.
Marketing Attribution
Lately, many vendors such as Facebook have been requesting companies provide hashed information for attribution purposes (in Facebook’s case, for the Conversion API). Facebook requests that the information be transmitted with SHA256, but as you can see from the documentation for each specific input, once ran through SHA256, you end up with an expected output. In this way the data is obscured, but not anonymous because it is linkable back to the specific input.
Problems with Hashing
Hashing has a few specific problems in the context of anonymization – but let’s look at one of the most common.
Rainbow Tables
A rainbow table is a table of precomputed hash values. that acts as a lookup index. Since the hash of a given input is deterministic you can precompute the entire key-space and run a basic SQL query where a where clause of the digest to return the original value used to provide that digest. In this way the cryptography isn’t broken, but because the hashes match identification of the original source is possible. If that source matches a user – then the user is identifiable. With the use of cloud services on demand, building a rainbow table and storing it (they are rather large) is simply a matter of being willing to pay for the compute and storage.
Defense against such attacks is possible via the use of cryptographic salts and peppers, but as you can see from the Facebook documentation referenced earlier these are not used for the Conversion API because Facebook wants to attempt to match the information to their own records to ensure attribution for Facebook Ads. Google does the same thing for their advertising platform with their Enhanced Conversions feature.
Issues with True Anonymization
Legally, true anonymity is a high bar to reach and may be impractical for specific use cases.
You anonymize a dataset by transforming it in key ways, often with statistical techniques such as k-anonymity, l-diversity and injecting noise with t-closeness. This process is very much like the way the Census Bureau in the United States seeks to protect those who take part in the census. Typically the more anonymized a data set becomes, the harder it becomes to obtain insight or derive value from it.
Various attacks do exist to reconstruct the data set in whole or part based on information left in the set so ensuring a data set remains anonymous is a moving target. It’s important to remember that just because you can’t reverse engineer a data set, does not mean that it’s impossible. Also noteworthy is if you can reverse a data set, than it’s not anonymous and becomes subject to different legal requirements.
The Legal Issue
Now that we have a grounding in hashing, and we know it’s deterministic and understand how a hashed digest’s original value can be determined – we see we can’t meet the threshold laid out in the CCPA for deidentified data when it comes to the marketing use case for attribution because for attribution the entire thing is working on the premise of being able to reidentify the user through a lookup table on the ad publishers platform in order to determine return on ad spend and related metrics.
This does mean however, as that the data may instead qualify for the definition of pseudonymization used by various laws. Typically data that is pseudonymized is still required to be disclosed to the user upon request and still subject to technical and organizational measures to protect such data. Under Europe’s General Data Protection Regulation – you are still required to obtain consent prior to collection. This also may impact the specific contractual elements of a data processing agreements or privacy notices if you were referring to hashed data as anonymous.
1: If you want to develop a in depth understanding of Cryptography, I recommend this book.