Data Anonymization: Perspectives from a Former Skeptic

Data Anonymization: Perspectives from a Former Skeptic

Share This Post

When I first started working with privacy enhancing technologies, I lived and breathed homomorphic encryption. I had tremendous mistrust of anonymization and thought anyone who believed otherwise simply did not know the facts. Although anonymization is “[t]he process in which identifiable data is altered so that it is no longer related back to a given individual”,  the many headlines call out public “anonymized” datasets for the ease of re-identification – headlines like the one from DarkDaily stating that, “Researchers Easily Reidentify Patient Records with 95% Accuracy; Privacy Protection of Patient Test Records a Concern for Clinical Laboratories.” 

Without more information, it’s only natural to believe Dr. Cynthia Dwork, known for her breakthrough research in differential privacy, when she says“anonymized data isn’t”.

So anonymization doesn’t work, right? 

When I saw headlines like this and heard quotes like Cynthia Dwork’s “anonymized data isn’t”, I genuinely thought so too. As part of my PhD, I took this as an opportunity to dive deeper into the details. Immediately, these three questions came to mind;

  1. What were the causes of these re-identification attacks?
  2. What type of data had been re-identified?
  3. How many re-identified records were impacted within the datasets?

They Were Never Truly Anonymous in The First Place

“We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth.” 

Digging deeper, it turns out that the public datasets mentioned by most of these headlines were never truly anonymous in the first place. Quasi-identifiers such as year of birth, age, sex, and approximate location were published without regard for population and dataset statistics

For example, a study was conducted by Sweeney in 2016, which found that there was 42.8% risk of matching anesthesia records to the Texas Inpatient Public Use Data File. With this information at hand, Sweeney was able to uniquely identify medical records of Governor Weld. 

All it took was noting that only 6 voters in Cambridge had the exact same birthdate as the governor –3 of which were male – and that only 1 of them resided in the same ZIP code as him.

The Datasets Were Mislabeled as Anonymous

The truth is not that the datasets were fully anonymized and that it was easy to re-identify them. 

The truth is that it never deserved that designation in the first place.

This is partially due to the learning curve and ensuring there was a strong grasp on what qualifies a dataset as truly anonymized. At first, it was believed that simply removing direct identifiers such as full names, social security numbers, etc., would do the trick. 

However, it took some stumbling to discover exactly how quasi-identifiers affect the likelihood of re-identifying an individual within a dataset. 

According to existing standards, there have been several misinterpretations of the severity and validity of re-identification attacks.

As confirmed in “A Systematic Review of Re-Identification Attacks on Health Data” by Emam et al, only one re-identification attack has been successfully performed on data truly anonymized. With that being said, this one attack had a re-identification risk of 2/15,000 — well below HIPAA Safe Harbor’s acceptable risk threshold for re-identification of 0.04%.

See for yourself in entry K below, which outlines a list of re-identification attacks over de-identified datasets. 

Data Anonymization: Perspectives From a Former Skeptic

Image: Only two out of the 14 attacks were on datasets that were properly anonymized. Only one of those attacks (K) has the re-identification verified. 2 out of 15,000 records (Sources 57 and 58 ) were re-identified. Thank you to Professor Khaled El Emam for allowing me to reproduce this table.

Moving Forward…the Likelihood of Re-identification

Thanks to the re-identification attacks by researchers, journalists, and concepts like k-anonymity, l-diversity, and t-closeness, the privacy community finally began to understand what anonymization meant under a different lens. 

These concepts take into account what the likelihood of re-identification is based on an individual’s quasi-identifiers and how those can be optimally aggregated in order to minimize re-identification risk.

Although anonymization is often completely misused by journalists and companies alike, methods have been used successfully on a number of clinical trial datasets with the purpose of sharing data for research. 

All-in-all, anonymization is far from easy. It takes years of expertise and experience to even properly conceptualize how quasi-identifiers can affect re-identification risk.

Join Private AI for more discussions on data anonymization on LinkedIn, Twitter, and Youtube

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.