Data Anonymization: Perspectives from a Former Skeptic

Share This Post

When I first started working with privacy enhancing technologies, I lived and breathed homomorphic encryption. I had tremendous mistrust of anonymization and thought anyone who believed otherwise simply did not know the facts. Although anonymization is “[t]he process in which identifiable data is altered so that it is no longer related back to a given individual”,  the many headlines call out public “anonymized” datasets for the ease of re-identification – headlines like the one from DarkDaily stating that, “Researchers Easily Reidentify Patient Records with 95% Accuracy; Privacy Protection of Patient Test Records a Concern for Clinical Laboratories.” 

Without more information, it’s only natural to believe Dr. Cynthia Dwork, known for her breakthrough research in differential privacy, when she says“anonymized data isn’t”.

So anonymization doesn’t work, right? 

When I saw headlines like this and heard quotes like Cynthia Dwork’s “anonymized data isn’t”, I genuinely thought so too. As part of my PhD, I took this as an opportunity to dive deeper into the details. Immediately, these three questions came to mind;

  1. What were the causes of these re-identification attacks?
  2. What type of data had been re-identified?
  3. How many re-identified records were impacted within the datasets?

They Were Never Truly Anonymous in The First Place

“We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth.” 

Digging deeper, it turns out that the public datasets mentioned by most of these headlines were never truly anonymous in the first place. Quasi-identifiers such as year of birth, age, sex, and approximate location were published without regard for population and dataset statistics

For example, a study was conducted by Sweeney in 2016, which found that there was 42.8% risk of matching anesthesia records to the Texas Inpatient Public Use Data File. With this information at hand, Sweeney was able to uniquely identify medical records of Governor Weld. 

All it took was noting that only 6 voters in Cambridge had the exact same birthdate as the governor –3 of which were male – and that only 1 of them resided in the same ZIP code as him.

The Datasets Were Mislabeled as Anonymous

The truth is not that the datasets were fully anonymized and that it was easy to re-identify them. 

The truth is that it never deserved that designation in the first place.

This is partially due to the learning curve and ensuring there was a strong grasp on what qualifies a dataset as truly anonymized. At first, it was believed that simply removing direct identifiers such as full names, social security numbers, etc., would do the trick. 

However, it took some stumbling to discover exactly how quasi-identifiers affect the likelihood of re-identifying an individual within a dataset. 

According to existing standards, there have been several misinterpretations of the severity and validity of re-identification attacks.

As confirmed in “A Systematic Review of Re-Identification Attacks on Health Data” by Emam et al, only one re-identification attack has been successfully performed on data truly anonymized. With that being said, this one attack had a re-identification risk of 2/15,000 — well below HIPAA Safe Harbor’s acceptable risk threshold for re-identification of 0.04%.

Moving Forward…the Likelihood of Re-identification

Thanks to the re-identification attacks by researchers, journalists, and concepts like k-anonymity, l-diversity, and t-closeness, the privacy community finally began to understand what anonymization meant under a different lens. 

These concepts take into account what the likelihood of re-identification is based on an individual’s quasi-identifiers and how those can be optimally aggregated in order to minimize re-identification risk.

Although anonymization is often completely misused by journalists and companies alike, methods have been used successfully on a number of clinical trial datasets with the purpose of sharing data for research. 

All-in-all, anonymization is far from easy. It takes years of expertise and experience to even properly conceptualize how quasi-identifiers can affect re-identification risk.

Join Private AI for more discussions on data anonymization on LinkedIn, Twitter, and Youtube

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.