Anonymized Data is Useless: Fact or Fiction

Share This Post

“When is anonymization useful?” is a tricky question, because the answer is highly data-type- and task-dependent. Anonymized datasets are being used for academic research, industrial research, and real-world products in numerous areas, with clinical research often at the vanguard due to the high level of sensitivity and utility of the data. A 2016 NIST presentation mentions several other use cases in which anonymized data are useful, including:

– Improving driving solutions for directions and traffic data.

– Pothole alerts.

– Releasing educational records.

– Voluntary safety reports submitted to the Federal Aviation Administration.

While there have been years of research on proper methods for structured data anonymization (especially in the medical domain), research on unstructured data anonymization is just starting to ramp up. In this post we’ll dive into the research happening in speech, images/video, and text anonymization spaces.

For speech, anonymization means:

  1. Making a speaker’s voice unrecognizable (e.g., using the methodology proposed in Speaker Anonymization Using X-vector and Neural Waveform Models) and
  2. Removing direct and quasi identifiers from the speech by either bleeping them out or replacing them (i.e., pseudonymization).

Quick reminder if you haven’t read “Demystifying De-identification” or “Data Anonymization: Perspectives from a Former Skeptic” that direct identifiers are entities that directly identify an individual (full name, exact location, social security number, etc.) and that quasi-identifiers are entities that can identify an individual with exponential likelihood when combined together (age, approximate location, spoken languages, etc.).

If speech technology and privacy is your thing, take a look at the VoicePrivacy initiative and at the ISCA Special Interest Group on Security and Privacy in Speech Communication, which brings together professionals from a variety of backgrounds (from signal processing to law) to discuss privacy in speech technologies.

Dealing with images & video

Anonymization in images and video is a complicated task, given the variance in identifiable information. While fully and properly blurring out whole human bodies in the pictures might do the trick for certain constrained use cases, there can still be re-identification risk from name tags on backpacks, differentiated lunchboxes, a house in the background, etc. Nevertheless, anonymization for these media has often just meant removing or replacing faces, which means it is limited to a particular part of the body rather than the reduction of re-identification risk to almost zero (e.g., face anonymization — see, for instance, CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks). This is a start, but considering that companies like Palantir Technologies can recognize people by their tattoos, removing or replacing a part of the body can often only really be called redaction, not anonymization.

That said, there are numerous machine learning tasks that make use of images and video without personal data or in which personal data is superfluous and could be removed/replaced without detriment to the task, including:

Just take the example provided in that vehicle counting GitHub repo for vehicle counting.

 Image source via github (MIT license)

It’s clear that neither license plates nor people’s faces play a role in the task. And if we’re concerned about unique car colours being too telling, even a black and white video could do nicely, as would counting the vehicles on the edge (e.g., directly on the camera, before the data hits any servers).

Take this other image as an example:

Image source via unsplash

What can you detect in the image?

  • Number of houses
  • Type of farmland
  • Relief
  • Weather

There is a lot of information available about the terrain and there are plenty of similar images that can be used for determining ecosystem health, whether there are weeds growing in a crop, etc. Not to mention that lots of anonymous video feeds can be used as a partial training set for self-driving cars.

Considering text anonymization

There has been some initial research on re-identification risk scores for text, including our work in the Journal of Data Protection and Privacy titled Reasoning about unstructured data de-identification (contact us if you have a hard time accessing the paper). While proper anonymization for the purposes of data publication has required an expert to go over the data and calculate the risk of re-identification, we can say that automatically redacting text has a huge role to play in greatly improving data security through data minimization (i.e., reducing the amount of personal data you collect to just the essentials). Note that there have been tests conducted on the effectiveness of statistical and rule-based systems to automatically de-identify medical text corpora (three of these studies summarized here). These tests have to be redone to account for the vast improvements in statistical natural language processing systems from the past three years.

Anecdotally, let me give you a quick example of how much information a single anonymized email can carry:

“Hi [NAME],

Apologies, it had ended up in my spam!

I’m booked at [TIME] tomorrow, but [TIME] would work. I’ll send an updated invite for that time. Please let me know if that doesn’t work for you.

Thank you,

[NAME]”

Any idea who wrote that? It’s impossible to know unless you were the recipient or author.

But what useful information can you gather from this email?

– A call is being rescheduled to tomorrow

– The sender is polite (says please and thank you)

– The recipient’s previous email ended up in a spam folder when it shouldn’t have!

What can, say, an email service provider use this information for? Well, it would be great if they could make sure that this recipient’s emails never end up in a spam folder again.

I have so many examples of these. From being able to identify how a person feels about a particular product, to determining which topics were covered in a conversation and determining a consumer’s sentiment over a chat or phone call.

Anonymized data is

It has taken time and ample research for the community to gain a greater understanding of what it means for data to be anonymous and useful. The process of iterating over a technology and understanding its limitations was felt in the field of cryptography just as it is in differential privacy and in anonymization. We no longer use DES to encrypt our data, but rather AES. And chances are that, in the next decade, we will have to rely more on lattice-based cryptography rather than RSA. As we find limitations in a technology, we do not throw the baby out with the bathwater, but rather look to gain a deeper understanding of what went wrong, innovate upon it, and make it stronger, more useful, and more accessible.

Join us for more discussions on anonymized data on LinkedIn, Twitter, and Youtube

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.