Why is Privacy-Preserving NLP Important?

Share This Post

I have had a number of people ask me why we should bother creating natural language processing (NLP) tools that preserve privacy. Apparently not everyone spends hours upon hours thinking about data breaches and data privacy infringements. Shocking!

The main argument for privacy-preserving NLP comes down to one fact: text and speech are our primary methods of communication. When interacting with web-based service providers and other users thereof, we often allow companies to store, use, and even sell the messages we have sent and received on their platforms.

So why do we even share this information with third parties in the first place? Frequently, the answer lies in our desire to get “free” services.

After the Cambridge Analytica/Facebook scandal, one can only hope that the general public has been made aware of how the personal data they have unknowingly been giving away in exchange for “free” services can be sold to third parties for any purpose at any time, without their knowledge or consent. 

Sure, the thought of targeted ad campaigns siloing us into a political echo chamber is scary, but isn’t it even worse to think that the data being used to target us might not even be de-identified when sold to third-parties? Heck, even if the data are said to be de-identified, companies often attempt to do so poorly and without taking into account that cross-referencing them with other databases for the purpose of re-identification is far from unheard of. 

Your name and number can be removed, but how about the locations you’ve visited, your preference for certain restaurants, and even your favourite tea flavours? Innocuous-seeming information becomes part of your digital fingerprint and can give your identity away, therefore we should be incredibly wary of what written and spoken data we allow onto the marketplace.

Why is Privacy-Preserving NLP Important?

We readily give away our data in exchange for convenience 

Biometric authentication is an example of this. Simply put, it means using some unique part of your body/physiology as a password — think of using your fingerprint or face to unlock your computer. In the context of NLP, think of your voice being used by your bank to verify your identity during a customer service call (i.e., speaker authentication). There are two types of speaker authentication: text-dependent and text-independent. In the former, a spoken password is used to identify you in addition to comparing features extracted from your voice to that of previous recordings. In the latter, the system only uses features extracted from your voice while you speak to the automated or human representative without you having to say any particular keywords. 

Securing user data within a speaker authentication system is by far the most researched area related to privacy-preserving NLP, though it certainly is not a solved problem 

One common concern is the feature vectors associated with a user’s voice becoming compromised. Such data is therefore encrypted at rest. Another concern is that of replay attacks, where a user’s voice is recorded and played to the authentication system with the purpose of gaining access to particular files or locations. A band-aid solution to this is to use more than one mode of authentication (speech + fingerprint, speech + face recognition, speech + pin number, etc), such that using speech becomes a security enhancer rather than a system’s central component. Regardless of the system, a user’s privacy must be preserved from malicious outsiders in order for the system to remain reliable. 

Speech authentication is one of the few examples where service providers have strong inherent incentive to put a lot of time and money into making sure our spoken data remains private from third parties. They are, however, still storing a biometric identifier associated with your personal information. An identifier that can be used to determine whether you are speaking in other audio or video recordings. Civil liberties groups argue against voice identifiers being stored without explicit consent (such as at Her Majesty’s Revenue and Customs in the UK) and without more information about how it is stored/shared/erased.

What about exchanging data privacy for physical security? 

A concern that is often raised regarding data being encrypted without any backdoors is whether that makes it easier for nefarious communications (e.g., between terrorists and criminals) to go undetected. And, hey! If you’ve got nothing to hide you’ve got nothing to fear, right? That wonderfully nonsensical line is reserved for the fraction of us who have the good fortune of living in democratic societies where we won’t get thrown in jail and/or tortured for speaking our minds about the political party in power. Okay, let’s humour the people who think that line flies and that it’s a-okay for (some) governments and police to have quasi-unrestricted access to citizens’ private data via backdoors or requests for information from companies (predominantly without a warrant).

Data breaches

Suppose that for whatever reason there is some government or company that you sincerely trust with storing your written or spoken content. Splendid!

Well, I hate to break it to you, but over 2.6 billion records were breached in 2017 alone (76% due to accidental loss, 23% due to malicious outsiders). Here’s the crux: you might trust an organization’s intended use of your data without having a clue about how they protect it from being leaked.

Fine, then let’s not even bother sharing our data in the first place. 

Why not just prevent people from accessing speech and text we produce altogether? Because we want to be provided with free or cheap services (Facebook, Twitter, Instagram, …). We also want training data for AI algorithms that are adaptable to our specific traits and preferences (speech recognition systems, personal assistants, search engines, …).

So how do we get what we want, give companies the data they need to profit/improve services AND maintain our privacy? 

Research in privacy-preserving NLP is in its infancy, but it is likely to revolutionize the way companies and governments collect, store, process, and sell user data. With regulations like the GDPR coming into effect, public outcry over the Cambridge Analytica scandal, and the massive number of hacks that cost companies millions of dollars in reparations every year, the number of privacy-preserving data processing algorithms will snowball. I will be going into some detail on various (practical as well as still computationally intractable but promising) existing solutions in future blog posts, including privacy-preserving surveillance, federated learning, the application of differential privacy to neural networks in order to prevent reverse-engineering, homomorphically encrypted NLP, and so on.

Join us for more discussions about natural language processing on LinkedIn, Twitter, and Youtube

 

Acknowledgements. My deepest gratitude to Kelly Langlais, Dr. Siavash Kazemian, and Simon Emond for their invaluable feedback on this post.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.