Why is Privacy-Preserving NLP Important?

Share This Post

I have had a number of people ask me why we should bother creating natural language processing (NLP) tools that preserve privacy. Apparently not everyone spends hours upon hours thinking about data breaches and data privacy infringements. Shocking!

The main argument for privacy-preserving NLP comes down to one fact: text and speech are our primary methods of communication. When interacting with web-based service providers and other users thereof, we often allow companies to store, use, and even sell the messages we have sent and received on their platforms.

So why do we even share this information with third parties in the first place? Frequently, the answer lies in our desire to get “free” services.

After the Cambridge Analytica/Facebook scandal, one can only hope that the general public has been made aware of how the personal data they have unknowingly been giving away in exchange for “free” services can be sold to third parties for any purpose at any time, without their knowledge or consent.

Sure, the thought of targeted ad campaigns siloing us into a political echo chamber is scary, but isn’t it even worse to think that the data being used to target us might not even be de-identified when sold to third-parties? Heck, even if the data are said to be de-identified, companies often attempt to do so poorly and without taking into account that cross-referencing them with other databases for the purpose of re-identification is far from unheard of.

Your name and number can be removed, but how about the locations you’ve visited, your preference for certain restaurants, and even your favourite tea flavours? Innocuous-seeming information becomes part of your digital fingerprint and can give your identity away, therefore we should be incredibly wary of what written and spoken data we allow onto the marketplace.

Why is Privacy-Preserving NLP Important?

We readily give away our data in exchange for convenience

Biometric authentication is an example of this. Simply put, it means using some unique part of your body/physiology as a password — think of using your fingerprint or face to unlock your computer. In the context of NLP, think of your voice being used by your bank to verify your identity during a customer service call (i.e., speaker authentication). There are two types of speaker authentication: text-dependent and text-independent. In the former, a spoken password is used to identify you in addition to comparing features extracted from your voice to that of previous recordings. In the latter, the system only uses features extracted from your voice while you speak to the automated or human representative without you having to say any particular keywords.

Securing user data within a speaker authentication system is by far the most researched area related to privacy-preserving NLP, though it certainly is not a solved problem

One common concern is the feature vectors associated with a user’s voice becoming compromised. Such data is therefore encrypted at rest. Another concern is that of replay attacks, where a user’s voice is recorded and played to the authentication system with the purpose of gaining access to particular files or locations. A band-aid solution to this is to use more than one mode of authentication (speech + fingerprint, speech + face recognition, speech + pin number, etc), such that using speech becomes a security enhancer rather than a system’s central component. Regardless of the system, a user’s privacy must be preserved from malicious outsiders in order for the system to remain reliable.

Speech authentication is one of the few examples where service providers have strong inherent incentive to put a lot of time and money into making sure our spoken data remains private from third parties. They are, however, still storing a biometric identifier associated with your personal information. An identifier that can be used to determine whether you are speaking in other audio or video recordings. Civil liberties groups argue against voice identifiers being stored without explicit consent (such as at Her Majesty’s Revenue and Customs in the UK) and without more information about how it is stored/shared/erased.

What about exchanging data privacy for physical security?

A concern that is often raised regarding data being encrypted without any backdoors is whether that makes it easier for nefarious communications (e.g., between terrorists and criminals) to go undetected. And, hey! If you’ve got nothing to hide you’ve got nothing to fear, right? That wonderfully nonsensical line is reserved for the fraction of us who have the good fortune of living in democratic societies where we won’t get thrown in jail and/or tortured for speaking our minds about the political party in power. Okay, let’s humour the people who think that line flies and that it’s a-okay for (some) governments and police to have quasi-unrestricted access to citizens’ private data via backdoors or requests for information from companies (predominantly without a warrant).

Data breaches

Suppose that for whatever reason there is some government or company that you sincerely trust with storing your written or spoken content. Splendid!

Well, I hate to break it to you, but over 2.6 billion records were breached in 2017 alone (76% due to accidental loss, 23% due to malicious outsiders). Here’s the crux: you might trust an organization’s intended use of your data without having a clue about how they protect it from being leaked.

Fine, then let’s not even bother sharing our data in the first place.

Why not just prevent people from accessing speech and text we produce altogether? Because we want to be provided with free or cheap services (Facebook, Twitter, Instagram, …). We also want training data for AI algorithms that are adaptable to our specific traits and preferences (speech recognition systems, personal assistants, search engines, …).

So how do we get what we want, give companies the data they need to profit/improve services AND maintain our privacy?

Research in privacy-preserving NLP is in its infancy, but it is likely to revolutionize the way companies and governments collect, store, process, and sell user data. With regulations like the GDPR coming into effect, public outcry over the Cambridge Analytica scandal, and the massive number of hacks that cost companies millions of dollars in reparations every year, the number of privacy-preserving data processing algorithms will snowball. I will be going into some detail on various (practical as well as still computationally intractable but promising) existing solutions in future blog posts, including privacy-preserving surveillance, federated learning, the application of differential privacy to neural networks in order to prevent reverse-engineering, homomorphically encrypted NLP, and so on.

Join us for more discussions about natural language processing on LinkedIn, Twitter, and Youtube

Acknowledgements. My deepest gratitude to Kelly Langlais, Dr. Siavash Kazemian, and Simon Emond for their invaluable feedback on this post.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more!

More To Explore

Blog

Leveraging Private AI to Meet the EDPB’s AI Audit Checklist for GDPR-Compliant AI Systems

As the European Union continues to strengthen its data protection and artificial intelligence (AI) regulations, organizations are seeking innovative ways to ensure compliance. Private AI,

Kathrin Gardhouse July 18, 2024

Blog

Handling Personal Information by Financial Institutions in Japan – The Strict Requirements of the FSA Guidelines

Under the APPI, businesses must adhere to strict rules regarding the processing of personal information, in particular when it comes to the disclosure or transfer

Kathrin Gardhouse July 12, 2024

PrivateGPT

Text

Files

Solutions

Compliance

Developers

Quick Links

Resources

Company

Why is Privacy-Preserving NLP Important?

Share This Post

We readily give away our data in exchange for convenience

Securing user data within a speaker authentication system is by far the most researched area related to privacy-preserving NLP, though it certainly is not a solved problem

What about exchanging data privacy for physical security?

Data breaches

Fine, then let’s not even bother sharing our data in the first place.

So how do we get what we want, give companies the data they need to profit/improve services AND maintain our privacy?

Subscribe To Our Newsletter

More To Explore

Leveraging Private AI to Meet the EDPB’s AI Audit Checklist for GDPR-Compliant AI Systems

Handling Personal Information by Financial Institutions in Japan – The Strict Requirements of the FSA Guidelines

Contact Us

Key Links

Follow Us

Join our Mailing List

Download the Free Report

Request an API Key

Language Packs

Rappel

99.5%+ Accuracy