Shortcuts

Introduction

Private AI’s solution relies on the latest advances in Machine Learning (ML) research and engineering, including R&D performed by Private AI’s ML team, with the purpose of identifying sensitive information at far higher levels of accuracy than legacy rule- or regex-based systems, competing cloud APIs and open-source systems.

The solution is built on top of the latest Transformer Neural Networks, along with many custom improvements specifically for entity detection engineered by Private AI’s ML team. The system relies on context to identify what is sensitive information, much like how humans read & understand text. The main drawback of Transformer models until now however has been their enormous compute requirement. To solve this problem, the Private AI team has spent a lot of time working on Transformer optimisation to reach a runtime 25X faster than a reference Transformer model, enabling low cost processing on commodity hardware.

Cloud APIs usually reserve the right for the provider to retain sensitive customer data for service improvement and ML model development. Private AI’s solution in comparison is designed to be self-hosted by the customer, meaning that Private AI never sees or handles customer data.

This documentation describes how to use the Private AI privacy solution to de-identify or redact PII, PHI & PCI, and how to generate synthetic PII.

For more detail on any of the subjects below, please visit: https://www.private-ai.com/blog/

What is PII?

PII stands for “Personally Identifiable Information”. PII encompasses any form of information that could be used to identify someone. Common examples of PII include names, phone numbers and credit card numbers. These directly identify someone and are hence called ‘direct identifiers’.

Data de-identification is simple, right? Just remove names, phone numbers, credit card numbers and a few other things like social security numbers and the data is redacted! Unfortunately, real-world data contains edge case after edge case that needs to be considered. For example, what about a person who is named Paris, or June? What about an internal office extension of x324?

In addition to direct identifiers, PII also includes ‘quasi-identifiers,’ which on their own cannot uniquely identify a person, but can exponentially increase the likelihood of re-identifying an individual when grouped together. Examples of quasi-identifiers include nationality, religion and prescribed medications. For example, consider a company with 10,000 customers. Knowing that a particular customer lives in Delaware isn’t likely to allow for re-identification, but knowing that they live in Delaware, follow Bhuddism, is male, has Dutch nationality and is taking heart medication probably is!

To cut a long story short, detecting PII is hard! What is considered PII also depends on the relevant legislation, such as the General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA). Pieter Luitjens, our CTO wrote an article that touches on this subject here: https://www.private-ai.com/2021/02/17/1342/ .

To this end, the Private AI team includes linguists and privacy experts that make the decisions on what is and isn’t considered PII, in line with legislation.

The GDPR, for instance, provides the following definition of personal data: ‘’‘Personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.’’ (source: https://gdpr.eu/eu-gdpr-personal-data/)

While the CPPA defines ‘personal information’ as ‘’information that identifies, relates to, or could reasonably be linked with you or your household. For example, it could include your name, social security number, email address, records of products purchased, internet browsing history, geolocation data, fingerprints, and inferences from other personal information that could create a profile about your preferences and characteristics.’’ (source: https://oag.ca.gov/privacy/ccpa)

Even whom the information relates to/identifies/could be linked to differs between the two legislations (‘data subject’ in the GDPR vs. ‘you or your household’ in the CPPA).

Why Redaction?

First of all, redaction plays a key role in data minimization, which means collecting only absolutely necessary personal data. Not only does that protect individuals’ privacy from the data collector (e.g., corporation, government), but it also prevents significant harm to individuals and data collectors in the event of a data breach.

Note that the same technology that is used for redaction of unstructured data can be used for personal data identification - another central requirement for modern data governance. The following article describes how these concepts are core to data governance in the following article: https://www.lastwatchdog.com/guest-essay-how-stricter-data-privacy-laws-have-redefined-the-filing-of-our-personal-data/ .

Warning: redaction, anonymization & de-identification are often used interchangeably. It is incorrect - even dangerous - to do so. You can find an article on the differences between these terms here: https://www.private-ai.com/2021/04/06/1751/

There’s a large body of thought in the privacy community that redaction, anonymization & de-identification don’t work. This is largely due to a number of high profile, improperly de-identified datasets created by companies claiming that they were anonymized, see https://www.private-ai.com/2021/06/04/data-anonymization-perspectives-from-a-former-skeptic/. Another key reason is that legacy de-identification systems rely on rule-based PII detectors, which are usually made up of regular expressions (regexes). It is tough to develop rule-based systems that handle even a fraction of real-world edge cases. For example, consider Driver’s Licenses. They have different formats in every US state, let alone every country! Or phone numbers like x324 or 1-800-FREE-PIZZA.

Private AI’s solution is built on top of the latest Transformer Neural Networks, along with many custom improvements specifically for entity detection engineered by Private AI’s ML team. The system relies on context to identify what is sensitive information, much like how humans read & understand text. The main drawback of Transformer models until now however has been their enormous compute requirement. To solve this problem, the Private AI team has spent a lot of time working on Transformer optimization to reach a runtime 25X faster than a reference Transformer model, enabling low cost processing on commodity hardware.

Redaction in Machine Learning

Redaction is very useful when training ML models, because of their memorization ability. Transformer networks are very good at memorization and should never be used without some steps to mitigate sensitive information. A good example of things going wrong is the ScatterLab Lee-Luda chatbot scandal, where a chatbot trained on intimate conversations started using memorized PII (such as home addresses) in conversations with other people. Even classification models such as for sentiment analysis have been shown to retain sensitive data in input embeddings, allowing for PII to be extracted.

Approaches like Differential Privacy require expert tuning and lead to unclear real-world privacy guarantees due to the lack of explainability and due to the complexity of ML models. Differential Privacy also results in a large accuracy drop, leading to the problem of trading off privacy against utility. Redaction is the perfect solution to this problem, as all sensitive data is removed before the ML model is trained, creating a strong and easy to understand privacy. Anyone, including non-ML engineers, can inspect the training data to see that personal data has been removed. Additionally, ML training is decoupled from privacy, resulting in a simpler system.

Finally, removing all identifying information also helps improve fairness. A model can’t discriminate against age & gender if it has been removed from the input data!

Why Synthetic PII?

Generating synthetic PII has two key advantages. Firstly, any PII identification errors become much harder to find. An attacker must first identify what PII is real, and then use this PII to re-identify target subjects.

Secondly, synthetic PII eliminates data shift between training and inference. Transformer models are typically pre-trained on large corpora of natural text, which is why synthetic PII is able to eliminate data shift between pre-training and fine-tuning, preventing accuracy loss in production.

Private AI’s synthetic PII generation system relies on ML to generate PII that is more realistic & better fits the surrounding context than legacy systems relying on lookup tables.

While the synthetic PII generation system is still in beta, it was successfully used to eliminate the accuracy loss caused by redaction in the CoLA (Corpus of Linguistic Acceptability) subtask of the GLUE benchmark. A copy of the benchmark report is available upon request.