
In recent years, AI models have become more accessible and easier to use, leading to an explosion of data-driven innovations across various industries. However, this increased reliance on AI models raises concerns about privacy management and personally identifiable information (PII) detection. As users become more aware of the value and sensitivity of their personal data, gaining their trust and ensuring the security of their information has become more critical than ever.
The Weakness of Regex-Based Approaches
Regex-based code is a popular approach for detecting and extracting PII from text data. However, this approach has severe limitations when generalizing to identifying PII in different contexts. Specifically, in areas like automated speech recognition (ASR), where transcription errors are common, regex-based code may miss out on certain PII such as email, addresses, credit cards, and phone numbers. This is because regex-based models rely on a specific set of patterns to identify PII, which may not be comprehensive enough to capture all possible variations and errors.
Consider this real-world example from a French call transcript:

As you can see, the credit card number is spoken in natural language across multiple segments of conversation. Traditional regex patterns would completely miss this sensitive information.
The Open-Source Alternative Problem
Common characteristics of open-source solutions include:
- Primarily built around a single use-case and do not generalize well
- Built with one or only a few languages covered
- Have limited sets of entities that are often selected for a single use-case
- Built primarily for simple Named Entity Recognition (NER) tasks
- Built for limited data types
- Often not optimized for speed or scalability
Due to these characteristics, the immediate application of open-source solutions in real-world use cases requires a tremendous amount of effort that often goes unplanned. Additionally, as data drift occurs, new data types and contexts appear, and regulations change, the maintenance effort quickly adds to the total FTE costs required to build, maintain, and manage a solution.
An often conflated issue is that a simple ML model can become a fully scalable product in production and companies frequently underestimate what it takes to actually bring models into production.

The Stakes Are Higher Than Ever
Detecting PII in specific contexts can be challenging and require specialized knowledge. For example, medical data may contain sensitive information such as patient names, social security numbers, and medical conditions, which require advanced knowledge of medical terminology and concepts (we will explore these healthcare challenges in depth in an upcoming post). Similarly, financial data may contain sensitive information such as bank account numbers, credit card information, and transaction details, and understanding the sensitivity of each requires knowledge of financial regulations and practices. Furthermore, how the sensitive information enters into enterprise applications also varies significantly and poses additional problems like typos, inaccurate transcriptions of phone calls, etc.
For our customers, missed PII is a significant event that can lead to severe consequences, such as data breaches, identity theft, or legal implications. This is why specialized AI-based models that can learn and adapt to different contexts and variations are more effective at detecting PII accurately and efficiently than traditional approaches.