Healthcare and Medical Data: The Ultimate PII Detection Challenge

Due to its nature, medical data contains vast amounts of PII and PHI. Compared to alternative services, Private AI covers all of these entities in a single API. Our comprehensive performance benchmarks detailed in "The Specialization Gap: Purpose-Built vs. General Market PII Detection Solutions (Benchmark Results)" show stark differences in healthcare settings. It's worth noting that many privacy legislations, including GDPR, CPRA, QC Law 25, and Japan's APPI, cover sensitive data contained in medical records.
The Unique Challenges of Medical Data
Medical data presents some of the most complex PII detection challenges:
- Extensive PHI Coverage Required: Blood Type, Dosage, Injury, Medical Condition, Medical Process, Medical Statistics, Medication
- Free-form Text Fields: Patient records and EMRs often contain unstructured notes
- Multiple Entity Types: Names, family members, medical professionals, and clinical details intermixed
- Regulatory Complexity: HIPAA, GDPR, and other regulations with strict requirements
Performance Results: Medical Data1
Using the methodology detailed in "How to Properly Benchmark PII Detection Solutions," here are the medical data results:

The Entity Coverage Problem
Private AI identifies and redacts entity types detailed in the mentioned regulations. Azure and Presidio do not support any medical entity types. Google has a single broad MEDICAL_TERM entity type, which includes various information such as conditions and injuries. AWS has a separate service for PHI (AWS Comprehend Medical), which means that a user who wants to redact both PII and PHI must make two calls for the same text, significantly increasing their service charges. Even with this additional service, AWS Comprehend Medical does not detect BLOOD_TYPE, INJURY, or medical STATISTICS, which are all covered by Private AI.
Real-World Example: Free-Form Medical Notes
To further complicate matters, one of the most common issues faced in health and medical scenarios is the difficulty of managing free-form text fields in patient records and EMRs. For example, a field called "attending notes" will often include PII, such as "Mr Johnson had a fall last night."
Here's a comparison showing how different approaches handle this challenge:

When not using a commercial solution, de-identification of these free-form fields is commonly attempted using a string search for name fields in the patient record and removing the data based on a match. Equally common are the mentions of failures due to typos, nicknames, and names of family members that might be mentioned in the notes.
HIPAA Compliance and Expert Determination
Private AI helps organizations comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA), by accurately de-identifying healthcare datasets that then go through the process of expert determination. Expert determination assesses a statistically representative sample of a dataset and ensures that the re-identification risk is below a pre-determined re-identification risk threshold.
As a Senior Software Development Manager at Providence Health explains:

Why Medical Data Demands Specialized AI
The combination of complex medical terminology, free-form clinical notes, multiple stakeholder types (patients, family members, medical professionals), and strict regulatory requirements makes medical data the ultimate test for PII detection systems. Competitive commercial approaches consistently fail to handle this complexity, missing 15% to 33% of sensitive information. In healthcare, where patient privacy is both a legal requirement and ethical imperative, this level of data leakage is simply unacceptable.
These medical data challenges represent just one domain where the performance gaps shown in "The Specialization Gap: Purpose-Built vs. General Market PII Detection Solutions (Benchmark Results)" have major life changing implications for patient privacy.
1 Last compared October 2024