Natural Language v. Regex: The Context wars

Share This Post

The past decade has seen AI advance in leaps and bounds, with Deep Learning (DL) enabling many new applications and redefining industries. Computer Vision has had a particularly significant impact: without it, DL computers could not reliably process real-world images and videos with accuracies high enough to enable real-world applications. Natural Language Processing (NLP) is different in that, prior to Deep Learning, relatively effective techniques did exist, enabling applications such as spam filtering. 

Enter the humble regular expression (regex), a method used to look for patterns within text. Regexes have formed an integral part of computer software for decades, especially in Named Entity Recognition (NER) systems. NER applications include gathering keywords for search indexing, product intelligence, and identifying sensitive information.

Say you’re looking for credit card numbers. It’s quite easy to set up a regex that looks for 16-digit numbers or four groups of four numbers separated by a ‘-’. A regex like this is highly effective in the perfect world of computer data, but unfortunately the real world is much more complicated. Let’s take phone numbers. Numbers come in different lengths and arrangements, and change based on where the caller is located. 

The following numbers are all equivalent, for instance:

 00049 153 43437 800 – Extra 0 to dial out of an office phone system

+49 153 434 37 800 – ‘+’ international dialing code format

01534 347800 – ‘Standard’ infra country format

15 343 478 00 – People often drop the leading digit from their locale

Phone numbers can also contain letters. For example, people frequently write their internal office number just using their extension, typically starting with ’x’; e.g., ‘x51399’. Hotlines are another good example, often using the letters associated with each number to make the number easier to remember. For example, a taxi hotline in Australia writes ‘132227’ as ‘13taxi’.

Certain numbers also have different meanings. For example, ‘911’ could refer to the Porsche 911 or the 9/11 attacks. Replacement part numbers are another good example.

Google actually built a system for finding phone numbers using regular expressions 

The system was manually programmed to find numbers and check whether they’re valid for each country in the world. The system works well for regular numbers and is even able to catch the above example with an extra 0 for office systems. It is not, however, able to find alphanumeric numbers such as ‘x5177’ or ‘13taxi’.

It isn’t feasible to program a regex for every single possible pattern 

In addition to the ludicrous amount of developer time it would take to program, optimize, and maintain this fictional ‘perfect regex’, the unfortunate truth is that it would also return so many false matches that the output would no longer be useful. Regex-based solutions also require a lot of work to maintain, as the expressions constantly need to be tweaked to account for new patterns and false matches. This task coincidentally happens to be one of the least favourite developer chores out there.

Natural Language v. Regex: The Context Wars

Today’s highly connected world requires international solutions dealing with regional differences in standardization 

For example, a German address could be ‘Eugen-Schoenhaar-strasse 21, 10423 Berlin’.  The house number is written after the street name, whilst ‘strasse’ (street in German) is joined to the street name in one long compound word, because Germans love those. Postal codes are another good example. Dutch and Canadian postal codes are made up of 6 digits. A Dutch postal code however is 4 digits followed by 2 letters (E.g. 1234AY), whereas a Canadian postal code is a mix of digits and letters (E.g. 1A32Y4). An Australian postal code on the other hand is only 4 digits.

Building a system that performs reliably and delivers a level of accuracy high enough for production applications requires handling each of these variants and edge cases, such as phone numbers containing letters. And that credit card number example? Well, it turns out not all credit cards are 16 digits long!

AI models understand context

Humans can easily distinguish between the above examples to determine what is or is not a phone number, an address, etc. We do this by looking at the number, but also the context (e.g. “call 911!” and “gosh that’s a nice 911”). Unlike regexes, AI models understand context and are therefore able to understand text more like humans do.

State-of-the-art AI systems (like our models at Private AI) are trained on large amounts of carefully annotated data and meticulously revised to account for all these edge cases and locale-specific differences. In Private AI’s case, we reach >97% in-domain accuracy, which we have found to be higher than human performance in most settings.

Regex systems require careful analysis of existing expressions

AI systems also scale better and are easier to maintain. Integrating a change to a regex-based system requires careful analysis of the existing expressions, which become more and more difficult to comprehend as one adds to them, to make sure that any changes don’t affect existing expressions and adequately capture all the possible permutations of a term. It is common to ‘fix one thing, break another’ and it often requires a few iterations in production before a change is successfully made. On the other hand, AI systems only require some extra training data.

Minimal training data is needed

In the past, one drawback of AI systems was the need for a large amount of data to train them. However, modern techniques allow AI systems to be developed with just a fraction of the training data that was previously required. For example, Private AI’s solution can learn to generalize from as few as 10 examples.

AI-based systems were also held back by the tremendous amount of computing power they required – far more than regex-based systems. This resulted in large cloud bills and difficulty integrating models into edge applications such as mobile apps or desktop applications. AI is also rapidly advancing in this area and the latest techniques enable large reductions in compute resources. At Private AI, we have spent a large amount of time optimizing our solution to the point where it is now 25x faster than BERT large, a popular NLP architecture, whilst also surpassing its performance.

New levels of performance 

Regexes have served as an integral part of computing systems for decades, and will continue to do so. AI-based systems, however, offer new levels of performance by understanding context similarly to humans. For unstructured real-world applications like detecting phone numbers in text, these new techniques enable a wide range of new, high-quality production applications. 

Interested in seeing an AI system in the works? Connect with us to test out a free live demo.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.