The past decade has seen AI advance in leaps and bounds, with Deep Learning (DL) enabling many new applications and redefining industries. Computer Vision has had a particularly significant impact: without it, DL computers could not reliably process real-world images and videos with accuracies high enough to enable real-world applications. Natural Language Processing (NLP) is different in that, prior to Deep Learning, relatively effective techniques did exist, enabling applications such as spam filtering.
Enter the humble regular expression (regex), a method used to look for patterns within text. Regexes have formed an integral part of computer software for decades, especially in Named Entity Recognition (NER) systems. NER applications include gathering keywords for search indexing, product intelligence, and identifying sensitive information.
Say you’re looking for credit card numbers. It’s quite easy to set up a regex that looks for 16-digit numbers or four groups of four numbers separated by a ‘-’. A regex like this is highly effective in the perfect world of computer data, but unfortunately the real world is much more complicated. Let’s take phone numbers. Numbers come in different lengths and arrangements, and change based on where the caller is located. The following numbers are all equivalent, for instance:
00049 153 43437 800 – Extra 0 to dial out of an office phone system
+49 153 434 37 800 – ‘+’ international dialing code format
01534 347800 – ‘Standard’ infra country format
15 343 478 00 – People often drop the leading digit from their locale
Phone numbers can also contain letters. For example, people frequently write their internal office number just using their extension, typically starting with ’x’; e.g., ‘x51399’. Hotlines are another good example, often using the letters associated with each number to make the number easier to remember. For example, a taxi hotline in Australia writes ‘132227’ as ‘13taxi’.
Certain numbers also have different meanings. For example, ‘911’ could refer to the Porsche 911 or the 9/11 attacks. Replacement part numbers are another good example.
Google actually built a system for finding phone numbers using regular expressions. The system was manually programmed to find numbers and check whether they’re valid for each country in the world. The system works well for regular numbers and is even able to catch the above example with an extra 0 for office systems. It is not however able to find alphanumeric numbers such as ‘x5177’ or ‘13taxi’.
As you can see, it simply isn’t feasible to program a regex for every single possible pattern. In addition to the ludicrous amount of developer time it would take to program, optimize, and maintain this fictional ‘perfect regex’, the unfortunate truth is that it would also return so many false matches that the output would no longer be useful. Regex-based solutions also require a lot of work to maintain, as the expressions constantly need to be tweaked to account for new patterns and false matches. This task coincidentally happens to be one of the least favourite developer chores out there.
Today’s highly connected world requires international solutions dealing with regional differences in standardization. For example, a German address could be ‘Eugen-Schoenhaar-strasse 21, 10423 Berlin’. The house number is written after the street name, whilst ‘strasse’ (street in German) is joined to the street name in one long compound word, because Germans love those. Postal codes are another good example. Dutch and Canadian postal codes are made up of 6 digits. A Dutch postal code however is 4 digits followed by 2 letters (E.g. 1234AY), whereas a Canadian postal code is a mix of digits and letters (E.g. 1A32Y4). An Australian postal code on the other hand is only 4 digits.
In summary, the real world is a very complicated place, with even simple problems containing many different variations and edge cases. Building a system that performs reliably and delivers a level of accuracy high enough for production applications requires handling each of these variants and edge cases, such as phone numbers containing letters. And that credit card number example? Well, it turns out not all credit cards are 16 digits long! (https://en.wikipedia.org/wiki/Payment_card_number).
Humans can easily distinguish between the above examples to determine what is or is not a phone number, an address, etc. We do this by looking at the number, but also the context (e.g. “call 911!” and “gosh that’s a nice 911”). Unlike regexes, AI models understand context and are therefore able to understand text more like humans do.
State-of-the-art AI systems (like our models at Private AI) are trained on large amounts of carefully annotated data and meticulously revised to account for all these edge cases and locale-specific differences. In Private AI’s case, we reach >97% in-domain accuracy, which we have found to be higher than human performance in most settings.
AI systems also scale better and are easier to maintain. Integrating a change to a regex-based system requires careful analysis of the existing expressions, which become more and more difficult to comprehend as one adds to them, to make sure that any changes don’t affect existing expressions and adequately capture all the possible permutations of a term. It is common to ‘fix one thing, break another’ and it often requires a few iterations in production before a change is successfully made. On the other hand, AI systems only require some extra training data.
In the past, one drawback of AI systems was the need for a large amount of data to train them. However, modern techniques allow AI systems to be developed with just a fraction of the training data that was previously required. For example, Private AI’s solution can learn to generalize well from as few as 10 examples.
AI-based systems were also held back by the tremendous amount of computing power they required – far more than regex-based systems. This resulted in large cloud bills and difficulty integrating models into edge applications such as mobile apps or desktop applications. AI is also rapidly advancing in this area and the latest techniques enable large reductions in compute resources. At Private AI, we have spent a large amount of time optimizing our solution to the point where it is now 25x faster than BERT large, a popular NLP architecture, whilst also surpassing its performance.
Regexes have served as an integral part of computing systems for decades, and will continue to do so. AI-based systems however offer new levels of performance by understanding context similarly to humans. For unstructured real-world applications like detecting phone numbers in text, these new techniques enable a wide range of new, high-quality production applications.