What It Really Takes to Build An AI System: It’s more complicated than many think

Share This Post

Today we live in a world of unprecedented open-source code where companies such as Google and Facebook have put their internal AI solutions in the public domain. As this was previously unheard of, there are now plenty of resources on how to quickly and easily build an AI system.

There’s a saying ‘the last 20% of the work takes 80% of the time’ and nowhere is that more true when dealing with an AI system.

A massive amount of work to develop real-world AI applications still remains. This is because the level of quality, reliability for production deployments, and the amount of work required is frequently underestimated…even by experienced developers and managers. 

I once worked on a Traffic Sign Recognition (TSR) system for an automaker. It is also frequently used in the classic ‘Build vs. Buy’ debate. So, I will refer to this experience as it is parallel with the key reasons why it takes so much effort to build an AI system and why it’s a better choice to purchase an existing set-up. 

Let’s dive in!

Edge cases…edge cases are everywhere

Data is the number one time and money consumer. You may not have known this, but it stems from the consistent underestimation of the complexity of the real world and how many edge cases there are for even the most simple tasks.

During my TSR project in Europe, we encountered all sorts of things such as LED highway signs. These signs in addition to looking completely different from normal signs, are difficult to capture with a camera (try filming a computer screen).

Traffic signs with various lighting conditions, stickers and age. Images by Pieter Luitjens

Even when signs are perfectly visible in good conditions, they can be tricky to identify amongst all the noise. For example, trucks in Europe have speed limit stickers on the back that are identical to roadside signs but that indicate how fast they are allowed to drive. Things also get tricky at highway intersections, as exit speed limit signs can be perfectly visible from the highway itself. And what if the sign is covered in snow, which just so happens to be the same colour as most traffic signs?

The complexity of the real world isn’t limited to Computer Vision, I recently wrote a complementary article on regexes in the real world.

A good dataset is hard to find

Lots of models are published and open-sourced, but the datasets they are trained on for production applications are usually kept under lock & key. Some data (like credit card numbers) are especially hard to obtain. In fact, a ‘data moat’ is the main competitive advantage of many AI companies.

But what about all of those juicy datasets researchers use, you might wonder? Unfortunately, production applications don’t match up neatly with research tasks. And even if they did, research datasets usually don’t allow for commercial use (e.g. ImageNet). It’s also common to have a lot of labelling errors in research datasets, preventing the development of high quality models. A good example is Google’s OpenImages object detection dataset. Consisting of 1.7 million images with 600 different classes labelled, it could be useful for training object detection models. Unfortunately, the training split has less than half the labels per image that the validation split does, which would imply that a significant number of examples aren’t labelled.

Datasets for TSR also fall prey to these issues. Freely available TSR datasets don’t allow for commercial use, contain too few examples to be of any real use, and are marred by significant labelling errors. Additionally, they only use examples captured in good lighting conditions in one country. And cars have a pesky habit of travelling into new jurisdictions with different traffic laws and different traffic sign designs.

Creating a custom dataset for an AI system is expensive and time-consuming

Why not create your own dataset for your AI system, you say? Well, let’s have a look at that. First step is to decide on labels/outputs and collect data, making sure every single edge case is captured. Then it’s important to make sure you have good validation and test sets that provide a reliable, balanced snapshot of your performance.

Next comes the data hygiene and formatting, which can take a lot of time. It’s very important to get this step right. Transformer models, for example, suffer a surprisingly large drop in performance when this step isn’t done correctly.

For most tasks, the data then needs to be labelled. For the projects I’ve worked on, we’ve always built our own labelling tool or modified open source tools, as existing out-of-the-box tools never quite suit the task at hand. You’ll also need data infrastructure to manage, version and serve your new dataset.

Next, you’ll need to involve some humans to annotate your dataset

If you’re lucky and can share the data outside your organization and your task doesn’t require too much domain knowledge, you might outsource annotation tasks. If not, it takes a ton of work to hire and manage your new team of annotators. In either case, annotator training can also be some work, as most tasks require some domain knowledge and are typically more complicated than clicking on objects in an image. And since turnover in this type of role is high, you can expect to find yourself on that hamster wheel more than you’d expect. One of the best ways to support your annotators is by having an annotation guide they can start with reading before you jump into the annotation and feedback training cycle. Creating the annotation guide itself is a lot of work, as many labels are ambiguous if not defined correctly, often an exhaustive list of examples must be included, along with a living FAQ section that has to be added to as you discover that more and more clarifications are needed to account for the variety of understandings that humans can about a single concept.

Finally, it’s important to verify your process to ensure it maintains a high quality of output

Annotators also need to label edge cases consistently for the model to work well. For example, at Private AI we’re frequently confronted with thousands of tiny questions on what constitutes sensitive information. For example, “I like Game of Thrones” probably isn’t going to identify someone, but “I like David Lynch’s 1984 rendition of Dune” narrows things down a bit.

In summary, whilst data annotators can be found quite cheaply, a large amount of valuable dev/management time is required to construct a dataset. As an alternative, you can go to services like Amazon’s Mechanical Turk to outsource part of the process. In my experience however, these services are quite expensive and don’t deliver high quality labels. On top of this, in real projects, the requirements/specifications usually change. This means going over the data multiple times as internal and external requirements (like data protection regulations) change.

The process of building a dataset for your AI system has also gotten harder over the last 5 years. The TSR project I worked on was pre-GDPR, and nowadays privacy is a must when collecting data.

Model Stuff

You’ve got your data. Now what?

Now we’ve arrived at the most visible part of the process: building the model. We can use the plethora of open-source solutions out there, but there’s typically a lot of work to be done fixing small bugs that impact accuracy, accounting for the large variety of possible real-world input types, ensuring the code works as well as it can given the new data and labels you’ve added, etc. A while back I wrote my own MobileNet V3 implementation, as none of the implementations I could find matched the paper — not even the keras-applications implementation. Similarly at Private AI, getting state-of-the-art models to run at 100% of their capacity has been a lot of work. You also need to make sure that the code allows for commercial use — this typically knocks out a lot of research paper implementations.

A production system frequently relies on a combination of domain-specific techniques to improve performance, which requires integrating a bunch of different codebases together. Finally everything should be tested, something that open-source code is usually light on. 

After all, who likes writing tests?

Deploying your AI system

So you’ve gotten the data and you’ve built your model — now it’s time to put it into production. This is another area open-source code is usually light on, even though things have gotten significantly better in the past few years. If your application is to run in the cloud, this can be quite simple (just put your Pytorch model into a Docker container), but that comes with a caveat: running ML in the cloud can get really expensive. Just a few GPU-equipped instances easily cost tens of thousands per year to run. And you’ll typically run in a few different zones to reduce latency.

Things get significantly more complicated when integrating into mobile apps or embedded systems. In these situations you’re usually forced to run on CPU due to hardware fragmentation (I’m looking at you, Android) or compatibility issues. That TSR project I worked on required all code to be written according to a 30-year-old C standard and had to fit in just a few megabytes! The use of external libraries was also precluded due to issues surrounding safety certification.

In any case, model optimization is usually necessary. The trouble is that Deep Learning inference packages are at a much lower state of readiness and much harder to use than training tools such as Tensorflow or Pytorch. Recently I converted a transformer model to Intel’s OpenVINO package. Except Intel’s demo example no longer worked with the latest version of Pytorch, so I had to go into OpenVINO’s source code and make some fixes myself.

Real-world applications also involve more than just running an AI model. There’s normally a lot of pre- and post-processing required, all of which also needs to be productionized. In particular, integration in an application may require porting to the application language (like C++ or Java). On that TSR project, a large amount of code was required to match the detected signs together with the navigation map.

Finally it’s worth noting that people with expertise in this area are REALLY hard to find.

Ongoing Tasks

So, we’re at the finish line! Your application is now in production, doing its thing sorting/identifying/talking with widgets. Now comes the ongoing maintenance.

Like any piece of software, there will be bugs and model prediction failures. In particular (and despite your best efforts), there will be plenty of work to do in collecting the data needed to fill in the edge cases that were missed during the initial data collection phase. The world we live in isn’t static, so data needs to be continually collected and put through the system. A good example is Covid-19. Try asking any pre-2019 chatbot what that is.

Finally, whilst not strictly necessary, it’s good practice to periodically evaluate and integrate the latest research advancements.

So, that’s what it really takes to build an AI system

As you can see, it typically takes a team with diverse specialities such as data science, model deployment to build a complete system, and application domain expertise. There remains enormous demand for these skills in 2021, meaning that building up a team can be a very costly exercise. Complicating the matter further is staff turnover, which could mean that the system your company just spent a large amount of time & money building is suddenly unmaintainable, presenting a very real business risk.

So hopefully this helps you approach your ‘buy vs build’ decision armed with more info. It’s considerably more complicated than ‘oh lets get model X and switch it on’. I’ve seen firsthand and heard many accounts of companies not batting an eyelid at giving hundreds of thousands per year to Amazon/Microsoft/Google for cloud computing, despite 3rd party solutions offering a fraction of the total cost of ownership. If you decide to build yourself, make sure you have a lot of contingency! And consider all the costs like cloud compute, hiring & management.

And that TSR application? I can say I was quite proud of how well our system worked, but it required many, many decades of developer time to achieve.

Join Private AI for more discussions on the build vs. buying an AI system on LinkedInTwitter, and Youtube

Or, book a call to schedule a live demo.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore


Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.