MLOps & Machine Learning Deployment at Scale

Share This Post

In the latest episode of Private AI’s ML Speaker Series, Patricia Thaine (CEO of Private AI) sits down to chat about MLOps and Machine Learning Deployment at Scale with Luke de Oliveira from Twilio.

Luke de Oliveira is the Director of Machine Learning at Twilio, working at the intersection of technology and product. He is building Twilio Intelligence, a brand new product to help their customers act on and automate structured intelligence from their conversations. He was previously CEO & Founder of Vai Technologies, which was acquired by Twilio. Furthermore, Luke angel invests in early stage startups and enjoys helping founders wherever he can.

In a past life, he was active in the academic community working at the intersection of Deep Learning and High Energy Particle Physics. He held appointments at Lawrence Berkeley National Laboratory (LBNL), SLAC, and CERN

If you missed the last session, scroll down to find a recap of Patricia and Luke’s chat or watch the full session below.

Watch the full session:

PAI: What made you move from working on particle physics at places like CERN to working on AI and product development?

Luke: Great question. I’m super excited to be here, Patricia. Thank you so much for having me. Yeah, it’s actually funny in my mind, it’s not too much of a shift. When I was working at CERN and working at SLAC on ultimately particle physics applications, everything that I was doing was AI centric. So whether it was simulations using GANs to accelerate simulation processes, or doing any sort of forecasting, everything that I was doing had machine learning at its core. That’s ultimately kind of what my specialty has always been. So just kind of through getting bitten by the entrepreneurial bug, the extension was actually fairly natural to go from working on AI for science to using scientific rigor to approach AI product development. So it’s kind of where I found myself today. 

PAI: Tell us about Twilio Intelligence – what excites you about it?

Luke: Yeah, absolutely. So Twilio Intelligence, basically, you can think about it as the intelligence layer on top of Twilio. And if you’re not familiar, Twilio provides API level and programmable access to communication channels across the world. So anytime you get a notification from your Ride Share service or you have to call your delivery driver to make sure you coordinate where to pick up your takeout, that is all usually mitigated through Twilio through programmatic communications. And as we generally move to being closer and closer to how communications directly impact the business, our customers are pulling us in the direction of making sure that we have what we like to think of as intelligence building blocks that not only enable our customers to have these communications with their customers, but to be able to act on what their customers are saying.

So when someone requests to be called back, when someone complains when someone mentioned something negative about a past experience they had when they called in, for example, we want our customers to be able to act on that semantic meeting automatically and be able to drive their business directly off of that. So Twilio Intelligence is kind of providing the programmatic building blocks that enable, what I’ll call, the semantic customer layer that kind of sits on top of communications. 

PAI: Being at Twilio, you have to deal with deploying at a massive scale. Can you give us an idea of your largest deployment thus far?

Luke: There’s no such thing as the largest deployment. And I think I’ll break this apart in kind of two axes and we’ll probably get into this more later. There are two ways that you can end up in challenging deployment scenarios with machine learning. Scenario number one is massively multi-tenanted. So when you have hundreds of thousands of customers like we do at Twilio, potentially each with their own model, you need to be able to serve any of these at any point in time.

There’s also another kind of end of this, which is you have one model or potentially some countably small number of models that need to be served across all customers. So these are two very different types of scaling characteristics. And I mean, just to give you a sense of the scale, Twilio has 200 to 300 something thousand active developer accounts and there are machine learning models running for every single account on the platform. Then just in Twilio Intelligence alone, we’ve done an order of billions of transactions per year through state of the art models like BERT and T5 to be able to provide that structured information for customers. 

So without getting into the specifics, we kind of have both of these kind of vectors, if you will, massively multi-tenanted ML deployments, as well as every single customer and every single interaction that happens on Twilio, that’s a very large number needing to go through a small set of machine learning models that are kind of integral to the business. So we really have both of these kinds of competing axes of deployment requirements. 

PAI: Can you walk us through a typical model deployment process for you and your team? How much is automated vs. manual?

Luke: Yeah. So I can give you an example from Twilio Intelligence and maybe specifically for speech models. So, as part of Twilio Intelligence, one of the things we offer to customers is the ability to transcribe phone calls, to transcribe voice calls into text, to be able to train these models. We have obviously a very large data asset acquisition pipeline that we run to be able to collect ground truth from customers who have opted in to have their data transcribed to improve models. But then as we get into model development, there’s what I’ll call three phases. 

There’s recipe development. So kind of figuring out what modeling approaches we might need to take for a specific problem we want to solve. Like let’s say we need to improve our models’ robustness to background noise out there. That is something that looks more like recipe development. Then there’s kind of a middle step which is training and monitoring. So this is launching at scale (like scaling on Kubernetes) and actually running our models and training them over the large corpuses of data that we have, evaluating them, seeing how they do. And then there’s a final stage which is deployment and then it’s very hard to automate that last part of the step.

A lot of what we actually do is our own smoke test, but we also do a lot of manual spot checking to make sure that the models that we’re producing at the end of the day retain customers’ accuracy and trust. We’re asking them to run their business critical applications for us. We owe it to them to kind of spot check and make sure that things are up to the quality bar that we expect from Twilio. 

So really there are those three phases. And in general, for a model release, we are actively always trying to drive down the time that it takes to go from idea or customer need to something that a customer can directly use or deploy or integrate with. And we’ve generally seen that go down for some types of models that’s now in the order of minutes. For some types of models, it’s still an order of days because we train on hundreds of GPUs and it takes a while. But in general, we’re always trying to drive that time down from “hey, this is a need that we need to meet” to “hey, here’s a model that’s available to our customers”. 

PAI: What does CI/CD look like for ML models? How is it different from CI/CD for normal software?

Luke: Yeah, yeah, that’s a great question. I think that there are some interesting similarities and differences to traditional software development when it comes to CI/CD. So on the continuous integration side, it’s no longer just code and unit tests. It kind of branches from the world of MLOps where you care about your models, to the world of DataOps where you care about schemas, you care about migrations, you now care about your downstream things that depend on you. Are you breaking schemas for someone that needs to be consuming your model output via API? So CI expands to not just being a unit test, but it expands to almost the surface area, the exoskeleton, if you will, of the model you’re deploying. 

In terms of continuous deployment, you’re not producing just the system. You’re producing a system, which itself produces systems. So when you are deploying a new version of a machine learning pipeline that’s not itself contained software artifact, that is a software artifact that produces software artifacts, if you will. So there’s an additional layer of complexity that kind of comes into that to make it even more fun. Models are inherently stochastic. The process by which we reach a trained model is inherently stochastic. So a lot of the kind of more deterministic testing procedures that we’ve all kind of gotten used to in your basic unit testing, functional testing don’t tend to have a natural analog in about.

There are some things you can do to get around it, such as fitting to batches, having golden sets that you have as a quality bar. But ultimately, there’s no right answer for what some of these analogs look like. One interesting addition to CI/CD is what I suppose I’ve heard called CT continuous training, where you’re integrating with new schemas and new data production procedures. 

Yes, you’re deploying new systems that are themselves deploying new models. But how do you actually get the new model out to the real world? And that is like this concept of going to continue with learning live training is something that I think a lot of people talk about. Not a lot of people do. We ourselves with Twilio don’t do it. We have enough quality checks that we do manually that we like to have a human in that process. But there are places that do kind of fully continuous training that especially in the recommender system domain, fully continuous training that is kind of triggered with some sort of regularity that will kind of be a part of the CI/CD process to be able to kind of reach the final end state of what the fully deployed system looks like. 

PAI: What sort of challenges do you encounter when deploying models at such a massive scale?

Luke: I think one of the things that tends to become a lot more challenging at large scale is, and this is true of any company that operates at internet scale, things that are very rare happen a lot. So when you’re dealing with millions and millions of things per hour, things that happen with way less than 1% probability happen like maybe 100 times, that’s not insignificant. So having your testing and verifiability of your systems be robust is certainly a challenge. I’m not saying we have a solution for it. We certainly don’t. But that’s something that inherently will come up. The other kind of interesting thing that comes with scale is rollback, especially when you have artifacts that are now dispersed, potentially over a very large Kubernetes cluster. 

Do you rely on Kubernetes native infrastructure? In terms of rollback, do you have other systems that you kind of imposed on top of that? Do you do customer specific rollback? There are a number of different things that you might want to consider once you’re at what I’ll call planet scale. But in general, I think the hardest thing is that rare mistakes happen all the time once you’re at planet scale and understanding and having a deep understanding of when those mistakes are okay and why they’re okay is honestly sometimes the hardest part of being on an ML team, as I’m sure you’re highly aware coming from your space. 

PAI: Are there any mistakes that happened at Twilio that you could give as general examples of what ML teams should feel comfortable or not comfortable making?

Luke:  Honestly, I think it’s hard to give anything a blanket, at least for Twilio. Twilio is a very customer centric company. For us, it really comes down to being ridiculously honest with our customers when things aren’t working, we encourage our customers to benchmark themselves, verify everything we’re saying, and we generally don’t try to purport to having perfect models or having any sort of threshold that we deem to be like perfect and then you’re all done. We’re more interested in treating it as something to continuously improve on. So I think in general it’s always okay to, especially at the beginning, be okay with mistakes. 

The most important thing is closing the loop on mistakes and having data be a central part of what you do. Having humans and hoops kind of everywhere be a part of what you do at the early stages to make sure you can act on those mistakes. And yeah, there will be some mistakes that are just unavoidable. And unfortunately that’s just a matter of, in our case, customer education, in the case of maybe a consumer application being more resilient in your UX experience to make customers not feel the ramifications of the model mistake as much. But ultimately, in my mind, it’s less about a target or a specific type of mistake, and more just about having the processes and mechanisms in place to be able to improve. 

PAI: What kind of tools do you rely on the most for this? 

Luke: Yeah, I think this is an interesting question. We’ve actually had a number of evolutions ourselves internally, kind of as the field has evolved. Our latest incarnation of how a lot of our infrastructure runs is very much on top of Kubeflow. So we’re all in on Kubernetes for how we run machine learning specifically for Twilio. This matters because we have very spiky workloads. One day there might be some ad campaign that goes out somewhere and all of a sudden we’re having a 100x increase in traffic and we need to be cost efficient for that and these models do obviously require expensive hardware. 

So having an abstraction on top of Kubernetes, which gives us experiment containers that are extensible that allows us to build our own business logic into it as well. It has proven pretty critical for us, so we’re fairly all in on Kubeflow to help there. And then just in terms of the more ML primitives, we tend to run a lot on PyTorch. We found, especially for NLP, this allows us to take advantage of great advances from companies like Huggingface, to be able to not build everything ourselves, but to be able to leverage the best of what’s out there to be able to move quickly for our customers. Ultimately, the tools we use just are in service to us, moving quickly for our customers. 

PAI: Do you monitor models post-deployment and, if so, for what?

Luke: That is an extremely good question. I think the correct answer to this is I always wish I was monitoring more. I think that’s always the right answer here. Because of the inherently qualitative nature of language, it’s hard to quantify when something feels off. We do a lot of manual spot checking. So we actually have our own team that’s trained directly with our machine learning team that is able to kind of directly synthesize mistakes that our speech models are making, that our NLP models are making, and try to tie together trends that are actionable for our ML team to go and act on. So this is one of, I would say, the most critical ways in which we monitor our models. 

Of course, we do have little ways of trying to detect drift. I don’t think they’re perfect. I think drift is easier said than done in many applications, especially in NLP. But one of the ways we’ve just tried to get around something like a concept drift, which is inherent in conversation, is to just retrain with some regularity. In terms of other things we like to do for monitoring, not so much on the model performance side from an accuracy standpoint, but from a throughput standpoint, we do try to make sure that new versions of algorithms and new architectures that our team is trying out still satisfy the throughput requirements that we have for our system. We are partially cost sensitive. 

We’re not only cost sensitive, so when someone comes up with a new approach that may yield a, let’s say 5% increase in accuracy, we want to make sure that that’s within some compute tolerance of where we are today from how we’re measuring our costs. So we do have some smoke testing around that to make sure that we’re not running models that are now magically 100x more expenses that obviously wouldn’t be great for us or for our customers. So, yeah, hopefully that gives you kind of a spectrum of the two sides there. 

PAI: I guess one drift that happened with NLP models is language change. Is that something that you measure? 

Luke: Yeah, we are definitely subject to that. I don’t have any tried and true methods of detecting that, to say the least. But for us, the way that we’ve caught some of that concept drift, especially in we’ve caught some of this concept drift for certain customers who have been very generous with having us work very closely with their data, where they’re actually launching potentially new campaigns, new business initiatives that we have never seen before in data. And our team will pick up on that through very sporadic spot checking and kind of alert the customer and work with them on a retraining plan. But yeah, ultimately the vocabulary drift concept drift is extremely, extremely hard to detect with any sort of surefire statistical regularity. 

COVID-19 is the perfect example. I think one of the funny ones for us was we have a formatter that runs after our speech model that will go and basically turn ungrammatical text, unpunctuated text into something that’s human readable. And that was actually out of sync with our speech model for a little bit. In terms of COVID-19, we did fix it at the end of the day, but we didn’t update that right in lockstep with our speech model. So there was actually a bit of a drift where our speech model knew about COVID-19, but our formatting model did not know about COVID-19. So those things do happen. It’s mostly about making sure they don’t happen too often, and you have things in place to make sure that you can at least try to detect some of that. 

PAI: I would love to learn a little bit more about your work at research organizations like CERN and how that differs from the industry work you’re doing at Twilio. 

Luke: Yeah, it’s a good question. I think it’s less about the models per say, and more about the environment. When you’re at a place like CERN or any lab that ultimately is in service of science,you care more about the why than the what. So a lot of what you care about when you’re developing new architectures, new approaches, new data sets, or basically doing anything, is understanding if you see a performance improvement and why you see a performance improvement. Because ultimately what you care about isn’t the performance improvement, it’s understanding the science behind why you can now get this performance improvement. So I think that’s a pretty fundamental difference between most of what I’ll call product or business applications of machine learning. 

Obviously, you do care about the why, but it’s not goal number zero, whereas in science, it’s ultimately the only goal. Everything is just in service of the question why. So that also drives a lot of the model selection that you end up doing, a lot of the studies you end up doing about your model. So you care a lot more about, for example, reweighting specific parts of your input space to understand sensitivity effects or really focusing on understanding robustness to outliers and certain features that are way more sensitive to modeling issues in downstream applications that this one is going to end up being used for.

So you have a lot more system intricacies that really affect your pursuit of knowledge that you care about at a very, very deep level. Not to say that that doesn’t happen in industry as well, but that’s definitely a pretty big training difference between the two. 

PAI: What do you see as being the biggest potential to drive business value with users of Twilio Intelligence?

Luke: The biggest potential for me is I think it fits into a broader trend that I think we’re seeing across the industry. Like, if I tie this together with what we’re seeing with Apple cracking down on companies like Meta who are doing third party tracking, the way I view Twilio Intelligence and the way I view Twilio more broadly is sitting at this position where customers own their first party data and we’re enabling our customers to act on the content of that first party data. Like, I can’t think of a better source of business driving meaning to be able to act on as a business. 

Today, your customers are complaining about something, your customers are excited about something, you can act on that, and you now have the tools you need to be able to directly tap into that. So for me, that’s I think the biggest kind of mountain mover when it comes to Twilio Intelligence. I think as part of this bigger shift away from third party tracking to kind of first party data and holding customer trust as a core tenant of how we conduct ourselves in the world of business. I think that’s a particularly unique slice that I think that’s why I’m so very excited to go to work every day. So I think it’s quite impactful. 

PAI: What other tools did you consider when you chose Kubeflow?

Luke: So the Kubeflow decision was an interesting one for us. We kind of were evaluating this on two axes. So the main thing that Kubeflow gets you on top of, obviously managing a lot of the complexity of Kubernetes, is some very nice experimentation management that gets kind of built in for free. There are other tools that do that. There’s a number of other tools to do that. For us, it was actually team expertise. So we actually had a number of folks on the team who were very familiar with Kubeflow, and that obviously helped us get off the ground very quickly. On the other side, we’ve been evaluating, and we actually do use for other parts of the product and around fully as well as temporal, which helps us do, in general, kind of data management on Kubernetes, it’s not ML specific. 

So this helps us very much with the ML domain in terms of tracking and launching experiments that you track hyper parameters and results in, so that it just kind of served our needs. We have expertise around it, and it enables us to be more efficient with our compute. So that was ultimately enough for us to move the needle and invest, and it’s got a great community as well. The community aspect is very important when it comes to choosing open source.

PAI: Any recommendations, material, blogs, etc. that help you keep pace with novel developments in the space, especially for those of us who aren’t currently in an engineering seat. 

Luke: This is a good question. Unfortunately, I stay very up to speed on ML through Twitter. In all seriousness, I think Twitter in general is a great resource to stay up to speed. I’d also say generally, especially if you are a bit more research or algorithmically inclined, don’t read the entirety of the ArXiv every day, but just peruse a high level summary of roughly what’s going on.

I usually do it like a couple of times a month. I’m not religious about it in any way, shape, or form, but just try to get a sense of how things are moving. And then in general, all of the blogs from each of the major labs are producing great results. I suppose it’s not FAANG anymore, it’s MANGA, but the MANGA plus OpenAI plus Huggingface. They generally do produce great content. And of course it’s important to keep the obvious agenda in mind when looking at blogs like this, to understand that this is obviously within the context of one company so it’s important to read multiple and kind of build your own perspective. But I’ve, in general, found these to be good just to keep apprised, coupled with the healthy dose of academic Twitter to see where the field is moving. 

PAI: Do you use anything such as W & B, DVC, etc. for training workflows?

Luke: Great question. On our team we don’t. Mostly because we have a lot of internal kinds of things that were kind of built before a lot. A lot of these tools were mature. Funny note on W & B, I’ve heard multiple people pronounce it wandi b or wand-b and I still don’t know what the canonical way of saying this, but I know a lot of the best researchers I know use W & B in their own personal projects and as well as people on the team use it for personal projects. I think it is in general a great way to do model monitoring for development. We don’t use it ourselves. We also don’t use DVC, though I have heard great things about it. Yes, I can’t unfortunately speak from first hand at scale production experience.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.