🎓 Free introductory course "LLM evaluations for AI product teams". Save your seat
Community

AMA with Ben Wilson: planning ML projects, AutoML, and deploying at scale

Last updated:
November 27, 2024

We invite ML practitioners to share their experiences with the Evidently Community during the Ask-Me-Anything series.

This time, we talked with Ben Wilson, Principal Architect at Databricks. He is also the creator of the Databricks Labs AutoML Toolkit and the author of the Machine Learning Engineering in Action book.

We discussed the questions one should ask before implementing any ML project, AutoML use cases, deploying models into the wild, and how one can learn about ML deployment. 

‍Sounds interesting? Read on for the recap of the AMA with Ben.

AutoML

How would you define AutoML? What is it for you? What is it not?

Well, that's an incredibly loaded question. 

As a former practitioner DS, I'd define AutoML as a tool that helps me select an appropriate model family, provides some means of evaluating each of the selections against one another, does some hyperparameter tuning of each (within reason), and, most importantly, gives me a nice shiny output of all of the nitty-gritty details so that I can evaluate the benefits and drawbacks of each model type that was tested on my feature set. 

Fast forward a few years to now... 

AutoML is a vague abstracted term for a framework of disparate actions that permits a range of either autonomous or semi-autonomous actions ranging from data preprocessing, statistical validations, data cleansing, feature engineering, model selection, tuning, evaluation (cross-validation and final), tracking/logging, visualization generation, summary report generation, and a pipelined model artifact object that can be integrated to CI/CD tools for either batch or real-time prediction or inference. It can do all of those things, some of those things, or many of those things. Every tool that purports to be "AutoML" strives to do all of those (and more).

What should be the objective of AutoML? To provide working baselines? To "automate the boring tasks"? To enable non-ML experts to work with machine learning models? Something else?

The objective? Ask a dozen vendors, and you'll get a dozen different answers. Ask a dozen DSs, and you might get two. I'll give an attempt based on personal bias: 

1. I freaking hate tuning models by hand. Loathe it, actually. I find it among the most mind-numbing boring work I've ever done professionally. Automating this process with a clever algorithm (I'm a huge fan of Optuna) makes me happy. 

2. If I'm doing evaluations for a project and I know that I want to test, say, six different model families for a regression task, I'd rather not write out the scripts to generate all of those with manual hyperparameter partial functions. I'm lazy, the work is repetitive and boring, and I've got better things to do with my time. So, AutoML is great for this. 

To answer the question that I think you're trying to ask: "Yo, Ben, have you ever used AutoML outputs in prod?" No. 

Maybe I'm a Luddite, maybe I have trust issues, or maybe I like having extensibility in my project code base, but I've never taken the output of AutoML and deployed it. This isn't to say you can't. I just like to design my code for extensibility and testability and be able to control for nuances in the business problem within the code. AutoML doesn't let me do this for the sorts of problems I've worked on in my career. 

Don't let me discourage you from using it. At the end of the day, we're in this profession to solve business problems, not use a particular set of tools or methodologies to solve them. Use what works for you and enables you to monitor the health of your deployed solution :)

Adopting machine learning

I need to provide predictions for my customer-facing application. It is classic tabular data. How would you proceed? XGB with a FastAPI wrapper and some caching using Redis or similar? What would be the AutoML-way to do it?

Not to be flippant, but... 

1. I don't care about the tools. You shouldn't either. 
2. What's the business problem that you're trying to solve? 

Some other questions that I'd ask before getting into implementation details (which is what your question addressed): 

3. What is the nature of the tabular data? 
4. How often is it updated? 
5. What is the statistical distribution of each feature in your data set? 
6. How stable are the features over time? 
7. How many users do you have? 

The reason for asking all these questions is that they inform the implementation in that understanding details about the problem you're trying to solve and the nature of the data you collect will let you know: 

  • Which model frameworks should I look at? What families of models should I test? 
  • How often do I need to retrain this sucker? 
  • What does my serving infra need to look like to meet SLAs, request volumes, etc.? 
  • How do I want to deploy this? What do I do when I need to retrain and deploy another model? Is this a bandit approach? Canary? Shadow? 

Once you understand your project requirements and have gone through all the aspects of reviewing these details, you'll have a good feel for what you need to research, test, and implement. 

Sorry for the non-answer ;)

A couple of years ago, many enterprise companies adopting ML were stuck in "pilot hell," struggling to put models in production after spending months on model building. Do you believe big companies, in general, have become better at it? What are the most pressing issues that enterprise companies face in adopting machine learning?

Good Lord, that's a good question. 

I wouldn't say that we've moved out of "pilot hell." I think many companies are struggling to match effective solvable business problems to the relatively (to their company or industry, that is) nascent field of ML. 

If you select a business problem that's impossible or stupid to solve with ML, you'll always stay in the pilot phase. Regardless of how fancy the implementation is, how well it does against validation data, or how stable it is, if you don't have business buy-in that it's worthwhile, no one's going to care about it running (or ever getting to production). 

From my highly biased position in working with companies that are going down this path (I haven't talked or worked with all of them), the ones that have business buy-in and pursue maintainable ML practices (stuff is monitored, retrained, the project is handled like an Agile software process, etc.) are the ones that are deploying ML solutions regularly. They're also making A LOT OF MONEY from them—shocking amounts. 

The ones that don't trust ML or know what it's for are the ones that aren't getting much done and are retasking their DS teams to work as highly paid, partially competent DEs or Analysts. Even if they get a model into production, it usually doesn't stay there long (due to trust, degradation, lack of caring about the solution's value, or they're just trying to solve an impossible or dumb problem).

I see a lot of companies adopting SageMaker and similar cloud tools. Many acknowledge that the predictions could be more accurate. Still, they prefer to stay in a single ecosystem and get a single invoice, plus the benefits of tools that are made for developers and not for machine learners. Will we end up in a future where the market trades convenience and simplicity over accuracy and speed?

SaaS companies (full disclosure, I work for one of them) are striving to gain market share by offering one or both of two paths: 

  1. Catering to advanced ML practitioners by building framework infrastructure to support selecting feature functionality that will allow for the advanced ad-hoc selection of managed services that support the full MLOps stack. They're either building it themselves and selling it, building it and giving it away to OSS, or managing OSS solutions and selling the integration. 
  2. Simplifying the process dramatically by offering a "lower-code" version of integration to the above more advanced style solutions.

Most vendors start with #1 (although some just do #2, and that's their jam) and migrate to #2. From the perspective of full integration of services (Data + ML + Serving + monitoring), I don't think anyone is really "there yet" with a fully realized vision for #2. 

We're working on it. And we have some super smart people. Who are also really cool and nice. Every other major SaaS company aspiring to be the first truly "both buckets" company is working furiously towards both elements.

Learning ML engineering

How can one learn more about Model deployment? Do you think videos, books, and other educational content are useful, or learning by doing is better? :)

Great question. The jerk response is, "I wrote a book on this top-." 

The real answer is: YES. 

There's no "one way" to learn complex and broad topics. But if I were to give a "path" that's worked well for me when learning any sufficiently advanced concept when working with data, ML, software... 

1. Get familiar with the broad strokes. Some people like reading (that's me!), some like tutorial videos, and some like taking classes. Whatever works for you, at least have some fundamental understanding of the concepts you're trying to learn. That prepares you for...

2. Test it out. If you're trying to learn about deployment architecture, build a few use cases. Get familiar with those DEVOPS-Y tools like Terraform, Docker, Kubernetes, ES, EC, etc. Build some pet projects that are just serving "hello world" responses. Get familiar with looking up in the docs AND THE SOURCE CODE when you break something and don't quite grok what you did wrong. That prepares you for... 

3. A mentor. Find someone familiar with these topics. See what they've done in the past. Ask them questions. Buy them a coffee. Make a friend. Learn all that you can about that tribal knowledge that they've learned the hard way. Which prepares you for... 

4. Build something for real while being mentored. Attach an SWE to a project that you're working on. Get them to help when you are stuck and review every line of code and config you're writing. Let them know that you're learning and want help from their expertise. Switch entirely into student mode while building it out and set your ego aside. 

5. Continue to get feedback on the next projects until you're the one people seek as a mentor.

What would be the things that those who have done deployment courses miss when they apply their knowledge and skills in real-world deployment?

I actually can't answer this as I took the "screw it up and fix it until you get it right" approach of learning how to deploy models or model outputs. 

If I were advising myself now, I'd probably tell myself to take some courses on how a "happy path" deployment would look. Then, when I screw things up (inevitably), I wouldn't have to spend so much time and effort on reverse engineering things that I didn't understand at the time. 

The real world is messy. 

There are decisions that are made with infrastructure at a company that can make deployments super easy (cloud-native startup founded by ex-FAANG engineers who learned how to do all stuff' puters over decades of collective work) or a complete nightmare (a Fortune-100 company that is not in tech and is terrified of data leaks and investor perception of their capabilities if a news article release that they had some issue with data, so they lock everything down like they're handling nuclear launch codes). 

You might find that the startup situation will make the course you took seem extremely relevant (heck, the founders might have been the authors!). Conversely, you might find the F100 company's deployment environment so foreign to known best practices that you spend months working through something that should take hours. YMMV.

Are you working on another book now?

NOPE. NOPE. NOPE. :) I do write docs and examples, though!

How to deploy ML models

I wanted to know a detailed process explanation for deploying heavy machine learning models in production.

A heavy model? Do models have mass? 

In all seriousness, it depends on what you're trying to do. 

I assume by "heavy," you mean "this object, when serialized, consumes a lot of storage space on a hard drive somewhere." 

If the model is enormous, the first thing to ask is: Does it need to be this big? 

Are you working with a massive language model or an ensemble of language models? Do you have some enormous matrix that you're storing that needs to be used for lookups to "serve predictions"? Is this a gigantic computer vision model capable of classifying detected objects across the broad spectrum of general photographed objects? 

If your use case requires a massive model, you'll have to factor that into your deployment architecture. You'll need large instances to serve that model. If you have a high request volume for that model, you'll need high levels of concurrency to support that. 

So, the TLDR is:

  • Big models are big money in deployment. Do an ROI on your infra costs and make sure that the model is at least net-zero cost before going with that architecture. 
  • There are a LOT of considerations to think about (from logging to monitoring, SLAs to VM allocation) that dictate how complex the stack needs to be from a hardware perspective to justify its existence.


What is your best take on deploying ML at scale when you need a response from a model in an ms range and high availability (e.g., reply to ad ranking) when not relying on batch prediction?

Millisecond scale? Ad serving determination? Real-time inference? 

Depending on your request volume, you're probably not using a non-compiled language to start with here. I've worked with financial services institutions that make extreme scale predictions at truly mind-boggling volumes where each request's response either means they make money or lose money. 

When the stakes are that high, you're not using OSS tools (for the most part). You're not using some cloud-managed service. You're not using a typical DS development language. 

Most of them are writing C++ to implement the decision logic or regression and are running a fault-tolerant deployment setup with automated and extreme-scale load balancing. Most of these workloads don't even run in the cloud. They run on a server farm that the company owns that has direct trunk-line access to an internet backbone hub (in the US, likely somewhere in TN, where real estate is cheap, power is cheap, and there's a fat trunk line running through prairie). 

Can you please explain the benefits model servers like Triton and TorchServe offer compared to a FastAPI approach? Do you recommend using them always?

The benefits? No. 
Do I recommend any? No. 

As I answered previously, I don't ever look at tech stacks as "oooh, this one is the best; I must use this." Rather, it's important to test them out for your use case. 

Does it work for the use case? 
Does it work for your team? 
Can you maintain, monitor, upgrade, and extend it for other business use cases? 
How locked-in will it make you (how much do you need to change your approach and architecture to go with one vs. the other)? 

All of these factors should be handled scientifically. That is, develop some testing criteria relevant to your company and use cases and then test them out. Evaluate it without bias and have others review the design proposal. Argue about it. Build prototypes. Break it. Fix it. Share the results. 

Then, when you're done with your evaluation, make the right decision based on effective testing and evaluation. Ignore things you hear online.

Monitoring ML in production

Did you ever deploy a machine learning model that was so effective that the performance declined due to its success?

I'm assuming that you're referring to the performance lift/gain reduced due to how well the model impacted the business. If you're getting at something else, please correct me. 

If that's what you're asking, then yes, most definitely. Many times. I think that's inevitable for any useful business-affecting model. Unless you're optimizing a limitless resource (let me know if you can think of one, I can't), there's going to be a point at which any improvement will see less impact over time. 

This is a political consideration for things that "are important to a company," and one of the ways that I have handled it is to negotiate at a project's start a holdout group that won't get the treatment. That way, you always have a comparison to give for performance analytics.

"Yeah, business, I get it. We're not seeing a 4.8% lift anymore. It's now 0.4% over the last quarter. But check it out; against our control group, we're at a 93.8% lift. Let's not shut this off, mmmmkay?"

Recommender systems are notorious for this. You see INSANE gains if you've built something useful. But, at a certain point, if you're not working with behavior psychologists in designing the system and getting clever with feathering in "potentially might be interesting" elements, you'll end up with a self-sustaining limited exposure of content that dictates how users interact with your business. Bad news; people get bored, leave, and think your brand sucks. 

On the other hand, if you're concerned with a fraud detection model, you might get something out there that's SO GOOD that fraud effectively drops to zero. Now you've lost your positive classification signal on retraining. OOPS. When the fraudsters regroup and figure out a new angle of attack, your model's detection ability will fall apart until the activity has been flagged manually. 

Been there in both examples. Learned that the hard way on both accounts :) 

What monitoring advice would you give to a team that has deployed a model in production for the first time? What are the metrics they absolutely need to keep tabs on?

Excellent question! 

The most important ones: 

1. Attribution. For a first-time push to prod DS group, you're not fighting for model accuracy or model stability. You're fighting for relevance in the business. If you can't tell the business and management how much money this thing is making them, then they're DEFINITELY GOING TO ASK and will be annoyed if you can't tell them. 

2. Predictions over time. The output needs to be monitored to know if it's going into "bozo-land" so that you can rapidly scramble to get a fix out as soon as possible. If the business notices it before you do, you're done for. I've seen entire departments get fired at companies for this exact situation.  

3. Input raw features. If you're not catching drift on incoming features, you're absolutely done for in trying to determine WHAT went wrong, WHEN, and how to fix it. 

For #2 and #3, I hear there's this cool OSS tool that offers functionality for that ;)

ML consulting

What do you like about working as a consultant? What do you dislike?

I actually don't do that anymore. I work on MLflow with a truly fantastic group of brilliant software engineers :)

When I was doing it, I liked the "guts" of building creative solutions with motivated people. I've always been a builder and collaborator, so if there's a challenging problem to solve with a group of people, it's always fun. 

What did I not like? Politics. Working with a poor team of suffering ML Engineers/DSs who were forced by management to work on really stupid or impossible problems that I knew (and they knew) would never get solved. That's frustrating and stressful for them. I hated seeing stuff like that (and it's far more common than most people think).

[fs-toc-omit] Join 2500+ data scientists and ML engineers
Jump to our Discord community to get support, contribute, and chat about AI products.

Join Discord community ⟶

You might also like

🎓 Free course on LLM evaluations for AI product teams. Sign up

Get Started with AI Observability

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.