AMA with Alexey Grigorev: MLOps tools, best practices for ML projects, and tips for community builders

May 10, 2022
ama with alexey grigorev
In the Ask-Me-Anything series in the Evidently Community, we invite ML experts and practitioners to share their experiences and answer questions from our community members. Our first guest was Alexey Grigorev, a Principal Data Scientist at OLX Group, a founder of the DataTalks.Club community, and an author of "Machine Learning Bootcamp".

The AMA covered all things production machine learning, from tools to workflow, and even a bit on community building!

Scroll down for the recap of the AMA with Alexey.

Machine learning workflow

Do you think a single individual can handle all the parts of an ML workflow: experimentation, data gathering, training, deployment, monitoring, etc.?
If not, do you expect the tooling landscape will catch up?


If you work at a startup, it's okay if one person is doing all that. In the end, it's not rocket science. Of course, you can't be an expert in everything. That's why you should have a team of people where everyone has different interests and strengths. A data team — data analyst, data engineer, data scientist, and ML engineer — can cover this end-to-end.

The tooling landscape will catch up for sure. But I'm pretty overwhelmed with the number of tools right now. It isn't easy to keep track of what's happening in the tooling space.
Which best practices do you follow regarding software engineering for machine learning and data science? Are there popular practices for general Python development that you consider unsuited for DS?

  • Having at least integration tests for smaller projects and unit tests for bigger projects is a must
  • CI/CD for running these tests
  • Not a fan of linting/automatic code formatting, but we also use that in some of our pipelines
  • Scaffolding (cookie-cutter)
  • Makefiles. I love makefiles. I use them everywhere
  • Pipenv. But still, dependency management in Python sucks compared to Java.
As for the second part, I don't know, I never did proper backend engineering in Python. Maybe Scrum? Not a big fan of story points and all this stuff.


Machine Learning is historically very function/ script oriented. Still, many of the strengths of modern Python are object-oriented features such as data classes or typing. When you decide whether to convert the model into a set of classes and objects or keep it as a script file with a set of functions in python?

When it becomes a mess :)


What does the process in DS projects look like at OLX? Do you stick to vanilla CRISP-DM or have some custom modifications?

I don't think anyone officially admitted that we use CRISP-DM (a variant of it), but in my opinion, we pretty much follow it end-to-end. CRISP-DM is slightly outdated (no wonder), so we do evaluation and deployment simultaneously via experiments, A/B tests, shadow deployment, etc.

ML model monitoring

How do you monitor the ML models you or your team members work on?

Right now, we do simple stuff. I'll divide it into a batch and online.

Batch
  • If an Airflow task exists with non-zero code, we get an alert in Slack.
  • We have a SQL-based home-grown data monitoring system for doing data validations.
  • A few projects use Tableau for alerting (yes, I was also surprised).

Online

  • The standard SRE/DevOps metrics (RPS, CPU, Network, etc.), number of containers, number of messages in the queues, etc. All the dev stuff.
  • A few projects output the prediction distribution, so we can see p01, p05, p50, p95, and p99 values for it and see if something changes.
  • We don't monitor for feature drift currently.
A lot is happening in model monitoring. Do you have any guidelines for your team on choosing model monitoring tools or techniques?

I'm pretty pragmatic, so try starting with tools you already have and iterate. You'll see what's missing, and it'll be clearer for you what kind of tool you need to use. I also liked a talk by Lina Weichbrodt from the recent PyData Berlin. Here are the slides.


Do you see any trends or anything that has caught your attention?

I see a trend that there are a lot of data and ML monitoring tools. It's challenging to keep track of them, let alone decide which one to use. But I also love a trend: many of these tools go open-source, and that's amazing. It's easy to try them and see if they work for our problems or not.


When does it make sense to retrain the model continuously? And when is it better to opt-in for retraining the model from time to time when it's not good enough / outdated? How to choose between these two approaches?

Continuous retraining makes sense for projects where data drifts quickly, for example, in recommender systems. How do you know that? By setting up monitoring, of course. I'd go with monitoring business metrics first. If they start going down after some time, you probably need automatic retraining. Then you can add all other metrics, drift, and so on.

For us, we need that for recommenders and moderation. We don't need to do constant retraining for many other projects, and doing it every once in a while is enough. For example, when the precision falls under a predefined threshold.

Machine learning tools

What ML tools do you use in your daily routine?

In a nutshell:
  • The standard PyData stack: NumPy, pandas, SciPy, etc.,
  • Scikit-Learn, XGBoost, and Keras (TensorFlow),
  • SageMaker, MLFlow,
  • We have our internal ML platform at OLX Group.
Other tools I use are not strictly ML-related: Kubernetes, Airflow, Spark, and many AWS services such as Batch, Lambda, Kinesis, etc. There are also many smaller ones that I don't use day-to-day, but they are pretty helpful. For example, Numba is a great tool. I've probably forgotten about a bunch of other super helpful things.


Do you use transformer-based metric tools like BERT for MT tasks? What LM metrics will use more such things in the future?

Nope, we don't. Even though OLX is available in many markets, we haven't had a use case for that. And I also personally don't have any experience with that.


What do you think about feature-store tools like Tecton and Feast? When is it a good practice to use them, and when not?

I haven't used Tecton, and setting up Feast seems mission impossible. We usually go with more simple stuff based on Dynamo, and it works well so far.


There is a lot of new ML research and techniques coming out all the time. How do you stay updated without getting overwhelmed? Or do you only look up new things after stumbling on a particular problem?

First, let me say how I stay updated with the tools. I mainly look at open-source tools. Because there are so many, I can be selective. And I invite the authors to demo the tools. This is how I see what's out there. Shameless plug: the interviews go here.

As for other things, I don't try to stay up-to-date. If enough people talk about something — both in communities/social media and at work — then I look at it.

Community

What was the most surprising for you in building the DataTalks.Club community? Did you expect it to grow so popular? Any tips for other community builders?

Of course I did expect that. Why would I start it if I wasn't expecting it? :D

But honestly, it all happened spontaneously, and it grew organically. So I didn't know what would happen on that day when I created that Slack workspace.

As for the tips, try to figure out why people joined the community. What do they want to get out of it? In the beginning, I was trying to talk to everyone who joined Slack and learn a bit about them. Now I don't do that as it takes too much time. But it's something you should do as a community builder.


What is your favorite question to ask your guests at DataTalks.Club events?

What's the best way to reach out to you? :) I think all the guests get that question from me. I don't think I have any other favorite questions that I always ask.


What's the best way to reach out if someone wants to ask you more questions or follow up?

Join DataTalks.Club of course!
* The discussion was lightly edited for better readability.

Want to join the next Ask-Me-Anything session?

Join our Discord community! Our next guest is Hamza Tahir, Co-founder of ZenML, an open-source MLOps framework for creating reproducible pipelines.

Register here to get a friendly reminder and an invite straight to your calendar.
At Evidently AI, we create open-source tools to analyze and monitor machine learning models. Check our project on GitHub and give it a star if you like it!

Want to stay in the loop?
Community Manager

You Might Also Like: