Which tool do you think we as an MLOps community need but do not have? Rather than a tool, one thing I'd like to see is a set of good bare-bones examples. That is, a Git repo that if I clone it and follow the steps, I'll have a model training pipeline that trains and deploys "hello world", with monitoring and data versioning. Using open source, of course :)
It is not just documentation. These examples can be code representing actual infrastructure, pipelines, etc.
What do you think is a reasonable number of tools to stitch together when making an internal MLOps platform? Is it 3-4 (e.g., one for workflow management, one for experiment management, one for deployment, one for monitoring), or could it be 10+? If I'm honest, this is something we're still experimenting with. But my intuition is that it's not a big number. I think the role of MLOps engineer includes making everybody else better and more efficient by sharing the principles and patterns with the wider tech team, so data scientists and other engineers. That is to say, bring everybody else into the MLOps fold, instead of building a big MLOps team separately.
I imagine there is some ideal "ELK" stack the industry would converge to. How many categories will it have? What's your bet? The MLOps ELK probably still includes the K :) I reckon the ingredients need to include:
- Pipelines / workflows / orchestration
- Deployment / serving
- Logging / observability / monitoring
What is your take on Notebooks within the MLOps landscape? Many claim they do not fit into production and foster bad practices. At the same time, we see more and more tools trying to wrap Notebooks into microservices so they can serve production services. My own opinion is notebooks are a net negative for MLOps purposes. Turning them into microservices just seems to hide the problems. I'm largely in agreement with Joel Grus on this question
here!
For some MLOps tasks (like annotation), there are feature-rich closed-source tools and not many open-source alternatives (if any). What would you recommend for those trying to get their models into production in this case? Use sub-par open-source tools, or connect an open-source stack with proprietary platforms that solve one piece of workflow really well? Well, I suppose there are feature-rich closed-source tools in all sorts of categories, besides annotation, too. But I get the same impression that annotation is an area that's lacking a bit.
I think it's important not to be locked into a vendor. So, using something proprietary in some cases might be the best thing, but it's best done in a way that avoids lock-in. For example, SageMaker, but with ZenML, because you're not tied into SageMaker that way.
On labeling in particular, I've got my eye on Label Studio but I haven't had a chance to try it yet!
I have used a few data version control tools in the past. Although it is a hyped topic on social media, I find them lagging, and practitioners don't seem to use them. I wonder if it is because we are in early days or, potentially, the niche is small, and the added value is not significant yet. What's your take on the maturity of Data Version Control as a segment? I think one issue with data version control is that the most suitable tool depends a lot on the nature of the data. Is it big or small data? Tabular, images, text? I suppose that could be a barrier to entry for many people.
There are really good, mature, open-source tools for data version control. Adoption is a different matter.
On our projects, we always recommend data version control. It's as fundamental as source code version control and has implications for everything else in the MLOps solution. But it's really important to work closely with the people who will actually use these tools, so that we understand what they really need, and they understand how to use the tool effectively.