contents
We invite ML practitioners to share their experiences with the Evidently Community during the Ask-Me-Anything series.
Our recent guest is Jacopo Tagliabue, co-leader of RecList, an open-source project for behavioral testing for recommender systems. Jacopo is a proponent of “the reasonable scale ML” and has explored that theme in a series of articles, papers, and open source repositories. He is also an Adjunct Professor of ML at NYU, teaching NLP and ML Systems to Master students.
We chatted about ML at a reasonable scale, RecSys, MLOps anti-patterns, what’s hot in DataOps, and more.
Sounds interesting? Read on for the recap of the AMA with Jacopo.
For those who are not familiar with the concept, could you please define “reasonable scale ML”? What reaction to the concept are you getting? Does it resonate with the companies?
If you don't work at Amazon, Google, or Uber and you're doing ML, you are probably a reasonable scale practitioner! Most of us work with terabytes of data, in teams of 10 or fewer people and without infinite cloud computing available - that's the reasonable scale: advanced teams in small tech companies (like my team at Coveo), or small teams in large traditional companies, just starting with ML.
The general message is this: it's false that only large teams can do great ML, thanks to a thriving ecosystem of tools that make very few people VERY productive :) If only you know where to look!
Some good places to read more and get some working code are our TDS series and, of course, my Github.
I’m interested in the intersection of “reasonable scale” with the large models that people intersect with (often finetuning, etc.). To what extent does “reasonable scale” ML rely on the unreasonable scale of the organizations/companies who create those large models (and their data / their ability to annotate it, etc.)?
It is indeed a great point. The imageNet moment for Computer Vision and the HuggingFace moment for NLP opened up the possibility to do great things through fine-tuning. For example, my team recently fine-tuned CLIP, and it was great! In the current big-data big-model paradigm, we depend on larger organizations for many of these things, in the same sense that we depend on three providers for the cloud.
Is there a different paradigm for ML? I advocated small data learning in the past, but the truth is, it just doesn't work as well in practice as throwing data and computing.
Is there any way to break free of this dependency? Do you see any hope for that? Or maybe a better question: are there people/organizations working on these approaches?
I don't believe scale is all you need, and I'm pretty disappointed most researchers are optimizing over a narrow set of ideas. Still, it's very hard to find funding for radically different approaches when the "big data guys" produce GPT-3, which is really, really impressive.
As a cognitive scientist, my view on LLM changed significantly recently, and I now started to like them, even as a skeptic.
One of my favorite researchers is Joshua Tenembaum (MIT), who always has incredible ideas, that, however, is not what people at Amazon would then use :) I think we need some more courage to explore new stuff, a point Gary Marcus, for example, often makes (correctly).
What do you believe is the optimal composition of a “reasonable scale” ML team? Is there some golden ratio between data engineers, data scientists, and machine learning engineers? Or is there a specific role to have, maybe ML product manager?
Good question. I come from a startup mentality, even as part of big organizations. The general principles for me are two:
My job as a leader is to ensure nobody is "wasting time" with no-value-added activity, like scaling GPUs, tracking experiments, monitoring data quality, etc. So I buy everything I can in this sense, to free everybody.
Please bear in mind that my experience is in B2B companies where ML IS the product. Those are very technical organizations in some sense, where ML is not an afterthought but what they literally do. That said, some luxuries I’ve had are not strategies that could work in more traditional companies. But on point (1), I'm pretty sure: a great team always looks understaffed when looking from the outside :)
RecList is based on a great idea of behavioral testing of recommendation systems. How generalizable is it? E.g., can you formulate tests that apply to different recommendation systems out of the box, or is it a forcing function for the creators of a specific ML recommendation system to think through the custom tests they need (following a template)?
As for the inspiring CheckList paper in NLP, the answer is both. Some general rules depend on the use case (item-item vs. user-item recs, for example) and are somehow independent of the dataset. In contrast, others are dataset-specific, so some custom code is required. The point of RecList (or checklist!) is not to fully automate testing but to a) provide good abstractions to avoid re-writing flaky, boring code, b) provide a standard way to frame the problem of rounded evaluation, and therefore address potential issues in fairness, robustness, etc.
In recsys, what do you consider the current baseline for learning to rank problems for an engineering system? In other words, what should be my go-to reference for an algorithm or software library if I am building a ranking for my e-commerce catalog?
Fantastic question! My recent exploration in the space brought me to an ongoing collaboration between Metaflow as an MLOps tool and NVIDIA Merlin as the Open Source rec library. In particular, I'm a HUGE fan of the Merlin team, what they have accomplished so far, and their vision. Coupling the infra abstraction provided by Metaflow with the power of Merlin gives you a fantastic starting point for most use cases!
In your opinion, what are the three hottest topics in DataOps right now?
There are a few startups I super like:
So maybe: transformation, orchestration, and apps based on warehouse are three cool topics. But of course, there are many more: data observability was super hot last year, and one niche topic I love is everybody trying to replace SQL :)
What idea about ML/MLOps do very few people agree with you on?
You need to be Big Tech to do meaningful ML work. That is false, and there's a fantastic opportunity for ML AND MLOps to work at a reasonable scale if only you know where to look / how to work on them.
Second idea: ML is getting easier and easier, but data is still hard. The biggest marginal gains right now are in DataOps, not MLOps :) Especially if large language models (as I think) will become more and more pervasive.
Do you think many ML teams are building a more complex infra that they need? For example, they get busy adopting Kubernetes before figuring out what ML models they need to build. Or maybe there are some other MLOps anti-patterns that you observe?
Yes. Everybody thinks they are Google, which prompted a lot of our evangelization in the ML space. Start with the problem and the stack you have, not the one you wish you had.
The general idea is always the same: buy/hack your way into production - end to end, from data raw to predictions. As fast as possible, as basic as possible, as long as the main boxes are there.
ONCE it's live, you can assess ROI and make all the changes you want, knowing that improving one piece will immediately translate into something better. Unless you are very experienced/talented etc., most people doing ML projects these days don't even know if ML will be useful AT ALL.
Start simple, buy stuff. If you're happy, come back and do again what you think you need to be better. But first, establish which model is useful as it may take a WHILE in traditional companies to figure out even a good use case: and that's ok! But don't build a data team to ship a churn model in your app.
There are many MLOps tools, and many problems are still not solved. What are your thoughts on this? Will each category have a winner, and will they eventually consolidate? Or is there indeed enough space for all these tools as the problems are complex and diverse?
There is an imbalance right now between the size of the market and funding/startups, that is for sure. I don't think it's bad per se; it is actually producing a fantastic ecosystem. That said, the situation cannot continue indefinitely without showing meaningful progress: revenues in the space seem very low for most players, and the market is obviously super early.
People also don't like having seven vendors, five of them doing small parts of the pipeline. As the market grows, less sophisticated people would be unable to pick and choose and orchestrate seven tools, so that some platforms would become an attractive choice for many.
You know, as they say in Batman: either you die as an open source project, or you live long enough to become a platform :)
Is Transformers all we need in machine learning?
Good question. The answer is, as always, it depends :) There are undoubtedly huge benefits in transformer architectures for many practical applications, especially in NLP and sequence modeling. The community (e.g., HFace) made great progress in providing good abstractions to be productive! Beware, though, of hype: there are still places where good old tools work just as well (e.g., Information Retrieval!).
From a theoretical perspective, that's a question about, for example, the science of language. I recently talked about meaning in large language models if you're interested in more philosophical points.
I work with NLP-for-fun projects, but I am not an NLP person and want to create labels/topics out of a corpus. In the past, I would use Latent Dirichlet Allocation. Is this still what you would use as a baseline? What are the cool kids using these days?
Few options: I like all the stuff done by the Snorkel team in probabilistic/weak labeling to bootstrap your classifier. If you have a little money, I will rely more and more on GPT3 and LLM to label things for you. It’s an unpopular opinion, but generative models will be super useful to generate/augment data in ML, MORE than directly as a model solving a problem.
That said, topic modeling is fantastic. You should check a neural version of that, open-sourced by my friend Federico Bianchi (topic modeling meets BERT!)
What do you consider to be fundamental skills + knowledge in the MLOps space? (i.e., for practitioners and NOT researchers). Are there fundamentals yet, or are we still figuring all that out?
I would say ML system design, basics of ML (train/test split, drift, etc.), and solid software engineering are the most common.
If you don't know anything about ML, it’s hard to make a principled decision since you may not understand why predictions differ from APIs. If you don't know ML systems, it’s hard to play nicely with other parts of the toolchain. Finally, if you don't know the software, you can't write good abstractions for yourself and your team.
"Nobody said it was easy." As I mentioned above, my standard/MO may be very skewed towards ML-first companies, so the above is an ideal list for somebody on my team, but it's not a beginner list for a junior position.
I saw that you are also teaching ML Systems at NYU, and in an online repository, there is material for serverless architectures and AWS Sagemaker. With this in mind, what do you consider the ideal syllabus for someone teaching or studying ML systems?
In my course, I teach mostly the "training" part, such as how to properly organize a project for reading, cleaning data, and training a model with artifacts ready for downstream applications. However, there are MUCH better things than my course :) Two books from friends I recommend are Chip's book and Ville's book, both fantastic. As open source resources, Made with ML is a terrific repo.
Is there such a thing as software engineering for ML? For example, Python is object-oriented, but ML libraries, for historical reasons, are somewhat functional. The same applies to ideas such as unit tests, polymorphism, and what one learns in Software Engineering 101. They do not translate well to ML programming. Thus, what are the best programming practices for software engineers doing ML systems?
Omg, such a fantastic question, even if I suck at coding :) My disappointing TL;DR from software is: iterate fast (refactor as you go along) but make sure everything is reproducible. The hardest thing for me in ML is that the statistical nature of ML pipelines may make debugging super hard if you don't plan for it!
What are the typical mistakes you see data scientists make? Be it related to flawed coding practices, communication, anything?
Taking too long to ship things. If your goal is always to ship things and iterate as fast as possible, end-to-end, that will force you to communicate better, code better, understand the business, etc.
That's why in my team, everybody is end-to-end: you need to understand the entire thing, and it's YOUR responsibility to make it work (no handoff, no excuses). And everybody is encouraged to experiment: you learn more by deploying than by having a MEETING on whether you should deploy.
Since it is YOUR responsibility, I trust you to deploy something good enough. All the broken dynamics I have seen are because of conflicting incentives in teams that keep fighting.
Cut the middlemen, and own the entire feature like your small startup.
As an experienced start-up guy, what do you find to be the most challenging bit in building a product?
Finding something at the intersection of:
There's also a subtle trade-off above as things I think are cool may not necessarily be what OTHERS agree on :)
You do ML research but are also an ML engineer/startup/OSS contributor. As a researcher, one wants to develop at the frontier, but this is hardly possible. If one is building a company, one must build reliable software; often, a heuristic is enough. Where do you draw the line? Is it challenging to cater to these two sides of your professional career?
It's a pendulum. Sometimes I'm 100% builder, sometimes 100% scientist, and sometimes (more often) a mix of the two. There are some secrets to it if you will. The obvious one is to have a great team, peers, and friends who help in different ways. Much of our research is co-authored with friends from academia, which account for a significant part of our productivity :)
I will write a blog post soon on our experience running Coveo Labs if you're interested in how Applied R&D can live together with pragmatic product decisions. The TL;DR is that I've been very lucky to have the freedom and ability to shape roles and companies where it is important to be practical AND innovative!
What problems in RecSys, MLOps, or DataOps are you now interested in and want to solve? And in general, what are your plans for your professional future?
That's the hardest question of all! Right now, I'm enjoying (f)unemployment after Tooso and Coveo and focusing on open source with friends and peers. RecList beta and the CIKM data challenge for sure, but also some things I will unveil soon as new papers/open-source projects.
As a (possibly) founder, I'm exploring data transformation and operations themes: everybody loves ML, but the bottleneck is getting good data pipelines. Anybody working in data tools (Snowflake, dbt, etc.) that wants to chat, please reach out!
Finally, opinionated prediction: the next big problem in MLOps is experimental platforms, finally all realizing the only tests for ML models are live.
* The discussion was lightly edited for better readability.
Jump to our Discord community to get support, contribute, and chat about AI products.
Join Discord community ⟶