📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Community

AMA with Doris Xin: AutoML, modern data stack, and reunifying the tools

Last updated:

November 27, 2024

Published:

August 27, 2022

contents‍

Start testing your AI systems today

Get demo

We invite ML practitioners to share their experiences with the Evidently Community as a part of the Ask-Me-Anything series.

This time, we talked with Doris Xin, a Founder and CEO of Linea, a company behind the open-source tool LineaPy that aims to simplify repetitive and mundane data engineering tasks.

We chatted about the roles of Data Scientists and Data Engineers in an ML cycle, automation, MLOps tooling, bridging the gap between development and production, and more.

‍Sounds interesting? Read on for the recap of the AMA with Doris.

DS and DE roles in ML lifecycle

Do you believe data scientists should be end-to-end and be able to do everything from data prep and problem definition to deployment and monitoring?‍

Yes — from an organizational standpoint, as the handoff between the different personas for taking on different parts of the end-to-end workflow is high-friction. However, at the same time, it's a tall order to expect data scientists to split their attention across so many tasks. So no — from an individual productivity standpoint.‍

‍
There is no such thing as a "data engineering" definition, and the first book on data engineering is about to come to the market. How would you define data engineering as a discipline? Where does it start and end, especially regarding business intelligence, ML engineering, or data science?‍

This is a question I am still seeking the answer to. We've seen two schools of thought on this:

data engineering starts with raw data and ends before data analytics,
data engineering encompasses all engineering efforts in the end-to-end ML lifecycle.

The data engineers in the first bucket handle ETL/ELT pipelines and frequently work with tools like Snowflake and SQL queries. Data engineers in the second bucket handle all types of data pipelines and provide support for ETL, data analytics/modeling, and deployment. We're seeing the second option becoming an increasingly popular definition, and the boundary between data engineering and ML engineering is blurred.

The state of data and ML tooling

A famous (and often contested) statement is that only 87% of data science projects make it into production, and many stay as a "Proof of Concept" only. What do you believe is the actual rate these days? How many projects get to production in a typical company these days?‍

The mythical 87%! From our conversations with data practitioners, it largely depends on the stage of the company. Many organizations are still in the exploratory phase because they are trying to tackle very high-impact business problems with ML/DS. There, it's 0%.

On the other hand, companies that are mature in their data science journey are constantly pushing models into production. I think the answer comes down to the number of companies in these two buckets.‍

What is the most challenging problem to solve in Data/MLOps?‍

Closing the loop from production back to development. The reason for MLOps' ascension to popularity is that we're beginning to recognize that data science has a natural CI/CD cycle. I think tools like Evidently are making progress in this direction in a very meaningful and significant way.‍

Which data science tasks do you think have the most potential for automation?‍

We did a study with many existing AutoML users, and the conclusion is that the engineering tasks, rather than the modeling/analytics tasks have the most potential for automation. The reason is two-fold:

There are concerns about fairness and explainability, and removing the human from the loop via automation only adds to the problem.
We are still early in our effort to represent domain knowledge and rely heavily on human intuition to represent data in a format the machine can consume for analytics.

Therefore, it is more productive to have the machine handle the more mechanical aspects of the workflow and partner with the human to handle their intuition for analytics.‍

What was the most unexpected finding you came across during your PhD research?‍

Data practitioners have a love-hate relationship with automation. On the one hand, there's the desire to automate tedious chores to focus their bandwidth on more high-leverage and interesting analytics tasks. On the other hand, there are real concerns about automation taking away their jobs.‍

Some discuss the impending death of the modern data stack. But when I talk to companies, it seems that the modern data stack has not arrived yet. They still have to deploy Snowflake and work with dbt and the like. What is your feeling about it? Are we in the early stages of the modern data stack? Is it something we already tried and want to re-bundle again? Or will it remain a niche for well-funded Internet companies?‍

Oh, this is a funny one! It feels like we are quite early in the modern data stack journey. We have a lot of great raw ingredients floating around in the market, especially great OSS tools. What's different about big internet companies is that there are teams dedicated to evaluating the options and figuring out integrations. For the rest of the world, limited resources often mean working with a small sample of the available tools and fitting together lego pieces that might not be compatible with each other. The "stack" is more like a Jenga tower precariously balanced on a few key choices.

MLOps tools and LineaPy

What big holes do you see in the tooling for a relatively straightforward use case in ML? Do you think we have all the tools covered and it's just a question of smoothing over the edges or are there more fundamental reshapings yet to come?‍

If anything, we have too many tools! The unbundling of the data science tooling landscape has been driven by the desire to adopt best-in-breed point solutions. However, this trend is not sustainable. Data scientists need to glue together many different tools in the end-to-end lifecycle of ML/DS. The overhead is arguably taking away from the benefits of best-in-breed solutions. It feels that we're undergoing a time of "reunification" of the data science ecosystem.‍

What does LineaPy bring to the tooling ecosystem?‍

LineaPy is leading the "reunification" effort. We have found a balance between creating general-purpose tools that support many things (suboptimally) and forcing data scientists to manage too many tools simultaneously by creating a "glue layer." It allows many tools to easily integrate with the same workflow abstraction at different points of the end-to-end lifecycle.‍

I see similarities between LineaPy and tools for experiment tracking, such as CometML or MLFlow, aiming to increase productivity and document work. In LineaPy, there is also the pipelining component, but otherwise, how would you set LineaPy apart from tools in the data science space aiming to make DS teams more effective in going to production?‍

We're similar in that with MLFlow and CometML. We understand that the journey to production starts in development, and it's crucial to provide functionalities that give us visibility into the development side of things.

We're different in two important ways:

We're framework agnostic. There's a logical workflow implicitly expressed in the development code that these frameworks require data scientists to explicitly extract and define.
We're eager to capture everything rather than relying on data scientists to deliberately identify what's important to capture during experimentation, which often only becomes clear in hindsight.

These two factors combined allow LineaPy to automatically extract data pipelines from messy development code to be run in many different environments, such as Airflow, Ray, and someday MLFlow and CometML.‍

When during the DS process do you advise running LineaPy? Should we run it each time committing to a repository? Is it only until we know that the code is stable and we want to refactor it for production? Where is the sweet spot, especially thinking broader in terms of software engineering for data science?‍

LineaPy should be the first thing data scientists imports into their development environment! We anticipate and support the stream-of-thought data analytics workflows because it is what a data scientist needs to arrive at high-quality insights quickly.

LineaPy can capture everything during the development process to allow users to extract any relevant parts of the development workflow postdoc in a reproducible fashion. We believe it's unreasonable for data scientists to figure out exactly what parts to save during development. LineaPy combats the messiness by automatically analyzing the program to extract only the necessary parts to reproduce the results of interest.‍

What inspired you to start Linea?‍

In short, my passion for building ML tools and my desire to create something phenomenal from scratch (and believing I can find like-minded folks to join me on this mission).

I figured out early in my career that the ML tooling space is interesting to me because I'm a jack of all trades. Building ML tools requires a breadth of knowledge about many disciplines of CS, including ML, systems, databases, and, recently, compilers!

I spent the summer before grad school as (the first) intern at Databricks. There, I got to experience a startup's very early stage. This experience served as the inspiration for the startup journey. The energy of an early-stage startup was unlike anything I had experienced before. The 0 to 1 stage presents many challenges that stretch me in directions I've never imagined I could grow into!‍

What are your tips for those launching open-source libraries? When should one release the library? How much time should one invest in making the code elegant vs. writing documentation? How do we go about finding a community of contributors?‍

If it's "perfect," you're releasing it too late. Invite your users to be part of the chaos in the early days. Create a partnership rather than a "dictatorship" from the tool developers to the users. Documentation is extremely important too. If a tree fell in a forest and nobody was there to document it, all the development efforts will have a tough path to real user impact.

As for finding community contributors, what we've found to be effective is to let contributions happen organically and celebrate them when it happens. This ties back to the previous point of forming a partnership with the users and looping them in early to co-develop the solution.‍

* The discussion was lightly edited for better readability.

[fs-toc-omit] Join 2500+ data scientists and ML engineers

Jump to our Discord community to get support, contribute, and chat about AI products.

Join Discord community ⟶

Community

AMA with Fabiana Clemente: synthetic data, data-centric approach, and rookie mistakes to avoid

We recap Ask-Me-Anything session with Fabiana Clemente, which covered synthetic data, its quality, beginner mistakes in data generation, the data-centric approach, and how well companies are doing in getting there.

Community

AMA with Matt Squire: what makes a good open-source tool great, and the future of MLOps

In this blog we recap Ask-Me-Anything session with Matt Squire, that covered MLOps maturity and future, how MLOps fits in data-centric AI, and why open-source wins.

🏗 Free course "LLM evaluations for AI builders" with 10 code tutorials. Sign up ⟶

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.

Get demo

No credit card required

Community

AMA with Doris Xin: AutoML, modern data stack, and reunifying the tools

DS and DE roles in ML lifecycle

The state of data and ML tooling

MLOps tools and LineaPy

[fs-toc-omit] Join 2500+ data scientists and ML engineers

Dasha Maliugina

You might also like

Community

AMA with Fabiana Clemente: synthetic data, data-centric approach, and rookie mistakes to avoid

Community

AMA with Matt Squire: what makes a good open-source tool great, and the future of MLOps

Start testing your AI systems today