SAI Notes #05: Building efficient Experimentation Environments for ML Projects.
Let's look into what it takes for an Experimentation Environment to be efficient in Machine Learning Projects.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This week in the weekly SAI Notes:
Building efficient Experimentation Environments for ML Projects.
The Great Divide of MLOps Tooling Landscape.
Building efficient Experimentation Environments for ML Projects.
Experimentation Environments are one of the most important pieces of MLOps flow. MLOps practices are there to improve Machine Learning Product development velocity, the biggest bottlenecks happen when experimentation environments and other infrastructure elements are integrate poorly.
Experimentation Environments themselves are where Data Scientists iterate on Machine Learning models that are meant to solve a specific business problem, once the ML System is pushed to the higher environments for deployment, Data Scientists continue to iterate on new, improved versions of ML Models to be tested in production against current champion model.
Let’s look into the properties that an efficient Experimentation Environments should have. As a MLOps engineer you should strive to provide these to your users and as a Data Scientist, you should know what you should be demanding for.
The environment needs to have access to the raw data. While handling raw data is the responsibility of Data Engineering function, Data Scientists need the ability to explore and analyse available raw data and decide which of it needs to be moved upstream the Data Value Chain (2.1) and be curated by the Data Engineers.
The environment needs to have access to the curated data. Curated data might be available in the Data Warehouse but not exposed via a Feature Store. Such Data should not be exposed for model training in production environments. Data Scientists need the ability to explore curated data and see what needs to be pushed downstream (3.1) to the Feature Store.
Data used for training of Machine Learning models should be sourced from a Feature Store if the ML Training pipeline is ready to be moved to production stage.
Data Scientists should be able to easily spin up different types of compute clusters - might it be Spark, Dask or any other technology - to allow effective Raw and Curated Data exploration.
Data Scientists should be able to spin up a production like remote Machine Learning Training pipeline in development environment ad-hoc from the Notebook, this increases speed of iteration significantly as it removes the need to create Pull Requests in order to test if the pipeline is composed correctly.
Once the Pipelines are tested ad-hoc and are ready to be moved to a higher environment, there should be an automated setup in place that would perform the testing and promotion when a specific set of Pull Requests are created. E.g. a PR from /feature-* to /release-* branch could trigger a CI/CD process to test and deploy the ML Pipeline to a pre-prod environment.
Notebooks and any additional boilerplate code for CI/CD should be part of your Git integration. It is best for the notebooks to live next to the production project code in template repositories. Make it crystal clear where certain type of code should live - a popular way to do this is providing repository templates with clear documentation.
Data Scientists should be able to run local ML Pipelines and output Experiment related metadata to Experiment/Model Tracking System directly from Notebooks. Integration with Experiment/Model Tracking Systems should not be different from automated pipeline perspective.
Notebooks have to be running in the same environment that your production code will run in. Incompatible dependencies should not cause problems when porting applications to production. There might be several Production Environments - it should be easy to swap it for a given Notebook. It can be achieved by running Notebooks in containers.
The Great Divide of MLOps Tooling Landscape.
Keep reading with a 7-day free trial
Subscribe to