SAI Notes #05: Building efficient Experimentation Environments for ML Projects.
Let's look into what it takes for an Experimentation Environment to be efficient in Machine Learning Projects.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This week in the weekly SAI Notes:
Building efficient Experimentation Environments for ML Projects.
The Great Divide of MLOps Tooling Landscape.
Building efficient Experimentation Environments for ML Projects.
Experimentation Environments are one of the most important pieces of MLOps flow. MLOps practices are there to improve Machine Learning Product development velocity, the biggest bottlenecks happen when experimentation environments and other infrastructure elements are integrate poorly.
Experimentation Environments themselves are where Data Scientists iterate on Machine Learning models that are meant to solve a specific business problem, once the ML System is pushed to the higher environments for deployment, Data Scientists continue to iterate on new, improved versions of ML Models to be tested in production against current champion model.
Let’s look into the properties that an efficient Experimentation Environments should have. As a MLOps engineer you should strive to provide these to your users and as a Data Scientist, you should know what you should be demanding for.
The environment needs to have access to the raw data. While handling raw data is the responsibility of Data Engineering function, Data Scientists need the ability to explore and analyse available raw data and decide which of it needs to be moved upstream the Data Value Chain (2.1) and be curated by the Data Engineers.
The environment needs to have access to the curated data. Curated data might be available in the Data Warehouse but not exposed via a Feature Store. Such Data should not be exposed for model training in production environments. Data Scientists need the ability to explore curated data and see what needs to be pushed downstream (3.1) to the Feature Store.
Data used for training of Machine Learning models should be sourced from a Feature Store if the ML Training pipeline is ready to be moved to production stage.
Data Scientists should be able to easily spin up different types of compute clusters - might it be Spark, Dask or any other technology - to allow effective Raw and Curated Data exploration.
Data Scientists should be able to spin up a production like remote Machine Learning Training pipeline in development environment ad-hoc from the Notebook, this increases speed of iteration significantly as it removes the need to create Pull Requests in order to test if the pipeline is composed correctly.
Once the Pipelines are tested ad-hoc and are ready to be moved to a higher environment, there should be an automated setup in place that would perform the testing and promotion when a specific set of Pull Requests are created. E.g. a PR from /feature-* to /release-* branch could trigger a CI/CD process to test and deploy the ML Pipeline to a pre-prod environment.
Notebooks and any additional boilerplate code for CI/CD should be part of your Git integration. It is best for the notebooks to live next to the production project code in template repositories. Make it crystal clear where certain type of code should live - a popular way to do this is providing repository templates with clear documentation.
Data Scientists should be able to run local ML Pipelines and output Experiment related metadata to Experiment/Model Tracking System directly from Notebooks. Integration with Experiment/Model Tracking Systems should not be different from automated pipeline perspective.
Notebooks have to be running in the same environment that your production code will run in. Incompatible dependencies should not cause problems when porting applications to production. There might be several Production Environments - it should be easy to swap it for a given Notebook. It can be achieved by running Notebooks in containers.
The Great Divide of MLOps Tooling Landscape.
In the last few years MLOps Tooling has matured to the level where it is no longer complicated to stitch up an end-to-end MLOps workflow from different available pieces, might it be OS software or a SaaS. Companies of every size can now benefit from what MLOps has to offer.
In my opinion, the near term future will continue to be focused around point solutions rather than end-to-end ML platforms as it has been for quite a while. While there are platforms out there that try to provide all that is needed for MLOps implementation in a single package, they continue to be of a poorer quality compared to the ones that are focusing on a specific area of MLOps.
I have covered a high maturity MLOps pipeline setup in one of my articles here. It has settled so that OS projects and SaaS vendors have split into four wide categories to cover the end-to-end MLOps workflow. The four categories are:
Feature Platforms.
Platforms dealing with ingestion, processing and exposing of features for Machine Learning training and inference purposes.
A full featured platform will expose this set of APIs:
Real Time Feature Ingest API - this is where you ingest curated data in real time from streaming systems like Kafka.
Batch Feature Ingest API - this is where you ingest curated data in batch from Data Warehouses and Data Lakes.
Real Time Feature Serving API - this is where you will be retrieving Features for low latency inference.
Batch Feature Serving API - this is where you fetch Features for Batch inference and Model Training.
Compute layer that allows feature processing inside of the platform is optional for Feature Platforms, but I see this becoming a standard practice in the industry.
The Platforms also foster collaboration and discoverability of already existing Feature Sets throughout the organisation.
Major players in the field include: Tecton, Feast, Hopsworks, Featureform.
Pipeline frameworks.
These frameworks are similar to what we are used to seeing in Data Engineering workflows (think Airflow). ML specific frameworks target needs of Machine Learning model training and include specific operators for different flavours of ML model architectures, hyper parameter optimisation tasks, experiment tracking integrations, reproducibility etc.
Major players in the field include: Kubeflow Pipelines, Dagster, ZenML, Airflow.
ML Metadata Stores.
Products built specifically to store metadata about Machine Learning experiments and expose it to the builders and operators of Machine Learning Products. These platforms help in debuging, comparing and collaborating on experiments.
A good Metadata Store will consist of two integrated parts - an Experiment Tracker and a Model Registry.
Experiment Tracker is used to hold metadata about experiments like Training Datasets, model architecture, model performance metrics etc.
Model Registry is used to store model artifacts produced by the given experiments, model registry would also act as an intermediary system of ML model handover between development and deployment stages.
Major players in the field include: MLflow, neptune.ai, Weights & Biases.
Deployment and Observability.
This category could be split into two separate ones but I believe that they will be moving towards consolidation as Deployment/Serving and Real Time monitoring are closely integrated.
The Frameworks and tooling in this space are serving two purposes.
Serving models for Real Time inference. Deployment Frameworks would expose APIs to deploy specific types of models, integrate with Model registries etc.
Monitoring model quality once it is already deployed to production (Feature Drift, A/B testing etc.). Term observability actually includes not only monitoring but tracking the health of the entire ML system end-to-end.
Major players in the field include: KServe, Seldon core, Sagemaker endpoints, Arize, WhyLabs.
Join SwirlAI Data Talent Collective
If you are looking to fill your Hiring Pipeline with Data Talent or you are looking for a new job opportunity in the Data Space check out SwirlAI Data Talent Collective! Find out how it works by following the link below.