SAI #13: Lambda vs. Kappa Architecture.
Lambda vs. Kappa Architecture, MLOps Maturity Model: Level 1.
👋 This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
This week I cover two topics:
What are Lambda and Kappa Data Pipeline Architectures.
How do we move from MLOps Maturity Level 0 to 1 (by GCP).
Lambda vs. Kappa Architecture
What are Lambda and Kappa Architectures? Why is it important now?
Lambda and Kappa are both Data architectures proposed to solve movement of large amounts of data for reliable Online access.
The most popular architecture has been and continues to be Lambda. However, with Stream Processing becoming more accessible to organizations of every size you will be hearing a lot more of Kappa in the near future. Let’s see how they are different.
Lambda.
➡️ Ingestion layer is responsible for collecting the raw data and duplicating it for further Real Time and Batch processing separately.
➡️ Consists of 3 additional main layers:
👉 Speed or Stream - Raw Data here is coming in Real Time and is processed by a Stream Processing Framework (e.g. Flink) then passed to the Serving layer to create Real Time Views for low latency near to Real Time Data access.
👉 Batch - Batch ETL Jobs with batch processing Frameworks (e.g. Spark) are run against raw Data to create reliable Batch Views for Offline Historical Data access.
👉 Serving - this is where the processed Data is exposed to the end user. Latest Real Time Data can be accessed from Real Time Views or combined with Batch Views for full history. Historical Data can be accessed from Batch Views.
❗️ Processing code is duplicated for different technologies in Batch and Speed Layers causing logic divergence.
❗️ Compute resources are duplicated.
❗️ Need to manage two Infrastructures.
✅ Distributed Batch Storage is reliable and scalable, even if the System crashes it is easily recoverable without errors.
Kappa.
➡️ Treats both Batch and Real Time Workloads as a Stream Processing problem.
➡️ Uses Speed Layer only to prepare data for Real Time and Batch Access.
➡️ Consists of only 2 main layers:
👉 Speed or Stream - similar to Lambda but (optionally) often contains Tiered Storage which means that all of Data coming into the system is stored indefinitely in different Storage Layers. E.g. S3 or GCS for historical data and on disk log for hot data.
👉 Serving - same as Lambda but all transformations that are performed in Speed Layer are never duplicated in Batch Layer.
❗️ Some transformations are hard to perform in Speed Layer (e.g. complex joins) and are eventually pushed to Batch storage for implementation.
❗️ Requires strong skills in Stream Processing.
✅ Data is processed once with a single Stream Processing Engine.
✅ Only need to manage single set of Infrastructure.
MLOps Maturity Level 1 by GCP
What is MLOps Maturity Level 1 and how do we move to it from Level 0?
We already covered MLOps Maturity Level 0 (by GCP) in one of the previous posts. Next is Level 1, what are the main differences?
Level 1 focuses on introduction of Continuous Training (CT) and it achieves it by:
1️⃣ Automation of ML Pipelines.
👉 Pipelines need to be orchestrated.
👉 Each pipeline step should be developed independently and be able to run on different technologies.
👉 Pipelines are treated as a code artifact.
✅ You deploy Pipelines instead of Model Artifacts allowing Continuous Training In production.
✅ Reuse of components allows for rapid experimentation.
2️⃣ Introduction of strict Data and Model Validation steps in the ML Pipeline.
👉 Data is validated before training the Model. If inconsistencies are found - the Pipeline is aborted.
👉 Model is validated after training. Only after it passes the validation is it served for deployment.
✅ Short circuits of the Pipeline allow for safe CT in production.
3️⃣ Introduction of ML Metadata Store.
👉 Any Metadata related to ML artifact creation is tracked here.
👉 We also track performance of the ML Model.
✅ Experiments become reproducible and comparable between each other.
4️⃣ Different Pipeline triggers in production.
👉 Ad-hoc.
👉 Cron.
👉 Reactive to Metrics produced in Model Monitoring System.
👉 Arrival of New Data.
5️⃣ Introduction of Feature Store (Optional).
👉 Allows avoiding of work duplication when defining features.
👉 Reduces risk of Training/Serving Skew.
My thoughts on Level 1:
➡️ I would suggest approaching the migration from Level 0 step by step and not try to work on all parts simultaneously. The following could be good Quarterly Goals:
👉 Experiment Tracking is extremely important even for Level 0 - I would start with ML Metadata Store introduction.
👉 Orchestration of ML Pipelines is always a good idea, there are many tools supporting this. If you are not doing it yet - grab this next, also make the validation steps part of this goal.
👉 The need for Feature Store will wary on the types of Models you are deploying. I would only suggest prioritizing it if you have Models that perform Online predictions as it will help with avoiding Training/Serving Skew.
👉 Don’t rush with Automated retraining. Ad-hoc and on-schedule will bring you a long way.
❗️Pipeline Deployment in Level 1 is still dependent on Ops and is not automated. It is for Level 2 and we will look into it next time.
Hi!
Any advice on a good tool to use for orchestrating ML pipelines? I've tried some (e.g. Kedro), but none fully satisfied me. We use Hydra for experiment iteration and configuration, because of it's extreme flexibility. However, we would like to place the experimentation phase inside a wider pipeline.