SAI #12: CAP Theorem.
CAP Theorem, MLOps Maturity Model: Level 0.
👋 This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
Hope you are having a great start of the year 2023!
Today we cover 2 new topics:
What is the CAP Theorem.
MLOps Maturity Levels by GCP: Level 0.
Thanks for reading SwirlAI Newsletter!
What is the CAP Theorem and why does it matter to you as a Data Engineer?
As a modern Data Engineer you are most likely to work in the Cloud utilising all kinds of Distributed technologies and Databases. Understanding CAP theorem will help you choose a Database that is fit for your application use case.
Let’s take a closer look.
First, let’s define what we mean by distributed:
➡️ A Cluster is involved.
➡️ Cluster is composed of multiple Nodes that communicate with each other through the Network.
➡️ Multiple Nodes could be deployed on a single Server but in most production setups you will see a Node per Server.
➡️ Data is distributed between multiple Nodes.
Now, let’s define what CAP stands for:
➡️ Consistency - All Nodes in the cluster should see the same data. If we perform a read operation it should return data of the last write regardless of which node we are on.
➡️ Availability - The Distributed System remains operational all of the time regardless of the state of individual nodes. The system continues to operate even with multiple nodes down.
➡️ Partition-Tolerance - If a Network Partition occurs (communication between some nodes in the cluster breaks) the Distributed System does not fail even if some messages are dropped or delayed.
The CAP Theorem states that you can only have a combination of two in the distributed system so your system is one of three:
👉 CA - Consistent and Available.
👉 AP - Available and Partition-Tolerant.
👉 CP - Consistent and Partition-Tolerant.
❗️ In real world you will not find many CA Systems as Network Partitions are common in Distributed Databases so they must be Partition-Tolerant. This means that we are usually balancing between AP and CP Systems.
Different databases are designed with different CAP guarantees. For example:
👉 MongoDB is a CP Database.
👉 Cassandra is a AP Database.
✅ Having said this, you can also configure consistency on application level in Cassandra but more on that in the future posts.
MLOps Maturity Levels
What are the MLOps Maturity Levels? How can you make use of the concept?
Both GCP and Azure have defined what they call the MLOps Maturity Models around the same time at the beginning of 2020.
It is useful to understand the concept as it can help guide you while evolving your ML Platform automation. More importantly - understand what level is enough for your use case.
You will hear the term being thrown around in conversations so it will help you navigate them.
We will analyze GCPs definition which comprises of 3 levels and look into how to best approach moving between each level.
Today we look at Level 0:
The System is very simple and lacks any kind of automation. This is what you would see in most organizations that are just starting with running ML in Production.
1: The entire Machine Learning Pipeline is executed in the Experimentation Environment manually and on demand. Retraining is not happening often and the pipelines are not treated as an artifact and tested according to Software Engineering best practices.
2: After the Model Artifact is created it is saved into a Model Registry - it could be just a simple Object Storage or a more advanced Model Registry solution.
3: Software Engineers pick up the Model Artifact and use the already present Deployment procedures to expose the Model Service for Product Application to use.
✅ Extremely easy to set up:
👉 You only need a Notebook setup with access to the Data on different curation levels.
👉 ML is usually not the first use case to be deployed into Production which means that there is most likely a mature Software Deployment System already present in the organization.
✅ Extremely Fast for developing initial MVP Models for your business.
❗️ No Experiment Tracking System which means that comparing several experiments between each other becomes difficult. Also - reproducing a historical Model Training Run is near to impossible.
❗️ ML lifecycle lives in Notebooks that are complicated to track in code version control systems.
❗️ The System is extremely likely to suffer from Training/Serving skew as there is no connection between Offline and Online Data.
My thoughts on Level 0: it is ok to start here as long as you deliver value from your ML Models. What I would try to add as soon as possible is Experiment Tracking Systems in any complexity - it could simply be metadata stored in Object Storage or a Database. In future Posts we will look at Levels 1 and 2 and what steps you need to take to move there.
👋 this is Aurimas. Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.