SAI #15: What's in Kubernetes for MLOps?
What's in Kubernetes for MLOps, Placing Data roles of 2023 on Data Value Chain.
👋 This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
This week in the Newsletter I cover:
What's in Kubernetes for MLOps?
Placing Data roles of 2023 on Data Value Chain.
Why you should learn Data Engineering and Machine Learning Pipelines.
What's in Kubernetes for MLOps?
What is Kubernetes and why should you learn it as MLOps/ML/Data Engineer?
This is the first post in the future series on Kubernetes, today we look into the system from a bird's eye view.
So, what is Kubernetes (K8s)?
1: It is a container orchestrator that performs the scheduling, running and healing of your containerised applications in a horizontally scalable and self-healing way.
Kubernetes architecture consists of two main logical groups:
2: Control plane - this is where K8s system processes that are responsible for scheduling workloads defined by you and keeping the system healthy live.
3: Worker nodes - this is where containers are scheduled and run.
How does Kubernetes help you?
4: You can have thousands of Nodes (usually you only need tens of them) in your K8s cluster, each of them can host multiple containers. Nodes can be added or removed from the cluster as needed. This enables unrivaled horizontal scalability.
5: Kubernetes provides an easy to use and understand declarative interface to deploy applications. Your application deployment definition can be described in yaml, submitted to the cluster and the system will take care that the desired state of the application is always up to date.
6: Users are empowered to create and own their application architecture in boundaries pre-defined by Cluster Administrators.
✅ In most cases you can deploy multiple types of ML Applications into a single cluster, you don’t need to care about which server to deploy to - K8s will take care of it.
✅ You can request different amounts of dedicated machine resources per application.
✅ If your application goes down - K8s will make sure that a desired number of replicas is always alive.
✅ You can roll out new versions of the running application using multiple strategies - K8s will safely do it for you.
✅ You can expose your ML Services for other Product Apps to use with few intuitive resource definitions.
✅ …
❗️Having said this, while it is a bliss to use, usually the operation of Kubernetes clusters is what is feared. It is a complex system.
❗️Master Plane is an overhead, you need it even if you want to deploy a single small application.
Placing Data roles of 2023 on Data Value Chain.
Here is a short description of the most popular ones.
Data Platform.
➡️ Data Platform Engineer:
👉 Builds the Infrastructure of Data Platform for other involved Data Professionals to use.
➡️ Data Engineer:
👉 Leverages Data Platform Infrastructure to build and deploy applications that extract Data from external sources, clean it, ensure quality and hand it over for further Modeling.
👉 If more complicated processing is required - e.g. stream processing - the ownership of Data Engineers would shift more downstream as well.
➡️ Analytics Engineer:
👉 Communicates with business and builds out the Data Model for correct and intuitive Data use in downstream applications.
👉 Usually limited to the scope of Data Warehouse or Data Lake.
➡️ Data Analyst:
👉 Leverages the Modeled Data to answer business related questions with the help of Dashboards and Reports.
👉 Also known as Descriptive Analytics.
Machine Learning Platform.
➡️ MLOps Engineer:
👉 Builds the Infrastructure of ML Platform for Data Scientists and Machine Learning Engineers.
➡️ Data Scientist:
👉 Leverages Datasets present in Data Platform to explore their predictive qualities and build Machine Learning Models.
👉 Leverages either DWH or Data Lake. In more mature organizations you will find Feature Stores as well.
➡️ Machine Learning Engineer:
👉 Leverages ML Platform capabilities to deploy ML Models built by Data Scientists ensuring MLOps best practices.
👉 Communicates back with MLOps engineers to align on and help improve ML Platform capabilities.
General Note:
❗️Being further upstream the Data Value Chain does not mean that your value is less. Quite the opposite - any mistake upstream multiplies the impact of the downstream applications making the upstream roles have the most impact on the final value.
Why you should learn Data Engineering and Machine Learning Pipelines.
If you want to succeed in the Data Field you should strive to understand Data Engineering and Machine Learning Pipelines deeply.
They are probably the most important logical task groups of MLOps Lifecycle. Here is how understanding them will help you:
➡️ It will help You grasp the full Lifecycle of Data and how business value is generated by Data Driven ML Products end-to-end.
➡️ You will get to know the full spectrum of Stakeholders involved in delivering Machine Learning to Production.
➡️ You will improve Your Systems Thinking.
➡️ With this knowledge, it will become easier to participate in important organisation shaping conversations, helping You to advance in Your career faster.
➡️ Once You get to know all of the functions involved in The Data Flow - You can choose any part of it to dig deeper into and choose Your short term career path accordingly.
➡️ It will help You shape Your future career plans as well.
➡️ You will understand bottlenecks and where critical failures and Data Quality issues occur.
➡️ Technology Zoo involved in delivering end-to-end Data Products will become easy to conceptually glue together.
➡️ Pipelines are what really bridges the gap from development to prod - they are the key to automation.
Hi Aurimas!
I am newbie in MLOps, and really like your posts. i wanto use some of your images with proper citation and acknowledge, whats the best way to use them. Kindly, let me know! Also, it would be greate if you share your knowledge about which tools are you using for making them. Thanks in advance.
Hello Aurimas!
Thanks for this reading, it's interesting the orchestrator role from Kubernetes (it has similarities with Airflow to help us to monitoring our workflows). Moreover, I believe that every day, we should understand this data engineering concepts to create ML pipelines adequately.