SAI #03: Machine Learning Deployment Types, Spark - Architecture and more...
Machine Learning Deployment Types, ML Experiment/Model Tracking, Spark - Architecture, Kafka - Reading Data (Basics)
๐ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
In this episode we cover:
Machine Learning Deployment Types.
Experiment/Model Tracking.
Spark - Architecture.
Kafka -Reading Data (Basics).
MLOps Fundamentals or What Every Machine Learning Engineer Should Know
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ง๐๐ฝ๐ฒ๐.ย
There are many ways you could deploy a ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ ๐ผ๐ฑ๐ฒ๐น to serve production use cases. Even if you will not be working with them day to day,ย the following are the four ways you should know and understand as a ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ.
ย
โก๏ธ ๐๐ฎ๐๐ฐ๐ต:ย
ย
๐ You apply your trained models as a part of ๐๐ง๐/๐๐๐ง ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐ on a given schedule.
๐ You load the required Features from a batch storage, apply inference and save inference results to a batch storage.
๐ It is sometimes falsely thought that you canโt use this method for ๐ฅ๐ฒ๐ฎ๐น ๐ง๐ถ๐บ๐ฒ ๐ฃ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป๐.
๐ Inference results ๐ฐ๐ฎ๐ป ๐ฏ๐ฒ ๐น๐ผ๐ฎ๐ฑ๐ฒ๐ฑ ๐ถ๐ป๐๐ผ ๐ฎ ๐ฟ๐ฒ๐ฎ๐น ๐๐ถ๐บ๐ฒ ๐๐๐ผ๐ฟ๐ฎ๐ด๐ฒ and used for real time applications.
ย
โก๏ธ ๐๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ฒ๐ฑ ๐ถ๐ป ๐ฎ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป:ย
ย
๐ You apply your trained models as a part of ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ.
๐ While Data is continuously piped through your ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐, an application with a loaded model continuously applies inference on the data and returns it to the system - most likely another Streaming Storage.
๐ This deployment type will most likely involve a real time ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ ๐๐ฃ๐ to retrieve additional ๐ฆ๐๐ฎ๐๐ถ๐ฐ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ๐ for inference purposes.
๐ Predictions can be consumed by multiple applications subscribing to the ๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฅ๐ฒ๐๐๐น๐๐ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ.
ย
โก๏ธ ๐ฅ๐ฒ๐พ๐๐ฒ๐๐ - ๐ฅ๐ฒ๐๐ฝ๐ผ๐ป๐๐ฒ:
ย
๐ You expose your model as a backend Service.
๐ It will most likely be a ๐ฅ๐๐ฆ๐ง ๐ผ๐ฟ ๐ด๐ฅ๐ฃ๐ ๐ฆ๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ.
๐ The API service retrieves Features needed for inference from a ๐ฅ๐ฒ๐ฎ๐น ๐ง๐ถ๐บ๐ฒ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ ๐๐ฃ๐.
๐ Inference can be requested by any application in real time as long as it is able to form a correct request that conforms ๐๐ฃ๐ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐.
ย
โก๏ธ ๐๐ฑ๐ด๐ฒ:ย
ย
๐ You embed your trained model directly into the application that runs on a user device.
๐ This method provides the lowest latency and improves privacy.
๐ Data most likely has to be generated and live inside of device.
ย
๐ง๐๐ป๐ฒ ๐ถ๐ป ๐ณ๐ผ๐ฟ ๐บ๐ผ๐ฟ๐ฒ ๐ฎ๐ฏ๐ผ๐๐ ๐ฒ๐ฎ๐ฐ๐ต ๐๐๐ฝ๐ฒ ๐ผ๐ณ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ถ๐ป ๐ณ๐๐๐๐ฟ๐ฒ ๐ฒ๐ฝ๐ถ๐๐ผ๐ฑ๐ฒ๐!
๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐/๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด.
A good ๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ should be composed of two integrated parts: ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ and a ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฅ๐ฒ๐ด๐ถ๐๐๐ฟ๐.
From where you track ๐ ๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ metadata will depend on ๐ ๐๐ข๐ฝ๐ maturity in your company.ย
If you are at the beginning of the ML journey you might be:
1๏ธโฃ Training and Serving your Models from experimentation environment - you run ๐ ๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ inside of your ๐ก๐ผ๐๐ฒ๐ฏ๐ผ๐ผ๐ธ and do that manually at each retraining.
If you are beyond Notebooks you will be running ๐ ๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ from ๐๐/๐๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ and on ๐ข๐ฟ๐ฐ๐ต๐ฒ๐๐๐ฟ๐ฎ๐๐ผ๐ฟ ๐ง๐ฟ๐ถ๐ด๐ด๐ฒ๐ฟ๐.
In any case, the ๐ ๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ will not be too different and a well designed System should track at least:
2๏ธโฃ ๐๐ฎ๐๐ฎ๐๐ฒ๐๐ used for ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ ๐ผ๐ฑ๐ฒ๐น๐ in ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ฟ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐ ๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐. Here you should also track your ๐ง๐ฟ๐ฎ๐ถ๐ป/๐ง๐ฒ๐๐ ๐ฆ๐ฝ๐น๐ถ๐๐. At this stage you should also save all important metrics that relate to ๐๐ฎ๐๐ฎ๐๐ฒ๐๐ - ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ถ๐ผ๐ป etc.
3๏ธโฃ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ๐ (e.g. model type, hyperparameters) together with ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ๐.
4๏ธโฃ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฟ๐๐ถ๐ณ๐ฎ๐ฐ๐ ๐๐ผ๐ฐ๐ฎ๐๐ถ๐ผ๐ป.
5๏ธโฃ ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ is an ๐๐ฟ๐๐ถ๐ณ๐ฎ๐ฐ๐ itself - track information about who and when triggered it. Pipeline ID etc.
โ
๐๐ผ๐ฑ๐ฒ: Everything is code - you should version and track it.
When a ๐ง๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฟ๐๐ถ๐ณ๐ฎ๐ฐ๐ is saved to a ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฅ๐ฒ๐ด๐ถ๐๐๐ฟ๐ there should always be a 1 to 1 mapping of previously saved ๐ ๐ผ๐ฑ๐ฒ๐น ๐ ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ ๐๐ผ ๐ง๐ต๐ฒ ๐๐ฟ๐๐ถ๐ณ๐ฎ๐ฐ๐ which was outputted to ๐ง๐ต๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฅ๐ฒ๐ด๐ถ๐๐๐ฟ๐:
โก๏ธย ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฅ๐ฒ๐ด๐ถ๐๐๐ฟ๐ should have a convenient user interface in which you can compare metrics of different ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ versions.
โก๏ธย ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฅ๐ฒ๐ด๐ถ๐๐๐ฟ๐ should have a capability that allows change of ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฆ๐๐ฎ๐๐ฒ with a single click of a button. Usually it would be a change of state between ๐ฆ๐๐ฎ๐ด๐ถ๐ป๐ด ๐ฎ๐ป๐ฑ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป.
ย
๐๐ถ๐ป๐ฎ๐น๐น๐:
ย
6๏ธโฃย ๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ should be integrated with the ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ฆ๐๐๐๐ฒ๐บ. Once a model state is changed to ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป, ๐ง๐ต๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐๐ฟ๐ถ๐ด๐ด๐ฒ๐ฟ๐ ๐ฎ ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ that will deploy a new model version and perform a decommission of the old one. This will vary depending on the type of deployment. ๐ ๐ผ๐ฟ๐ฒ ๐ฎ๐ฏ๐ผ๐๐ ๐๐ต๐ถ๐ ๐ถ๐ป ๐๐ต๐ฒ ๐ณ๐๐๐๐ฟ๐ฒ ๐ฒ๐ฝ๐ถ๐๐ผ๐ฑ๐ฒ๐!
ย
๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ containing these properties helps in the following way:
ย
โก๏ธย You will be able to understand how a Model was built and repeat the experiment.
โก๏ธย You will be able to share experiments with other experts involved.
โก๏ธย You will be able to perform rapid and controlled experiments.
โก๏ธย The system will allow safe rollbacks to any Model Version.
โก๏ธย Such a Self-Service System would remove friction between ML and Operations experts.
Data Engineering Fundamentals + or What Every Data Engineer Should Know
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ - ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ.
๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.
As a warm up exercise for later deeper dives and tips, today we focus on some architecture basics.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ต๐ฎ๐ ๐๐ฒ๐๐ฒ๐ฟ๐ฎ๐น ๐ต๐ถ๐ด๐ต ๐น๐ฒ๐๐ฒ๐น ๐๐ฃ๐๐ ๐ฏ๐๐ถ๐น๐ ๐ผ๐ป ๐๐ผ๐ฝ ๐ผ๐ณ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฟ๐ฒ ๐๐ผ ๐๐๐ฝ๐ฝ๐ผ๐ฟ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐๐๐ฒ ๐ฐ๐ฎ๐๐ฒ๐:
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐ฆ๐ค๐ - Batch Processing.
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด - Near to Real-Time Processing.
โก๏ธ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ ๐๐น๐ถ๐ฏ - Machine Learning.
โก๏ธ ๐๐ฟ๐ฎ๐ฝ๐ต๐ซ - Graph Structures and Algorithms.
๐ฆ๐๐ฝ๐ฝ๐ผ๐ฟ๐๐ฒ๐ฑ ๐ฝ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ๐บ๐ถ๐ป๐ด ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ๐:
โก๏ธ Scala
โก๏ธ Java
โก๏ธ Python
โก๏ธ R
๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
1๏ธโฃ Once you submit a ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป - ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐๐ผ๐ป๐๐ฒ๐
๐ ๐ข๐ฏ๐ท๐ฒ๐ฐ๐ is created in the ๐๐ฟ๐ถ๐๐ฒ๐ฟ ๐ฃ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ. This Object is responsible for communicating with the ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ.
2๏ธโฃ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐๐ผ๐ป๐๐ฒ๐
๐ negotiates with ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ for required resources to run ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป. ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ allocates the resources inside of a respective Cluster and creates a requested number of ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ๐.
3๏ธโฃ After starting - ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ๐ will connect with ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ๐๐ผ๐ป๐๐ฒ๐
๐ to notify about joining the Cluster. ๐๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ๐ will be sending heartbeats regularly to notify the ๐๐ฟ๐ถ๐๐ฒ๐ฟ ๐ฃ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ that they are healthy and donโt need rescheduling.
4๏ธโฃ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ๐ are responsible for executing tasks of the ๐๐ผ๐บ๐ฝ๐๐๐ฎ๐๐ถ๐ผ๐ป ๐๐๐ (๐๐ถ๐ฟ๐ฒ๐ฐ๐๐ฒ๐ฑ ๐๐ฐ๐๐ฐ๐น๐ถ๐ฐ ๐๐ฟ๐ฎ๐ฝ๐ต). This could include reading, writing data or performing a certain operation on a partition of RDDs.
๐ฆ๐๐ฝ๐ฝ๐ผ๐ฟ๐๐ฒ๐ฑ ๐๐น๐๐๐๐ฒ๐ฟ ๐ ๐ฎ๐ป๐ฎ๐ด๐ฒ๐ฟ๐:
โก๏ธ ๐ฆ๐๐ฎ๐ป๐ฑ๐ฎ๐น๐ผ๐ป๐ฒ - simple cluster manager shipped together with Spark.
โก๏ธ ๐๐ฎ๐ฑ๐ผ๐ผ๐ฝ ๐ฌ๐๐ฅ๐ก - resource manager of Hadoop ecosystem.ย
โก๏ธ ๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ ๐ฒ๐๐ผ๐ - general cluster manager (โ๏ธ deprecated).
โก๏ธ ๐๐๐ฏ๐ฒ๐ฟ๐ป๐ฒ๐๐ฒ๐ - popular open-source container orchestrator.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฏ ๐๐ป๐๐ฒ๐ฟ๐ป๐ฎ๐น๐:
๐ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ฟ๐ถ๐๐ฒ๐ฟ is responsible for constructing an optimized physical execution plan for a given application submitted for execution.ย
๐ This plan materializes into a Job which is a ๐๐๐ ๐ผ๐ณ ๐ฆ๐๐ฎ๐ด๐ฒ๐.ย
๐ Some of the ๐ฆ๐๐ฎ๐ด๐ฒ๐ can be executed in parallel if they have no sequential dependencies.ย
๐ Each ๐ฆ๐๐ฎ๐ด๐ฒ is composed of ๐ง๐ฎ๐๐ธ๐.ย
๐ All ๐ง๐ฎ๐๐ธ๐ of a single ๐ฆ๐๐ฎ๐ด๐ฒ contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ๐.
๐๐ฎ๐ณ๐ธ๐ฎ - ๐ฅ๐ฒ๐ฎ๐ฑ๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ (๐๐ฎ๐๐ถ๐ฐ๐).
Kafka is an extremely important ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ to understand, last time we covered Writing Data.
๐ฆ๐ผ๐บ๐ฒ ๐ฟ๐ฒ๐ณ๐ฟ๐ฒ๐๐ต๐ฒ๐ฟ๐:
โก๏ธ Clients writing to Kafka are called ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐ฒ๐ฟ๐.
โก๏ธ Clients reading the Data are called ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐.
โก๏ธ Data is written into ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ that can be compared to tables in Databases.
โก๏ธ Messages sent to ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ are called ๐ฅ๐ฒ๐ฐ๐ผ๐ฟ๐ฑ๐.
โก๏ธ ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ are composed of ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐.
โก๏ธ Each ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป is a combination of and behaves as a write ahead log.
โก๏ธ Data is written to the end of the ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ Each ๐ฅ๐ฒ๐ฐ๐ผ๐ฟ๐ฑ has an ๐ข๐ณ๐ณ๐๐ฒ๐ assigned to it which denotes its order in the ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ ๐ข๐ณ๐ณ๐๐ฒ๐๐ start at 0 and increment by 1 sequentially.
๐ฅ๐ฒ๐ฎ๐ฑ๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ:
โก๏ธ Data is read sequentially per partition.
โก๏ธ ๐๐ป๐ถ๐๐ถ๐ฎ๐น ๐ฅ๐ฒ๐ฎ๐ฑ ๐ฃ๐ผ๐๐ถ๐๐ถ๐ผ๐ป can be set either to earliest or latest.
โก๏ธ Earliest position initiates the consumer at offset 0 or the earliest available due to retention rules of the ๐ง๐ผ๐ฝ๐ถ๐ฐ (more about this in later episodes).
โก๏ธ Latest position initiates the consumer at the end of a ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป - no ๐ฅ๐ฒ๐ฐ๐ผ๐ฟ๐ฑ๐ will be read initially and the ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ will wait for new data to be written.
โก๏ธ You could codify your consumers independently, but almost always the preferred way is to use ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ๐.
๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ๐:
โก๏ธ ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ is a logical collection of clients that read a ๐๐ฎ๐ณ๐ธ๐ฎ ๐ง๐ผ๐ฝ๐ถ๐ฐ and share the state.
โก๏ธ Groups of consumers are identified by the ๐ด๐ฟ๐ผ๐๐ฝ_๐ถ๐ฑ parameter.
โก๏ธ ๐ฆ๐๐ฎ๐๐ฒ is defined by the offsets that every ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป ๐ถ๐ป ๐๐ต๐ฒ ๐ง๐ผ๐ฝ๐ถ๐ฐ is being consumed at.
โก๏ธ ๐ฆ๐๐ฎ๐๐ฒ of ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ๐ is written by the ๐๐ฟ๐ผ๐ธ๐ฒ๐ฟ (more about this in later episodes) to an internal ๐๐ฎ๐ณ๐ธ๐ฎ ๐ง๐ผ๐ฝ๐ถ๐ฐ named __๐ฐ๐ผ๐ป๐๐๐บ๐ฒ๐ฟ_๐ผ๐ณ๐ณ๐๐ฒ๐๐.
โก๏ธ There can be multiple ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ๐ reading the same ๐๐ฎ๐ณ๐ธ๐ฎ ๐ง๐ผ๐ฝ๐ถ๐ฐ having their own independent ๐ฆ๐๐ฎ๐๐ฒ๐.
โก๏ธ Only one ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ per ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ can be reading a ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป at a single point in time.
๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ ๐ง๐ถ๐ฝ๐:
โ๏ธ If you have a prime number of ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ in the ๐ง๐ผ๐ฝ๐ถ๐ฐย - you will always have at least one ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ per ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ consuming less ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ than others unless number of ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐ equals number of ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐.
โ
If you want an odd number of ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ - set it to a ๐บ๐๐น๐๐ถ๐ฝ๐น๐ฒ ๐ผ๐ณ ๐ฃ๐ฟ๐ถ๐บ๐ฒ ๐ก๐๐บ๐ฏ๐ฒ๐ฟ.ย
โ๏ธ If you have more ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐ in the ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ then there are ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ in the ๐ง๐ผ๐ฝ๐ถ๐ฐ - some of the ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐ will be ๐๐ฑ๐น๐ฒ.
โ
Make your ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ large enough or have less ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐ per ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐๐ฟ๐ผ๐๐ฝ.