SwirlAI Newsletter

Share this post

SAI #03: Machine Learning Deployment Types, Spark - Architecture and more...

www.newsletter.swirlai.com

SAI #03: Machine Learning Deployment Types, Spark - Architecture and more...

Machine Learning Deployment Types, ML Experiment/Model Tracking, Spark - Architecture, Kafka - Reading Data (Basics)

Aurimas Griciลซnas
Oct 29, 2022
12
Share this post

SAI #03: Machine Learning Deployment Types, Spark - Architecture and more...

www.newsletter.swirlai.com

๐Ÿ‘‹ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.

In this episode we cover:

  • Machine Learning Deployment Types.

  • Experiment/Model Tracking.

  • Spark - Architecture.

  • Kafka -Reading Data (Basics).

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


MLOps Fundamentals or What Every Machine Learning Engineer Should Know


๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐——๐—ฒ๐—ฝ๐—น๐—ผ๐˜†๐—บ๐—ฒ๐—ป๐˜ ๐—ง๐˜†๐—ฝ๐—ฒ๐˜€.ย 


There are many ways you could deploy a ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ผ๐—ฑ๐—ฒ๐—น to serve production use cases. Even if you will not be working with them day to day,ย  the following are the four ways you should know and understand as a ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ.
ย 
โžก๏ธ ๐—•๐—ฎ๐˜๐—ฐ๐—ต:ย 
ย 
๐Ÿ‘‰ You apply your trained models as a part of ๐—˜๐—ง๐—Ÿ/๐—˜๐—Ÿ๐—ง ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ on a given schedule.
๐Ÿ‘‰ You load the required Features from a batch storage, apply inference and save inference results to a batch storage.
๐Ÿ‘‰ It is sometimes falsely thought that you canโ€™t use this method for ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐—ฃ๐—ฟ๐—ฒ๐—ฑ๐—ถ๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€.
๐Ÿ‘‰ Inference results ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐—น๐—ผ๐—ฎ๐—ฑ๐—ฒ๐—ฑ ๐—ถ๐—ป๐˜๐—ผ ๐—ฎ ๐—ฟ๐—ฒ๐—ฎ๐—น ๐˜๐—ถ๐—บ๐—ฒ ๐˜€๐˜๐—ผ๐—ฟ๐—ฎ๐—ด๐—ฒ and used for real time applications.
ย 
โžก๏ธ ๐—˜๐—บ๐—ฏ๐—ฒ๐—ฑ๐—ฑ๐—ฒ๐—ฑ ๐—ถ๐—ป ๐—ฎ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป:ย 
ย 
๐Ÿ‘‰ You apply your trained models as a part of ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ.
๐Ÿ‘‰ While Data is continuously piped through your ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€, an application with a loaded model continuously applies inference on the data and returns it to the system - most likely another Streaming Storage.
๐Ÿ‘‰ This deployment type will most likely involve a real time ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ ๐—”๐—ฃ๐—œ to retrieve additional ๐—ฆ๐˜๐—ฎ๐˜๐—ถ๐—ฐ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€ for inference purposes.
๐Ÿ‘‰ Predictions can be consumed by multiple applications subscribing to the ๐—œ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ.
ย 
โžก๏ธ ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ฒ๐˜€๐˜ - ๐—ฅ๐—ฒ๐˜€๐—ฝ๐—ผ๐—ป๐˜€๐—ฒ:
ย 
๐Ÿ‘‰ You expose your model as a backend Service.
๐Ÿ‘‰ It will most likely be a ๐—ฅ๐—˜๐—ฆ๐—ง ๐—ผ๐—ฟ ๐—ด๐—ฅ๐—ฃ๐—– ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฐ๐—ฒ.
๐Ÿ‘‰ The API service retrieves Features needed for inference from a ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ ๐—”๐—ฃ๐—œ.
๐Ÿ‘‰ Inference can be requested by any application in real time as long as it is able to form a correct request that conforms ๐—”๐—ฃ๐—œ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜.
ย 
โžก๏ธ ๐—˜๐—ฑ๐—ด๐—ฒ:ย 
ย 
๐Ÿ‘‰ You embed your trained model directly into the application that runs on a user device.
๐Ÿ‘‰ This method provides the lowest latency and improves privacy.
๐Ÿ‘‰ Data most likely has to be generated and live inside of device.
ย 
๐—ง๐˜‚๐—ป๐—ฒ ๐—ถ๐—ป ๐—ณ๐—ผ๐—ฟ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐—ฎ๐—ฏ๐—ผ๐˜‚๐˜ ๐—ฒ๐—ฎ๐—ฐ๐—ต ๐˜๐˜†๐—ฝ๐—ฒ ๐—ผ๐—ณ ๐—ฑ๐—ฒ๐—ฝ๐—น๐—ผ๐˜†๐—บ๐—ฒ๐—ป๐˜ ๐—ถ๐—ป ๐—ณ๐˜‚๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฒ๐—ฝ๐—ถ๐˜€๐—ผ๐—ฑ๐—ฒ๐˜€!


๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜/๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด.

A good ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ should be composed of two integrated parts: ๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜ ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ and a ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฅ๐—ฒ๐—ด๐—ถ๐˜€๐˜๐—ฟ๐˜†.

From where you track ๐— ๐—Ÿ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ metadata will depend on ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€ maturity in your company.ย 

If you are at the beginning of the ML journey you might be:

1๏ธโƒฃ Training and Serving your Models from experimentation environment - you run ๐— ๐—Ÿ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ inside of your ๐—ก๐—ผ๐˜๐—ฒ๐—ฏ๐—ผ๐—ผ๐—ธ and do that manually at each retraining.

If you are beyond Notebooks you will be running ๐— ๐—Ÿ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ from ๐—–๐—œ/๐—–๐—— ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ and on ๐—ข๐—ฟ๐—ฐ๐—ต๐—ฒ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ผ๐—ฟ ๐—ง๐—ฟ๐—ถ๐—ด๐—ด๐—ฒ๐—ฟ๐˜€.

In any case, the ๐— ๐—Ÿ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ will not be too different and a well designed System should track at least:

2๏ธโƒฃ ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜๐˜€ used for ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ in ๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ฟ ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐— ๐—Ÿ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€. Here you should also track your ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป/๐—ง๐—ฒ๐˜€๐˜ ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜€. At this stage you should also save all important metrics that relate to ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜๐˜€ - ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป etc.
3๏ธโƒฃ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ๐˜€ (e.g. model type, hyperparameters) together with ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐— ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ๐˜€.
4๏ธโƒฃ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—”๐—ฟ๐˜๐—ถ๐—ณ๐—ฎ๐—ฐ๐˜ ๐—Ÿ๐—ผ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป.
5๏ธโƒฃ ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ is an ๐—”๐—ฟ๐˜๐—ถ๐—ณ๐—ฎ๐—ฐ๐˜ itself - track information about who and when triggered it. Pipeline ID etc.
โœ… ๐—–๐—ผ๐—ฑ๐—ฒ: Everything is code - you should version and track it.

When a ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—”๐—ฟ๐˜๐—ถ๐—ณ๐—ฎ๐—ฐ๐˜ is saved to a ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฅ๐—ฒ๐—ด๐—ถ๐˜€๐˜๐—ฟ๐˜† there should always be a 1 to 1 mapping of previously saved ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐— ๐—ฒ๐˜๐—ฎ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜๐—ผ ๐—ง๐—ต๐—ฒ ๐—”๐—ฟ๐˜๐—ถ๐—ณ๐—ฎ๐—ฐ๐˜ which was outputted to ๐—ง๐—ต๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฅ๐—ฒ๐—ด๐—ถ๐˜€๐˜๐—ฟ๐˜†:

โžก๏ธย  ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฅ๐—ฒ๐—ด๐—ถ๐˜€๐˜๐—ฟ๐˜† should have a convenient user interface in which you can compare metrics of different ๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜ versions.
โžก๏ธย  ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฅ๐—ฒ๐—ด๐—ถ๐˜€๐˜๐—ฟ๐˜† should have a capability that allows change of ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฆ๐˜๐—ฎ๐˜๐—ฒ with a single click of a button. Usually it would be a change of state between ๐—ฆ๐˜๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐—ฎ๐—ป๐—ฑ ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป.
ย 
๐—™๐—ถ๐—ป๐—ฎ๐—น๐—น๐˜†:
ย 
6๏ธโƒฃย  ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ should be integrated with the ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐——๐—ฒ๐—ฝ๐—น๐—ผ๐˜†๐—บ๐—ฒ๐—ป๐˜ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ. Once a model state is changed to ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป, ๐—ง๐—ต๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐˜๐—ฟ๐—ถ๐—ด๐—ด๐—ฒ๐—ฟ๐˜€ ๐—ฎ ๐——๐—ฒ๐—ฝ๐—น๐—ผ๐˜†๐—บ๐—ฒ๐—ป๐˜ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ that will deploy a new model version and perform a decommission of the old one. This will vary depending on the type of deployment. ๐— ๐—ผ๐—ฟ๐—ฒ ๐—ฎ๐—ฏ๐—ผ๐˜‚๐˜ ๐˜๐—ต๐—ถ๐˜€ ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ณ๐˜‚๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฒ๐—ฝ๐—ถ๐˜€๐—ผ๐—ฑ๐—ฒ๐˜€!
ย 
๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ containing these properties helps in the following way:
ย 
โžก๏ธย  You will be able to understand how a Model was built and repeat the experiment.
โžก๏ธย  You will be able to share experiments with other experts involved.
โžก๏ธย  You will be able to perform rapid and controlled experiments.
โžก๏ธย  The system will allow safe rollbacks to any Model Version.
โžก๏ธย  Such a Self-Service System would remove friction between ML and Operations experts.


Data Engineering Fundamentals + or What Every Data Engineer Should Know

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ - ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ.

๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ is an extremely popular distributed processing framework utilizing in-memory processing to speed up task execution. Most of its libraries are contained in the Spark Core layer.

As a warm up exercise for later deeper dives and tips, today we focus on some architecture basics.

๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ต๐—ฎ๐˜€ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ต๐—ถ๐—ด๐—ต ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ฃ๐—œ๐˜€ ๐—ฏ๐˜‚๐—ถ๐—น๐˜ ๐—ผ๐—ป ๐˜๐—ผ๐—ฝ ๐—ผ๐—ณ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—–๐—ผ๐—ฟ๐—ฒ ๐˜๐—ผ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐˜‚๐˜€๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€:

โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—ฆ๐—ค๐—Ÿ - Batch Processing.
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ๐—ถ๐—ป๐—ด - Near to Real-Time Processing.
โžก๏ธ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐— ๐—Ÿ๐—น๐—ถ๐—ฏ - Machine Learning.
โžก๏ธ ๐—š๐—ฟ๐—ฎ๐—ฝ๐—ต๐—ซ - Graph Structures and Algorithms.

๐—ฆ๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—ฝ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ๐—บ๐—ถ๐—ป๐—ด ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ๐˜€:

โžก๏ธ Scala
โžก๏ธ Java
โžก๏ธ Python
โžก๏ธ R

๐—š๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ:

1๏ธโƒฃ Once you submit a ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป - ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—ข๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜ is created in the ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐—ฃ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ. This Object is responsible for communicating with the ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ.
2๏ธโƒฃ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ negotiates with ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ for required resources to run ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป. ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ allocates the resources inside of a respective Cluster and creates a requested number of ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€.
3๏ธโƒฃ After starting - ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€ will connect with ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ to notify about joining the Cluster. ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€ will be sending heartbeats regularly to notify the ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐—ฃ๐—ฟ๐—ผ๐—ด๐—ฟ๐—ฎ๐—บ that they are healthy and donโ€™t need rescheduling.
4๏ธโƒฃ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€ are responsible for executing tasks of the ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐——๐—”๐—š (๐——๐—ถ๐—ฟ๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ ๐—”๐—ฐ๐˜†๐—ฐ๐—น๐—ถ๐—ฐ ๐—š๐—ฟ๐—ฎ๐—ฝ๐—ต). This could include reading, writing data or performing a certain operation on a partition of RDDs.

๐—ฆ๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฒ๐—ฑ ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—ฟ๐˜€:

โžก๏ธ ๐—ฆ๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—น๐—ผ๐—ป๐—ฒ - simple cluster manager shipped together with Spark.
โžก๏ธ ๐—›๐—ฎ๐—ฑ๐—ผ๐—ผ๐—ฝ ๐—ฌ๐—”๐—ฅ๐—ก - resource manager of Hadoop ecosystem.ย 
โžก๏ธ ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐— ๐—ฒ๐˜€๐—ผ๐˜€ - general cluster manager (โ—๏ธ deprecated).
โžก๏ธ ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€ - popular open-source container orchestrator.

๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—๐—ผ๐—ฏ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น๐˜€:

๐Ÿ‘‰ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ฟ is responsible for constructing an optimized physical execution plan for a given application submitted for execution.ย 
๐Ÿ‘‰ This plan materializes into a Job which is a ๐——๐—”๐—š ๐—ผ๐—ณ ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ๐˜€.ย 
๐Ÿ‘‰ Some of the ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ๐˜€ can be executed in parallel if they have no sequential dependencies.ย 
๐Ÿ‘‰ Each ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ is composed of ๐—ง๐—ฎ๐˜€๐—ธ๐˜€.ย 
๐Ÿ‘‰ All ๐—ง๐—ฎ๐˜€๐—ธ๐˜€ of a single ๐—ฆ๐˜๐—ฎ๐—ด๐—ฒ contain the same type of work which is the smallest piece of work that can be executed in parallel and is performed by ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ๐˜€.


๐—ž๐—ฎ๐—ณ๐—ธ๐—ฎ - ๐—ฅ๐—ฒ๐—ฎ๐—ฑ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ (๐—•๐—ฎ๐˜€๐—ถ๐—ฐ๐˜€).

Kafka is an extremely important ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ to understand, last time we covered Writing Data.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ฟ๐—ฒ๐—ณ๐—ฟ๐—ฒ๐˜€๐—ต๐—ฒ๐—ฟ๐˜€:

โžก๏ธ Clients writing to Kafka are called ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ฒ๐—ฟ๐˜€.
โžก๏ธ Clients reading the Data are called ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€.
โžก๏ธ Data is written into ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ that can be compared to tables in Databases.
โžก๏ธ Messages sent to ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ are called ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—ฟ๐—ฑ๐˜€.
โžก๏ธ ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ are composed of ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€.
โžก๏ธ Each ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป is a combination of and behaves as a write ahead log.
โžก๏ธ Data is written to the end of the ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ Each ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—ฟ๐—ฑ has an ๐—ข๐—ณ๐—ณ๐˜€๐—ฒ๐˜ assigned to it which denotes its order in the ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ ๐—ข๐—ณ๐—ณ๐˜€๐—ฒ๐˜๐˜€ start at 0 and increment by 1 sequentially.

๐—ฅ๐—ฒ๐—ฎ๐—ฑ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ:

โžก๏ธ Data is read sequentially per partition.
โžก๏ธ ๐—œ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐—ฃ๐—ผ๐˜€๐—ถ๐˜๐—ถ๐—ผ๐—ป can be set either to earliest or latest.
โžก๏ธ Earliest position initiates the consumer at offset 0 or the earliest available due to retention rules of the ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ (more about this in later episodes).
โžก๏ธ Latest position initiates the consumer at the end of a ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป - no ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—ฟ๐—ฑ๐˜€ will be read initially and the ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ will wait for new data to be written.
โžก๏ธ You could codify your consumers independently, but almost always the preferred way is to use ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ๐˜€.

๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ๐˜€:

โžก๏ธ ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ is a logical collection of clients that read a ๐—ž๐—ฎ๐—ณ๐—ธ๐—ฎ ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ and share the state.
โžก๏ธ Groups of consumers are identified by the ๐—ด๐—ฟ๐—ผ๐˜‚๐—ฝ_๐—ถ๐—ฑ parameter.
โžก๏ธ ๐—ฆ๐˜๐—ฎ๐˜๐—ฒ is defined by the offsets that every ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ is being consumed at.
โžก๏ธ ๐—ฆ๐˜๐—ฎ๐˜๐—ฒ of ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ๐˜€ is written by the ๐—•๐—ฟ๐—ผ๐—ธ๐—ฒ๐—ฟ (more about this in later episodes) to an internal ๐—ž๐—ฎ๐—ณ๐—ธ๐—ฎ ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ named __๐—ฐ๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ_๐—ผ๐—ณ๐—ณ๐˜€๐—ฒ๐˜๐˜€.
โžก๏ธ There can be multiple ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ๐˜€ reading the same ๐—ž๐—ฎ๐—ณ๐—ธ๐—ฎ ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ having their own independent ๐—ฆ๐˜๐—ฎ๐˜๐—ฒ๐˜€.
โžก๏ธ Only one ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ per ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ can be reading a ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป at a single point in time.

๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ง๐—ถ๐—ฝ๐˜€:

โ—๏ธ If you have a prime number of ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ in the ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐย  - you will always have at least one ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ per ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ consuming less ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ than others unless number of ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€ equals number of ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€.

โœ… If you want an odd number of ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ - set it to a ๐—บ๐˜‚๐—น๐˜๐—ถ๐—ฝ๐—น๐—ฒ ๐—ผ๐—ณ ๐—ฃ๐—ฟ๐—ถ๐—บ๐—ฒ ๐—ก๐˜‚๐—บ๐—ฏ๐—ฒ๐—ฟ.ย 

โ—๏ธ If you have more ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€ in the ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ then there are ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ in the ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ - some of the ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€ will be ๐—œ๐—ฑ๐—น๐—ฒ.

โœ… Make your ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ large enough or have less ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€ per ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ.


Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


Share this post

SAI #03: Machine Learning Deployment Types, Spark - Architecture and more...

www.newsletter.swirlai.com
Comments
TopNewCommunity

No posts

Ready for more?

ยฉ 2023 Aurimas Griciลซnas
Privacy โˆ™ Terms โˆ™ Collection notice
Start WritingGet the app
Substackย is the home for great writing