SAI #06: No Excuses Data Engineering Project Template, Batch Model Deployment - The MLOps Way and more...
Spark - Parallelism, No Excuses Data Engineering Project Template, Model Release Strategies (Real Time), Batch Model Deployment - The MLOps Way.
👋 This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
In this episode we cover:
No Excuses Data Engineering Project Template.
Spark - Parallelism.
Batch Model Deployment - The MLOps Way.
Model Release Strategies (Real Time).
Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.
No Excuses Data Engineering Project Template
I will be bringing the Template back to life and make it part of my Newsletter.
I will use it to explain some of the fundamentals that we are talking about and eventually bring them to life in a tutorial series. Will also extend the template with missing MLOps parts so tune in!
𝟭. Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls.
𝟮. Collector - REST or gRPC server written in Python that takes a payload (json or protobuf), validates top level field existence and correctness, adds additional metadata and pushes the data into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.
𝟯. Enricher/Validator - Python or Spark Application that validates schema of events in Raw Events Topic, does some optional data enrichment and pushes results into either Enriched Events Topic if the validation passes and enrichment was successful or a Dead Letter Queue if any of previous have failed.
𝟰. Enrichment API - API of any flavour implemented with Python that can be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.
𝟱. Real Time Loader - Python or Spark Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Analysis and alerting.
𝟲. Batch Loader - Python or Spark Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO Object Storage.
𝟳. Scripts Scheduled via Airflow that read data from Enriched Events MinIO bucket, validates data quality, performs deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.
𝗧𝟭. A single Kafka instance that will hold all of the Topics for the Project.
𝗧𝟮. A single MinIO instance that will hold all of the Buckets for the Project.
𝗧𝟯. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.
𝗧𝟰. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.
𝗧𝟱: ElasticSearch instance to hold Real Time Data.
𝗧𝟲. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.
Data Engineering Fundamentals + or What Every Data Engineer Should Know
𝗦𝗽𝗮𝗿𝗸 - 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺.
While optimizing Spark Applications you will usually tweak two elements - performance and resource utilization.
Understanding parallelism in Spark and tuning it according to the situation will help you in both.
➡️ Spark Executor can have multiple CPU Cores assigned to it.
➡️ Number of CPU Cores per Spark executor is defined by 𝘀𝗽𝗮𝗿𝗸.𝗲𝘅𝗲𝗰𝘂𝘁𝗼𝗿.𝗰𝗼𝗿𝗲𝘀 configuration.
➡️ Single CPU Core can read one file or partition of a splittable file at a single point in time.
➡️ Once read a file is transformed into one or multiple partitions in memory.
𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝗥𝗲𝗮𝗱 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺:
❗️ If number of cores is equal to the number of files, files are not splittable and some of them are larger in size - larger files become a bottleneck, Cores responsible for reading smaller files will idle for some time.
❗️ If there are more Cores than the number of files - Cores that do not have files assigned to them will Idle. If we do not perform repartition after reading the files - the cores will remain Idle during processing stages.
✅ Rule of thumb: set number of Cores to be two times less than files being read. Adjust according to your situation.
𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺:
➡️ Use 𝘀𝗽𝗮𝗿𝗸.𝗱𝗲𝗳𝗮𝘂𝗹𝘁.𝗽𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺 and 𝘀𝗽𝗮𝗿𝗸.𝘀𝗾𝗹.𝘀𝗵𝘂𝗳𝗳𝗹𝗲.𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀 configurations to set the number of partitions created after performing wide transformations.
➡️ After reading the files there will be as many partitions as there were files or partitions in splittable files.
Keep reading with a 7-day free trial