SAI #06: No Excuses Data Engineering Project Template, Batch Model Deployment - The MLOps Way and more...
Spark - Parallelism, No Excuses Data Engineering Project Template, Model Release Strategies (Real Time), Batch Model Deployment - The MLOps Way.
๐ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
In this episode we cover:
No Excuses Data Engineering Project Template.
Spark - Parallelism.
Batch Model Deployment - The MLOps Way.
Model Release Strategies (Real Time).
No Excuses Data Engineering Project Template
I will be bringing the Template back to life and make it part of my Newsletter.
ย
I will use it to explain some of the fundamentals that we are talking about and eventually bring them to life in a tutorial series. Will also extend the template with missing MLOps parts so tune in!
ย
๐ฅ๐ฒ๐ฐ๐ฎ๐ฝ:
ย
๐ญ. Data Producers - Python Applications that extract data from chosen Data Sources and push it to Collector via REST or gRPC API calls.
๐ฎ. Collector - REST or gRPC server written in Python that takes a payload (json or protobuf), validates top level field existence and correctness, adds additional metadata and pushes the data into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.
๐ฏ. Enricher/Validator - Python or Spark Application that validates schema of events in Raw Events Topic, does some optional data enrichment and pushes results into either Enriched Events Topic if the validation passes and enrichment was successful or a Dead Letter Queue if any of previous have failed.
๐ฐ. Enrichment API - API of any flavour implemented with Python that can be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.
๐ฑ. Real Time Loader - Python or Spark Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Analysis and alerting.
๐ฒ. Batch Loader - Python or Spark Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO Object Storage.
๐ณ. Scripts Scheduled via Airflow that read data from Enriched Events MinIO bucket, validates data quality, performs deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.
ย
๐ง๐ญ. A single Kafka instance that will hold all of the Topics for the Project.
๐ง๐ฎ. A single MinIO instance that will hold all of the Buckets for the Project.
๐ง๐ฏ. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.
๐ง๐ฐ. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.
๐ง๐ฑ: ElasticSearch instance to hold Real Time Data.
๐ง๐ฒ. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.
Data Engineering Fundamentals + or What Every Data Engineer Should Know
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ - ๐ฃ๐ฎ๐ฟ๐ฎ๐น๐น๐ฒ๐น๐ถ๐๐บ.
While optimizing Spark Applications you will usually tweak two elements - performance and resource utilization.
Understanding parallelism in Spark and tuning it according to the situation will help you in both.
๐ฆ๐ผ๐บ๐ฒ ๐๐ฎ๐ฐ๐๐:
โก๏ธ Spark Executor can have multiple CPU Cores assigned to it.
โก๏ธ Number of CPU Cores per Spark executor is defined by ๐๐ฝ๐ฎ๐ฟ๐ธ.๐ฒ๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ.๐ฐ๐ผ๐ฟ๐ฒ๐ configuration.
โก๏ธ Single CPU Core can read one file or partition of a splittable file at a single point in time.
โก๏ธ Once read a file is transformed into one or multiple partitions in memory.
๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ถ๐ป๐ด ๐ฅ๐ฒ๐ฎ๐ฑ ๐ฃ๐ฎ๐ฟ๐ฎ๐น๐น๐ฒ๐น๐ถ๐๐บ:
โ๏ธ If number of cores is equal to the number of files, files are not splittable and some of them are larger in size - larger files become a bottleneck, Cores responsible for reading smaller files will idle for some time.ย
โ๏ธ If there are more Cores than the number of files - Cores that do not have files assigned to them will Idle. If we do not perform repartition after reading the files - the cores will remain Idle during processing stages.
โ
Rule of thumb: set number of Cores to be two times less than files being read. Adjust according to your situation.
๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ถ๐ป๐ด ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด ๐ฃ๐ฎ๐ฟ๐ฎ๐น๐น๐ฒ๐น๐ถ๐๐บ:
โก๏ธ Use ๐๐ฝ๐ฎ๐ฟ๐ธ.๐ฑ๐ฒ๐ณ๐ฎ๐๐น๐.๐ฝ๐ฎ๐ฟ๐ฎ๐น๐น๐ฒ๐น๐ถ๐๐บ and ๐๐ฝ๐ฎ๐ฟ๐ธ.๐๐พ๐น.๐๐ต๐๐ณ๐ณ๐น๐ฒ.๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ configurations to set the number of partitions created after performing wide transformations.
โก๏ธ After reading the files there will be as many partitions as there were files or partitions in splittable files.
Keep reading with a 7-day free trial
Subscribe to