SAI #04: Data Contracts, Experimentation Environments and more ...
Data Contracts, Spark - Shuffle, Experimentation Environments, Model Observability - Real Time.
๐ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
In this episode we cover:
Data Contracts.
Spark - Shuffle.
Experimentation Environments.
Model Observability - Real Time.
Data Engineering Fundamentals + or What Every Data Engineer Should Know
๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐.
In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.
๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐ ๐๐ต๐ผ๐๐น๐ฑ ๐ต๐ผ๐น๐ฑ ๐๐ต๐ฒ ๐ณ๐ผ๐น๐น๐ผ๐๐ถ๐ป๐ด ๐ป๐ผ๐ป-๐ฒ๐
๐ต๐ฎ๐๐๐๐ถ๐๐ฒ ๐น๐ถ๐๐ ๐ผ๐ณ ๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ:
๐ Schema of the Data being Produced.
๐ Schema Version - Data Sources evolve, Producers have to ensure that it is possible to detect and react to schema changes. Consumers should be able to process Data with the old Schema.
๐ SLA metadata - Quality: is it meant for Production use? How late can the data arrive? How many missing values could be expected for certain fields in a given time period?
๐ Semantics - what entity does a given Data Point represent. Semantics, similar to schema, can evolve over time.
๐ Lineage - Data Owners, Intended Consumers.
๐ โฆ
ย
๐ฆ๐ผ๐บ๐ฒ ๐ฃ๐๐ฟ๐ฝ๐ผ๐๐ฒ๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐:
ย
โก๏ธ Ensure Quality of Data in the Downstream Systems.
โก๏ธ Prevent Data Processing Pipelines from unexpected outages.
โก๏ธ Enforce Ownership of produced data closer to where it was generated.
โก๏ธ Improve Scalability of your Data Systems.
โก๏ธ Reduce intermediate Data Handover Layer.
โก๏ธ โฆ
๐๐
๐ฎ๐บ๐ฝ๐น๐ฒ ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป ๐ณ๐ผ๐ฟ ๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐ ๐๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐:
1๏ธโฃ Schema changes are implemented in a git repository, once approved - they are pushed to the Applications generating the Data and a central Schema Registry.
2๏ธโฃ Applications push generated Data to Kafka Topics. Separate Raw Data Topics for CDC streams and Direct emission.
3๏ธโฃ A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Schema Registry.
4๏ธโฃ Data that does not meet the contract is pushed to Dead Letter Topic.
5๏ธโฃ Data that meets the contract is pushed to Validated Data Topic.
6๏ธโฃ Applications that need Real Time Data consume it directly from Validated Data Topic or its derivatives.
7๏ธโฃ Data from the Validated Data Topic is pushed to object storage for additional Validation.
8๏ธโฃ On a schedule Data in the Object Storage is validated against additional SLAs and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.ย ย
9๏ธโฃ Consumers and Producers are alerted to any SLA breaches.
๐ Data that was Invalidated in Real Time is consumed by Flink Applications that alert on invalid schemas. There could be a recovery Flink Application with logic on how to fix invalidated Data.
๐ฆ๐ฝ๐ฎ๐ฟ๐ธ - ๐ฆ๐ต๐๐ณ๐ณ๐น๐ฒ.
There are many things to take into consideration to improve Spark Application Performance. We will talk about all of them - today is about Shuffle.
๐ฆ๐ผ๐บ๐ฒ ๐ฑ๐ฒ๐ณ๐ถ๐ป๐ถ๐๐ถ๐ผ๐ป๐:
โก๏ธ Spark Jobs are executed against Data Partitions.
โก๏ธ Each Spark Task is connected to a single Partition.
โก๏ธ Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
โก๏ธ After each Transformation - number of Child Partitions will be created.
โก๏ธ Each Child Partition will be derived from one or more Parent Partitions and will act as Parent Partition for future Transformations.
โก๏ธ Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network.ย
โก๏ธ There are two types of Transformations:
๐ก๐ฎ๐ฟ๐ฟ๐ผ๐ ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ถ๐ผ๐ป๐
โก๏ธ These are simple transformations that can be applied locally without moving Data between Data Containers.
โก๏ธ Locality is made possible due to cross-record context not being needed for the Transformation logic.ย
โก๏ธ E.g. a simple filter function only needs to know the value of a given record.
ย
๐๐ถ๐ฏ๐ค๐ต๐ช๐ฐ๐ฏ๐ด ๐ต๐ฉ๐ข๐ต ๐ต๐ณ๐ช๐จ๐จ๐ฆ๐ณ ๐๐ข๐ณ๐ณ๐ฐ๐ธ ๐๐ณ๐ข๐ฏ๐ด๐ง๐ฐ๐ณ๐ฎ๐ข๐ต๐ช๐ฐ๐ฏ๐ด
ย
๐ map()
๐ mapPartition()
๐ flatMap()
๐ filter()
๐ union()
๐ contains()
๐ โฆ
๐ช๐ถ๐ฑ๐ฒ ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ถ๐ผ๐ป๐
โก๏ธ These are complicated transformations that trigger Data movement between Data Containers.
โก๏ธ This movement of Data is necessary due to cross-record dependencies for a given Transformation type.
โก๏ธ E.g. a groupBy function followed by sum needs to have all records that contain a specific key in the groupBy column locally. This triggers the shuffle.
๐๐ถ๐ฏ๐ค๐ต๐ช๐ฐ๐ฏ๐ด ๐ต๐ฉ๐ข๐ต ๐ต๐ณ๐ช๐จ๐จ๐ฆ๐ณ ๐๐ช๐ฅ๐ฆ ๐๐ณ๐ข๐ฏ๐ด๐ง๐ฐ๐ณ๐ฎ๐ข๐ต๐ช๐ฐ๐ฏ๐ด
๐ groupByKey()
๐ aggregateByKey()
๐ groupBy()
๐ aggregate()
๐ join()
๐ repartition()
๐ โฆ
ย
[๐๐บ๐ฝ๐ผ๐ฟ๐๐ฎ๐ป๐]
โ๏ธShuffle is an expensive operation as it requires movement of Data through the Network.
โ๏ธShuffle procedure also impacts disk I/O since shuffled Data is saved to Disk.
โ๏ธTune your applications to have as little Shuffle Operations as possible.
๐๐ฝ๐ฎ๐ฟ๐ธ.๐๐พ๐น.๐๐ต๐๐ณ๐ณ๐น๐ฒ.๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐
โ
If Shuffle is necessary - use ๐ฝ๐ฎ๐ฟ๐ธ.๐๐พ๐น.๐๐ต๐๐ณ๐ณ๐น๐ฒ.๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ configuration to tune the number of partitions created after shuffle (defaults to 200).
โ
It is good idea to consider the number of cores your cluster will be working with. Rule of thumb could be having partition numbers set to one or two times more than available cores.ย
MLOps Fundamentals or What Every Machine Learning Engineer Should Know
๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป ๐๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐.
ย
Effective ML Experimentation Environments are key to your MLOps transformation success.
ย
At the center of this environment sits a Notebook. In the MLOps context the most important parts are the integration points between the Notebook and the remaining MLOps ecosystem. Most critical being:
ย
1๏ธโฃ ๐๐ฐ๐ฐ๐ฒ๐๐ ๐๐ผ ๐ฅ๐ฎ๐ ๐๐ฎ๐๐ฎ: Data Scientists need access to raw data to uncover new Features that could be moved downstream into a Curated Data Layer.
2๏ธโฃ ๐๐ฐ๐ฐ๐ฒ๐๐ ๐๐ผ ๐๐๐ฟ๐ฎ๐๐ฒ๐ฑ ๐๐ฎ๐๐ฎ: Here the data is curated, meeting SLAs and of the highest quality. The only remaining step is to move it to the Feature Store. If it is not there yet - Data Scientists need to know of available Data Sets.
3๏ธโฃ ๐๐ฐ๐ฐ๐ฒ๐๐ ๐๐ผ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ: Last step for data quality and preparation. Features are ready to be used for Model training as is - always start here, only move upstream when you need additional data.
4๏ธโฃ ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐/๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ: Data Scientists should be able to run local ML Pipelines and output Experiment related metadata to Experiment/Model Tracking System directly from Notebooks.
5๏ธโฃ ๐ฅ๐ฒ๐บ๐ผ๐๐ฒ ๐ ๐ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐: Machine Learning Pipeline development should be extremely easy for the Data Scientist, that means being able to spin up a full fledged Pipeline from their Notebook.
6๏ธโฃ ๐ฉ๐ฒ๐ฟ๐๐ถ๐ผ๐ป ๐๐ผ๐ป๐๐ฟ๐ผ๐น: Notebooks should be part of your Git integration. There are challenges with Notebook versioning but it does not change the fact. Additionally, it is best for the notebooks to live next to the production project code in template repositories. Having said that - you should make it crystal clear where certain type of code should live - a popular way to do this is providing repository ๐๐ฒ๐บ๐ฝ๐น๐ฎ๐๐ฒ๐ with clear documentation.
7๏ธโฃ ๐ฅ๐๐ป๐๐ถ๐บ๐ฒ ๐๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐: Notebooks have to be running in the same environment that your production code will run in. Incompatible dependencies should not cause any problems when porting applications to production. There might be several Production Environments so it should be easy to swap it for Notebook. It can be achieved by running Notebooks in containers.
8๏ธโฃ ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐๐น๐๐๐๐ฒ๐ฟ๐: Data Scientists should be able to easily spin up different types of compute clusters - might it be Spark, Dask or any other technology - to allow effective Raw Data exploration.
๐ ๐ผ๐ฑ๐ฒ๐น ๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ - ๐ฅ๐ฒ๐ฎ๐น ๐ง๐ถ๐บ๐ฒ.
Your Machine Learning Systems can fail in more ways than regular software. Machine Learning Observability System is where you will continuously monitor ML Application Health and Performance.
ย
There are end-to-end platforms that try to provide all of the later described capabilities, we will look into how an in-house built system could look as the concepts will be fundamentally similar.
To provide a holistic view of your Machine Learning Model Health in Production - Observability System needs at least:
ย
๐ข๐ฝ๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด:
ย
โก๏ธ Here you monitor regular software operational metrics.ย
โก๏ธ This would include response latency, inference latency of your ML Services, how many requests per second are made etc.
ย
๐๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ ๐ด๐ต๐ข๐ค๐ฌ:
ย
๐ Prometheus for storing Metrics and alerting.
๐ Grafana mounted on Prometheus for exploring metrics.
ย
๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป ๐๐ผ๐ด๐ด๐ถ๐ป๐ด:
ย
โก๏ธ As in any regular software system you will need to collect your application logs centrally.
โก๏ธ This would include errors, response codes, debug logs if there is a need to dig deeper.
ย
๐๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ ๐ด๐ต๐ข๐ค๐ฌ:
ย
๐ fluentD Sidecars forwarding logs to ElasticSearch clusters.
๐ Kibana mounted on ElasticSearch for log search and exploration.
ย
๐๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ฎ๐น ๐๐ผ๐ด๐:
ย
This is a key area to Machine Learning Observability. You track Analytical Logs of your Applications - both Product Applications and your Machine Learning Service. These Logs will include application specific data like:
ย
โก๏ธ What prediction was made by ML Service.
โก๏ธ What action was performed by the Product Application after receiving the Inference result.
โก๏ธ What was the feature input to the ML Service.
โก๏ธ โฆ
ย
โก๏ธ Some of the features for inference will still be calculated on the fly due to latency requirements. Here you can identify any remaining Training/Serving Skew.
โก๏ธ You will be able to monitor business related metrics. Click Through Rate, Revenue Per User etc.
โก๏ธ You will catch any Data or Concept Drift happening within your online systems.
ย
๐๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ ๐ด๐ต๐ข๐ค๐ฌ:
ย
๐ fluentD Sidecars forwarding logs to a Kafka Topic.
ย
๐๐ข๐ต๐ค๐ฉ:
ย
๐ Data is written to object storage and later to a Data Warehouse.
๐ A/B Testing Systems and BI Tooling are mounted on the Data Warehouse for Analytical purposes.
ย
๐๐ฆ๐ข๐ญ ๐๐ช๐ฎ๐ฆ:
ย
๐ Flink Cluster reads Data from Kafka, performs windowed Transformations and Aggregations and either Alerts on thresholds or Forwards the data to Low Latency Read capable storage.