SwirlAI Newsletter

Share this post

SAI #04: Data Contracts, Experimentation Environments and more ...

www.newsletter.swirlai.com

SAI #04: Data Contracts, Experimentation Environments and more ...

Data Contracts, Spark - Shuffle, Experimentation Environments, Model Observability - Real Time.

Aurimas Griciลซnas
Nov 5, 2022
11
Share

๐Ÿ‘‹ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.

In this episode we cover:

  • Data Contracts.

  • Spark - Shuffle.

  • Experimentation Environments.

  • Model Observability - Real Time.

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


Data Engineering Fundamentals + or What Every Data Engineer Should Know


๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€.


In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.

๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ต๐—ผ๐—น๐—ฑ ๐˜๐—ต๐—ฒ ๐—ณ๐—ผ๐—น๐—น๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ป๐—ผ๐—ป-๐—ฒ๐˜…๐—ต๐—ฎ๐˜‚๐˜€๐˜๐—ถ๐˜ƒ๐—ฒ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—บ๐—ฒ๐˜๐—ฎ๐—ฑ๐—ฎ๐˜๐—ฎ:

๐Ÿ‘‰ Schema of the Data being Produced.
๐Ÿ‘‰ Schema Version - Data Sources evolve, Producers have to ensure that it is possible to detect and react to schema changes. Consumers should be able to process Data with the old Schema.
๐Ÿ‘‰ SLA metadata - Quality: is it meant for Production use? How late can the data arrive? How many missing values could be expected for certain fields in a given time period?
๐Ÿ‘‰ Semantics - what entity does a given Data Point represent. Semantics, similar to schema, can evolve over time.
๐Ÿ‘‰ Lineage - Data Owners, Intended Consumers.
๐Ÿ‘‰ โ€ฆ
ย 
๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ฃ๐˜‚๐—ฟ๐—ฝ๐—ผ๐˜€๐—ฒ๐˜€ ๐—ผ๐—ณ ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€:
ย 
โžก๏ธ Ensure Quality of Data in the Downstream Systems.
โžก๏ธ Prevent Data Processing Pipelines from unexpected outages.
โžก๏ธ Enforce Ownership of produced data closer to where it was generated.
โžก๏ธ Improve Scalability of your Data Systems.
โžก๏ธ Reduce intermediate Data Handover Layer.
โžก๏ธ โ€ฆ

๐—˜๐˜…๐—ฎ๐—บ๐—ฝ๐—น๐—ฒ ๐—ถ๐—บ๐—ฝ๐—น๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ณ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜ ๐—˜๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜:

1๏ธโƒฃ Schema changes are implemented in a git repository, once approved - they are pushed to the Applications generating the Data and a central Schema Registry.
2๏ธโƒฃ Applications push generated Data to Kafka Topics. Separate Raw Data Topics for CDC streams and Direct emission.
3๏ธโƒฃ A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Schema Registry.
4๏ธโƒฃ Data that does not meet the contract is pushed to Dead Letter Topic.
5๏ธโƒฃ Data that meets the contract is pushed to Validated Data Topic.
6๏ธโƒฃ Applications that need Real Time Data consume it directly from Validated Data Topic or its derivatives.
7๏ธโƒฃ Data from the Validated Data Topic is pushed to object storage for additional Validation.
8๏ธโƒฃ On a schedule Data in the Object Storage is validated against additional SLAs and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.ย ย 
9๏ธโƒฃ Consumers and Producers are alerted to any SLA breaches.
๐Ÿ”Ÿ Data that was Invalidated in Real Time is consumed by Flink Applications that alert on invalid schemas. There could be a recovery Flink Application with logic on how to fix invalidated Data.


๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ - ๐—ฆ๐—ต๐˜‚๐—ณ๐—ณ๐—น๐—ฒ.

There are many things to take into consideration to improve Spark Application Performance. We will talk about all of them - today is about Shuffle.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

โžก๏ธ Spark Jobs are executed against Data Partitions.
โžก๏ธ Each Spark Task is connected to a single Partition.
โžก๏ธ Data Partitions are immutable - this helps with disaster recovery in Big Data applications.
โžก๏ธ After each Transformation - number of Child Partitions will be created.
โžก๏ธ Each Child Partition will be derived from one or more Parent Partitions and will act as Parent Partition for future Transformations.
โžก๏ธ Shuffle is a procedure when creation of Child Partitions involves data movement between Data Containers and Spark Executors over the network.ย 
โžก๏ธ There are two types of Transformations:

๐—ก๐—ฎ๐—ฟ๐—ฟ๐—ผ๐˜„ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€

โžก๏ธ These are simple transformations that can be applied locally without moving Data between Data Containers.
โžก๏ธ Locality is made possible due to cross-record context not being needed for the Transformation logic.ย 
โžก๏ธ E.g. a simple filter function only needs to know the value of a given record.
ย 
๐˜๐˜ถ๐˜ฏ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ต๐˜ณ๐˜ช๐˜จ๐˜จ๐˜ฆ๐˜ณ ๐˜•๐˜ข๐˜ณ๐˜ณ๐˜ฐ๐˜ธ ๐˜›๐˜ณ๐˜ข๐˜ฏ๐˜ด๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด
ย 
๐Ÿ‘‰ map()
๐Ÿ‘‰ mapPartition()
๐Ÿ‘‰ flatMap()
๐Ÿ‘‰ filter()
๐Ÿ‘‰ union()
๐Ÿ‘‰ contains()
๐Ÿ‘‰ โ€ฆ

๐—ช๐—ถ๐—ฑ๐—ฒ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€

โžก๏ธ These are complicated transformations that trigger Data movement between Data Containers.
โžก๏ธ This movement of Data is necessary due to cross-record dependencies for a given Transformation type.
โžก๏ธ E.g. a groupBy function followed by sum needs to have all records that contain a specific key in the groupBy column locally. This triggers the shuffle.

๐˜๐˜ถ๐˜ฏ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ต๐˜ณ๐˜ช๐˜จ๐˜จ๐˜ฆ๐˜ณ ๐˜ž๐˜ช๐˜ฅ๐˜ฆ ๐˜›๐˜ณ๐˜ข๐˜ฏ๐˜ด๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด

๐Ÿ‘‰ groupByKey()
๐Ÿ‘‰ aggregateByKey()
๐Ÿ‘‰ groupBy()
๐Ÿ‘‰ aggregate()
๐Ÿ‘‰ join()
๐Ÿ‘‰ repartition()
๐Ÿ‘‰ โ€ฆ
ย 
[๐—œ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐˜]

โ—๏ธShuffle is an expensive operation as it requires movement of Data through the Network.
โ—๏ธShuffle procedure also impacts disk I/O since shuffled Data is saved to Disk.
โ—๏ธTune your applications to have as little Shuffle Operations as possible.

๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐˜€๐—พ๐—น.๐˜€๐—ต๐˜‚๐—ณ๐—ณ๐—น๐—ฒ.๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€

โœ… If Shuffle is necessary - use ๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐˜€๐—พ๐—น.๐˜€๐—ต๐˜‚๐—ณ๐—ณ๐—น๐—ฒ.๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ configuration to tune the number of partitions created after shuffle (defaults to 200).
โœ… It is good idea to consider the number of cores your cluster will be working with. Rule of thumb could be having partition numbers set to one or two times more than available cores.ย 


MLOps Fundamentals or What Every Machine Learning Engineer Should Know

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—˜๐—ป๐˜ƒ๐—ถ๐—ฟ๐—ผ๐—ป๐—บ๐—ฒ๐—ป๐˜.
ย 

Effective ML Experimentation Environments are key to your MLOps transformation success.
ย 
At the center of this environment sits a Notebook. In the MLOps context the most important parts are the integration points between the Notebook and the remaining MLOps ecosystem. Most critical being:
ย 
1๏ธโƒฃ ๐—”๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐˜๐—ผ ๐—ฅ๐—ฎ๐˜„ ๐——๐—ฎ๐˜๐—ฎ: Data Scientists need access to raw data to uncover new Features that could be moved downstream into a Curated Data Layer.
2๏ธโƒฃ ๐—”๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐˜๐—ผ ๐—–๐˜‚๐—ฟ๐—ฎ๐˜๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ: Here the data is curated, meeting SLAs and of the highest quality. The only remaining step is to move it to the Feature Store. If it is not there yet - Data Scientists need to know of available Data Sets.
3๏ธโƒฃ ๐—”๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐˜๐—ผ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ: Last step for data quality and preparation. Features are ready to be used for Model training as is - always start here, only move upstream when you need additional data.
4๏ธโƒฃ ๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜/๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ: Data Scientists should be able to run local ML Pipelines and output Experiment related metadata to Experiment/Model Tracking System directly from Notebooks.
5๏ธโƒฃ ๐—ฅ๐—ฒ๐—บ๐—ผ๐˜๐—ฒ ๐— ๐—Ÿ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€: Machine Learning Pipeline development should be extremely easy for the Data Scientist, that means being able to spin up a full fledged Pipeline from their Notebook.
6๏ธโƒฃ ๐—ฉ๐—ฒ๐—ฟ๐˜€๐—ถ๐—ผ๐—ป ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ผ๐—น: Notebooks should be part of your Git integration. There are challenges with Notebook versioning but it does not change the fact. Additionally, it is best for the notebooks to live next to the production project code in template repositories. Having said that - you should make it crystal clear where certain type of code should live - a popular way to do this is providing repository ๐˜๐—ฒ๐—บ๐—ฝ๐—น๐—ฎ๐˜๐—ฒ๐˜€ with clear documentation.
7๏ธโƒฃ ๐—ฅ๐˜‚๐—ป๐˜๐—ถ๐—บ๐—ฒ ๐—˜๐—ป๐˜ƒ๐—ถ๐—ฟ๐—ผ๐—ป๐—บ๐—ฒ๐—ป๐˜: Notebooks have to be running in the same environment that your production code will run in. Incompatible dependencies should not cause any problems when porting applications to production. There might be several Production Environments so it should be easy to swap it for Notebook. It can be achieved by running Notebooks in containers.
8๏ธโƒฃ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ๐˜€: Data Scientists should be able to easily spin up different types of compute clusters - might it be Spark, Dask or any other technology - to allow effective Raw Data exploration.


๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ข๐—ฏ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† - ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ.

Your Machine Learning Systems can fail in more ways than regular software. Machine Learning Observability System is where you will continuously monitor ML Application Health and Performance.
ย 
There are end-to-end platforms that try to provide all of the later described capabilities, we will look into how an in-house built system could look as the concepts will be fundamentally similar.

To provide a holistic view of your Machine Learning Model Health in Production - Observability System needs at least:
ย 
๐—ข๐—ฝ๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด:
ย 
โžก๏ธ Here you monitor regular software operational metrics.ย 
โžก๏ธ This would include response latency, inference latency of your ML Services, how many requests per second are made etc.
ย 
๐˜Œ๐˜น๐˜ข๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ ๐˜ด๐˜ต๐˜ข๐˜ค๐˜ฌ:
ย 
๐Ÿ‘‰ Prometheus for storing Metrics and alerting.
๐Ÿ‘‰ Grafana mounted on Prometheus for exploring metrics.
ย 
๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—Ÿ๐—ผ๐—ด๐—ด๐—ถ๐—ป๐—ด:
ย 
โžก๏ธ As in any regular software system you will need to collect your application logs centrally.
โžก๏ธ This would include errors, response codes, debug logs if there is a need to dig deeper.
ย 
๐˜Œ๐˜น๐˜ข๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ ๐˜ด๐˜ต๐˜ข๐˜ค๐˜ฌ:
ย 
๐Ÿ‘‰ fluentD Sidecars forwarding logs to ElasticSearch clusters.
๐Ÿ‘‰ Kibana mounted on ElasticSearch for log search and exploration.
ย 
๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐—ฎ๐—น ๐—Ÿ๐—ผ๐—ด๐˜€:
ย 
This is a key area to Machine Learning Observability. You track Analytical Logs of your Applications - both Product Applications and your Machine Learning Service. These Logs will include application specific data like:
ย 
โžก๏ธ What prediction was made by ML Service.
โžก๏ธ What action was performed by the Product Application after receiving the Inference result.
โžก๏ธ What was the feature input to the ML Service.
โžก๏ธ โ€ฆ
ย 
โžก๏ธ Some of the features for inference will still be calculated on the fly due to latency requirements. Here you can identify any remaining Training/Serving Skew.
โžก๏ธ You will be able to monitor business related metrics. Click Through Rate, Revenue Per User etc.
โžก๏ธ You will catch any Data or Concept Drift happening within your online systems.
ย 
๐˜Œ๐˜น๐˜ข๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ ๐˜ด๐˜ต๐˜ข๐˜ค๐˜ฌ:
ย 
๐Ÿ‘‰ fluentD Sidecars forwarding logs to a Kafka Topic.
ย 
๐˜‰๐˜ข๐˜ต๐˜ค๐˜ฉ:
ย 
๐Ÿ‘‰ Data is written to object storage and later to a Data Warehouse.
๐Ÿ‘‰ A/B Testing Systems and BI Tooling are mounted on the Data Warehouse for Analytical purposes.
ย 
๐˜™๐˜ฆ๐˜ข๐˜ญ ๐˜›๐˜ช๐˜ฎ๐˜ฆ:
ย 
๐Ÿ‘‰ Flink Cluster reads Data from Kafka, performs windowed Transformations and Aggregations and either Alerts on thresholds or Forwards the data to Low Latency Read capable storage.


Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.

11
Share
Comments
Top
New
Community

No posts

Ready for more?

ยฉ 2023 Aurimas Griciลซnas
Privacy โˆ™ Terms โˆ™ Collection notice
Start WritingGet the app
Substackย is the home for great writing