👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
Today in the Newsletter.
Episodes you might have missed.
Newsletter Discounts.
Data Contracts in the Data Pipeline.
Episodes you might have missed.
I wanted to create a single template that would allow you to learn technical nuances and fundamentals of everything that you might run into as a Data Engineer in a single end-to-end project. I choose technologies that you would run into most often in a day-to-day of most organisations. I also want this to be a map which would allow me to explain some of the fundamental concepts I talk about in the Newsletter and map them to a specific use case that would be implemented in the Template. This Newsletter episode id the introduction of this map.
This is the second part of my guide to optimising Spark Application Performance. In it I cover:
Choosing the right File Format.
Column-Based vs. Row-Based storage
Splittable vs. non-splittable files
Encoding in Parquet.
RLE (Run Length Encoding)
Dictionary Encoding
Combining RLE and Dictionary Encoding
Understanding and tuning Spark Executor Memory.
Maximising the number of executors for a given cluster.
Memory allocation.
CPU allocation.
Most of the hands-on tutorials that I will do in my Newsletter will end up in application systems deployed on top of Kubernetes. In this first part of the Guide to Kubernetes series I cover most of the concepts that will be enough for you to follow and implement my tutorials.
I am also a holder of all CNCF certifications when it comes to Kubernetes - CKA, CKAD and CKS. My goal is to also teach you the nuances that are needed to pass at least CKAD exam as the series progresses into more complicated concepts around K8s.
In this Newsletter episode I start implementing The Collector piece of The SwirlAI Data Engineering Project Master Template defined in the first article mentioned in this post, we also deploy it to a local Kubernetes cluster. The hands-on tutorial contains all of the code you will need to implement the project, specifically:
The Collector.
Python application that uses FastAPI Framework to expose a REST API endpoint.
Deploying the application on Kubernetes.
Horizontally scaling the application for High Availability.
Data Producer.
Python application to download data from the internet.
Running the application on Kubernetes.
Sending the downloaded data to a previously deployed REST API endpoint.
Newsletter Discounts.
I now offer Newsletter discounts for specific groups:
Purchase pricing parity: if you are based in a country with lower income compared to the US and Western Europe, email me at aurimas@swirlai.com with your country of residence to get a more affordable option.
Students: Student discount for academic email addresses. Don’t have an .edu email address? Email me at aurimas@swirlai.com.
Data Contracts in the Data Pipeline.
In its simplest form Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.
Data Contract should hold the following non-exhaustive list of metadata:
Schema of the Data being Produced.
Schema Version - Data Sources evolve, Producers have to ensure that it is possible to detect and react to schema changes. Consumers should be able to process Data with the old Schema.
SLA metadata - Quality: is it meant for Production use? How late can the data arrive? How many missing values could be expected for certain fields in a given time period?
Semantics - what entity does a given Data Point represent. Semantics, similar to schema, can evolve over time.
Lineage - Data Owners, Intended Consumers.
…
Some Purposes of Data Contracts:
Ensure Quality of Data in the Downstream Systems.
Prevent Data Processing Pipelines from unexpected outages.
Enforce Ownership of produced data closer to where it was generated.
Improve scalability of your Data Systems.
Reduce intermediate Data Handover Layer.
…
Example implementation for Data Contract Enforcement:
Schema changes are implemented in a git repository, once approved - they are pushed to the Applications generating the Data and a central Schema Registry.
Applications push generated Data to Kafka Topics. Separate Raw Data Topics for CDC streams and Direct emission.
A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Schema Registry.
Data that does not meet the contract is pushed to Dead Letter Topic.
Data that meets the contract is pushed to Validated Data Topic.
Applications that need Real Time Data consume it directly from Validated Data Topic or its derivatives.
Data from the Validated Data Topic is pushed to object storage for additional Validation.
On a schedule Data in the Object Storage is validated against additional SLAs and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.
Consumers and Producers are alerted to any SLA breaches.
Data that was Invalidated in Real Time is consumed by Flink Applications that alert on invalid schemas. There could be a recovery Flink App with logic on how to fix invalidated Data.
I will be doing a deep dive on Data Contracts and their implementation in one of the future Newsletter episodes. You might have noticed that they fit very well into the architecture of The SwirlAI Data Engineering Project Master Template as well.