SwirlAI Newsletter

Share this post

SAI #09: Kafka - Use Cases, Pre Machine Learning Data Quality and more...

www.newsletter.swirlai.com

SAI #09: Kafka - Use Cases, Pre Machine Learning Data Quality and more...

Pre Machine Learning Data Quality, Kafka - Use Cases, Data Contracts and Outbox Tables, The Collector.

Aurimas Griciลซnas
Dec 10, 2022
7
2
Share

๐Ÿ‘‹ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.

In this episode we cover:

  • Pre Machine Learning Data Quality.

  • Kafka - Use Cases.

  • Data Contracts and Outbox Tables.

  • No Excuses Data Engineering Template: The Collector.

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


Data Engineering Meets MLOps


๐—ฃ๐—ฟ๐—ฒ ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ ๐—ค๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜†.

ย We have already seen that the Data Lifecycle before it even reaches the Machine Learning Stage is highly complex. How can we ensure the quality of Data that is fed into ML Models using Data Contract enforcement? Letโ€™s take a closer look:
ย 
๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€.
ย 
Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.
ย 
๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ต๐—ผ๐—น๐—ฑ ๐˜๐—ต๐—ฒ ๐—ณ๐—ผ๐—น๐—น๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ป๐—ผ๐—ป-๐—ฒ๐˜…๐—ต๐—ฎ๐˜‚๐˜€๐˜๐—ถ๐˜ƒ๐—ฒ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—บ๐—ฒ๐˜๐—ฎ๐—ฑ๐—ฎ๐˜๐—ฎ:
ย 
๐Ÿ‘‰ Schema Definition.
๐Ÿ‘‰ Schema Version.
๐Ÿ‘‰ SLA metadata.
๐Ÿ‘‰ Semantics.
๐Ÿ‘‰ Lineage.
๐Ÿ‘‰ โ€ฆ
ย 
๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ฃ๐˜‚๐—ฟ๐—ฝ๐—ผ๐˜€๐—ฒ๐˜€ ๐—ผ๐—ณ ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€:
ย 
โžก๏ธ Ensure Quality of Data in the Downstream Systems.
โžก๏ธ Prevent Data Processing Pipelines from unexpected outages.
โžก๏ธ Enforce Ownership of produced data closer to where it was generated.
โžก๏ธ Improve scalability of your Data Systems.
โžก๏ธ โ€ฆ
ย 
๐—˜๐˜…๐—ฎ๐—บ๐—ฝ๐—น๐—ฒ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ ๐—˜๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€:
ย 
๐Ÿญ: Schema changes are implemented using version control, once approved - they are pushed to the Applications generating the Data, Databases holding the Data and a central Data Contract Registry.
ย 
Applications push generated Data to Kafka Topics.
ย 
๐Ÿฎ: Events emitted directly by the Application Services.
ย 
๐Ÿ‘‰ This also includes IoT Fleets and Website Activity Tracking.
ย 
๐Ÿฎ.๐Ÿญ: Raw Data Topics for CDC streams.
ย 
๐Ÿฏ: A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Contract Registry.
๐Ÿฐ: Data that does not meet the contract is pushed to Dead Letter Topic.
๐Ÿฑ: Data that meets the contract is pushed to Validated Data Topic.
๐Ÿฒ: Data from the Validated Data Topic is pushed to object storage for additional Validation.
๐Ÿณ: On a schedule Data in the Object Storage is validated against additional SLAs contained in Data Contract Metadata and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.
๐Ÿด: Modeled and Curated data is pushed to the Feature Store System for further Feature Engineering.
๐Ÿด.๐Ÿญ: Real Time Features are ingested into the Feature Store directly from Validated Data Topic (5).
ย 
๐Ÿ‘‰ Ensuring Data Quality here is complicated since checks against SLAs is hard to perform.
ย 
๐Ÿต: Data of High Quality is used in Machine Learning Training Pipelines.
๐Ÿญ๐Ÿฌ: The same Data is used for Feature Serving in Inference.


Data Engineering Fundamentals + or What Every Data Engineer Should Know


๐—ž๐—ฎ๐—ณ๐—ธ๐—ฎ - ๐—จ๐˜€๐—ฒ ๐—–๐—ฎ๐˜€๐—ฒ๐˜€.

We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?

๐—Ÿ๐—ฒ๐˜โ€™๐˜€ ๐˜๐—ฎ๐—ธ๐—ฒ ๐—ฎ ๐—ฐ๐—น๐—ผ๐˜€๐—ฒ๐—ฟ ๐—น๐—ผ๐—ผ๐—ธ:

๐—ช๐—ฒ๐—ฏ๐˜€๐—ถ๐˜๐—ฒ ๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜† ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด.

โžก๏ธ The Original use case for Kafka by LinkedIn.
โžก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.
โžก๏ธ These events are forwarded to the downstream Analytical systems or processed in Real Time.
โžก๏ธ Kafka is used as an initial buffer as the Data amounts are usually big and Kafka guarantees no message loss due to its replication mechanisms.

๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฅ๐—ฒ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป.

โžก๏ธ Database Commit log is piped to a Kafka topic.
โžก๏ธ The committed messages are executed against a new Database in the same order.
โžก๏ธ Database replica is created.

๐—Ÿ๐—ผ๐—ด/๐— ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ๐˜€ ๐—”๐—ด๐—ด๐—ฟ๐—ฒ๐—ด๐—ฎ๐˜๐—ถ๐—ผ๐—ป.

โžก๏ธ Kafka is used for centralized Log and Metrics collection.
โžก๏ธ Daemons like FluentD are deployed in servers or containers together with the Applications to be monitored.
โžก๏ธ Applications send their Logs/Metrics to the Daemons.
โžก๏ธ The Daemons pipe Logs/Metrics to a Kafka Topic.
โžก๏ธ Logs/Metrics are delivered downstream to storages like ElasticSearch or InfluxDB for Log/Metrics discovery respectively.
โžก๏ธ This is also how you would track your IoT Fleets.

๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด.

โžก๏ธ This is usually coupled with ingestion mechanisms already covered.
โžก๏ธ Instead of piping Data to a certain storage downstream we mount a Stream Processing Framework on top of Kafka Topics.
โžก๏ธ The Data is filtered, enriched and then piped to the downstream systems to be further used according to the use case.
โžก๏ธ This is also where one would be running Machine Learning Models embedded into a Stream Processing Application.

๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ถ๐—ป๐—ด.

โžก๏ธ Kafka can be used as a replacement for more traditional messaging brokers like RabbitMQ.
โžก๏ธ Kafka has better durability guarantees and is easier to configure for several separate Consumer Groups to consume from the same Topic.

โ—๏ธHaving said this - always consider the complexity you are bringing with introduction of a Distributed System. Sometimes it is better to just use traditional frameworks.


๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ข๐˜‚๐˜๐—ฏ๐—ผ๐˜… ๐—ง๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€.

We already established that Data Contracts are to become one of the key topics for the upcoming years.

Today letโ€™s look into what Outbox Tables in Operational Service Level Databases are and how we can leverage them in implementing Data Contracts.

๐—ง๐—ต๐—ฒ ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ.

โžก๏ธ Operational Product Databases are modeled around Services that are coupled to them.
โžก๏ธ Usually Data Warehouses are fed with Raw events coming through a CDC stream that mirrors Operational Data Model.
โžก๏ธ This Data Model Does not necessarily represent intended Semantic Meaning for Analytical Data Model.
โžก๏ธ Data is additionally Transformed and modeled in the Data Warehouse to make it fit Analytical purposes.
ย 
โ—๏ธAny Change in the Service Operational Data Model might and most likely will break downstream Analytical processes depending on the data.
ย 
๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป.
ย 
โžก๏ธ Introduction of Outbox Tables for Analytical Purposes.
โžก๏ธ Any updates to Operational Database are also made to the Outbox Table in the same Transaction.
ย 
โ—๏ธIt is extremely important to do this in a single transaction to retain State Consistency of the Data.
ย 
โžก๏ธ Entities in the Outbox Table Payload field are modeled separately to fit Semantic Meaning for Analytical Data Model.
โžก๏ธ CDC is performed against the Outbox Table instead of Operational Tables.
โžก๏ธ There can be multiple or a single Outbox Table that contains all of the Data intended for Analytical Purposes.
โžก๏ธ Outbox Table will also include additional metadata like Schema Version so we can validate the correctness of the payload in the downstream systems or a Kafka Topic it should be routed to.
ย 
โœ… In this way we decouple Operational Model From Analytical model and any Changes done to the Operational Model do not propagate to the Downstream Analytical Systems.


No Excuses Data Engineering Project Template: The Collector


What is the function of ๐—ง๐—ต๐—ฒ ๐—–๐—ผ๐—น๐—น๐—ฒ๐—ฐ๐˜๐—ผ๐—ฟ?
ย 
Why and when does it even make sense to have separate ๐—–๐—ผ๐—น๐—น๐—ฒ๐—ฐ๐˜๐—ผ๐—ฟ microservice in your Data Engineering Pipeline?
ย 
๐—Ÿ๐—ฒ๐˜โ€™๐˜€ ๐˜‡๐—ผ๐—ผ๐—บ ๐—ถ๐—ป:
ย 
โžก๏ธ Data Collectors act as a Gateway to validate the correctness of Data entering your System. E.g. we could check for top level field existence that would allow us to validate ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐˜€ later on in the Downstream Processes.
โžก๏ธ Collectors add additional metadata to the events. E.g. collector timestamp, collector process id etc.
โžก๏ธ By making The Collector a separate service we decouple Data Producers from Collectors. By doing this weย  allow implementation of the Producers using different Technologies enabling wider collaboration.
โžก๏ธ Consequently, Data Producers can be deployed in different parts of the ๐—ก๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐—ง๐—ผ๐—ฝ๐—ผ๐—น๐—ผ๐—ด๐˜†. E.g. They can be a JS app running in a browser or a fleet of ๐—œ๐—ข๐—ง devices emitting events to a public Load Balancer. Or it could be a Python backend server delivering ML Predictions in a private network.
โžก๏ธ Collectors are the most sensitive part of your pipeline - you want to do as little processing and data manipulation here as possible. Usually, any errors here would result in permanent ๐——๐—ฎ๐˜๐—ฎ ๐—Ÿ๐—ผ๐˜€๐˜€. Once the data is already in the System you can always retry additional computation.ย 
โžก๏ธ Hence, we decouple Collectors from any additional Schema Validation or Enrichment of the ๐—ฃ๐—ฎ๐˜†๐—น๐—ผ๐—ฎ๐—ฑ and move that down the stream.
โžก๏ธ You can scale collectors independently so they are able to take any Incoming Data Amount.
โžก๏ธ Collectors should also implement a buffer. If Kafka starts misbehaving - Data would be buffered in memory for some time until Kafka Cluster recovers.
โžก๏ธ Data Output by the Collectors is your ๐—ฅ๐—ฎ๐˜„ ๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐—ง๐—ฟ๐˜‚๐˜๐—ต ๐——๐—ฎ๐˜๐—ฎ. You will not be able to acquire this Data again - you want Raw data to be treated differently, i.e. have a longer lifecycle and most likely archive it in a low cost object storage. This Data should not be manipulated in any way, that is a task for Downstream Systems.
ย 
When does it make sense to have a ๐—–๐—ผ๐—น๐—น๐—ฒ๐—ฐ๐˜๐—ผ๐—ฟ microservice?
ย 
๐Ÿ‘‰ Your Data is in event/log form.
๐Ÿ‘‰ Your Data is composed of many different event types.
๐Ÿ‘‰ Your Data is not saved in backend databases. If it is - use Change Data Capture instead.
ย 
โ—๏ธYou will only be able to use gRPC for events delivered from your Private Network. For Public Network you would use REST.


Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.

7
2
Share
2 Comments
Alex
Dec 13, 2022

Hi! Thanks for really good content! Would you mind sharing which tool you use to make your illustrations / flow charts?

Expand full comment
Reply
1 reply by Aurimas Griciลซnas
1 more commentโ€ฆ
Top
New
Community

No posts

Ready for more?

ยฉ 2023 Aurimas Griciลซnas
Privacy โˆ™ Terms โˆ™ Collection notice
Start WritingGet the app
Substackย is the home for great writing