SAI #09: Kafka - Use Cases, Pre Machine Learning Data Quality and more...
Pre Machine Learning Data Quality, Kafka - Use Cases, Data Contracts and Outbox Tables, The Collector.
๐ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
In this episode we cover:
Pre Machine Learning Data Quality.
Kafka - Use Cases.
Data Contracts and Outbox Tables.
No Excuses Data Engineering Template: The Collector.
Data Engineering Meets MLOps
๐ฃ๐ฟ๐ฒ ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ ๐ค๐๐ฎ๐น๐ถ๐๐.
ย We have already seen that the Data Lifecycle before it even reaches the Machine Learning Stage is highly complex. How can we ensure the quality of Data that is fed into ML Models using Data Contract enforcement? Letโs take a closer look:
ย
๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐.
ย
Data Contract is an agreement between Data Producers and Data Consumers on what the Data being produced should look like, what SLAs it should meet and the semantics of it.
ย
๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐ ๐๐ต๐ผ๐๐น๐ฑ ๐ต๐ผ๐น๐ฑ ๐๐ต๐ฒ ๐ณ๐ผ๐น๐น๐ผ๐๐ถ๐ป๐ด ๐ป๐ผ๐ป-๐ฒ๐
๐ต๐ฎ๐๐๐๐ถ๐๐ฒ ๐น๐ถ๐๐ ๐ผ๐ณ ๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ:
ย
๐ Schema Definition.
๐ Schema Version.
๐ SLA metadata.
๐ Semantics.
๐ Lineage.
๐ โฆ
ย
๐ฆ๐ผ๐บ๐ฒ ๐ฃ๐๐ฟ๐ฝ๐ผ๐๐ฒ๐ ๐ผ๐ณ ๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐:
ย
โก๏ธ Ensure Quality of Data in the Downstream Systems.
โก๏ธ Prevent Data Processing Pipelines from unexpected outages.
โก๏ธ Enforce Ownership of produced data closer to where it was generated.
โก๏ธ Improve scalability of your Data Systems.
โก๏ธ โฆ
ย
๐๐
๐ฎ๐บ๐ฝ๐น๐ฒ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ ๐๐ป๐ณ๐ผ๐ฟ๐ฐ๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐:
ย
๐ญ: Schema changes are implemented using version control, once approved - they are pushed to the Applications generating the Data, Databases holding the Data and a central Data Contract Registry.
ย
Applications push generated Data to Kafka Topics.
ย
๐ฎ: Events emitted directly by the Application Services.
ย
๐ This also includes IoT Fleets and Website Activity Tracking.
ย
๐ฎ.๐ญ: Raw Data Topics for CDC streams.
ย
๐ฏ: A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Contract Registry.
๐ฐ: Data that does not meet the contract is pushed to Dead Letter Topic.
๐ฑ: Data that meets the contract is pushed to Validated Data Topic.
๐ฒ: Data from the Validated Data Topic is pushed to object storage for additional Validation.
๐ณ: On a schedule Data in the Object Storage is validated against additional SLAs contained in Data Contract Metadata and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.
๐ด: Modeled and Curated data is pushed to the Feature Store System for further Feature Engineering.
๐ด.๐ญ: Real Time Features are ingested into the Feature Store directly from Validated Data Topic (5).
ย
๐ Ensuring Data Quality here is complicated since checks against SLAs is hard to perform.
ย
๐ต: Data of High Quality is used in Machine Learning Training Pipelines.
๐ญ๐ฌ: The same Data is used for Feature Serving in Inference.
Data Engineering Fundamentals + or What Every Data Engineer Should Know
๐๐ฎ๐ณ๐ธ๐ฎ - ๐จ๐๐ฒ ๐๐ฎ๐๐ฒ๐.
We have covered lots of concepts around Kafka already. But what are the most common use cases for The System that you are very likely to run into as a Data Engineer?
๐๐ฒ๐โ๐ ๐๐ฎ๐ธ๐ฒ ๐ฎ ๐ฐ๐น๐ผ๐๐ฒ๐ฟ ๐น๐ผ๐ผ๐ธ:
๐ช๐ฒ๐ฏ๐๐ถ๐๐ฒ ๐๐ฐ๐๐ถ๐๐ถ๐๐ ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด.
โก๏ธ The Original use case for Kafka by LinkedIn.
โก๏ธ Events happening in the website like page views, conversions etc. are sent via a Gateway and piped to Kafka Topics.
โก๏ธ These events are forwarded to the downstream Analytical systems or processed in Real Time.
โก๏ธ Kafka is used as an initial buffer as the Data amounts are usually big and Kafka guarantees no message loss due to its replication mechanisms.
๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป.
โก๏ธ Database Commit log is piped to a Kafka topic.
โก๏ธ The committed messages are executed against a new Database in the same order.
โก๏ธ Database replica is created.
๐๐ผ๐ด/๐ ๐ฒ๐๐ฟ๐ถ๐ฐ๐ ๐๐ด๐ด๐ฟ๐ฒ๐ด๐ฎ๐๐ถ๐ผ๐ป.
โก๏ธ Kafka is used for centralized Log and Metrics collection.
โก๏ธ Daemons like FluentD are deployed in servers or containers together with the Applications to be monitored.
โก๏ธ Applications send their Logs/Metrics to the Daemons.
โก๏ธ The Daemons pipe Logs/Metrics to a Kafka Topic.
โก๏ธ Logs/Metrics are delivered downstream to storages like ElasticSearch or InfluxDB for Log/Metrics discovery respectively.
โก๏ธ This is also how you would track your IoT Fleets.
๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด.
โก๏ธ This is usually coupled with ingestion mechanisms already covered.
โก๏ธ Instead of piping Data to a certain storage downstream we mount a Stream Processing Framework on top of Kafka Topics.
โก๏ธ The Data is filtered, enriched and then piped to the downstream systems to be further used according to the use case.
โก๏ธ This is also where one would be running Machine Learning Models embedded into a Stream Processing Application.
๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด.
โก๏ธ Kafka can be used as a replacement for more traditional messaging brokers like RabbitMQ.
โก๏ธ Kafka has better durability guarantees and is easier to configure for several separate Consumer Groups to consume from the same Topic.
โ๏ธHaving said this - always consider the complexity you are bringing with introduction of a Distributed System. Sometimes it is better to just use traditional frameworks.
๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐ ๐ฎ๐ป๐ฑ ๐ข๐๐๐ฏ๐ผ๐
๐ง๐ฎ๐ฏ๐น๐ฒ๐.
We already established that Data Contracts are to become one of the key topics for the upcoming years.
Today letโs look into what Outbox Tables in Operational Service Level Databases are and how we can leverage them in implementing Data Contracts.
๐ง๐ต๐ฒ ๐ฃ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ.
โก๏ธ Operational Product Databases are modeled around Services that are coupled to them.
โก๏ธ Usually Data Warehouses are fed with Raw events coming through a CDC stream that mirrors Operational Data Model.
โก๏ธ This Data Model Does not necessarily represent intended Semantic Meaning for Analytical Data Model.
โก๏ธ Data is additionally Transformed and modeled in the Data Warehouse to make it fit Analytical purposes.
ย
โ๏ธAny Change in the Service Operational Data Model might and most likely will break downstream Analytical processes depending on the data.
ย
๐ฆ๐ผ๐น๐๐๐ถ๐ผ๐ป.
ย
โก๏ธ Introduction of Outbox Tables for Analytical Purposes.
โก๏ธ Any updates to Operational Database are also made to the Outbox Table in the same Transaction.
ย
โ๏ธIt is extremely important to do this in a single transaction to retain State Consistency of the Data.
ย
โก๏ธ Entities in the Outbox Table Payload field are modeled separately to fit Semantic Meaning for Analytical Data Model.
โก๏ธ CDC is performed against the Outbox Table instead of Operational Tables.
โก๏ธ There can be multiple or a single Outbox Table that contains all of the Data intended for Analytical Purposes.
โก๏ธ Outbox Table will also include additional metadata like Schema Version so we can validate the correctness of the payload in the downstream systems or a Kafka Topic it should be routed to.
ย
โ
In this way we decouple Operational Model From Analytical model and any Changes done to the Operational Model do not propagate to the Downstream Analytical Systems.
No Excuses Data Engineering Project Template: The Collector
What is the function of ๐ง๐ต๐ฒ ๐๐ผ๐น๐น๐ฒ๐ฐ๐๐ผ๐ฟ?
ย
Why and when does it even make sense to have separate ๐๐ผ๐น๐น๐ฒ๐ฐ๐๐ผ๐ฟ microservice in your Data Engineering Pipeline?
ย
๐๐ฒ๐โ๐ ๐๐ผ๐ผ๐บ ๐ถ๐ป:
ย
โก๏ธ Data Collectors act as a Gateway to validate the correctness of Data entering your System. E.g. we could check for top level field existence that would allow us to validate ๐๐ฎ๐๐ฎ ๐๐ผ๐ป๐๐ฟ๐ฎ๐ฐ๐๐ later on in the Downstream Processes.
โก๏ธ Collectors add additional metadata to the events. E.g. collector timestamp, collector process id etc.
โก๏ธ By making The Collector a separate service we decouple Data Producers from Collectors. By doing this weย allow implementation of the Producers using different Technologies enabling wider collaboration.
โก๏ธ Consequently, Data Producers can be deployed in different parts of the ๐ก๐ฒ๐๐๐ผ๐ฟ๐ธ ๐ง๐ผ๐ฝ๐ผ๐น๐ผ๐ด๐. E.g. They can be a JS app running in a browser or a fleet of ๐๐ข๐ง devices emitting events to a public Load Balancer. Or it could be a Python backend server delivering ML Predictions in a private network.
โก๏ธ Collectors are the most sensitive part of your pipeline - you want to do as little processing and data manipulation here as possible. Usually, any errors here would result in permanent ๐๐ฎ๐๐ฎ ๐๐ผ๐๐. Once the data is already in the System you can always retry additional computation.ย
โก๏ธ Hence, we decouple Collectors from any additional Schema Validation or Enrichment of the ๐ฃ๐ฎ๐๐น๐ผ๐ฎ๐ฑ and move that down the stream.
โก๏ธ You can scale collectors independently so they are able to take any Incoming Data Amount.
โก๏ธ Collectors should also implement a buffer. If Kafka starts misbehaving - Data would be buffered in memory for some time until Kafka Cluster recovers.
โก๏ธ Data Output by the Collectors is your ๐ฅ๐ฎ๐ ๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ ๐ผ๐ณ ๐ง๐ฟ๐๐๐ต ๐๐ฎ๐๐ฎ. You will not be able to acquire this Data again - you want Raw data to be treated differently, i.e. have a longer lifecycle and most likely archive it in a low cost object storage. This Data should not be manipulated in any way, that is a task for Downstream Systems.
ย
When does it make sense to have a ๐๐ผ๐น๐น๐ฒ๐ฐ๐๐ผ๐ฟ microservice?
ย
๐ Your Data is in event/log form.
๐ Your Data is composed of many different event types.
๐ Your Data is not saved in backend databases. If it is - use Change Data Capture instead.
ย
โ๏ธYou will only be able to use gRPC for events delivered from your Private Network. For Public Network you would use REST.
Hi! Thanks for really good content! Would you mind sharing which tool you use to make your illustrations / flow charts?