SAI #02: Feature Store, Splittable vs. Non-Splittable Files and more...
Splittable vs Non-Splittable Files, CDC (Change Data Capture), Machine Learning Pipeline, Feature Store.
Data Engineering Fundamentals + or What Every Data Engineer Should Know
๐ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ ๐๐. ๐ก๐ผ๐ป-๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐น๐ฒ๐.
You are very likely to run into a ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐ผ๐ฟ ๐๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ in your career. It could be ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ, ๐๐ถ๐๐ฒ, ๐ฃ๐ฟ๐ฒ๐๐๐ผ or any other.ย
Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐๐๐๐ฆ, ๐ฆ๐ฏ etc.
These Frameworks utilize multiple ๐๐ฃ๐จ ๐๐ผ๐ฟ๐ฒ๐ ๐ณ๐ผ๐ฟ ๐๐ผ๐ฎ๐ฑ๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ and performing ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐๐ผ๐บ๐ฝ๐๐๐ฒ in parallel.
How files are stored in your ๐ฆ๐๐ผ๐ฟ๐ฎ๐ด๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐ถ๐ ๐๐ฒ๐ for utilizing distributed ๐ฅ๐ฒ๐ฎ๐ฑ ๐ฎ๐ป๐ฑ ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐๐น๐.
๐ฆ๐ผ๐บ๐ฒ ๐ฑ๐ฒ๐ณ๐ถ๐ป๐ถ๐๐ถ๐ผ๐ป๐:
โก๏ธ ๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐น๐ฒ๐ are Files that can be partially read by several processes at the same time.
โก๏ธ In distributed file or block storages files are stored in chunks called blocks.
โก๏ธ Block sizes will vary between different storage systems.
๐ง๐ต๐ถ๐ป๐ด๐ ๐๐ผ ๐ธ๐ป๐ผ๐:
โก๏ธ If your file is ๐ก๐ผ๐ป-๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ and is bigger than a block in storage - it will be split between blocks but will only be read by a ๐ฆ๐ถ๐ป๐ด๐น๐ฒ ๐๐ฃ๐จ ๐๐ผ๐ฟ๐ฒ which might cause ๐๐ฑ๐น๐ฒ ๐๐ฃ๐จ time.
โก๏ธ If your file is ๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ - multiple cores can read it at the same time (one core per block).
๐ฆ๐ผ๐บ๐ฒ ๐ด๐๐ถ๐ฑ๐ฎ๐ป๐ฐ๐ฒ:
โก๏ธ If possible - prefer ๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐น๐ฒ types.
โก๏ธ If you are forced to use ๐ก๐ผ๐ป-๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ files - manually partition them into sizes that would fit into a single FS Block to utilize more CPU Cores.
๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ ๐ณ๐ถ๐น๐ฒ ๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐:
๐ ๐๐๐ฟ๐ผ.
๐ ๐๐ฆ๐ฉ.
๐ ๐ข๐ฅ๐.
๐ ๐ป๐ฑ๐๐ฆ๐ข๐ก.
๐ ๐ฃ๐ฎ๐ฟ๐พ๐๐ฒ๐.
ย
๐ก๐ผ๐ป-๐ฆ๐ฝ๐น๐ถ๐๐๐ฎ๐ฏ๐น๐ฒ ๐ณ๐ถ๐น๐ฒ ๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐:
ย
๐ ๐ฃ๐ฟ๐ผ๐๐ผ๐ฐ๐ผ๐น ๐๐๐ณ๐ณ๐ฒ๐ฟ๐.
๐ ๐๐ฆ๐ข๐ก.
๐ ๐ซ๐ ๐.
ย
[๐๐ ๐ฃ๐ข๐ฅ๐ง๐๐ก๐ง] Compression might break splitability, more on it next time.
๐๐๐ (๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ).
๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ is a software process used to replicate actions performed against ๐ข๐ฝ๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐ for use in downstream applications.ย
๐ง๐ต๐ฒ๐ฟ๐ฒ ๐ฎ๐ฟ๐ฒ ๐๐ฒ๐๐ฒ๐ฟ๐ฎ๐น ๐๐๐ฒ ๐ฐ๐ฎ๐๐ฒ๐ ๐ณ๐ผ๐ฟ ๐๐๐. ๐ง๐๐ผ ๐ผ๐ณ ๐๐ต๐ฒ ๐บ๐ฎ๐ถ๐ป ๐ผ๐ป๐ฒ๐:
โก๏ธ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (refer to 3๏ธโฃ in the Diagram).
๐ ๐๐๐ can be used for moving transactions performed against ๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ to a ๐ง๐ฎ๐ฟ๐ด๐ฒ๐ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.
๐ ๐ฅ๐ฒ๐ฎ๐น ๐๐ถ๐บ๐ฒ ๐๐๐ is extremely valuable here as it enables ๐ญ๐ฒ๐ฟ๐ผ ๐๐ผ๐๐ป๐๐ถ๐บ๐ฒ ๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป ๐ฎ๐ป๐ฑ ๐ ๐ถ๐ด๐ฟ๐ฎ๐๐ถ๐ผ๐ป. E.g It is extensively used when migrating ๐ผ๐ป-๐ฝ๐ฟ๐ฒ๐บ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐ serving ๐๐ฟ๐ถ๐๐ถ๐ฐ๐ฎ๐น ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐ that can not be shut down for a moment to the cloud.
โก๏ธ Facilitation of ๐๐ฎ๐๐ฎ ๐ ๐ผ๐๐ฒ๐บ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ข๐ฝ๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐ ๐๐ผ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ธ๐ฒ๐ (refer to 1๏ธโฃ in the Diagram) ๐ผ๐ฟ ๐๐ฎ๐๐ฎ ๐ช๐ฎ๐ฟ๐ฒ๐ต๐ผ๐๐๐ฒ๐ (refer to 2๏ธโฃ in the Diagram) ๐ณ๐ผ๐ฟ ๐๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ ๐ฝ๐๐ฟ๐ฝ๐ผ๐๐ฒ๐.
๐ There are currently two Data movement patterns widely applied in the industry: ๐๐ง๐ ๐ฎ๐ป๐ฑ ๐๐๐ง.
๐ ๐๐ป ๐๐ต๐ฒ ๐ฐ๐ฎ๐๐ฒ ๐ผ๐ณ ๐๐ง๐ - data extracted by CDC can be transformed on the fly and eventually pushed to the Data Lake or Data Warehouse.
๐ ๐๐ป ๐๐ต๐ฒ ๐ฐ๐ฎ๐๐ฒ ๐ผ๐ณ ๐๐๐ง - Data is replicated to the Data Lake or Data Warehouse as is and Transformations performed inside of the System.
ย
๐ง๐ต๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐บ๐ผ๐ฟ๐ฒ ๐๐ต๐ฎ๐ป ๐ผ๐ป๐ฒ ๐๐ฎ๐ ๐ผ๐ณ ๐ต๐ผ๐ ๐๐๐ ๐ฐ๐ฎ๐ป ๐ฏ๐ฒ ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฒ๐ฑ, ๐๐ต๐ฒ ๐บ๐ฒ๐๐ต๐ผ๐ฑ๐ ๐ฎ๐ฟ๐ฒ ๐บ๐ฎ๐ถ๐ป๐น๐ ๐๐ฝ๐น๐ถ๐ ๐ถ๐ป๐๐ผ ๐๐ต๐ฟ๐ฒ๐ฒ ๐ด๐ฟ๐ผ๐๐ฝ๐:
ย
โก๏ธ ๐ฃ๐๐น๐น ๐๐ฎ๐๐ฒ๐ฑ ๐๐๐
๐ A client queries the Source Database and pushes data into the Target Database.
ย
โ๏ธDownside 1: There is a need to augment all of the source tables to include indicators that a record has changed.
โ๏ธDownside 2: Usually - not a real time CDC, it might be performed hourly, daily etc.
โ๏ธDownside 3: Source Database suffers high load when CDC is being performed.
โ๏ธDownside 4: It is extremely challenging to replicate Delete events.
โก๏ธ ๐ฃ๐๐๐ต ๐๐ฎ๐๐ฒ๐ฑ ๐๐๐
๐ Triggers are set up in the Source Database. Whenever a change event happens in the Database - it is pushed to a target system.
ย
โ๏ธ Downside 1: This approach usually causes highest database load overhead.
โ
ย Upside 1: Real Time CDC.
ย
โก๏ธ ๐๐ผ๐ด ๐๐ฎ๐๐ฒ๐ฑ ๐๐๐
๐ Transactional Databases have all of the events performed against the Database logged in the transaction log for recovery purposes.
๐ A Transaction Miner is mounted on top of the logs and pushes selected events into a Downstream System. Popular implementation - Debezium.
ย
โ๏ธ Downside 1: More complicated to set up.
โ๏ธ Downside 2: Not all Databases will have open source connectors.
โ
ย Upside 1: Least load on the Database.
โ
ย Upside 2: Real Time CDC.
MLOps Fundamentals or What Every Machine Learning Engineer Should Know
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ.
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ are extremely important because they bring the automation aspect into the day to day work of ๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐๐ถ๐๐๐ and ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐. Pipelines should be reusable and robust. Working in Pipelines allows you to scale the number of Machine Learning models that can be maintained in production concurrently.
Usual Pipeline consists of the following steps:
ย
1๏ธโฃ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น
ย
๐ Ideally we retrieve Features from a Feature Store here, if not - it could be a different kind of storage.
๐ Features should not require transformations at this stage, if they do - there should be an additional step of Feature Preparation.
๐ We do Train/Validation/Test splits here.
ย
2๏ธโฃ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฉ๐ฎ๐น๐ถ๐ฑ๐ฎ๐๐ถ๐ผ๐ป
ย
๐ You would hope to have ๐๐๐ฟ๐ฎ๐๐ฒ๐ฑ ๐๐ฎ๐๐ฎ at this point but errors can slip through. There could be a missing timeframe or ๐๐ฎ๐๐ฎ could be coming late.
๐ย You should perform ๐ฃ๐ฟ๐ผ๐ณ๐ถ๐น๐ฒ ๐๐ผ๐บ๐ฝ๐ฎ๐ฟ๐ถ๐๐ผ๐ป against data used during last time the pipeline was run. Any significant change in ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ถ๐ผ๐ป could signal ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ฟ๐ถ๐ณ๐.
ย
โ๏ธYou should act according to Validation results - if any of them breach predefined thresholds you should act accordingly, either alert responsible person or short circuit the Pipeline.
ย
3๏ธโฃ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด
ย
๐ We train our ML Model using ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ๐๐ฒ๐ from first step while validating on ๐ฉ๐ฎ๐น๐ถ๐ฑ๐ฎ๐๐ถ๐ผ๐ป ๐๐ฎ๐๐ฎ๐๐ฒ๐ for adequate results.
๐ If ๐๐ฟ๐ผ๐๐-๐ฉ๐ฎ๐น๐ถ๐ฑ๐ฎ๐๐ถ๐ผ๐ป is used - we do not split the Validation Set in the first step.
ย
4๏ธโฃ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฉ๐ฎ๐น๐ถ๐ฑ๐ฎ๐๐ถ๐ผ๐ป
ย
๐ We calculate ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ๐ against the ๐ง๐ฒ๐๐ ๐๐ฎ๐๐ฎ๐๐ฒ๐ that we split in the first step.
ย
โ๏ธYou should act according to Validation results - if predefined thresholds are breached you should act accordingly, either alert responsible person or short circuit the Pipeline.
ย
5๏ธโฃ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฆ๐ฒ๐ฟ๐๐ถ๐ป๐ด
ย
๐ You serve the Model for Deployment by placing it into a Model Registry.
ย
๐ฆ๐ผ๐บ๐ฒ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐ฟ๐ฒ๐พ๐๐ถ๐ฟ๐ฒ๐บ๐ฒ๐ป๐๐:
ย
โก๏ธ You should be able to trigger the Pipeline for retraining purposes. It could be done by the Orchestrator on a schedule, from the Experimentation Environment or by an Alerting System if any faults in Production are detected.
โก๏ธ Pipeline steps should be glued together by Experiment Tracking System for Pipeline Run Reproducibility purposes.
๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ.
๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ ๐ฆ๐๐๐๐ฒ๐บ๐ are an extremely important concept as they sit between ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฎ๐ป๐ฑ ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐.
๐ง๐ต๐ฒ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐๐ผ๐น๐๐ฒ๐ ๐ณ๐ผ๐น๐น๐ผ๐๐ถ๐ป๐ด ๐ถ๐๐๐๐ฒ๐:
โก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5๏ธโฃ)
โก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธโฃ, 3๏ธโฃ).
๐ง๐ต๐ฒ ๐ถ๐ฑ๐ฒ๐ฎ๐น ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ ๐ฆ๐๐๐๐ฒ๐บ ๐๐ต๐ผ๐๐น๐ฑ ๐ต๐ฎ๐๐ฒ ๐๐ต๐ฒ๐๐ฒ ๐ฝ๐ฟ๐ผ๐ฝ๐ฒ๐ฟ๐๐ถ๐ฒ๐:
1๏ธโฃ ๐๐ ๐๐ต๐ผ๐๐น๐ฑ ๐ฏ๐ฒ ๐บ๐ผ๐๐ป๐๐ฒ๐ฑ ๐ผ๐ป ๐๐ผ๐ฝ ๐ผ๐ณ ๐๐ต๐ฒ ๐๐๐ฟ๐ฎ๐๐ฒ๐ฑ ๐๐ฎ๐๐ฎ ๐๐ฎ๐๐ฒ๐ฟ
ย
๐ the Data that is being pushed into the Feature Store System should be of High Quality and meet SLAs, trying to Curate Data inside of the Feature Store System is a recipe for disaster.
๐ Curated Data could be coming in Real Time or Batch. Not all companies need Real Time Data at least when they are only starting with Machine Learning.
ย
2๏ธโฃ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ผ๐ฟ๐ฒ ๐ฆ๐๐๐๐ฒ๐บ๐ ๐๐ต๐ผ๐๐น๐ฑ ๐ต๐ฎ๐๐ฒ ๐ฎ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ถ๐ผ๐ป ๐๐ฎ๐๐ฒ๐ฟ ๐๐ถ๐๐ต ๐ถ๐๐ ๐ผ๐๐ป ๐ฐ๐ผ๐บ๐ฝ๐๐๐ฒ.
ย
๐ In the modern Data Stack this part could be provided by the vendor or you might need to implement it yourself.ย
๐ The industry is moving towards a state where it becomes normal for vendors to include Feature Transformation part into their offering.
ย
3๏ธโฃ ๐ฅ๐ฒ๐ฎ๐น ๐ง๐ถ๐บ๐ฒ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐ฒ๐ฟ๐๐ถ๐ป๐ด ๐๐ฃ๐ - this is where you will be retrieving your Features for low latency inference. The System should provide two types of Real Time serving APIs:ย
ย
๐ Get - you fetch a single Feature Vector.
๐ Batch Get - you fetch multiple Feature Vectors at the same time with Low Latency.
ย
4๏ธโฃ ๐๐ฎ๐๐ฐ๐ต ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐ฒ๐ฟ๐๐ถ๐ป๐ด ๐๐ฃ๐ - this is where you will be fetching your Features for Batch inference and Model Training. The API should provide:
ย
๐ Point in time Feature Retrieval - you need to be able to time travel. A Feature view fetched for a certain timestamp should always return its state at that point in time.
๐ Point in time Joins - you should be able to combine several feature sets in a specific point in time easily.
ย
5๏ธโฃ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฆ๐๐ป๐ฐ - whether the Data was ingested in Real Time or Batch, the Data being Served should always be synced. Implementation of this part can vary, an example could be:
๐ Data is ingested in Real Time -> Feature Transformation Applied -> Data pushed to Low Latency Read capable Storage like Redis -> Data is Change Data Captured to Cold Storage like S3.
๐ Data is ingested in Batch -> Feature Transformation Applied -> Data is pushed to Cold Storage like S3 -> Data is made available for Real Time Serving by syncing it with Low Latency Read capable Storage like Redis.
So good, so clear... Good job ๐๐
Hello Aurimas,
Thanks for this reading, you explained cross-cutting topics really useful ๐ค