SwirlAI Newsletter

Share this post

SAI #02: Feature Store, Splittable vs. Non-Splittable Files and more...

www.newsletter.swirlai.com

SAI #02: Feature Store, Splittable vs. Non-Splittable Files and more...

Splittable vs Non-Splittable Files, CDC (Change Data Capture), Machine Learning Pipeline, Feature Store.

Aurimas Griciลซnas
Oct 22, 2022
26
3
Share

Data Engineering Fundamentals + or What Every Data Engineer Should Know

๐Ÿ‘‹ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐˜ƒ๐˜€. ๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—™๐—ถ๐—น๐—ฒ๐˜€.


You are very likely to run into a ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ฟ ๐—™๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜„๐—ผ๐—ฟ๐—ธ in your career. It could be ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ, ๐—›๐—ถ๐˜ƒ๐—ฒ, ๐—ฃ๐—ฟ๐—ฒ๐˜€๐˜๐—ผ or any other.ย 

Also, it is very likely that these Frameworks would be reading data from a distributed storage. It could be ๐—›๐——๐—™๐—ฆ, ๐—ฆ๐Ÿฏ etc.

These Frameworks utilize multiple ๐—–๐—ฃ๐—จ ๐—–๐—ผ๐—ฟ๐—ฒ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—Ÿ๐—ผ๐—ฎ๐—ฑ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ and performing ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ in parallel.

How files are stored in your ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฎ๐—ด๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ถ๐˜€ ๐—ž๐—ฒ๐˜† for utilizing distributed ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐—ฎ๐—ป๐—ฑ ๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—˜๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—น๐˜†.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

โžก๏ธ ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—™๐—ถ๐—น๐—ฒ๐˜€ are Files that can be partially read by several processes at the same time.
โžก๏ธ In distributed file or block storages files are stored in chunks called blocks.
โžก๏ธ Block sizes will vary between different storage systems.

๐—ง๐—ต๐—ถ๐—ป๐—ด๐˜€ ๐˜๐—ผ ๐—ธ๐—ป๐—ผ๐˜„:

โžก๏ธ If your file is ๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ and is bigger than a block in storage - it will be split between blocks but will only be read by a ๐—ฆ๐—ถ๐—ป๐—ด๐—น๐—ฒ ๐—–๐—ฃ๐—จ ๐—–๐—ผ๐—ฟ๐—ฒ which might cause ๐—œ๐—ฑ๐—น๐—ฒ ๐—–๐—ฃ๐—จ time.
โžก๏ธ If your file is ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ - multiple cores can read it at the same time (one core per block).

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐˜‚๐—ถ๐—ฑ๐—ฎ๐—ป๐—ฐ๐—ฒ:

โžก๏ธ If possible - prefer ๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—™๐—ถ๐—น๐—ฒ types.
โžก๏ธ If you are forced to use ๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ files - manually partition them into sizes that would fit into a single FS Block to utilize more CPU Cores.

๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ณ๐—ถ๐—น๐—ฒ ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐˜€:

๐Ÿ‘‰ ๐—”๐˜ƒ๐—ฟ๐—ผ.
๐Ÿ‘‰ ๐—–๐—ฆ๐—ฉ.
๐Ÿ‘‰ ๐—ข๐—ฅ๐—–.
๐Ÿ‘‰ ๐—ป๐—ฑ๐—๐—ฆ๐—ข๐—ก.
๐Ÿ‘‰ ๐—ฃ๐—ฎ๐—ฟ๐—พ๐˜‚๐—ฒ๐˜.
ย 
๐—ก๐—ผ๐—ป-๐—ฆ๐—ฝ๐—น๐—ถ๐˜๐˜๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ณ๐—ถ๐—น๐—ฒ ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐˜€:
ย 
๐Ÿ‘‰ ๐—ฃ๐—ฟ๐—ผ๐˜๐—ผ๐—ฐ๐—ผ๐—น ๐—•๐˜‚๐—ณ๐—ณ๐—ฒ๐—ฟ๐˜€.
๐Ÿ‘‰ ๐—๐—ฆ๐—ข๐—ก.
๐Ÿ‘‰ ๐—ซ๐— ๐—Ÿ.
ย 
[๐—œ๐— ๐—ฃ๐—ข๐—ฅ๐—ง๐—”๐—ก๐—ง] Compression might break splitability, more on it next time.


๐—–๐——๐—– (๐—–๐—ต๐—ฎ๐—ป๐—ด๐—ฒ ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ฎ๐—ฝ๐˜๐˜‚๐—ฟ๐—ฒ).


๐—–๐—ต๐—ฎ๐—ป๐—ด๐—ฒ ๐——๐—ฎ๐˜๐—ฎ ๐—–๐—ฎ๐—ฝ๐˜๐˜‚๐—ฟ๐—ฒ is a software process used to replicate actions performed against ๐—ข๐—ฝ๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€ for use in downstream applications.ย 

๐—ง๐—ต๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐˜‚๐˜€๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—–๐——๐—–. ๐—ง๐˜„๐—ผ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—บ๐—ฎ๐—ถ๐—ป ๐—ผ๐—ป๐—ฒ๐˜€:

โžก๏ธ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฅ๐—ฒ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป (refer to 3๏ธโƒฃ in the Diagram).

๐Ÿ‘‰ ๐—–๐——๐—– can be used for moving transactions performed against ๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ to a ๐—ง๐—ฎ๐—ฟ๐—ด๐—ฒ๐˜ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ. If each transaction is replicated - it is possible to retain all ACID guarantees when performing replication.
๐Ÿ‘‰ ๐—ฅ๐—ฒ๐—ฎ๐—น ๐˜๐—ถ๐—บ๐—ฒ ๐—–๐——๐—– is extremely valuable here as it enables ๐—ญ๐—ฒ๐—ฟ๐—ผ ๐——๐—ผ๐˜„๐—ป๐˜๐—ถ๐—บ๐—ฒ ๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฅ๐—ฒ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ฑ ๐— ๐—ถ๐—ด๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป. E.g It is extensively used when migrating ๐—ผ๐—ป-๐—ฝ๐—ฟ๐—ฒ๐—บ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€ serving ๐—–๐—ฟ๐—ถ๐˜๐—ถ๐—ฐ๐—ฎ๐—น ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ that can not be shut down for a moment to the cloud.

โžก๏ธ Facilitation of ๐——๐—ฎ๐˜๐—ฎ ๐— ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ข๐—ฝ๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€ ๐˜๐—ผ ๐——๐—ฎ๐˜๐—ฎ ๐—Ÿ๐—ฎ๐—ธ๐—ฒ๐˜€ (refer to 1๏ธโƒฃ in the Diagram) ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—ช๐—ฎ๐—ฟ๐—ฒ๐—ต๐—ผ๐˜‚๐˜€๐—ฒ๐˜€ (refer to 2๏ธโƒฃ in the Diagram) ๐—ณ๐—ผ๐—ฟ ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ ๐—ฝ๐˜‚๐—ฟ๐—ฝ๐—ผ๐˜€๐—ฒ๐˜€.

๐Ÿ‘‰ There are currently two Data movement patterns widely applied in the industry: ๐—˜๐—ง๐—Ÿ ๐—ฎ๐—ป๐—ฑ ๐—˜๐—Ÿ๐—ง.
๐Ÿ‘‰ ๐—œ๐—ป ๐˜๐—ต๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ ๐—ผ๐—ณ ๐—˜๐—ง๐—Ÿ - data extracted by CDC can be transformed on the fly and eventually pushed to the Data Lake or Data Warehouse.
๐Ÿ‘‰ ๐—œ๐—ป ๐˜๐—ต๐—ฒ ๐—ฐ๐—ฎ๐˜€๐—ฒ ๐—ผ๐—ณ ๐—˜๐—Ÿ๐—ง - Data is replicated to the Data Lake or Data Warehouse as is and Transformations performed inside of the System.
ย 
๐—ง๐—ต๐—ฒ๐—ฟ๐—ฒ ๐—ถ๐˜€ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐˜๐—ต๐—ฎ๐—ป ๐—ผ๐—ป๐—ฒ ๐˜„๐—ฎ๐˜† ๐—ผ๐—ณ ๐—ต๐—ผ๐˜„ ๐—–๐——๐—– ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐—ถ๐—บ๐—ฝ๐—น๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ, ๐˜๐—ต๐—ฒ ๐—บ๐—ฒ๐˜๐—ต๐—ผ๐—ฑ๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—บ๐—ฎ๐—ถ๐—ป๐—น๐˜† ๐˜€๐—ฝ๐—น๐—ถ๐˜ ๐—ถ๐—ป๐˜๐—ผ ๐˜๐—ต๐—ฟ๐—ฒ๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜‚๐—ฝ๐˜€:
ย 
โžก๏ธ ๐—ฃ๐˜‚๐—น๐—น ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—–๐——๐—–

๐Ÿ‘‰ A client queries the Source Database and pushes data into the Target Database.
ย 
โ—๏ธDownside 1: There is a need to augment all of the source tables to include indicators that a record has changed.
โ—๏ธDownside 2: Usually - not a real time CDC, it might be performed hourly, daily etc.
โ—๏ธDownside 3: Source Database suffers high load when CDC is being performed.
โ—๏ธDownside 4: It is extremely challenging to replicate Delete events.

โžก๏ธ ๐—ฃ๐˜‚๐˜€๐—ต ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—–๐——๐—–

๐Ÿ‘‰ Triggers are set up in the Source Database. Whenever a change event happens in the Database - it is pushed to a target system.
ย 
โ—๏ธ Downside 1: This approach usually causes highest database load overhead.
โœ…ย  Upside 1: Real Time CDC.
ย 
โžก๏ธ ๐—Ÿ๐—ผ๐—ด ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—–๐——๐—–

๐Ÿ‘‰ Transactional Databases have all of the events performed against the Database logged in the transaction log for recovery purposes.
๐Ÿ‘‰ A Transaction Miner is mounted on top of the logs and pushes selected events into a Downstream System. Popular implementation - Debezium.
ย 
โ—๏ธ Downside 1: More complicated to set up.
โ—๏ธ Downside 2: Not all Databases will have open source connectors.
โœ…ย  Upside 1: Least load on the Database.
โœ…ย  Upside 2: Real Time CDC.


MLOps Fundamentals or What Every Machine Learning Engineer Should Know

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ.


๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ are extremely important because they bring the automation aspect into the day to day work of ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—ถ๐˜€๐˜๐˜€ and ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐˜€. Pipelines should be reusable and robust. Working in Pipelines allows you to scale the number of Machine Learning models that can be maintained in production concurrently.

Usual Pipeline consists of the following steps:
ย 
1๏ธโƒฃ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฅ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฎ๐—น
ย 
๐Ÿ‘‰ Ideally we retrieve Features from a Feature Store here, if not - it could be a different kind of storage.
๐Ÿ‘‰ Features should not require transformations at this stage, if they do - there should be an additional step of Feature Preparation.
๐Ÿ‘‰ We do Train/Validation/Test splits here.
ย 
2๏ธโƒฃ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฉ๐—ฎ๐—น๐—ถ๐—ฑ๐—ฎ๐˜๐—ถ๐—ผ๐—ป
ย 
๐Ÿ‘‰ You would hope to have ๐—–๐˜‚๐—ฟ๐—ฎ๐˜๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ at this point but errors can slip through. There could be a missing timeframe or ๐——๐—ฎ๐˜๐—ฎ could be coming late.
๐Ÿ‘‰ย  You should perform ๐—ฃ๐—ฟ๐—ผ๐—ณ๐—ถ๐—น๐—ฒ ๐—–๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ถ๐˜€๐—ผ๐—ป against data used during last time the pipeline was run. Any significant change in ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป could signal ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐——๐—ฟ๐—ถ๐—ณ๐˜.
ย 
โ—๏ธYou should act according to Validation results - if any of them breach predefined thresholds you should act accordingly, either alert responsible person or short circuit the Pipeline.
ย 
3๏ธโƒฃ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด
ย 
๐Ÿ‘‰ We train our ML Model using ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜ from first step while validating on ๐—ฉ๐—ฎ๐—น๐—ถ๐—ฑ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜ for adequate results.
๐Ÿ‘‰ If ๐—–๐—ฟ๐—ผ๐˜€๐˜€-๐—ฉ๐—ฎ๐—น๐—ถ๐—ฑ๐—ฎ๐˜๐—ถ๐—ผ๐—ป is used - we do not split the Validation Set in the first step.
ย 
4๏ธโƒฃ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฉ๐—ฎ๐—น๐—ถ๐—ฑ๐—ฎ๐˜๐—ถ๐—ผ๐—ป
ย 
๐Ÿ‘‰ We calculate ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐— ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ๐˜€ against the ๐—ง๐—ฒ๐˜€๐˜ ๐——๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜ that we split in the first step.
ย 
โ—๏ธYou should act according to Validation results - if predefined thresholds are breached you should act accordingly, either alert responsible person or short circuit the Pipeline.
ย 
5๏ธโƒฃ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ป๐—ด
ย 
๐Ÿ‘‰ You serve the Model for Deployment by placing it into a Model Registry.
ย 
๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐˜€:
ย 
โžก๏ธ You should be able to trigger the Pipeline for retraining purposes. It could be done by the Orchestrator on a schedule, from the Experimentation Environment or by an Alerting System if any faults in Production are detected.
โžก๏ธ Pipeline steps should be glued together by Experiment Tracking System for Pipeline Run Reproducibility purposes.


๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ.


๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ are an extremely important concept as they sit between ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฎ๐—ป๐—ฑ ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€.

๐—ง๐—ต๐—ฒ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐˜€๐—ผ๐—น๐˜ƒ๐—ฒ๐˜€ ๐—ณ๐—ผ๐—น๐—น๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ถ๐˜€๐˜€๐˜‚๐—ฒ๐˜€:

โžก๏ธ Eliminates Training/Serving skew by syncing Batch and Online Serving Storages (5๏ธโƒฃ)
โžก๏ธ Enables Feature Sharing and Discoverability through the Metadata Layer - you define the Feature Transformations once, enable discoverability through the Feature Catalog and then serve Feature Sets for training and inference purposes trough unified interface (4๏ธโƒฃ, 3๏ธโƒฃ).

๐—ง๐—ต๐—ฒ ๐—ถ๐—ฑ๐—ฒ๐—ฎ๐—น ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ต๐—ฎ๐˜ƒ๐—ฒ ๐˜๐—ต๐—ฒ๐˜€๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐˜๐—ถ๐—ฒ๐˜€:

1๏ธโƒฃ ๐—œ๐˜ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ฏ๐—ฒ ๐—บ๐—ผ๐˜‚๐—ป๐˜๐—ฒ๐—ฑ ๐—ผ๐—ป ๐˜๐—ผ๐—ฝ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—–๐˜‚๐—ฟ๐—ฎ๐˜๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ ๐—Ÿ๐—ฎ๐˜†๐—ฒ๐—ฟ
ย 
๐Ÿ‘‰ the Data that is being pushed into the Feature Store System should be of High Quality and meet SLAs, trying to Curate Data inside of the Feature Store System is a recipe for disaster.
๐Ÿ‘‰ Curated Data could be coming in Real Time or Batch. Not all companies need Real Time Data at least when they are only starting with Machine Learning.
ย 
2๏ธโƒฃ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฒ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ต๐—ฎ๐˜ƒ๐—ฒ ๐—ฎ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—Ÿ๐—ฎ๐˜†๐—ฒ๐—ฟ ๐˜„๐—ถ๐˜๐—ต ๐—ถ๐˜๐˜€ ๐—ผ๐˜„๐—ป ๐—ฐ๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ.
ย 
๐Ÿ‘‰ In the modern Data Stack this part could be provided by the vendor or you might need to implement it yourself.ย 
๐Ÿ‘‰ The industry is moving towards a state where it becomes normal for vendors to include Feature Transformation part into their offering.
ย 
3๏ธโƒฃ ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ง๐—ถ๐—บ๐—ฒ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ป๐—ด ๐—”๐—ฃ๐—œ - this is where you will be retrieving your Features for low latency inference. The System should provide two types of Real Time serving APIs:ย 
ย 
๐Ÿ‘‰ Get - you fetch a single Feature Vector.
๐Ÿ‘‰ Batch Get - you fetch multiple Feature Vectors at the same time with Low Latency.
ย 
4๏ธโƒฃ ๐—•๐—ฎ๐˜๐—ฐ๐—ต ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ป๐—ด ๐—”๐—ฃ๐—œ - this is where you will be fetching your Features for Batch inference and Model Training. The API should provide:
ย 
๐Ÿ‘‰ Point in time Feature Retrieval - you need to be able to time travel. A Feature view fetched for a certain timestamp should always return its state at that point in time.
๐Ÿ‘‰ Point in time Joins - you should be able to combine several feature sets in a specific point in time easily.
ย 
5๏ธโƒฃ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฆ๐˜†๐—ป๐—ฐ - whether the Data was ingested in Real Time or Batch, the Data being Served should always be synced. Implementation of this part can vary, an example could be:

๐Ÿ‘‰ Data is ingested in Real Time -> Feature Transformation Applied -> Data pushed to Low Latency Read capable Storage like Redis -> Data is Change Data Captured to Cold Storage like S3.
๐Ÿ‘‰ Data is ingested in Batch -> Feature Transformation Applied -> Data is pushed to Cold Storage like S3 -> Data is made available for Real Time Serving by syncing it with Low Latency Read capable Storage like Redis.


Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.

26
3
Share
3 Comments
Hadil
Oct 29, 2022Liked by Aurimas Griciลซnas

So good, so clear... Good job ๐Ÿ‘๐Ÿ™

Expand full comment
Reply
Michael Andrรฉs Mora
Jan 29Liked by Aurimas Griciลซnas

Hello Aurimas,

Thanks for this reading, you explained cross-cutting topics really useful ๐Ÿค“

Expand full comment
Reply
1 more commentโ€ฆ
Top
New
Community

No posts

Ready for more?

ยฉ 2023 Aurimas Griciลซnas
Privacy โˆ™ Terms โˆ™ Collection notice
Start WritingGet the app
Substackย is the home for great writing