SwirlAI Newsletter

Share this post

SAI #01: Column Based vs. Row Based Storage, Kafka - Writing Data

www.newsletter.swirlai.com

SAI #01: Column Based vs. Row Based Storage, Kafka - Writing Data

Aurimas Griciลซnas
Oct 16, 2022
44
8
Share

Data Engineering Fundamentals + or What Every Data Engineer Should Know

๐Ÿ‘‹ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.


๐—ฅ๐—ผ๐˜„ ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜ƒ๐˜€ ๐—–๐—ผ๐—น๐˜‚๐—บ๐—ป ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—™๐—ถ๐—น๐—ฒ ๐—™๐—ผ๐—ฟ๐—บ๐—ฎ๐˜

๐—ฅ๐—ผ๐˜„ ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ:
ย 
โžก๏ธ Rows on disk are stored in sequence.
โžก๏ธ New rows are written efficiently since you can write the entire row at once.
โžก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.ย 
โžก๏ธ Compression is not efficient if columns have different data types since different data types are scattered all around the files.
ย 
๐Ÿ‘‰ Example File Formats: ๐—”๐˜ƒ๐—ฟ๐—ผ
ย 
โœ… Use for ๐—ข๐—Ÿ๐—ง๐—ฃ purposes.
ย 
๐—–๐—ผ๐—น๐˜‚๐—บ๐—ป ๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ:
ย 
โžก๏ธ Columns on disk are stored in sequence.
โžก๏ธ New rows are written slowly since you need to write fields of a row into different parts of the file.
โžก๏ธ For select statements that target a subset of columns, reads are faster than row based storage since you donโ€™t need to scan the entire file.
โžก๏ธ Compression is efficient since different data types are always grouped together.
ย 
๐Ÿ‘‰ Example File Formats: ๐—ฃ๐—ฎ๐—ฟ๐—พ๐˜‚๐—ฒ๐˜, ๐—ข๐—ฅ๐—–
ย 
โœ… Use for ๐—ข๐—Ÿ๐—”๐—ฃ purposes.


๐—ž๐—ฎ๐—ณ๐—ธ๐—ฎ - ๐—ช๐—ฟ๐—ถ๐˜๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ

Kafka is an extremely important ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€:

โžก๏ธ Clients writing to Kafka are called ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ฒ๐—ฟ๐˜€,ย 
โžก๏ธ Clients reading the Data are called ๐—–๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ๐˜€.
โžก๏ธ Data is written into ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ that can be compared to ๐—ง๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€ ๐—ถ๐—ป ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ๐˜€.
โžก๏ธ Messages sent to Topics are called ๐—ฅ๐—ฒ๐—ฐ๐—ผ๐—ฟ๐—ฑ๐˜€.
โžก๏ธ Topics are composed of ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€.
โžก๏ธ Each Partition behaves like and is a set of ๐—ช๐—ฟ๐—ถ๐˜๐—ฒ ๐—”๐—ต๐—ฒ๐—ฎ๐—ฑ ๐—Ÿ๐—ผ๐—ด๐˜€.

๐—ช๐—ฟ๐—ถ๐˜๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ:

โžก๏ธ There are two types of records that can be sent to a Topic - ๐—–๐—ผ๐—ป๐˜๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—ฎ ๐—ž๐—ฒ๐˜† ๐—ฎ๐—ป๐—ฑ ๐—ช๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ ๐—ฎ ๐—ž๐—ฒ๐˜†.
โžก๏ธ If there is no key, then records are written into Partitions in a ๐—ฅ๐—ผ๐˜‚๐—ป๐—ฑ ๐—ฅ๐—ผ๐—ฏ๐—ถ๐—ป ๐—ณ๐—ฎ๐˜€๐—ต๐—ถ๐—ผ๐—ป.
โžก๏ธ If there is a key, then records with the same keys will always be written to the ๐—ฆ๐—ฎ๐—บ๐—ฒ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ Data is always written to the ๐—˜๐—ป๐—ฑ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ When written, a record gets an ๐—ข๐—ณ๐—ณ๐˜€๐—ฒ๐˜ assigned to it which denotes its ๐—ข๐—ฟ๐—ฑ๐—ฒ๐—ฟ/๐—ฃ๐—น๐—ฎ๐—ฐ๐—ฒ ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป.
โžก๏ธ Partitions have separate sets of Offsets starting from 1.
โžก๏ธ Offsets are incremented sequentially when new records are written.


Data Engineerโ€™s Learning Path

I believe that the following is a correct order to start in ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฃ๐—ฎ๐˜๐—ต:

โžก๏ธ ๐—จ๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ฎ๐—ป๐—ฑ ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฒ๐˜€:ย 
ย 
๐Ÿ‘‰ Data Extractionย 
๐Ÿ‘‰ Data Validationย 
๐Ÿ‘‰ Data Contracts
๐Ÿ‘‰ Loading Data into a DWH / Data Lake
๐Ÿ‘‰ Transformations in a DWH / Data Lake
๐Ÿ‘‰ Scheduling
ย 
โžก๏ธ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐˜„๐—ถ๐—ฑ๐—ฒ๐—น๐˜† ๐˜‚๐˜€๐—ฒ๐—ฑ ๐˜๐—ผ๐—ผ๐—น๐—ถ๐—ป๐—ด ๐—ฏ๐˜† ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฎ ๐—ฝ๐—ฒ๐—ฟ๐˜€๐—ผ๐—ป๐—ฎ๐—น ๐—ฝ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜ ๐˜๐—ต๐—ฎ๐˜ ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—ด๐—ฒ๐˜€ ๐˜๐—ต๐—ฒ ๐˜๐—ฒ๐—ฐ๐—ต๐—ป๐—ผ๐—น๐—ผ๐—ด๐˜†:
ย 
๐Ÿ‘‰ Python
๐Ÿ‘‰ SQL
๐Ÿ‘‰ Airflow - ๐˜†๐—ฒ๐˜€ ๐—”๐—ถ๐—ฟ๐—ณ๐—น๐—ผ๐˜„, there are many who say that focusing only on Airflow as a scheduler is a narrow minded approach. Well, you will find Airflow in 99% of job ads - start with it, forget what people say.
๐Ÿ‘‰ Spark
๐Ÿ‘‰ DBT
ย 
โžก๏ธ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป ๐—™๐˜‚๐—ป๐—ฑ๐—ฎ๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐—น๐˜€ ๐——๐—ฒ๐—ฒ๐—ฝ๐—น๐˜†:
ย 
๐Ÿ‘‰ Data Modeling
๐Ÿ‘‰ Distributed Compute
๐Ÿ‘‰ Stakeholder Management
๐Ÿ‘‰ System Design
๐Ÿ‘‰ โ€ฆ
ย 
โžก๏ธ ๐—–๐—ผ๐—ป๐˜๐—ถ๐—ป๐˜‚๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด/๐—ฆ๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ฎ๐—น๐—ถ๐˜‡๐—ถ๐—ป๐—ด:
ย 
๐Ÿ‘‰ Stream Processing
๐Ÿ‘‰ Feature Stores
๐Ÿ‘‰ Data Governance
๐Ÿ‘‰ DataOps
๐Ÿ‘‰ Different tooling to implement the same Basic Processes
๐Ÿ‘‰ ...

Thanks for reading SwirlAI Newsletter! Subscribe for free to receive new posts and support my work.

44
8
Share
8 Comments
Manu
Nov 7, 2022Liked by Aurimas Griciลซnas

Simply superb

Expand full comment
Reply
1 reply by Aurimas Griciลซnas
Brent Brewington
Writes Brentโ€™s Substack
Nov 4, 2022Liked by Aurimas Griciลซnas

Enjoying how you explain some concepts, then show them in a visual. For some reason, that really resonates with my brain

Expand full comment
Reply
6 more commentsโ€ฆ
Top
New
Community

No posts

Ready for more?

ยฉ 2023 Aurimas Griciลซnas
Privacy โˆ™ Terms โˆ™ Collection notice
Start WritingGet the app
Substackย is the home for great writing