SAI #01: Column Based vs. Row Based Storage, Kafka - Writing Data

Oct 16, 2022

Data Engineering Fundamentals + or What Every Data Engineer Should Know

👋 This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.

𝗥𝗼𝘄 𝗕𝗮𝘀𝗲𝗱 𝘃𝘀 𝗖𝗼𝗹𝘂𝗺𝗻 𝗕𝗮𝘀𝗲𝗱 𝗙𝗶𝗹𝗲 𝗙𝗼𝗿𝗺𝗮𝘁

𝗥𝗼𝘄 𝗕𝗮𝘀𝗲𝗱:

➡️ Rows on disk are stored in sequence.
➡️ New rows are written efficiently since you can write the entire row at once.
➡️ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.
➡️ Compression is not efficient if columns have different data types since different data types are scattered all around the files.

👉 Example File Formats: 𝗔𝘃𝗿𝗼

✅ Use for 𝗢𝗟𝗧𝗣 purposes.

𝗖𝗼𝗹𝘂𝗺𝗻 𝗕𝗮𝘀𝗲𝗱:

➡️ Columns on disk are stored in sequence.
➡️ New rows are written slowly since you need to write fields of a row into different parts of the file.
➡️ For select statements that target a subset of columns, reads are faster than row based storage since you don’t need to scan the entire file.
➡️ Compression is efficient since different data types are always grouped together.

👉 Example File Formats: 𝗣𝗮𝗿𝗾𝘂𝗲𝘁, 𝗢𝗥𝗖

✅ Use for 𝗢𝗟𝗔𝗣 purposes.

𝗞𝗮𝗳𝗸𝗮 - 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗗𝗮𝘁𝗮

Kafka is an extremely important 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗠𝗲𝘀𝘀𝗮𝗴𝗶𝗻𝗴 𝗦𝘆𝘀𝘁𝗲𝗺 to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.

𝗦𝗼𝗺𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻𝘀:

➡️ Clients writing to Kafka are called 𝗣𝗿𝗼𝗱𝘂𝗰𝗲𝗿𝘀,
➡️ Clients reading the Data are called 𝗖𝗼𝗻𝘀𝘂𝗺𝗲𝗿𝘀.
➡️ Data is written into 𝗧𝗼𝗽𝗶𝗰𝘀 that can be compared to 𝗧𝗮𝗯𝗹𝗲𝘀 𝗶𝗻 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀.
➡️ Messages sent to Topics are called 𝗥𝗲𝗰𝗼𝗿𝗱𝘀.
➡️ Topics are composed of 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀.
➡️ Each Partition behaves like and is a set of 𝗪𝗿𝗶𝘁𝗲 𝗔𝗵𝗲𝗮𝗱 𝗟𝗼𝗴𝘀.

𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗗𝗮𝘁𝗮:

➡️ There are two types of records that can be sent to a Topic - 𝗖𝗼𝗻𝘁𝗮𝗶𝗻𝗶𝗻𝗴 𝗮 𝗞𝗲𝘆 𝗮𝗻𝗱 𝗪𝗶𝘁𝗵𝗼𝘂𝘁 𝗮 𝗞𝗲𝘆.
➡️ If there is no key, then records are written into Partitions in a 𝗥𝗼𝘂𝗻𝗱 𝗥𝗼𝗯𝗶𝗻 𝗳𝗮𝘀𝗵𝗶𝗼𝗻.
➡️ If there is a key, then records with the same keys will always be written to the 𝗦𝗮𝗺𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻.
➡️ Data is always written to the 𝗘𝗻𝗱 𝗼𝗳 𝘁𝗵𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻.
➡️ When written, a record gets an 𝗢𝗳𝗳𝘀𝗲𝘁 assigned to it which denotes its 𝗢𝗿𝗱𝗲𝗿/𝗣𝗹𝗮𝗰𝗲 𝗶𝗻 𝘁𝗵𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻.
➡️ Partitions have separate sets of Offsets starting from 1.
➡️ Offsets are incremented sequentially when new records are written.

Data Engineer’s Learning Path

I believe that the following is a correct order to start in 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗣𝗮𝘁𝗵:

➡️ 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗕𝗮𝘀𝗶𝗰 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗲𝘀:

👉 Data Extraction
👉 Data Validation
👉 Data Contracts
👉 Loading Data into a DWH / Data Lake
👉 Transformations in a DWH / Data Lake
👉 Scheduling

➡️ 𝗟𝗲𝗮𝗿𝗻 𝗺𝗼𝘀𝘁 𝘄𝗶𝗱𝗲𝗹𝘆 𝘂𝘀𝗲𝗱 𝘁𝗼𝗼𝗹𝗶𝗻𝗴 𝗯𝘆 𝗰𝗿𝗲𝗮𝘁𝗶𝗻𝗴 𝗮 𝗽𝗲𝗿𝘀𝗼𝗻𝗮𝗹 𝗽𝗿𝗼𝗷𝗲𝗰𝘁 𝘁𝗵𝗮𝘁 𝗹𝗲𝘃𝗲𝗿𝗮𝗴𝗲𝘀 𝘁𝗵𝗲 𝘁𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆:

👉 Python
👉 SQL
👉 Airflow - 𝘆𝗲𝘀 𝗔𝗶𝗿𝗳𝗹𝗼𝘄, there are many who say that focusing only on Airflow as a scheduler is a narrow minded approach. Well, you will find Airflow in 99% of job ads - start with it, forget what people say.
👉 Spark
👉 DBT

➡️ 𝗟𝗲𝗮𝗿𝗻 𝗙𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹𝘀 𝗗𝗲𝗲𝗽𝗹𝘆:

👉 Data Modeling
👉 Distributed Compute
👉 Stakeholder Management
👉 System Design
👉 …

➡️ 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴/𝗦𝗽𝗲𝗰𝗶𝗮𝗹𝗶𝘇𝗶𝗻𝗴:

👉 Stream Processing
👉 Feature Stores
👉 Data Governance
👉 DataOps
👉 Different tooling to implement the same Basic Processes
👉 ...

SwirlAI Newsletter

Discussion about this post