Data Engineering Fundamentals + or What Every Data Engineer Should Know
๐ This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
๐ฅ๐ผ๐ ๐๐ฎ๐๐ฒ๐ฑ ๐๐ ๐๐ผ๐น๐๐บ๐ป ๐๐ฎ๐๐ฒ๐ฑ ๐๐ถ๐น๐ฒ ๐๐ผ๐ฟ๐บ๐ฎ๐
๐ฅ๐ผ๐ ๐๐ฎ๐๐ฒ๐ฑ:
ย
โก๏ธ Rows on disk are stored in sequence.
โก๏ธ New rows are written efficiently since you can write the entire row at once.
โก๏ธ For select statements that target a subset of columns, reading is slower since you need to scan all sets of rows to retrieve one of the columns.ย
โก๏ธ Compression is not efficient if columns have different data types since different data types are scattered all around the files.
ย
๐ Example File Formats: ๐๐๐ฟ๐ผ
ย
โ
Use for ๐ข๐๐ง๐ฃ purposes.
ย
๐๐ผ๐น๐๐บ๐ป ๐๐ฎ๐๐ฒ๐ฑ:
ย
โก๏ธ Columns on disk are stored in sequence.
โก๏ธ New rows are written slowly since you need to write fields of a row into different parts of the file.
โก๏ธ For select statements that target a subset of columns, reads are faster than row based storage since you donโt need to scan the entire file.
โก๏ธ Compression is efficient since different data types are always grouped together.
ย
๐ Example File Formats: ๐ฃ๐ฎ๐ฟ๐พ๐๐ฒ๐, ๐ข๐ฅ๐
ย
โ
Use for ๐ข๐๐๐ฃ purposes.
๐๐ฎ๐ณ๐ธ๐ฎ - ๐ช๐ฟ๐ถ๐๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ
Kafka is an extremely important ๐๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐ ๐ฒ๐๐๐ฎ๐ด๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ to understand as it was the first of its kind and most of the new products are built on the ideas of Kafka.
๐ฆ๐ผ๐บ๐ฒ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐ฑ๐ฒ๐ณ๐ถ๐ป๐ถ๐๐ถ๐ผ๐ป๐:
โก๏ธ Clients writing to Kafka are called ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐ฒ๐ฟ๐,ย
โก๏ธ Clients reading the Data are called ๐๐ผ๐ป๐๐๐บ๐ฒ๐ฟ๐.
โก๏ธ Data is written into ๐ง๐ผ๐ฝ๐ถ๐ฐ๐ that can be compared to ๐ง๐ฎ๐ฏ๐น๐ฒ๐ ๐ถ๐ป ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ๐.
โก๏ธ Messages sent to Topics are called ๐ฅ๐ฒ๐ฐ๐ผ๐ฟ๐ฑ๐.
โก๏ธ Topics are composed of ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐.
โก๏ธ Each Partition behaves like and is a set of ๐ช๐ฟ๐ถ๐๐ฒ ๐๐ต๐ฒ๐ฎ๐ฑ ๐๐ผ๐ด๐.
๐ช๐ฟ๐ถ๐๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ:
โก๏ธ There are two types of records that can be sent to a Topic - ๐๐ผ๐ป๐๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฎ ๐๐ฒ๐ ๐ฎ๐ป๐ฑ ๐ช๐ถ๐๐ต๐ผ๐๐ ๐ฎ ๐๐ฒ๐.
โก๏ธ If there is no key, then records are written into Partitions in a ๐ฅ๐ผ๐๐ป๐ฑ ๐ฅ๐ผ๐ฏ๐ถ๐ป ๐ณ๐ฎ๐๐ต๐ถ๐ผ๐ป.
โก๏ธ If there is a key, then records with the same keys will always be written to the ๐ฆ๐ฎ๐บ๐ฒ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ Data is always written to the ๐๐ป๐ฑ ๐ผ๐ณ ๐๐ต๐ฒ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ When written, a record gets an ๐ข๐ณ๐ณ๐๐ฒ๐ assigned to it which denotes its ๐ข๐ฟ๐ฑ๐ฒ๐ฟ/๐ฃ๐น๐ฎ๐ฐ๐ฒ ๐ถ๐ป ๐๐ต๐ฒ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป.
โก๏ธ Partitions have separate sets of Offsets starting from 1.
โก๏ธ Offsets are incremented sequentially when new records are written.
Data Engineerโs Learning Path
I believe that the following is a correct order to start in ๐ฌ๐ผ๐๐ฟ ๐๐ฎ๐๐ฎ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ฎ๐๐ต:
โก๏ธ ๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ ๐๐ฎ๐๐ถ๐ฐ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ฒ๐:ย
ย
๐ Data Extractionย
๐ Data Validationย
๐ Data Contracts
๐ Loading Data into a DWH / Data Lake
๐ Transformations in a DWH / Data Lake
๐ Scheduling
ย
โก๏ธ ๐๐ฒ๐ฎ๐ฟ๐ป ๐บ๐ผ๐๐ ๐๐ถ๐ฑ๐ฒ๐น๐ ๐๐๐ฒ๐ฑ ๐๐ผ๐ผ๐น๐ถ๐ป๐ด ๐ฏ๐ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ถ๐ป๐ด ๐ฎ ๐ฝ๐ฒ๐ฟ๐๐ผ๐ป๐ฎ๐น ๐ฝ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐ ๐๐ต๐ฎ๐ ๐น๐ฒ๐๐ฒ๐ฟ๐ฎ๐ด๐ฒ๐ ๐๐ต๐ฒ ๐๐ฒ๐ฐ๐ต๐ป๐ผ๐น๐ผ๐ด๐:
ย
๐ Python
๐ SQL
๐ Airflow - ๐๐ฒ๐ ๐๐ถ๐ฟ๐ณ๐น๐ผ๐, there are many who say that focusing only on Airflow as a scheduler is a narrow minded approach. Well, you will find Airflow in 99% of job ads - start with it, forget what people say.
๐ Spark
๐ DBT
ย
โก๏ธ ๐๐ฒ๐ฎ๐ฟ๐ป ๐๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น๐ ๐๐ฒ๐ฒ๐ฝ๐น๐:
ย
๐ Data Modeling
๐ Distributed Compute
๐ Stakeholder Management
๐ System Design
๐ โฆ
ย
โก๏ธ ๐๐ผ๐ป๐๐ถ๐ป๐๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด/๐ฆ๐ฝ๐ฒ๐ฐ๐ถ๐ฎ๐น๐ถ๐๐ถ๐ป๐ด:
ย
๐ Stream Processing
๐ Feature Stores
๐ Data Governance
๐ DataOps
๐ Different tooling to implement the same Basic Processes
๐ ...
Simply superb
Enjoying how you explain some concepts, then show them in a visual. For some reason, that really resonates with my brain