The SwirlAI Data Engineering Project Master Template.
And why you should consider implementing it as a Data Engineer looking to up-skill in the field.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This is a 🔒 Paid Subscriber 🔒 only issue. If you want to read the full article, consider upgrading to paid subscription.
Some time ago I started putting together a comprehensive template for a Data Engineering project. Today I want to introduce a high-level overview of it and why you should consider implementing it (or just follow future articles around it in order to learn Data Engineering).
I wanted to create a single template that would allow you to learn technical nuances of everything that you might run into as a Data Engineer in a single end-to-end project. I also want this to be a map which would allow me to explain some of the fundamental concepts I talk about in the Newsletter and map them to a specific use case that would be implemented in the Template.
In this Newsletter episode I will lay out the high-level structure of the template. Every few weeks as part of premium Newsletter tier I will be releasing a hands-on tutorial on implementing a piece of the template until it becomes an end-to-end infrastructure project that generalises to any type of data you might want to push through the data pipeline - from data creation till being able to serve it. We will also learn the most crucial components of Kubernetes as we will be deploying the project on a K8s cluster.
Today on a high level we cover:
Introduction to the template.
Why this template might be a good idea to implement if you want to up-skill as a Data Engineer.
Scaling Infrastructure of the template and making it production ready.
Real Time Pipeline:
The Collector.
Enricher/Validator.
Enrichment/Machine Learning API.
Real Time Loader.
Batch Loader.
Batch Pipeline.
Introduction to the template.
As mentioned before, the template is meant to cover a lot of different data manipulation patterns, hence it is relatively extensive. Here are the pieces:
Real Time Pipeline:
Data Producers - these are independent applications that extract data from chosen Data Sources and push it in real time to the Collector application via REST or possibly gRPC API calls.
Collector - REST or gRPC API (it is important to note here that you will be able to use gRPC only in Private Network. If it is Public, you would have to go with REST) that takes a payload (json or protobuf in case of gRPC), validates top level field existence and correctness, adds additional metadata and pushes the data into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.
Enricher/Validator - Stream Processing Application that validates schema of events in Raw Events Topic, performs some (optional) data enrichment and pushes results into either Enriched Events Topic if the validation passes and enrichment was successful or a Dead Letter Queue if any of previous have failed.
Enrichment API - API of any flavour that will be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.
Real Time Loader - Stream Processing Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Real Time Analysis and alerting.
Batch Loader - Stream Processing Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO or other kind of Object Storage.
Batch Pipeline:
This is where scripts Scheduled via Airflow or any other pipeline orchestrator read data from Enriched Events MinIO bucket, validate data quality, perform deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.
Some of the Infrastructure elements that will be needed.
A. A single Kafka instance that will hold all of the Topics for the Project.
B. A single MinIO instance that will hold all of the Buckets for the Project.
C. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.
D. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.
E. ElasticSearch instance to hold Real Time Data.
F. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.
Why might this Template be a good idea to implement if you want to up-skill as a Data Engineer?
I have constructed the Template referring to my experience, I tried to make a single template that would allow you to learn most Data manipulation patterns out there.
You are most likely to find similar setups in real life situations. Seeing something like this implemented signals a relatively high level of maturity in Organizations Data Architecture.
You can choose to go extremely deep or light on any of the architecture elements when learning and studying The Template. It can be implemented while starting from different pieces of it as a starting point.
We will cover most of the possible Data Transportation/Manipulation Patterns:
Data Producers - Data extraction from external systems. You can use different technologies for each of the Applications.
Collector - API to Collect Events. Here you will acquire skills in building REST/gRPC Servers and learn the differences.
Enrichment API - API to expose data Transformation/Enrichment capability. This is where we will learn how to expose a Machine Learning Model as a REST or gRPC API.
Enricher/Validator - Stream to Stream Processor. Extremely important, as here we will enrich Data with ML Inference results in real time and more importantly ensure that Data Contract is respected between Producers and Consumer of the Data.
Batch Loader - Stream to Batch Storage Sink. Here we will learn object storage and how to effectively serialize data so that downstream Batch Processing jobs would be most efficient and performant.
Real Time Loader - Stream to Real Time Storage Sink. We will learn how to use Kafka Consumers to write data to Real Time Storage and observe Data coming in Real time through Superset Dashboards.
Batch Pipeline - We will look into concepts such as: difference between Data Warehouse, Data Lake and Data Lakehouse, what are Bronze, Silver and Golden layers in Data Lakehouse Architecture, what are SLAs, how can you leverage DBT, internals and best practices of Airflow and much more.
Everything will be containerised and deployed on a local Kubernetes cluster, you will be able to redeploy it to any Cloud without much needed change.
The Template is Dynamic - in the process we will add the entire MLOps Stack to The Template. As we build the template out, we might decide to add some additional elements that have not been mentioned in this Newsletter episode.
How do we scale the infrastructure of the Template and make it ready for production?
Keep reading with a 7-day free trial
Subscribe to