The SwirlAI Data Engineering Project Master Template: The Collector (Part 1).
And how to run it on Kubernetes.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This is a 🔒 Paid Subscriber 🔒 only issue. If you want to read the full article, consider upgrading to paid subscription.
In this Newsletter episode I will start implementing The Collector piece of The SwirlAI Data Engineering Project Master Template and deploying it to a local Kubernetes cluster. Let’s rewind the high level architecture of the template.
Real Time Pipeline:
Data Producers - these are independent applications that extract data from chosen Data Sources and push it in real time to the Collector application via REST or possibly gRPC API calls.
Collector - REST or gRPC API (it is important to note here that you will be able to use gRPC only in Private Network. If it is Public, you would have to go with REST) that takes a payload (json or protobuf in case of gRPC), validates top level field existence and correctness, adds additional metadata and pushes the data into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.
Enricher/Validator - Stream Processing Application that validates schema of events in Raw Events Topic, performs some (optional) data enrichment and pushes results into either Enriched Events Topic if the validation passes and enrichment was successful or a Dead Letter Queue if any of previous have failed.
Enrichment API - API of any flavour that will be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.
Real Time Loader - Stream Processing Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Real Time Analysis and alerting.
Batch Loader - Stream Processing Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO or other kind of Object Storage.
Batch Pipeline:
This is where scripts Scheduled via Airflow or any other pipeline orchestrator read data from Enriched Events MinIO bucket, validate data quality, perform deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.
Some of the Infrastructure elements that will be needed.
A. A single Kafka instance that will hold all of the Topics for the Project.
B. A single MinIO instance that will hold all of the Buckets for the Project.
C. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.
D. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.
E. ElasticSearch instance to hold Real Time Data.
F. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.
In this Newsletter episode that is Part 1 of the Collector related series we will work on 1. and 2. without any additional infrastructure elements and implement the following:
The Collector.
Write a Python application that uses FastAPI Framework to expose a REST API endpoint.
Deploy the application on Kubernetes.
Horizontally scale the application for High Availability
Data Producer.
Write a Python application to download data from the internet.
Run the application on Kubernetes.
Send the downloaded data to a previously deployed REST API endpoint.
In Part 2 of Collector related series that will be coming out in a few weeks we will:
Improve code structure.
Implement unit tests for our application.
Set up a CI/CD pipeline via GitHub Actions.
Add Kafka to the picture.
Prerequisites.
This tutorial assumes that you know the basics of Python and have skimmed through my latest Guide to Kubernetes which you can find here.
If you want to dig deeper into the full Template without any implementation details, you can do it here.
Defining the architecture.
Keep reading with a 7-day free trial
Subscribe to