SwirlAI Newsletter

Share this post

The SwirlAI Data Engineering Project Master Template: The Collector (Part 1).

www.newsletter.swirlai.com

The SwirlAI Data Engineering Project Master Template: The Collector (Part 1).

And how to run it on Kubernetes.

Aurimas Griciūnas
Aug 27, 2023
∙ Paid
18
Share this post

The SwirlAI Data Engineering Project Master Template: The Collector (Part 1).

www.newsletter.swirlai.com
6
Share

👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.

This is a 🔒 Paid Subscriber 🔒 only issue. If you want to read the full article, consider upgrading to paid subscription.

SwirlAI Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


In this Newsletter episode I will start implementing The Collector piece of The SwirlAI Data Engineering Project Master Template and deploying it to a local Kubernetes cluster. Let’s rewind the high level architecture of the template.

The SwirlAI Data Engineering Project Master Template

Real Time Pipeline:

  1. Data Producers - these are independent applications that extract data from chosen Data Sources and push it in real time to the Collector application via REST or possibly gRPC API calls.

  2. Collector - REST or gRPC API (it is important to note here that you will be able to use gRPC only in Private Network. If it is Public, you would have to go with REST) that takes a payload (json or protobuf in case of gRPC), validates top level field existence and correctness, adds additional metadata and pushes the data into either Raw Events Topic if the validation passes or a Dead Letter Queue if top level fields are invalid.

  3. Enricher/Validator - Stream Processing Application that validates schema of events in Raw Events Topic, performs some (optional) data enrichment and pushes results into either Enriched Events Topic if the validation passes and enrichment was successful or a Dead Letter Queue if any of previous have failed.

  4. Enrichment API - API of any flavour that will be called for enrichment purposes by Enricher/Validator. This could be a Machine Learning Model deployed as an API as well.

  5. Real Time Loader - Stream Processing Application that reads data from Enriched Events and Enriched Events Dead Letter Topics and writes them in real time to ElasticSearch Indexes for Real Time Analysis and alerting.

  6. Batch Loader - Stream Processing Application that reads data from Enriched Events Topic, batches it in memory and writes to MinIO or other kind of Object Storage.

Batch Pipeline:

This is where scripts Scheduled via Airflow or any other pipeline orchestrator read data from Enriched Events MinIO bucket, validate data quality, perform deduplication and any additional Enrichments. Here you also construct your Data Model to be later used for reporting purposes.

Some of the Infrastructure elements that will be needed.

A. A single Kafka instance that will hold all of the Topics for the Project.

B. A single MinIO instance that will hold all of the Buckets for the Project.

C. Airflow instance that will allow you to schedule Python or Spark Batch jobs against data stored in MinIO.

D. Presto/Trino cluster that you mount on top of Curated Data in MinIO so that you can query it using Superset.

E. ElasticSearch instance to hold Real Time Data.

F. Superset Instance that you mount on top of Trino Querying Engine for Batch Analytics and Elasticsearch for Real Time Analytics.

In this Newsletter episode that is Part 1 of the Collector related series we will work on 1. and 2. without any additional infrastructure elements and implement the following:

Today we implement: Producer and Collector applications
  • The Collector.

    • Write a Python application that uses FastAPI Framework to expose a REST API endpoint.

    • Deploy the application on Kubernetes.

    • Horizontally scale the application for High Availability

  • Data Producer.

    • Write a Python application to download data from the internet.

    • Run the application on Kubernetes.

    • Send the downloaded data to a previously deployed REST API endpoint.

In Part 2 of Collector related series that will be coming out in a few weeks we will:

  • Improve code structure.

  • Implement unit tests for our application.

  • Set up a CI/CD pipeline via GitHub Actions.

  • Add Kafka to the picture.

Prerequisites.

This tutorial assumes that you know the basics of Python and have skimmed through my latest Guide to Kubernetes which you can find here.

If you want to dig deeper into the full Template without any implementation details, you can do it here.

Defining the architecture.

Share

Keep reading with a 7-day free trial

Subscribe to

SwirlAI Newsletter
to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
Previous
Next
© 2023 Aurimas Griciūnas
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing