SAI Notes #03: Apache Flink - Architecture.
Let's take a closer look into internals of Apache Flink.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This week in the weekly SAI Notes:
Apache Flink - Architecture.
Mounting a Feature Store on top of Curated Data.
Apache Flink - Architecture.
Stream Processing is becoming more important and accessible to all organisations. One of the most popular Stream Processing Frameworks used by companies like AirBnB or Uber is Apache Flink.
But what is under the hood of Apache Flink? Let’s take a closer look.
There are two main components that are responsible for runtime of Flink Applications - JobManager and one or multiple TaskManagers. There are also few more pieces that are not involved with the runtime.
Flink Program - it is not part of a Flink runtime but rather used to prepare the optimal Dataflow Graph (by Optimiser/Graph Builder) of the Flink Application that is later sent to the JobManager for execution. The Client that is part of Flink Program is not part of the Flink Cluster and once it has sent required data to JobManager can then disconnect or stay connected to the JobManager to receive status reports (Attached Mode). Flink Program is also how developers pass the work to Flink via code.
JobManager - JobManager receives work from Client and proceeds to coordinate execution on Flink Cluster. There can be multiple JobManagers in the cluster for HA purposes, but one of them is always acting as the main one while others are on standby. The JobManager itself is composed of multiple separate processes:
2.1. Dispatcher - This process exposes a REST API to which Flink Applications are submitted for execution. For each separate Application the Dispatcher spawns a dedicated JobMaster. Dispatcher also exposes a Flink Web UI.
2.2. Flink Web UI - This is a Web UI application where users can monitor execution progress of Flink Applications submitted to the cluster. Think of it as a Spark UI for Spark applications.
2.3. ResourceManager - This process is responsible for allocating resources in the Flink Cluster. It does so by controlling available Task Slots on TaskManagers that are fundamental units of resource scheduling in Flink. Flink provides several types of ResourceManagers depending on what underlying technology is used to orchestrate Flink Applications - most popular types are Kubernetes, Hadoop YARN and Local. If you are running on Kubernetes or Hadoop YARN, ResourceManager is also able to spawn new TaskManagers.
2.4. JobMaster - This process is responsible for management of a single JobGraph. Since multiple Flink Jobs can be running on a single cluster, there can be multiple JobMasters running at the same time as well.
TaskManager - TaskManagers are responsible for executing tasks of JobGraphs managed by JobMasters.
3.1. TaskSlot - TaskSlot is the smallest unit of resource that can be used to schedule work on Flink cluster. The number of TaskSlots defines how many tasks can be executed in parallel.
3.2. Network Manager - This process is responsible for exchanging streams of data between multiple TaskManagers.
JobManager is also responsible for checkpointing and coordinating TaskManagers to make sure correct Job execution and restart given a failure at specific point in time. TaskManagers continuously send heartbeats and task statuses back to the JobManager to notify about their health and state while JobManager sends commands to perform tasks like checkpointing and specific actions on Task execution (deploy/stop/cancel).
Mounting a Feature Store on top of Curated Data.
What does it mean to mount a Feature Store on top of Curated Data?
When I talk about Feature Stores in my content, I tend to start with:
---
1️⃣ Feature Stores are mounted on top of the Curated Data Layer.
👉 The Data pulled into the Feature Platform should be of High Quality and meet SLAs.
👉 Curated Data could be coming in Real Time or Batch.
❗️ It is super important that only curated data is fed to the Feature Store as doing so inside of the Store is a recipe for disaster.
---
So, what does this actually mean? Let’s zoom in.
If Data is coming in Batch:
➡️ Raw Data lands in Data Lake.
➡️ It is loaded into the Staging area of the Data Warehouse where it is cleaned, indexed and moved to Master Data tables.
➡️ From the Master Data Layer single or multiple Data Marts are built to specifically fit the data model of the Feature Store.
✅ Data is periodically pulled by the Feature Store via Batch Ingest API.
If Data is coming in Real Time:
➡️ Raw Data Lives in a Distributed Queue like Kafka.
➡️ Streaming Application e.g. Flink reads the Raw Data , cleans it and saves it to the curated data stream.
➡️ Another Flink Application Reads one or multiple curated streams, performs joins and aggregations and saves the data to a final Kafka topic that now holds data that specifically fits the data model of the Feature Store.
✅ The data is pushed to the Feature Store in real time via Real Time Ingest API.
There is a lot of work happening before the data is ready to be handed over to the Feature Store, especially when Real Time Processing is involved. This is why Feature Platform providers are also trying to include Feature Compute as part of the Feature Platform itself. This allows users of FS to only care about cleaning the data outside of the Feature Store and leave complexities of managing the transformations to it.
Join SwirlAI Data Talent Collective
If you are looking to fill your Hiring Pipeline with Data Talent or you are looking for a new job opportunity in the Data Space check out SwirlAI Data Talent Collective! Find out how it works by following the link below.
@swirlAI Thank you for writing the article. I have the following questions.
1) How does the flink receives/processes stream of data? Do task managers work on top of kafka queues?
2) Why do we need network managers for stream processing? I believe that the actor system already provides a way to move data and coordinate among tasks.
3) Why does flink uses concept of timeslots on the jobs. I don't think kafka stream processor uses timeslots concept?
4) The artictecture described is good for batch processing, but you mentioned that flink is used for stream processing?
this article is full capitalized terms and doesn't explain the core concepts in. any meaningful manner.