SAI #27: Event Latency in Data Systems
A look into how event latency builds up as data travels through the Data System
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
Thanks for reading SwirlAI Newsletter!
One of the most important aspects of the Data Pipeline that you will work with as a Data Engineer is ensuring that SLAs agreed upon with your stakeholders are being met. One of the most important parts of the SLA together with data quality is Data Latency. This means ensuring contracts like:
Every day at 4 AM the Data Marts in the Data Warehouse have to contain all of the data that has been generated yesterday until midnight (for a selected time zone).
A real time view of a streaming pipeline should not have data that is delayed by more than 5 minutes from when it was generated.
Data Latency can creep in unexpectedly. What do you do when the SLAs are clearly no longer met and you are processing data from yesterday in the ETL that runs at 6 AM?
The only way to handle these situations is to decompose the Data System and see which separate elements build the latency up. By doing it, you will be able to tune each of the elements separately to achieve a desired latency to meet SLAs. Without doing it you will be running blind.
In todays newsletter I will present an example of what the journey of an event generated in a website might look like and how the latency accumulates while traversing a Batch Data Pipeline. Time when event reaches a specific part of the Data Pipeline will be denoted in the diagrams by T.
I covered General architecture of Website Activity Tracking data ingestion pipeline in one of my previous Newsletter episodes (you can find it here). This time I will group it into 5 fundamental parts.
1. - Event Generation.
2. and 3. - Event Collection.
4., 5. and 6. - Transport.
7. in large scale systems will usually include two elements - Data Lake and Data Warehouse.
Let’s look into each group closer.
The event will have to be sent over the network to the exposed endpoint of the next Data System step which is usually a Load Balancer facing a fleet of collectors. This will add the network latency of some milliseconds depending on the distance between where the Load Balancers are deployed and where the event was generated. In the diagram ( A. ) we assume it is 50 ms, but it is a really naive assumption, it could be a lot more depending on your location.
Usual culprit of extra Latency is network availability. If a device in which the event was generated does not have network connectivity it will usually buffer ( B. ) the event locally and only send it out once the connectivity is restored. This can be anywhere between fem minutes and hours of additional latency, a good example is a plane flight when you turn on a flight mode in your device, the duration of the flight will vary.
Collectors are applications that act as a gateway between the internet and internal Data System. The function of these applications is to validate incoming events against some top level schema and buffer them in memory for certain time in case the downstream systems (raw event Kafka Topic) stop responding.
Usually at this point the latency of additional network hops between applications will decrease to single digits of milliseconds ( A. ) as internal applications will be deployed close to each other, in cloud terms - in the single Region and very likely Availability Zone.
If a downstream Kafka stops responding ( B. ), collector starts buffering events to avoid data loss, unfortunately data volume coming in here is huge and collectors might be able to buffer data for few minutes before starting to evict it causing permanent loss. If Kafka is recovered in time, we might see additional latencies of few minutes.
Keep reading with a 7-day free trial