SAI #20: Decomposing Real Time Machine Learning Service Latency.
Decomposing Real Time Machine Learning Service Latency, Stream Processing: Event vs. Processing Timestamp.
👋 This is Aurimas. I write the weekly SAI Newsletter where my goal is to present complicated Data related concepts in a simple and easy to digest way. The goal is to help You UpSkill in Data Engineering, MLOps, Machine Learning and Data Science areas.
This week in the Newsletter:
Decomposing Real Time Machine Learning Service Latency.
Stream Processing: Event vs. Processing Time.
Decomposing Real Time Machine Learning Service Latency.
How do we Decompose Real Time Machine Learning Service Latency and why should you care to understand the pieces as a ML Engineer?
Usually, what is cared about by the users of your Machine Learning Service is the total endpoint latency - the time difference between when a request is performed (1.) against the Service till when the response is received (6.).
Certain SLAs will be established on what the acceptable latency is and you will need to reach that. In order to do it most efficiently you need to know the building blocks that comprise the total latency. Being able to decompose the total latency is even more important as you can improve each piece independently. Let's see how.
Total latency of typical Machine Learning Service will be comprised of:
2: Feature Lookup.
You can decrease the latency here by:
A: Changing the Storage Technology to better fit the type of data you are looking up and the queries you are running against the Store.
B: Implementing aggressive caching strategies that would allow you to contain the data of most frequently queried entities locally. You could even spin up local Online Feature Stores if the data held in them are not too big (however, management and syncing of these becomes troublesome).
3: Feature Preprocessing.
Decrease latency here by:
A: If possible, moving as much of preprocessing logic before the data is pushed to the Feature Store. This allows for calculation logic to be performed once instead of doing it each time before Inference. Combine this with improvements in point 2 for largest latency improvement.
4: Machine Learning Inference.
Decrease latency here by:
A: Reducing the Model Size using techniques like Pruning, Knowledge Distillation or Quantisation (more on these in future posts).
B: If your service consists of multiple models - applying inference in parallel and combining the results instead of chaining them in sequence if possible.
These three building blocks are what is specific to a ML System, there are naturally components like Network latency and machine resources that we need to take into consideration.
Network Latency could be reduced by pulling ML Model into the backend service.
When it comes to Machine Resources - it is important to consider horizontal scaling strategies and how large your servers have to be to efficiently host the model type that you are using.
Stream Processing: Event vs. Processing Time.
Keep reading with a 7-day free trial
Subscribe to