👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This is a 🔒 Paid Subscriber 🔒 only issue. If you want to read the full article, consider upgrading to paid subscription.
Most of the online sources will advocate for trying to achieve the highest level of data freshness possible, which means minimising the time from data generation to serving it to the end user. And this is no surprise, the articles are usually written by the companies that help with Data Freshness. While there is no doubt of the value of fresh data, it does not come without cost. In my career I had been a victim of my own geekiness and ambition multiple times, while leading projects I would choose to advocate for over engineered solutions resulting in less optimal ROI. In this article I will outline observations that I made over the years on Data Freshness in Machine Learning Systems and hopefully it will help some of the readers make better decisions when architecting their own ML Systems.
I will cover the following topics:
What is Data Freshness and why is it important?
Data Freshness in Machine Learning Systems (Feature Freshness).
Levels of Feature Freshness in Machine Learning Systems.
Moving between the levels.
Complexities introduced with each new level.
Data Freshness.
In most general terms, Data Freshness is the amount of time needed for a data point after being generated by Data Producers to be available for consumption in analytical systems by Data Consumers. A relative term to Data Freshness is Data Latency.
Data Latency refers to the time difference between when the data was generated until it was made available in core data stores like Data Warehouses or Data Lakes.
Once the data is available in the core data store it is not necessarily automatically exposed to the data consumers. You might still need to expose it to BI Reporting tools or Reverse ETL systems. Our focus today is on Machine Learning Systems that are by themselves a Data Consumer of the Data Engineering System.
Why is Data Freshness important?
Value of data reduces over time.
Stale data applied in production might damage your business.
Near to Real Time data unlocks new use cases.
Ability to react quickly on recent data gives you a competitive advantage.
Fresh data builds trust.
Having said this, there are multiple cases where certain levels of data staleness is fully acceptable:
Weekly/daily management reports can be delivered with delays of several hours after the day closes with no severe impact.
Sending a promotional email to the user with high probability of churn estimated on yesterday's next day afternoon via a daily batch is usually good enough.
…
Data Freshness in Machine Learning Systems.
When it comes to Machine Learning Systems, we usually plug in on top of Data Engineering Systems where data is already collected, transformed and curated for efficient usage in downstream systems - ML System is just one of them. This does not mean however that no additional data transformations need to happen after data is handed over. We refer to Data Freshness in Machine Learning Systems as Feature Freshness.
When thinking about composition of how data is served to the end user in ML Systems there are two mostly independent pieces, hence also two perspectives on Feature Freshness:
Feature Freshness at Model Training time: how much time does it take for a generated data point to be included when training a Machine Learning Model which is then deployed to serve the end user. Remember that Machine Learning models are nothing more than Statistical models trained to predict certain outcomes on a given feature distribution. We can’t avoid ML Models becoming stale if not retrained. This phenomenon of ML models becoming stale is called Feature and Concept Drift (you can read more about them here).
Feature Freshness at inference time: how much time does it take for a generated data point to be available when performing Inference with the previously trained and deployed model. Features used for inference are usually decoupled in terms of freshness from the ones that are used while training the model and are less stale.
Both types of Feature freshness add on top of Data Freshness introduced by the Data Engineering system.
Levels of Feature Freshness in Machine Learning Systems.
In this article I want to outline different levels of feature freshness you could achieve in your ML Systems that I have observed in the industry. I will also discuss investments you would need to make and complexities in the system that would be introduced.
Keep reading with a 7-day free trial
Subscribe to