SAI Notes #06: Machine Learning Model Compression.
Let's review the methods of Compressing Machine Learning Models and why you might need it.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This week in the weekly SAI Notes:
Machine Learning model compression and why you might need it.
Thoughts on Latency of Feedback Loops in Machine Learning applications.
Machine Learning model compression and why you might need it.
When you deploy Machine Learning models to production you need to take into account several operational metrics that are in general not ML related. Today we talk about two of them:
👉 Inference Latency: How long does it take for your Model to compute inference result and return it.
👉 Size: How much memory does your model occupy when it’s loaded for serving inference results.
Both of these are important when considering operational performance and feasibility of your model deployment in production.
👉 Large models might not fit on a device if you are considering edge deployments.
👉 Latency of retrieving inference results might make business case non feasible. E.g. Recommendation Engines require latencies in tens of milliseconds as ranking has to be applied as the user browses your website or app in real time.
👉 …
You can influence both latency and model size by applying different Model Compression methods, most popular of them being:
➡️ Pruning: this method is mostly used in tree-based and Neural Network algorithms. In tree-based ones we prune leaves or branches from decision trees. In Neural Networks we remove nodes and synapses (weights) while trying to retain ML performance metrics.
✅ In both cases the output is a reduction in the number of Model Parameters and model size.
➡️ Knowledge Distillation: this type of compression is achieved by:
👉 Training an original large model which is called the Teacher model.
👉 Training a smaller model to mimic the Teacher model by transferring knowledge from it, this model is called the Student model. Knowledge in this context can be extracted from the outputs, internal hidden state (feature representations) or a combination of both.
👉 We then use the “Student” model in production.
➡️ Quantisation: a most commonly used method that doesn’t have much to do with Machine Learning. This approach uses fewer bits to represent model parameters.
👉 You can apply quantisation techniques both during the training and after the models has been already trained.
👉 In regular Neural Networks what is quantised are Model Weights, Biases and Activation Functions.
👉 Most usual quantisation is from float to integer (32 bits to 8 bits).
➡️ …
[Important]: while the above methods do reduce the size of the models and by doing so allowing them to be deployed in production scenarios, there is almost always a reduction in accuracy so be careful and evaluate it accordingly.
Thoughts on Latency of Feedback Loops in Machine Learning applications.
Keep reading with a 7-day free trial
Subscribe to