SAI Notes #07: What is a Vector Database?
Let's look into what Vector Databases are and how they might empower your Machine Learning applications.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This week in the weekly SAI Notes:
What is a Vector Database?
Partitioning vs. Bucketing in Spark.
What is a Vector Database?
With the rise of Foundational Models, Vector Databases skyrocketed in popularity as well. The truth is that a Vector Database is also useful out of a Large Language Model context.
When it comes to Machine Learning, what we often deal with are Vector Embeddings. Vector Embeddings are created by casting of contextual (feature) information of an object into a Latent (Embedding) space by running features of the object through a specific Machine Learning model.
Vector Databases are created to perform specifically well when working with Vector Embeddings. Work in this context includes storing, updating and retrieving vectors. When we talk about retrieval, we usually refer to retrieving vectors that are most similar to a query that is embedded into the same Latent space and passed to the Vector DB. This retrieval procedure is called Approximate Nearest Neighbour (ANN, more on it in point 10.) search.
A query here could be in a form of an object like an image for which we would like to find similar images. Or it could be a question for which we want to retrieve relevant context that could later be transformed into an answer via e.g. a Large Language Model.
Let’s look into how one would interact with a regular Vector Database:
Writing/Updating Data.
1. Choose a Machine Learning model that would be used to generate Vector Embeddings.
2. You can embed any type of information: text, images, audio, tabular. Choice of ML model used for embedding will depend on the type of data.
3. After running the preprocessed data through the Embedding Model you get a Vector representation of your data.
4. Most modern vector databases will allow storing additional metadata together with the Vector Embedding. This data would later be used to pre-filter or post-filter ANN search results.
5. Vector Database indexes received Vector Embedding and metadata separately. Indexes are created for faster retrieval of data when performing queries. There are multiple methods that can be used for creating vector indexes, some of them being: Random Projection, Product Quantization, Locality-sensitive Hashing, Hierarchical Navigable Small World.
6. Vector data is stored together with indexes for Vector Embeddings and metadata connected to the Embedded objects.
Reading Data.
Keep reading with a 7-day free trial
Subscribe to