SAI Notes #07: What is a Vector Database?
Let's look into what Vector Databases are and how they might empower your Machine Learning applications.
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in Data Engineering, MLOps, Machine Learning and overall Data space.
This week in the weekly SAI Notes:
What is a Vector Database?
Partitioning vs. Bucketing in Spark.
Thanks for reading SwirlAI Newsletter! Subscribe to receive new posts and support my work.
What is a Vector Database?
With the rise of Foundational Models, Vector Databases skyrocketed in popularity as well. The truth is that a Vector Database is also useful out of a Large Language Model context.
When it comes to Machine Learning, what we often deal with are Vector Embeddings. Vector Embeddings are created by casting of contextual (feature) information of an object into a Latent (Embedding) space by running features of the object through a specific Machine Learning model.
Vector Databases are created to perform specifically well when working with Vector Embeddings. Work in this context includes storing, updating and retrieving vectors. When we talk about retrieval, we usually refer to retrieving vectors that are most similar to a query that is embedded into the same Latent space and passed to the Vector DB. This retrieval procedure is called Approximate Nearest Neighbour (ANN, more on it in point 10.) search.
A query here could be in a form of an object like an image for which we would like to find similar images. Or it could be a question for which we want to retrieve relevant context that could later be transformed into an answer via e.g. a Large Language Model.
Let’s look into how one would interact with a regular Vector Database:
1. Choose a Machine Learning model that would be used to generate Vector Embeddings.
2. You can embed any type of information: text, images, audio, tabular. Choice of ML model used for embedding will depend on the type of data.
3. After running the preprocessed data through the Embedding Model you get a Vector representation of your data.
4. Most modern vector databases will allow storing additional metadata together with the Vector Embedding. This data would later be used to pre-filter or post-filter ANN search results.
5. Vector Database indexes received Vector Embedding and metadata separately. Indexes are created for faster retrieval of data when performing queries. There are multiple methods that can be used for creating vector indexes, some of them being: Random Projection, Product Quantization, Locality-sensitive Hashing, Hierarchical Navigable Small World.
6. Vector data is stored together with indexes for Vector Embeddings and metadata connected to the Embedded objects.
7. A query to be executed against a Vector Database will usually consist of two parts:
Data that will be used for ANN search. e.g. an image for which you want to find similar ones.
Metadata query to exclude Vectors that hold specific qualities known beforehand. E.g. given that you are looking for similar images of apartments, you can exclude apartments in a specific location if this information is included in the metadata.
8. You execute Metadata Query against the metadata index. It could be done before or after the ANN search procedure.
9. When it comes to the context of the data, you embed the data into the Latent space with the same model that was used for writing the data to the Vector Database.
10. Depending on what indexing method was used, the Database might need to index the Query vector as well. Once done, a ANN search procedure is applied and a set of Vector embeddings are retrieved. Popular similarity measures for ANN search include:
Possible Query Paths.
As mentioned before, there are two possible query paths (Figure 2) when Metadata Query is involved:
Path 1: Apply ANN search first and run Metadata Query on retrieved results.
Path 2: Run Metadata Query first and then apply ANN search on filtered results.
Both paths have their own cons and pros. E.g. Depending on the Vector Indexing method, you might lose some relevant context for ANN search when applying Metadata Query first. However, you might reduce the search space significantly when applying Metadata Query first and by doing so improving overall query performance.
Example use cases for Vector Databases.
As mentioned in the article, any data can be embedded into a vector embedding, hence there are also a lot of use cases for Vector Databases in real life. Some of them are:
Natural Language Processing.
An example would be a popular technique of providing context to an LLM. Let’s say you want to create a chatbot that would be able to answer questions only about articles in your Substack. You could embed all of the articles into vectors and store them in a Vector Database. When asking a question, you would embed the question into the same Latent space and retrieve most relevant text from the Vector Database by running a query using the embedded question. Then, you send the retrieved pieces of text to the LLM to construct an answer from it.
A good example is identification of the same objects from photos taken from similar or different angles. Let’s say you are trying to identify if the room listed in Airbnb is the same as the one in BookingCom. You could run an Airbnb photo through a ANN search in the BookingCom Vector Database index and identify rooms that are the most similar feature-vise.
These systems are usually composed of two consequential steps:
Candidate retrieval. Given a query, retrieve a smaller number of candidates for ranking that would be performed using a heavy computationally expensive model. This is where you would use a Vector Database as it is able to efficiently retrieve large amounts of similar vectors to the query vector.
Ranking. Rescoring candidates retrieved by the previous step using a heavy model.
Some popular Vector Databases: Qdrant, Pinecone, Weviate, Milvus, Faiss.
While not being Vector Databases, more and more database providers start including ANN search capabilities, these include databases like Redis, Cassandra etc.
Thanks for reading SwirlAI Newsletter!
Partitioning vs. Bucketing in Spark.
When working with big data there are many important concepts we need to consider about how the data is stored both on disk and in memory, we should try to answer questions like:
➡️ Can we achieve desired parallelism?
➡️ Can we skip reading parts of the data?
✅ The question is addressed by partitioning and bucketing procedures
➡️ How is the data colocated on disk?
✅ The question is mostly addressed by bucketing.
So what are the procedures of Partitioning and Bucketing? 𝗟𝗲𝘁'𝘀 𝘇𝗼𝗼𝗺 𝗶𝗻.
➡️ Partitioning in Spark API is implemented by .partitionBy() method of the DataFrameWriter class.
➡️ You provide the method one or multiple columns to partition by.
➡️ The dataset is written to disk split by the partitioning column, each of the partitions is saved into a separate folder on disk.
➡️ Each folder can maintain multiple files, the amount of resulting files is controlled by the setting spark.sql.shuffle.partitions.
✅ Partitioning enables Partition Pruning. Given we filter on a column that we used to partition the dataframe by, Spark can plan to skip the reading of files that are not falling into the filter condition.
➡️ Bucketing in Spark API is implemented by .bucketBy() method of the DataFrameWriter class.
𝟭: We have to save the dataset as a table since the metadata of buckets has to be saved somewhere. Usually, you will find a Hive metadata store leveraged here.
𝟮: You will need to provide number of buckets you want to create. Bucket number for a given row is assigned by calculating a hash on the bucket column and performing modulo by the number of desired buckets operation on the resulting hash.
𝟯: Rows of a dataset being bucketed are assigned to a specific bucket and collocated when saving to disk.
✅ If Spark performs wide transformation between the two dataframes, it might not need to shuffle the data as it is already collocated in the executors correctly and Spark is able to plan for that.
❗️There are conditions that need to be met between two datasets in order for bucketing to have desired effect.
𝗪𝗵𝗲𝗻 𝘁𝗼 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝘄𝗵𝗲𝗻 𝘁𝗼 𝗕𝘂𝗰𝗸𝗲𝘁?
✅ If you will often perform filtering on a given column and it is of low cardinality, partition on that column.
✅ If you will be performing complex operations like joins, groupBys and windowing and the column is of high cardinality, consider bucketing on that column.
❗️Bucketing is complicated to nail as there are many caveats and nuances you need to know when it comes to it. More on it in future posts.
Join SwirlAI Data Talent Collective
If you are looking to fill your Hiring Pipeline with Data Talent or you are looking for a new job opportunity in the Data Space check out SwirlAI Data Talent Collective! Find out how it works by following the link below.