This blog on vector databases and storage was originally published on Medium.com. It has been republished with the author’s credit and consent.
At this year’s GTC in San Jose, one slide from a NVIDIA session caught my eye, and when I talked about it with my colleagues, they all got excited. In the session, the speaker claimed that data expands when embedded, then stored uncompressed for optimal RAG, can increase data storage by up to 10x.
A 10x increase is a lot—100TB becomes 1PB, and 1PB becomes 10PB. No wonder why people in storage companies got really excited. But is it true that generative AI and RAG really expands data storage usage that much? And why is that? Let me try to test and confirm this.
In my previous blog, I briefly wrote about how RAG works and its data infrastructure. RAG encodes external data so that it can easily retrieve the relevant parts of the data on query. The best option for storing and retrieving external data for RAG is a vector database because it supports similarity search that enables RAG to quickly retrieve data that is relevant to user query. To understand its storage usage, we need to dig a little deeper.
How Does a Vector Database Work?
A vector database is designed to efficiently store, index, and query data in the form of vectors, which are arrays of numbers representing data points in a high-dimensional space.
Here’s how a vector database typically indexes data (at a high level):
1. Vector representation: First, the data, whether it’s images, text, or any other form of multimedia, is converted into vectors. Each vector represents the features of the data item in a numerical format. This is often done using neural network models or feature extraction algorithms.
2. Indexing: The vectors are then indexed to facilitate efficient retrieval. Vector databases use specialized indexing algorithms to manage and search through these high-dimensional spaces effectively.
3. Querying: When querying a vector database, the query itself is also converted into a vector using the same method as the indexed data. The database then uses the indexed structure to quickly retrieve vectors that are similar to the query vector, typically by calculating distances or similarities.
In real applications, hundreds or even higher dimensions are common. To handle large data sets and ensure quick retrieval, vector databases often implement additional optimizations. These include using GPUs for parallel computations, distributing the database across multiple machines, and implementing efficient cache mechanisms.
Vector Database Solutions
There are multiple choices of vector database, including open source and commercial software and managed services. Some popular ones include: Pipecone, Weaviate, Chroma, Qdrant, Milvus, and Vespa. Because a vector database is a critical building block in RAG, some classical databases, such as Redis, MongoDB, and Elasticsearch, also started to support for vector capability.
I chose to use Milvus to verify RAG data usage expansion. Given our focus on AI at scale, I am particularly interested in the following Milvus features:
- Support GPU
- Support object storage
- Deploy on Kubernetes as a distributed system
These are supposed to contribute to Milvus’ performance and scalability.
Vector Database and Storage
To test vector database and storage usage, I downloaded a chunk of papers (152 PDF files) from arXiv. I extracted text from the PDF papers. I then embedded the text into 768 dimensional vectors using a sentence-transformers model and stored the vectors in Milvus. Finally, I compared the size of the original PDFs, extracted text, and the Milvus database. Some code snippets are shown below.
Extract text from PDF:
import pdfplumber
def extract_text_from_pdf(pdf_path):
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
text += page.extract_text()
return text
Split text into chunks, and create embedding vector for each chunk:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
splits = text_splitter.split_text(text)
# Creating embeddings
model_name = “sentence-transformers/all-mpnet-base-v2”
model_kwargs = {“device”: “cuda”}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
embedded = hf.embed_documents(splits)
Insert into Milvus:
# Define Milvus schema and create collection
schema = CollectionSchema(fields, description=”arXiv paper”, auto_id=True, primary_field=’id’)
collection = Collection(name=”arxiv”, schema=schema, shards_num=1)
# Insert to Milvus
data = [pub_months, file_seqs, file_sizes, chunk_nums, embeddings]
collection.insert(data)
collection.flush()
Create Milvus index:
index_params = {“index_type”: “HNSW”, “params”: {“M”: 8, “efConstruction”: 200}, “metric_type”: “L2”}
collection.create_index(field_name=’chunk_embedding’, index_params=index_params)
To check everything works, I conduct a hybrid search in Milvus:
# Load the Milvus collection to perform the search
collection.load()
query = ‘Generative AI and RAG application.’
query_embedding = hf.embed_query(query)
# Conduct a hybrid vector search
search_param = {
“data”: [query_embedding],
“anns_field”: “chunk_embedding”,
“param”: {“metric_type”: “L2”},
“limit”: 3,
“expr”: “pub_month == 2401”,
}
res = collection.search(**search_param)
# Check the result
hits = res[0]
print(f”- Total hits: {len(hits)}, hits ids: {hits.ids} “)
print(f”- Top1 hit id: {hits[0].id}, distance: {hits[0].distance}, score: {hits[0].score} “)
Here is the search result:
– Total hits: 3, hits ids: [449571692599744330, 449571692599740033, 449571692599740035]
1 |
– Top1 hit id: 449571692599744330, distance: 0.902103066444397, score: 0.902103066444397 |
Storage Usage Comparison
The PDF and extracted text files are stored on local filesystem. The vectors in Milvus are stored in FlashBlade S3. So I use this command to get S3 usage of the database:
1 |
Now let’s compare the storage usage of PDF, text and vector database.
- PDF: 520MB
- Text: 11MB
- Vector: 120MB
Observation from the above:
- PDF format has the biggest size because images and charts are not converted/stored in the other two formats in this test.
- Compared to text, vector format uses up to 10x storage to store the same content (text data extracted from PDF)
Why Does Vector Database Use 10x More Storage than Text?
Most common English text, including all lowercase and uppercase letters, digits, and basic punctuation, falls within the ASCII range, which UTF-8 encodes using a single byte per character. It is about 1KB of storage for every 1,000 characters.
In this test, we split the text into 1,000-character chunks (chunk_size=1000), we then encoded each chunk into a 768-dimensional (dim=768) vector, which is represented as 768 floating point numbers. A 32-bit floating point number takes 4 bytes. Therefore, for each 1KB text (1,000 characters), we have 768 x 4 = 3,072B in vector. About 3x amplification, which depends on chunk size and vector dimension.
Another big overhead comes from the indices. Depending on the index parameter, index data could be as big as or even bigger than the embedding vectors. For example, in my environment, index files take almost half of the total size of the Milvus bucket. This adds another 3x or so amplification.
Due to the high-dimensional vector and index files, I am not surprised to find a 10x data expansion after ingesting text data into a vector database.
Conclusion
In this simple test, I confirmed that it is actually true that generative AI and RAG could increase data storage by up to 10x. Another reason why fast enterprise storage with built-in data compression is crucial for AI.
Written By:
o
o