In Part 1 of this series, we leveraged LangChain’s built-in capabilities to load in data from a FlashBlade® S3 bucket and reviewed various settings to optimize throughput performance. In this article, we’ll review the next steps of taking the data residing on FlashBlade, chunking it up, embedding it, loading a vectorstore, persisting it to storage for easy future retrievals, and demonstrating how choices we made in chunking affect the accuracy of similarity search.
Step 1: Chunking
We last left off using LangChain’s S3DirectoryLoader to load documents into memory for processing. We now have to take the contents of those documents and split them into chunks for quicker retrieval later on in the chatbot pipeline. This can be done seamlessly by leveraging S3DirectoryLoader’s load_and_split() function.
In order to use this function, we’ll need to define our text splitter and an appropriate chunk size and chunk overlap. I set the chunk_size to something really small to illustrate what the output will look like below—normally you would set this to something optimized for your environment’s compute capabilities, data set contents, and desired Q&A output accuracy. Play around with the chunk size and chunk overlap values and you’ll see how this affects the accuracy of the similarity search results—something we’ll cover later on in this article.
1 2 3 4 5 6 7 |
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size = 30, chunk_overlap = 0 ) |
1 |
documents = loader.load_and_split(text_splitter) |
The code above will result in an output similar to this, showing our data was chunked successfully into more document objects:
1 2 3 4 5 6 7 8 |
[Document(page_content=‘I have many leather-bound books’, metadata={‘source’: ‘s3://flashblade-bucket/anchorman.txt’}), Document(page_content=‘and my apartment smells of rich mahogany.’, metadata={‘source’: ‘s3://flashblade-bucket/anchorman.txt’}), Document(page_content=‘I award you no points, and may’, metadata={‘source’: ‘s3://flashblade-bucket/billymadison.txt’}), Document(page_content=‘God have mercy on your soul.’, metadata={‘source’: ‘s3://flashblade-bucket/billymadison.txt’})] |
Step 2: Embedding and Storing to a Vectorstore
The next step will be to embed the chunks in preparation for storing in a vectorstore. Embedding is an important process in which a vector of values is created to represent a chunk of text—this gives us the ability to not just search for singular words but also the surrounding words for better context.
This step varies depending on your LLM of choice. For example, the code would be different if we’re leveraging OpenAI’s API or if we’re using local models sourced from Hugging Face. Both of these (among many others) are supported in LangChain. Since most enterprises are unable to send their proprietary data to OpenAI for various legal reasons, let’s work on using local LLMs from Hugging Face.
We need to import LangChain’s HuggingFaceEmbeddings code where we’ll pass in our preferred sentence transformer model that will handle the embedding logic. In the example below, I picked the most popular sentence transformer model, but there are others available that have different pros and cons. Using a different model from Hugging Face is as simple as changing the model name.
1 2 3 4 5 6 |
from langchain.embeddings import HuggingFaceEmbeddings #model names can be found on https://huggingface.co/models model_name = “sentence-transformers/all-mpnet-base-v2” embeddings = HuggingFaceEmbeddings(model_name = model_name) |
Now that we have our embedding instructions ready, let’s create our vectorstore and finally glue our chunks, embedding logic, and vectorstore pieces together. There are many different vectorstore technologies available. In this tutorial, we’ll leverage FAISS from Meta due to its ease of deployment, scalability, and search performance.
Let’s install the CPU version of FAISS for this tutorial (a GPU version is also available):
1 |
pip install faiss–cpu |
And now our Python code to create the vectorstore:
1 2 3 4 |
from langchain.vectorstores.faiss import FAISS #create the FAISS vectorstore vectorstore = FAISS.from_documents(documents, embeddings) |
We now have a FAISS vectorstore loaded with embedding representations of our chunked documents and are ready for similarity searching.
But before we start searching, we’ve got one more important step to do. Right now, that vectorstore is in memory, and we would need to redo the above steps every time we launched the application. Let’s see how we can persist this to storage and recall it back.
ANALYST REPORT,
Top Storage Recommendations
to Support Generative AI
Step 3: Persisting the Vectorstore to FlashBlade
In this step, we’re going to be leveraging pickle in conjunction with boto3 to store the vectorstore variable to a FlashBlade S3 bucket. Pickle is a Python module that converts a Python object to a byte stream and boto3 will take the byte stream and handle the transmission of that data to FlashBlade.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import boto3 import pickle #prep boto3 for sending data to FlashBlade S3 bucket s3_client = boto3.client( “s3”, aws_access_key_id=“FB User Access Key” , aws_secret_access_key=“FB User Secret Key”, endpoint_url=“https://FB Data VIP” ) #use pickle to create vectorstore file and send to FlashBlade via boto3 pickle_byte_obj = pickle.dumps(vectorstore) bucket = “FlashBlade Bucket Name” key = “vectorstore.pkl” s3_client.put_object(Body=pickle_byte_obj, Bucket=bucket, Key=key) |
Now when we want to load our vectorstore in production, we can just use the following code to use our vectorstore instead of having to reload/chunk/embed/vectorstore every time:
1 2 3 4 5 6 7 8 9 |
response = s3_client.get_object( Bucket=“FlashBlade Bucket Name”, Key=“vectorstore.pkl” ) body = response[‘Body’].read() vectorstore = pickle.loads(body) |
Step 4: Querying the Vectorstore
We’re finally at an important stage of the chatbot pipeline: testing the vectorstore for accuracy on the chunks of documents it returns based on a query. For a FAISS vectorstore, there are several search methodologies such as similarity search and max marginal relevance search, each with synchronous and asynchronous versions, as well as the option to display the relevancy scores of each document chunk returned. We’ll use the simple similarity search call to demonstrate our vectorstore is working:
1 2 3 4 5 6 7 8 |
query = “What does my apartment smell like?” #k specifies the number of documents to return, default is 4 docs = vectorstore.similarity_search(query, k=2) print(docs) [Document(page_content=‘books and my apartment smells’, metadata={‘source’: ‘s3://flashblade-bucket/anchorman.txt’}), Document(page_content=‘of rich mahogany.’, metadata={‘source’: ‘s3://flashblade-bucket/anchorman.txt’})] |
The vectorstore similarity search worked and returned enough content that our answer resides in. But notice how chunking and setting a k-value could affect our results… we chunked to 30 characters earlier (a very small value for demonstration purposes) and set our similarity search k-value to 2 so it would return two chunks. If we had set k=1, we would not have gotten our correct context (the “rich mahogany” text). Alternatively, if we increased our chunk size and overlap to larger values and left k=1, we would have received our correct context. This is why it’s important to find the correct balance of chunk size, overlap, and k-value to make sure it’s large enough to get the full context but efficiently small enough as to not have to load a ton of text into our LLM in the following tutorials.
Stay Tuned for More Tutorials
Let’s review what we’ve accomplished from Part 1 and Part 2 of this blog series so far. We’ve set up a LangChain environment that pulled documents from a FlashBlade S3 bucket into memory, reviewed various data movement tools for performance considerations, chunked and embedded those documents, created and loaded a FAISS vectorstore with our chunked embeddings, persisted the vectorstore into a pkl file that was stored to a FlashBlade S3 bucket, showed how to retrieve that vectorstore pkl file from FlashBlade back into memory for usage, queried the vectorstore, and received a document that contained the answer to our question.
In our next blog post in the series, we’ll cover:
Logging, tracing, and debugging a chain
Passing the relevant document into an LLM chain for inference where we’ll receive a definitive answer and not just a chunk of documentation
Read More from This Series
Part 1: How We Got Here – Virtualization Solves Enterprise Challenges
Part 2: The Helmsman Arrives! – A History of Containers and Kubernetes
Part 3: The Challenge – Bringing Proven Enterprise Data Storage to Kubernetes
Part 4: The Solution – Comprehensive Data Storage Infrastructure with Portworx and Pure
Support Generative AI
Learn the top storage recommendations to support generative AI.