image_pdfimage_print

In Part 1 of this series, we leveraged LangChain’s built-in capabilities to load in data from a FlashBlade® S3 bucket and reviewed various settings to optimize throughput performance. In this article, we’ll review the next steps of taking the data residing on FlashBlade, chunking it up, embedding it, loading a vectorstore, persisting it to storage for easy future retrievals, and demonstrating how choices we made in chunking affect the accuracy of similarity search.

Step 1: Chunking

We last left off using LangChain’s S3DirectoryLoader to load documents into memory for processing. We now have to take the contents of those documents and split them into chunks for quicker retrieval later on in the chatbot pipeline. This can be done seamlessly by leveraging S3DirectoryLoader’s load_and_split() function.

In order to use this function, we’ll need to define our text splitter and an appropriate chunk size and chunk overlap. I set the chunk_size to something really small to illustrate what the output will look like below—normally you would set this to something optimized for your environment’s compute capabilities, data set contents, and desired Q&A output accuracy. Play around with the chunk size and chunk overlap values and you’ll see how this affects the accuracy of the similarity search results—something we’ll cover later on in this article.

The code above will result in an output similar to this, showing our data was chunked successfully into more document objects:

Step 2: Embedding and Storing to a Vectorstore

The next step will be to embed the chunks in preparation for storing in a vectorstore. Embedding is an important process in which a vector of values is created to represent a chunk of text—this gives us the ability to not just search for singular words but also the surrounding words for better context. 

This step varies depending on your LLM of choice. For example, the code would be different if we’re leveraging OpenAI’s API or if we’re using local models sourced from Hugging Face. Both of these (among many others) are supported in LangChain. Since most enterprises are unable to send their proprietary data to OpenAI for various legal reasons, let’s work on using local LLMs from Hugging Face.

We need to import LangChain’s HuggingFaceEmbeddings code where we’ll pass in our preferred sentence transformer model that will handle the embedding logic. In the example below, I picked the most popular sentence transformer model, but there are others available that have different pros and cons. Using a different model from Hugging Face is as simple as changing the model name.

Now that we have our embedding instructions ready, let’s create our vectorstore and finally glue our chunks, embedding logic, and vectorstore pieces together. There are many different vectorstore technologies available. In this tutorial, we’ll leverage FAISS from Meta due to its ease of deployment, scalability, and search performance.

Let’s install the CPU version of FAISS for this tutorial (a GPU version is also available):

And now our Python code to create the vectorstore:

We now have a FAISS vectorstore loaded with embedding representations of our chunked documents and are ready for similarity searching. 

But before we start searching, we’ve got one more important step to do. Right now, that vectorstore is in memory, and we would need to redo the above steps every time we launched the application. Let’s see how we can persist this to storage and recall it back.

Step 3: Persisting the Vectorstore to FlashBlade

In this step, we’re going to be leveraging pickle in conjunction with boto3 to store the vectorstore variable to a FlashBlade S3 bucket. Pickle is a Python module that converts a Python object to a byte stream and boto3 will take the byte stream and handle the transmission of that data to FlashBlade.

Now when we want to load our vectorstore in production, we can just use the following code to use our vectorstore instead of having to reload/chunk/embed/vectorstore every time:

Step 4: Querying the Vectorstore

We’re finally at an important stage of the chatbot pipeline: testing the vectorstore for accuracy on the chunks of documents it returns based on a query. For a FAISS vectorstore, there are several search methodologies such as similarity search and max marginal relevance search, each with synchronous and asynchronous versions, as well as the option to display the relevancy scores of each document chunk returned. We’ll use the simple similarity search call to demonstrate our vectorstore is working:

The vectorstore similarity search worked and returned enough content that our answer resides in. But notice how chunking and setting a k-value could affect our results… we chunked to 30 characters earlier (a very small value for demonstration purposes) and set our similarity search k-value to 2 so it would return two chunks. If we had set k=1, we would not have gotten our correct context (the “rich mahogany” text). Alternatively, if we increased our chunk size and overlap to larger values and left k=1, we would have received our correct context. This is why it’s important to find the correct balance of chunk size, overlap, and k-value to make sure it’s large enough to get the full context but efficiently small enough as to not have to load a ton of text into our LLM in the following tutorials.

Stay Tuned for More Tutorials

Let’s review what we’ve accomplished from Part 1 and Part 2 of this blog series so far. We’ve set up a LangChain environment that pulled documents from a FlashBlade S3 bucket into memory, reviewed various data movement tools for performance considerations, chunked and embedded those documents, created and loaded a FAISS vectorstore with our chunked embeddings, persisted the vectorstore into a pkl file that was stored to a FlashBlade S3 bucket, showed how to retrieve that vectorstore pkl file from FlashBlade back into memory for usage,  queried the vectorstore, and received a document that contained the answer to our question. 

In our next blog post in the series, we’ll cover:

  • Passing the relevant document into an LLM chain for inference where we’ll receive a definitive answer and not just a chunk of documentation
  • Logging, tracing, and debugging a chain