With the rising popularity of generative artificial intelligence (AI) and frameworks like LangChain, companies and teams are in a race to leverage the technology against their repositories of data. LangChain is a powerful open source framework that simplifies the development of large language model (LLM) applications such as chatbots, generative question answering, summarization, code analysis, and more.
In this first post of a three-part series we’ll demonstrate how to create a Q&A chatbot with LangChain and a Pure Storage FlashBlade® S3 bucket.
Step 1: FlashBlade Prep
To create a chatbot that can answer questions from a repository of data on a FlashBlade S3 bucket, you’ll need to load data and get it ready for chunking, embedding, and indexing in a vectorstore for similarity searching.
Let’s start with the FlashBlade side before we dive into the coding portion. In most cases, you would already have a FlashBlade S3 bucket with data residing in it, so the following sub-steps are complete:
- Create a FlashBlade S3 Account, User, and Bucket/li>
- Securely store a FlashBlade S3 User secret key and access key
- Configure a minimum of 1 Data VIP on FlashBlade. This will be the endpoint IP for LangChain configuration later
The above information can be gathered either through the FlashBlade GUI, CLI, or API calls.
Step 2: LangChain Configuration and Data Loading
First things first, make sure LangChain, unstructured, and boto3 are installed.
1 |
pip install langchain unstructured boto3 |
Now we can start our Python application by importing the LangChain S3DirectoryLoader, initializing the loader with all of our FlashBlade information, and load the bucket data as a List for usage:
1 2 3 4 5 6 7 8 9 10 |
from langchain.document_loaders import S3DirectoryLoader loader = S3DirectoryLoader( “FB Bucket Name”, aws_access_key_id=“FB User Access Key”, aws_secret_access_key=“FB User Secret Key”, endpoint_url=“https://FB Data VIP Address” ) documents = loader.load() |
Running this code will return an object similar to this, with a document entry for each document within the bucket:
1 2 |
[Document(page_content=‘I have many leather-bound books and my apartment smells of rich mahogany.’, metadata={‘source’: ‘s3://flashblade-bucket/anchorman.docx’}), Document(page_content=‘I award you no points, and may God have mercy on your soul.’, metadata={‘source’: ‘s3://flashblade-bucket/billymadison.docx’})] |
Since the S3DirectoryLoader is using boto3 under the hood, there are some parameters we can change to increase the throughput performance coming from the FlashBlade. Here’s an example of the settings:
1 2 3 4 |
max_concurrent_requests = 1000 max_queue_size = 10000 multipart_threshold = 64MB multipart_chunksize = 16MB |
For more performance-based analysis, check out this blog written by Joshua Robinson that compares performance across several S3 transfer tools. (Spoiler: boto3 is not the fastest option.)
As LangChain evolves, it would be useful to have additional data transfer options for the S3DirectoryLoader function to use such as s5cmd.
Stay Tuned for More Tutorials
We now have our LangChain code connecting to an on-premises, high-performance FlashBlade without a large lift! In the next installation of this blog series, we’ll cover taking this newly loaded data and chunking it up, embedding the chunks, creating the vectorstore, and persisting that index to storage.
By leveraging FlashBlade as the fast and scalable data platform foundation for this chat bot framework, we’ll be able to have our data in a centralized location that can not only ingest data from various sources but also fast retrieval of large amounts of data, allowing AI practitioners to use in house data sets with business rich context, to train and deploy more accurate models.
¹ https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/
Written By:
Upskill Your Knowledge!
Check out some of our other coding content so that you can go from try to DIY.