With the rising popularity of generative artificial intelligence (AI) and frameworks like LangChain, companies and teams are in a race to leverage the technology against their repositories of data. LangChain is a powerful open source framework that simplifies the development of large language model (LLM) applications such as chatbots, generative question answering, summarization, code analysis, and more.

In this first post of a three-part series we’ll demonstrate  how to create a Q&A chatbot with LangChain and a Pure Storage FlashBlade® S3 bucket. 

Step 1: FlashBlade Prep

To create a chatbot that can answer questions from a repository of data on a FlashBlade S3 bucket, you’ll need to load data and get it ready for chunking, embedding, and indexing in a vectorstore for similarity searching.

Let’s start with the FlashBlade side before we dive into the coding portion. In most cases, you would already have a FlashBlade S3 bucket with data residing in it, so the following sub-steps are complete:

  • Create a FlashBlade S3 Account, User, and Bucket/li>
  • Securely store a FlashBlade S3 User secret key and access key
  • Configure a minimum of 1 Data VIP on FlashBlade. This will be the endpoint IP for LangChain configuration later

The above information can be gathered either through the FlashBlade GUI, CLI, or API calls. 

Step 2: LangChain Configuration and Data Loading

First things first, make sure LangChain, unstructured, and boto3 are installed.

Now we can start our Python application by importing the LangChain S3DirectoryLoader, initializing the loader with all of our FlashBlade information, and load the bucket data as a List for usage:

Running this code will return an object similar to this, with a document entry for each document within the bucket:

Since the S3DirectoryLoader is using boto3 under the hood, there are some parameters we can change to increase the throughput performance coming from the FlashBlade. Here’s an example of the settings:

For more performance-based analysis, check out this blog written by Joshua Robinson that compares performance across several S3 transfer tools. (Spoiler: boto3 is not the fastest option.)

As LangChain evolves, it would be useful to have additional data transfer options for the S3DirectoryLoader function to use such as s5cmd.

Stay Tuned for More Tutorials

We now have our LangChain code connecting to an on-premises, high-performance FlashBlade without a large lift! In the next installation of this blog series, we’ll cover taking this newly loaded data and chunking it up, embedding the chunks, creating the vectorstore, and persisting that index to storage.

By leveraging FlashBlade as the fast and scalable data platform foundation for this chat bot framework, we’ll be able to have our data in a centralized location that can not only ingest data from various sources but also fast retrieval of large amounts of data, allowing AI practitioners to use in house data sets with business rich context, to train and deploy more accurate models.

¹ https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/