By alphasec in AI/ML — May 17, 2023

LangChain Decoded: Part 4 - Indexes

An exploration of the LangChain framework and modules in multiple parts; this post covers Indexes.

In this multi-part series, I explore various LangChain modules and use cases, and document my journey via Python notebooks on GitHub. The previous post covered LangChain Prompts; this post explores Indexes. Feel free to follow along and fork the repository, or use individual notebooks on Google Colab. Shoutout to the official LangChain documentation though - much of the code is borrowed or influenced by it, and I'm thankful for the clarity it offers.

Over the course of this series, I'll dive into the following topics:

Models
Embeddings
Prompts
Indexes (this post)
Memory
Chains
Agents
Callbacks

Getting Started

LangChain is available on PyPi, so it can be easily installed with pip. By default, the dependencies (e.g. model providers, data stores) are not installed, and should be installed separately based on your specific needs. LangChain also offers an implementation in JavaScript, but we'll only use the Python libraries here.

LangChain supports several model providers, but this tutorial will only focus on OpenAI (unless explicitly stated otherwise). Set the OpenAI API key via the OPENAI_API_KEY environment variable, or directly inside the notebook (or your Python code); if you don't have the key, you can get it here. Obviously, the first option is preferred in general, but especially in production - do not commit your API key accidentally to GitHub!

Follow along in your own Jupyter Python notebook, or click the link below to open the notebook directly in Google Colab.

# Install the LangChain package
!pip install langchain

# Install the OpenAI package
!pip install openai

# Configure the API key
import os

openai_api_key = os.environ.get('OPENAI_API_KEY', 'sk-XXX')

LangChain Indexes

The pre-trained corpus of knowledge available with large language models (LLMs) is quite phenomenal but, if you want the model to be more attuned to your specific use case, you can provide additional context as part of the request (aka prompt). This context is often provided in the form of documents or data retrieved from external sources - this is where indexes are useful. Indexes often (but not always) form the bridge between documents and the model, providing a simple interface to structured and unstructured data. In this post, we'll explore the LangChain Indexes module, its components, and ways to create and interface with indexes.

In the following sections, we'll discuss the four main components of an index:

Document loaders
Text splitters
Vector stores
Retrievers

Indexes: Document Loaders

LangChain offers three broad categories of document loaders:

Transform loaders: these loaders transform data from specific formats like CSV, PDF, SQL etc. to the Document format. Most of these loaders are powered by the unstructured Python package.
Public dataset or service loaders: these loaders are built for specific public web services like YouTube, Hacker News etc.
Private dataset or service loaders: these loaders are built for non-public datasets and service like Google Drive, AWS S3, Slack, Twitter etc., and require authenticated access to those resources.

Document loaders expose two methods load and loadAndSplit. The load method simply loads the documents and returns them as an array of Document objects. The loadAndSplit method loads the documents, splits them using a TextSplitter, and then returns them as an array of Document objects.

Here is an example of the UnstructuredURLLoader class used to load web URLs. For a simple Streamlit app that implements this class, see this gist.

!pip install unstructured tabulate pdf2image pytesseract

# URL Loader
from langchain.document_loaders import UnstructuredURLLoader

urls = ["https://alphasec.io/summarize-text-with-langchain-and-openai"]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
print(data)

Here's another example, this time using the PyPDFLoader class to load and split a PDF file into smaller chunks.

!pip install pypdf

# PDF Loader
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./data/attention-is-all-you-need.pdf")
pages = loader.load_and_split()
pages[0]

If you have multiple files in a directory, you can use the DirectoryLoader class to load the files. Use the glob parameter to control which files to load, ** to specify recursion, and *.csv to filter the files to load (e.g. only .CSV files).

# File Directory Loader
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('data', glob="**/*.csv")
docs = loader.load()
len(docs)

An example of a public dataset loader is the YoutubeLoader class, which is used to load YouTube transcripts. For a simple Streamlit app that implements this class, see this gist.

!pip install pytube youtube-transcript-api

# YouTube Transcripts Loader
from langchain.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=yEgHrxvLsz0", add_video_info=True)
data = loader.load()
print(data)

An example of a private dataset loader is the GCSFileLoader class, which is used to load files from Google Cloud Storage (GCS). Because GCS objects are private by default, you need to authenticate with Google Cloud before you can access the resources. For a simple Streamlit app that implements this class, see this gist.

!pip install google-cloud-storage

# Google Cloud Storage File Loader
from langchain.document_loaders import GCSFileLoader

loader = GCSFileLoader(project_name="langchain-gcs", bucket="langchain-gcs", blob="lorem-ipsum.txt")
data = loader.load()
print(data)

LangChain provides several other document loaders to get you started; see the official docs for more examples.

Indexes: Text Splitters

If you have to deal with text sources that are very long, it is necessary to split the text into chunks before it can be loaded. But, you have to keep the semantic "relatedness" intact. LangChain provides the TextSplitter base class for this. The text is split into small, semantically meaningful chunks (i.e. chunk_size, with some overlap between chunks (i.e. chunk_overlap) to keep context. The length_function determines the unit for chunk length; it defaults to characters, but you can also use token counters instead.

Here is an example of the CharacterTextSplitter class, a simple text splitter implementation that splits text by a single character. For a simple Streamlit app that implements this class, see this gist.

If you keep the chunk_size too small (e.g. 100 characters below), and a chunk of text exceeds this length, you'll get an error during the text splitting process. In such a scenario, you should use the RecursiveCharacterTextSplitter class.

# Character Text Splitter
from langchain.text_splitter import CharacterTextSplitter
from google.colab import files

uploaded = files.upload()
filename = next(iter(uploaded))
text = uploaded[filename].decode("utf-8")

text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

texts = text_splitter.create_documents([text])
print(texts[0])
print(texts[1])
print(texts[2])

The most recommended text splitter is the RecursiveCharacterTextSplitter class, which iteratively creates smaller chunks based on the separators; by default, the characters used to split on are ["\n\n", "\n", " ", ""].

# Recursive Character Text Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google.colab import files

uploaded = files.upload()
filename = next(iter(uploaded))
text = uploaded[filename].decode("utf-8")

text_splitter = RecursiveCharacterTextSplitter(        
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

texts = text_splitter.create_documents([text])
print(texts[0])
print(texts[1])
print(texts[2])

LangChain provides several other text splitters to get you started; see the official docs for more examples.

Indexes: Vector Stores

Once you've split a large set of documents into smaller, semantically related chunks of text, you need to store this "relatedness" data for reuse in subsequent queries or across different use cases. This measure of relatedness is called an embedding, and is usually persisted in a vector store or database. I've covered LangChain embeddings at length in Part 2 of this blog series, do have a read.

Chroma is an open-source, lightweight embedding (or vector) database that can be used to store embeddings locally. In the example below, the source document is split into chunks, OpenAI embeddings are generated for each chunk, stored in the local Chroma database, and retrieved for subsequent similarity searches. If you don't have an OpenAI API key, you can get it here. In this example, Chroma uses an in-memory DuckDB database, hence the data will be transient. For a detailed tutorial on loading and summarizing documents using LangChain and Chroma vector store, see this post.

!pip install chromadb tiktoken

# Chroma Vector Store
import os, tiktoken
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

OPENAI_API_KEY = '' # @param {type:"string"}
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from google.colab import files

uploaded = files.upload()
filename = next(iter(uploaded))

loader = TextLoader(filename)
data = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(data)

embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)

query = "What comes after 'Vestibulum congue convallis finibus'?"
docs = db.similarity_search(query)

print(docs[0].page_content)

LangChain provides wrappers for several other vector stores like Pinecone, Milvus, Weaviate, Qdrant, and more; see the official docs for more examples.

Indexes: Retrievers

The Retriever interface is a generic interface to combine documents with language models. It exposes a single get_relevant_documents method, which takes in the user query and returns a list of relevant documents.

Here's an example of the ArxivRetriever class used to retrieve scientific articles from arXiv, the popular open-access archive. Use the load_max_docs argument to limit the number of documents retrieved.

!pip install arxiv pymupdf

# Arxiv Retriever
from langchain.retrievers import ArxivRetriever

retriever = ArxivRetriever(load_max_docs=2)
docs = retriever.get_relevant_documents(query='2203.15556')

docs[0].metadata

Here's another example, this time using the WikipediaRetriever class to retrieve relevant wiki pages from Wikipedia. Use the load_all_available_metadata argument to retrieve all metadata fields.

!pip install wikipedia

# Wikipedia Retriever
from langchain.retrievers import WikipediaRetriever

retriever = WikipediaRetriever()
docs = retriever.get_relevant_documents(query='large language models')

docs[0].metadata

Finally, here's the Chroma vector store example from the previous section, but re-written to use a retriever instead of a similarity search.

!pip install chromadb tiktoken

# Chroma Vector Store Retriever
import os, tiktoken
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

OPENAI_API_KEY = '' # @param {type:"string"}
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from google.colab import files

uploaded = files.upload()
filename = next(iter(uploaded))

loader = TextLoader(filename)
data = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(data)

embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)

retriever = db.as_retriever()
query = "What comes after 'Vestibulum congue convallis finibus'?"
docs = retriever.get_relevant_documents(query)

print(docs[0].page_content)

LangChain provides several other retrievers to get you started; see the official docs for more examples. That concludes this tutorial on indexes.

The next post in this series covers LangChain Memory - do follow along if you liked this post. Finally, check out this handy compendium of all LangChain posts.