LangChain Decoded: Part 4 - Indexes
An exploration of the LangChain framework and modules in multiple parts; this post covers Indexes.
In this multi-part series, I explore various LangChain modules and use cases, and document my journey via Python notebooks on GitHub. The previous post covered LangChain Prompts; this post explores Indexes. Feel free to follow along and fork the repository, or use individual notebooks on Google Colab. Shoutout to the official LangChain documentation though - much of the code is borrowed or influenced by it, and I'm thankful for the clarity it offers.
Over the course of this series, I'll dive into the following topics:
- Models
- Embeddings
- Prompts
- Indexes (this post)
- Memory
- Chains
- Agents
- Callbacks
Getting Started
LangChain is available on PyPi, so it can be easily installed with pip
. By default, the dependencies (e.g. model providers, data stores) are not installed, and should be installed separately based on your specific needs. LangChain also offers an implementation in JavaScript, but we'll only use the Python libraries here.
LangChain supports several model providers, but this tutorial will only focus on OpenAI (unless explicitly stated otherwise). Set the OpenAI API key via the OPENAI_API_KEY
environment variable, or directly inside the notebook (or your Python code); if you don't have the key, you can get it here. Obviously, the first option is preferred in general, but especially in production - do not commit your API key accidentally to GitHub!
Follow along in your own Jupyter Python notebook, or click the link below to open the notebook directly in Google Colab.
# Install the LangChain package
!pip install langchain
# Install the OpenAI package
!pip install openai
# Configure the API key
import os
openai_api_key = os.environ.get('OPENAI_API_KEY', 'sk-XXX')
LangChain Indexes
The pre-trained corpus of knowledge available with large language models (LLMs) is quite phenomenal but, if you want the model to be more attuned to your specific use case, you can provide additional context as part of the request (aka prompt). This context is often provided in the form of documents or data retrieved from external sources - this is where indexes are useful. Indexes often (but not always) form the bridge between documents and the model, providing a simple interface to structured and unstructured data. In this post, we'll explore the LangChain Indexes module, its components, and ways to create and interface with indexes.
In the following sections, we'll discuss the four main components of an index:
- Document loaders
- Text splitters
- Vector stores
- Retrievers
Indexes: Document Loaders
LangChain offers three broad categories of document loaders:
- Transform loaders: these loaders transform data from specific formats like CSV, PDF, SQL etc. to the Document format. Most of these loaders are powered by the
unstructured
Python package. - Public dataset or service loaders: these loaders are built for specific public web services like YouTube, Hacker News etc.
- Private dataset or service loaders: these loaders are built for non-public datasets and service like Google Drive, AWS S3, Slack, Twitter etc., and require authenticated access to those resources.
Document loaders expose two methods load
and loadAndSplit
. The load
method simply loads the documents and returns them as an array of Document
objects. The loadAndSplit
method loads the documents, splits them using a TextSplitter
, and then returns them as an array of Document
objects.
Here is an example of the UnstructuredURLLoader
class used to load web URLs. For a simple Streamlit app that implements this class, see this gist.
!pip install unstructured tabulate pdf2image pytesseract
# URL Loader
from langchain.document_loaders import UnstructuredURLLoader
urls = ["https://alphasec.io/summarize-text-with-langchain-and-openai"]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
print(data)
Here's another example, this time using the PyPDFLoader
class to load and split a PDF file into smaller chunks.
!pip install pypdf
# PDF Loader
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./data/attention-is-all-you-need.pdf")
pages = loader.load_and_split()
pages[0]
If you have multiple files in a directory, you can use the DirectoryLoader
class to load the files. Use the glob
parameter to control which files to load, **
to specify recursion, and *.csv
to filter the files to load (e.g. only .CSV files).
# File Directory Loader
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('data', glob="**/*.csv")
docs = loader.load()
len(docs)
An example of a public dataset loader is the YoutubeLoader
class, which is used to load YouTube transcripts. For a simple Streamlit app that implements this class, see this gist.
!pip install pytube youtube-transcript-api
# YouTube Transcripts Loader
from langchain.document_loaders import YoutubeLoader
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=yEgHrxvLsz0", add_video_info=True)
data = loader.load()
print(data)
An example of a private dataset loader is the GCSFileLoader
class, which is used to load files from Google Cloud Storage (GCS). Because GCS objects are private by default, you need to authenticate with Google Cloud before you can access the resources. For a simple Streamlit app that implements this class, see this gist.
!pip install google-cloud-storage
# Google Cloud Storage File Loader
from langchain.document_loaders import GCSFileLoader
loader = GCSFileLoader(project_name="langchain-gcs", bucket="langchain-gcs", blob="lorem-ipsum.txt")
data = loader.load()
print(data)
LangChain provides several other document loaders to get you started; see the official docs for more examples.
Indexes: Text Splitters
If you have to deal with text sources that are very long, it is necessary to split the text into chunks before it can be loaded. But, you have to keep the semantic "relatedness" intact. LangChain provides the TextSplitter
base class for this. The text is split into small, semantically meaningful chunks (i.e. chunk_size
, with some overlap between chunks (i.e. chunk_overlap
) to keep context. The length_function
determines the unit for chunk length; it defaults to characters, but you can also use token counters instead.
Here is an example of the CharacterTextSplitter
class, a simple text splitter implementation that splits text by a single character. For a simple Streamlit app that implements this class, see this gist.
If you keep the chunk_size
too small (e.g. 100 characters below), and a chunk of text exceeds this length, you'll get an error during the text splitting process. In such a scenario, you should use the RecursiveCharacterTextSplitter
class.
# Character Text Splitter
from langchain.text_splitter import CharacterTextSplitter
from google.colab import files
uploaded = files.upload()
filename = next(iter(uploaded))
text = uploaded[filename].decode("utf-8")
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
texts = text_splitter.create_documents([text])
print(texts[0])
print(texts[1])
print(texts[2])
The most recommended text splitter is the RecursiveCharacterTextSplitter
class, which iteratively creates smaller chunks based on the separators; by default, the characters used to split on are ["\n\n", "\n", " ", ""]
.
# Recursive Character Text Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google.colab import files
uploaded = files.upload()
filename = next(iter(uploaded))
text = uploaded[filename].decode("utf-8")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
)
texts = text_splitter.create_documents([text])
print(texts[0])
print(texts[1])
print(texts[2])
LangChain provides several other text splitters to get you started; see the official docs for more examples.
Indexes: Vector Stores
Once you've split a large set of documents into smaller, semantically related chunks of text, you need to store this "relatedness" data for reuse in subsequent queries or across different use cases. This measure of relatedness is called an embedding, and is usually persisted in a vector store or database. I've covered LangChain embeddings at length in Part 2 of this blog series, do have a read.
Chroma is an open-source, lightweight embedding (or vector) database that can be used to store embeddings locally. In the example below, the source document is split into chunks, OpenAI embeddings are generated for each chunk, stored in the local Chroma database, and retrieved for subsequent similarity searches. If you don't have an OpenAI API key, you can get it here. In this example, Chroma uses an in-memory DuckDB database, hence the data will be transient. For a detailed tutorial on loading and summarizing documents using LangChain and Chroma vector store, see this post.
!pip install chromadb tiktoken
# Chroma Vector Store
import os, tiktoken
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
OPENAI_API_KEY = '' # @param {type:"string"}
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from google.colab import files
uploaded = files.upload()
filename = next(iter(uploaded))
loader = TextLoader(filename)
data = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)
query = "What comes after 'Vestibulum congue convallis finibus'?"
docs = db.similarity_search(query)
print(docs[0].page_content)
LangChain provides wrappers for several other vector stores like Pinecone, Milvus, Weaviate, Qdrant, and more; see the official docs for more examples.
Indexes: Retrievers
The Retriever
interface is a generic interface to combine documents with language models. It exposes a single get_relevant_documents
method, which takes in the user query and returns a list of relevant documents.
Here's an example of the ArxivRetriever
class used to retrieve scientific articles from arXiv, the popular open-access archive. Use the load_max_docs
argument to limit the number of documents retrieved.
!pip install arxiv pymupdf
# Arxiv Retriever
from langchain.retrievers import ArxivRetriever
retriever = ArxivRetriever(load_max_docs=2)
docs = retriever.get_relevant_documents(query='2203.15556')
docs[0].metadata
Here's another example, this time using the WikipediaRetriever
class to retrieve relevant wiki pages from Wikipedia. Use the load_all_available_metadata
argument to retrieve all metadata fields.
!pip install wikipedia
# Wikipedia Retriever
from langchain.retrievers import WikipediaRetriever
retriever = WikipediaRetriever()
docs = retriever.get_relevant_documents(query='large language models')
docs[0].metadata
Finally, here's the Chroma vector store example from the previous section, but re-written to use a retriever instead of a similarity search.
!pip install chromadb tiktoken
# Chroma Vector Store Retriever
import os, tiktoken
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
OPENAI_API_KEY = '' # @param {type:"string"}
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from google.colab import files
uploaded = files.upload()
filename = next(iter(uploaded))
loader = TextLoader(filename)
data = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)
retriever = db.as_retriever()
query = "What comes after 'Vestibulum congue convallis finibus'?"
docs = retriever.get_relevant_documents(query)
print(docs[0].page_content)
LangChain provides several other retrievers to get you started; see the official docs for more examples. That concludes this tutorial on indexes.
The next post in this series covers LangChain Memory - do follow along if you liked this post. Finally, check out this handy compendium of all LangChain posts.