By alphasec in AI/ML — Apr 19, 2023

LangChain Decoded: Part 2 - Embeddings

An exploration of the LangChain framework and modules in multiple parts; this post covers Embeddings.

In this multi-part series, I explore various LangChain modules and use cases, and document my journey via Python notebooks on GitHub. The previous post covered LangChain Models; this post explores Embeddings. Feel free to follow along and fork the repository, or use individual notebooks on Google Colab. Shoutout to the official LangChain documentation though - much of the code is borrowed or influenced by it, and I'm thankful for the clarity it offers.

Over the course of this series, I'll dive into the following topics:

Models
Embeddings (this post)
Prompts
Indexes
Memory
Chains
Agents
Callbacks

Getting Started

LangChain is available on PyPi, so it can be easily installed with pip. By default, the dependencies (e.g. model providers, data stores) are not installed, and should be installed separately based on your specific needs. LangChain also offers an implementation in JavaScript, but we'll only use the Python libraries here.

LangChain supports several model providers, but this tutorial will only focus on OpenAI (unless explicitly stated otherwise). Set the OpenAI API key via the OPENAI_API_KEY environment variable, or directly inside the notebook (or your Python code); if you don't have the key, you can get it here. Obviously, the first option is preferred in general, but especially in production - do not commit your API key accidentally to GitHub!

Follow along in your own Jupyter Python notebook, or click the link below to open the notebook directly in Google Colab.

# Install the LangChain package
!pip install langchain

# Install the OpenAI package
!pip install openai

# Configure the API key
import os

openai_api_key = os.environ.get('OPENAI_API_KEY', 'sk-XXX')

LangChain: Text Embeddings

Embeddings are a measure of the relatedness of text strings, and are represented with a vector (list) of floating point numbers. The distance between two vectors measures their relatedness - the shorter the distance, the higher the relatedness. Embeddings are used for a wide variety of use cases - text classification, search, clustering, recommendations, anomaly detection, diversity measurement etc.

The LangChain Embedding class is designed as an interface for embedding providers like OpenAI, Cohere, HuggingFace etc. The base class exposes two methods embed_query and embed_documents - the former works over a single document, while the latter can work across multiple documents.

In this notebook, we'll interface with the OpenAI Embeddings wrapper, and carry out a few basic operations. OpenAI offers several first-generation embedding models which are relatively slow and expensive, so we'll use the default text-embedding-ada-002 second-generation model, which is suitable for almost all use cases, along with the cl100k_base encoding scheme.

# Retrieve OpenAI text embeddings for a text input
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

text = "This is a sample query."

query_result = embeddings.embed_query(text)
print(query_result)
print(len(query_result))

These code samples simply pass a single and multiple string inputs to the embed_query and embed_documents methods respectively. It's more likely that the latter would be used to load and retrieve information from uploaded documents. However, that requires knowledge of additional LangChain modules like document loaders, so I'll defer that discussion for later posts.

# Retrieve OpenAI text embeddings for multiple text/document inputs
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

text = ["This is a sample query.", "This is another sample query.", "This is yet another sample query."]

doc_result = embeddings.embed_documents(text)
print(doc_result)
print(len(doc_result))

LangChain also offers a FakeEmbeddings class to test your pipeline without making actual calls to the embedding providers.

# Use fake embeddings to test your pipeline
from langchain.embeddings import FakeEmbeddings

embeddings = FakeEmbeddings(size=1481)

text = "This is a sample query."

query_result = embeddings.embed_query(text)
print(query_result)
print(len(query_result))

OpenAI embedding models cannot embed text that exceeds a maximum length e.g. for the text-embedding-ada-002 model with cl100k_base encoding, the maximum context length is 8191 tokens. If the provided text length exceeds the stated maximum, you'll get an error. Here is an example.

# Request with context length > 8191 throws an error
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

long_text = 'Hello ' * 10000

query_result = embeddings.embed_query(long_text)
print(query_result)

*** Response ***
openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

To deal with this, you can either truncate the input text length, or chunk the text and embed each chunk individually. A naive example of truncation could be:

max_tokens = 8191

truncated_text = long_text[:max_tokens]

However, because context length is measured in tokens, the right way to do this is to tokenise the input text with tiktoken before truncating it. Unfortunately, both the embed_query and embed_documents methods only support a string input at the moment, so we need to re-convert the tokens to a string before embedding. Run !pip install tiktoken before executing this code.

# Truncate input text length using tiktoken
import tiktoken
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

max_tokens = 8191
encoding_name = 'cl100k_base'

long_text = 'Hello ' * 10000

# Tokenize the input text before truncating it
encoding = tiktoken.get_encoding(encoding_name)
truncated_text = encoding.encode(long_text)[:max_tokens]

# Re-convert the tokens to a string before embedding
truncated_text = encoding.decode(tokens)

query_result = embeddings.embed_query(truncated_text)
print(query_result)
print(len(query_result))

The OpenAI Cookbook offers sample code for several use cases that can you use in conjunction with LangChain - text classification, question-answering for Wikipedia articles, recommendations, semantic text search, sentiment analysis with zero-shot classification, and more.

In addition to truncation, I mentioned chunking as a method to deal with large text inputs. However, to search over many vectors quickly, repeatedly calling OpenAI embedding models is neither efficient nor cost effective. Instead, you should use vector databases like Chroma, Weaviate, Pinecone, Qdrant, Milvus, and others. We'll cover vector databases in the post on Indexes later.

The next post in this series covers LangChain Prompts - do follow along if you liked this post. Finally, check out this handy compendium of all LangChain posts.

Getting Started

LangChain: Text Embeddings

References

Subscribe to alphasec