Sameer Singh

If you have been following this LangChain series, you already know how to work with Models, Prompts, Chains, and Runnables. You understand the LCEL pipe syntax, how chain pipelines are structured, and how the Runnable system ties everything together.
Now it is time to take the next big step: building real, production-ready applications using LangChain.
And the most important category of real-world LLM applications right now is RAG-based applications.
In this article, we will cover:
By the end of this article, you will have a clear mental model of how data flows through a RAG system and how Document Loaders fit into that flow.
RAG stands for Retrieval-Augmented Generation.
It is a technique where you:
In simple terms: instead of relying solely on what the LLM memorized during training, you give it access to fresh, specific, and private data at query time.
Large Language Models like GPT-4 or Claude are impressive. But they have a fundamental limitation: their knowledge is frozen at the time of training.
This means they cannot answer questions about:
For example, if you ask a base LLM "What did our sales team close last quarter?" it has no idea. It never saw your CRM data. It cannot answer.
This is where RAG steps in.
RAG connects the LLM to an external knowledge base that you control. The LLM no longer needs to have memorized everything. It can look things up on demand, from sources you provide.
This makes RAG especially powerful for:
Here is the flow of a typical RAG application:
User asks a question
|
v
Search the knowledge base for relevant content
|
v
Retrieve the most relevant document chunks
|
v
Inject those chunks into the LLM prompt
|
v
LLM reads the context and generates a grounded answer
|
v
Return the answer to the userThe key insight is that the LLM is not generating from memory alone. It is reading the relevant context you give it, just like a human would read a document before answering a question about it.
This approach dramatically improves:
Almost every RAG system in production is built from four fundamental components:
1. Document Loader Loads raw data from sources like PDFs, websites, CSV files, or databases and converts it into a standard format LangChain can work with.
2. Text Splitter Breaks long documents into smaller chunks because LLMs have limited context windows and you want to retrieve precise, relevant chunks rather than entire documents.
3. Vector Database Stores the embedded representations (vectors) of your document chunks. When a query comes in, the vector database finds the most semantically similar chunks.
4. Retriever The interface between the vector database and the LLM chain. It takes a query and returns the most relevant chunks to be injected into the prompt.
The full pipeline looks like this:
Load Data -> Split into Chunks -> Embed and Store -> Retrieve -> Generate
In this article, we focus on the first component: Document Loaders.
Document Loaders are LangChain components that load data from different sources and convert that data into a standard format called a Document object.
Data can come from almost anywhere:
No matter the source, every loader converts the data into the same standard structure: a list of Document objects.
Every document in LangChain has exactly two fields:
Document(
page_content="This is the actual text content of the chunk.",
metadata={"source": "report.pdf", "page": 3}
)This standardized format is what makes LangChain's pipeline so composable. Once data is in Document format, you can pass it through any Text Splitter, any Vector Store, and any Retriever without worrying about the original source format.
Without Document Loaders, you would need to write custom parsing logic for every data source. A PDF parser, an HTML scraper, a CSV reader, a database connector - all custom code just to get data into your pipeline.
Document Loaders handle all of that for you. They abstract away the messy details of reading different file formats so you can focus on building the RAG logic.
TextLoader is the most basic loader in LangChain. It reads a plain text file and returns its content as a list containing a single Document object.
Use TextLoader for:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("meeting_notes.txt")
docs = loader.load()
print(len(docs)) # 1
print(docs[0].page_content[:200]) # First 200 characters of the file
print(docs[0].metadata) # {'source': 'meeting_notes.txt'}Because a single .txt file is treated as one document, you will typically pass the result through a Text Splitter next if the file is large.
If your file uses a non-UTF-8 encoding, you can specify it:
loader = TextLoader("data.txt", encoding="utf-8")
docs = loader.load()PDFs are one of the most common data sources in enterprise RAG applications. LangChain provides multiple PDF loaders, and PyPDFLoader is the most commonly used starting point.
PyPDFLoader splits the PDF by page. Each page becomes its own Document object. This is important to understand because it affects how your data is chunked and retrieved.
A PDF with 50 pages will produce 50 Document objects. Each one will have:
page_content: the extracted text from that pagemetadata: the source filename and page numberfrom langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("annual_report.pdf")
docs = loader.load()
print(len(docs)) # Number of pages
print(docs[0].page_content) # Text of first page
print(docs[0].metadata) # {'source': 'annual_report.pdf', 'page': 0}PyPDFLoader is a good choice for:
PyPDFLoader struggles with:
For these cases, LangChain offers alternatives:
| Loader | Best For |
|---|---|
| PyPDFLoader | Standard text PDFs |
| PDFPlumberLoader | Tables and complex layouts |
| PyMuPDFLoader | Fast extraction, rich metadata |
| UnstructuredPDFLoader | Mixed content, images, complex structure |
| AmazonTextractLoader | Scanned documents needing cloud-based OCR |
In real projects, you rarely work with just one file. You might have a folder with hundreds of PDFs, or a directory full of text files. DirectoryLoader handles bulk loading efficiently.
You point it at a folder, tell it which file pattern to match using a glob expression, and specify which loader class to use for each file. DirectoryLoader then iterates through all matching files and loads them using the specified loader.
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
loader = DirectoryLoader(
"company_docs/", # Folder to search
glob="*.pdf", # File pattern to match
loader_cls=PyPDFLoader # Which loader to use per file
)
docs = loader.load()
print(len(docs)) # Total documents across all filesIf your folder contains:
Then DirectoryLoader will produce 100 Document objects (5 files x 20 pages each).
You can use DirectoryLoader with any loader class:
# Load all .txt files
from langchain_community.document_loaders import TextLoader
txt_loader = DirectoryLoader("logs/", glob="*.txt", loader_cls=TextLoader)
txt_docs = txt_loader.load()
# Load all .csv files
from langchain_community.document_loaders import CSVLoader
csv_loader = DirectoryLoader("data/", glob="*.csv", loader_cls=CSVLoader)
csv_docs = csv_loader.load()For large folders, you can enable a progress bar:
loader = DirectoryLoader(
"large_folder/",
glob="*.pdf",
loader_cls=PyPDFLoader,
show_progress=True
)By default, if one file fails to load, the entire process stops. You can configure it to skip errors:
loader = DirectoryLoader(
"mixed_folder/",
glob="*.pdf",
loader_cls=PyPDFLoader,
silent_errors=True # Skip files that fail, continue loading others
)Every LangChain document loader supports two loading methods: load() and lazy_load(). Understanding the difference is critical for building memory-efficient applications.
docs = loader.load()
# All documents are now in memoryload() reads all documents into memory immediately and returns a list. This is simple and convenient, but it can be a problem when:
for doc in loader.lazy_load():
process_document(doc)
# Only one document is in memory at a timelazy_load() returns a Python generator. It loads and yields one document at a time. The next document is only fetched when the previous one has been processed. This makes it ideal for:
# load() - Simple, all in memory
docs = loader.load()
for doc in docs:
embed_and_store(doc)
# lazy_load() - Memory efficient, generator-based
for doc in loader.lazy_load():
embed_and_store(doc)For production applications handling large document collections, lazy_load() is almost always the better choice.
WebBaseLoader lets you load the text content of any publicly accessible web page into a LangChain Document. This opens up a huge range of use cases.
WebBaseLoader uses the requests library to fetch the HTML of the page, then uses BeautifulSoup to parse it and extract the readable text. It strips away navigation, ads, scripts, and other non-content elements (though imperfectly for heavily JavaScript-driven sites).
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.python.org/3/library/functions.html")
docs = loader.load()
print(docs[0].page_content[:500])
print(docs[0].metadata) # {'source': 'https://docs.python.org/3/...'}WebBaseLoader accepts a list of URLs:
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
loader = WebBaseLoader(urls)
docs = loader.load()
# Returns one Document per URLWebBaseLoader is great for:
WebBaseLoader has limitations you should be aware of:
For JavaScript-heavy sites, consider using Selenium or Playwright-based loaders that can render the page in a real browser before extracting content.
WebBaseLoader is the foundation for a compelling product: a browser extension that lets you chat with the content of any webpage you are viewing. Load the page, chunk the text, embed it, and set up a conversational interface. This is a genuinely useful project that demonstrates the full RAG pipeline.
CSVLoader loads CSV files where each row becomes its own Document object. This is particularly useful when your data is structured as a table and each row represents a self-contained record.
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader("customer_support_tickets.csv")
docs = loader.load()
print(len(docs)) # One document per row
print(docs[0].page_content) # The text representation of the first row
print(docs[0].metadata) # {'source': 'customer_support_tickets.csv', 'row': 0}By default, each row is represented as a string where column names and values are joined:
column1: value1
column2: value2
column3: value3You can configure which column to use as the document's source in its metadata:
loader = CSVLoader(
"products.csv",
source_column="product_id" # Use this column value as the source
)If your CSV has many columns and you only want certain ones to become the document content, you can specify them:
loader = CSVLoader(
"employees.csv",
csv_args={
"fieldnames": ["name", "department", "skills"]
}
)CSVLoader works well for:
LangChain has an extensive library of community-contributed loaders covering almost every data source you can think of.
File Loaders TextLoader, PyPDFLoader, PDFPlumberLoader, UnstructuredPDFLoader, CSVLoader, JSONLoader, UnstructuredWordDocumentLoader, UnstructuredExcelLoader, UnstructuredMarkdownLoader, and many more.
Web Loaders WebBaseLoader, RecursiveUrlLoader (crawls entire sites), SitemapLoader, FirecrawlLoader, and others.
Cloud Storage Loaders S3FileLoader, GCSFileLoader, AzureBlobStorageFileLoader, GoogleDriveLoader.
Database Loaders SQLDatabaseLoader for querying relational databases directly.
Communication and Collaboration Tools SlackDirectoryLoader, NotionDBLoader, ConfluenceLoader, GitHubIssuesLoader.
API and Data Service Loaders ArxivLoader (research papers), WikipediaLoader, HNLoader (Hacker News), RedditPostsLoader, YouTubeLoader (for transcripts).
The right approach is project-driven learning. When your project needs to load data from a specific source, look up the appropriate loader for that source. Most loaders follow the same pattern you have seen in this article, so switching between them is straightforward.
All community loaders live in the langchain_community package:
from langchain_community.document_loaders import SomeLoaderSometimes you need to load data from a source that does not have an existing loader. LangChain makes this easy by providing a base class you can extend.
A custom loader must implement the lazy_load() method. The load() method is inherited from the base class and calls lazy_load() internally.
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from typing import Iterator
import requests
class MyAPILoader(BaseLoader):
def __init__(self, api_url: str, api_key: str):
self.api_url = api_url
self.api_key = api_key
def lazy_load(self) -> Iterator[Document]:
response = requests.get(
self.api_url,
headers={"Authorization": f"Bearer {self.api_key}"}
)
data = response.json()
for item in data["records"]:
yield Document(
page_content=item["text_content"],
metadata={
"source": self.api_url,
"record_id": item["id"],
"created_at": item["created_at"]
}
)
# Usage
loader = MyAPILoader(api_url="https://api.myservice.com/data", api_key="your_key")
docs = loader.load()By implementing lazy_load() as a generator, your custom loader is automatically memory-efficient.
Now that you understand Document Loaders deeply, it is helpful to see how they connect to the rest of the RAG pipeline.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Step 1: Load documents
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()
# Step 2: Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Step 3: Embed and store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
# Step 4: Retrieve
retriever = vectorstore.as_retriever()
results = retriever.invoke("What is the vacation policy?")
# Step 5: Generate (with an LLM chain)The Document Loader is the entry point for all data into your RAG system. Everything downstream depends on getting this step right.
Now that you understand Document Loaders, the next step in the RAG pipeline is Text Splitters: the components that break your documents into smaller, more retrievable chunks. After that, you will learn about Vector Databases and Retrievers, and then put everything together to build a complete, end-to-end RAG application.
This article is part of an ongoing LangChain series covering everything from basic components to production-ready RAG applications.
Discover how vector stores power modern AI search and RAG systems. Learn about embeddings, semantic similarity, indexing techniques, and how to use Chroma with LangChain to build intelligent applications.
Sameer Singh
You do not need to check every buy-sell pair. One pass, two variables, and the right greedy instinct is all it takes. Here is the full breakdown of LeetCode 121 with a deep dry run and interview tips.
Sign in to join the discussion.
Rahul Kumar