RAG in LangChain Explained: Document Loaders, Vector Stores, and Building Your First RAG Application

Introduction

If you have been following this LangChain series, you already know how to work with Models, Prompts, Chains, and Runnables. You understand the LCEL pipe syntax, how chain pipelines are structured, and how the Runnable system ties everything together.

Now it is time to take the next big step: building real, production-ready applications using LangChain.

And the most important category of real-world LLM applications right now is RAG-based applications.

In this article, we will cover:

What RAG is and why it exists
How a RAG pipeline works step by step
The four core components every RAG system needs
A deep dive into Document Loaders, including TextLoader, PyPDFLoader, DirectoryLoader, WebBaseLoader, CSVLoader, and more
The difference between load() and lazy_load()
How to create a custom document loader

By the end of this article, you will have a clear mental model of how data flows through a RAG system and how Document Loaders fit into that flow.

What is RAG? A Clear Definition

RAG stands for Retrieval-Augmented Generation.

It is a technique where you:

Retrieve relevant information from an external knowledge source
Augment the LLM prompt with that retrieved information
Let the LLM generate a grounded, accurate answer based on what it retrieved

In simple terms: instead of relying solely on what the LLM memorized during training, you give it access to fresh, specific, and private data at query time.

Why Does RAG Exist? The Core Problem It Solves

Large Language Models like GPT-4 or Claude are impressive. But they have a fundamental limitation: their knowledge is frozen at the time of training.

This means they cannot answer questions about:

News or events that happened after their training cutoff
Your private files, emails, or documents
Internal company data and reports
Real-time database records
PDFs stored on your machine
Proprietary documentation

For example, if you ask a base LLM "What did our sales team close last quarter?" it has no idea. It never saw your CRM data. It cannot answer.

This is where RAG steps in.

RAG connects the LLM to an external knowledge base that you control. The LLM no longer needs to have memorized everything. It can look things up on demand, from sources you provide.

This makes RAG especially powerful for:

Customer support chatbots trained on your product documentation
Legal assistants that reference specific case files
Research tools that search through large collections of papers
Internal knowledge bases for enterprise teams
Personal productivity tools that work with your own notes and emails

How a RAG Pipeline Works: Step by Step

Here is the flow of a typical RAG application:

code

User asks a question
        |
        v
Search the knowledge base for relevant content
        |
        v
Retrieve the most relevant document chunks
        |
        v
Inject those chunks into the LLM prompt
        |
        v
LLM reads the context and generates a grounded answer
        |
        v
Return the answer to the user

The key insight is that the LLM is not generating from memory alone. It is reading the relevant context you give it, just like a human would read a document before answering a question about it.

This approach dramatically improves:

Accuracy on domain-specific questions
Ability to work with private and real-time data
Transparency, because answers can be traced back to source documents
Scalability, because you can update the knowledge base without retraining the model

The Four Core Components of a RAG Application

Almost every RAG system in production is built from four fundamental components:

1. Document Loader Loads raw data from sources like PDFs, websites, CSV files, or databases and converts it into a standard format LangChain can work with.

2. Text Splitter Breaks long documents into smaller chunks because LLMs have limited context windows and you want to retrieve precise, relevant chunks rather than entire documents.

3. Vector Database Stores the embedded representations (vectors) of your document chunks. When a query comes in, the vector database finds the most semantically similar chunks.

4. Retriever The interface between the vector database and the LLM chain. It takes a query and returns the most relevant chunks to be injected into the prompt.

The full pipeline looks like this:

Load Data -> Split into Chunks -> Embed and Store -> Retrieve -> Generate

In this article, we focus on the first component: Document Loaders.

What Are Document Loaders?

Document Loaders are LangChain components that load data from different sources and convert that data into a standard format called a Document object.

Data can come from almost anywhere:

PDF files
Plain text files
Web pages
CSV files
Databases
Cloud storage (S3, Google Drive)
APIs
Email
Notion, Confluence, and other tools

No matter the source, every loader converts the data into the same standard structure: a list of Document objects.

The Document Object

Every document in LangChain has exactly two fields:

code

Document(
    page_content="This is the actual text content of the chunk.",
    metadata={"source": "report.pdf", "page": 3}
)

page_content contains the raw text that will be embedded, searched, and fed to the LLM
metadata contains additional context about the document, such as the source file, page number, URL, author, or any custom fields you want to attach

This standardized format is what makes LangChain's pipeline so composable. Once data is in Document format, you can pass it through any Text Splitter, any Vector Store, and any Retriever without worrying about the original source format.

Why Document Loaders Matter

Without Document Loaders, you would need to write custom parsing logic for every data source. A PDF parser, an HTML scraper, a CSV reader, a database connector - all custom code just to get data into your pipeline.

Document Loaders handle all of that for you. They abstract away the messy details of reading different file formats so you can focus on building the RAG logic.

TextLoader: The Simplest Document Loader

TextLoader is the most basic loader in LangChain. It reads a plain text file and returns its content as a list containing a single Document object.

When to Use TextLoader

Use TextLoader for:

Plain .txt files
Log files
Transcripts
Source code files
Markdown files (when you want raw text, not parsed markdown)
Any unstructured text file

How to Use TextLoader

code

from langchain_community.document_loaders import TextLoader

loader = TextLoader("meeting_notes.txt")
docs = loader.load()

print(len(docs))        # 1
print(docs[0].page_content[:200])  # First 200 characters of the file
print(docs[0].metadata) # {'source': 'meeting_notes.txt'}

Because a single .txt file is treated as one document, you will typically pass the result through a Text Splitter next if the file is large.

Encoding Support

If your file uses a non-UTF-8 encoding, you can specify it:

code

loader = TextLoader("data.txt", encoding="utf-8")
docs = loader.load()

PyPDFLoader: Loading PDF Documents

PDFs are one of the most common data sources in enterprise RAG applications. LangChain provides multiple PDF loaders, and PyPDFLoader is the most commonly used starting point.

Key Behavior: One Document Per Page

PyPDFLoader splits the PDF by page. Each page becomes its own Document object. This is important to understand because it affects how your data is chunked and retrieved.

A PDF with 50 pages will produce 50 Document objects. Each one will have:

page_content: the extracted text from that page
metadata: the source filename and page number

How to Use PyPDFLoader

code

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("annual_report.pdf")
docs = loader.load()

print(len(docs))           # Number of pages
print(docs[0].page_content)  # Text of first page
print(docs[0].metadata)      # {'source': 'annual_report.pdf', 'page': 0}

When PyPDFLoader Works Well

PyPDFLoader is a good choice for:

Standard text-based PDFs
PDFs with simple layouts
Reports, articles, and documentation in PDF format
Any PDF where text is selectable (not scanned)

When to Use a Different PDF Loader

PyPDFLoader struggles with:

Scanned PDFs (images of text) - these require OCR
Complex multi-column layouts
PDFs with tables where layout matters
PDFs with embedded images you want to extract

For these cases, LangChain offers alternatives:

Loader	Best For
PyPDFLoader	Standard text PDFs
PDFPlumberLoader	Tables and complex layouts
PyMuPDFLoader	Fast extraction, rich metadata
UnstructuredPDFLoader	Mixed content, images, complex structure
AmazonTextractLoader	Scanned documents needing cloud-based OCR

DirectoryLoader: Loading Multiple Files at Once

In real projects, you rarely work with just one file. You might have a folder with hundreds of PDFs, or a directory full of text files. DirectoryLoader handles bulk loading efficiently.

How DirectoryLoader Works

You point it at a folder, tell it which file pattern to match using a glob expression, and specify which loader class to use for each file. DirectoryLoader then iterates through all matching files and loads them using the specified loader.

Basic Usage

code

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader

loader = DirectoryLoader(
    "company_docs/",      # Folder to search
    glob="*.pdf",         # File pattern to match
    loader_cls=PyPDFLoader  # Which loader to use per file
)

docs = loader.load()
print(len(docs))  # Total documents across all files

Understanding the Document Count

If your folder contains:

5 PDF files
Each PDF has 20 pages

Then DirectoryLoader will produce 100 Document objects (5 files x 20 pages each).

Using Different Loaders with Different File Types

You can use DirectoryLoader with any loader class:

code

# Load all .txt files
from langchain_community.document_loaders import TextLoader

txt_loader = DirectoryLoader("logs/", glob="*.txt", loader_cls=TextLoader)
txt_docs = txt_loader.load()

# Load all .csv files
from langchain_community.document_loaders import CSVLoader

csv_loader = DirectoryLoader("data/", glob="*.csv", loader_cls=CSVLoader)
csv_docs = csv_loader.load()

Show Progress for Large Directories

For large folders, you can enable a progress bar:

code

loader = DirectoryLoader(
    "large_folder/",
    glob="*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)

Handling Errors Gracefully

By default, if one file fails to load, the entire process stops. You can configure it to skip errors:

code

loader = DirectoryLoader(
    "mixed_folder/",
    glob="*.pdf",
    loader_cls=PyPDFLoader,
    silent_errors=True  # Skip files that fail, continue loading others
)

load() vs lazy_load(): An Important Distinction

Every LangChain document loader supports two loading methods: load() and lazy_load(). Understanding the difference is critical for building memory-efficient applications.

load(): Load Everything at Once

code

docs = loader.load()
# All documents are now in memory

load() reads all documents into memory immediately and returns a list. This is simple and convenient, but it can be a problem when:

You are loading a large folder of PDFs
The combined size of all documents exceeds available RAM
You want to process documents as they come in rather than waiting for all of them

lazy_load(): Load One Document at a Time

code

for doc in loader.lazy_load():
    process_document(doc)
    # Only one document is in memory at a time

lazy_load() returns a Python generator. It loads and yields one document at a time. The next document is only fetched when the previous one has been processed. This makes it ideal for:

Very large files or folders
Streaming scenarios where you want to start processing before all loading is complete
Memory-constrained environments
Pipelines where early results are more valuable than waiting for everything

Practical Comparison

code

# load() - Simple, all in memory
docs = loader.load()
for doc in docs:
    embed_and_store(doc)

# lazy_load() - Memory efficient, generator-based
for doc in loader.lazy_load():
    embed_and_store(doc)

For production applications handling large document collections, lazy_load() is almost always the better choice.

WebBaseLoader: Loading Web Pages

WebBaseLoader lets you load the text content of any publicly accessible web page into a LangChain Document. This opens up a huge range of use cases.

How WebBaseLoader Works Internally

WebBaseLoader uses the requests library to fetch the HTML of the page, then uses BeautifulSoup to parse it and extract the readable text. It strips away navigation, ads, scripts, and other non-content elements (though imperfectly for heavily JavaScript-driven sites).

Basic Usage

code

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://docs.python.org/3/library/functions.html")
docs = loader.load()

print(docs[0].page_content[:500])
print(docs[0].metadata)  # {'source': 'https://docs.python.org/3/...'}

Loading Multiple URLs at Once

WebBaseLoader accepts a list of URLs:

code

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]

loader = WebBaseLoader(urls)
docs = loader.load()
# Returns one Document per URL

When WebBaseLoader Works Well

WebBaseLoader is great for:

Blog posts and articles
Documentation sites
Static HTML pages
News articles
Wikipedia entries

Limitations of WebBaseLoader

WebBaseLoader has limitations you should be aware of:

Pages that require JavaScript to render content may return incomplete or empty text
Pages behind authentication walls cannot be accessed
Rate limiting or bot detection may block requests on some sites
Dynamic single-page applications built with React, Vue, or Angular may not extract correctly

For JavaScript-heavy sites, consider using Selenium or Playwright-based loaders that can render the page in a real browser before extracting content.

Project Idea: Chat with Any Webpage

WebBaseLoader is the foundation for a compelling product: a browser extension that lets you chat with the content of any webpage you are viewing. Load the page, chunk the text, embed it, and set up a conversational interface. This is a genuinely useful project that demonstrates the full RAG pipeline.

CSVLoader: Loading Tabular Data

CSVLoader loads CSV files where each row becomes its own Document object. This is particularly useful when your data is structured as a table and each row represents a self-contained record.

Basic Usage

code

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("customer_support_tickets.csv")
docs = loader.load()

print(len(docs))         # One document per row
print(docs[0].page_content)  # The text representation of the first row
print(docs[0].metadata)      # {'source': 'customer_support_tickets.csv', 'row': 0}

How Row Content is Represented

By default, each row is represented as a string where column names and values are joined:

code

column1: value1
column2: value2
column3: value3

Specifying the Source Column

You can configure which column to use as the document's source in its metadata:

code

loader = CSVLoader(
    "products.csv",
    source_column="product_id"  # Use this column value as the source
)

Choosing Specific Columns for Content

If your CSV has many columns and you only want certain ones to become the document content, you can specify them:

code

loader = CSVLoader(
    "employees.csv",
    csv_args={
        "fieldnames": ["name", "department", "skills"]
    }
)

When CSVLoader Is Useful

CSVLoader works well for:

FAQ datasets where each row is a question-answer pair
Product catalogs
Customer records
Support tickets
Any dataset where each row is a semantically complete unit

The LangChain Document Loader Ecosystem

LangChain has an extensive library of community-contributed loaders covering almost every data source you can think of.

Categories of Available Loaders

File Loaders TextLoader, PyPDFLoader, PDFPlumberLoader, UnstructuredPDFLoader, CSVLoader, JSONLoader, UnstructuredWordDocumentLoader, UnstructuredExcelLoader, UnstructuredMarkdownLoader, and many more.

Web Loaders WebBaseLoader, RecursiveUrlLoader (crawls entire sites), SitemapLoader, FirecrawlLoader, and others.

Cloud Storage Loaders S3FileLoader, GCSFileLoader, AzureBlobStorageFileLoader, GoogleDriveLoader.

Database Loaders SQLDatabaseLoader for querying relational databases directly.

Communication and Collaboration Tools SlackDirectoryLoader, NotionDBLoader, ConfluenceLoader, GitHubIssuesLoader.

API and Data Service Loaders ArxivLoader (research papers), WikipediaLoader, HNLoader (Hacker News), RedditPostsLoader, YouTubeLoader (for transcripts).

You Do Not Need to Learn All of Them

The right approach is project-driven learning. When your project needs to load data from a specific source, look up the appropriate loader for that source. Most loaders follow the same pattern you have seen in this article, so switching between them is straightforward.

All community loaders live in the langchain_community package:

code

from langchain_community.document_loaders import SomeLoader

Building a Custom Document Loader

Sometimes you need to load data from a source that does not have an existing loader. LangChain makes this easy by providing a base class you can extend.

When to Build a Custom Loader

Your company has an internal API with proprietary data
You need to load from a niche file format not yet supported
You want special preprocessing logic during the loading step
You need to load from a custom database schema

How to Build One

A custom loader must implement the lazy_load() method. The load() method is inherited from the base class and calls lazy_load() internally.

code

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from typing import Iterator
import requests

class MyAPILoader(BaseLoader):
    def __init__(self, api_url: str, api_key: str):
        self.api_url = api_url
        self.api_key = api_key

    def lazy_load(self) -> Iterator[Document]:
        response = requests.get(
            self.api_url,
            headers={"Authorization": f"Bearer {self.api_key}"}
        )
        data = response.json()

        for item in data["records"]:
            yield Document(
                page_content=item["text_content"],
                metadata={
                    "source": self.api_url,
                    "record_id": item["id"],
                    "created_at": item["created_at"]
                }
            )

# Usage
loader = MyAPILoader(api_url="https://api.myservice.com/data", api_key="your_key")
docs = loader.load()

By implementing lazy_load() as a generator, your custom loader is automatically memory-efficient.

Putting It All Together: Where Document Loaders Fit in a Full RAG Pipeline

Now that you understand Document Loaders deeply, it is helpful to see how they connect to the rest of the RAG pipeline.

code

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Step 1: Load documents
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()

# Step 2: Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Step 3: Embed and store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Step 4: Retrieve
retriever = vectorstore.as_retriever()
results = retriever.invoke("What is the vacation policy?")

# Step 5: Generate (with an LLM chain)

The Document Loader is the entry point for all data into your RAG system. Everything downstream depends on getting this step right.

Key Takeaways

RAG solves the fundamental limitation of LLMs by connecting them to external knowledge at query time
Every RAG system is built on four components: Document Loaders, Text Splitters, Vector Databases, and Retrievers
Document Loaders convert raw data from any source into a standardized Document object with page_content and metadata
TextLoader handles plain text, PyPDFLoader handles PDFs (one document per page), DirectoryLoader handles entire folders, WebBaseLoader handles web pages, and CSVLoader handles tabular data (one document per row)
load() loads everything at once; lazy_load() uses a generator for memory-efficient streaming
LangChain has hundreds of community loaders covering almost every data source imaginable
You can build custom loaders by extending BaseLoader and implementing lazy_load()

What Comes Next

Now that you understand Document Loaders, the next step in the RAG pipeline is Text Splitters: the components that break your documents into smaller, more retrievable chunks. After that, you will learn about Vector Databases and Retrievers, and then put everything together to build a complete, end-to-end RAG application.

This article is part of an ongoing LangChain series covering everything from basic components to production-ready RAG applications.

RAG in LangChain Explained: Document Loaders, Components, and How RAG Applications Work

Introduction

What is RAG? A Clear Definition

Why Does RAG Exist? The Core Problem It Solves

How a RAG Pipeline Works: Step by Step

The Four Core Components of a RAG Application

What Are Document Loaders?

The Document Object

Why Document Loaders Matter

TextLoader: The Simplest Document Loader

When to Use TextLoader

How to Use TextLoader

Encoding Support

PyPDFLoader: Loading PDF Documents

Key Behavior: One Document Per Page

How to Use PyPDFLoader

When PyPDFLoader Works Well

When to Use a Different PDF Loader

DirectoryLoader: Loading Multiple Files at Once

How DirectoryLoader Works

Basic Usage

Understanding the Document Count

Using Different Loaders with Different File Types

Show Progress for Large Directories

Handling Errors Gracefully

load() vs lazy_load(): An Important Distinction

load(): Load Everything at Once

lazy_load(): Load One Document at a Time

Practical Comparison

WebBaseLoader: Loading Web Pages

How WebBaseLoader Works Internally

Basic Usage

Loading Multiple URLs at Once

When WebBaseLoader Works Well

Limitations of WebBaseLoader

Project Idea: Chat with Any Webpage

CSVLoader: Loading Tabular Data

Basic Usage

How Row Content is Represented

Specifying the Source Column

Choosing Specific Columns for Content

When CSVLoader Is Useful

The LangChain Document Loader Ecosystem

Categories of Available Loaders

You Do Not Need to Learn All of Them

Building a Custom Document Loader

When to Build a Custom Loader

How to Build One

Putting It All Together: Where Document Loaders Fit in a Full RAG Pipeline

Key Takeaways

What Comes Next

Related Articles

Vector Stores in RAG Explained: Why They Matter and How to Use Them with LangChain

Day 8 of 250: LeetCode 121 Best Time to Buy and Sell Stock | Greedy State Tracking Explained

Discussion

Text Splitting in RAG Explained: Why It Matters and How to Use It in LangChain

LangChain Runnables Explained: The Concept That Makes Chains, Agents, and LCEL Work