Text Splitting in RAG: The Complete Guide to Chunking Strategies for LangChain

Introduction

If you have ever tried building a Retrieval-Augmented Generation (RAG) application, you already know that loading documents is only the first step. The real challenge begins when you ask yourself:

How do I prepare this text so that an LLM can actually use it effectively?

That is where text splitting comes in.

Text splitting is not just a preprocessing detail. It is one of the most critical decisions you will make in any RAG pipeline. Get it wrong and your embeddings become noisy, your search results become irrelevant, and your LLM starts hallucinating. Get it right and your RAG application becomes fast, accurate, and production-ready.

In this guide, we are going to break down text splitting from the ground up:

What text splitting actually is
Why your RAG system will fail without proper chunking
The context length problem in LLMs and how it affects your design
Why smaller chunks produce better embeddings and search results
All four major text splitting strategies in LangChain explained in depth
Chunk overlap: what it is, why it matters, and how much to use
A practical decision guide for which splitter to pick in different scenarios

Whether you are just starting out with RAG or optimizing an existing pipeline, this guide will give you everything you need to master text splitting.

What Is Text Splitting?

Text splitting is the process of breaking large documents into smaller, more manageable pieces called chunks.

These chunks are then:

Converted into vector embeddings using an embedding model
Stored in a vector database
Retrieved at query time based on semantic similarity
Passed to an LLM as context to generate a response

Think of it like this: if you hand someone a 500-page textbook and ask them to answer a question in 30 seconds, they will struggle. But if you hand them just the two relevant paragraphs, they will nail the answer immediately. Text splitting does exactly that for your LLM.

What counts as a "large" document?

Practically anything can qualify:

PDF files with hundreds of pages
Long-form articles and blog posts
Entire books or legal contracts
Large HTML pages or scraped websites
API documentation and technical manuals
Code repositories and markdown files

All of these need to be chunked before they can be effectively used in a RAG pipeline.

Why Is Text Splitting Important? 5 Key Reasons

Many developers underestimate text splitting because they assume any chunking strategy will do. But there are at least five concrete, technical reasons why the quality of your text splitting directly determines the quality of your RAG output.

1. LLM Context Window Limitations

Every LLM has a maximum number of tokens it can process in a single request. This is called the context window.

Here are some rough numbers for reference:

Model	Approximate Context Window
GPT-3.5	4,096 tokens
GPT-4 (standard)	8,192 tokens
Claude 3	Up to 200,000 tokens
Llama 3 (7B)	8,192 tokens

Even with models offering very large context windows, sending an entire 500-page PDF in one shot is not practical. It is expensive, slow, and often produces worse results than targeted retrieval of relevant chunks.

Text splitting ensures that the content you pass to the LLM fits cleanly within the context window while remaining focused and relevant to the user's query.

2. Better Embeddings Through Focused Content

When you generate an embedding for a piece of text, you are compressing the semantic meaning of that text into a single vector of fixed dimensions, often 768, 1536, or 3072 numbers.

If you try to embed a 50-page document as a single unit, you are asking the embedding model to compress an enormous, diverse set of ideas into one vector. The result is a blurry, diluted representation that does not capture any single topic very well.

Now consider what happens when you split that document into focused 300-word chunks, each covering a specific topic. Each embedding becomes a precise, clean representation of that topic. When a user asks a question, the similarity search can now pinpoint the most relevant chunk rather than retrieving an entire document based on average relevance.

The principle is simple: smaller chunks produce sharper embeddings.

3. Improved Semantic Search Accuracy

Semantic search works by comparing the embedding of a user's query against the embeddings of all stored chunks. The closer two embeddings are in vector space, the more relevant the chunk is assumed to be.

If your chunks are too large, a single chunk might contain information about five different topics. Even if the query is relevant to just one of those topics, the chunk will match only partially, reducing its score in the similarity search.

Smaller, topically focused chunks allow semantic search to be far more precise. The result is that your RAG system retrieves the right information more consistently, which leads to better answers and fewer hallucinations from the LLM.

4. Reduced Hallucination in LLM Responses

LLMs hallucinate most when they are given either too little context or too much irrelevant context. When you pass a massive, unfocused chunk to an LLM as context, you are essentially providing noise along with the signal.

The model then has to figure out which parts of the context are relevant while also generating a response. This increases the likelihood of confabulation.

Well-split chunks act as targeted context. The LLM receives exactly what it needs to answer the question, nothing more and nothing less.

5. Computational and Financial Efficiency

LLM API calls are billed per token. Smaller, relevant chunks mean:

Fewer tokens per request
Lower API costs
Faster response times
Ability to run more queries in parallel

In production systems with high query volumes, the difference in cost between good and poor chunking strategies can be significant.

Understanding Chunk Size and Chunk Overlap

Before diving into the types of splitters, you need to fully understand two configuration parameters that appear in almost every text splitter:

Chunk Size

Chunk size defines the maximum number of characters (or tokens, depending on the splitter) in each chunk.

Common chunk sizes and their use cases:

Chunk Size	Best For
100 to 300 characters	Short factual Q&A, precise retrieval
500 to 1000 characters	General-purpose RAG retrieval
1000 to 2000 characters	Summarization, complex reasoning tasks
2000+ characters	Document-level summarization, long-form generation

Choosing the right chunk size depends on your use case. For simple Q&A over a FAQ document, small chunks work well. For complex technical documentation where context matters, larger chunks may be necessary.

Chunk Overlap

Chunk overlap is the number of characters that are shared between consecutive chunks.

Here is a concrete example:

code

Chunk size = 200 characters
Overlap = 40 characters

Chunk 1: characters 0 to 200
Chunk 2: characters 160 to 360
Chunk 3: characters 320 to 520

Each chunk shares 40 characters with the previous one.

Why does overlap matter?

Imagine a sentence that reads: "The transformer architecture, first introduced by Vaswani et al. in 2017, revolutionized natural language processing."

If you split the text right after "Vaswani et al. in 2017," one chunk ends with incomplete information and the next chunk begins with "revolutionized natural language processing" without the subject. This breaks the meaning.

Overlap ensures that no sentence or idea is cut off abruptly at a chunk boundary. The surrounding context is preserved across chunks, which leads to better retrieval and better LLM responses.

How much overlap should you use?

A commonly recommended rule of thumb is to use 10 to 20 percent of your chunk size as overlap.

code

Chunk size = 1000 characters
Recommended overlap = 100 to 200 characters

More overlap means better context preservation but also more storage, more embeddings to compute, and slightly higher retrieval costs. Less overlap is leaner but risks losing context at boundaries.

The Four Types of Text Splitters in LangChain

LangChain provides four major categories of text splitters. Each is designed for different document types and use cases.

1. Length-Based Text Splitting

This is the simplest and most primitive form of text splitting. You specify a maximum chunk size and the splitter cuts the text at that character or token count, regardless of what it is splitting.

How it works:

code

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator=""
)

chunks = splitter.split_text(your_text)

The algorithm simply moves through the text, counts characters, and cuts at the specified size.

Advantages:

Extremely simple to implement
Very fast, even for large documents
Predictable chunk sizes

Disadvantages:

Can split words in the middle
Frequently cuts sentences without regard for meaning
Produces semantically incoherent chunks

When to use it:

Length-based splitting is generally not recommended for production RAG systems. It is best used for quick experiments or in cases where the downstream process does not depend on coherent text boundaries, such as certain keyword search pipelines.

2. Text Structure-Based Splitting (RecursiveCharacterTextSplitter)

This is the most widely used and recommended text splitter in LangChain. It works by splitting text according to its natural hierarchical structure.

How it works:

The RecursiveCharacterTextSplitter uses a prioritized list of separators:

code

["\n\n", "\n", " ", ""]

It attempts to split first by double newlines (paragraph boundaries). If the resulting chunks are still too large, it tries single newlines. If still too large, it tries spaces. As a final resort, it splits by individual characters.

This recursive approach means the splitter always tries to preserve the highest-level structural unit it can while still meeting your chunk size requirement.

code

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_documents(documents)

Why it is the best general-purpose splitter:

It respects the natural structure of human-written text. Paragraphs are the primary unit of meaning in most documents. Splitting at paragraph boundaries whenever possible ensures that each chunk contains a coherent idea or argument.

Advantages:

Preserves natural text structure
Adapts to chunk size by trying progressively finer splits
Works well on a wide range of document types
Produces semantically meaningful chunks

Disadvantages:

Slightly more complex than a simple character splitter
May still break across ideas if document lacks clear paragraph structure

When to use it:

Use RecursiveCharacterTextSplitter as your default choice for almost all general-purpose RAG applications built on plain text, articles, reports, books, or similar unstructured documents.

3. Document Structure-Based Splitting

When your source documents are not plain prose but have a specific format, you need a splitter that understands the conventions of that format.

LangChain provides specialized splitters for:

Python code
JavaScript code
Markdown documents
HTML pages
JSON files
LaTeX documents

Example: Python Code Splitter

For Python source code, splitting by paragraphs or sentences makes no sense. The logical units of Python code are classes and functions. The Python splitter uses these as its primary separators:

code

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,
    chunk_overlap=200
)

chunks = python_splitter.split_text(python_code)

The separators for Python are roughly:

code

["\nclass ", "\ndef ", "\n\n", "\n", " ", ""]

This ensures that classes and functions are kept intact wherever possible, which is critical for code search and retrieval.

Example: Markdown Splitter

Markdown documents have a header hierarchy (H1, H2, H3) that defines their structure. The MarkdownHeaderTextSplitter splits the document along these headers and preserves the header information as metadata on each chunk:

code

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_document)

Each chunk now carries metadata like {"Header 1": "Introduction", "Header 2": "Core Concepts"}, which can be used for filtering and metadata-aware retrieval.

Example: HTML Splitter

HTML documents are split by tags like <section>, <article>, <p>, and <div>:

code

from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(html_content)

Why document-structure splitters matter:

When your documents have a well-defined format, ignoring that format during splitting is a significant missed opportunity. Structure-aware splitting produces chunks that are both semantically coherent and enriched with structural metadata, improving retrieval accuracy significantly.

When to use it:

Use document-structure splitters any time your source material is code, markdown documentation, HTML pages, or other structured formats. Do not use a plain text splitter on these document types.

4. Semantic Meaning-Based Splitting

This is the most advanced and currently experimental approach to text splitting. Instead of using characters, tokens, or document structure as the basis for splitting, semantic splitters split based on changes in meaning within the text itself.

The core idea:

A single paragraph can shift between two entirely different topics. Traditional splitters would keep both topics in the same chunk because there is no structural boundary. A semantic splitter detects that the meaning has changed and splits there.

How it works:

The text is broken into individual sentences
An embedding is generated for each sentence
Consecutive sentence pairs are compared using cosine similarity
When the similarity score drops sharply, it signals a topic change
A split is made at that point

code

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

chunks = splitter.split_text(your_text)

The breakpoint_threshold_amount controls sensitivity. A higher percentile means fewer, larger chunks. A lower percentile means more splits and smaller chunks.

Advantages:

Produces the most semantically coherent chunks of any method
Does not depend on structural cues like paragraphs or headers
Can handle documents that mix multiple topics within paragraphs

Disadvantages:

Computationally expensive: requires generating an embedding per sentence
Threshold tuning is required and can vary across document types
Performance is inconsistent across different kinds of text
Still considered experimental in LangChain

When to use it:

Semantic splitting is worth experimenting with when you are working with documents that have inconsistent structure, mix multiple topics densely, or when retrieval quality with other splitters is noticeably poor. It is not yet the default recommendation for production systems due to its cost and inconsistency.

Comparing All Four Text Splitters at a Glance

Splitter Type	Speed	Semantic Quality	Best Use Case	Production Ready?
Length-Based	Very Fast	Low	Quick prototyping	Not recommended
RecursiveCharacter	Fast	Good	General-purpose RAG	Yes
Document-Structure	Fast	Very Good	Code, Markdown, HTML	Yes
Semantic	Slow	Excellent	Dense mixed-topic text	Experimental

How to Choose the Right Text Splitter

Use this decision guide when building your RAG pipeline:

Start with RecursiveCharacterTextSplitter if your documents are general prose: articles, books, reports, or any unstructured plain text. It handles the vast majority of real-world cases reliably.

Switch to a Document-Structure splitter when your source material has a defined format: Python or JavaScript code, markdown documentation, HTML pages, or JSON. Using structure-aware splitting in these cases will significantly improve retrieval quality.

Experiment with Semantic splitting when your retrieval quality is still poor after tuning chunk size and overlap with RecursiveCharacterTextSplitter, especially when your documents blend multiple topics in dense paragraphs without clear structural markers.

Avoid Length-Based splitting in production. It is only useful for quick experimentation or cases where the chunked text does not need to be read or understood coherently.

Practical Tips for Tuning Your Text Splitter in Production

Getting text splitting right in production is an iterative process. Here are some practical guidelines:

Start with a chunk size of 500 to 1000 characters. This range works well for most general-purpose RAG applications. Adjust based on your evaluation results.

Use 10 to 20 percent of your chunk size as overlap. For a 1000-character chunk, start with 100 to 150 characters of overlap.

Evaluate chunk quality visually. Print out 10 to 20 random chunks from your dataset and read them. Do they make sense as standalone units? If sentences are frequently cut mid-way, your chunk size is too small or your separator hierarchy is not working well.

Run retrieval evaluation with a small test set. Build a set of 20 to 30 question-answer pairs from your documents. Retrieve chunks for each question and check whether the correct chunk is appearing in the top results. Adjust chunk size and overlap based on this evaluation.

Use token-based chunk sizes for LLM-facing chunks. When the chunks will be passed directly to an LLM, measure size in tokens (using a tokenizer) rather than characters. This ensures you never accidentally exceed the model's context window.

Add metadata to your chunks. Document title, section headers, page numbers, and source URLs added as chunk metadata dramatically improve retrieval accuracy, especially when combined with metadata filtering in your vector store.

A Complete Example: Building a RAG Pipeline with Proper Text Splitting

Here is a minimal but complete example of loading a document and splitting it properly for a RAG pipeline using LangChain:

code

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Step 1: Load the document
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()

# Step 2: Split the document into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} pages.")

# Step 3: Generate embeddings and store in a vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# Step 4: Retrieve relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("What are the main conclusions of this document?")

for result in results:
    print(result.page_content[:300])
    print("---")

This pipeline loads a PDF, splits it into 1000-character chunks with 150-character overlap, generates embeddings, stores them in Chroma, and retrieves the five most relevant chunks for a query.

Common Mistakes to Avoid in Text Splitting

Using chunk sizes that are too small. Chunks of 50 to 100 characters are often too small to contain meaningful information. They produce shallow embeddings and increase storage and retrieval costs dramatically.

Ignoring chunk overlap entirely. Zero overlap causes context to be lost at chunk boundaries. This is especially problematic for long multi-sentence arguments where the conclusion appears in a different chunk from the premise.

Applying plain text splitting to code. Code split at arbitrary character boundaries becomes syntactically broken and impossible to interpret correctly. Always use language-aware splitters for code.

Not evaluating retrieval quality. Splitting is not a set-and-forget step. Always evaluate how well your chunks are being retrieved before deploying to production.

Forgetting to add metadata. Raw text chunks without metadata lose their source context. Always preserve at least the source document name and page number.

Final Thoughts

Text splitting is often treated as a footnote in RAG tutorials, but it is anything but trivial. The quality of your chunks directly determines the quality of your embeddings, the precision of your semantic search, and ultimately the accuracy of your LLM's responses.

Here is a quick summary of everything we covered:

Text splitting breaks large documents into smaller chunks that LLMs can process efficiently
It solves the context window problem, improves embedding quality, and reduces hallucination
Chunk size and chunk overlap are the two most important configuration parameters
RecursiveCharacterTextSplitter is the best general-purpose choice for most RAG applications
Document-structure splitters are essential for code, markdown, and HTML
Semantic splitters offer the highest quality but are computationally expensive and still experimental
Production text splitting requires evaluation and iteration, not just a one-time setup

Master text splitting and you will have a strong foundation for building RAG applications that actually work in the real world.

Text Splitting in RAG Explained: Why It Matters and How to Use It in LangChain

Introduction

What Is Text Splitting?

What counts as a "large" document?

Why Is Text Splitting Important? 5 Key Reasons

1. LLM Context Window Limitations

2. Better Embeddings Through Focused Content

3. Improved Semantic Search Accuracy

4. Reduced Hallucination in LLM Responses

5. Computational and Financial Efficiency

Understanding Chunk Size and Chunk Overlap

Chunk Size

Chunk Overlap

The Four Types of Text Splitters in LangChain

1. Length-Based Text Splitting

2. Text Structure-Based Splitting (RecursiveCharacterTextSplitter)

3. Document Structure-Based Splitting

4. Semantic Meaning-Based Splitting

Comparing All Four Text Splitters at a Glance

How to Choose the Right Text Splitter

Practical Tips for Tuning Your Text Splitter in Production

A Complete Example: Building a RAG Pipeline with Proper Text Splitting

Common Mistakes to Avoid in Text Splitting

Final Thoughts

Related Articles

Retrievers in RAG Explained: Types, Working, and Examples with LangChain

Vector Stores in RAG Explained: Why They Matter and How to Use Them with LangChain

Discussion

RAG in LangChain Explained: Document Loaders, Components, and How RAG Applications Work

LangChain Runnables Explained: The Concept That Makes Chains, Agents, and LCEL Work