Sameer Singh

If you have ever tried building a Retrieval-Augmented Generation (RAG) application, you already know that loading documents is only the first step. The real challenge begins when you ask yourself:
How do I prepare this text so that an LLM can actually use it effectively?
That is where text splitting comes in.
Text splitting is not just a preprocessing detail. It is one of the most critical decisions you will make in any RAG pipeline. Get it wrong and your embeddings become noisy, your search results become irrelevant, and your LLM starts hallucinating. Get it right and your RAG application becomes fast, accurate, and production-ready.
In this guide, we are going to break down text splitting from the ground up:
Whether you are just starting out with RAG or optimizing an existing pipeline, this guide will give you everything you need to master text splitting.
Text splitting is the process of breaking large documents into smaller, more manageable pieces called chunks.
These chunks are then:
Think of it like this: if you hand someone a 500-page textbook and ask them to answer a question in 30 seconds, they will struggle. But if you hand them just the two relevant paragraphs, they will nail the answer immediately. Text splitting does exactly that for your LLM.
Practically anything can qualify:
All of these need to be chunked before they can be effectively used in a RAG pipeline.
Many developers underestimate text splitting because they assume any chunking strategy will do. But there are at least five concrete, technical reasons why the quality of your text splitting directly determines the quality of your RAG output.
Every LLM has a maximum number of tokens it can process in a single request. This is called the context window.
Here are some rough numbers for reference:
| Model | Approximate Context Window |
|---|---|
| GPT-3.5 | 4,096 tokens |
| GPT-4 (standard) | 8,192 tokens |
| Claude 3 | Up to 200,000 tokens |
| Llama 3 (7B) | 8,192 tokens |
Even with models offering very large context windows, sending an entire 500-page PDF in one shot is not practical. It is expensive, slow, and often produces worse results than targeted retrieval of relevant chunks.
Text splitting ensures that the content you pass to the LLM fits cleanly within the context window while remaining focused and relevant to the user's query.
When you generate an embedding for a piece of text, you are compressing the semantic meaning of that text into a single vector of fixed dimensions, often 768, 1536, or 3072 numbers.
If you try to embed a 50-page document as a single unit, you are asking the embedding model to compress an enormous, diverse set of ideas into one vector. The result is a blurry, diluted representation that does not capture any single topic very well.
Now consider what happens when you split that document into focused 300-word chunks, each covering a specific topic. Each embedding becomes a precise, clean representation of that topic. When a user asks a question, the similarity search can now pinpoint the most relevant chunk rather than retrieving an entire document based on average relevance.
The principle is simple: smaller chunks produce sharper embeddings.
Semantic search works by comparing the embedding of a user's query against the embeddings of all stored chunks. The closer two embeddings are in vector space, the more relevant the chunk is assumed to be.
If your chunks are too large, a single chunk might contain information about five different topics. Even if the query is relevant to just one of those topics, the chunk will match only partially, reducing its score in the similarity search.
Smaller, topically focused chunks allow semantic search to be far more precise. The result is that your RAG system retrieves the right information more consistently, which leads to better answers and fewer hallucinations from the LLM.
LLMs hallucinate most when they are given either too little context or too much irrelevant context. When you pass a massive, unfocused chunk to an LLM as context, you are essentially providing noise along with the signal.
The model then has to figure out which parts of the context are relevant while also generating a response. This increases the likelihood of confabulation.
Well-split chunks act as targeted context. The LLM receives exactly what it needs to answer the question, nothing more and nothing less.
LLM API calls are billed per token. Smaller, relevant chunks mean:
In production systems with high query volumes, the difference in cost between good and poor chunking strategies can be significant.
Before diving into the types of splitters, you need to fully understand two configuration parameters that appear in almost every text splitter:
Chunk size defines the maximum number of characters (or tokens, depending on the splitter) in each chunk.
Common chunk sizes and their use cases:
| Chunk Size | Best For |
|---|---|
| 100 to 300 characters | Short factual Q&A, precise retrieval |
| 500 to 1000 characters | General-purpose RAG retrieval |
| 1000 to 2000 characters | Summarization, complex reasoning tasks |
| 2000+ characters | Document-level summarization, long-form generation |
Choosing the right chunk size depends on your use case. For simple Q&A over a FAQ document, small chunks work well. For complex technical documentation where context matters, larger chunks may be necessary.
Chunk overlap is the number of characters that are shared between consecutive chunks.
Here is a concrete example:
Chunk size = 200 characters
Overlap = 40 characters
Chunk 1: characters 0 to 200
Chunk 2: characters 160 to 360
Chunk 3: characters 320 to 520Each chunk shares 40 characters with the previous one.
Why does overlap matter?
Imagine a sentence that reads: "The transformer architecture, first introduced by Vaswani et al. in 2017, revolutionized natural language processing."
If you split the text right after "Vaswani et al. in 2017," one chunk ends with incomplete information and the next chunk begins with "revolutionized natural language processing" without the subject. This breaks the meaning.
Overlap ensures that no sentence or idea is cut off abruptly at a chunk boundary. The surrounding context is preserved across chunks, which leads to better retrieval and better LLM responses.
How much overlap should you use?
A commonly recommended rule of thumb is to use 10 to 20 percent of your chunk size as overlap.
Chunk size = 1000 characters
Recommended overlap = 100 to 200 charactersMore overlap means better context preservation but also more storage, more embeddings to compute, and slightly higher retrieval costs. Less overlap is leaner but risks losing context at boundaries.
LangChain provides four major categories of text splitters. Each is designed for different document types and use cases.
This is the simplest and most primitive form of text splitting. You specify a maximum chunk size and the splitter cuts the text at that character or token count, regardless of what it is splitting.
How it works:
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separator=""
)
chunks = splitter.split_text(your_text)The algorithm simply moves through the text, counts characters, and cuts at the specified size.
Advantages:
Disadvantages:
When to use it:
Length-based splitting is generally not recommended for production RAG systems. It is best used for quick experiments or in cases where the downstream process does not depend on coherent text boundaries, such as certain keyword search pipelines.
This is the most widely used and recommended text splitter in LangChain. It works by splitting text according to its natural hierarchical structure.
How it works:
The RecursiveCharacterTextSplitter uses a prioritized list of separators:
["\n\n", "\n", " ", ""]It attempts to split first by double newlines (paragraph boundaries). If the resulting chunks are still too large, it tries single newlines. If still too large, it tries spaces. As a final resort, it splits by individual characters.
This recursive approach means the splitter always tries to preserve the highest-level structural unit it can while still meeting your chunk size requirement.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(documents)Why it is the best general-purpose splitter:
It respects the natural structure of human-written text. Paragraphs are the primary unit of meaning in most documents. Splitting at paragraph boundaries whenever possible ensures that each chunk contains a coherent idea or argument.
Advantages:
Disadvantages:
When to use it:
Use RecursiveCharacterTextSplitter as your default choice for almost all general-purpose RAG applications built on plain text, articles, reports, books, or similar unstructured documents.
When your source documents are not plain prose but have a specific format, you need a splitter that understands the conventions of that format.
LangChain provides specialized splitters for:
Example: Python Code Splitter
For Python source code, splitting by paragraphs or sentences makes no sense. The logical units of Python code are classes and functions. The Python splitter uses these as its primary separators:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1500,
chunk_overlap=200
)
chunks = python_splitter.split_text(python_code)The separators for Python are roughly:
["\nclass ", "\ndef ", "\n\n", "\n", " ", ""]This ensures that classes and functions are kept intact wherever possible, which is critical for code search and retrieval.
Example: Markdown Splitter
Markdown documents have a header hierarchy (H1, H2, H3) that defines their structure. The MarkdownHeaderTextSplitter splits the document along these headers and preserves the header information as metadata on each chunk:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_document)Each chunk now carries metadata like {"Header 1": "Introduction", "Header 2": "Core Concepts"}, which can be used for filtering and metadata-aware retrieval.
Example: HTML Splitter
HTML documents are split by tags like <section>, <article>, <p>, and <div>:
from langchain.text_splitter import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(html_content)Why document-structure splitters matter:
When your documents have a well-defined format, ignoring that format during splitting is a significant missed opportunity. Structure-aware splitting produces chunks that are both semantically coherent and enriched with structural metadata, improving retrieval accuracy significantly.
When to use it:
Use document-structure splitters any time your source material is code, markdown documentation, HTML pages, or other structured formats. Do not use a plain text splitter on these document types.
This is the most advanced and currently experimental approach to text splitting. Instead of using characters, tokens, or document structure as the basis for splitting, semantic splitters split based on changes in meaning within the text itself.
The core idea:
A single paragraph can shift between two entirely different topics. Traditional splitters would keep both topics in the same chunk because there is no structural boundary. A semantic splitter detects that the meaning has changed and splits there.
How it works:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(your_text)The breakpoint_threshold_amount controls sensitivity. A higher percentile means fewer, larger chunks. A lower percentile means more splits and smaller chunks.
Advantages:
Disadvantages:
When to use it:
Semantic splitting is worth experimenting with when you are working with documents that have inconsistent structure, mix multiple topics densely, or when retrieval quality with other splitters is noticeably poor. It is not yet the default recommendation for production systems due to its cost and inconsistency.
| Splitter Type | Speed | Semantic Quality | Best Use Case | Production Ready? |
|---|---|---|---|---|
| Length-Based | Very Fast | Low | Quick prototyping | Not recommended |
| RecursiveCharacter | Fast | Good | General-purpose RAG | Yes |
| Document-Structure | Fast | Very Good | Code, Markdown, HTML | Yes |
| Semantic | Slow | Excellent | Dense mixed-topic text | Experimental |
Use this decision guide when building your RAG pipeline:
Start with RecursiveCharacterTextSplitter if your documents are general prose: articles, books, reports, or any unstructured plain text. It handles the vast majority of real-world cases reliably.
Switch to a Document-Structure splitter when your source material has a defined format: Python or JavaScript code, markdown documentation, HTML pages, or JSON. Using structure-aware splitting in these cases will significantly improve retrieval quality.
Experiment with Semantic splitting when your retrieval quality is still poor after tuning chunk size and overlap with RecursiveCharacterTextSplitter, especially when your documents blend multiple topics in dense paragraphs without clear structural markers.
Avoid Length-Based splitting in production. It is only useful for quick experimentation or cases where the chunked text does not need to be read or understood coherently.
Getting text splitting right in production is an iterative process. Here are some practical guidelines:
Start with a chunk size of 500 to 1000 characters. This range works well for most general-purpose RAG applications. Adjust based on your evaluation results.
Use 10 to 20 percent of your chunk size as overlap. For a 1000-character chunk, start with 100 to 150 characters of overlap.
Evaluate chunk quality visually. Print out 10 to 20 random chunks from your dataset and read them. Do they make sense as standalone units? If sentences are frequently cut mid-way, your chunk size is too small or your separator hierarchy is not working well.
Run retrieval evaluation with a small test set. Build a set of 20 to 30 question-answer pairs from your documents. Retrieve chunks for each question and check whether the correct chunk is appearing in the top results. Adjust chunk size and overlap based on this evaluation.
Use token-based chunk sizes for LLM-facing chunks. When the chunks will be passed directly to an LLM, measure size in tokens (using a tokenizer) rather than characters. This ensures you never accidentally exceed the model's context window.
Add metadata to your chunks. Document title, section headers, page numbers, and source URLs added as chunk metadata dramatically improve retrieval accuracy, especially when combined with metadata filtering in your vector store.
Here is a minimal but complete example of loading a document and splitting it properly for a RAG pipeline using LangChain:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Step 1: Load the document
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
# Step 2: Split the document into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} pages.")
# Step 3: Generate embeddings and store in a vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# Step 4: Retrieve relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("What are the main conclusions of this document?")
for result in results:
print(result.page_content[:300])
print("---")This pipeline loads a PDF, splits it into 1000-character chunks with 150-character overlap, generates embeddings, stores them in Chroma, and retrieves the five most relevant chunks for a query.
Using chunk sizes that are too small. Chunks of 50 to 100 characters are often too small to contain meaningful information. They produce shallow embeddings and increase storage and retrieval costs dramatically.
Ignoring chunk overlap entirely. Zero overlap causes context to be lost at chunk boundaries. This is especially problematic for long multi-sentence arguments where the conclusion appears in a different chunk from the premise.
Applying plain text splitting to code. Code split at arbitrary character boundaries becomes syntactically broken and impossible to interpret correctly. Always use language-aware splitters for code.
Not evaluating retrieval quality. Splitting is not a set-and-forget step. Always evaluate how well your chunks are being retrieved before deploying to production.
Forgetting to add metadata. Raw text chunks without metadata lose their source context. Always preserve at least the source document name and page number.
Text splitting is often treated as a footnote in RAG tutorials, but it is anything but trivial. The quality of your chunks directly determines the quality of your embeddings, the precision of your semantic search, and ultimately the accuracy of your LLM's responses.
Here is a quick summary of everything we covered:
Master text splitting and you will have a strong foundation for building RAG applications that actually work in the real world.
Discover how vector stores power modern AI search and RAG systems. Learn about embeddings, semantic similarity, indexing techniques, and how to use Chroma with LangChain to build intelligent applications.
Sameer Singh
You do not need to check every buy-sell pair. One pass, two variables, and the right greedy instinct is all it takes. Here is the full breakdown of LeetCode 121 with a deep dry run and interview tips.
Sign in to join the discussion.
Rahul Kumar