Sameer Singh

If you have ever built or studied a Retrieval-Augmented Generation (RAG) system, you already know that the quality of your AI responses depends heavily on one thing: how well your system finds the right information at the right time.
That job belongs to the retriever.
Retrievers are the beating heart of any RAG application. Without them, a large language model (LLM) is essentially guessing from memory. With a well-designed retriever, your AI becomes a precision search engine connected to your private knowledge base, your documents, the web, or any custom data source.
In this comprehensive guide, we will cover everything you need to know about retrievers in RAG systems:
Whether you are a beginner building your first chatbot or an experienced developer optimizing a production RAG pipeline, this guide will give you a clear, actionable understanding of one of the most important components in modern AI applications.
Before diving deep into retrievers, it helps to understand where they fit in the broader RAG architecture.
A standard RAG pipeline consists of four main components:
1. Document Loader This component is responsible for ingesting raw data. It pulls content from sources such as PDFs, Word documents, websites, databases, Notion pages, or any custom source you define. Think of it as the entry point for your knowledge base.
2. Text Splitter Large documents cannot be processed efficiently in a single chunk. The text splitter breaks documents into smaller, manageable pieces called chunks. These chunks are sized to fit within the context window of your embedding model and to preserve semantic coherence.
3. Vector Store Once your documents are split into chunks, each chunk is converted into a numerical representation called an embedding. These embeddings are stored in a vector database (also called a vector store). Popular choices include Chroma, Pinecone, Weaviate, and FAISS.
4. Retriever When a user asks a question, the retriever is responsible for searching the vector store (or another data source) and returning the chunks most relevant to that query. Those chunks are then passed to the LLM as context, enabling grounded, accurate responses.
Understanding this pipeline is essential because everything the LLM says in a RAG system is only as good as what the retriever finds. Garbage in, garbage out.
At its most fundamental level, a retriever is a component that accepts a query as input and returns a list of relevant documents as output.
Simple Definition:
Input: User Query (natural language question or search phrase)
Output: List of relevant documents or document chunksA retriever does not just find documents that contain matching keywords. Modern retrievers go much further. They understand the semantic meaning of a query, search across millions of documents in milliseconds, apply advanced filtering and reranking strategies, and return results that are contextually meaningful, not just textually similar.
You can think of a retriever as a specialized search engine that sits inside your AI system. It does not generate text. It does not understand language the way an LLM does. Its one job is to bridge the gap between a user's question and the relevant knowledge stored in a data source.
User Query
|
v
Retriever
|
v
Data Source (Vector DB, API, Web, Knowledge Base)
|
v
Relevant Documents
|
v
LLM (generates response using retrieved context)This workflow is what separates RAG systems from pure generative models. The retriever grounds the LLM in facts, dramatically reducing hallucinations and improving answer accuracy.
To appreciate the importance of retrievers, consider what happens without them.
A vanilla LLM has knowledge frozen at its training cutoff date. It cannot access your company's internal documents. It cannot answer questions about a product launched last month. It cannot search a database of customer records. And when it does not know the answer, it sometimes makes one up.
Retrievers solve all of these problems by connecting the LLM to external, up-to-date, and domain-specific knowledge at inference time.
But not all retrieval is created equal. A poorly designed retriever will:
A well-designed retriever, on the other hand, dramatically improves the quality, accuracy, and trustworthiness of your AI system's responses.
LangChain, one of the most popular frameworks for building LLM applications, has a clear and powerful design philosophy around retrievers.
In LangChain, every retriever is a Runnable object. This is important because it means every retriever exposes a consistent interface with methods like invoke(). You do not need to rewrite your pipeline logic every time you switch from a Wikipedia retriever to a vector store retriever. You just swap the component.
This modularity means retrievers can be easily dropped into:
The Runnable interface is part of LangChain's LCEL (LangChain Expression Language) design, which encourages composable, readable, and maintainable AI pipelines.
Retrievers can be broadly categorized along two dimensions.
This categorization is about where the retriever goes to find information.
This categorization focuses on the algorithm or technique the retriever uses, regardless of data source.
Now let us explore each of the major retrievers in detail.
The Vector Store Retriever is the most commonly used retriever in production RAG systems. If you have built or seen a chatbot that answers questions from a PDF or a knowledge base, it almost certainly uses a vector store retriever at its core.
The entire process revolves around the concept of embeddings. An embedding is a high-dimensional numerical vector that represents the semantic meaning of a piece of text. Words, sentences, and paragraphs that mean similar things will have embeddings that are close together in vector space.
Here is the step-by-step process:
Step 1: Document Ingestion All your documents are loaded, split into chunks, and converted into embeddings using an embedding model (such as OpenAI's text-embedding-3-small or a local model via HuggingFace).
Step 2: Index Creation These embeddings are stored in a vector database. The database builds an index that enables fast approximate nearest neighbor (ANN) search at query time.
Step 3: Query Processing When a user submits a query, that query is also converted into an embedding using the same model.
Step 4: Similarity Search The vector database compares the query embedding against all stored document embeddings using a distance metric (typically cosine similarity or dot product).
Step 5: Document Return The top-k most similar document chunks are returned to the LLM as context.
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
# Create vector store from documents
vectorstore = Chroma.from_documents(
documents=your_documents,
embedding=OpenAIEmbeddings()
)
# Create retriever (returns top 3 most similar chunks)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3}
)
# Retrieve relevant documents
results = retriever.invoke("What is Chroma used for?")
for doc in results:
print(doc.page_content)
print("---")This is a fair question. Vector stores already have a similarity_search() method. Why wrap it in a retriever?
The answer is abstraction and composability. By wrapping the vector store in a retriever object, you get:
The Wikipedia Retriever is a lightweight but powerful retriever for general-knowledge queries. Instead of searching a local vector database, it calls the Wikipedia API in real time and returns the most relevant encyclopedia articles for your query.
This makes it ideal for applications where you need broad, factual information about topics that are already well-documented in Wikipedia.
from langchain_community.retrievers import WikipediaRetriever
# Initialize the retriever
retriever = WikipediaRetriever(
top_k_results=3, # Number of Wikipedia articles to return
lang="en", # Language of the articles
doc_content_chars_max=2000 # Limit content length per article
)
# Execute retrieval
query = "Geopolitical history of India and Pakistan"
docs = retriever.invoke(query)
for doc in docs:
print(doc.metadata["title"])
print(doc.page_content[:500])
print("---")A document loader blindly fetches all documents from a source without considering relevance. A retriever performs search logic, evaluating which documents are most relevant to a specific query and returning only those.
The Wikipedia Retriever uses Wikipedia's own search ranking to filter results, which is why it qualifies as a retriever and not just a document loader.
The MMR Retriever solves one of the most frustrating problems in search: redundancy.
Imagine you ask your RAG system, "What are the effects of climate change?" A standard similarity search might return five documents that all say the same thing about rising temperatures. You get five nearly identical results, and important aspects of the topic (ocean acidification, economic impacts, biodiversity loss) are completely missed.
Maximum Marginal Relevance (MMR) was designed to solve this exact problem.
MMR strikes a balance between two competing goals:
The algorithm works iteratively. It selects the first document based purely on similarity to the query. For every subsequent document, it penalizes candidates that are too similar to documents already selected. The result is a set of documents that are individually relevant but collectively diverse.
The behavior of MMR is controlled by a parameter called lambda_mult:
lambda_mult = 1.0 means pure similarity search (no diversity penalty). This is equivalent to standard cosine similarity retrieval.lambda_mult = 0.0 means maximum diversity (relevance is ignored entirely).lambda_mult = 0.5 (the default) means an equal balance between relevance and diversity.from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
vectorstore = Chroma.from_documents(
documents=your_documents,
embedding=OpenAIEmbeddings()
)
# Use MMR search strategy
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 4, # Number of documents to return
"fetch_k": 20, # Candidate pool size (fetch more, then select diverse subset)
"lambda_mult": 0.5 # Balance between relevance and diversity
}
)
results = retriever.invoke("What are the effects of climate change?")
for doc in results:
print(doc.page_content[:300])
print("---")Standard vector search works well when the user's query is clear and specific. But real users rarely ask perfectly articulated questions. They ask vague, broad, or multi-faceted questions that a single embedding may not fully capture.
Consider this query: "How can I stay healthy?"
This question could be asking about nutrition, exercise, sleep, stress management, preventive healthcare, mental wellness, or any combination of these. A single embedding of "How can I stay healthy?" will be biased toward whichever aspect the embedding model happens to associate most strongly with that phrase. Important documents about other aspects may be ranked too low to be retrieved.
The Multi Query Retriever solves this by using an LLM to think more broadly about the query before searching.
Original query: "How can I stay healthy?"
LLM-generated variations:
Each of these variations targets a different semantic neighborhood in the embedding space, dramatically increasing the recall of the retrieval step.
import logging
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
# Enable logging to see generated queries
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
# Create base vector store retriever
vectorstore = Chroma.from_documents(
documents=your_documents,
embedding=OpenAIEmbeddings()
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Wrap with Multi Query Retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=ChatOpenAI(temperature=0)
)
# Invoke with an ambiguous query
results = multi_query_retriever.invoke("How can I stay healthy?")
print(f"Retrieved {len(results)} unique documents")
for doc in results:
print(doc.page_content[:300])
print("---")Because Multi Query Retriever makes an LLM call to generate query variations AND multiple vector search calls, it is slower and more expensive than a standard similarity search. Use it when recall matters more than latency, or for offline batch processing.
Even when your retriever finds the right documents, those documents often contain a lot of noise. Consider a document that covers multiple unrelated topics:
"The Grand Canyon is a famous natural site visited by millions of tourists annually. Photosynthesis is the process by which plants convert sunlight and carbon dioxide into glucose. New York City has over 8 million residents."
If a user asks "What is photosynthesis?", a standard retriever returns this entire paragraph as context. The LLM now has to work harder to extract the relevant sentence while the rest of the text occupies valuable context window space.
The Contextual Compression Retriever fixes this by filtering and compressing retrieved documents down to only the content that is directly relevant to the query.
The Contextual Compression Retriever is a wrapper that sits on top of any base retriever. Here is the processing pipeline:
LangChain provides several compressor options:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
# Create the base retriever
vectorstore = Chroma.from_documents(
documents=your_documents,
embedding=OpenAIEmbeddings()
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Create the compressor
llm = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
# Wrap base retriever with contextual compression
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
# Query returns only the relevant portions of each document
results = compression_retriever.invoke("What is photosynthesis?")
for doc in results:
print(doc.page_content) # Only the relevant text, nothing else
print("---")LangChain offers a rich ecosystem of specialized retrievers beyond the core ones covered above. Here is a brief overview of the most notable:
This retriever addresses a fundamental tension in RAG: small chunks are better for retrieval (more precise semantic matching), but large chunks are better for context (more information for the LLM).
The Parent Document Retriever resolves this by indexing small child chunks for retrieval precision, but returning their larger parent documents as context. You get the best of both worlds.
This retriever combines semantic similarity with document recency. Recent documents receive a boost in ranking, making it useful for applications where freshness matters, such as news summarization or support ticket analysis.
This retriever uses an LLM to convert a natural language query into a structured query with metadata filters. For example, the query "Show me papers about transformers published after 2022" is automatically converted into a semantic search with a date filter applied.
This retriever combines results from multiple base retrievers (for example, a BM25 keyword retriever and a vector similarity retriever) and uses a fusion algorithm like Reciprocal Rank Fusion (RRF) to merge and rerank the results. This hybrid approach often outperforms any single retriever in isolation.
Similar to the Parent Document Retriever, but instead of parent/child documents, it stores multiple representations of the same document (summaries, hypothetical questions, dense and sparse embeddings) and retrieves based on the best-matching representation.
With so many retriever options available, how do you decide which one to use? Here is a practical framework:
Start with Vector Store Retriever. It is fast, simple, and works well for the majority of use cases. Use it as your baseline.
Add MMR if you notice your results are too similar or repetitive, or if your knowledge base has many overlapping documents.
Add Multi Query if users frequently ask vague or broad questions, or if you are getting poor recall on complex queries.
Add Contextual Compression if your documents are long, noisy, or cover multiple unrelated topics, or if you are hitting context window limits.
Use Ensemble Retriever if you need maximum accuracy and can tolerate higher latency, especially when combining keyword and semantic search.
Use Self Query if your data has meaningful metadata (dates, categories, authors) that should be used for filtering.
Use Parent Document if you need fine-grained retrieval but rich context for generation.
Here is an example of how these components combine into a complete RAG pipeline:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
# Step 1: Load documents
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
# Step 2: Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
# Step 3: Create vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings()
)
# Step 4: Create retriever with compression
base_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "lambda_mult": 0.5}
)
compressor = EmbeddingsFilter(
embeddings=OpenAIEmbeddings(),
similarity_threshold=0.76
)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
# Step 5: Build QA chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True
)
# Query your system
result = qa_chain.invoke("What are the main findings of the report?")
print(result["result"])Pitfall 1: Chunk size too large or too small Too large and you waste context window space. Too small and individual chunks lose semantic meaning. Start with 500-1000 tokens and tune from there.
Pitfall 2: Wrong embedding model Always use the same embedding model at ingestion time and query time. Mixing models produces meaningless similarity scores.
Pitfall 3: Not enough retrieved documents Setting k=1 or k=2 may miss critical context. Experiment with k=4 to k=8 as a starting point.
Pitfall 4: Ignoring metadata Metadata (document title, source, date, section) is valuable for filtering and attribution. Always preserve it during ingestion.
Pitfall 5: Over-relying on a single retriever Hybrid approaches (Ensemble Retriever, Multi Query + Compression) consistently outperform any single retriever strategy in complex real-world applications.
Retrievers are not just a supporting component in RAG systems. They are the foundation on which everything else is built. The quality, accuracy, and reliability of your AI application depend directly on how well your retriever finds and delivers relevant information.
In this guide, we explored:
The next step is to combine all of this knowledge into a real application. Take a set of documents you care about, load them into a vector store, experiment with different retrievers, and observe how the quality of your AI responses changes with each strategy.
Building great RAG applications is iterative. The retriever is the best place to start optimizing.
Counting consecutive ones is not about totals. It is about streaks. And the hardest part is not resetting when you hit a zero: it is remembering to finalize the streak when the array ends. Full breakdown inside.
When the window size is fixed, you never need to recompute the sum from scratch. Add one element, remove one element, keep sliding. Here is the full breakdown of the fixed-window sliding technique.
Sign in to join the discussion.