Sameer Singh

Every LangChain application - whether it is a chatbot, a document search engine, a coding assistant, or an autonomous agent - has one thing at its absolute core: a model.
Not a chain. Not a prompt. Not an agent. A model.
Everything else in LangChain exists to feed data into a model, structure its inputs, and do something useful with its outputs. If you do not deeply understand the Model component, every other part of LangChain will feel like memorizing syntax without understanding what you are actually building.
This guide fixes that. We will go far beyond "models take text and return text." We will cover the internal mechanics of how language models work, why the industry shifted from traditional LLMs to chat models, how embedding models convert language into mathematics, and how all of this connects to real applications like semantic search and RAG systems.
By the end, you will understand not just what models are, but why they are designed the way they are - and that understanding will make every LangChain concept that follows click into place.
At the simplest level, the Model component in LangChain is a standardized interface for communicating with AI models.
But let us unpack why that standardization matters so much.
The AI provider landscape today looks something like this:
Each of these providers has its own API format, authentication method, request structure, and response schema. If you build your application directly against OpenAI's API and later need to switch to Anthropic (for cost, compliance, capability, or availability reasons), you face a significant rewrite.
LangChain solves this with a unified abstraction layer. Every model - regardless of provider - is called the same way:
response = llm.invoke("Your question here")The provider-specific complexity is hidden inside the model object. Your application logic stays clean and portable.
LangChain divides all models into two distinct categories, each designed for a fundamentally different purpose.
Traditional language models follow a simple contract: you give them text, they return text. This is sometimes called the "completion" paradigm - you provide a prompt and the model completes it.
from langchain_openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo-instruct")
response = llm.invoke("The capital of France is")
# Response: "Paris, a city known for..."Understanding what happens inside a language model changes how you think about prompting, temperature, and model selection.
Language models are trained to predict the next token (roughly, the next word or word-piece) in a sequence. During training, the model sees billions of text examples and learns the statistical patterns of language - which words tend to follow which other words, in which contexts.
At inference time (when you call the model), this is what happens:
This is why language model outputs are probabilistic, not deterministic. The same input can produce different outputs on different runs - a feature, not a bug, that enables creativity and diversity.
Temperature is one of the most important parameters you will use, and it is widely misunderstood.
Technically, temperature rescales the logits (raw model outputs) before they are converted to probabilities via the softmax function. Here is the intuition:
Temperature = 0 (or near 0): The model always picks the single most probable next token. Output is completely deterministic. Every run produces the same result.
Best for: code generation, factual Q&A, data extraction, structured output - anywhere correctness matters more than variety.
Temperature = 0.3 - 0.7: The model strongly favors likely tokens but occasionally picks less probable ones. Output feels natural and varied without being incoherent.
Best for: chatbots, explanations, summarization, customer service - the sweet spot for most conversational applications.
Temperature = 1.0: The model samples from probabilities more or less as the training distribution suggests. Balanced between coherence and creativity.
Temperature = 1.5 - 2.0: The model is much more likely to pick unexpected tokens. Output becomes creative, surprising, and sometimes incoherent.
Best for: creative writing, brainstorming, generating diverse options - anywhere novelty is valued over precision.
from langchain_openai import ChatOpenAI
# For code generation - precise and deterministic
code_llm = ChatOpenAI(model="gpt-4o", temperature=0)
# For customer support - natural and consistent
support_llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
# For creative writing - varied and expressive
creative_llm = ChatOpenAI(model="gpt-4o", temperature=0.9)Tokens are the unit of measurement for LLM inputs and outputs. As a rough guide:
max_tokens sets a hard ceiling on the length of the model's response. The model will stop generating when it hits this limit, even mid-sentence.
llm = ChatOpenAI(model="gpt-4o", max_tokens=500)Why does this matter?
Cost control: API providers charge per token (input + output). Uncapped responses from complex prompts can be expensive at scale.
Latency control: Longer responses take more time to generate. For real-time applications, shorter max_tokens means faster responses.
Use case fit: A one-sentence answer needs max_tokens=50. A full essay needs max_tokens=2000. Sizing this appropriately improves both cost and user experience.
Traditional language models treat the entire conversation as a single flat string. To give a model conversation history, you had to manually format it:
User: What is gradient descent?
Assistant: Gradient descent is an optimization algorithm...
User: Can you give me a Python example?
Assistant:This worked, but it was brittle. The model had to infer roles from the text itself. There was no clear separation between system instructions, user messages, and assistant responses.
Chat models solved this with a structured message format. Instead of a flat string, you pass a list of typed messages:
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
messages = [
SystemMessage(content="You are a Python expert. Be concise and practical."),
HumanMessage(content="What is gradient descent?"),
AIMessage(content="Gradient descent is an optimization algorithm that minimizes a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient."),
HumanMessage(content="Can you give me a Python example?")
]
response = llm.invoke(messages)This structure gives the model explicit, unambiguous signals about:
SystemMessage
The system message is your most powerful tool for shaping model behavior. It is processed before any user input and sets the model's persona, constraints, expertise, and behavioral rules for the entire conversation.
Effective system messages are specific, not vague:
# Vague - not very effective
SystemMessage(content="You are a helpful assistant.")
# Specific - shapes behavior precisely
SystemMessage(content="""You are a senior data scientist with 10 years of experience
in machine learning and statistics. When explaining concepts:
- Always start with the intuition before the math
- Use concrete, real-world examples
- Flag when a concept has common misconceptions
- Recommend further reading when appropriate
Never use jargon without explaining it first.""")HumanMessage
Represents what the user typed. In a simple single-turn interaction, this is just the user's question. In a multi-turn conversation, there will be multiple HumanMessage objects in the list, each paired with an AIMessage response.
AIMessage
Represents what the model previously said. Including past AIMessage objects in your conversation history is what gives the model its "memory" of what it already told the user.
The shift from traditional LLMs to chat models is nearly complete for production applications. Here is why:
1. Role clarity: The model always knows which content is instructions (system), which is user input (human), and which is its own prior output (AI). This reduces hallucination and off-topic responses.
2. Instruction following: Chat models are fine-tuned specifically for following instructions, not just completing text. They are much better at respecting constraints, maintaining personas, and producing structured output.
3. Safety alignment: Chat models have RLHF (Reinforcement Learning from Human Feedback) and constitutional AI training baked in, making them more reliable for production use.
4. Tool use and function calling: Agents and tool-using applications require structured output formats that chat models support natively.
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_google_genai import ChatGoogleGenerativeAI
# All three share the same interface
openai_llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
anthropic_llm = ChatAnthropic(model="claude-opus-4-5", temperature=0.3)
google_llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", temperature=0.3)
# Called identically regardless of provider
response = openai_llm.invoke([HumanMessage(content="Explain transformers")])One of the most important decisions in any LLM application is which model to use. The choice between closed-source (proprietary) and open-source models involves real trade-offs across capability, cost, privacy, and control.
These are models developed and hosted by private companies. You access them via API, paying per token. You never have access to the weights.
Leading examples:
Strengths:
Raw capability: The frontier closed-source models (GPT-4o, Claude 3.5 Sonnet) currently outperform open-source alternatives on most benchmarks, especially for complex reasoning, coding, and instruction following.
Zero infrastructure: No GPUs, no deployment, no maintenance. Call the API and you are done. This dramatically accelerates time-to-production for most teams.
Reliability and uptime: Enterprise-grade SLAs with 99.9%+ uptime. You are not responsible for model serving.
Continuous improvement: Providers regularly update and improve their models without you changing any code.
Weaknesses:
Cost at scale: Per-token pricing is manageable at low volume but becomes significant at scale. A high-traffic application making millions of API calls per day can incur substantial costs.
Data privacy: Your prompts and user data are sent to a third-party server. For regulated industries (healthcare, finance, legal) or applications involving sensitive personal data, this may be unacceptable.
No customization: You cannot fine-tune, modify, or inspect the model weights. You work with what the provider gives you.
Vendor lock-in: API changes, pricing changes, or provider outages directly impact your application.
These are models where the weights are publicly released. You can download them, run them locally, fine-tune them, and deploy them however you choose.
Leading examples:
Strengths:
Complete data privacy: The model runs on your own infrastructure. No data ever leaves your environment. This is non-negotiable for many enterprise use cases.
No per-token cost: Once deployed, inference is effectively free (you pay only for the infrastructure). For high-volume applications, this can represent massive savings.
Full customization: You can fine-tune the model on your own data to specialize it for your use case - something impossible with closed-source models.
No vendor dependency: The model weights are yours. No API deprecations, pricing changes, or outages from a third party.
Regulatory compliance: For GDPR, HIPAA, SOC 2, and other compliance frameworks, running models locally is often required.
Weaknesses:
Capability gap: While narrowing rapidly, open-source models generally lag behind frontier closed-source models on complex tasks. The gap is smaller for specialized tasks after fine-tuning.
Infrastructure burden: You need to provision, deploy, monitor, and maintain the model serving infrastructure. This requires ML engineering expertise.
Hardware requirements: Llama 3 70B requires multiple high-end GPUs to run at reasonable speed. Even smaller 7B models need at least 8GB of VRAM.
HuggingFace is the central hub for the open-source AI ecosystem. It hosts:
When evaluating an open-source model on HuggingFace, look at:
Approach 1: Hosted API (Easiest)
Several platforms host open-source models and expose them via API, combining the convenience of closed-source access with the cost advantages of open-source models.
from langchain_together import Together
llm = Together(
model="meta-llama/Llama-3-70b-chat-hf",
temperature=0.3,
max_tokens=512
)Approach 2: Local Deployment with Ollama (Most Private)
Ollama is the easiest way to run open-source models locally. One command downloads and serves any supported model:
# Install Ollama, then:
ollama pull llama3
ollama pull mistral
ollama pull codellamafrom langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke([HumanMessage(content="Explain neural networks")])Your data never leaves your machine. No API keys, no costs, complete privacy.
Embedding models solve a fundamentally different problem from language models. Instead of generating text, they convert text into a numerical vector that captures its semantic meaning.
This might sound abstract, but it is one of the most powerful ideas in modern AI.
Consider these three sentences:
Sentences 1 and 2 are semantically similar - they describe the same kind of event, just with different words. Sentence 3 is about something completely different.
An embedding model converts each sentence into a vector (a list of numbers). Sentences 1 and 2 will have vectors that are close together in vector space. Sentence 3's vector will be far away from both.
This mathematical representation of meaning is what makes semantic search, document retrieval, and RAG systems possible.
Embedding models are neural networks trained specifically to produce these vector representations. The most common architecture is a bi-encoder: two identical transformer networks that independently encode two pieces of text, with a training objective that pushes similar texts' embeddings closer together and dissimilar texts' embeddings further apart.
The dimensionality of the resulting vector varies by model:
text-embedding-ada-002 (OpenAI): 1536 dimensionstext-embedding-3-large (OpenAI): 3072 dimensionsall-MiniLM-L6-v2 (HuggingFace, local): 384 dimensionsembed-english-v3.0 (Cohere): 1024 dimensionsHigher dimensionality generally means more nuanced semantic representation, but also higher storage costs and slower search.
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")
# Embed a single text
vector = embeddings_model.embed_query("What is gradient descent?")
print(f"Vector dimensions: {len(vector)}") # 3072
print(f"First 5 values: {vector[:5]}") # [0.022, -0.091, 0.044, ...]
# Embed multiple documents
documents = [
"Gradient descent minimizes a function iteratively.",
"Neural networks are inspired by the human brain.",
"Python is a popular programming language."
]
doc_vectors = embeddings_model.embed_documents(documents)Once texts are converted to vectors, measuring their semantic similarity becomes a geometry problem. The most common measure is cosine similarity: the cosine of the angle between two vectors.
import numpy as np
def cosine_similarity(vec1, vec2):
dot_product = np.dot(vec1, vec2)
magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)
return dot_product / magnitude
# These will have high similarity
vec1 = embeddings_model.embed_query("Indian cricket team captain")
vec2 = embeddings_model.embed_query("Who leads India in cricket?")
print(cosine_similarity(vec1, vec2)) # ~0.93
# These will have low similarity
vec3 = embeddings_model.embed_query("Recipe for chocolate cake")
print(cosine_similarity(vec1, vec3)) # ~0.12This is why semantic search is so powerful: a query for "Indian cricket team captain" will match a document that says "Rohit Sharma leads the Indian cricket team" even though those exact words do not appear in the query. The meaning is the same, so the embeddings are similar.
Generating embeddings on the fly for every search query against thousands or millions of documents would be computationally prohibitive. The solution is to pre-compute and store all document embeddings in a specialized database designed for fast similarity search: a vector database.
How it works:
Modern vector databases use approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World graphs) to search through millions of vectors in milliseconds.
Choosing a vector database:
| Database | Ideal Use Case | Key Strength |
|---|---|---|
| Chroma | Local development, prototyping | Zero setup, runs in memory |
| FAISS | Research, single-machine scale | Fastest in-memory search |
| Pinecone | Production SaaS | Fully managed, auto-scaling |
| Weaviate | Hybrid search (semantic + keyword) | Flexible filtering |
| Qdrant | High-performance filtering | Rich metadata filtering |
| pgvector | Already using PostgreSQL | No new infrastructure |
| Milvus | Billion-scale production | Extreme scale |
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# Create vector store from documents
vectorstore = Chroma.from_texts(
texts=["LangChain is a framework for LLM apps",
"Python is a programming language",
"RAG combines retrieval with generation"],
embedding=embeddings
)
# Semantic search
results = vectorstore.similarity_search("How do I build AI applications?", k=2)
for doc in results:
print(doc.page_content)
# Returns: "LangChain is a framework for LLM apps" (most similar)
# "RAG combines retrieval with generation" (second most similar)Here is the end-to-end picture of how embedding models power a RAG application:
Indexing Phase (runs once when documents are uploaded):
Raw documents (PDF, web pages, databases)
|
v
Document Loader (standardize format)
|
v
Text Splitter (break into chunks: ~500-1000 tokens each)
|
v
Embedding Model (convert each chunk to vector)
|
v
Vector Database (store vectors + original text + metadata)Query Phase (runs on every user question):
User question: "What are the benefits of meditation?"
|
v
Embedding Model (convert question to vector)
|
v
Vector Database similarity search (find top-k most similar chunk vectors)
|
v
Retrieved chunks: ["Meditation reduces stress by...", "Regular practice improves..."]
|
v
Chat Model prompt: [System] + [Retrieved context] + [User question]
|
v
Chat Model generates grounded answer
|
v
Response to user (with source citations)The embedding model appears twice: once at index time (embed all documents) and once at query time (embed the user's question). It is the mathematical bridge between language and search.
Every LangChain project should follow this structure:
my_langchain_project/
├── .env # API keys (never commit this)
├── .gitignore # Include .env here
├── requirements.txt # Dependencies
├── main.py # Application entry point
└── src/
├── models.py # Model initialization
├── chains.py # Chain definitions
└── utils.py # Helper functionsNever hardcode API keys in your source code. Use environment variables loaded from a .env file:
# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
HUGGINGFACEHUB_API_TOKEN=hf_...
LANGCHAIN_API_KEY=ls__... # For LangSmith tracing
LANGCHAIN_TRACING_V2=true# Load in your Python code
from dotenv import load_dotenv
import os
load_dotenv() # Reads .env file and sets environment variables
openai_key = os.getenv("OPENAI_API_KEY")This keeps secrets out of your codebase and makes environment-specific configuration straightforward.
One method appears throughout LangChain - invoke(). Understanding it once means you understand how to call models, chains, prompts, retrievers, and agents.
# Calling a model
llm.invoke("What is machine learning?")
llm.invoke([HumanMessage(content="What is machine learning?")])
# Calling a prompt template
prompt.invoke({"topic": "neural networks", "audience": "beginners"})
# Calling a chain
chain.invoke({"question": "How do I use LangChain?"})
# Calling a retriever
retriever.invoke("What is gradient descent?")The pattern is always the same: pass in the required inputs, get back the output. The consistency across the entire framework is one of LangChain's greatest strengths.
Not every task needs GPT-4o. Matching the model to the task saves cost, reduces latency, and often produces better results.
Task: Simple Q&A, basic summarization, classification
Task: Complex reasoning, multi-step analysis, coding
Task: Creative writing, brainstorming
Task: Data extraction, structured output, JSON generation
Task: Private data, regulated industry, local deployment
Task: High-volume production (millions of calls/day)
1. The Model Component is a unified interface LangChain wraps every AI provider behind a consistent API. Switching from OpenAI to Anthropic is a one-line change, not a rewrite.
2. Chat Models have replaced traditional LLMs Structured message formats (system, human, AI), role support, and instruction-following capability make chat models the right choice for virtually all modern applications.
3. Temperature controls creativity vs precision Low temperature for facts and code. High temperature for creativity and brainstorming. Most production applications live in the 0.2-0.5 range.
4. Embedding Models convert meaning to mathematics They enable semantic search - finding relevant content based on meaning, not keywords. This is the foundation of every RAG application.
5. Vector databases make semantic search scalable Pre-compute and store embeddings once, search millions of documents in milliseconds at query time.
6. Open-source vs closed-source is a real trade-off Closed-source wins on raw capability and simplicity. Open-source wins on privacy, cost at scale, and customization. The right choice depends on your specific constraints.
Models are not just one component among six in LangChain. They are the cognitive core that everything else serves. Prompts exist to structure input for models. Chains exist to move data to and from models. Indexes exist to give models relevant context. Memory exists to give models conversational continuity. Agents exist to let models decide what to do.
When you understand models deeply - how they generate text token by token, how embeddings represent meaning geometrically, why chat models replaced traditional LLMs, and how to pick the right model for the right task - every other LangChain concept becomes significantly easier to grasp.
The model is where language meets intelligence. Everything else is infrastructure.
Day 4 of the 250 Days DSA Challenge. LeetCode 11 Container With Most Water is a visual problem hiding a clean logical constraint. This post breaks down the full approach with a detailed dry run, pointer movement reasoning, and interview ready insights.
Rahul Kumar
Learn the 6 core building blocks of LangChain - Models, Prompts, Chains, Indexes, Memory, and Agents. Understand what each component does, why it exists, and how they work together to power real AI apps.
Sameer Singh
Master LangChain from the ground up. Learn how RAG systems work, how LLM memory is managed, and how agents take action - complete with working Python code. Your full roadmap to building real-world AI apps.
Sameer Singh
Day 3 of the 250 Days DSA Challenge. LeetCode 15 Three Sum looks intimidating at first, but once you sort the array and fix one element, it reduces to the exact same Two Sum II problem we solved yesterday. This post breaks down the full approach with a detailed dry run.
Rahul Kumar
Sign in to join the discussion.