LangChain Models Decoded: LLMs, Chat Models and Embeddings Explained with Real-World Depth (2026)

Introduction

Every LangChain application - whether it is a chatbot, a document search engine, a coding assistant, or an autonomous agent - has one thing at its absolute core: a model.

Not a chain. Not a prompt. Not an agent. A model.

Everything else in LangChain exists to feed data into a model, structure its inputs, and do something useful with its outputs. If you do not deeply understand the Model component, every other part of LangChain will feel like memorizing syntax without understanding what you are actually building.

This guide fixes that. We will go far beyond "models take text and return text." We will cover the internal mechanics of how language models work, why the industry shifted from traditional LLMs to chat models, how embedding models convert language into mathematics, and how all of this connects to real applications like semantic search and RAG systems.

By the end, you will understand not just what models are, but why they are designed the way they are - and that understanding will make every LangChain concept that follows click into place.

What Is the Model Component in LangChain?

At the simplest level, the Model component in LangChain is a standardized interface for communicating with AI models.

But let us unpack why that standardization matters so much.

The AI provider landscape today looks something like this:

OpenAI - GPT-4o, GPT-4 Turbo, o1
Anthropic - Claude 3.5 Sonnet, Claude 3 Opus
Google - Gemini 1.5 Pro, Gemini 1.5 Flash
Meta - Llama 3 (open-source)
Mistral AI - Mistral Large, Mixtral
Cohere - Command R+
HuggingFace - thousands of community models

Each of these providers has its own API format, authentication method, request structure, and response schema. If you build your application directly against OpenAI's API and later need to switch to Anthropic (for cost, compliance, capability, or availability reasons), you face a significant rewrite.

LangChain solves this with a unified abstraction layer. Every model - regardless of provider - is called the same way:

code

response = llm.invoke("Your question here")

The provider-specific complexity is hidden inside the model object. Your application logic stays clean and portable.

The Two Fundamental Model Types

LangChain divides all models into two distinct categories, each designed for a fundamentally different purpose.

Model Type 1: Language Models (Text In, Text Out)

What They Are

Traditional language models follow a simple contract: you give them text, they return text. This is sometimes called the "completion" paradigm - you provide a prompt and the model completes it.

code

from langchain_openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-instruct")
response = llm.invoke("The capital of France is")
# Response: "Paris, a city known for..."

How Language Models Actually Work (The Real Mechanics)

Understanding what happens inside a language model changes how you think about prompting, temperature, and model selection.

Language models are trained to predict the next token (roughly, the next word or word-piece) in a sequence. During training, the model sees billions of text examples and learns the statistical patterns of language - which words tend to follow which other words, in which contexts.

At inference time (when you call the model), this is what happens:

Your input text is tokenized - broken into tokens (words, sub-words, or characters depending on the tokenizer)
The transformer architecture processes the entire sequence simultaneously, with each token "attending" to every other token to build contextual representations
The model outputs a probability distribution over its entire vocabulary for the next token
A token is sampled from this distribution (with randomness controlled by the temperature parameter)
The sampled token is appended to the sequence and the process repeats
This continues until a stop condition is met (max tokens, stop sequence, or end-of-text token)

This is why language model outputs are probabilistic, not deterministic. The same input can produce different outputs on different runs - a feature, not a bug, that enables creativity and diversity.

The Temperature Parameter in Depth

Temperature is one of the most important parameters you will use, and it is widely misunderstood.

Technically, temperature rescales the logits (raw model outputs) before they are converted to probabilities via the softmax function. Here is the intuition:

Temperature = 0 (or near 0): The model always picks the single most probable next token. Output is completely deterministic. Every run produces the same result.

Best for: code generation, factual Q&A, data extraction, structured output - anywhere correctness matters more than variety.

Temperature = 0.3 - 0.7: The model strongly favors likely tokens but occasionally picks less probable ones. Output feels natural and varied without being incoherent.

Best for: chatbots, explanations, summarization, customer service - the sweet spot for most conversational applications.

Temperature = 1.0: The model samples from probabilities more or less as the training distribution suggests. Balanced between coherence and creativity.

Temperature = 1.5 - 2.0: The model is much more likely to pick unexpected tokens. Output becomes creative, surprising, and sometimes incoherent.

Best for: creative writing, brainstorming, generating diverse options - anywhere novelty is valued over precision.

code

from langchain_openai import ChatOpenAI

# For code generation - precise and deterministic
code_llm = ChatOpenAI(model="gpt-4o", temperature=0)

# For customer support - natural and consistent
support_llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

# For creative writing - varied and expressive
creative_llm = ChatOpenAI(model="gpt-4o", temperature=0.9)

The Max Tokens Parameter in Depth

Tokens are the unit of measurement for LLM inputs and outputs. As a rough guide:

1 token is approximately 4 characters of English text
100 tokens is roughly 75 words
A typical paragraph is around 100-150 tokens

max_tokens sets a hard ceiling on the length of the model's response. The model will stop generating when it hits this limit, even mid-sentence.

code

llm = ChatOpenAI(model="gpt-4o", max_tokens=500)

Why does this matter?

Cost control: API providers charge per token (input + output). Uncapped responses from complex prompts can be expensive at scale.

Latency control: Longer responses take more time to generate. For real-time applications, shorter max_tokens means faster responses.

Use case fit: A one-sentence answer needs max_tokens=50. A full essay needs max_tokens=2000. Sizing this appropriately improves both cost and user experience.

Model Type 1B: Chat Models - The Modern Standard

Why Chat Models Replaced Traditional LLMs

Traditional language models treat the entire conversation as a single flat string. To give a model conversation history, you had to manually format it:

code

User: What is gradient descent?
Assistant: Gradient descent is an optimization algorithm...
User: Can you give me a Python example?
Assistant:

This worked, but it was brittle. The model had to infer roles from the text itself. There was no clear separation between system instructions, user messages, and assistant responses.

Chat models solved this with a structured message format. Instead of a flat string, you pass a list of typed messages:

code

from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

messages = [
    SystemMessage(content="You are a Python expert. Be concise and practical."),
    HumanMessage(content="What is gradient descent?"),
    AIMessage(content="Gradient descent is an optimization algorithm that minimizes a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient."),
    HumanMessage(content="Can you give me a Python example?")
]

response = llm.invoke(messages)

This structure gives the model explicit, unambiguous signals about:

What role it should play (system message)
What the human said (human messages)
What it previously said (AI messages)

The Three Message Types Explained

SystemMessage

The system message is your most powerful tool for shaping model behavior. It is processed before any user input and sets the model's persona, constraints, expertise, and behavioral rules for the entire conversation.

Effective system messages are specific, not vague:

code

# Vague - not very effective
SystemMessage(content="You are a helpful assistant.")

# Specific - shapes behavior precisely
SystemMessage(content="""You are a senior data scientist with 10 years of experience
in machine learning and statistics. When explaining concepts:
- Always start with the intuition before the math
- Use concrete, real-world examples
- Flag when a concept has common misconceptions
- Recommend further reading when appropriate
Never use jargon without explaining it first.""")

HumanMessage

Represents what the user typed. In a simple single-turn interaction, this is just the user's question. In a multi-turn conversation, there will be multiple HumanMessage objects in the list, each paired with an AIMessage response.

AIMessage

Represents what the model previously said. Including past AIMessage objects in your conversation history is what gives the model its "memory" of what it already told the user.

Why Modern Apps Use Chat Models Exclusively

The shift from traditional LLMs to chat models is nearly complete for production applications. Here is why:

1. Role clarity: The model always knows which content is instructions (system), which is user input (human), and which is its own prior output (AI). This reduces hallucination and off-topic responses.

2. Instruction following: Chat models are fine-tuned specifically for following instructions, not just completing text. They are much better at respecting constraints, maintaining personas, and producing structured output.

3. Safety alignment: Chat models have RLHF (Reinforcement Learning from Human Feedback) and constitutional AI training baked in, making them more reliable for production use.

4. Tool use and function calling: Agents and tool-using applications require structured output formats that chat models support natively.

code

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_google_genai import ChatGoogleGenerativeAI

# All three share the same interface
openai_llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
anthropic_llm = ChatAnthropic(model="claude-opus-4-5", temperature=0.3)
google_llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", temperature=0.3)

# Called identically regardless of provider
response = openai_llm.invoke([HumanMessage(content="Explain transformers")])

Closed-Source vs Open-Source Models: A Deep Comparison

One of the most important decisions in any LLM application is which model to use. The choice between closed-source (proprietary) and open-source models involves real trade-offs across capability, cost, privacy, and control.

Closed-Source Models

These are models developed and hosted by private companies. You access them via API, paying per token. You never have access to the weights.

Leading examples:

OpenAI: GPT-4o, GPT-4 Turbo, o1-preview
Anthropic: Claude 3.5 Sonnet, Claude 3 Opus
Google: Gemini 1.5 Pro, Gemini 1.5 Ultra
Cohere: Command R+

Strengths:

Raw capability: The frontier closed-source models (GPT-4o, Claude 3.5 Sonnet) currently outperform open-source alternatives on most benchmarks, especially for complex reasoning, coding, and instruction following.

Zero infrastructure: No GPUs, no deployment, no maintenance. Call the API and you are done. This dramatically accelerates time-to-production for most teams.

Reliability and uptime: Enterprise-grade SLAs with 99.9%+ uptime. You are not responsible for model serving.

Continuous improvement: Providers regularly update and improve their models without you changing any code.

Weaknesses:

Cost at scale: Per-token pricing is manageable at low volume but becomes significant at scale. A high-traffic application making millions of API calls per day can incur substantial costs.

Data privacy: Your prompts and user data are sent to a third-party server. For regulated industries (healthcare, finance, legal) or applications involving sensitive personal data, this may be unacceptable.

No customization: You cannot fine-tune, modify, or inspect the model weights. You work with what the provider gives you.

Vendor lock-in: API changes, pricing changes, or provider outages directly impact your application.

Open-Source Models

These are models where the weights are publicly released. You can download them, run them locally, fine-tune them, and deploy them however you choose.

Leading examples:

Meta: Llama 3 8B, 70B, and 405B
Mistral AI: Mistral 7B, Mixtral 8x7B, Mistral Large
Google: Gemma 2
Technology Innovation Institute: Falcon 180B
Allen AI: OLMo

Strengths:

Complete data privacy: The model runs on your own infrastructure. No data ever leaves your environment. This is non-negotiable for many enterprise use cases.

No per-token cost: Once deployed, inference is effectively free (you pay only for the infrastructure). For high-volume applications, this can represent massive savings.

Full customization: You can fine-tune the model on your own data to specialize it for your use case - something impossible with closed-source models.

No vendor dependency: The model weights are yours. No API deprecations, pricing changes, or outages from a third party.

Regulatory compliance: For GDPR, HIPAA, SOC 2, and other compliance frameworks, running models locally is often required.

Weaknesses:

Capability gap: While narrowing rapidly, open-source models generally lag behind frontier closed-source models on complex tasks. The gap is smaller for specialized tasks after fine-tuning.

Infrastructure burden: You need to provision, deploy, monitor, and maintain the model serving infrastructure. This requires ML engineering expertise.

Hardware requirements: Llama 3 70B requires multiple high-end GPUs to run at reasonable speed. Even smaller 7B models need at least 8GB of VRAM.

Where to Find Open-Source Models: HuggingFace

HuggingFace is the central hub for the open-source AI ecosystem. It hosts:

Over 500,000 models across every modality (text, image, audio, code)
Model cards with capability descriptions, training data, and benchmark results
An online inference playground to test models before downloading
Datasets for training and evaluation
The Transformers library (the backbone of most open-source model usage)

When evaluating an open-source model on HuggingFace, look at:

Model card: What was it trained on? What tasks is it designed for?
Benchmark scores: MMLU, HumanEval, GSM8K for reasoning and coding tasks
Downloads and likes: Community adoption is a signal of reliability
License: Some models restrict commercial use (Llama has its own license; Mistral models are Apache 2.0)

Running Open-Source Models: Two Approaches

Approach 1: Hosted API (Easiest)

Several platforms host open-source models and expose them via API, combining the convenience of closed-source access with the cost advantages of open-source models.

Together AI: Hosts Llama 3, Mistral, Falcon, and dozens of others
Groq: Extremely fast inference for Llama and Mixtral (using custom LPU hardware)
Replicate: Wide model selection with simple API
Anyscale: Enterprise-grade hosting

code

from langchain_together import Together

llm = Together(
    model="meta-llama/Llama-3-70b-chat-hf",
    temperature=0.3,
    max_tokens=512
)

Approach 2: Local Deployment with Ollama (Most Private)

Ollama is the easiest way to run open-source models locally. One command downloads and serves any supported model:

code

# Install Ollama, then:
ollama pull llama3
ollama pull mistral
ollama pull codellama

code

from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke([HumanMessage(content="Explain neural networks")])

Your data never leaves your machine. No API keys, no costs, complete privacy.

Model Type 2: Embedding Models - Language as Mathematics

The Core Concept

Embedding models solve a fundamentally different problem from language models. Instead of generating text, they convert text into a numerical vector that captures its semantic meaning.

This might sound abstract, but it is one of the most powerful ideas in modern AI.

Consider these three sentences:

"The prime minister announced new economic policies."
"The country's leader unveiled a fresh financial plan."
"The football match ended in a draw."

Sentences 1 and 2 are semantically similar - they describe the same kind of event, just with different words. Sentence 3 is about something completely different.

An embedding model converts each sentence into a vector (a list of numbers). Sentences 1 and 2 will have vectors that are close together in vector space. Sentence 3's vector will be far away from both.

This mathematical representation of meaning is what makes semantic search, document retrieval, and RAG systems possible.

How Embeddings Are Generated

Embedding models are neural networks trained specifically to produce these vector representations. The most common architecture is a bi-encoder: two identical transformer networks that independently encode two pieces of text, with a training objective that pushes similar texts' embeddings closer together and dissimilar texts' embeddings further apart.

The dimensionality of the resulting vector varies by model:

text-embedding-ada-002 (OpenAI): 1536 dimensions
text-embedding-3-large (OpenAI): 3072 dimensions
all-MiniLM-L6-v2 (HuggingFace, local): 384 dimensions
embed-english-v3.0 (Cohere): 1024 dimensions

Higher dimensionality generally means more nuanced semantic representation, but also higher storage costs and slower search.

code

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")

# Embed a single text
vector = embeddings_model.embed_query("What is gradient descent?")
print(f"Vector dimensions: {len(vector)}")  # 3072
print(f"First 5 values: {vector[:5]}")      # [0.022, -0.091, 0.044, ...]

# Embed multiple documents
documents = [
    "Gradient descent minimizes a function iteratively.",
    "Neural networks are inspired by the human brain.",
    "Python is a popular programming language."
]
doc_vectors = embeddings_model.embed_documents(documents)

Cosine Similarity: How Semantic Search Works

Once texts are converted to vectors, measuring their semantic similarity becomes a geometry problem. The most common measure is cosine similarity: the cosine of the angle between two vectors.

Cosine similarity of 1.0 means the vectors point in exactly the same direction - the texts are semantically identical
Cosine similarity of 0.0 means the vectors are perpendicular - the texts are semantically unrelated
Cosine similarity of -1.0 means the vectors point in opposite directions - the texts are semantically opposite

code

import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    return dot_product / magnitude

# These will have high similarity
vec1 = embeddings_model.embed_query("Indian cricket team captain")
vec2 = embeddings_model.embed_query("Who leads India in cricket?")
print(cosine_similarity(vec1, vec2))  # ~0.93

# These will have low similarity
vec3 = embeddings_model.embed_query("Recipe for chocolate cake")
print(cosine_similarity(vec1, vec3))  # ~0.12

This is why semantic search is so powerful: a query for "Indian cricket team captain" will match a document that says "Rohit Sharma leads the Indian cricket team" even though those exact words do not appear in the query. The meaning is the same, so the embeddings are similar.

Vector Databases: Storing and Searching at Scale

Generating embeddings on the fly for every search query against thousands or millions of documents would be computationally prohibitive. The solution is to pre-compute and store all document embeddings in a specialized database designed for fast similarity search: a vector database.

How it works:

At index time: embed all documents, store vectors in the vector database
At query time: embed only the query (one embedding call), search the database for the nearest vectors, retrieve corresponding documents

Modern vector databases use approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World graphs) to search through millions of vectors in milliseconds.

Choosing a vector database:

Database	Ideal Use Case	Key Strength
Chroma	Local development, prototyping	Zero setup, runs in memory
FAISS	Research, single-machine scale	Fastest in-memory search
Pinecone	Production SaaS	Fully managed, auto-scaling
Weaviate	Hybrid search (semantic + keyword)	Flexible filtering
Qdrant	High-performance filtering	Rich metadata filtering
pgvector	Already using PostgreSQL	No new infrastructure
Milvus	Billion-scale production	Extreme scale

code

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Create vector store from documents
vectorstore = Chroma.from_texts(
    texts=["LangChain is a framework for LLM apps",
           "Python is a programming language",
           "RAG combines retrieval with generation"],
    embedding=embeddings
)

# Semantic search
results = vectorstore.similarity_search("How do I build AI applications?", k=2)
for doc in results:
    print(doc.page_content)
# Returns: "LangChain is a framework for LLM apps" (most similar)
#          "RAG combines retrieval with generation" (second most similar)

The Complete RAG Flow Using Embeddings

Here is the end-to-end picture of how embedding models power a RAG application:

Indexing Phase (runs once when documents are uploaded):

code

Raw documents (PDF, web pages, databases)
        |
        v
Document Loader (standardize format)
        |
        v
Text Splitter (break into chunks: ~500-1000 tokens each)
        |
        v
Embedding Model (convert each chunk to vector)
        |
        v
Vector Database (store vectors + original text + metadata)

Query Phase (runs on every user question):

code

User question: "What are the benefits of meditation?"
        |
        v
Embedding Model (convert question to vector)
        |
        v
Vector Database similarity search (find top-k most similar chunk vectors)
        |
        v
Retrieved chunks: ["Meditation reduces stress by...", "Regular practice improves..."]
        |
        v
Chat Model prompt: [System] + [Retrieved context] + [User question]
        |
        v
Chat Model generates grounded answer
        |
        v
Response to user (with source citations)

The embedding model appears twice: once at index time (embed all documents) and once at query time (embed the user's question). It is the mathematical bridge between language and search.

LangChain Project Setup: Best Practices

Environment Setup

Every LangChain project should follow this structure:

code

my_langchain_project/
├── .env                    # API keys (never commit this)
├── .gitignore              # Include .env here
├── requirements.txt        # Dependencies
├── main.py                 # Application entry point
└── src/
    ├── models.py           # Model initialization
    ├── chains.py           # Chain definitions
    └── utils.py            # Helper functions

Managing API Keys Securely

Never hardcode API keys in your source code. Use environment variables loaded from a .env file:

code

# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
HUGGINGFACEHUB_API_TOKEN=hf_...
LANGCHAIN_API_KEY=ls__...       # For LangSmith tracing
LANGCHAIN_TRACING_V2=true

code

# Load in your Python code
from dotenv import load_dotenv
import os

load_dotenv()  # Reads .env file and sets environment variables

openai_key = os.getenv("OPENAI_API_KEY")

This keeps secrets out of your codebase and makes environment-specific configuration straightforward.

The invoke() Method: LangChain's Universal Interface

One method appears throughout LangChain - invoke(). Understanding it once means you understand how to call models, chains, prompts, retrievers, and agents.

code

# Calling a model
llm.invoke("What is machine learning?")
llm.invoke([HumanMessage(content="What is machine learning?")])

# Calling a prompt template
prompt.invoke({"topic": "neural networks", "audience": "beginners"})

# Calling a chain
chain.invoke({"question": "How do I use LangChain?"})

# Calling a retriever
retriever.invoke("What is gradient descent?")

The pattern is always the same: pass in the required inputs, get back the output. The consistency across the entire framework is one of LangChain's greatest strengths.

Choosing the Right Model for Your Use Case

Not every task needs GPT-4o. Matching the model to the task saves cost, reduces latency, and often produces better results.

Decision Framework

Task: Simple Q&A, basic summarization, classification

Model tier: Small/fast (GPT-4o mini, Claude 3 Haiku, Gemini 1.5 Flash, Llama 3 8B)
Why: Overengineering with a large model wastes cost and adds latency

Task: Complex reasoning, multi-step analysis, coding

Model tier: Large/capable (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro)
Why: These tasks genuinely benefit from the larger model's reasoning capacity

Task: Creative writing, brainstorming

Temperature: 0.8-1.2, any capable model
Why: Higher temperature produces more varied and interesting outputs

Task: Data extraction, structured output, JSON generation

Temperature: 0, any model with function calling support
Why: Zero temperature ensures consistent, deterministic output format

Task: Private data, regulated industry, local deployment

Model: Open-source via Ollama (Llama 3, Mistral)
Why: Data never leaves your infrastructure

Task: High-volume production (millions of calls/day)

Consider: Open-source hosted APIs (Together AI, Groq) or local deployment
Why: Per-token costs compound dramatically at scale

Key Takeaways

1. The Model Component is a unified interface LangChain wraps every AI provider behind a consistent API. Switching from OpenAI to Anthropic is a one-line change, not a rewrite.

2. Chat Models have replaced traditional LLMs Structured message formats (system, human, AI), role support, and instruction-following capability make chat models the right choice for virtually all modern applications.

3. Temperature controls creativity vs precision Low temperature for facts and code. High temperature for creativity and brainstorming. Most production applications live in the 0.2-0.5 range.

4. Embedding Models convert meaning to mathematics They enable semantic search - finding relevant content based on meaning, not keywords. This is the foundation of every RAG application.

5. Vector databases make semantic search scalable Pre-compute and store embeddings once, search millions of documents in milliseconds at query time.

6. Open-source vs closed-source is a real trade-off Closed-source wins on raw capability and simplicity. Open-source wins on privacy, cost at scale, and customization. The right choice depends on your specific constraints.

Final Thoughts

Models are not just one component among six in LangChain. They are the cognitive core that everything else serves. Prompts exist to structure input for models. Chains exist to move data to and from models. Indexes exist to give models relevant context. Memory exists to give models conversational continuity. Agents exist to let models decide what to do.

When you understand models deeply - how they generate text token by token, how embeddings represent meaning geometrically, why chat models replaced traditional LLMs, and how to pick the right model for the right task - every other LangChain concept becomes significantly easier to grasp.

The model is where language meets intelligence. Everything else is infrastructure.

Introduction

Every LangChain application - whether it is a chatbot, a document search engine, a coding assistant, or an autonomous agent - has one thing at its absolute core: a model.

Not a chain. Not a prompt. Not an agent. A model.

By the end, you will understand not just what models are, but why they are designed the way they are - and that understanding will make every LangChain concept that follows click into place.

What Is the Model Component in LangChain?

At the simplest level, the Model component in LangChain is a standardized interface for communicating with AI models.

But let us unpack why that standardization matters so much.

The AI provider landscape today looks something like this:

OpenAI - GPT-4o, GPT-4 Turbo, o1
Anthropic - Claude 3.5 Sonnet, Claude 3 Opus
Google - Gemini 1.5 Pro, Gemini 1.5 Flash
Meta - Llama 3 (open-source)
Mistral AI - Mistral Large, Mixtral
Cohere - Command R+
HuggingFace - thousands of community models

LangChain solves this with a unified abstraction layer. Every model - regardless of provider - is called the same way:

code

response = llm.invoke("Your question here")

The provider-specific complexity is hidden inside the model object. Your application logic stays clean and portable.

The Two Fundamental Model Types

LangChain divides all models into two distinct categories, each designed for a fundamentally different purpose.

Model Type 1: Language Models (Text In, Text Out)

What They Are

Traditional language models follow a simple contract: you give them text, they return text. This is sometimes called the "completion" paradigm - you provide a prompt and the model completes it.

code

from langchain_openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-instruct")
response = llm.invoke("The capital of France is")
# Response: "Paris, a city known for..."

How Language Models Actually Work (The Real Mechanics)

Understanding what happens inside a language model changes how you think about prompting, temperature, and model selection.

At inference time (when you call the model), this is what happens:

Your input text is tokenized - broken into tokens (words, sub-words, or characters depending on the tokenizer)
The transformer architecture processes the entire sequence simultaneously, with each token "attending" to every other token to build contextual representations
The model outputs a probability distribution over its entire vocabulary for the next token
A token is sampled from this distribution (with randomness controlled by the temperature parameter)
The sampled token is appended to the sequence and the process repeats
This continues until a stop condition is met (max tokens, stop sequence, or end-of-text token)