LLMs and the models people actually use

Product isn’t model

This is the number one confusion. It costs time and leads to bad decisions.

ChatGPT is a product. You open it in the browser, type a question, and get an answer. Inside, ChatGPT uses models like GPT-4o or o3. When OpenAI releases a new model, the product stays the same while the “brain” changes. The same logic applies to Claude (Sonnet, Opus, Haiku) and Gemini (Gemini 2.5 Pro).

Simple rule: product is where you interact. Model is what generates the answer.

How models work, without becoming a scientist

You don’t need neural-network math to work with LLMs. But four concepts change how you use every AI tool.

Training

Models are trained on huge volumes of text: books, articles, code, web pages. The model learns statistical patterns and calculates which token is likely to come next. That’s why models generate text that looks right but is sometimes plausible and wrong: they optimize for likely, not true.

Tokens and context window

The model splits text into tokens and processes them one at a time. Each model has a limit for how many tokens it can handle at once: the context window. In 2026, windows range from ~128K to ~1M+ tokens, but a medium software project can contain millions. You still need to choose what goes into context.

Inference and cost

Every message sent and answer received is an inference, and each inference costs money via APIs. Larger models cost more per token. “Use the best model for everything” can cost 10x more than “use the right model for each task”.

Real limitations

Models don’t know what they don’t know and generate answers even without enough information. That’s hallucination. Models also have a knowledge cutoff: ask about something released yesterday and the model may invent an answer based on patterns, not facts.

Profiles by task, not rankings

There’s no single best model. There’s the right model for the task. Think in profiles:

GPT-4o (OpenAI)

Product: ChatGPT

Profile: Fast and versatile. Good at general tasks, conversation, and multimodal work.

Best for: Exploring ideas quickly, conversational and multimodal tasks.

Limitations: Deep reasoning can be shallow. Medium cost.

o3 (OpenAI)

Product: ChatGPT

Profile: Deep, structured reasoning for complex multi-step problems.

Best for: Planning system architecture, deep code analysis.

Limitations: Slower and more expensive than GPT-4o. Can overthink simple tasks.

Claude Sonnet (Anthropic)

Product: Claude

Profile: Strong balance of speed and quality. Strong at code and long instructions.

Best for: Writing and reviewing code, following detailed specs.

Limitations: Can be conservative. Large context, but not infinite.

Claude Opus (Anthropic)

Product: Claude

Profile: Deep, nuanced reasoning. Strong for long and complex tasks.

Best for: Planning large refactors, reviewing architecture.

Limitations: Slower and more expensive. Too much for simple tasks.

Gemini 2.5 Pro (Google)

Product: Gemini

Profile: Very large context window. Strong multimodal and research profile.

Best for: Analyzing long documents, exploring and synthesizing information.

Limitations: Code quality can vary. Less predictable with complex instructions.

Llama 4 (Meta)

Product: Llama, open-weight

Profile: Open-weight. Can run locally without depending on an external API.

Best for: Privacy-sensitive projects, domain fine-tuning.

Limitations: Requires strong hardware. Lower quality than the best proprietary models.

DeepSeek V3 (DeepSeek)

Product: DeepSeek, open-weight

Profile: Open-weight with competitive quality. Strong at code and reasoning.

Best for: Lower-cost code generation, budget-limited projects.

Limitations: Smaller ecosystem. Availability can vary by region.

Why this matters

Understanding models isn’t about becoming an AI specialist. It’s about making informed decisions: knowing which model is behind the tool, what it costs, and where it fails. Model literacy is the difference between using AI in the dark and using AI on purpose.

Real example

Same project, three tasks, three models:

Task 1: Explore a feature idea. The PM uses GPT-4o in ChatGPT for a quick conversation about webhook trade-offs and complexity. Fast, cheap, good for exploration.

Task 2: Plan the implementation. The tech lead uses Claude Opus to analyze the architecture and propose a plan across affected modules. Slower and pricier, but the deeper reasoning is worth it.

Task 3: Implement the code. The developer uses Claude Sonnet through a coding agent. It’s fast enough for iteration and strong at code, with no need for Opus on every function.

Three different models. None is “the best”. Each is right for the task.

Where this breaks

Benchmark worship: Benchmarks measure standardized tasks, not your project. A leaderboard is a signal, not an answer.
Bigger isn’t better: The most expensive model isn’t automatically best for your task. Using Opus for a tiny autocomplete is like driving a truck to buy bread.
Ignoring limitations: Every model hallucinates. Every model has a knowledge cutoff. Every model can produce code that compiles but does the wrong thing.

Interactive block

Pick a task type to see which models fit.

Product: ChatGPT

Product: Claude

Product: Gemini

Product: Llama (open-weight)

Product: DeepSeek (open-weight)