The Setup: What Extended Thinking Actually Does
Anthropic rolled out Extended Thinking as a feature bolted into Claude, permitting the model to generate internal reasoning before serving up an answer. Users can toggle visibility to watch the scratchpad unfold, or keep it hidden and just receive the final response. The mechanism isn't new—chain-of-thought prompting has existed for years—but baking it directly into the model weights represents a different engineering choice. The system now treats thinking as a first-class computational phase rather than a prompt-level hack.
The pricing structure reflects this architecture. Thinking tokens accumulate separately from output tokens, meaning API customers face a fresh cost dimension. A query that triggers extensive reasoning before a terse answer suddenly carries overhead that a traditional query wouldn't. For Anthropic's business model, this opens a new lever. For users, it introduces friction and calculation into every API call.
The Numbers Behind the Pause
Benchmarks show Extended Thinking generating anywhere from hundreds to several thousand tokens of internal monologue before the model commits to a final answer. On standardized reasoning tasks, the performance lift ranges between 5% and 15%, depending on domain. Math problems see the bump more reliably than general Q&A. Code generation benefits. Complex multi-step reasoning shows measurable gains.
The cost is latency. Response times roughly double or triple on queries that trigger heavy thinking. For a user waiting for a chatbot response, this becomes noticeable. For an API integration running on a tight schedule, this becomes problematic.
Token economics compound the issue. While Anthropic priced thinking tokens lower than standard output tokens—roughly a third of the cost—the volume adds up. A query generating 5,000 thinking tokens before 500 output tokens skews the bill toward the reasoning phase. Early adopters running benchmarks report 20% to 40% cost increases on workloads that activate Extended Thinking, depending on the task mix.
What The Visible Thinking Reveals (and Hides)
Users who enable thinking output encounter Claude's reasoning process: false starts, self-corrections, exploratory tangents. It reads like a human scratching out a solution on paper. Verbose. Repetitive. Sometimes contradictory.
That contradiction matters. Researchers examining the thinking output have found instances where the model's internal monologue contains errors or reaches wrong conclusions, yet the final answer emerges correct. How? The mechanism remains opaque. Either the thinking phase corrects course without explicitly stating it, or the final generation step overrides flawed reasoning with something better. Anthropic hasn't fully illuminated the mechanism.
"The thinking output satisfies curiosity," says Dr. Priya Sharma, AI research lead at the Institute for Computational Reasoning, "but it doesn't meaningfully explain why improvements occur. We're seeing the scratchpad, not understanding the causality."
The verbosity raises another question: is this genuine reasoning or sophisticated padding designed to improve downstream performance? Chain-of-thought prompting works partly because more tokens allow more opportunity for the model to course-correct. Extended Thinking might be doing the same thing at scale. The visible thinking could be less "here's how I think" and more "here's how I'm buying time to think better."
Why This Matters (And Doesn't)
For high-stakes domains, the accuracy gains justify the cost and latency trade-off. Research workflows, code generation, mathematical problem-solving—these benefit from the extra reasoning cycles. A researcher fact-checking a literature review or a developer debugging a complex algorithm can absorb the latency cost.
For routine tasks, Extended Thinking becomes deadweight. Summarizing an article. Drafting an email. Answering an FAQ. These don't require extended deliberation. Activating the feature wastes API spend and customer time without meaningfully improving results.
The feature edges closer to interpretability by exposing reasoning, but remains fundamentally limited. "We can see the thinking, but interpreting it is another matter," notes Marcus Chen, head of AI governance at Veritas Analytics. "The thinking text itself is written by a model optimized for next-token prediction, not clarity. It's a black box describing another black box."
Competitors aren't sleeping. OpenAI released o1, implementing similar reasoning capabilities. Google is advancing Gemini's reasoning modes. The feature is becoming table stakes rather than differentiation. Within months, most major models will offer variants of this approach.
The Hype-to-Reality Ratio
Anthropic positioned Extended Thinking as a major breakthrough, with early coverage suggesting glimpses of AGI-adjacent reasoning capabilities. The narrative was seductive: finally, AI systems that think before they speak.
The reality is messier. Performance gains are real but incremental. They're also task-specific. Extended Thinking won't solve problems the model fundamentally can't handle; it refines performance on problems the model nearly solves. The visible thinking satisfies curiosity but doesn't resolve the core interpretability problem.
Expect the feature to carve out a niche. Specialized workflows will adopt it. General-purpose applications will keep it off by default. The hype cycle will cool as users realize it's a useful tool rather than a paradigm shift.
Looking Ahead
Extended Thinking represents optimization within existing AI architectures rather than a conceptual leap. It's incremental improvement dressed in philosophical language. That doesn't make it worthless—latency and cost trade-offs are legitimate engineering problems—but it does mean the feature is smaller than the headlines suggested. Watch which use cases stick around in six months. That'll tell you what Extended Thinking actually is.