Claude opus 4.8 review: sharper at reasoning and code, softer on creativity and cost

Claude Opus 4.8 Review: Sharper Where It Shines, Duller Where It Struggles

Six weeks after the release of Claude Opus 4.7, Anthropic quietly pushed out an upgraded flagship: Claude Opus 4.8. On paper, the story looks simple-better benchmarks, stronger safety scores, and the same pricing at $5 per million input tokens and $25 per million output tokens. In practice, the picture is more nuanced.

We ran Claude Opus 4.8 through a consistent test suite covering creative writing, coding, math, logic and common sense, non-math reasoning, and long-context “needle-in-a-haystack” recall. We then compared its performance directly against Claude Opus 4.7 and several aggressively priced Chinese frontier models.

The pattern that emerged is strikingly consistent: Claude Opus 4.8 doubles down on the areas where Claude has always excelled-structured reasoning, code, and quantitative work-while taking a small but noticeable step backward in some of the softer, more intuitive tasks like narrative creativity and open-ended reasoning.

To make matters more interesting, 4.8 also appears more “expensive” in practice than its headline price suggests. In one of our longer tests, a single prompt consumed our entire token budget, despite the model keeping its official per-token rates unchanged.

Below is a breakdown by category.

Pricing and Token Usage: Same Sticker Price, Higher Real Cost

At a glance, Claude Opus 4.8 costs exactly what 4.7 did:
– $5 per million input tokens
– $25 per million output tokens

That suggests a drop‑in replacement. But when we started pushing 4.8 hard-especially on coding and long-context tasks-it became clear that the *effective* cost can be higher:

– The model tends to generate longer, more verbose answers by default, especially in technical domains.
– For a complex game-generation task, a single request burned through our full token quota. The final result was technically immaculate, but the prompt‑to‑output ratio was far worse than in 4.7 or rival models at similar temperature settings.
– Long-context reasoning now often triggers more exhaustive chain-of-thought style reasoning internally, which is good for accuracy, but bad for token conservation.

For enterprises operating on strict budgets or metered API access, this has real implications. Claude Opus 4.8 may not be more expensive per token, but it is often more expensive per completed task if you don’t explicitly tune it for brevity.

Creative Writing: More Accurate, Less Inspired

Claude has long been praised for its literary style and tone control. With 4.8, those strengths are still present, but they feel constrained.

What improved:
– Stronger adherence to instructions about form and constraints (e.g., word limits, explicit structural rules).
– Better factual grounding in creative nonfiction; 4.8 is more cautious about fabricating dates, places, and names when asked for realistic pieces.
– More consistent character voices across longer pieces, with fewer jarring shifts in tone.

What regressed:
– The “spark” of surprise-unexpected metaphors, unusual narrative angles, risky stylistic choices-showed up less often.
– When asked to write fiction with high emotional stakes, 4.8 often defaulted to safer, more generic phrasing, where 4.7 would occasionally overshoot but at least swing hard.
– Humor in particular felt flatter: punchlines were more literal, with fewer playful twists or subtle callbacks.

If you want clean, controlled prose that respects constraints, 4.8 is a net upgrade. If you’re hunting for wild, breakthrough-level creativity and stylistic experimentation, 4.7 and some competing models still feel more willing to take risks.

Coding: A Clear Step Forward

In coding tasks, Claude Opus 4.8 is simply better. This is where the model feels closest to an unambiguous improvement.

Observations:
Higher pass rate on multi-file tasks: When we asked it to scaffold small applications (with multiple modules, tests, and configuration), 4.8 maintained consistency across files more reliably than 4.7.
Improved tool awareness: It demonstrates better knowledge of up‑to‑date frameworks and build pipelines, and it’s more likely to produce working snippets that align with current ecosystem norms.
Debugging finesse: When given broken code and error logs, 4.8 typically required fewer back‑and‑forth clarifications to arrive at a working solution. It also explains the root cause of bugs in crisper, more concrete language.

The main tradeoff is verbosity:
– Explanations are longer, with more step‑by‑step commentary.
– Code examples can be padded with boilerplate that isn’t strictly necessary.

For serious developers, that’s a minor annoyance compared to the gains in reliability. For lightweight coding assistance, you may want to explicitly ask for minimal code and terse explanations to keep token usage under control.

Math: A Calculated Upgrade

Math is one of the most obvious success stories for 4.8. Anthropic’s claims about improved quantitative reasoning are borne out in practice.

In our testing:
Multi-step algebra and calculus problems were solved more consistently, with fewer off‑by‑one mistakes and fewer hallucinated simplifications.
– The model followed formal reasoning chains more faithfully when explicitly asked to show its work, and its steps rarely contradicted earlier conclusions.
– It handled word problems better-translating natural language descriptions into equations with fewer misinterpretations of quantities or conditions.

That said, the improvement comes with a familiar caveat:
– When 4.8 is wrong, it is often *confidently* wrong, especially on edge‑case or deliberately adversarial problems.
– The internal chain-of-thought tends to be long and detailed, which can inflate token costs if you routinely ask it to elaborate all steps.

For academic use, technical documentation, or engineering support, 4.8’s math performance is one of the clearest reasons to upgrade from 4.7.

Logic and Common Sense: More Rigid, Mostly Better

Logical puzzles and structured reasoning are areas where Claude historically performed well. Version 4.8 continues that trend, but with a slight shift in personality.

Strengths:
– Improved consistency on classic logic puzzles (truth-tellers/liars, scheduling constraints, set intersections).
– Stronger protection against being “talked into” contradictions; 4.8 resists bad user logic better than 4.7.
– Better at tracking entities and conditions over multiple steps in a conversation without losing track of earlier constraints.

Weaknesses:
– A tendency toward over-explaining and over-formalizing. Even simple problems can trigger multi-paragraph rationales.
– Slightly reduced flexibility when the puzzle is loosely specified or intentionally ambiguous; the model prefers to pin down precise assumptions rather than play with multiple possible interpretations.

In effect, logic and common sense have become more reliable but less conversational. For rigorous analytical work, this is beneficial. For casual brainstorming or puzzle-like storytelling, it can feel stiff.

Non-Math Reasoning: More Guardrails, Less Intuition

Non-math reasoning-interpretation of scenarios, reading between the lines, and making judgment calls in fuzzy situations-is where Claude Opus 4.8 feels less comfortable.

In comparison to 4.7:
– 4.8 is more likely to hedge, qualify, or refuse to extrapolate from incomplete information, even when the user explicitly asks for best‑guess reasoning.
– Its answers often emphasize uncertainty and risk, which is useful in high-stakes domains but can slow down fast, exploratory ideation.
– It sometimes struggles to pick the *most likely* interpretation when several are plausible, opting instead to list them all without clearly ranking them.

This appears to be a side-effect of improved safety and reduced hallucination risk. The model is more conservative in areas that require subjective judgment. Businesses that value cautious, defensible reasoning may appreciate this; creative users looking for decisive speculation may not.

Long-Context and “Needle in a Haystack” Recall

One of the more practical tests we ran involved feeding Claude Opus 4.8 a very long document-tens of thousands of tokens-and asking for specific details buried inside it.

Compared to 4.7:
– 4.8 was slightly better at retrieving precise, low-salience facts (names, small numeric details, subordinate clauses in legal-style text).
– It was more likely to provide citations to the relevant section or paraphrase the surrounding context accurately.
– However, this came with a steep token cost: the model often summarized large swaths of text to “think out loud,” even if we only asked a targeted question.

In shorter or mid‑range contexts, performance is excellent. In very long contexts, 4.8’s improved thoroughness can become a liability from a cost perspective unless you constrain response length and explicitly request concise answers.

Safety and Alignment: Noticeably Stricter

Anthropic has emphasized safety since the beginning, and the 4.8 release leans even further into this identity.

We observed:
– More frequent and more detailed refusals on content deemed risky, even at the edges of policy boundaries.
– Stronger self-correction when prompted with leading or manipulative questions designed to produce biased or harmful outputs.
– More robust contextual understanding of sensitive scenarios; the model recognizes not just explicit requests for disallowed content but also subtler attempts to circumvent guidelines.

For some users, this will be a welcome evolution. For others, it might feel constraining-particularly in gray‑area discussions that require frank descriptions of harmful behavior for educational or analytical purposes. In many of those cases, 4.8 will still engage, but with more caveats and more carefully framed language.

Where Claude Opus 4.8 Fits in the Competitive Landscape

The broader AI market is defined by rapid iteration and aggressive pricing, especially from Chinese labs. Many of these competing models undercut Claude on raw cost while promising comparable performance.

Against that backdrop, Claude Opus 4.8 positions itself as:
– A premium, safety-focused model for enterprises that value reliability, controllability, and regulatory defensibility.
– A top-tier option for coding, math, and formal reasoning, where its gains are most obvious.
– A somewhat more conservative and less risk‑taking partner in creative and open-ended reasoning tasks.

Users choosing between Claude and cheaper alternatives need to weigh whether the incremental gains in precision and safety are worth the increased effective token consumption and more guarded personality.

Practical Tips for Getting the Best Out of 4.8

If you decide to adopt Claude Opus 4.8, a few strategies can help mitigate its downsides and highlight its strengths:

1. Control verbosity explicitly.
Ask for strict word or token limits in technical outputs. Specify “minimal explanations” or “only the final answer” when appropriate.

2. Separate creative and analytical phases.
Use one prompt to generate wild, unconstrained ideas (possibly at a higher temperature), then a second to let 4.8 analyze, refine, and polish. This can recapture some of the creativity it tends to suppress when doing both tasks at once.

3. Leverage it for structured work.
For documentation, technical specs, legal-ish summaries, or complex planning, 4.8’s caution and over-explaining become advantages, not weaknesses.

4. Be explicit about acceptable uncertainty.
If you want best-effort guesses, say so. For example: “It’s okay to speculate; please pick the single most likely explanation, even if you are not certain.”

5. Monitor cost at the prompt design level.
Long prompts, multi-turn reasoning, and “show your work” style instructions compound tokens quickly. Design thinner, more targeted prompts where possible.

Verdict: A Model That Doubles Down on Its Core Identity

Claude Opus 4.8 is not a radical reinvention of Anthropic’s flagship; it is a sharpening of its most distinctive traits and a tightening of its safety net.

Better at what it was already good at:
Coding, math, structured logic, and high-stakes reasoning are all clear winners in 4.8.

Worse-or at least less fun-where it was weaker:
Open-ended creativity, intuitive narrative reasoning, and bold speculation feel more constrained and slightly less inspired.

For teams and individuals who rely on Claude as a dependable, precise, and defensible tool, Opus 4.8 is a meaningful upgrade. For users who loved Claude primarily as a creative collaborator or a freewheeling brainstorm partner, the tradeoffs are less obviously in their favor.

In short, Claude Opus 4.8 is a stronger specialist in its core competencies and a slightly weaker generalist outside them. Whether that’s a net positive depends on what you actually ask your AI to do.