13 min readfrom VentureBeat

Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push back

Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push back

A growing number of developers and AI power users are taking to social media to accuse Anthropic of degrading the performance of Claude Opus 4.6 and Claude Code — intentionally or as an outcome of compute limits — arguing that the company’s flagship coding model feels less capable, less reliable and more wasteful with tokens than it did just weeks ago.

The complaints have spread quickly on Github, X and Reddit over the past several weeks, with several high-reach posts alleging that Claude has become worse at sustained reasoning, more likely to abandon tasks midway through, and more prone to hallucinations or contradictions.

Some users have framed the issue as “AI shrinkflation” — the idea that customers are paying the same price for a weaker product.

Others have gone further, suggesting Anthropic may be throttling or otherwise tuning Claude downward during periods of heavy demand.

Those claims remain unproven, and Anthropic employees have publicly denied that the company degrades models to manage capacity. At the same time, Anthropic has acknowledged real changes to usage limits and reasoning defaults in recent weeks, which has made the broader debate more combustible.

VentureBeat has reached out to Anthropic for further clarification on the recent accusations, including whether any recent changes to reasoning defaults, context handling, throttling behavior, inference parameters or benchmark methodology could help explain the spike in complaints.

We have also asked how Anthropic explains the recent benchmark-related claims and whether it plans to publish additional data that could reassure customers. As of publication time, we are awaiting a response.

Viral user complaints, including from an AMD Senior Director, argue Claude has become less capable

One of the most detailed public complaints originated as a GitHub issue filed by Stella Laurenzo on April 2, 2026, whose LinkedIn profile identifies her as Senior Director in AMD’s AI group.

In that post, Laurenzo wrote that Claude Code had regressed to the point that it could not be trusted for complex engineering work, then backed that claim with a sprawling analysis of 6,852 Claude Code session files, 17,871 thinking blocks and 234,760 tool calls.

The complaint argued that, starting in February, Claude’s estimated reasoning depth fell sharply while signs of poorer performance rose alongside it, including more premature stopping, more “simplest fix” behavior, more reasoning loops, and a measurable shift from research-first behavior to edit-first behavior.

The post’s broader point was that for advanced engineering workflows, extended reasoning is not a luxury but part of what makes the model usable in the first place.

That GitHub thread then escaped into the broader social media conversation, with X users including @Hesamation, who posted screenshots of Laurenzo's GitHub post to X on April 11, turning it into an even more viral talking point.

That amplification mattered because it gave the wider “Claude is getting worse” narrative something more concrete than anecdotal frustration: a long, data-heavy post from a senior AI leader at a major chip company arguing that the regression was visible in logs, tool-use patterns and user corrections, not just gut feeling.

Anthropic’s public response focused on separating perceived changes from actual model degradation. In a pinned follow-up on the same GitHub issue posted a week ago, Claude Code lead Boris Cherny thanked Laurenzo for the care and depth of the analysis but disputed its main conclusion.

Cherny said the “redact-thinking-2026-02-12” header cited in the complaint is a UI-only change that hides thinking from the interface and reduces latency, but “does not impact thinking itself,” “thinking budgets,” or how extended reasoning works under the hood.

He also said two other product changes likely affected what users were seeing: Opus 4.6’s move to adaptive thinking by default on Feb. 9, and a March 3 shift to medium effort, or effort level 85, as the default for Opus 4.6, which he said Anthropic viewed as the best balance across intelligence, latency and cost for most users.

Cherny added that users who want more extended reasoning can manually switch effort higher by typing /effort high in Claude Code terminal sessions.

That exchange gets at the core of the controversy. Critics like Laurenzo argue that Claude’s behavior in demanding coding workflows has plainly worsened and point to logs and usage patterns as evidence.

Anthropic, by contrast, is not saying nothing changed. It is saying the biggest recent changes were product and interface choices that affect what users see and how much effort the system expends by default, not a secret downgrade of the underlying model. That distinction may be technically important, but for power users who feel the product is delivering worse results, it is not necessarily a satisfying one.

External coverage from TechRadar and PC Gamer further amplified Laurenzo's post and larger wave of agreement from some power users.

Another viral post on X from developer Om Patel on April 7 made the same argument in even more direct terms, claiming that someone had “actually measured” how much “dumber” Claude had gotten and summarizing the result as a 67% drop.

That post helped popularize the “AI shrinkflation” label and pushed the controversy beyond hard-core Claude Code users into the broader AI discourse on X.

These claims have resonated because they map closely onto what many frustrated users say they are seeing in practice: more unfinished tasks, more backtracking, more token burn and a stronger sense that Claude is less willing to reason deeply through complicated coding jobs than it was earlier this year.

Benchmark posts turned anecdotal frustration into a public controversy

The loudest benchmark-based claim came from BridgeMind, which runs the BridgeBench hallucination benchmark. On April 12, the account posted that Claude Opus 4.6 had fallen from 83.3% accuracy and a No. 2 ranking in an earlier result to 68.3% accuracy and No. 10 in a new retest, calling that proof that “Claude Opus 4.6 is nerfed.”

That post spread widely and became one of the main anchors for the broader public case that Anthropic had degraded the model.

Other users also circulated benchmark-related or test-based posts suggesting that Opus 4.6 was underperforming versus Opus 4.5 in practical coding tasks.

Still other posts pointed to TerminalBench-related results as supposed evidence that the model’s behavior had changed in certain harnesses or product contexts.

The effect was cumulative: benchmark screenshots, side-by-side tests and anecdotal frustration all began reinforcing one another in public.

That matters because benchmark claims tend to travel farther than more subjective complaints. A developer saying a model “feels worse” is one thing. A screenshot showing a ranking drop from No. 2 to No. 10, or a dramatic percentage swing in accuracy, gives the appearance of hard proof, even when the underlying comparison may be more complicated.

Critics of the benchmark claims say the evidence is weaker than it looks

The most important rebuttal to the BridgeBench claim did not come from Anthropic. It came from Paul Calcraft, an outside software and AI researcher on X, who argued that the viral comparison was misleading because the earlier Opus 4.6 result was based on only six tasks while the later one was based on 30.

In his words, it was a “DIFFERENT BENCHMARK.” He also said that on the six tasks the two runs shared in common, Claude’s score moved only modestly, from 87.6% previously to 85.4% in the later run, and that the bigger swing appeared to come mostly from a single fabrication result without repeats. He characterized that as something that could easily fall within ordinary statistical noise.

That outside rebuttal matters because it undercuts one of the cleanest and most viral claims in circulation. It does not prove users are wrong to think something has changed. But it does suggest that at least some of the benchmark evidence now driving the story may be overstated, poorly normalized or not directly comparable.

Even the BridgeBench post itself drew a community note to similar effect. The note said the two benchmark runs covered different scopes — six tasks in one case and 30 in the other — and that the common-task subset showed only a minor change. That does not make the later result meaningless, but it weakens the strongest version of the “BridgeBench proved it” argument.

This is now a key feature of the controversy: the claims are not all equally strong. Some are grounded in first-hand user experience. Some point to real product changes. Some rely on benchmark comparisons that may not be apples-to-apples. And some depend on inferences about hidden system behavior that users outside Anthropic cannot directly verify.

Earlier capacity limits gave users a reason to suspect more changes under the hood

The current backlash also lands in the shadow of a real, confirmed Anthropic policy change from late March. On March 26, Anthropic technical staffer Thariq Shihipar posted that, “To manage growing demand for Claude,” the company was adjusting how 5-hour session limits work for Free, Pro and Max subscribers during peak hours, while keeping weekly limits unchanged.

He added that during weekdays from 5 a.m. to 11 a.m. Pacific time, users would move through their 5-hour session limits faster than before. In follow-up posts, he said Anthropic had landed efficiency wins to offset some of the impact, but that roughly 7% of users would hit session limits they would not have hit before, particularly on Pro tiers.

In an email on March 27, 2026, Anthropic told VentureBeat that Team and Enterprise customers were not affected by those changes, and that the shift was not dynamically optimized per user but instead applied to the peak-hour window the company had publicly described. Anthropic also said it was continuing to invest in scaling capacity.

Those comments were about session limits, not model downgrades. But they are important context, because they establish two things that users now keep connecting in public: first, Anthropic has been dealing with surging demand; second, it has already changed how usage is rationed during busy periods. That does not prove Anthropic reduced model quality. It does help explain why so many users are primed to believe something else may also have changed.

Prompt caching and TTL

A separate, more recent GitHub issue broadens the dispute beyond model quality and into pricing and quota behavior. In issue #46829, user seanGSISG argued that Claude Code’s prompt-cache time-to-live, or TTL, appeared to shift from a one-hour setting back to a five-minute setting in early March, based on analysis of nearly 120,000 API calls drawn from Claude Code session logs across two machines.

The complaint argues that this change drove meaningful increases in cache-creation costs and quota burn, especially for long-running coding sessions where cached context expires quickly and must be rebuilt. The author claims that this helps explain why some subscription users began hitting usage limits they had not previously encountered.

What makes this issue notable is that Anthropic did not flatly deny that something changed. In a reply on the thread, Jarred Sumner said the March 6 change was real and intentional, but rejected the framing that it was a regression. He said Claude Code uses different cache durations for different request types, and that one-hour cache is not always cheaper because one-hour writes cost more up front and only save money when the same cached context is reused enough times to justify it.

In his telling, the change was part of ongoing cache optimization work, not a silent downgrade, and the pre–March 6 behavior described in the issue “wasn’t the intended steady state.”

The thread later drew a more detailed response from Anthropic’s Cherny, who described one-hour caching as “nuanced” and said the company has been testing heuristics to improve cache hit rates, token usage and latency for subscribers. Cherny said Anthropic keeps five-minute cache for many queries, including subagents that are rarely resumed, and said turning off telemetry also disables experiment gates, which can cause Claude Code to fall back to a five-minute default in some cases.

He added that Anthropic plans to expose environment variables that let users force one-hour or five-minute cache behavior directly. Together, those replies do not validate the issue author’s claim that Anthropic silently made Claude Code more expensive overall, but they do confirm that Anthropic has been actively experimenting with cache behavior behind the scenes during the same period users began complaining more loudly about quota burn and changing product behavior.

Anthropic says user-facing changes, not secret degradation, explain much of the uproar

Anthropic-affiliated employees have publicly pushed back on the broadest accusations. In one widely circulated reply on X, Cherny responded to claims that Anthropic had secretly nerfed Claude Code by writing, “This is false.”

He said Claude Code had been defaulted to medium effort in response to user feedback that Claude was consuming too many tokens, and that the change had been disclosed both in the changelog and in a dialog shown to users when they opened Claude Code.

That response is notable because it concedes a meaningful product change while rejecting the more conspiratorial interpretation of it. Anthropic is not saying nothing changed. It is saying that what changed was disclosed and was aimed at balancing token use, not secretly reducing model quality.

Public documentation also supports the fact that effort defaults have been in motion. Claude Code’s changelog says that on April 7, Anthropic changed the default effort level from medium to high for API-key users as well as Bedrock, Vertex, Foundry, Team and Enterprise users.

That suggests Anthropic has actively been tuning these settings across different segments, which could plausibly affect user perceptions even if the core model weights are unchanged.

Shihipar has also directly denied the broader demand-management accusation. In a reply on X posted April 11, he said Anthropic does not “degrade” its models to better serve demand. He also said that changes to thinking summaries affected how some users were measuring Claude’s “thinking,” and that the company had not found evidence backing the strongest qualitative claims now spreading online.

The real issue may be trust as much as model quality

What is clear is that a trust gap has opened between Anthropic and some of its most demanding users.

For developers who rely on Claude Code all day, subtle shifts in visible thinking output, effort defaults, token burn, latency tradeoffs or usage caps can feel indistinguishable from a weaker model.

That is true whether the root cause is a product setting, a UI change, an inference-policy tweak, capacity pressure or a genuine quality regression.

It also means both sides of the fight may be talking past each other. Users are describing what they experience: more friction, more failures and less confidence. Anthropic is responding in product terms: effort defaults, hidden thinking summaries, changelog disclosures, and denials that demand pressure is causing secret model degradation.

Those are not necessarily incompatible descriptions. A model can feel worse to users even if the company believes it has not “nerfed” the underlying model in the way critics allege. But coming at a time when Anthropic's chief rival OpenAI has recently pivoted and put more resources behind its competing, enterprise and vibe-coding focused product Codex — even offering a new, more mid-range ChatGPT subscription in an effort to boost usage of the tool — it's certainly not the kind of publicity that stands to benefit Anthropic or its customer retention.

At the same time, the public evidence remains mixed. Some of the most viral claims have come from developers with detailed logs and strong opinions based on repeated use. Some of the benchmark evidence has been challenged by outside observers on methodological grounds. And Anthropic’s own recent changes to limits and settings ensure that this debate is happening against a backdrop of real adjustments, not pure rumor.

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#no-code spreadsheet solutions
#natural language processing for spreadsheets
#real-time data collaboration
#real-time collaboration
#rows.com
#financial modeling with spreadsheets
#cloud-based spreadsheet applications
#enterprise data management
#enterprise-level spreadsheet solutions
#conversational data analysis
#data analysis tools
#big data performance
#big data management in spreadsheets
#automation in spreadsheet workflows
#intelligent data visualization
#data visualization tools
#spreadsheet API integration
#data cleaning solutions
Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push back