Reasoning Drift in AI: Why a Commercial Model Can Silently Get Worse, and What to Do About It
In recent weeks, a heated discussion has unfolded on Reddit around Claude. The complaints vary widely: from sluggishness and declining code quality to loss of coherence, half-baked answers, and odd refusals on perfectly ordinary tasks. On the surface, the service keeps working: responses come through, the subscription is active, and the interface is up.
The feeling is personally familiar to me. Over the past month or two, I’ve had a similar impression even while working with a Max subscription. Of course, a handful of sessions can’t serve as proof of actual regression, but the sense is that the model now gives up more quickly, holds context less reliably, and increasingly produces incomplete solutions.
Naturally, these discussions quickly attract the usual counterarguments. Some believe the quality decline is merely perceived; others point out that the prompts themselves have grown more complex over time. And that’s a fair caveat: even a massive wave of complaints doesn’t necessarily mean they all describe the same technical issue. Real signals, individual circumstances, and general frustration are usually mixed together.
But what interests me in this story is no longer just the question “has Claude actually degraded?” It’s a different one: are there any guarantees protecting users when a commercial AI model continues to function on paper, but the quality of its output quietly shifts for the worse? After all, some companies already rely on models like these in business-critical workflows.
My preliminary conclusion is this: such protection does exist, but it is fairly limited. It is fragmented, covers only certain aspects of the problem, and often demands significant effort from the customer: introducing their own quality control procedures, building out infrastructure, or actively engaging with the provider.
Why I Wanted to Dig Deeper
If it were just my impressions and Reddit threads, I probably wouldn’t have investigated further. But then a detailed GitHub issue appeared for Claude Code, where the author described quality degradation across a large number of work sessions. Several publications linked this issue to Stella Laurenzo, a senior manager in AMD’s AI group. The mention of her name alone doesn’t prove anything, but it looked like a more serious external signal, and it became interesting to analyze this as an engineering question rather than a one-off complaint.
Why “the Model Got Worse” Is Not a Single Phenomenon
The phrase “the model got worse” captures the emotion well but does little to clarify the situation. In practice, it often conflates at least three distinct scenarios:
- A genuine decline in solution quality - the model actually handles tasks worse than before.
- A change in response format - the model responds more briefly, dryly, or simply differently, even though its effectiveness may have stayed the same.
- Increased unpredictability - the average level seems roughly the same, but results vary more widely from one request to the next.
It’s more useful to think of this as a broader class of changes invisible to the user: the service keeps running, but the usefulness of the output gradually shifts.
In this article, I’m primarily interested in the first scenario - cases where the quality of the actual solution degrades, not just the surface-level style of the response.
If this manifests most strongly on long, multi-step, context-intensive tasks, I’ll use the term “reasoning drift” going forward - a situation where the model progressively loses its grip on the solution process: it veers off plan, stops honoring constraints, loses the thread of a long session, and increasingly produces answers that sound convincing but don’t actually complete the task.
I should note that reasoning drift, in this context, is not an established industry term. The expression does appear in research papers, but usually in a narrower sense: as a breakdown within the reasoning chain itself. I’m using it more broadly - as a user-observable decline in multi-step reasoning quality from a commercial model.
Alongside reasoning drift, there’s another problem that feels similar but has a different root cause: safety drift. In this case, the model doesn’t reason worse so much as it over-hedges, refusing even in perfectly acceptable scenarios.
One more caveat: going forward, I’m not including safety drift or simple changes in response style under reasoning drift. They may look similar on the surface, but analytically they are different cases.
For the user, all of this is indeed easy to confuse. A short answer, a collapsed plan, a truncated solution, or a sudden refusal can all look the same: “the model got weaker.” Without systematic testing, it’s hard to tell what you’re actually dealing with - a real quality drop, a format change, increased variance, or a shift toward excessive caution.
Half-Answers: How the Problem Looks in Practice
One of the most frustrating symptoms here is the half-answer: a response that looks coherent on the surface but fundamentally doesn’t solve the task at hand:
- The model proposes an approach but doesn’t follow through to a working result;
- It addresses only one aspect of a complex question;
- It falls back on generalities where substantive analysis is needed;
- It starts writing code but breaks off well before the function is complete.
Here’s what this looks like in real discussions. Not every such case should be considered “pure” reasoning drift: often it’s a mixed picture where a reasoning failure intertwines with general quality instability or excessive caution.
Code cuts off mid-stream. In the Claude Performance Discussion thread, users frequently report cases where the AI stops responding right in the middle of a function or just before a final fix. The response technically exists, but it requires redoing the work manually.
The model forgets its own plan. In the thread “Claude ignores its own plans, memory, and guardrails,” the author describes this pattern: Claude creates a detailed plan, agrees to follow the project rules, and then a short while later skips key steps and leaves the task half-done.
A simple task stretches out dramatically. In one recent post, a user writes that a routine frontend form fix that used to take a few minutes stretched to twenty. The model closed 2 out of 7 items, started hallucinating, and the text quality was questionable in places.
Refusal instead of an answer. In performance-related threads, users complain about another type of half-answer: instead of making progress on a routine task, Claude suddenly refuses to continue or switches to lengthy warnings. On the surface, it looks the same as other half-answers, but some of these cases belong not to reasoning drift but to a different phenomenon - safety drift, which I’ll cover below.
As a result, standard monitoring metrics like response latency, error rates, or uptime turn out to be insufficient. They’re good at catching service outages and some infrastructure issues, but on their own they do a poor job of detecting the situation where the service is functional yet the quality of output is slipping on critical tasks.
What We Actually Buy When We Pay for AI
The expectation is straightforward: since I’m paying for access to a powerful model, its performance should remain consistent. But this expectation diverges from how AI services actually work.
In essence, a subscription is not access to a fixed version of an algorithm. It’s access to an external service where numerous factors outside the user’s control are constantly changing:
- Request routing: how the system distributes your queries across servers.
- Compute resources: how much processing power is allocated to handle each request.
- Safety settings: what content is deemed suitable for output.
- Hidden context: a pre-set collection of instructions that shape the model’s behavior before the conversation even begins.
- Handling large volumes of text: how the system manages long messages.
- Default generation parameters: such as response variability (“temperature”) and effort level.
- Performance under heavy load: how well the platform handles peak activity spikes.
The practical takeaway: a model with the same name is not necessarily the same system in terms of actual behavior. The user often doesn’t know - and can’t find out - whether they’re getting the same configuration, the same Opus, the same effort level, the same reasoning budget as a week ago.
The Improvement Paradox
There’s a non-obvious wrinkle here. A model provider can ship an update that simultaneously:
- improves the model for the majority of users;
- speeds up responses;
- reduces processing costs;
- makes behavior safer;
- and degrades quality on a narrow but critical class of tasks for a specific customer.
This shouldn’t be surprising: the provider is simultaneously optimizing for speed, cost, safety, and average quality across mass-market use cases. In such a system, improvement on some metrics can easily come with regression on a narrow but important class of tasks. This is precisely why the statement “on average, the model got better” can be useless for a particular team if their specific workflow is the one that suffered.
The result is a paradox: there may be no global decline in model quality, yet for a specific user the picture looks very different - their workflow is already falling apart, and that’s a very real problem.
Why This Hits B2B Especially Hard
For an individual user, a drop in model performance is an annoying inconvenience: you can switch to another tool. But for a business that has embedded AI into its operations, this can become an operational issue.
Here I’m describing a typical scenario rather than a proven universal case. Suppose a team has an analyst or developer who regularly relies on AI for long, time-intensive tasks. As long as the model performs consistently, they handle their workload fine. But if solution quality starts to slip, what grows is not just the number of retries but also the volume of manual review, rework, and backtracking to previously agreed-upon steps. The problem is that this rarely looks like an obvious breakdown: the slowdown is easy to chalk up to task complexity, unclear requirements, or just a bad day.
In recent discussions, there’s a telling example: the author writes that they spent over 50,000 tokens just trying to get the model back on track with an already-agreed plan. Degradation in reasoning quality can manifest not as an obvious failure but as a hidden increase in costs and project timelines, despite the service technically working fine.
What the Market Promises vs. What It Doesn’t
I reviewed the public documentation from OpenAI, Anthropic, and Google, and I noticed a certain asymmetry.
What providers have reasonably well covered:
- Procedures for deprecating and sunsetting models;
- Clear versioning, snapshots, and alias management;
- Release lifecycle stages and detailed changelogs;
- Service level agreements (SLAs);
- Customer support and migration guidance.
However, there are virtually no rules governing the following:
- Quality stability on customer-critical use cases;
- Stability of reasoning quality;
- Obligation to notify customers about material quality changes that don’t cause API failures;
- Mechanisms for proving that hidden regressions exist;
- Customer compensation in situations where the interface is available but effectiveness drops below expectations.
One of the reasons is the lack of a standardized approach to measuring quality on real-world tasks, especially where long-form reasoning, answer completeness, and reliability on critical scenarios matter. The providers themselves must balance competing demands of quality, throughput, cost, and safety.
But for the user, the practical bottom line is the same: protection against hidden quality regression is weak.
An Indirect but Important Signal: Overall AI Transparency Is Declining
It’s worth mentioning something here that isn’t directly about reasoning quality degradation but goes a long way toward explaining why we shouldn’t expect transparency on this particular issue anytime soon.
In December 2025, Stanford HAI (Transparency in AI is on the Decline) and CRFM (The 2025 Foundation Model Transparency Index) published the latest Foundation Model Transparency Index. This index doesn’t evaluate how well users are informed about fluctuations in model performance. It assesses much more general aspects: availability of information about training data, compute volumes, potential risks, and post-release monitoring. In other words, things far more basic than “did you tell users that reasoning quality dropped?”
Even here, there’s a clear regression: the industry average fell from 58 points in 2024 to just 40 in 2025, returning almost to 2023 levels. Anthropic scored 31 out of 100, OpenAI scored 30, and Google came in at just 24.
The logic is simple: if companies are beginning to withhold basic information like training processes, risk assessments, and monitoring procedures, there’s little reason to expect they’ll voluntarily share details about subtle quality changes on specific types of queries. This is less about accusations and more about context: since providers aren’t meeting this need, the market is beginning to fill the gap on its own - through independent monitoring and evaluation platforms, which I’ll cover below.
When the Model Isn’t “Getting Dumber” but Playing It Too Safe
Some complaints about declining AI quality have nothing to do with a decline in the model’s cognitive capabilities. Sometimes an update to safety mechanisms causes the system to flag ordinary requests as potentially harmful more often. As a result, instead of solving the task, users receive warnings, truncated answers, or excessive hedging.
This is safety drift. Amazon Science’s work on FalseReject describes it systematically: safety mechanisms often become overly cautious and start refusing even safe requests.
To users, this effect feels identical to model degradation: the model becomes less useful. However, the causes are entirely different, and confusing the two makes diagnosis harder. It’s important to distinguish cases where the model has genuinely lost understanding from situations where excessive safety measures are getting in the way of productive work.
Four Layers That Already Exist in the Market
Pulling together the existing methods for protecting against AI model changes, we can identify four layers of safeguards. None of them solves the problem entirely, but together they cover different parts of it. The depth of protection depends on the scale of dependency: for some, pinning a version and running a dozen manual tests is enough; for others, a full eval pipeline with alerts is needed.
1. Version Control on the Provider Side
Pinned model versions, snapshots, release notes, deprecation timelines, migration guides. This is basic hygiene that reduces the chance of an unexpected change, but it doesn’t answer the key question: is the pinned model still performing well on the specific tasks that matter?
An important nuance: pinning the model name alone isn’t enough. If part of the problem lies in hidden default settings, effort level, or adaptive reasoning mode, a stricter approach is needed: you must pin the model version under the exact conditions in which its performance was validated.
2. Your Own Checks and Regression Testing
This is the most critical practical element in combating solution quality degradation. Rather than blindly trusting provider claims, you maintain your own set of tasks and regularly verify: has anything gotten worse where it really matters?
A minimal set looks roughly like this:
- Baseline reference tests run regularly to serve as a quality benchmark.
- Canary tests that detect early signs of quality problems.
- Challenging stress tests that probe the model’s edge-case capabilities.
- Safe control scenarios that demonstrate system reliability even under minor changes.
When assessing the model’s willingness to respond, it’s also important to check how often it refuses valid requests - otherwise a creeping increase in unwarranted refusals becomes a hidden threat.
For accurate analysis, it’s useful to separate test suite results into three groups: stable baseline tasks, tasks vulnerable to quality decline, and tasks potentially improving after updates. Without this separation, the improvement paradox will break your interpretation of results.
3. LLM Monitoring and Evaluation Platforms
A distinct class of solutions for this problem has already taken shape in the market. Gartner identifies it as a standalone category: AI Evaluation and Observability Platforms. It’s important to note upfront: none of these platforms provide protection from degradation “out of the box” - they are infrastructure for detection, not a ready-made shield. But without such infrastructure, detection often remains more manual, more fragmented, and harder to scale.
Within this category, it’s important to distinguish two disciplines: observability and evaluation.
Observability answers the question “what is happening”: call tracing, latency, cost, errors, token usage. It’s the dashboard for your LLM application. It will show you that the model started responding more slowly or that error counts went up, but on its own it won’t show you that reasoning quality has dropped.
Evaluation answers the question “how well is the model handling the task”: reference datasets, automated scorers, human annotation, A/B experiments, regression testing. This is the component that detects reasoning regression.
Among notable platforms, I’d highlight three characteristic types:
- Arize / Phoenix - an example of a stack where tracing, experimentation, and quality evaluation are well covered.
- LangSmith - a solid example for teams that care about tracing agentic workflows, evals, and human annotation.
- Braintrust - an example of an approach where quality evaluation is placed at the center and can directly gate releases through CI/CD.
In practice, it makes sense to track through these platforms: the share of successfully completed tasks, answer completeness, frequency of manual corrections, and time to a usable result. And to set up alerts - not just for errors, but for less obvious signals: increased variance, stress test failures, and more frequent fallback switches.
All of these platforms require the same thing from a team: create reference datasets for your tasks, configure scorers, and define quality thresholds. You can have perfect tracing for every request and still lack a clear rule for “this is good / this is bad.”
4. Audit and Provider Engagement
This layer is more about trust, compliance, and contractual obligations. It’s less about day-to-day operations and more about the long-term relationship, including audit rights, clear justification for vendor selection, and a claims mechanism. While this protection doesn’t guarantee detection of every individual failure, it provides a legal basis and formal grounds for complaints.
Conclusions
I started with the question: what exactly protects a user when the model hasn’t gone down but has started performing worse? After this analysis, my answer is: the market protects against outright service failure significantly better than against hidden quality changes on critical scenarios.
Looking at this from the perspective of an enterprise user, my takeaway is this: a significant portion of the protection today is built not by the provider but on the team’s own side.
Within this broader problem, reasoning drift remains a particularly important case to me. It hits the complex, long, and expensive tasks where everything can still seem “mostly fine” on the surface while the usefulness of the output has already noticeably declined.
If I translate this into practice, I’d highlight three things: pin model versions, maintain your own test suite for critical scenarios, and monitor production not just for availability but for result consistency. The tools for this already exist. But assembling a working defense still requires combining several different layers.
The most useful insight I took away from this analysis, though, is this: for critical tasks, the problem doesn’t start when “AI in general got worse.” It starts when, without your own eval procedures, the team can no longer see hidden quality regression. Once you frame it precisely, reasoning drift stops being a social media debate and becomes an engineering problem.