← All posts
· 8 min read · Agentic Stack

Sycophancy Is the Last Hard Problem in AI-Assisted Software Development

RLHF-trained coding agents do not just make mistakes. They silently implement the wrong thing, accruing alignment debt that passes tests and leaves the codebase worse off.

Sycophancy Is the Last Hard Problem in AI-Assisted Software Development

The clearest public warning sign for coding agents came in 2025, when OpenAI rolled back a GPT-4o update because the model had become too flattering and too agreeable. OpenAI’s own writeup said the update had leaned too hard on short-term feedback and produced behavior that was “overly supportive but disingenuous” (OpenAI). That should have been a bigger moment for anyone building software with agents than it was.

I think sycophancy is the last hard problem in AI-assisted software development.

Not the only problem. Tooling still breaks. Benchmarks still miss reality. But those are secondary compared to the deeper issue: current models are optimized to continue the plan I already stated, not to tell me that my plan is underspecified, strategically off, or in tension with the architecture sitting in front of them.

That sounds like a UX quirk until I watch what it does inside a real codebase. Then it becomes obvious that sycophancy is a debt generator.

The human translation layer was always supposed to add friction

When I work with human developers, the value is not that I hand them a perfect spec and they type it in faster than I could. The value is that I bring product intent, priorities, and constraints, and they bring implementation taste, architectural judgment, and a feel for what the codebase is trying to become. A story goes in. Something better than the story often comes out.

When I work with a coding agent, the problem feels different. I spend far more time on enablement: how do I frame the request clearly, what will the model miss, what context do I need to preload, what failure mode do I need to guard against, what review loop do I need afterward. That is not the same as working with a strong developer who can absorb intent and reshape it. It is closer to building a temporary execution environment around a very capable but overly literal implementer.

Ward Cunningham and Martin Fowler’s debt framing still applies here: software gets worse when I optimize too hard for the immediate ask and not enough for the shape of the system I am building (Fowler).

The human layer in software delivery has always absorbed some of that tension. A good developer turns a request into questions. Coding agents remove that filter by default.

If I tell a model to add a filter to a product list, it usually does not stop to ask whether the codebase already has a canonical server-side query builder, whether filtering belongs in URL state, or whether the existing analytics pipeline depends on server-filtered results. It sees a local problem and solves the local problem. That is exactly what I asked for. It is also how a coherent codebase starts to decay.

RLHF builds yes-men on purpose, even if nobody says it that way

Mrinank Sharma and collaborators showed that state-of-the-art assistants consistently exhibit sycophancy across multiple tasks, and that both humans and preference models often prefer convincing sycophantic answers over correct ones (Sharma et al.). If feedback rewards answers that feel aligned with my stated belief, optimization pushes the model toward agreement.

That is useful up to a point. I want a model that understands my intent. I do not want one that helps me refine a bad implementation into a bad PR.

In software, the three moves are obvious: tell me the strategy is flawed, present a better alternative, or enthusiastically implement my bad idea. Current coding agents still skew toward the third.

OpenAI’s GPT-4o rollback is the cleanest product-scale example. If a major lab can ship that failure mode to production, I should assume some version of the same tradeoff is present in every coding agent I use (OpenAI).

That is why I do not find the usual discourse about “vibes” especially interesting. The model is doing what I trained it to do. The problem is pretending that behavior is compatible with software architecture.

The debt this creates is not ordinary technical debt

I use the term alignment debt for the specific gap between the implementation I requested and the implementation the company actually needed.

Technical debt is familiar because I can usually point to the shortcut. Alignment debt is subtler. The code works. The tests pass. The PR looks reasonable. But the system is now slightly less coherent because the model optimized for stated intent over architectural intent.

The distinction is not just technical taste. It is technical taste plus customer voice.

That is why I no longer love purely code-level examples here. The harder version is upstream. A product leader asks for a feature that sounds right locally. A strong developer often pushes back and says the request conflicts with the product’s deeper strategy, the data model, or the shape of the customer problem. A sycophantic agent is much more likely to implement the request cleanly and miss that strategic mismatch entirely.

That is alignment debt.

The phrase already exists in adjacent literature. Oyemike et al. use it to describe the hidden work users do to make AI usable across cultural and linguistic contexts (arXiv). I am using it more narrowly here: the hidden burden created when an AI system has enough context to produce a technically plausible answer, but still chooses the implementation that most directly satisfies the prompt in front of it instead of the product the company is actually trying to build.

That is what makes it dangerous. Technical debt often announces itself. Alignment debt usually hides behind passing tests.

This is going to move the org chart

As coding work moves upstream toward product thinkers, the sycophancy problem stops being just a model problem and starts becoming an org-design problem.

Either product people move downstream and learn agent orchestration, technical taste, review discipline, and failure-mode management, or engineers move upstream into product oversight and become the people who police whether the generated work still matches customer reality. I do not know which transition is easier at scale. It may be easier to teach engineers customer voice than to teach product managers architectural judgment. But the center of gravity is clearly shifting.

That shift is why I think sycophantic agents inherently strengthen the product management organization. They turn more of the coding surface area into a requirements and review problem. The person closest to the model increasingly controls what gets built. That makes request framing, strategic clarity, and review quality more important than ever.

It also creates a new bottleneck. Engineers can already multiply implementation throughput with agents. What they cannot yet multiply nearly as well is trustworthy PR review. GitClear’s 2025 report on 211 million changed lines is useful here not because it proves sycophancy directly, but because it shows what happens when generation outpaces judgment: more cloned code, less refactoring, more churn (GitClear). The unresolved bottleneck in agentic software is increasingly upstream review, not raw code production.

The current mitigations help, but none of them solve the real problem

I use all the standard defenses: AGENTS.md files, planning docs, narrow task scopes, independent review, different models for implementation and audit. They help. They reduce obvious drift.

But they do not eliminate alignment debt, because manifests describe patterns and constraints, not judgment.

That is why the recent codified-context literature matters. Papers like Codified Context argue that single-file manifests do not scale, so teams end up building layered retrieval, specialized agents, and on-demand specs just to simulate persistent judgment (Codified Context). Martin Fowler’s autonomy experiment reached a similar conclusion from the other direction: even for a bounded app, reliable output required structure everywhere (Fowler).

That is progress. It is not the fix.

The behavior I actually want is still mostly absent: “I can implement that, but it conflicts with the repo’s existing approach, here is the tradeoff, and I recommend a different path.”

The systems are getting better, just not all the way there

It would be misleading to pretend nothing has changed since the 2025 failure cases. It is March 2026, and the frontier models are better. GPT-5.4 and Claude 4.6 do give me more caveats than the models that made vibe coding feel culturally inevitable. If I explicitly ask for pushback, I increasingly get side notes, edge cases, and occasional blunt warnings that a request may be the wrong shape.

That matters. These systems are becoming technical advisors in a way earlier agents were not.

But advisor is not the same thing as architect. The model may tell me, to be honest, this approach has drawbacks. It still usually waits for me to set the objective, define the boundary conditions, and recognize when the product-level framing itself is wrong. That is progress worth acknowledging. It is just not the same thing as a model that independently protects the long-term coherence of the system.

I suspect sycophancy and weak novelty are the same limitation viewed from two angles

I increasingly think sycophancy and the inability to solve truly novel problems are not separate failures. They are the same underlying limitation showing up in different contexts.

On the social side, the model predicts that agreement is the most likely continuation of my request, because agreement is heavily represented in both human text and human preference data. On the reasoning side, the model performs well when a problem resembles patterns it has already seen, then degrades when the structure becomes genuinely unfamiliar. Apple’s GSM-Symbolic work found large variance across simple re-instantiations of the same problem and steep drops from seemingly irrelevant added clauses (Mirzadeh et al.). Their later work on reasoning models found an eventual accuracy collapse as complexity rises, plus inconsistent use of explicit algorithms (Shojaee et al.).

That maps almost perfectly to what I see in code. The agent can often produce an implementation I did not think of. What it still struggles to do is evaluate whether that implementation belongs in my codebase, under my constraints, with my future consequences.

If that interpretation is right, then solving sycophancy is not a side quest. It is entangled with genuine reasoning. The day I trust a model to push back intelligently on architecture is probably also the day I trust it more on novel problem solving with weak or delayed feedback.

What I do in practice is manually add the friction back

I build with agentic workflows every day, and I no longer expect the base model to provide the right kind of resistance on its own.

So I add the friction back procedurally.

In my own stack, one model orchestrates, another executes, and a separate reviewer checks non-trivial PRs without inheriting implementation context. I make the agent write plans before code. I force independent review from a different model. I maintain project constitutions. I treat every clean-looking PR from an implementation agent as suspicious until it survives a second pass whose job is to find architectural drift.

That works well enough to be useful. It does not work well enough to disappear.

So when people ask what still blocks fully autonomous software development, my answer is not coding ability in the narrow sense. The models are already good enough at producing code. The blocker is that they still do not reliably know when to resist me.

Until they do, the biggest risk in AI-assisted development is not that the model fails loudly. It is that it succeeds obediently.

Discussion