The End of Vibe Coding: Why AI Agents Are Growing Up
The End of Vibe Coding: Why AI Agents Are Growing Up
Something subtle but profound is happening in AI-assisted development. The "vibe coding" era—that euphoric period where we threw prompts at Claude and marveled at the output—is coming to an end. Not because the tools are getting worse, but because we're finally figuring out how to use them properly.
The new pattern emerging across research labs, startups, and major tech companies is surprisingly consistent: reliable AI agents require a strict separation between planning and execution. This isn't just a workflow preference. It's becoming a fundamental architectural principle.
The Planning/Execution Divide
Last week, Boris Tane's detailed Claude Code workflow hit the front page of Hacker News—and the developer response was electric. His approach is methodical: deep research phase, written plan in markdown, multiple rounds of human annotation, then a single implementation command. The magic happens in the annotation cycle, where he adds inline notes correcting assumptions before any code is written.
Stripe revealed something similar with their "Minions" system. Over a thousand PRs merged weekly at Stripe are now completely AI-generated. But here's what matters: they emphasize one-shot, end-to-end execution. The agent gets the complete context upfront—including the entire codebase through compressed git trees—and executes without human checkpoints mid-flow.
Both approaches converge on the same insight: the cognitive work of architecture and the mechanical work of implementation should not be interleaved. When they are, you get the classic agent failure mode—reasonable-but-wrong assumptions that compound for 15 minutes until the whole chain needs unwinding.
Why Accuracy Metrics Are Broken
This planning/execution split isn't just about developer ergonomics. New research reveals it's fundamental to how we should evaluate reasoning systems.
A paper from this week introduces two novel metrics for Chain-of-Thought reasoning: reusability and verifiability. The researchers decoupled CoT generation from execution using a Thinker-Executor framework and discovered something startling: models with the highest accuracy don't necessarily produce the most reusable or verifiable reasoning traces.
DeepSeek-R1 and Phi4-reasoning—specialized reasoning models—were sometimes worse at generating reusable CoT than general-purpose LLMs like Llama and Gemma. Accuracy on benchmarks like GSM8K correlates weakly with whether other models can follow the reasoning or verify the steps. This exposes a blind spot in our current evaluation paradigm: we optimize for getting the right answer while ignoring whether the reasoning process is sound, communicable, or trustworthy.
In multi-agent systems, this matters enormously. If Agent A's reasoning isn't verifiable by Agent B, the whole pipeline becomes brittle. The paper's conclusion is stark: "relying solely on accuracy to select models for these pipelines poses a significant reliability risk."
Agent Specialization Is Exploding
The planning/execution pattern isn't limited to code. It's appearing across specialized domains as researchers build agentic systems for specific verticals.
FAMOSE, also published this week, applies the ReAct paradigm to automated feature engineering. Instead of generating features in one shot, the agent iteratively proposes, evaluates, and refines features based on validation performance—simulating how a data scientist actually works. The system achieved state-of-the-art results on regression tasks (2% RMSE reduction) and large classification datasets by learning from its mistakes across iterations.
OpenEarthAgent takes this even further into geospatial analysis. The system orchestrates visual, spectral, GIS, and GeoTIFF-aware tools through structured reasoning trajectories. Each query generates multi-step chains of tool calls with explicit intermediate observations. Trained on 14K instances with over 100K reasoning steps, it demonstrates how domain-specific agents benefit enormously from structured execution patterns.
The pattern is clear: agents work best when they operate within well-defined tool schemas, generate explicit reasoning traces, and separate the planning of tool sequences from their execution.
The RLVR Revolution
Underpinning all of this is a quiet revolution in how we train reasoning models. Posts on X this week from researchers highlight remarkable gains from RLVR (Reinforcement Learning with Verifiable Rewards):
- Qwen3-8B: GPQA-Diamond jumped from 30.3% to 48.9%
- Qwen2.5-32B: MMLU-Pro improved from 55.1% to 74.4%
- Qwen3-14B: GPQA rose from 42.6% to 57.7%
These aren't incremental improvements. They're leaps that suggest we're discovering how to train models to generate more verifiable, more structured reasoning by default. DeepSeek-R1-Distill-Qwen-14B now beats o1-mini on AIME 2024 and MATH-500 benchmarks while being efficient enough for local deployment.
What This Means for Builders
If you're building with AI agents today, the implications are concrete:
1. Never let agents write code without a reviewed plan. The "vibe coding" approach of iterative prompt-and-fix leads to compounding errors. The plan document is your review surface for architectural decisions.
2. Separate research, planning, and implementation into distinct phases. Each phase has different failure modes and requires different types of oversight. Don't interleave them.
3. Treat reasoning traces as first-class artifacts. Whether it's CoT for math problems, tool sequences for geospatial analysis, or feature engineering steps—the intermediate reasoning is as important as the final output.
4. Build or choose tools that enforce structured execution. The trending GitHub repos reflect this: gemini-cli, UI-TARS-desktop, and oh-my-opencode all provide structured agent harnesses rather than open-ended chat interfaces.
The Road Ahead
We're moving from an era of AI assistants that help you code to AI agents that execute engineering workflows. But this transition requires discipline. The tools are capable of remarkable autonomous execution—but only when given proper structure.
The future belongs to developers and teams who treat AI agents as junior engineers that need architectural oversight, not as oracles that generate code from vague prompts. The "separation of planning and execution" that Boris Tane described isn't just a productivity hack. It's the fundamental pattern that makes reliable agent systems possible.
The vibe coding party was fun. But the real work of building reliable, scalable agent systems is just getting started.
Sources
Academic Papers
- Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability — arXiv, Feb 21, 2026 — Reveals accuracy metrics fail to capture reasoning quality; specialized reasoning models aren't necessarily more verifiable than general LLMs
- FAMOSE: A ReAct Approach to Automated Feature Discovery — arXiv, Feb 22, 2026 — First application of ReAct agents to automated feature engineering, achieving SOTA results through iterative refinement
- OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents — arXiv, Feb 22, 2026 — Demonstrates structured tool orchestration for geospatial reasoning with 14.5K training instances
Hacker News Discussions
- How I use Claude Code: Separation of planning and execution — Hacker News, Feb 18, 2026 — Detailed workflow emphasizing research → plan → annotate → implement phases
- Minions: Stripe's one-shot, end-to-end coding agents — Hacker News, Feb 20, 2026 — Stripe's system generating 1000+ PRs/week with complete context upfront
Reddit Communities
- Claude 3.7 Sonnet discussion — r/MachineLearning, Feb 2026 — Community observations on coding capabilities
- Local LLaMA reasoning benchmarks — r/LocalLLaMA, Feb 2026 — Performance analysis of distilled reasoning models
X/Twitter
- Richard Price on RLVR improvements — @richardprice100, Feb 21, 2026 — Compilation of RLVR training results showing dramatic benchmark improvements
- Grok on DeepSeek-R1 benchmarks — @grok, Feb 21, 2026 — DeepSeek-R1-Distill-Qwen-14B beating o1-mini on math benchmarks
GitHub Projects
- google-gemini/gemini-cli — GitHub, Feb 2026 — Open-source AI agent for terminal with 95K+ stars
- bytedance/UI-TARS-desktop — GitHub, Feb 2026 — Open-source multimodal AI agent stack with 28K+ stars
- x1xhlol/system-prompts-and-models-of-ai-tools — GitHub, Feb 2026 — Comprehensive collection of AI tool system prompts with 116K+ stars
Company Research
- How I Use Claude Code — Boris Tane, Feb 2026 — Detailed breakdown of disciplined agent workflow with annotation cycles