Reasoning and Action: Two Fields Racing Toward the Same Destination

April 15, 2026 8 min read

For the past few years, AI research has been quietly split into two camps that barely talked to each other.

On one side: the reasoning crowd. These researchers obsessed over Chain-of-Thought prompting, test-time compute scaling, and getting language models to "think harder" before committing to an answer. Their north star was benchmarks like MATH and AIME — clean, verifiable problems where the model could reason its way to a right answer without touching the outside world.

On the other side: the agentic crowd. These researchers cared about everything reasoning people ignored — tool use, multi-step execution, feedback loops, acting in environments. Their north star was getting models to actually do things: browse the web, write and run code, interact with APIs, navigate databases. The agentic crowd treated reasoning as a means to an end, not the end itself.

Both camps were making real progress. Both camps were publishing impressive papers. And for the most part, they weren't reading each other's work.

That just changed — and it's the most exciting thing happening in AI right now.

When Reasoning Isn't Enough

The reasoning-first approach reached a surprising limit that multiple April 2026 papers independently rediscovered: reasoning in isolation is fundamentally incomplete.

SeLaR, a paper from Peking University, made this precise. The researchers observed something counterintuitive during Chain-of-Thought decoding: most reasoning steps are low-entropy — the model is already confident and doesn't need to deliberate further. Only a small fraction of steps are genuinely uncertain, where the model's top candidates compete and exploration might help. Existing latent reasoning methods applied soft embeddings globally, injecting unnecessary perturbation at steps where the model was already decisive, while failing to sustain exploration at the steps that actually mattered.

SeLaR's fix was elegant: use entropy as a gate. Only activate the more expensive, richer latent reasoning path at high-uncertainty steps. Keep discrete decoding everywhere else. The result is a system that allocates reasoning resources intelligently rather than uniformly.

But here's the part that matters for our bigger story: SeLaR was purely about thinking better. It didn't touch actions, tools, or environments. It was the reasoning crowd at its finest — squeezing more performance out of the model's internal process.

Now zoom out to BracketRank, a paper from the University of Innsbruck tackling a very different problem: document retrieval. The BRIGHT benchmark revealed that reasoning-intensive queries — the kind where you need to actually understand why a document is relevant, not just whether it contains the right keywords — stumps even state-of-the-art models. GPT-4-based rankers scored 17 nDCG. Rank-R1-14B hit 20.5. These are dramatic underperformances compared to standard retrieval tasks.

BracketRank's solution was to embed explicit step-by-step reasoning into every stage of the document comparison process, using a tournament-style elimination bracket where documents compete head-to-head. The reasoning wasn't just decoration — it was load-bearing. Forcing the model to articulate why one document beats another before ranking them produced dramatically better outcomes.

Again: reasoning in service of action. But still not the full picture.

Where STEM Reasoning Falls Short

Here's the observation that ties the reasoning and agentic crowds together. RLVR — Reinforcement Learning with Verifiable Rewards — has been enormously successful for math and code. Train a model on problems where you know the right answer, give it compute to generate multiple attempts, reward the ones that land on the correct solution. It works beautifully for formal domains.

But does that reasoning transfer to everything else?

A new paper called SuperNova, from UCLA, investigated exactly this question. The answer is a qualified no — and the qualification is important.

Models trained exclusively on STEM reasoning show dramatic gains on math benchmarks (AIME, competition math) while simultaneously degrading on general reasoning tasks like Big-Bench Extra Hard. The STEM-trained models got 50% better at math and 8% worse at causal inference, temporal reasoning, and commonsense logic. The skills simply don't transfer.

SuperNova's response was to build a principled data curation framework for RLVR that goes beyond formal domains. Their key finding: different target reasoning skills require different source tasks. Temporal reasoning benefits from exposure to temporal graph tasks during training. Causal inference needs causalstructure. A single averaged-optimal data mixture — what they call "macro mixing" — consistently underperforms per-skill task selection, or "micro mixing."

The implication cuts both ways: (1) reasoning is more domain-specific than the reasoning crowd assumed, and (2) the agentic crowd, which has always cared about diverse environments and task-specific behaviors, was right that general capability emerges from diverse experience, not from perfecting one domain.

The Infrastructure Finally Arrives

While the academic papers were mapping the contours of the reasoning-action problem, something practical was happening on GitHub.

Hermes-agent, built by NousResearch, rocketed to over 52,000 stars in a single week. The concept is straightforward but the timing is significant: an open-source agent framework designed around continuous learning, where the agent's capabilities compound over time rather than resetting. It's not a research prototype — it's infrastructure for building real agentic systems.

Multica hit 9,823 stars simultaneously, positioning itself as a managed agent platform that turns coding agents into "real teammates" with tracked progress and compounding skills. DeepTutor, from HKUDS, took the personalized learning assistant concept — long a staple of AI ed-tech PowerPoint decks — and made it agent-native, with the model actively reasoning about learning paths rather than retrieving static recommendations.

The pattern is unmistakable: the tools for building agentic systems have crossed a quality threshold. The research community spent years establishing what agentic AI should look like. The open-source community is now building the means to productionize it.

The Convergence Nobody Predicted

So what's the synthesis here?

The reasoning crowd discovered that reasoning alone hits walls: it doesn't transfer across domains (SuperNova), it doesn't ground in reality (BracketRank), and it misallocates resources when applied uniformly (SeLaR). The agentic crowd discovered that action without reasoning is blind: tool use without calibrated confidence is brittle, multi-step execution without selective reasoning is inefficient, and diverse environments without principled training signals don't generalize.

The convergence point — the place both crowds are now racing toward — is reasoned action. Not reasoning as a prelude to acting. Not action as a test of reasoning. But reasoning and action as a single, interleaved process, trained end-to-end with RL in diverse environments, where the model's confidence calibration, exploration strategy, and environmental feedback all improve together.

"From Reasoning to Agentic," another April arXiv paper, put it directly: the field needs better credit assignment — figuring out which reasoning steps led to good outcomes — when the rewards come from the real world rather than from verifiable answers. This is a fundamentally harder RL problem than math or code, and it's the next frontier.

Walk the Talk, a multimodal agentic policy optimization paper, made the same point from the other direction: bridging the gap between how a model reasons about visual information internally and how it acts on that reasoning externally. The gap isn't just architectural. It's about training signals, feedback loops, and the cost of exploration in real environments.

Why This Matters for the Rest of 2026

The practical implication is that the next wave of capability improvements won't come from scaling reasoning in isolation — more CoT steps, more test-time compute, larger reasoning models. They'll come from the intersection: models that reason and act in the world, trained with RL signals from both formal verification and environmental feedback, with principled approaches to when to explore and when to commit.

This means the infrastructure investments happening right now — Hermes-agent, Multica, agent-native frameworks — are not just useful tooling. They're the substrate on which the next generation of capable AI will be built. The research问题和engineering problems are converging, and the researchers who can work across both will define the next chapter.

For the past few years, the AI research community ran two parallel marathons toward different finish lines. They just realized they're on the same course.

Sources

Academic Papers

SeLaR: Selective Latent Reasoning in Large Language Models — arXiv, April 14, 2026 — Introduced entropy-gated selective activation for latent reasoning, demonstrating that most CoT steps are low-entropy and don't need expensive soft embedding reasoning
SuperNova: Eliciting General Reasoning in LLMs with RL on Natural Instructions — arXiv, April 14, 2026 — Showed that STEM RLVR training doesn't transfer to general reasoning; established micro-mixing data curation as key for generalist RLVR
BracketRank: LLM Document Ranking via Reasoning-based Competitive Elimination — arXiv, April 14, 2026 — Applied explicit tournament-style reasoning to document retrieval, achieving state-of-the-art on BRIGHT reasoning-intensive retrieval benchmark
From Reasoning to Agentic: Credit Assignment in RL for LLMs — arXiv, April 13, 2026 — Identified credit assignment in RL for language models as the core unsolved problem in moving from reasoning to agentic behavior
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images — arXiv, April 8, 2026 — Tackled the multimodal reasoning-to-action gap via agentic policy optimization
Self-Distilled RL for Co-Evolving Agentic Recommender Systems — arXiv, April 11, 2026 — Demonstrated agentic recommender systems as a domain where reasoning and action must co-evolve through RL
VISOR: Agentic Visual RAG via Iterative Search and Over-horizon Reasoning — arXiv, April 10, 2026 — Combined retrieval, reasoning, and action in a unified agentic visual search framework

GitHub Projects

Hermes-agent — NousResearch, April 2026 — 52,996 stars this week; agent framework with continuous learning and compounding capabilities
Multica — multica-ai, April 2026 — 9,823 stars this week; open managed agents platform turning coding agents into compound-teammates
DeepTutor — HKUDS, April 2026 — 6,401 stars this week; agent-native personalized AI learning assistant
BracketRank — University of Innsbruck, April 2026 — Open-source implementation of tournament-based LLM document ranking

Hacker News

Claude Code Routines Launch — Hacker News, April 14, 2026 — 661 points; Anthropic's release of coding agent routinesHN discussion reveals strong practitioner interest in agent reliability and reproducibility patterns

Reddit Communities

AI Slop on Local Models — r/LocalLLaMA, April 14, 2026 — Community discussing the quality and coherence challenges of local AI at scale
Xiaomi 12 Pro as Headless AI Server — r/LocalLLaMA, April 14, 2026 — Practical demonstrations of local AI deployment pushing into everyday hardware
State of LocalLLaMA Community — r/LocalLLaMA, April 2026 — Meta discussion of community values around open, local AI development

X/Twitter

Tesla AI5 Chip Tape-out — @Tesla_AI, April 2026 — 20K+ posts in 5 hours; next-generation AI accelerator for autonomous driving signals the scaling of real-world AI inference infrastructure