Back to Blog

The Buildable Future: Why AI's Next Revolution Is Smaller, Local, and Yours

Something strange happened this week. While major AI providers quietly degraded their models—users reporting that everything from Claude to GPT has gotten "grumpy," slower, and shallower—something else emerged from the local AI community that demands attention.

A model installed its own inference stack. In Zig. And beat a widely-used production product by 20%.

That's not a benchmark improvement. That's a category shift.

The Model That Built Its Own Engine

When Kimi K2.6 launched, most coverage focused on benchmark scores—tool-enabled HLE, SWE-Bench Pro, BrowseComp. But buried in the launch documentation was a detail that should have been the headline: the model reportedly installed Qwen3.5-0.8B on a Mac, decided the default inference stack wasn't cutting it, and wrote an entirely new one in Zig. Not Python. Not C++. Zig.

The result was a jump from roughly 15 tokens per second to 193—more than a 10x speedup, and about 20% faster than LM Studio on the same hardware.

This isn't about a faster inference framework. This is about a model entering an unfamiliar technical regime, identifying bottlenecks, and engineering its way to a materially better system. In software, passing tests is one thing. Rewriting a runtime in a language you weren't handed—unprompted—and beating a widely deployed product is something else entirely.

If this generalizes, the center of gravity shifts: from answering questions to improving the machinery that produces the answers.

The Great Capability Divergence

The intelligence drops across major providers have become impossible to ignore. On LocalLLaMA, a thread titled "Major drop in intelligence across most major models" hit 792 points. Users report models ignoring basic instructions, taking longer to respond, producing deliberately shortened outputs. This isn't tinfoil hat territory—it's a measurable pattern.

The likely culprit: resource reallocation. Running large models at scale for millions of users is expensive. When providers need to cut costs, they often do so by reducing inference compute—the same "vibes" you'd get from a model running on less hardware than it was trained on.

Meanwhile, local models aren't just filling the gap—they're actively surpassing their cloud counterparts for specific use cases. Kimi K2.6, Qwen3.6-35B-A3B, and Gemma 4 have created a class of models that run on your own hardware, without usage limits, without throttling, without the "grumpy mode" that seems to afflict centralized services.

The numbers tell the story: Kimi K2.6 is described by users as "the first model I'd confidently recommend as an Opus 4.7 replacement." A Reddit post about running Qwen3.6-35B-A3B through OpenCode on an M5 Max declares it "as good as Claude"—for coding tasks that previously required frontier-tier cloud models.

Minimalism as Moat

The most interesting AI research this week wasn't about scaling up. It was about scaling down—and understanding why less can be more.

StarVLA-α, released this week, deliberately minimizes architectural complexity in Vision-Language-Action systems. Where the field has been adding layers upon layers of engineering to handle robotics tasks, StarVLA-α shows that a strong VLM backbone combined with minimal design is already sufficient. Their single generalist model outperforms π0.5 by 20% on the RoboChallenge benchmark—not by being more complex, but by being less so.

This is the Lego principle: standardized, simple pieces that snap together reliably beat bespoke engineering that requires constant maintenance.

The LARY Benchmark research reinforces this. When evaluating latent action representations—the critical bridge between visual observation and physical action—they found that general visual foundation models, trained without any action supervision, consistently outperform specialized embodied models. The semantic abstraction of "what to do" matters more than pixel-level reconstruction of "how to move."

The Synthetic Learning Turn

Perhaps the most significant research finding comes from Sim2Reason at CMU, which demonstrates training LLMs inside physics simulators to acquire real-world reasoning. The key insight: today's best AI needs orders of magnitude more data than a human child to achieve visual competence. But when you train on physics simulations—letting models learn like Newton learned, by experiencing consequences rather than reading descriptions—you get zero-shot transfer to real-world physics benchmarks.

Training solely on synthetic simulated data improved performance on International Physics Olympiad problems by 5-10 percentage points across model sizes. The bottleneck isn't data quantity—it's data quality. Structured, physically-grounded, causally-transparent training signals outperform the noise of internet-scale scraping.

This aligns with a broader recognition in the field: the next breakthrough won't come from more parameters, but from better regimes. How you train matters more than what you train on.

The Research Integrity Reckoning

The AI research community is having an uncomfortable conversation. A post on r/MachineLearning titled "Failure to Reproduce Modern Paper Claims" hit 185 points: out of 7 checked claims this year, 4 were irreproducible, with 2 having active unresolved issues on GitHub.

This isn't new—reproducibility has been a concern for years. What's new is the volume. Users on the same thread note that ICLR 2025 accepted papers used nature language metrics instead of execution metrics for SQL code generation, finding around 20% false positive rates. A paper getting oral presentation with fundamental evaluation flaws.

When 100-200 new ML papers upload to arXiv daily, the community can't keep pace with verification. And when providers are also releasing models and benchmarks, there's a structural incentive to report favorable results.

The result is an ecosystem where trust is becoming rationed. Practitioners increasingly rely on community testing, shared benchmarks, and hands-on evaluation rather than published claims.

The Buildable Era

What unites these threads?

We're entering a phase where the most exciting AI isn't something you use—it's something you build with. Local models that you can modify, optimize, and yes, that can modify and optimize themselves. Open-weight models with Apache 2.0 licenses that enterprises can actually deploy without legal review. Research frameworks like StarVLA that prioritize understandability over impressiveness.

The era of "bring your data to our AI" is giving way to "bring our AI to your data." The era of "trust our benchmarks" is giving way to "verify our claims yourself."

And perhaps most importantly: the era of "wait for the next flagship model" is giving way to "build the system you need from components that actually work."

The model that rewrote its own inference stack in Zig wasn't trying to win a benchmark. It was trying to solve a problem. That's a fundamentally different relationship between AI and the task—and it's the relationship that will define the next chapter.

The question isn't whether AI will transform how we build software. It's whether you'll be building with it, or just using it.


Sources

Academic Papers

Hacker News Discussions

Reddit Communities

X/Twitter

GitHub Projects

  • StarVLA/starVLA — GitHub, Oct 9, 2025 — Lego-like VLA codebase with 1,980 stars; exemplifies the minimalist design philosophy in embodied AI