Roko's Basilisk
Posts
Why The Next Leap In AI Won’t Come From Bigger Models

Why The Next Leap In AI Won’t Come From Bigger Models

An interview with Aakanksha Chowdhery, Adjunct Professor at Stanford & Researcher at Reflection AI.

Roko's Basilisk
February 14, 2026

Inside the Agentic Turn in AI with Aakanksha Chowdhery

Welcome to Revenge of the Nerds. We’re skipping the hype and going straight to the builders. In this edition, we talked about:

Why the next leap in AI won’t come from bigger models, but from post-training and reinforcement learning
Where today’s agents break first: loops, planning failures, and the limits of next-token prediction
What would actually need to exist before calling a system AGI — and why we’re not there yet

Let’s dive in. No floaties needed.

In partnership with

Better prompts. Better AI output.

AI gets smarter when your input is complete. Wispr Flow helps you think out loud and capture full context by voice, then turns that speech into a clean, structured prompt you can paste into ChatGPT, Claude, or any assistant. No more chopping up thoughts into typed paragraphs. Preserve constraints, examples, edge cases, and tone by speaking them once. The result is faster iteration, more precise outputs, and less time re-prompting. Try Wispr Flow for AI or see a 30-second demo.

Start flowing free

_{*This is sponsored content}

Revenge of the Nerds

Aakanksha Chowdhery, Adjunct Professor at Stanford & Researcher at Reflection

Aakanksha Chowdhery

Aakanksha Chowdhery is a researcher working at the intersection of large language models, reinforcement learning, and agentic systems. She is currently a researcher at Reflection and an Adjunct Professor at Stanford University, where she focuses on building and teaching systems that can reason, plan, and improve themselves over long horizons. Her work centers on the post-training phase of AI development, where models are aligned, evaluated, and shaped into reliable tools for real-world use.

Before her current role, Chowdhery led and trained some of the largest language models ever built at Google, including serving as the technical lead for the 540B-parameter PaLM model and contributing to Gemini, PaLM-E, MedPaLM, and Pathways. Earlier in her career, she held research leadership roles at Microsoft Research and Princeton University. She holds a PhD in Electrical Engineering from Stanford University, where her research earned the Paul Baran Marconi Young Scholar Award, and has received multiple Outstanding Paper Awards at MLSys for her contributions to machine learning systems research.

What surprised you the most once the largest models actually started working?

When training the largest models, what fascinated me most was that some capabilities emerge at scale that would have been hard to predict. For example, midway through training PaLM, we discovered that the model performed extremely well on a reasoning benchmark called BIG-bench. This was about three years ago, which in AI feels like ages. It represented almost a step change compared to previous models. Over the last two years, we’ve seen tremendous progress in reasoning, but these capabilities often first appear in the largest models before appearing in smaller ones.

Source: Gary Marcus

Post-training has become a significant factor in recent progress. How do you see its current state, and what’s driving its importance right now?

Historically, pre-training consumed most of the compute. Post-training models were initially aligned with human preferences, especially for chatbots. A pre-trained model produces probabilistic outputs, and post-training helps make it more confident and better aligned with human intent.

More recently, in domains like code generation and math—where outcomes are verifiable—we’ve seen reinforcement learning with verifiable rewards become very important. This increases the probability of correct answers. Post-training has served different purposes over time: alignment, safety, and usefulness.

As we move toward more agentic systems, scaling reinforcement learning is increasingly important for driving reasoning capabilities. This shift has just started to happen with reasoning-focused models.

Agents are everywhere right now. What’s the least hyped, most accurate definition of a real agent?

The concept of an agent existed long before LLMs. In its purest form, an agent has a goal, interacts with an environment, receives feedback, and takes actions to make progress toward that goal.

In the LLM world, early versions of agents look more like structured workflows: multi-step processes where the model fetches information, combines context from different sources, and produces an output. That’s still a form of agency, but a constrained one.

Long context is easy to market. Why is long-form reasoning still so hard for today’s models?

A big part of this comes down to benchmarks. Long-context benchmarks today mostly focus on retrieval tasks, like ‘needle-in-a-haystack’ problems, which models are very good at.

Long-form reasoning is different. For example, analyzing a character’s personality across multiple chapters of a story requires tracking actions and interactions and drawing abstract conclusions. That’s much harder than retrieval. These reasoning-intensive tasks are challenging for current model architectures.

Where do current agents fail first when asked to reason multiple steps ahead?

There are two main failure modes. First is context understanding: whether the agent can correctly abstract from past states and actions to decide what to do next. For example, a coding agent might look at the wrong set of files and fail to realize it should explore others.

Second, models are trained with next-token prediction objectives rather than incentives to think ahead. They tend to lock into a trajectory rather than evaluate future consequences. Reinforcement learning helps here, but next-token prediction scales easily, which is why it’s been dominant.

Agents often get stuck in loops. Why does this happen, and what would a real fix require?

Looping occurs because, given the same context, the model repeatedly selects the same high-probability actions. To fix this, agents need the ability to self-correct.

That requires two things: feedback that signals lack of progress toward the goal, and awareness of alternative actions. Technically, this comes down to better planning and multi-step reasoning, supported by meaningful environmental feedback. Without feedback, there’s no signal telling the agent it’s stuck.

You’ve said hallucination isn’t just a bug. When does it become a feature rather than a failure?

In factual or reliability-critical systems, hallucination is clearly a bug. But in creative writing, brainstorming, or idea exploration, hallucination can be a feature.

When models generate unexpected perspectives, simulate different personalities, or help brainstorm ideas, they act as a force multiplier. They can introduce viewpoints or approaches that an individual might not consider on their own. In those contexts, hallucination adds value rather than causing harm.

Why Hallucinations Will Never Be Eliminated

An essay about the limits of language models.

jurgengravestein.substack.com/p/why-hallucinations-will-never-be

What capability would need to exist before you’d seriously consider calling a system an AGI?

Using a conservative definition, I’d say systems need to complete long-horizon, multi-step tasks that are economically valuable reliably. That requires strong planning, multi-step reasoning, and reliability.

Today’s systems can support software engineering, but they often miss steps or introduce unnecessary changes. They’re like half-reliable interns. For true trust, systems need verification, feedback, and correction mechanisms—either built into the model or into the surrounding system. We’re not there yet.

What kind of researchers are you looking for at Reflection, and what makes the work different from what they’d do at a closed lab?

We're looking for people who want to work on hard problems around post-training, reinforcement learning, and reasoning systems at scale. The technical work isn't that different from other frontier labs because you're training large models, running RL experiments, and building evals. The difference is that the models are open, meaning anyone can use, modify, and build on them.

We're also small enough that individual researchers shape the direction. If you want to test a new approach to credit assignment or long-horizon reasoning, you can actually do it without waiting for five layers of approval. My work at Stanford has given me perspective on this. Students want their work to push the field forward, but they also want it to matter beyond a single company's API. That's what open models enable.

Outperform the competition.

Business is hard. And sometimes you don’t really have the necessary tools to be great in your job. Well, Open Source CEO is here to change that.

Tools & resources, ranging from playbooks, databases, courses, and more.
Deep dives on famous visionary leaders.
Interviews with entrepreneurs and playbook breakdowns.

Are you ready to see what’s all about?