What If AI Could Actually See?

An interview with Andrew Dai, CEO of Elorian AI, on why the models that write grad-school prose still can’t count bottles on a shelf, and why the Google Brain veteran who co-wrote the playbook for language-first AI is now building visual reasoning from scratch.

How Elorian Is Building Visual AGI With Andrew Dai

Welcome to Revenge of the Nerds. We’re skipping the hype and going straight to the builders. In this edition, we talked about:

. Why frontier models can write a graduate thesis and code like a senior engineer, yet flunk visual tests that a three-year-old can solve.

. The structural reason text-first AI stalled on visual reasoning, and why letting models think in images (not translate them) is the unlock.

. Why building a visual-first is a fundamentally different bet from anything happening inside Google, OpenAI, or Anthropic right now.

Let’s dive in. No floaties needed…

Launch fast. Design beautifully. Build your company's website on Framer.

Framer helps teams design, build, and launch their marketing sites lightning fast. With the ability to publish hundreds of CMS pages in a single click, operate at a global scale with seamless localization, and even host unified content across multiple domains, teams have never been able to ship faster.

Trusted by companies like Miro, Bilt, and Perplexity.

*This is sponsored content

Revenge of the Nerds

Andrew Dai, Co-Founder & CEO of Elorian AI

Andrew Dai is the co-founder and CEO of Elorian AI, a Palo Alto research and product lab building toward visual AGI. Before Elorian, he spent nearly 14 years at Google, where his career coincided with the rise of modern language models. In 2015, he co-authored 'Semi-supervised Sequence Learning' with Quoc V. Le, the paper that introduced the pretrain-then-fine-tune approach now baked into every major LLM. He went on to co-lead GLaM (Google's first mixture-of-experts language model), co-led pre-training for PaLM 2, and served as data lead for Gemini. Elorian came out of stealth in April 2026 with a $55M seed round at a $300M post-money valuation, co-led by Striker Venture Partners, Menlo Ventures, and Altimeter Capital, with participation from NVIDIA, 49 Palms Ventures, and Jeff Dean. Co-founder Yinfei Yang joined from Apple, former Harvard professor Seth Neel rounds out the founding bench, and the team now comprises 12 researchers and engineers drawn from the world's leading labs.

The interesting tension in Dai's story is that he helped write the original recipe for language-first AI, and is now betting against it.

What are you building at Elorian & why now?

Elorian is a research and product lab. The ultimate goal is visual AGI, and we're getting there by advancing the frontier in visual reasoning and visual understanding. Current frontier models are still very far behind their text counterparts when it comes to visual understanding. They are not even at the level of a six-year-old, whereas text models are very advanced now. Take-up of AI is going really quickly for software engineering, but for things outside pure text problems, it's going very slowly. We want to solve that.

Where are these models failing at visual tasks?

A pretty obvious one is counting. Both abstract things and real physical things. If you take a picture of a bar and ask how many bottles or glasses are on it, they'll usually be off, sometimes by three or four, sometimes by ten. Or if you ask which row a network switch is on, they can also do pretty badly, which is a problem if you're trying to give instructions to people.

Another category is puzzles and board games. I play board games quite a bit, I have a decent collection, and I'll take pictures of the board and ask very easy questions. Even something like 'what is the number under the yellow marker?' These models just can't do it. They can't understand these diagrams.

There's also a benchmark that came out a few months ago called BabyVision, where they compare these models with children. The results show models can't do things a six-year-old can do: finding similar things in an image, finding things that are different, navigating a maze, navigating spatial representations, even in 2D. These models are quite poor at it.

Why is the gap between visual & text reasoning so wide?

One reason we believe is that the reasoning for these models was created and developed on text. This came about through papers like the Chain of Thought paper by Jason Wei. In the training data, for example, math homework or English homework, people write out the steps they take to solve the problem. That's what a teacher wants to see: that you haven't cheated with a calculator. There's a lot of that data in pre-training, so if you prompt the model the right way, it does reasoning in this text space.

The same isn't happening in the visual space. There's very limited visual reasoning data for these really large models, so they're just not reasoning in the visual space much. Whereas if you think about humans and animals, most people think in the visual space. When you buy a sofa, you visualize how it looks in your living room. If an architect is designing a floor plan, they might need to visualize a wheelchair moving through all the gaps. These things are very hard to represent in text.

What changes when a model reasons directly in images?

The model needs to be able to natively generate and edit images as part of reasoning. That's what we mean by reasoning within the visual space, and it's something we believe humans do when we visualize a problem. Even when you visualize a complex coding problem, you imagine drawing the boxes and drawing lines between them to describe what it's doing. To do that efficiently, you have to do it in the same model. Doing it in one model means the later representations are preserved, and the meaning is preserved, rather than meaning being lost through these image-to-text mappings.

You co-wrote the recipe for language-first AI. What made you bet against it?

Initially, when I was developing language models, working on Gemini and PaLM 2, I always believed that blind people can do reasoning very well. They are as intelligent as anyone else, even people who were born blind. So it seemed to me there was a lot of hype back then around 'you have to have visual language models, you have to have embodied reasoning to really get to AGI,' but I was seeing text-based reasoning really take off with math and coding.

Then I saw that text-based reasoning was doing really well, but it was having almost no impact on visual reasoning. Visual reasoning is far, far behind text-based reasoning now. It's not even at the GPT-3 level. That made me think we need to try a different approach, a visual-first approach rather than a visual-second.

If you think about the development of reasoning in the natural world, a lot of my research is inspired by Geoff Hinton, who very much believes we should develop AI following how the human brain works, how natural intelligence developed. Reasoning came before language because animals need to do pretty advanced reasoning. An eagle needs very advanced reasoning to catch a sparrow flying mid-air. It needs to reason about turbulence, drag, lift, where the sparrow is going, the speed, and the intercept point. That's quite advanced, and we're guessing it does most of that without language. It seems like we're missing this really big part of how reasoning developed.

How does building for the physical world change your approach?

What we've seen is that, especially for visual tasks, these current models hallucinate much more than text-based models. With the board game example, when I took a photo of 7 Wonders and asked it about the situation, it just hallucinated points, cards, and resources. That happens much more often during visual tasks.

We have some ideas about why that's happening, so we're going to build our model to really reduce hallucinations. We think that's a really big problem with using these models in practical ways. In the physical world, there are also other things you can do that help make models more reliable. You can use computer vision tools as a baseline. Computer vision has been around for a long time and has segmentation tools, detection tools, recognition tools, and you can use them together to make the actual final prediction more reliable.

After 14 years at Google, what made you leave?

Looking at the projects I worked on at Google: the first paper with Quoc, where we developed pre-training and fine-tuning, was just the two of us. Then I co-led the first mixture of experts, GLaM. That was a team of about 10 people. Then I co-led pre-training for PaLM 2, around 30 people. After that, I became the data lead for Gemini.

If you look at Gemini now, the decisions made there are much more conservative. Smaller steps are taken. It's more risk-averse, because it's the fate of a few thousand people on the line and a lot of millions of dollars of computing. They want to be more conservative with the approaches they try.

If we want to build a visual-first model, that's quite a different direction from the trend of current frontier labs. So it just made more sense to do that independently, where we can be 100% focused on this problem and build a model without distractions from other products or other priorities.

Will visual AI eat more compute than text AI?

It really depends on a few things. One is resolution. You can always do things at a lower resolution to save compute. You can't really do that with text. If you have a coding problem and your code base is a few thousand lines, you can't lower the resolution. You can use RAG and things like that, but it's a bit unreliable.

The other part is reasoning. After we do reasoning in the visual space, that could actually save computation. Current reasoning chains are very long. Most are hidden from the user, but the reason these models take a long time to reason is that, behind the scenes, they're generating hundreds or thousands of tokens during reasoning. In the visual reasoning space, you don't really need to do so much of that. You're not moving words around; you're not looking up your memory for where you've seen this word before. With visual reasoning, you just want to make edits to the image, like the sofa or the wheelchair example. So it's possible to save tokens during the reasoning stage. Ultimately, we're aiming to be similar in cost to text models after accounting for both of these, but we'll see. There are a lot of levers we can use to control costs.

If Elorian cracks visual reasoning, what becomes possible?

If we solve visual reasoning, then everyone will be using AI for any kind of visual task. Whether engineers are designing the next rocket engine, or designing a new battery, or biologists are doing a new experiment and needing to understand whether it was successful, that's a visual problem. Fundamentally, I believe it'll move us to a world where AI reasoning is much more prevalent, everywhere, and we can advance technology much faster. The dream would be faster engines, cheaper batteries, batteries that hold more power, and just driving technology forward.

Additional Reads

Be the future of AI

Hunting for AI roles eats hours. Careers pages, group chats, internal referrals, a dozen job boards.

The Athyna AI Job Board does the watching for you. We track openings at frontier AI companies, match them to your profile, and notify you when a role hits a 75% matching index.

Set up a profile and let the matches come to you.

*This is sponsored content

Quick Poll

Andrew Dai says frontier AI models can write a graduate thesis but can’t count bottles on a shelf. What’s the real path to visual intelligence?

Login or Subscribe to participate in polls.

Rate This Edition

What did you think of today's email?

Login or Subscribe to participate in polls.