What If AI Could Touch the Real World?

An interview with Jacky Mok, Head of Applied AI at Reka, on why the models that ace internet benchmarks still can’t act in a kitchen or on a factory floor. The lab, built by ex-DeepMind researchers, is now teaching machines to see, reason, and move through physical reality.

How Reka Is Building Intelligence for the Physical World With Jacky Mok

Welcome to Revenge of the Nerds. We’re skipping the hype and going straight to the builders. In this edition, we talked about:

  • Web data is running dry, and the next frontier is teaching models to interact with physical reality, not just text.

  • Reka bets on the full stack (models, inference, video, and training data) because each piece feeds the research lab at its core.

  • If physical AI gets solved, expect reliable on-device robots that keep working even when the connection drops.

Let’s dive in. No floaties needed…

Goodies delivered straight into your inbox.

Get the chance to peek inside founders and leaders’ brains and see how they think about going from zero to 1 and beyond.

Join thousands of weekly readers at Google, OpenAI, Stripe, TikTok, Sequoia, and more.

Check out all the tools and more here to outperform the competition.

*This is sponsored content

Revenge of the Nerds

Jacky Mok, Head of Applied AI at Reka

Jacky Mok leads Applied AI at Reka, the research lab founded in 2022 by ex-DeepMind, Google, and Meta researchers under CEO Dani Yogatama. His job is to bridge research and production, getting the lab's multimodal models to run in real products for real clients. Reka builds ultra-efficient models that understand video, image, text, and audio on a fraction of the compute its rivals burn through, and in July 2025, it raised $110M from backers including NVIDIA and Snowflake, pushing its valuation past $1B. Customers like Shutterstock and Turing Video already run on its vision platform.

What makes Reka worth watching is the bet underneath the funding. Snowflake tried to buy the company outright in 2024, but the talks collapsed, and both sides decided Reka was better off independent. Instead of chasing bigger text models, Reka turned toward video understanding, edge deployment, and the expensive problem of collecting real-world data from kitchens, factory floors, and farms.

Jacky's view is that the industry has nearly exhausted the internet's supply of training data, and the companies that teach models to interact with physical reality, not just describe it, will define the next decade.

What's the main difference between building for physical AI & building for the cloud?

Much of the LLM progress over the last two years has come from web data, and we've basically exhausted it. The models are smart and can reason generally, but they reason within the tech space, and there's a huge gap in this new paradigm. That's the rush into robotics, into models that can not only understand the world but also interact with it. Interaction is the main thing. You can have a model interact one-shot in the real world, but it doesn't do it very well.

Physical environments have many more parameters, especially in real-time, where the model needs to understand the environment's physics. Even if you attach a frontier model to a robot, it might be too slow or too expensive. With physical AI, deployments are smaller, models run on edge devices, and inference differs from that on a cloud server optimized for throughput. That's why there's this rush to build embodied and physical AI models.

How does Reka reduce inference costs with Infer?

Infer is a newer initiative. As a research lab, we already lease a lot of GPUs for training, and Infer takes our specialization in inference, which is different from training, and applies it to deploy non-Reka models too. Once you take a raw GPU and raw model weights, there are margins to be had because the licensed API price includes markup. Even out of the box, our inference engine outperforms popular open-source models, though each model needs customization depending on the family.

We'll eventually focus on a few specific models, not everyone. When we're not training, we run inference and optimize it, and that optimization also helps training, since RL environments need an optimized loop. We take a hybrid approach, pulling from open source too, to run these as fast as possible, and there's a margin in that.

What's the main benefit of processing video natively?

It's more portable. A full video solution includes orchestration and embeddings, which make it less portable. It's impossible for each model to optimize for every use case, so different models are good for different things. With orchestration, a pipeline of indexing, embedding, and deeper understanding over our API, we set the right parameters for models to perform efficiently.

For a physical AI deployment, we primarily share raw model weights, since the customer often already has the orchestration and only needs the model. For a media and entertainment company, we share our orchestration because they don't have the expertise to build it. It depends on each integration. Processing a large video library or aggregating multiple cameras requires more orchestration and plumbing to get the best outputs.

What is Reka building with Claru & real-world data collection?

Claru is a relatively new engagement. This year, the theme has been physical AI, embodied AI, and world models, and there's a huge gap in getting sufficient data for these spaces. We already use these Claru workflows in-house because behind every big model run, we need a lot of data labeling and raw examples, and these models are hungry. The researchers figure out we need more of a certain type, like egocentric data, which is collected from a human's perspective, like a webcam on your head.

High-quality egocentric data is net new. There hasn't been a corpus from that perspective, and there's a massive need for it. Our cinematic video model is trained on films, so it produces high-fidelity video. However, without examples of tool manipulation, we need more egocentric data. Claru takes the lab's expertise and builds our own marketplace so others can use it too, and we keep private data separate from the data Reka uses.

Can't these models be trained on existing internet video instead?

You can, and a lot of our datasets include licensed internet data. But you often need very high-quality video in a specific domain. Say you want an example of someone gardening to express it in the model. Where can you find high-quality egocentric data for the model to train on?

If it's a Twitch stream or live-generated data, you need to filter out noise and clean it. Much of the work involves cleaning the data, and quality matters more than tagging, since we can auto-tag. The best way to get high-quality sources is often to pay someone to find the information and format it the way you want.

How do you handle privacy when processing video?

We use licensed datasets for training. For Claru, it's within the labeler's requirements to label a specific item and maybe keep other faces out of frame, and our legal team ensures each workflow and use case is validated.

Each Claru job has specific use cases, sometimes set by an end client specifying the criteria and parameters. It might be to keep certain things out of frame, to focus on a specific area, or just to show someone working at their desk. Some use cases have more concerns, like filming outdoors where other faces come up, but that depends on the use case.

Why is Reka betting on the full stack of models, inference, video, & training data?

Reka started with many ex-DeepMind researchers and a large general foundation model. I went to video because the clients most engaged with our model needed to annotate their images and videos. Inference and training data aren't directly related to the model, but they have good synergy with the lab. We're already using the training data, and for inference, we have in-house expertise running these models over our GPUs.

Most of Reka's resources still go to the lab and researchers defining the next model, something like a world model, and these other initiatives fuel the lab's sustainability. It's not random. It's what happens when researchers build a company. What are they good at? What can we monetize to perpetuate the research? Everything serves the research. The commercial side will focus on the vision platform, since that translates to physical AI down the line.

If Reka gets physical AI right, what does the world look like in five to ten years?

Solving physical AI means ushering in an era of embodied robots and a strong core model that deploys not just in the cloud but also inside a robot, optimized to run on a single machine. At that point, it becomes daily usage, the way we talk to a ChatGPT equivalent daily, except with robots acting as assistants in the physical space. We'll gradually see more autonomous robots, much like self-driving cars today, starting with businesses automating their workforce.

These robots will be a different class, since they won't fully depend on the cloud, though the cloud will still empower them. We'll see hybrid deployment: robots that are on-device and smart enough to operate without connectivity, which is the biggest reliability point. This relates to world models, models that understand physical reality, which is the next gap. That would be the North Star, and it'll be quite different and exciting.

Additional Reads

Hire smarter with Athyna, save up to 70% on salary costs.

Athyna connects you with top LATAM AI talent, fast

*This is sponsored content

Quick Poll

Jacky Mok says the industry has nearly exhausted the internet’s supply of training data, and the next frontier is teaching models to act in the physical world. Where’s the real edge?

Login or Subscribe to participate in polls.

Rate This Edition

What did you think of today's email?

Login or Subscribe to participate in polls.