Roko's Basilisk
Posts
AI's Data Drought

AI's Data Drought

Plus: AI cracks medieval secrets, Anthropic's $1T sprint, SK Hynix joins the $1T club.

Roko's Basilisk
June 01, 2026

Here’s what’s on our plate today:

🧪 How data shortage is reshaping artificial intelligence.
📰 AI cracks medieval secrets, Anthropic's $1T sprint, SK Hynix joins the $1T club.
💡 Roko's Pro Tip: Your operational data is a moat; treat it like one.
🗳️ Poll: What's the cleanest fix for AI's data shortage?

Let’s dive in. No floaties needed…

Launch fast. Design beautifully. Build your company's website on Framer.

Framer helps teams design, build, and launch their marketing sites lightning fast. With the ability to publish hundreds of CMS pages in a single click, operate at a global scale with seamless localization, and even host unified content across multiple domains, teams have never been able to ship faster.

Trusted by companies like Miro, Bilt, and Perplexity.

_{*This is sponsored content}

The Laboratory

TL;DR

The wall is real: High-quality public training data could be functionally exhausted between 2026 and 2032, and leading labs are already hitting the ceiling.
Synthetic data is a trap: Models trained heavily on AI-generated content degrade over time, becoming generic and error-prone. Researchers call it model collapse, and even small contamination can trigger it.
Meta went into surveillance mode: to train agents on real-world computer use, Meta monitored employees' keystrokes, clicks, and screenshots across Slack, GitHub, and Google Docs. European employees were excluded because GDPR required consent. American ones weren’t so lucky.
Everyone else is scrambling too: Anthropic quietly updated policies to retain user conversations for training. OpenAI and Handshake AI asked contractors to upload internal company documents, outsourcing the legal risk to them.
The stakes: Whoever controls deep, proprietary behavioral data inside hospitals, law firms, and financial institutions will hold the real leverage in AI’s next phase. Compute defined the first wave. The second will be shaped by something harder to replicate: detailed knowledge of how humans actually make decisions, handle exceptions, and get work done.

How data shortage is reshaping artificial intelligence

Every technological boom eventually discovers the thing it cannot outgrow. For the oil industry, it was the easy-to-reach reserves buried beneath the earth. For chipmakers, it was the physical limits of shrinking transistors. And now, artificial intelligence, after years of seemingly limitless expansion, is beginning to confront a bottleneck that feels strangely mundane for such a futuristic technology: it is running out of human-created data to learn from.

The next AI gold rush is not for compute or models, but for proprietary data that reveals how humans actually work. Photo Credit: Scoopanalytics.

Large language models (LLMs, the AI systems behind chatbots and agents) are trained on enormous volumes of human-generated material, including books, articles, forums, code, and research papers. For years, the supply of that data appeared effectively limitless, allowing models to grow larger and more capable with every training cycle. However, the sheer volume of data needed to train a capable AI model is quickly outpacing the rate at which it can be generated.

According to Epoch AI, the stock of high-quality public text available online could be functionally exhausted for frontier AI training sometime between 2026 and 2032. And for the most advanced systems, the constraint may already be here. The models that dominate today were trained on much of the internet’s usable knowledge, and the next generation is being forced to look elsewhere.

To fill this gap, AI labs are now being forced to be increasingly inventive, and in some cases, their inventiveness can be increasingly uncomfortable.

The wall every lab is running into

As of 2026, the obvious answer to AI’s data problem is to let models generate more data for themselves. Synthetic data, text, code, images, and examples produced by AI systems rather than humans are cheap, effectively infinite, and increasingly convincing. But the industry’s growing dependence on it has exposed a deeper problem in how the technology works and progresses.

When models are trained too heavily on AI-generated material, they begin to lose contact with the messy, uneven, highly specific patterns that make human knowledge useful in the first place. Researchers call this model collapse, a failure mode in which systems become progressively more generic, repetitive, and error-prone as they recursively learn from their own outputs rather than from reality.

A 2024 paper published in Nature by researchers, including former Google DeepMind scientist Ilia Shumailov, found that this degradation compounds over time. Later research further suggested that even small amounts of synthetic contamination in training datasets can trigger the effect.

This creates a scenario in which clean, human-generated data is becoming scarce, and the situation is further exacerbated by the increasing pushback against using available data to train models, which may eventually compete with the very humans who generate data in the first place within the economic apparatus.

As AI models grew more powerful between 2010 and 2023, their appetite for training data exploded, pushing the industry toward a looming shortage of high-quality human-generated information. Photo Credit: MIT FutureTech.

This tension has led millions of websites to block AI crawlers, while publishers, archives, and academic repositories that once served as open libraries for training data are increasingly restricting access or taking legal action against AI firms for unauthorized scraping. The open internet that powered the modern AI boom is slowly closing behind the companies that already used it.

For AI labs, the situation gets even trickier because what frontier labs need now is not simply more text, but a different category of information altogether: behavioral data that captures how humans actually perform tasks in the real world. This is important because the next generation of AI systems is being designed less as conversational assistants and more as autonomous agents that can navigate software, coordinate workflows, and complete office tasks across multiple applications. Training those systems requires observing not what people say, but what they do.

Meta’s solution: watch how we work

The present predicament facing AI labs can be best understood by looking at Meta. The company behind some of the most influential social media platforms is seeking alternative data sources, which explains the rationale behind its Model Capability Initiative.

Under the initiative, Meta began installing monitoring software on U.S. employees’ work computers that records keystrokes, mouse movements, clicks, and periodic screenshots across applications such as Google Docs, Slack, LinkedIn, and GitHub. Internally, the purpose was framed as teaching AI agents ‘how people actually use computers,’ including the small procedural behaviors that humans perform instinctively but that current AI systems routinely fail at, such as navigating menus, chaining actions across applications, and relying on shortcuts and workarounds accumulated through experience.

This category of information, known in research circles as computer-use trajectory data, is extraordinarily scarce precisely because it captures the sequential process of human work rather than the final output alone.

A 2025 paper from Shanghai Jiao Tong University identified the lack of high-quality trajectory data as one of the main reasons AI agents still struggle with practical computer tasks, and showed that even a relatively small number of carefully collected human examples could significantly improve model performance.

With this in mind, Meta’s urgency makes a lot more sense. Its flagship AI model, Muse Spark, openly acknowledged weaknesses in long-horizon agentic tasks, the ability to reliably execute extended sequences of actions without losing context or making compounding mistakes. Solving that problem is central to the industry’s next commercial ambition: AI systems capable of automating meaningful portions of white-collar work.

Currently, there is no large public dataset that teaches models how professionals actually operate across spreadsheets, browsers, messaging apps, and enterprise software, so companies are beginning to build such datasets themselves, often based on their employees’ behavior.

However, while this mode of data collection may work for companies looking to train the next generation of AI models, it does not suit the workers being tracked.

The legal boundaries around this emerging data race are already revealing sharp geographic differences. Meta’s monitoring program reportedly excluded European employees because the EU’s General Data Protection Regulation prohibits workplace surveillance without explicit consent. In the United States, where no equivalent federal privacy framework exists, the same practices face far fewer constraints.

Everyone else is scrambling too

Meta’s attempts to track its employees’ workflows may work for it, but they do not solve the problem of data scarcity for the broader AI industry. And different AI labs are responding to the problem with varying creativity.

In 2025, Anthropic revised its consumer policies so that conversations and coding sessions from many Claude users could be retained and used for model training unless users explicitly opted out. The change was notable because privacy protections around user conversations had previously been one of Anthropic’s clearest differentiators. Around the same time, reports described how OpenAI and data-labeling firm Handshake AI were asking contractors to upload real-world materials, including spreadsheets, presentations, code repositories, and internal documents, to help train systems for white-collar automation tasks.

The legal and ethical risks of that approach were immediately obvious, particularly because determining what counts as confidential information was effectively being outsourced to contractors themselves.

These strategies differ in method and in risk, but they are all responses to the same underlying reality: the open internet no longer contains enough high-quality data to sustain the industry’s ambitions on its own.

The side door that this opens

While AI labs continue to look for alternatives, the scarcity has created a second-order effect that could reshape where AI power accumulates next. If frontier models increasingly depend on rare behavioral, operational, and domain-specific data, then organizations sitting on those datasets, like hospitals with patient interaction histories, manufacturers with operational workflows, law firms with decades of case management records, and financial institutions with transaction behavior, suddenly possess something AI labs cannot easily reproduce.

In the first phase of the AI race, scale alone created advantage. Compute, capital, and access to the public internet determined who pulled ahead. In the next phase, proprietary data may matter just as much.

The companies best positioned for that transition may not necessarily be the largest technology firms, but the organizations that combine deep, highly specific datasets with the operational ability to safely integrate them into AI systems. The value now lies less in raw abundance and more in information that is difficult to imitate, difficult to acquire, and grounded in how real-world work actually happens.

The next scarcity

The first phase of the AI boom was built on an abundance of data, computing, and an open internet that companies could train on freely. That era is beginning to close, and the next phase of AI will be shaped by scarcity, specifically the scarcity of high-quality human behavioral data that cannot be easily scraped, synthesized, or replicated.

And just like the scarcity in the powerful AI chips changed the fortunes of companies like NVIDIA, the next phase of AI could help reshape the fortunes of companies that have access to the deepest understanding of how humans actually work, and the ability to use that data without destroying the trust required to collect it in the first place.

Roko Pro Tip

💡

If your company sits on years of proprietary operational data, stop treating it like backend overhead. It’s becoming one of the scarcest assets in AI; label it, structure it, and lock down who can train on it before someone else writes that contract for you.

The context to prepare for tomorrow, today.

Memorandum merges global headlines, expert commentary, and startup innovations into a single, time-saving digest built for forward-thinking professionals.

Rather than sifting through an endless feed, you get curated content that captures the pulse of the tech world—from Silicon Valley to emerging international hubs. Track upcoming trends, significant funding rounds, and high-level shifts across key sectors, all in one place.

Keep your finger on tomorrow’s possibilities with Memorandum’s concise, impactful coverage.