Roko's Basilisk
Posts
GPUs As A Utility

GPUs As A Utility

Plus: Alphabet's record $85B raise, Microsoft's agent push, EU's sovereignty play.

Roko's Basilisk
June 05, 2026

Here’s what’s on our plate today:

🧪 The companies making GPUs behave like utilities.
📰 Alphabet's record $85B raise, Microsoft's agent push, and the EU's sovereignty play.
🛠️ Weekend To-Do: deploy a serverless GPU job, compare hyperscalers, benchmark inference.
🗳️ Poll: Who wins the serverless GPU layer?

Let’s dive in. No floaties needed…

Goodies delivered straight into your inbox.

Get the chance to peek inside founders and leaders’ brains and see how they think about going from zero to one and beyond.

Join thousands of weekly readers at Google, OpenAI, Stripe, TikTok, Sequoia, and more.

Check out all the tools and more here to outperform the competition.

_{*This is sponsored content}

The Laboratory

TL;DR

Cold starts, hot problem: Spinning up a GPU for AI inference could take over 2k seconds. Modal Labs cut that to ~50 through four compounding tricks: pre-warmed GPU pools, lazy-loading filesystems, CPU process snapshots, and capturing GPU memory state mid-run.
Inference is where the money went: Inference now accounts for roughly two-thirds of all AI compute, and the inference chip market alone is projected to exceed $50B in 2026. Training got the attention; inference got the bill.
Hyperscalers are watching and shipping: Google, Microsoft, and AWS all launched serverless GPU products in the past year. The pattern across containers and serverless functions suggests that incumbents can quickly close technical gaps once the approach is documented.
The stakes: The companies that make AI applications actually work at scale may end up owning the most critical layer of the AI economy without their names ever appearing on a single product.

The companies making GPUs behave like utilities

Every time someone types a question into an AI chatbot, a chain of events begins on hardware they will never see. Somewhere in a data center, a GPU must be available, loaded with the correct model, and ready to respond. If that GPU is already running, the interaction feels instantaneous. If it isn’t, things become more complicated.

Starting a GPU for AI workloads is less like flipping a light switch and more like cold-starting a diesel engine in winter: slow, resource-intensive, and prone to delays if the underlying systems are not carefully maintained.

Modal Labs says it reduced GPU startup times from over 2k seconds to roughly 50, tackling one of AI’s most important infrastructure bottlenecks. Photo Credit: Intel.

That challenge may sound mundane, but it is becoming one of the most important engineering problems in the AI economy. On May 12, 2026, New York-based infrastructure company Modal Labs published a detailed account of how it reduced GPU startup times from more than 2k seconds to roughly 50. The company describes the approach as “serverless GPU” computing, a model that allows developers to run AI applications without managing the underlying hardware and pay only for the seconds they actually use.

Written by Modal CEO Erik Bernhardsson and three engineers, the post brings together five years of infrastructure work into a single narrative. In effect, it is a guide to making GPUs behave less like specialized machines and more like utilities: available on demand, precisely metered, and largely invisible to users.

Erik Bernhardsson, CEO of Modal Labs, says the company reduced GPU startup times from more than 2k seconds to roughly 50 through a series of infrastructure optimizations. Photo Credit: LinkedIn.

The startup problem nobody talks about

The disclosure matters because it offers a rare window into a layer of the AI economy that receives relatively little attention. Models attract headlines, and chips drive geopolitical debates, but another part of the stack determines whether AI systems actually work at scale.

That is the infrastructure connecting models to users, and it is this layer that decides whether an application responds instantly or leaves people waiting. This layer is being built by a small group of companies, and Modal’s disclosure offers a rare glimpse into how it operates and the unresolved problems within the AI supply chain.

To understand what Modal solved, it helps to think about what happens when an AI application gets a sudden surge of users. Say a company has deployed a language model to handle customer service inquiries. On a normal day, it handles a few hundred requests per minute. Then a product goes viral, or a news event drives traffic, and suddenly it needs to handle 10k, and it needs them immediately.

When traffic spikes, the clock starts ticking

Modal helps to manage this surge by maintaining a buffer of pre-checked, idle GPUs shared across customers, so new capacity doesn’t require requesting fresh machines.

The company built a custom system that loads only the software components an application actually needs and saves the state of running processes so new instances can skip most of the startup work. It also developed a way to capture and restore the GPU’s state, including loaded models and memory, allowing AI workloads to resume almost instantly. That final step is particularly difficult because NVIDIA’s software was never designed to be paused and restarted this way.

But by doing just that, Modal claims it has reduced startup times by 40x. For the person typing into a chatbot or the enterprise running an AI-powered document processor, this is the difference between a responsive application and one that buckles under load.

The shift from training to serving

To understand the importance of the layer within which Modal operates, it helps to look at the larger change unfolding in the AI industry. For years, the biggest expense in AI was training models, the process of feeding vast amounts of data into neural networks so they could learn. Training is expensive, but it happens only occasionally. Once a model is built, the real challenge begins: serving it to users.

However, inference, the process of actually running that model to generate outputs for users, happens continuously and at scale. A popular model may handle millions of requests each day, with each interaction consuming compute resources and GPU time. And as AI adoption continues to grow, the industry’s attention is increasingly shifting from building models to operating them efficiently.

Deloitte estimated that inference workloads would account for roughly two-thirds of all AI compute by 2026, up from about half in 2025 and roughly one-third in 2023. The extent of demand can be gauged by looking at the market for inference-optimized chips, which by itself is expected to exceed $50B in 2026.

The shift from training models to actually running them has triggered a surge in funding for companies building inference infrastructure. Baseten raised $300M in January 2026 at a $5B valuation, with NVIDIA contributing $150M. Fireworks AI secured $250M at a $4B valuation. Inferact, the company commercializing the open-source vLLM inference engine, raised $150M in seed funding.

Modal itself raised $87M in September 2025 at a $1.1B valuation and was reported by TechCrunch to be in talks for a new round at roughly $2.5B, with annualized revenue of approximately $50M.

These investments highlight that the capital markets have decided that the machinery enabling other people’s models to run at scale is worth a great deal, even if the companies building it remain largely invisible to end users.

The hyperscalers notice, and they ship

Even the major cloud providers have taken note and begun looking for ways to compete with this emerging layer of companies.

Google Cloud Run reached general availability for GPU-attached serverless containers in June 2025, offering NVIDIA L4 GPUs with scale-to-zero billing and startup times of approximately five seconds. Microsoft Azure launched serverless GPUs in Azure Container Apps, supporting NVIDIA A100 and T4 chips with per-second billing and scale-to-zero capability. AWS introduced Lambda Managed Instances at re:Invent in December 2025, bringing EC2-backed compute, including GPU-capable instance types, under Lambda’s management model for the first time.

However, the growing interest from hyperscalers, while benefiting end users, creates a familiar challenge for startups that spend years identifying and solving difficult infrastructure problems, only to find themselves competing against cloud giants with far greater resources and distribution.

Cloud computing has followed a familiar pattern for years. Startups identify a problem, build a solution, and gain traction, only for hyperscalers such as AWS, Google Cloud, and Azure to eventually introduce competing offerings integrated into their broader platforms. The question for Modal and its peers is whether GPU infrastructure will follow the same path, or whether the technology’s complexity gives independent companies more room to establish themselves.

Modal’s decision to publish its engineering approach in detail suggests it believes the latter. The blog post explicitly states that the company considers secrecy a poor competitive strategy and that broader knowledge of efficient GPU usage expands the overall market. That confidence could be well-placed. It could also underestimate how quickly well-resourced incumbents can close a technical gap once the approach is documented.

Modal’s customer list offers a glimpse of how deeply this infrastructure is already embedded in the AI economy. Customers include Lovable, Ramp, Substack, Harvey AI, Mistral, and Suno, yet most users will never know that infrastructure providers like Modal are helping keep those products responsive behind the scenes.

Invisible by design, essential by necessity

In many ways, that invisibility is the goal. Infrastructure is most successful when users never have to think about it. The electrical grid, internet routing systems, and the software that keeps much of the web running all fade into the background when they work as intended. The companies building AI’s inference layer are pursuing the same outcome: making GPUs available on demand while hiding the complexity required to make that possible.

Whether Modal, Baseten, or any of their competitors become durable independent companies or get absorbed into the hyperscalers’ expanding platforms remains an open question.

Their technical advantages are real, but maintaining them will require constant investment as NVIDIA, cloud providers, and open-source projects develop competing capabilities. At the same time, the rapid flow of capital into the sector raises questions about whether valuations are being driven by future revenue or future expectations. What seems less uncertain is the demand for the infrastructure itself.

As AI becomes embedded across businesses and consumer applications, the need to make GPUs available on demand and at scale will only grow. The bigger question is who ultimately controls that layer of the AI economy.

Headlines You Actually Need

Alphabet's record $85B raise: Alphabet pulled off a record-breaking $85B raise for Google's AI business, a strong signal that investors are still betting heavily on the AI buildout.
Microsoft's agent push: Microsoft used Build to double down on AI agents, sharpening its competition with OpenAI even as the two remain deeply intertwined.
EU's sovereignty play: The EU unveiled a technological sovereignty package to accelerate the development of homegrown AI chips and reduce its dependence on US and Asian cloud infrastructure.

Launch fast. Design beautifully. Build your company's website on Framer.

Framer helps teams design, build, and launch their marketing sites lightning fast.

With the ability to publish hundreds of CMS pages in a single click, operate at a global scale with seamless localization, and even host unified content across multiple domains, teams have never been able to ship faster.

Trusted by companies like Miro, Bilt, and Perplexity