- Roko's Basilisk
- Posts
- The Real AI Cost War
The Real AI Cost War
Plus: AI psychosis study, GPT-5.5 superapp, and AI stock call.
Here’s what’s on our plate today:
• 🧪 The messy economics of running AI at scale.
• 📰 Chatbot psychosis gap, OpenAI's superapp, the best AI stock pick.
• 💡 Roko's Pro Tip: cut tokens, not deals, to tame AI costs.
• 🗳️ Poll: Who wins the inference cost war?
Let’s dive in. No floaties needed…

Meet your next hire in as little as five days.
Need high-quality tech talent without the high-cost spiral?
Athyna helps you hire vetted LATAM professionals, matched with AI-assisted precision and reviewed by humans who know what good looks like.
From AI engineers and data experts to marketers and designers, you can meet strong candidates in as little as five days. We’ll also support onboarding and HR logistics, so the process stays simple and the momentum stays yours.
*This is sponsored content

The Laboratory
TL;DR
• Inference now dominates AI spending: According to Deloitte, inference workloads account for roughly two-thirds of all AI compute, up from one-third in 2023, with infrastructure spending on inference expected to hit $20.6B in 2026.
• Price wars mask a paradox: OpenAI slashed O3 pricing by 80%, Anthropic cut Opus costs by 67% between generations, but total spending keeps climbing because usage outpaces every price cut.
• Google is betting on specialized silicon: Through partnerships with Marvell Technology, Broadcom, and MediaTek, Google is building inference-specific chips, signaling that serving models is a fundamentally different hardware problem than training them.
• Custom chips are outpacing GPUs: The custom AI ASIC market is projected to grow 45% in 2026, compared with 16% for GPU shipments, with every major hyperscaler now building its own silicon.
• Value capture remains unsettled: If inference-optimized hardware dominates, the resulting economy could recreate familiar dependency structures where value accrues unevenly across designers, manufacturers, and the AI labs running workloads.
Understanding the messy economics of running AI at scale
When OpenAI helped ignite the modern artificial intelligence race, the industry’s early obsession was with model capability, focusing on how powerful these systems could become and how quickly they could improve. As more players entered the field, the conversation grew more technical, exploring how models are built, how different companies approach training, and what factors like data quality, parameter count, and fine-tuning mean in practice.
Over time, however, as competing models began to converge in performance and the gaps between them narrowed, capability alone ceased to be the defining differentiator. The focus has since shifted to a more practical question: what it costs to run, deploy, and scale these models for both enterprises and everyday users.
The shift matters because training a model is a concentrated, upfront expense, even if models are periodically retrained and updated; inference, the process of serving that model to users every time they ask a question, generate code, or run an automated workflow, is a cost that recurs with every single query.
One of the clearest signs of this shift is where the money is going. According to Deloitte, inference workloads now account for roughly two-thirds of all AI compute, up sharply from about one-third in 2023 and half in 2025. AI cloud infrastructure spending is expected to reach $37.5B in 2026, with more than half of that, around $20.6B, going toward inference. The data clearly shows that for the first time, running models is set to cost more than training them. These estimates vary across providers and workloads, but the directional shift is consistent across industry forecasts.
When this data is applied to the scale at which AI is growing, the economics become hard to ignore. Some industry estimates suggest that for every $1B spent training a model, organizations could spend $15B to $20B on inference over its lifetime, depending on usage intensity and deployment scale. Take, for instance, GPT-4, the model reportedly cost more than $100M to train, but its cumulative inference costs have already been estimated at around $2.3B by the end of 2024.
And as this trend continues to grow, with more complex, agent-driven workflows requiring multiple model calls to complete a single task, inference will, and it already is, set to become the largest and most persistent expense in AI, and its impact is now being felt across the AI stack.
However, what the data does not fully capture is that this is not just a cost problem; it points to deeper structural shifts, where the biggest challenge is no longer training models, but delivering that intelligence efficiently at scale.
For now, the shift can be understood by looking at model providers that are responding to the challenge of delivering intelligence at a reasonable price to their customers.
What emerges is not a single battleground but a fragmented one, where pricing models, infrastructure design, and hardware strategy are all contested simultaneously.
How the model providers responded
The model companies have responded to this shift in two ways: by cutting prices and by optimizing their service infrastructure.
OpenAI moved first, and most aggressively on pricing, and in June 2025, the company cut the price of its O3 reasoning model by 80%, bringing it from $10 per million input tokens down to $2. The company attributed the reduction to optimization of the inference stack that serves the model, not to any changes to the model itself.
Independent benchmarks from the ARC Prize confirmed that performance remained identical. The move signaled a broader strategy: aggressively reducing unit costs to expand usage, even if it compresses margins in the short term.
Since then, OpenAI has continued expanding its model lineup downward in price, with GPT-5.4 Nano now priced at $0.20 per million input tokens.
Anthropic, meanwhile, took a different path and, rather than racing to the bottom on per-token pricing, the company cut the cost of its flagship Opus model by 67% between generations, dropping from $15/$75 per million tokens on Opus 4.1 to $5/$25 on Opus 4.5 and subsequent versions.
At the same time, Anthropic restructured its enterprise billing, moving away from flat-rate subscriptions toward usage-based consumption commitments. The shift reflects a recognition that inference volume, not subscription headcounts, drives the real cost of running AI at scale.
Both approaches point to the same underlying reality. The unit cost of AI intelligence is falling, but total spending keeps rising because usage grows faster than prices drop. The companies that manage this paradox best, by building the most efficient inference infrastructure, will have the most sustainable business models.
However, despite changes in their pricing and cost structures, neither company has publicly answered the question: how low can prices go? If inference costs keep falling by 50-80% with each generation, it raises the possibility that pricing could eventually drop close to zero, pushing companies toward a different business model that focuses on outcomes rather than charging for tokens.
Google’s hardware bet: bifurcating the chip
Against this backdrop, Google is reportedly looking at a whole new approach to managing costs. The company is in talks with Marvell Technology to develop two new chips. One is a memory-processing unit designed to complement Google’s existing Tensor Processing Units, while the other is a new TPU tailored for inference workloads.
The Marvell discussions are the latest move in a broader strategy. Google already works with Broadcom on the core TPU architecture, and it has a partnership with MediaTek on cost-sensitive TPU variants manufactured by TSMC. Adding Marvell as a fourth design partner, specifically for inference-optimized silicon, suggests that Google sees inference as fundamentally different from training.
This points to a broader shift from the approach that defined AI hardware until recently, when the same chips, largely NVIDIA GPUs, were used for both training and inference workloads. Google has already begun moving in a different direction with its seventh-generation TPU, Ironwood, which was designed with inference as its primary focus, delivering 4,614 TFLOPs of compute alongside 192 GB of high-bandwidth memory and positioned by the company as infrastructure for the “age of inference.”
The reported discussions with Marvell Technology extend this direction further by introducing a more granular level of specialization within the hardware stack. A dedicated memory processing unit working alongside TPUs would directly target memory bandwidth constraints that limit inference throughput. At the same time, a separate inference-focused TPU would deepen the level of optimization applied to serving workloads. Taken together, these developments point toward an AI hardware ecosystem that is increasingly segmented, with distinct silicon classes emerging to handle distinct parts of the model lifecycle, each optimized for its own performance and efficiency requirements.
The counter-argument
However, even as AI labs continue to invest in reducing inference costs by advancing model architecture and hardware, not everyone agrees that bifurcation between training and running inference is inevitable.
Deloitte has argued that the chip market will be ‘both-and’ rather than ‘either-or,’ because training costs continue to rise toward the billion-dollar level and the GPUs best suited for that work remain necessary. NVIDIA still holds more than 80% of the training market share, and its CUDA software ecosystem keeps developers locked in. An inference-only chip that cannot train models at all is a bet that the world will always need both types, and that neither will cannibalize the other. An alternative view is that sufficiently flexible architectures could continue to handle both workloads, especially if software-level optimizations narrow the efficiency gap between training and inference.
Besides, there are coordination risks, as managing four silicon design partners and an in-house team is complex. MediaTek’s TPU work has already experienced design changes that complicated production schedules. And some observers argue that the current price war among model providers is a land grab, with companies subsidizing compute costs to lock in developers before the market matures.
There is also an uncomfortable question at the center of this bifurcation thesis, which is who ultimately captures the value if inference-optimized chips become dominant, because while Google designs them, Marvell Technology and TSMC manufacture them, and Anthropic runs workloads on them, the resulting inference economy could end up recreating similar dependency structures, where value accrues unevenly across the stack despite efforts to vertically integrate.
The new front line
The broader direction is already coming into focus through the numbers. The custom ASIC market for AI is projected to grow by 45% in 2026, far outpacing the 16% growth expected in GPU shipments, as companies increasingly look beyond general-purpose hardware for efficiency gains. Broadcom offers a clear signal of this momentum, with its AI revenue reaching $8.4B in the most recent quarter, up 106% year over year, alongside an ambitious target of $100B in AI chip revenue by 2027. At the same time, every major hyperscaler is investing in its own silicon stack, with Amazon developing Trainium and Microsoft building Maia, both aimed at gaining tighter control over performance and cost in large-scale deployments.
Whether Google and Marvell Technology ultimately move from discussions to fully realized chips matters less than the underlying shift that is already underway. Inference has become the primary center of spending, and it is increasingly where competitive advantage will be determined.
For AI labs competing to serve enterprise demand, success now depends on delivering models that can be run efficiently at scale, where cost and speed matter as much as capability. In that environment, hardware has moved from a supporting role to a central lever in shaping the viability and competitiveness of AI systems.


Roko Pro Tip
![]() | 💡If your AI costs are climbing, don’t negotiate the price; cut the tokens. Prompt caching, shorter context windows, and routing simple tasks to cheaper models beat any vendor discount. |

The context to prepare for tomorrow, today.
Memorandum merges global headlines, expert commentary, and startup innovations into a single, time-saving digest built for forward-thinking professionals.
Rather than sifting through an endless feed, you get curated content that captures the pulse of the tech world—from Silicon Valley to emerging international hubs. Track upcoming trends, significant funding rounds, and high-level shifts across key sectors, all in one place.
Keep your finger on tomorrow’s possibilities with Memorandum’s concise, impactful coverage.
*This is sponsored content

Monday Poll
🗳️ Inference is eating AI's budget. Who wins the cost war? |

Bite-Sized Brains
• Certain chatbots' worst AI psychosis study: A study finding some chatbots are more likely to trigger AI-induced psychosis than others.
• OpenAI ChatGPT GPT-5.5 superapp: OpenAI rolling GPT-5.5 into a superapp strategy.
• AI stock pick of the year: Yahoo Finance makes its call for the best-performing AI stock, a reminder that the real winners may not be the labs themselves.
Meme Of The Day

The Toolkit
• Regie.ai: AI sales agent that researches prospects, writes personalized outbound at scale, and handles follow-ups so reps can focus on closing.
• Replit AI: Browser-based coding environment with an AI agent that builds, debugs, and deploys apps from a prompt, no local setup required.
• Sourcegraph: Code intelligence platform with Cody, an AI assistant that understands your entire codebase so it can answer questions and write code that actually fits.

Rate This Edition
What did you think of today's email? |





