Roko's Basilisk
Posts
Paywalls Strike Back

Paywalls Strike Back

Plus: UK demands Apple’s encryption key, and Cartken swaps streets for factories.

Roko's Basilisk
July 22, 2025

Here’s what’s on our plate today:

📰 How news sites slammed the gates shut.
🛠️ 3 Paywall-smart tools to test tonight.
🔐 UK wants an iPhone backdoor and Cartken’s factory-bot pivot.
💡 Lock crawlers in seconds.

Let’s dive in. No floaties needed…

Win over your customers with Zoho CRM.

Customer experience is the pulse of every successful business. Enhance yours with Zoho CRM, a solution built to create impactful customer journeys. Its innovative features and AI-driven capabilities enrich data and simplify tasks for your sales, marketing, and service teams.

With 20 years at the forefront of the SaaS industry, we've empowered businesses globally, streamlining workflows, boosting engagement, and driving conversions.

Explore Zoho CRM and transform the way you work!

_{*This is sponsored content}

The Laboratory

Why are paywalls taking over online news?

Suppose you are an avid reader interested in the news. In that case, there is a high likelihood that you have come across interesting articles on the internet that are hidden behind a paywall. While this used to be the case for some articles, a majority of news articles have historically been free to read for most users. Publishers relied on paywalls to attract subscribers and boost revenue. But over the past few months, there hardly seems to be any piece of news that is not hidden behind a paywall. So what changed?

In 2022, OpenAI released its chatbot ChatGPT. The chatbot was built on a model from the GPT‑3.5 series, which OpenAI fine-tuned specifically for conversational dialogue. While OpenAI did not publicly disclose the exact composition of the training data used for GPT‑3.5, based on official statements and research papers, it was inferred that the model was trained on publicly available and licensed text data. This started a debate around whether AI companies should be paying news publishers if their publicly available articles are used for training LLMs. And the road has been turbulent ever since.

Publishers put up defences against AI scraping

By 2023, news publishers realised they needed fences around their content to cut off crawlers deployed by AI companies. These crawlers, notorious for engulfing large swaths of data to be used to train Large Language Models, according to publishers, were infringing copyright laws by reproducing material without consent or proper credit. This resulted in a slew of lawsuits being filed by news publishers against AI companies.

One of the most reported cases is that of The New York Times, which sued OpenAI (and also Microsoft) in December 2023 over alleged copyright infringement related to training their language models.

Other publishers also followed suit; News Corp’s Dow Jones and the New York Post filed similar lawsuits in October 2024 against Perplexity AI. The case against Perplexity gained notoriety as the company was once touted to become Google Search’s replacement. However, it found itself facing accusations of plagiarism from multiple sites, including Forbes. An investigation by Wired also alleged that Perplexity AI could be freely copying online content from other prominent news sites as well.

However, despite the allegations and lawsuits, online publishers had two choices: either leave the front door open or create walls that would keep out both AI crawlers and users who are yet to subscribe. And it would appear that most publishers opted for the latter.

AI companies push back: claiming fair use

OpenAI called the lawsuit filed by the NYT, “without merit,” emphasizing that training on publicly available content qualifies as fair use. They pointed out that single sources like the NYT do not significantly influence their models. The ChatGPT-maker also argued that during the trial, NYT used contrived or cherry-picked prompts to force ChatGPT to regurgitate content verbatim, which is not representative of typical model behavior.

In response to allegations that its models were regurgitating verbatim content from news articles, OpenAI acknowledged the issue, describing it as a rare bug they are actively working to resolve. However, the company maintains that it is not a systemic problem.

OpenAI also announced that website owners could start blocking its web crawlers from accessing their data, nearly a year after it launched ChatGPT. However, OpenAI maintained that for AI models to learn and solve new problems, they required access to a substantial amount of data, and that data from the internet fell under fair use rules, which allow for the repurposing of copyrighted works.

Perplexity AI took a similar stand, framing itself as an aggregator, not a plagiarist. Both companies signaled they were willing to engage in business or licensing agreements to address publisher concerns. However, while the lawsuits played out in courts, more and more publishers moved their content behind paywalls to stop crawlers from scraping their content.

Cloudflare’s pay-per-crawl could be a viable alternative

While the tussle between publishers and AI companies continues to simmer, internet architecture provider Cloudflare announced it will block known AI web crawlers by default to prevent them from "accessing content without permission or compensation. The company also said that domain owners could choose to allow AI scrapers, and would even start a “pay per crawl” fee which would allow publishers to charge a fee for access without having to implement blanket bans.

Cloudflare began blocking AI crawlers as early as 2023, but this measure was limited to crawlers that respected a website's robots.txt file, a voluntary and unenforceable protocol indicating whether bots may access content. However, according to a Reuters report, artificial intelligence companies were circumventing a common web standard used by publishers to block the scraping of their content for use in generative AI systems. This could further create tensions between publishers and AI companies, potentially stalling licensing deals.

The future of online news hangs in the balance

The rise of generative AI has prompted a fundamental reevaluation of what it means to publish online. Once predicated on open access and advertising revenue, much of the digital news economy is now retreating behind increasingly fortified paywalls. The transformation has been swift and, to some degree, defensive.

For years, publishers tolerated web scraping by aggregators or research tools as a cost of participating in the open web. But the scale and ambition of today’s AI models, trained on trillions of words, with commercial applications across search, enterprise, and creative markets, have changed the calculus.

What used to be background activity is now viewed as appropriation at an industrial scale. At the same time, AI companies have insisted that their actions fall under fair use, especially when trained on publicly available data. This philosophical and legal mismatch has led to a steady increase in litigation, with The New York Times, News Corp, and others challenging the assumption that what is online is free for all to use.

Caught in the crossfire are readers, who increasingly encounter paywalls restricting their access. Though frustrating for casual browsers, this development reflects deeper structural conflicts between publishers and AI companies. Publishers are demanding fair value for their labor and intellectual property. AI companies are hungry for data to keep their models competitive. Without negotiated agreements, walls go up, and users lose access in the process.

And while Cloudflare’s “pay per crawl” model offers one possible path forward, its implementation remains a challenge, especially when some AI firms ignore robots.txt conventions or rely on third-party datasets to sidestep publisher restrictions.

If AI companies continue to treat the open web as an unrestricted training ground, publishers may respond with more aggressive legal and technical countermeasures. But if new licensing standards emerge, ones that honor both innovation and content creator rights, there may be a viable middle ground.

Bite-Sized Brains

UK pushes iPhone backdoor — ministers demand secret access to encrypted user data.
Cartken pivots to factory bots — sidewalk delivery shelved for Hyundai’s industrial floors.
OpenAI’s $200 “Operator” leak — pro-tier agent promises hands-free task runs.

Roko Pro Tip

💡 Two clicks to stop AI scrapers:

`User-agent: GPTBot\nDisallow: /`

Add to your robots.txt, then verify Cloudflare’s new crawler log.

Simplify training with AI-generated video guides.

Are you tired of repeating the same instructions to your team? Guidde revolutionizes how you document and share processes with AI-powered how-to videos.

Instant creation: Turn complex tasks into stunning step-by-step video guides in seconds.
Fully automated: Capture workflows with a browser extension that generates visuals, voiceovers, and call-to-actions.
Seamless sharing: Share or embed guides anywhere effortlessly.

The best part? The browser extension is 100% free.