Crawlers At The Gate

Plus: Practical bot controls, licensing options, and today’s quick hits.

Here’s what’s on our plate today:

  • 🕸️ AI scrapers flood the web—and site owners pay the price.

  • 🔒 Should AI bots be blocked, billed, or left alone?

  • 🧃 Three fast ways to filter out the bad bots.

  • 🧵 Quick Bits on Chrome x Perplexity, Nvidia’s Cosmos, and Datumo’s raise.

Let’s dive in. No floaties needed…

Invest your retirement in bourbon.

Looking for an innovative way to diversify your retirement portfolio? Investors can now invest in bourbon barrels on the CaskX platform using a Self-Directed IRA from Directed Trust. This strategy allows portfolios to capitalize on an asset that naturally appreciates over time.

With each passing year, the bourbon inside the barrel increases in complexity and flavor. Backed by America's bourbon legacy and rising international demand, this alternative investment offers tangible asset diversification with a history of long-term appreciation.

*This is sponsored content

The Laboratory

How AI web scraping is breaking the open internet

Within the span of a few years, artificial intelligence has metamorphosed from a futuristic milestone to a household topic of discussion. What we see today in the form of ChatGPT, DALL-E, Gemini, and others is not the creation of a few individuals working in artificially lit rooms, but the product of decades of hard work, innovation, and collaboration between different organizations and individuals.

Let me elaborate. The foundation on which contemporary AI models function is that they can learn patterns from massive amounts of existing text or data and then use those patterns to predict and generate responses. So where does this data come from? From decades of human-created work.

This data is best represented on the internet, where information has been systematically created, stored, and disseminated. And AI companies have found interesting ways to use this existing data to train their models. But now, the creators and conservators of this data are feeling the pinch of not just how AI is impacting the creation of new data, but also on how AI data scrapers are impacting their existing data repositories, their websites.

Why AI needs so much data, and where does it come from

The market for AI technologies is vast, amounting to around USD 244 billion in 2025, and is expected to grow well beyond that to over USD 800 billion by 2030. As the market continues to evolve and grow, AI companies are working on improving existing models and training new ones. All this requires tremendous amounts of data, and companies are relying on web scraping.

But AI-powered bots deployed by companies are different from web scrapers that have been around since the beginning of the internet. AI bots are sophisticated, tireless, and can ingest huge amounts of data at unimaginable speeds.

For instance, Bytespider, operated by TikTok’s parent company ByteDance, leads in request volume and is used to gather training data for large language models. It is followed closely by GPTBot from OpenAI and ClaudeBot, which are among the most popular bots crawling the web among the AI giants.

OpenAI, Google, and Meta also have their own AI web crawlers to gather public data for training their models. OpenAI’s official web crawler, named GPTBot, is designed to collect publicly available content to improve ChatGPT and related services. Google, meanwhile, is using its CloudVertexBot, part of Google’s Vertex AI platform. Meta has also launched an ‘ExternalAgent’ crawler to scrape public web data for training LLaMA and other AI systems.

AI bots now dominate web traffic

One of the fastest-growing problems on the internet, egged on by AI, is the emergence of non-human users on the internet. According to Imperva’s 2024 Bad Bot report, nearly half of the traffic on the internet is already non-human, with ‘bad bots’ the fastest-growing slice.

Bots existed long before the advent of AI in the public domain. But their impact was limited. Now, however, bots created using AI tools reportedly account for around one in every eight visits to websites, compared to 8% for Google’s bots.

This means that web traffic is spiking, but not because more people are using the internet or visiting a website, but because more bots are scraping data from websites to enhance their functioning and provide the data required to train future AI models.

While on the surface it does not look so bad, all these bots are doing is collecting data. But their impact on small websites, their owners, and users can be devastating.

The hidden cost of AI crawlers for websites

For owners hosting websites, visits from AI bots can result in huge spikes in search traffic, something they might not have been prepared to deal with.

Even if they plan for traffic spikes, visits from AI crawlers can rapidly push up costs and run roughshod through fair use levels agreed with web providers, causing not only issues of reliability but also increasing the costs that hosts have to pay when footing the bill.

In one striking case, Anthropic’s crawler hammered iFixit’s websites almost a million times in 24 hours, seemingly violating the repair company’s Terms of Use in the process. The company, in its terms of use policy, states that “reproducing, copying or distributing any content from the website is strictly prohibited without the express prior written permission from the company, with specific inclusion of training a machine learning or AI model.”

Companies operating small to mid-size sites also have to face the pressure of deploying tools that can block AI bots altogether. However, this path is not easy.

AI crawlers often ignore robots.txt, rotate IP addresses, or spoof user agents. Forums like Hacker News describe it as a nightmare, requiring complex rules and network-level defenses to keep bots at bay. These defensive investments, both technical and financial, are adding unexpected overhead to what were once lightweight websites.

For hobbyist bloggers or fledgling startups on limited budgets, this leads to poor user experience, higher hosting bills, or even forced migration to paid or more robust hosting, effectively penalizing grassroots web creators.

Open web crossroads: Block or cooperate?

As AI crawlers proliferate, a fundamental question arises: will the internet evolve into gated silos of curated content, or can a cooperative framework emerge that respects creators while fostering innovation? The choices made today will determine whether the web remains a vibrant commons or fractures into controlled enclaves.

Small publishers, news outlets, and open-source communities are already pushing back. News organizations are updating their robots.txt or taking stronger actions to block AI crawlers to protect copyrighted works and maintain editorial control.

Larger platforms like Reddit are also working to stop AI models from scraping user content without permission, prompting a highly public debate on ownership and user consent.

These responses reflect a growing wave of creators demanding agency over their content as AI firms push to harvest loose web resources for model training.

In the open-source realm, public voices are raising alarms. Software authors like Drew DeVault argue that AI crawlers are problematic for the open internet, causing outages and surging costs for creators who receive no benefit. His call to action suggests that without cooperative norms, open web projects face an ‘extinction event’.

In sum, the web is at a crossroads: unrestricted AI scraping may lead to a privatized, paywalled web while heavy-handed gatekeeping forfeits relevance in AI-powered discovery.

Quick Bits, No Fluff

AI teams built for real-world impact.

AI outcomes depend on the team behind them. Athyna connects you with professionals who deliver—not just interview well. We source globally, vet rigorously, and match fast.

From production-ready engineers to strategic minds, we build teams that actually ship. Get hiring support without the usual hiring drag.

*This is sponsored content

Thursday Poll

🗳️ What’s your stance on AI crawlers hitting your site?

Login or Subscribe to participate in polls.

3 Things Worth Trying

Rate This Edition

What did you think of today's email?

Login or Subscribe to participate in polls.