Skip to content
LIVE // BREAKING
Generative

The Plan to Kill Web Scraping Is Coming From Inside the House

By K. Denise WashingtonEditor-in-ChiefJune 27, 20265 min read
Share with tracking
?utm_source=reddit
The Plan to Kill Web Scraping Is Coming From Inside the House

The IETF, the internet's own standards body, is debating proposals to cryptographically authenticate bots. It sounds like security, but it's a plan to build tollbooths on the open web.

Scraping public data is the internet’s dirty, necessary secret. It’s the engine behind Google Search, the memory of the Internet Archive, and the tool journalists use to find out if you’re being overcharged. It has always been a cat-and-mouse game, but the field of play was the open web itself. Now, the battlefield is moving. The fight is over the protocols, deep in the plumbing, at the Internet Engineering Task Force. The very body that codified the open web is now considering proposals to wall it off, turning a public square into a series of private clubs with bouncers at the door.

The IETF’s job is to write the neutral rulebook, not pick winners. Yet two proposals threaten to do just that. The Electronic Frontier Foundation reports the first, coming from the AI Preferences working group, wants to upgrade the simple `robots.txt` file — historically a polite suggestion for crawlers — into a machine-readable edict against being used for AI training. The unstated goal is for this signal to become legally binding, a tripwire for lawsuits. A second proposal is more direct. Another working group, called Web Bot Auth, aims to create a standard for bots to carry cryptographic IDs. On paper, it’s about stopping malicious scrapers. In practice, it gives any website owner the power to instantly block any bot that hasn't been pre-authenticated, effectively requiring a permission slip to access public data.

The players pushing this are the ones you’d expect: large publishers and tech platforms. They see their public content being scraped to train large language models and want to turn that data into a licensable asset. An authenticated web is a monetized web. If you can cryptographically block any crawler you don't recognize, you can charge the ones you do for access. The winners are incumbents who can afford to pay for data and the platforms who get to sell it. The losers are everyone else. Researchers investigating algorithmic bias, digital archivists saving our culture, and any startup hoping to build a better search tool will find themselves locked out, faced with paying exorbitant fees or shutting down. The open commons of the web becomes a pay-to-play data market.

These changes won’t flip a switch overnight. They will be a gradual closing of the gates. Within a couple of years of adoption, key data sources will start requiring authenticated crawlers. University research projects and non-profit watchdogs will find their tools inexplicably failing. The cost of innovation will skyrocket; building the next great data analysis tool will require a legal team and a licensing budget before you write a single line of code. This entrenches the current AI giants, who are the only ones who can afford the tolls. The argument is framed as protecting sites from the resource drain of AI. But the mechanism creates a tiered internet, where access to public information is determined by your ability to pay. The real question is no longer about stopping bad bots. It’s whether information, once posted for the world to see, is truly public at all.

More in Generative