Poisoning scraperbots with iocaine
Web sites are being increasingly beset by AI scraperbots — a problem that we have written about before, and which has slowly ramped up to an occasional de-facto DDoS attack. This has not gone uncontested, however: web site operators from around the world have been working on inventive countermeasures. These solutions target the problem posed by scraperbots in different ways; iocaine, a MIT-licensed nonsense generator, is designed to make scraped text less useful by poisoning it with fake data. The hope is to make running scraperbots not economically viable, and thereby address the problem at its root instead of playing an eternal game of Whac-A-Mole.
The problem with scraperbots
There are plenty of good reasons to scrape the web: creating indexes for search engines, backing up old web pages before they go offline, and even scientific research. Scraping can be disruptive, however. It requires resources from the server operator, often more than normal browsing, and is sometimes in support of an effort that the server operator doesn't agree with. The difference between a well-behaved scraperbot and a problematic one is often simply whether it is respectful of the resources of the server.
The Common Crawl project, for example, seeks to crawl the web once, and then make that data available to share between multiple users. That way, server operators only spend the resources to serve their pages for scraping once, instead of once per scraper. Other well-behaved bots respect signals like robots.txt files, site maps, ETags, and Retry-After headers that politely request that robots follow certain rules. For example, LWN's robots.txt asks bots to not scrape the mailing-list archives, because the archives have grown quite large, and serving content from them is more expensive than a typical request to the server.
With the invention of large language models (LLMs), text on the web suddenly has an economic value that it didn't previously, which leads to the temptation to ignore those polite requests. That, in turn, drives server operators to attempt to differentiate scraperbots from humans in order to enforce their chosen limits. Thus begins a game of cat and mouse, between server operators coming up with new detection techniques and scraperbots trying to blend in. There is never a winner, but the losers are independent web sites that can't keep up with the race and their visitors. As with the email spam problem, centralization and scale both make it easier to detect and respond to trends in new attacks, which makes avoiding the scraperbots easier for larger sites.
What to do about it
There are a few possible types of response. For one, server operators could try to make serving all of their pages less expensive. LWN has done some of that in the form of cleaning up unnecessary database queries in our site code. So, that's one potentially good thing to come from the increase in scraperbots: users might see slightly faster page loads when the site isn't being effectively snowed under by bots.
Another possible solution is to differentiate bots from humans with some kind of costly signal that's hard to fake. This is how the increasingly prevalent Anubis tries to protect server resources: it requires first-time visitors to solve a proof-of-work problem in order to access the site. Other approaches in this vein include checking that a user agent implements the meta http-equiv attribute correctly, or checking that it can store and provide cookies.
The problems with those approaches are twofold: failing to deter bots (which can run JavaScript just like everyone else), and putting an additional barrier in the path of users. Modern scraperbots use browsers — sometimes headless browsers, but sometimes actually rendering to a virtual screen — and route their connections through more-or-less legitimately acquired domestic IP addresses, just like humans. In this particular arena, the advantage seems to lie on the side of bots trying to mimic human traffic.
Iocaine
Unlike measures that seek to detect and block bots, iocaine is a last-ditch defense for after they have already made it to the web site. The request looks like a human, and the server is going to have to spend some resources responding to it. There's still one attribute that separates bots from human readers, however: reading comprehension. A human presented with a page of obvious nonsense might click one or two links in confusion, but they're extremely unlikely to try to download an endless torrent of nonsense instead of wandering off to do something better with their time. A bot, on the other hand, that is merely scanning for links and archiving text for later processing, will happily continue to download nonsense until it hits some kind of limit.
Iocaine is a Rust program dedicated to generating convincing-enough nonsense with a minimum expenditure of server resources. When it receives a request, it uses a hidden Markov model to quickly generate a random stream of words, with the occasional embedded link to another iocaine-generated page. That generation process can happen entirely on the CPU, without having to dispatch a request to the disk or database, satisfying the request quickly and removing it from the server's queue.
The program can be hooked into an existing web server in a variety of ways. In its default configuration, it is set up to be easily inserted into an existing reverse proxy's configuration ahead of the actual web site. It uses a set of heuristics to identify bot traffic: the presence of known-bad user agents, a request for an iocaine-generated nonsense URL, traffic that claims to be a mainstream browser but that doesn't set a Sec-Fetch-Site header, and requests coming from a range of autonomous system numbers known to belong to datacenters. Any traffic that doesn't match those heuristics is responded to with HTTP status code 421 Misdirected Request. That causes most reverse proxies to fall through to the next possible handler, which will typically be the actual web site.
That default configuration, while simple, arguably reduces iocaine's core value, since it makes the server dependent on the same bot-identification tricks that other solutions have attempted to wrestle with. To allow for more subtle configuration, iocaine embeds the Lua and Roto scripting languages. These can be used to implement custom handler logic, allowing users to extend iocaine to respond to requests on a particular path, with a particular cookie, or whatever other collection of traits makes sense for their use case. The Lua interpreter can also handle other languages that compile to Lua, such as Fennel. Roto scripts compile to machine code for minimum overhead when processing a request.
Some features of iocaine, for which a user may not want to reach for a full-fledged scripting language, can also be configured using KDL. Those configuration files can be used to specify multiple different scripted handlers to consult, set up log files and sources of data, choose how iocaine should bind to a port or socket, and so on.
On startup, iocaine reads in a word list and a corpus of text in order to set up its Markov model; by default, it uses its own source code for both, which produces nonsense that looks like this:
R, keys: &'a [Bigram], state: Bigram, } impl<'a, R: Rng> Iterator for Words<'a, R> { type Error = anyhow::Error; fn try_from(config: Config) -> Self { Self::Report(r) } } impl UserData for LuaQRJourney { fn new() -> Self { Self(k) } } impl WurstsalatGeneratorPro { #[must_use] pub fn library() -> impl Registerable { library! { impl Val<LabeledIntCounterVec> { fn.
For users who want more human-looking nonsense, using some ebooks from Project Gutenberg (and a word list from the GNU miscfiles package) produces a nicely different flavor:
Cabbage. Winston knelt down beside her. He tore open a window somewhere. 'There is a possible enemy. Yes, even science." Science? The Savage violently started and, uncovering his face, 'that in another hiding-place known to Julia, the belfry of a half-gramme holiday. Bernard was car- rying his baby sister — or perhaps not exactly be called upon to.
It is well known that, like adding sugar to unset concrete, adding a small amount of generated data to the training of LLMs can have large negative impacts on their performance. If iocaine-generated text can sneak into that training corpus, it will make it harder to train LLMs; therefore, the developers of LLMs will be less likely to pay for a dataset that could have iocaine-generated text in it. So, if enough web sites start using iocaine or similar approaches, it will no longer be profitable to scrape web sites and use that text for model training — putting an end to scraperbots once and for all.
That assumes, of course, that the purveyors of AI models don't have a way to detect and remove iocaine-generated text. The project's Markov model is not particularly sophisticated, and it seems entirely possible that AI labs will want to work on ways to detect it. On the other hand, that puts the game of cat-and-mouse firmly in the scraperbots' court, to badly mix a metaphor: now, the problem of distinguishing humans and bots is a problem for them, instead of a problem for server operators. Whether this more speculative aspect of using iocaine turns out to be worth it will be hard to tell without more study.
In either case, the overhead of running the software is high enough to be noticeable, but probably still an improvement over serving an expensive web page, and not likely to be a problem for modern servers. In my tests, iocaine used 101 megabytes of virtual address space, of which only 55 remained resident in memory after startup. The generated pages are also fairly short and to-the-point, often only a couple of paragraphs and a handful of kilobytes.
It probably doesn't make as much sense to put iocaine in front of a web site that
consists entirely of static files — web servers are good at serving those
efficiently already — unless one is particularly committed to the idea of
combating scrapers economically. For users who have dynamic web sites, however,
where every request can involve trips to the database, queries to backend
services, or other expensive operations, iocaine is, like the iocaine in
The Princess Bride, "not to be trifled with
".
Trading a bit of CPU time to fill a scraper's queue with junk might just save us
all some time and expense in the long run.
P.S.: Here's what iocaine had to say about itself when given this article as input.
