Poisoning scraperbots with iocaine

By Daroc Alden
February 12, 2026

Web sites are being increasingly beset by AI scraperbots — a problem that we have written about before, and which has slowly ramped up to an occasional de-facto DDoS attack. This has not gone uncontested, however: web site operators from around the world have been working on inventive countermeasures. These solutions target the problem posed by scraperbots in different ways; iocaine, a MIT-licensed nonsense generator, is designed to make scraped text less useful by poisoning it with fake data. The hope is to make running scraperbots not economically viable, and thereby address the problem at its root instead of playing an eternal game of Whac-A-Mole.

The problem with scraperbots

There are plenty of good reasons to scrape the web: creating indexes for search engines, backing up old web pages before they go offline, and even scientific research. Scraping can be disruptive, however. It requires resources from the server operator, often more than normal browsing, and is sometimes in support of an effort that the server operator doesn't agree with. The difference between a well-behaved scraperbot and a problematic one is often simply whether it is respectful of the resources of the server.

The Common Crawl project, for example, seeks to crawl the web once, and then make that data available to share between multiple users. That way, server operators only spend the resources to serve their pages for scraping once, instead of once per scraper. Other well-behaved bots respect signals like robots.txt files, site maps, ETags, and Retry-After headers that politely request that robots follow certain rules. For example, LWN's robots.txt asks bots to not scrape the mailing-list archives, because the archives have grown quite large, and serving content from them is more expensive than a typical request to the server.

With the invention of large language models (LLMs), text on the web suddenly has an economic value that it didn't previously, which leads to the temptation to ignore those polite requests. That, in turn, drives server operators to attempt to differentiate scraperbots from humans in order to enforce their chosen limits. Thus begins a game of cat and mouse, between server operators coming up with new detection techniques and scraperbots trying to blend in. There is never a winner, but the losers are independent web sites that can't keep up with the race and their visitors. As with the email spam problem, centralization and scale both make it easier to detect and respond to trends in new attacks, which makes avoiding the scraperbots easier for larger sites.

What to do about it

There are a few possible types of response. For one, server operators could try to make serving all of their pages less expensive. LWN has done some of that in the form of cleaning up unnecessary database queries in our site code. So, that's one potentially good thing to come from the increase in scraperbots: users might see slightly faster page loads when the site isn't being effectively snowed under by bots.

Another possible solution is to differentiate bots from humans with some kind of costly signal that's hard to fake. This is how the increasingly prevalent Anubis tries to protect server resources: it requires first-time visitors to solve a proof-of-work problem in order to access the site. Other approaches in this vein include checking that a user agent implements the meta http-equiv attribute correctly, or checking that it can store and provide cookies.

The problems with those approaches are twofold: failing to deter bots (which can run JavaScript just like everyone else), and putting an additional barrier in the path of users. Modern scraperbots use browsers — sometimes headless browsers, but sometimes actually rendering to a virtual screen — and route their connections through more-or-less legitimately acquired domestic IP addresses, just like humans. In this particular arena, the advantage seems to lie on the side of bots trying to mimic human traffic.

Iocaine

Unlike measures that seek to detect and block bots, iocaine is a last-ditch defense for after they have already made it to the web site. The request looks like a human, and the server is going to have to spend some resources responding to it. There's still one attribute that separates bots from human readers, however: reading comprehension. A human presented with a page of obvious nonsense might click one or two links in confusion, but they're extremely unlikely to try to download an endless torrent of nonsense instead of wandering off to do something better with their time. A bot, on the other hand, that is merely scanning for links and archiving text for later processing, will happily continue to download nonsense until it hits some kind of limit.

Iocaine is a Rust program dedicated to generating convincing-enough nonsense with a minimum expenditure of server resources. When it receives a request, it uses a hidden Markov model to quickly generate a random stream of words, with the occasional embedded link to another iocaine-generated page. That generation process can happen entirely on the CPU, without having to dispatch a request to the disk or database, satisfying the request quickly and removing it from the server's queue.

The program can be hooked into an existing web server in a variety of ways. In its default configuration, it is set up to be easily inserted into an existing reverse proxy's configuration ahead of the actual web site. It uses a set of heuristics to identify bot traffic: the presence of known-bad user agents, a request for an iocaine-generated nonsense URL, traffic that claims to be a mainstream browser but that doesn't set a Sec-Fetch-Site header, and requests coming from a range of autonomous system numbers known to belong to datacenters. Any traffic that doesn't match those heuristics is responded to with HTTP status code 421 Misdirected Request. That causes most reverse proxies to fall through to the next possible handler, which will typically be the actual web site.

That default configuration, while simple, arguably reduces iocaine's core value, since it makes the server dependent on the same bot-identification tricks that other solutions have attempted to wrestle with. To allow for more subtle configuration, iocaine embeds the Lua and Roto scripting languages. These can be used to implement custom handler logic, allowing users to extend iocaine to respond to requests on a particular path, with a particular cookie, or whatever other collection of traits makes sense for their use case. The Lua interpreter can also handle other languages that compile to Lua, such as Fennel. Roto scripts compile to machine code for minimum overhead when processing a request.

Some features of iocaine, for which a user may not want to reach for a full-fledged scripting language, can also be configured using KDL. Those configuration files can be used to specify multiple different scripted handlers to consult, set up log files and sources of data, choose how iocaine should bind to a port or socket, and so on.

On startup, iocaine reads in a word list and a corpus of text in order to set up its Markov model; by default, it uses its own source code for both, which produces nonsense that looks like this:

R, keys: &'a [Bigram], state: Bigram, } impl<'a, R: Rng> Iterator for Words<'a, R> { type Error = anyhow::Error; fn try_from(config: Config) -> Self { Self::Report(r) } } impl UserData for LuaQRJourney { fn new() -> Self { Self(k) } } impl WurstsalatGeneratorPro { #[must_use] pub fn library() -> impl Registerable { library! { impl Val<LabeledIntCounterVec> { fn.

For users who want more human-looking nonsense, using some ebooks from Project Gutenberg (and a word list from the GNU miscfiles package) produces a nicely different flavor:

Cabbage. Winston knelt down beside her. He tore open a window somewhere. 'There is a possible enemy. Yes, even science." Science? The Savage violently started and, uncovering his face, 'that in another hiding-place known to Julia, the belfry of a half-gramme holiday. Bernard was car- rying his baby sister — or perhaps not exactly be called upon to.

It is well known that, like adding sugar to unset concrete, adding a small amount of generated data to the training of LLMs can have large negative impacts on their performance. If iocaine-generated text can sneak into that training corpus, it will make it harder to train LLMs; therefore, the developers of LLMs will be less likely to pay for a dataset that could have iocaine-generated text in it. So, if enough web sites start using iocaine or similar approaches, it will no longer be profitable to scrape web sites and use that text for model training — putting an end to scraperbots once and for all.

That assumes, of course, that the purveyors of AI models don't have a way to detect and remove iocaine-generated text. The project's Markov model is not particularly sophisticated, and it seems entirely possible that AI labs will want to work on ways to detect it. On the other hand, that puts the game of cat-and-mouse firmly in the scraperbots' court, to badly mix a metaphor: now, the problem of distinguishing humans and bots is a problem for them, instead of a problem for server operators. Whether this more speculative aspect of using iocaine turns out to be worth it will be hard to tell without more study.

In either case, the overhead of running the software is high enough to be noticeable, but probably still an improvement over serving an expensive web page, and not likely to be a problem for modern servers. In my tests, iocaine used 101 megabytes of virtual address space, of which only 55 remained resident in memory after startup. The generated pages are also fairly short and to-the-point, often only a couple of paragraphs and a handful of kilobytes.

It probably doesn't make as much sense to put iocaine in front of a web site that consists entirely of static files — web servers are good at serving those efficiently already — unless one is particularly committed to the idea of combating scrapers economically. For users who have dynamic web sites, however, where every request can involve trips to the database, queries to backend services, or other expensive operations, iocaine is, like the iocaine in The Princess Bride, "not to be trifled with". Trading a bit of CPU time to fill a scraper's queue with junk might just save us all some time and expense in the long run.

P.S.: Here's what iocaine had to say about itself when given this article as input.

to post comments

Beautiful Poetry

Posted Feb 12, 2026 17:07 UTC (Thu) by acarno (subscriber, #123476) [Link] (4 responses)

> That assumes, of course.

The entirety of (software) engineering summed up in one beautiful, succinct expression.

Beautiful Poetry

Posted Feb 14, 2026 18:30 UTC (Sat) by Heretic_Blacksheep (subscriber, #169992) [Link] (3 responses)

Arguably it sums up all of science and philosophy. You have to make assumptions to get anywhere. Making assumptions by itself isn't a problem. Setting assumptions in stone and metaphorically making them into religious literalisms that can't be questioned is what gets people into trouble.

There's a reason why good teaching programs not only show you how the state of art got from A to B, but gives you the tools to question the validity of the process yourself, including questioning the supposedly unquestionable. That's where "Queering" philosophy comes from. It's a challenge to the status quo by questioning long held assumptions that underlie mainstream religious concepts, social constructs/contracts, and sciences by approaching those pillars from different points of view. Sometimes those pillars hold because the assumptions are solid, sometimes they fall when the assumptions they're founded on are faulty. Many times when those pillars are in danger of falling, a vocal segment of people take it personally and go on religious-like crusades against the Other that pointed out the cracks in the foundation.

Beautiful Poetry

Posted Feb 14, 2026 19:08 UTC (Sat) by Wol (subscriber, #4433) [Link] (2 responses)

Assumptions come in two forms in philosophy/mathematics. You have axioms (we believe this because it's unproveable), and theories (we believe this because it appears to be correct).

The aim of mathematics is to turn theories into theorems (this can be logically proven provided our theories are correct).

In Science theories are the best we can get, because the only proof we have is "this doesn't work in real life".

These tow different approaches are how I personally differentiate Science from Maths.

And one of the best examples imho as to how things can be got wrong, and change over time, is geometry. Euclid laid down his axioms, including the statement "parallel lines never meet", which gave us the 3D geometry we're all familiar with. But at some point - about the time of Newton or even earlier - some people tried to formalise the axioms of geometry, and that one just "didn't fit". So people started thinking "what if it isn't an axiom at all", and we ended up with relativity and all that stuff :-)

It remains the defining axiom of 3D geometry, but it is a parochial axiom, not a universal one.

Cheers,
Wol

Beautiful Poetry

Posted Feb 14, 2026 19:10 UTC (Sat) by Wol (subscriber, #4433) [Link]

WHOOPS

> The aim of mathematics is to turn theories into theorems (this can be logically proven provided our theories are correct).

The aim of mathematics is to turn theories into theorems (this can be logically proven provided our *AXIOMS* are correct).

Cheers,
Wol

Beautiful Poetry

Posted Feb 15, 2026 2:17 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

> Euclid laid down his axioms, including the statement "parallel lines never meet", which gave us the 3D geometry we're all familiar with. But at some point - about the time of Newton or even earlier - some people tried to formalise the axioms of geometry, and that one just "didn't fit".

Even Euclid (and contemporaries) didn't like it. That's why the parallel postulate isn't used for the entire first book until the end (constructing a square IIRC).

iocaine

Posted Feb 12, 2026 17:13 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

Patenting "datacaine".

Love it

Posted Feb 12, 2026 18:28 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Nice touch.

How did it get a copy of 1984?

Posted Feb 12, 2026 19:52 UTC (Thu) by KJ7RRV (subscriber, #153595) [Link] (3 responses)

The second example output text seems to consist in part of scrambled text from Orwell's 1984, which is, if I understand correctly, still under copyright. Did 1984 somehow get added into the corpus (which would raise copyright concerns), or are the names from a different work that happens to have characters with the same names?

How did it get a copy of 1984?

Posted Feb 12, 2026 20:24 UTC (Thu) by daroc (editor, #160859) [Link] (2 responses)

The only corpus it comes with is its own source code; if one wants to use a different corpus, one needs to supply text files for it to read on startup.

In this case, 1984 is out of copyright in most of the world, so it is freely available from Project Gutenberg subsidiaries in a few countries. People in the U.S. do have to worry about infringement, since it's still under copyright here, but I think using the text to highlight a point about the use of language as an act of resistance should reasonably fall under fair use.

How did it get a copy of 1984?

Posted Feb 13, 2026 2:17 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link] (1 responses)

Thank you! I misunderstood the article and thought that that corpus was also provided by Iocaine, and also wasn't aware that 1984 is in the public domain in some countries. Thank you for the clarification!

How did it get a copy of 1984?

Posted Mar 13, 2026 12:44 UTC (Fri) by sammythesnake (guest, #17693) [Link]

> [I] wasn't aware that 1984 is in the public domain in some countries.

ITYM "almost all countries"

According to my quick scan of https://en.wikipedia.org/wiki/List_of_copyright_duration_... I make it everywhere except:

Columbia, Equatorial Guinea, Spain (until 2030)
United States (until 2045)

Jamaica stands out with what looks like the longest period at 100 years from author's death, bit that's only for works created after 2003, so not applicable. The vast majority of the world goes with life+50 years in most cases, though numbers in the range 60-75 are fairly common, too.

What is LWN using ?

Posted Feb 12, 2026 21:29 UTC (Thu) by Poliorcetics (subscriber, #165001) [Link] (14 responses)

iocaine seems like a nice, no JS alternative since that was the reason LWN didn’t adopt Anubis because of that requirement right ? Are you planning to/already using it ?

What is LWN using ?

Posted Feb 12, 2026 21:51 UTC (Thu) by daroc (editor, #160859) [Link] (11 responses)

We've discussed it, but haven't decided whether to deploy it or not yet. Since it would have a low impact on human visitors, I think that we probably will if we start seeing sustained attacks.

As mentioned in the article, we've done some work to make the site faster for everyone, and we've also implemented a "load shedding" mode where, when the site is experiencing a high volume of traffic, requests that look like they come from a bot sometimes get an extra interstitial page that they have to click through to actually access the content.

If you haven't seen those pages, it's because logged-in users never see them. Hopefully that suffices for the moment, but if things get bad (and especially if this policy prevents new readers from reaching us), we might have to go a step farther.

What is LWN using ?

Posted Feb 12, 2026 23:00 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (5 responses)

I know one of the things Slashdot did a long time ago was to disable comments on old discussions and then archive the whole thing to static HTML. I think they did that mostly to keep their database a reasonable size, but it would also reduce the load if somebody starts trying to download the entire history of the site. You would obviously want to keep comments open for a lot longer than they do, but it seems like a potentially useful approach. It would probably be some extra work, but you might be able to have a dynamic, database dependent mode for ordinary operation and serve only static web pages when you go to "load shedding" mode.

What is LWN using ?

Posted Feb 13, 2026 2:16 UTC (Fri) by KJ7RRV (subscriber, #153595) [Link] (2 responses)

Would it be possible to generate a static archive of all pages, then have the comment links still go to the usual system, then have that update the static archive after posting the comment? That would increase the burden created by the posting of comments, but certainly that occurs far less than regular page loads, and presumably is never done by AI scrapers.

What is LWN using ?

Posted Feb 13, 2026 11:17 UTC (Fri) by leromarinvit (subscriber, #56850) [Link] (1 responses)

I was going to write a comment about how it's probably cheaper to keep querying the comments from a database for recent articles with active discussions, and that there should be a break-even point somewhere. But then I realized that in that mode, *every* page load must do the same work of generating a static page to serve to that single reader. So the only difference would be saving that page to disk, which the database presumably also does already (even if it only writes new comments and therefore less data).

Therefore, the break-even point is reached as soon as saving somewhat larger generated pages for new comments is less expensive than the DB round-trip and dynamic page generation for every request. My totally scientific gut feeling says it would require a ridiculously high comment load to make generating static pages for every new comment *not* cheaper than dynamic rendering for every request.

Of course, this only works if all/most users get served the exact same content. But I think this should be the case here, and in any case, per-user differences can probably be outsourced to a per-user CSS file (which the scrapers probably wouldn't even fetch, since it provides no useful value for LLM training - so this could be dynamic).

What is LWN using ?

Posted Feb 16, 2026 8:52 UTC (Mon) by taladar (subscriber, #68407) [Link]

The main issue with active, relatively new pages would probably be some form of locking on the page re-generation process. But that could be skipped if you are willing to just serve the old page until the generation of the new version is complete.

What is LWN using ?

Posted Feb 13, 2026 10:40 UTC (Fri) by Wol (subscriber, #4433) [Link]

I've said it before, but I do get a bit annoyed by new comments on old articles. You often don't realise the article is out of date until you've been provoked into responding.

Yup - creating a static page AND turning on the moderator flag after about three months imho would be a good idea.

Cheers,
Wol

Static pages

Posted Feb 13, 2026 14:16 UTC (Fri) by corbet (editor, #1) [Link]

The early LWN site ran on some pretty minimal hardware; as a result, it does a lot of aggressive caching of rendered pages. For anonymous readers, it essentially is static pages, comments included. Logged-in users get freshly rendered pages so that their preferences can be honored, but the cost of that is zero relative to the bot load.

What is LWN using ?

Posted Feb 13, 2026 10:22 UTC (Fri) by paulj (subscriber, #341) [Link] (4 responses)

Please don't adopt anubis. You will probably lock me out of LWN.

I can't access lore.kernel.org and other kernel.org stuff cause of Anubis. I have not been able to figure out why (using LibreWolf).

What is LWN using ?

Posted Feb 13, 2026 10:45 UTC (Fri) by paulj (subscriber, #341) [Link] (3 responses)

Could also be the network I'm on. On another network I can access.

What is LWN using ?

Posted Feb 13, 2026 14:37 UTC (Fri) by daroc (editor, #160859) [Link] (2 responses)

If we do adopt Anubis, which I think is unlikely given that we're keeping up so far using other methods, logging in to your LWN account will exempt you from it. We really want to make sure that readers can access the site — that's really what's bothersome about the bots: they were spiking the load and causing timeouts for real humans trying to access the site before we put our countermeasures in place.

What is LWN using ?

Posted Feb 13, 2026 16:57 UTC (Fri) by leromarinvit (subscriber, #56850) [Link]

The effort is much appreciated, and whatever you're doing seems to be working really well (for users at least). For what it's worth, I never noticed any delays or even timeouts, so either it was resolved *very* quickly, or it never affected a large fraction of requests. LWN has consistently remained one of the fastest-loading websites. Thank you for that!

What is LWN using ?

Posted Feb 16, 2026 12:37 UTC (Mon) by Baughn (subscriber, #124425) [Link]

I’m in the same boat. Anubis causes my iPad to overheat and crash, and that’s the device I use for reading things like LWN.

I know it’s old, but it works fine so long as I keep JavaScript disabled.

What is LWN using ?

Posted Feb 12, 2026 21:52 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

JavaScript is an impediment, but we also want to put as few obstacles between LWN and its readers as possible. So we have implemented various defenses over time, and they are continuing to evolve; we luck we can continue to keep the site responsive even with all of the irresponsible behavior out there.

What is LWN using ?

Posted Feb 12, 2026 21:57 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Maybe it should be OK to have JS if you're anonymous?

Love LWN articles

Posted Feb 12, 2026 21:35 UTC (Thu) by ptime (subscriber, #168171) [Link]

“Snowed under” is a great phrase for this, thank you

A bit of a misconception: iocaine can easily be used to block bots

Posted Feb 12, 2026 22:38 UTC (Thu) by IanKelling (subscriber, #89418) [Link] (7 responses)

> It probably doesn't make as much sense to put iocaine in front of a web site that consists entirely of static files — web servers are good at serving those efficiently already

This seems to be based on a misconception that iocaine it isn't useful for blocking and/or that serving endless nonsense is an essential feature. Wrong and wrong. Detection is the difficult part, adding blocking is easy and left to the user if they want it, and that of course also stops the endless nonsense serving. On savannah.gnu.org, we already used reaction https://framagit.org/ppom/reaction to build ipset based blocks. Then we added iocaine, which meant a one line config change in reaction to block bots detected by iocaine. Alternatively, we could have just added a small amount of extension code in iocaine to do that even more directly. We are using the extension https://git.madhouse-project.org/iocaine/nam-shub-of-enki and it is working quite well.

A bit of a misconception: iocaine can easily be used to block bots

Posted Feb 13, 2026 10:24 UTC (Fri) by rsidd (subscriber, #2582) [Link] (6 responses)

Emacs has an NSA mode (M-x spook) to generate nonsense text in case the NSA is reading your mail. I remember it from way back, and it still seems to work. So it's appropriate that GNU is using a similar technique to poison the bots!

Emacs NSA bait

Posted Feb 15, 2026 5:17 UTC (Sun) by geuder (subscriber, #62854) [Link] (5 responses)

Way back?

I noticed this around 1984/85 when using emacs at university.

As a European, Internet did not exist, NSA completely unheard of, I ruled out that threat as the weirdest conspiracy ever.

Well, it took decades to understand how wrong I was and how right Stallman or whoever put it there was.

Emacs NSA bait

Posted Feb 15, 2026 14:31 UTC (Sun) by rsidd (subscriber, #2582) [Link] (4 responses)

For me it was 1990s and email and internet had just about come into widespread use.

Another easter egg, probably equally old, is a psychotherapist (M-x doctor). It's fairly rubbish, compared to today's large language models, but amusing.

Emacs NSA bait

Posted Feb 16, 2026 10:58 UTC (Mon) by anselm (subscriber, #2796) [Link] (3 responses)

This is essentially Weizenbaum's ELIZA. It looks clunky now, but when it was new in the 1960s it was enough to convince many people that they were talking to an actual psychotherapist – which probably tells us more about humans than about the program itself.

Emacs NSA bait

Posted Feb 16, 2026 16:21 UTC (Mon) by karkhaz (subscriber, #99844) [Link] (2 responses)

I do nowadays wonder whether the stories of people being convinced by ELIZA are entirely fabricated or at least exaggerated.

Today's LLM boosters created an entire genre of press releases disguised as science publications, claiming that LLMs can take the initiative to blackmail people, that they demonstrate self-awareness, that their inevitable power-seeking presents a risk to human survival. It's clearly part of a campaign to diminish the idea that companies and ultimately individuals would be responsible for instigating any such scenario in the first place. The money being spent to prevent government regulation is much more effective when coupled with widespread credulity of LLMs' agency and anthropomorphisation that gets parroted by the compliant press. There are plenty of people who would benefit from an environment of increased impunity where the key takeaway of the "AI agent published a hit piece on me" blog post is that...an AI agent published a hit piece on somebody. And it was nobody's fault. Nor with the deepfakes, war crimes, AI-induced psychosis and so on.

It's quite easy to see how the idea that "people were convinced by ELIZA" could become embedded in the wider culture if that idea benefited somebody powerful.

Emacs NSA bait

Posted Feb 16, 2026 19:20 UTC (Mon) by excors (subscriber, #95769) [Link] (1 responses)

> I do nowadays wonder whether the stories of people being convinced by ELIZA are entirely fabricated or at least exaggerated.

I can believe it, because people seem to be really bad at judging intelligence. There's a recent paper "Large Language Models Pass the Turing Test" (https://arxiv.org/abs/2503.23674) where participants had five-minute conversations with a pair of one human and one AI, and had to guess which was human. A suitably-prompted ChatGPT (imitating a young nerd with poor writing skills) was significantly more convincing than the actual humans, winning 73% of the time. But even ELIZA won 23% of the time.

Like, a quarter of people were convinced by a dumb algorithm that responded to every question with either a verbatim echo of the question or a vague unrelated question of its own, even when they were specifically tasked with identifying the non-human. They liked its "question handling" and "personality traits" and apparently didn't mind it was all empty words. (Outside of the test, I guess that helps explain why so many people are tricked by LLMs that are far more sophisticated at echoing and validating their delusional thoughts while they fall down a rabbit hole.)

Emacs NSA bait

Posted Feb 17, 2026 9:13 UTC (Tue) by farnz (subscriber, #17727) [Link]

Note, too, that the claim is not "everyone was convinced by ELIZA some of the time", but rather "some people were convinced by ELIZA some of the time".

To misquote an apocryphal quote, "there's a huge overlap between ELIZA's best output, and what people think someone really dumb would say"; the Turing Test was still in vogue when ELIZA was new, and under the rules of the test, you are "convinced" by ELIZA if you cannot determine whether you're conversing with a human, or with a bot to an extent better than random guessing - but that means that you count as "convinced" if you erroneously identify humans as the bot repeatedly.

Wurstsalat

Posted Feb 13, 2026 8:39 UTC (Fri) by mnglora (subscriber, #96888) [Link] (3 responses)

An extra bonus for all readers with knowledge in german is WurstsalatGeneratorPro! Made my day.

WurstsalatGenerator

Posted Feb 13, 2026 13:38 UTC (Fri) by dambacher (subscriber, #1710) [Link] (2 responses)

when i search this i get edeka, chefkoch and rewe hits, no chance.
can you share a link?

WurstsalatGenerator

Posted Feb 13, 2026 14:05 UTC (Fri) by mnglora (subscriber, #96888) [Link]

https://de.wikipedia.org/wiki/Wurstsalat

WurstsalatGenerator

Posted Feb 13, 2026 14:14 UTC (Fri) by jak90 (subscriber, #123821) [Link]

The joke is that it cuts up and mixes a text corpus the same way as a meat salad is made.

Volume difference with older crawlers?

Posted Feb 15, 2026 5:44 UTC (Sun) by marcH (subscriber, #57642) [Link] (29 responses)

> With the invention of large language models (LLMs), text on the web suddenly has an economic value that it didn't previously, which leads to the temptation to ignore those polite requests.

Is ignoring robots.txt and other conventions the only difference with pre-AI crawlers? I (wrongly?) guessed that they were crawling more frequently than before (to constantly educate themselves with the latest news etc.)

Volume difference with older crawlers?

Posted Feb 15, 2026 8:09 UTC (Sun) by mb (subscriber, #50428) [Link] (28 responses)

The difference is that AI crawler traffic is a DDoS.

AI crawlers use a massive amount of IP addresses and hammer the systems dozens or hundreds of times per second, but only reuse an IP address once per day or so. Almost all of them also use a normal browser User-Agent header.

This makes it very hard to block an IP address. Even if you managed to detect that one IP address is probably an AI crawler, it won't contact you for another 24 hours or so. So if you want to block based on previous behavior of addresses, you need massive databases and massive firewall tables.
If I check my logs I can see that source IP addresses are 99% unique during an AI DDoS.

Volume difference with older crawlers?

Posted Feb 15, 2026 11:43 UTC (Sun) by marcH (subscriber, #57642) [Link] (9 responses)

Interesting, thanks! But why? Why do they need so much more traffic? Search engines already needed to know everything, no?

Volume difference with older crawlers?

Posted Feb 15, 2026 12:47 UTC (Sun) by excors (subscriber, #95769) [Link] (7 responses)

My unfounded speculation is that crawling the whole web is expensive and it's hard for a young search engine to make money, so they had to figure out how to crawl efficiently and cheaply. Eventually they grew into enormously rich companies, but they still had the technology and the culture of crawling efficiently and cooperating with server owners (via robots.txt, honest UA strings, etc). But now there's an insane amount of money in the AI industry, so everybody and their dog can get billions of dollars for their AI startup and vibe-code their own crawler, and they don't know how to do it efficiently and they don't care - they just care about scraping as many terabytes of text as possible, as quickly as possible, before the whole thing collapses.

More cynically they might recognise it's actually a competitive advantage for them to DDoS sites because it blocks other AI crawlers from getting the same content, and also means end users can't go to the original source and will have to make do with half-fabricated AI summaries of the content, ensuring all the ad revenue goes to the AI companies and not the content producers.

Volume difference with older crawlers?

Posted Feb 15, 2026 13:35 UTC (Sun) by mb (subscriber, #50428) [Link] (6 responses)

I really don't get what their goal is.
The traffic increased up until the point where basically *all* requests started to time out. Including the AI crawler generated requests.

That forced me to put the most expensive part (cgit) behind a simple cookie gate.
That works pretty well, but ever since I have installed this about half a year ago the AI crawlers keep hammering cgit.

Traffic of today is 99.6% hammering on the protected cgit area (and receiving a simple block message as a response).
They keep requesting deep URLs from the cgit, even though they didn't receive a cgit answer ever since I blocked it. So they must be keeping a database of what was there and what to hammer next. But without looking at the actual response.

But at least it doesn't invoke the cgi backend anymore, so the server can still handle it.
Not sure what I want to do, if it keeps increasing, though. And that's the trend.
Load average is at 0.6 and I really do not want to pay for a bigger machine.

Running small servers on the internet is a mess for hobbyists these days.
After E-Mail became a nightmare due to the big corporations putting massive restrictions onto sending servers now it becomes a nightmare to run a simple web server that nobody cared about two years ago.

These two things are currently destroying the Internet as a place where everybody can publish free speech without the help of big corporation platforms.

Volume difference with older crawlers?

Posted Feb 16, 2026 13:44 UTC (Mon) by marcH (subscriber, #57642) [Link] (3 responses)

> After E-Mail became a nightmare due to the big corporations putting massive restrictions onto sending servers now it becomes a nightmare to run a simple web server that nobody cared about two years ago.

Email was doomed from the very beginning because:
- spam existed long before the internet for obvious and well documented economical[*] reasons
- email removed the sender costs and "friction" keeping pre-internet spam to (barely) manageable levels.

Having your inbox open by default to billions of strangers is flawed by design. No other communication tool does that, they all have some sort of approval process (or some way to inflict some cost on the sender side). Most mitigations for email spam involve some sort of approval concepts.

It's not clear yet whether AI will kill the web as we knew it but for sure the Internet was designed by smart "technologists" living in some sort of bubble isolated from economical and other "real-world" aspects.

[*] not just advertisements but also scams and other crime.

Volume difference with older crawlers?

Posted Feb 16, 2026 17:59 UTC (Mon) by mb (subscriber, #50428) [Link] (2 responses)

>Email was doomed from the very

Well, some mail providers act like complete idiots. That has nothing to do with E-Mail as a technical thing.
Want a real word example that happened to me a week ago?

T-Online (Telekom) blocked my personal locked down (accessible only to me and a couple of people I know in person) mail server because I didn't send a mail to t-online addresses for a long time. That made them conclude my address was dialup. This was their "reason" for blocking me.

Sending too much mail -> blocked. You are a spammer.
Sending too little mail -> blocked. You are a dialup even though obviously you aren't.

I do implement every technique needed for modern mail delivery on my system.
That and explaining and proving to them that my system was not dialup and was private and low volume did not help.
They forced me to disclose my private home address in public to fulfill their idiotic corporate processes.

This has nothing to do with E-Mail being problematic.
This is purely caused by idiots doing idiotic things.
Just like AI crawler responsible persons do.

Volume difference with older crawlers?

Posted Feb 17, 2026 8:47 UTC (Tue) by taladar (subscriber, #68407) [Link] (1 responses)

They also now want you to host a website on the domain in your server IP's reverse DNS entry that has some sort of redirect to a website with a contact form for them to use in case of abuse.

I am certainly not going to run a webserver on every single server I run under the server's hostname (which is deliberately chosen to not be any of the services it hosts to make migrating those services easier) just because the T-Online mail admin team is too incompetent to use the abuse@ addresses like everyone else.

Volume difference with older crawlers?

Posted Feb 17, 2026 10:26 UTC (Tue) by Wol (subscriber, #4433) [Link]

T-Online? Germany?

If you start getting spam to that address file a formal complaint under your equivalent of the GDPR, that they leaked your private email address.

Dunno how many legs it has, but given the circumstances any reasonable regulator *should* conclude that they clearly breached the regs.

(Although whether said reasonable regulator actually has sufficient brain cells to *understand* the complaint is another matter...)

Cheers,
Wol

Volume difference with older crawlers?

Posted Feb 23, 2026 23:29 UTC (Mon) by anton (subscriber, #25547) [Link] (1 responses)

They keep requesting deep URLs from the cgit, even though they didn't receive a cgit answer ever since I blocked it.

We have a similar experience with Gforth's ViewCVS or CVSweb; one had been disabled for many years, the other was disabled when the scraperbots caused problems; both were obsolete, because Gforth has switched to git in 2014, with all of the CVS history included.

When I last looked, the scraperbots were still busily accessing various urls for some versions of files in CVS, or various directories with sorting options and the like encoded in the URL, several months after the service had been shut down.

One other thing I noticed is that at least one scraperbot doesn't use the IP protocol properly, and on the old web server we used at the time, every connection by a scraperbot consumed one of the available connections without it being released for a long time, if ever. After the few hundred connections available to the web server had been consumed, it no longer served anything. I don't remember in which state the connection hung, but one may be able to use their connection behaviour to identify them; identifying after the content has been served is probably not be very useful, but maybe the also connect in an unusual way on establishing the connection.

Volume difference with older crawlers?

Posted Mar 8, 2026 18:18 UTC (Sun) by fest3er (guest, #60379) [Link]

Around a year ago, I noticed a huge amount of guests on my forum and decided to take action. (This is a little disjointed; I haven't looked at what I did for some months now.)

On Linux, nf-ct-list is quite useful. Connections with SYN_RECV, CLOSE, CLOSE_WAIT and ESTABLISHED normally should be responded to fairly quickly. If their timers wind down 'too far', I delete the conn which changes its state to INVALID. My firewall drops INVALID packets first thing in mangle:PREROUTING. I also add the IP to my 'banditList' for ipset.

I employ (1) Univ. of Toulouse's blacklists and a couple others to block many IPs I never want traffic to or from, (2) the blacklist of stolen IPs, (3) IPs from would-be members who use invalid email addresses, (4) IPs from my sites' logs (mostly errors and bad traffic entries), (5) Snort alerts, (6) IPs from spam 'Contact Us' messages, and (7) nets from MS and others who claim to be scraping my sites 'to help me'.

Hosts with fewer than three accesses between the previous midnight and 10 minutes ago (humans usually have more than three accesses in ten minutes) get throttled to 2400 baud. I use a few escalating timeouts, up to ipset's ~21 day limit.

Those efforts dropped the 'guest' count on my forum from ~5000 to under 100 and often to under 30. I'm still blocking around 45 000 IPs, dropping connections from internet and rejecting connections to internet. And I'm blocking (via dnsmasq) around 1.3M domains from certain of UT's categories (using "local='/FQDN/'" to 'undefine' them).

It would be nice to find lists of datacenter/cloud IPs to block many of them (or at least recognize them and possibly treat them differently).

It ain't perfect, but it does reduce unwanted traffic.

Volume difference with older crawlers?

Posted Feb 16, 2026 9:21 UTC (Mon) by taladar (subscriber, #68407) [Link]

One aspect is probably users prompting the models to look at specific sites which turns the requests from one per service (e.g. OpenAI) to one per user of the service.

Volume difference with older crawlers?

Posted Feb 15, 2026 13:04 UTC (Sun) by pizza (subscriber, #46) [Link] (3 responses)

> If I check my logs I can see that source IP addresses are 99% unique during an AI DDoS.

I routinely see in excess of a million unique IP addresses per wave, each hitting at most two pages over the course of a day or two.

Volume difference with older crawlers?

Posted Feb 16, 2026 11:37 UTC (Mon) by paulj (subscriber, #341) [Link] (2 responses)

Wow, that is incredible. How many unique ASNs?

It will get to the point that you just have to block these abusive ASes in their entirety, every prefix they have.

Volume difference with older crawlers?

Posted Feb 16, 2026 12:10 UTC (Mon) by pizza (subscriber, #46) [Link] (1 responses)

> Wow, that is incredible. How many unique ASNs?

Too many to bother with; the IPs I looked up were all assigned to residential ISPs and cell phone carriers.

(probably coming from [borderline] malicious apps or browser extensions on otherwise legit users' systems)

> It will get to the point that you just have to block these abusive ASes in their entirety, every prefix they have.

I think we're already pas the point where that has ceased to be effective.

Volume difference with older crawlers?

Posted Feb 16, 2026 12:45 UTC (Mon) by paulj (subscriber, #341) [Link]

Oh, wow. Ok.

If things really are as dire as painted by a number of people here, in the article and in the comments, then this may ultimately the end of the current Internet architecture, at the L3 -> L4 interface, at least so far as smaller content providers are concerned. Everything will have to go behind something that requires the requester to pay in some way - either by providing some proof of work done (as Anubis does), some other form of proof of extra effort (perhaps registering an account, but that's weak and the abusers will just do that if they need to) or some other micro-payment.

Wow. We live in interesting times.

Volume difference with older crawlers?

Posted Feb 15, 2026 23:59 UTC (Sun) by Wol (subscriber, #4433) [Link] (13 responses)

>If I check my logs I can see that source IP addresses are 99% unique during an AI DDoS.

Sounds like a good anti-bot would simply be to delay answering any request from a new IP. 5 seconds? And then the more requests from that IP, the higher the priority goes.

Cheers,
Wol

Volume difference with older crawlers?

Posted Feb 16, 2026 0:55 UTC (Mon) by corbet (editor, #1) [Link] (11 responses)

I don't see how that would help. Humans get grumpy about a five-second delay... The bot just waits, then has the page it was after. Meanwhile your server, which is now trying to keep open five seconds worth of full-on scraper bot traffic, melts down into slag.

Volume difference with older crawlers?

Posted Feb 16, 2026 11:41 UTC (Mon) by paulj (subscriber, #341) [Link] (10 responses)

To be honest, at some point people are going to have to start sueing these highly abusive AI crawlers for the DDoS they are causing. There are laws covering DoS attacks that would apply if these companies are being reckless.

Volume difference with older crawlers?

Posted Feb 16, 2026 12:15 UTC (Mon) by pizza (subscriber, #46) [Link] (9 responses)

> To be honest, at some point people are going to have to start sueing these highly abusive AI crawlers for the DDoS they are causing. There are laws covering DoS attacks that would apply if these companies are being reckless.

So... how do I identify who to sue when they piggyback on residential IPs and pretend to be MacOS 15 (or a Pixel 8 phone, or Chrome on Windows 11 (or, or, or...)

The thing is, each crawler on its own is fine. The problem comes from everyone and their dog having a unique crawler all hitting at the same time.

Volume difference with older crawlers?

Posted Feb 17, 2026 12:05 UTC (Tue) by paulj (subscriber, #341) [Link] (8 responses)

You won't be able to take lines from your logs, and work out exactly which abusive AI-enshitifier is responsible. Or even if you could to some extent, you won't be able to prove an overall pattern of abuse from just that.

What may happen is that, one day, information comes out (by leaks, or just by the hubris of their own self-delusional sense of self-importance - listen to Altman for examples; or perhaps discovery in some other court case not related to web abuse) from one of these abusive AI enshitifiers (abusing the resources of the web, the Internet, the world in terms of energy) that details their shitty and abusive practices and finally provides the rope to hang them with wrt to how abusive they are of the resources of others on the Internet.

We must live in hope.

Volume difference with older crawlers?

Posted Feb 17, 2026 13:17 UTC (Tue) by Wol (subscriber, #4433) [Link]

Don't various ISPs ask for permission to use your router to anonymise web accesses?

Be rather tricky to do, but couldn't a bunch of small website owners sue one of them for damages, on the basis they are actively facilitating unwanted traffic and abuse?

I know I go on about fraud and abuse (small letter initial), but basing it upon the English laws of trespass, using someone else's property when you "knew or should have known" that permission would be refused, is a criminal offense. The mere fact these people are desperate to hide their identity is a blatant admission they know the "knew or should have known" bar is passed.

I don't know how far such a lawsuit would get in the UK (and the fact it would be a criminal suit means the DPP probably wouldn't be interested), but the "anti-social"ness is clear. It's just finding some way of turning the "at someone else's expense" into something you can sue over :-(

Cheers,
Wol

Volume difference with older crawlers?

Posted Feb 17, 2026 13:45 UTC (Tue) by marcH (subscriber, #57642) [Link] (6 responses)

I agree with the immorality but where exactly is the illegality of the abuse? Which "law of the internet" does an abusive, distributed crawler violate? Assuming you find some, what stops crawlers from avoiding those countries and to crawl only from other ones?

If they hijack unwilling client computers then sure, but do we have any indication that it is actually the case? If not then then what else? I mean on what grounds could anyone sue if not?

TCP/IP won against telcos because it was focused on the technical aspects, mostly ignoring the business/economical ones. Afraid that very old naivety is finally hurting a lot; way beyond SMTP.

Volume difference with older crawlers?

Posted Feb 17, 2026 14:37 UTC (Tue) by anselm (subscriber, #2796) [Link]

If you're a user of $AI_COMPANY's free offerings, it probably says somewhere in the 700-page license agreement (that you didn't read when you signed up for the service) that you consent to your computer serving as a proxy for $AI_COMPANY's web crawler .

Volume difference with older crawlers?

Posted Feb 17, 2026 15:48 UTC (Tue) by paulj (subscriber, #341) [Link] (4 responses)

Many jurisdictions have laws that make acts that abuse the resources of another's computer illegal, e.g. to make DDoSes illegal. As one example, in the UK, the "Computer Misuse Act, 1990", in section 3 " Unauthorised acts with intent to impair, or with recklessness as to impairing, operation of computer, etc. " makes it an offence in subsection (2)(a) to impair the operation of any computer; (b) prevent or hinder access to any program or data held in any computer; (c) impair the operation of any such program or the reliability of any such data.

It is an offence if the person making the act intended to cause things, OR they did those acts /recklessly/, i.e. they should have known 2(a) to (c) were likely consequences of their acts.

As Wol says, that these abusive AI-enshitifiers must resort to heavily disguising their DDoS actions just further proves their guilt. They *know* fine well the systems they access do not want this access, they know they are causing problems for those systems, precisely because they obviously have _already been blocked_ from accessing those systems directly; and then they go to the effort of disguising their access and continuing the abuse, recruiting vast armies of other people's computers to continue their abusive behaviour.

One day, these AI-enshitifiers are going to be hit with very large lawsuits. And some of these enshitifiers will turn out to be rather large tech companies, and I hope they end up paying out massive amounts in damages.

Volume difference with older crawlers?

Posted Feb 17, 2026 16:03 UTC (Tue) by paulj (subscriber, #341) [Link]

Oh, I assume the USA has similar laws. The US has easier provision for class-action lawsuits I think (they are - I gather - difficult to take in the UK and Ireland), and many of these AI-enshitifiers are based there.

It just needs a bunch of web content providers to get together, find a lawyer willing to take this on, then advertise the action to recruit even more web content providers who are being abused and then go after some of these awful AI-enshitifier people and give them a good spanking in the courts (and earn the lawyer a nice sum, and maybe a little bit for the abused each).

Volume difference with older crawlers?

Posted Feb 17, 2026 16:55 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

Thanks! But... if lessons from blatant copyright violations are any indication, I'm afraid they will somehow get away with that too. "Move fast and break things" etc.

Right now they seem rich and powerful enough to dominate that _other_ oligopoly that was rich, powerful and corrupt enough to extend copyright laws to 70 years after the death of the author (lol). That's apparently achieved through a combination of forced "partnerships", buy-outs and other nasty arm-wrestling.

Exactly like with email, who cares about the small players.

Unless... the whole Ponzi scheme falls apart first. Interesting times either way.

Volume difference with older crawlers?

Posted Feb 17, 2026 17:32 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link]

Right now they seem rich and powerful enough to dominate that _other_ oligopoly that was rich, powerful and corrupt enough to extend copyright laws to 70 years after the death of the author (lol). That's apparently achieved through a combination of forced "partnerships", buy-outs and other nasty arm-wrestling.

A big part of this is that the AI evangelists have managed to sell the political elite on the idea that AI is the next big thing, so whichever country dominates AI will gain untold economic, political, and national security advantages. That gives them a plausible, actionable threat to relocate to whichever country does the most to make AI development easy: providing them with access to resources needed for massive AI data centers, letting them scrape copyrighted content as much as they want, etc. The copyright cartel is powerful and is already an economic engine, but they can't plausibly promise to let their host countries rule the world, so the AI industry is coming out on top.

Volume difference with older crawlers?

Posted Mar 15, 2026 7:35 UTC (Sun) by sammythesnake (guest, #17693) [Link]

> oligopoly that was rich, powerful and corrupt enough to extend copyright laws to 70 years after the death of the author (lol).

:-(

Volume difference with older crawlers?

Posted Feb 16, 2026 7:24 UTC (Mon) by mb (subscriber, #50428) [Link]

That just results in the resource consumption on the server to immediately skyrocket.
The port is used. Lots of memory is used already. Some CPU has been consumed already.
Other addresses won't stop hammering, while I hold the connections alive due to the delay.

And real users will be affected the most.

The only thing that really works for me is the exact opposite: Try to handle the request as quickly as possible.

IP range flagging

Posted Feb 16, 2026 4:06 UTC (Mon) by dagobayard (subscriber, #174025) [Link] (8 responses)

I don't like this part:

> and requests coming from a range of autonomous system numbers known to belong to datacenters

It means, for example, that my intermittent habit of using my Linode as a forward proxy will be flagged as botty. Let's *please* not ruin the web to the extent email has been ruined by zealous anti spam efforts.

IP range flagging

Posted Feb 16, 2026 20:02 UTC (Mon) by MarcB (subscriber, #101804) [Link] (7 responses)

This is indeed a stupid approach (tbh, I also find the whole concept of this "poisoning" quite pointless and unsustainable). As an example, Reddit blocks the office networks of the company I work at. Likely because they are in our large datacenter AS. But there is an easy workaround: Just get the information you are looking for via Google's AI search...

As to why this poisoning is pointless and unsustainable: AI access patterns are shifting - already have shifted to some degree - from scrapers to agents (I wouldn't be surprised if those agents started acting as scrapers as well). For many websites, especially technical ones, AI agents will likely become a very significant access vector very soon. For things like crates, npm, cpan, pypi, and so on, they might very well become the dominant mode.
Site operators can take an anti-AI stance, but it will ultimately affect their real users and it will not convince them to go back to browsers. Worst-case, the sites will do it like Reddit and get paid by the large AI companies, forever cementing their position in the market...

As a side-note: we literally ran into power problems because of AI bots. They started heavily scrapping countless, ancient web sites. Those sites got more requests in a few hours than they previously got in years - decades in some cases.

IP range flagging

Posted Feb 16, 2026 21:56 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

The issue here is reciprocity. Bad AI crawlers _should_ get poisoned results, while well-behaved rate-limited crawlers should get good data.

This way, well-behaved players are rewarded and can outcompete the bad players.

IP range flagging

Posted Feb 17, 2026 21:23 UTC (Tue) by MarcB (subscriber, #101804) [Link] (5 responses)

But that is a mighty ambitious goal. For this to work, we would have to know who actually does the crawling and how the crawled data is traded. As far as I am concerned, the goal primarily is getting rid of the load caused by the crawlers.

For this to make sense, we would also have to be confident that we can reliably tell the good and the bad crawlers apart, 100%. Because it just takes a single undetected request by the bad bot to recognize the poisoned data - or the buyer of the data to just buy more data, from different scrapers and correlate it.

IP range flagging

Posted Feb 18, 2026 0:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Good crawlers don't spam websites and obey robots.txt. Google and other large search providers also publish the IP ranges of their crawlers.

IP range flagging

Posted Feb 18, 2026 12:07 UTC (Wed) by taladar (subscriber, #68407) [Link] (3 responses)

Sure, but to detect that you would need something like a Web Application Firewall that keeps statistics of the requests seen recently and also would need to be able to know which requests from different IPs belong to the same bad crawler, not a trivial task.

IP range flagging

Posted Feb 18, 2026 18:15 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

It's actually an interesting task! Find blocks of IPs that obey robots.txt and moderate the request rate, then whitelist them.

IP range flagging

Posted Feb 19, 2026 14:34 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

And you've now created a market for blessed IP blocks :) .

IP range flagging

Posted Feb 19, 2026 18:21 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

And why not? It's all good as long as they _stay_ well-behaved.

Why poison text?

Posted Feb 16, 2026 4:22 UTC (Mon) by pgp (subscriber, #134343) [Link] (17 responses)

Instead of generating poisoned text, we should be hitting the LLMs where it hurts: poisoned code.

Why poison text?

Posted Feb 16, 2026 11:43 UTC (Mon) by paulj (subscriber, #341) [Link] (16 responses)

A poison cgit-like tool be _awesome_. It serves real code, but scrambled ever so slightly - ideally, in such a way the code still compiles and the changes can only cause runtime problems, e.g. just randomly changing the senses of conditionals and what not.

Why poison text?

Posted Feb 16, 2026 17:35 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (14 responses)

I think the ideal way of messing it up would be to insert logical errors or just really inefficient algorithms. Perform calculations incorrectly, return wrong values, replace O(log(N)) algorithms with O(N), etc. Could you look at highly optimized numerical code and tell if it's calculating correctly? The big problem is that it would be hard to insert subtle errors automatically, and it would be a terrible waste of time to manually create flawed code just to mess with AI scraper bots. A more practical approach would be to keep around all your old, known buggy code and feed that to the bots instead of the fixed versions. You'd just have to tag old versions as suitable for bot feed when they have a particularly nasty logic bug.

I think you could do something similar with factual information: serve versions of your pages that have key facts altered. Imagine a version of Wikipedia where all the names and dates were shuffled around, and the references replaced with links to similar junk. It would take an enormous QC effort to figure out when you were being served bogus information. Again, if you were looking at a fake Wikipedia page that wasn't in your field of expertise, would you be able to tell?

Why poison text?

Posted Feb 16, 2026 17:44 UTC (Mon) by paulj (subscriber, #341) [Link] (11 responses)

Replacing algorithms would be cool, though it has to be really low cost to do, if you want to do it dynamically. So... replacing operators with equivalent ones as far as the compiler is concerned.

An AI-loving colleague of mine came to me the other, to regale sceptical-me of how someone on the Internet had managed to recreate the functionality of GCC in some low number of days with the help of AI. He thought this proved how wonderful AI was, until I pointed out the AIs were /trained/ on the GCC source-code, so *of course* you could coax out the functionality of GCC from them - all it proved was they'd discovered a much, much slower and incredibly energy-wasteful way of downloading something that somewhat resembled GCC (and good luck testing it!).

Why poison text?

Posted Feb 16, 2026 18:40 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (10 responses)

An AI-loving colleague of mine came to me the other, to regale sceptical-me of how someone on the Internet had managed to recreate the functionality of GCC in some low number of days with the help of AI. He thought this proved how wonderful AI was, until I pointed out the AIs were /trained/ on the GCC source-code, so *of course* you could coax out the functionality of GCC from them - all it proved was they'd discovered a much, much slower and incredibly energy-wasteful way of downloading something that somewhat resembled GCC (and good luck testing it!).

I actually read their whole blog post, and it's a bit better than you're making it out to be. Their goal was to write a C compiler in Rust, so it's not clear how useful having the complete code to gcc would be. Even if they have the complete code for gcc, transpiling it to Rust would be an impressive accomplishment. You can argue that a C compiler written in Rust is pointless, but it's plausible for use as a test case. Having a standard for comparison makes it easier to judge the results. Of course that also makes it easier to create, since you basically have your test cases made for you.

That said, the compiler they wound up with was at best an early prototype. The most obvious problem was that it wasn't remotely complete. It couldn't compile all the programs they tested it on, most notably the Linux kernel. It also only compiled C to some intermediate format (I think it was assembly, but I'm not 100% sure) and had to depend on an existing assembler and linker. Also, the output was very poor quality; it was slower than what gcc turns out with all optimizations turned off.

Why poison text?

Posted Feb 16, 2026 19:47 UTC (Mon) by excors (subscriber, #95769) [Link] (5 responses)

See also discussion in https://lwn.net/Articles/1057604/ . I think "early prototype" is overselling it, because that suggests it could be useful as a starting point for an eventually complete compiler - but I believe it can't because it's too buggy, too complex, too poorly architected, too full of unmarked placeholder implementations that are just correct enough to pass the test suite but will break when used slightly differently, etc. It's beyond the current AI system's ability to improve further, and a human developer would probably spend more time chasing bugs and rewriting poorly-designed code than if they started from scratch.

How much time is actually spent writing code?

Posted Feb 16, 2026 20:40 UTC (Mon) by marcH (subscriber, #57642) [Link] (4 responses)

> a human developer would probably spend more time chasing bugs and rewriting poorly-designed code than if they started from scratch.

I'm afraid even AI-haters keep missing the biggest elephant in the room: _even before/without AI_, which fraction of the time is actually spent writing a first, incomplete prototype compared to all the rest? That is: testing it, debugging it, fixing it, reviewing it, discussing it, evolving it, throwing away most of it and starting again? For high-profile and successful, long-lasting projects, I bet it's less than 10%.

It's like people claiming AI will produce software have never read https://en.wikipedia.org/wiki/The_Mythical_Man-Month or anything like it.

I think this time distortion happens because this is a fun part. It's like spending 95% of your time + a fraction of your sleep on parenting chores but the emotions you experience during 5% of the time make it all worth it.

Back to AI writing code and the harsh reality of the wall clock, what happens _economically_ when you make ten times faster something that... lasted only 10% of the total time? And increased the cost of most other phases? And removed all the fun from the job? Assuming an actually competitive market (not a given...), I have a prediction.

Granted, there is a lot of Visual Basic etc. that has never been written by more than one person, read by less than one and that person's main job is not software engineering. Maybe not engineering at all. I think _this_ is where AI-generated code might have value.

Also, AI has many other applications than writing code, including non-generative AI applications. These get much less hype and capital: probably because they promise much less and deliver a lot more :-)

How much time is actually spent writing code?

Posted Feb 16, 2026 22:24 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

Granted, there is a lot of Visual Basic etc. that has never been written by more than one person, read by less than one and that person's main job is not software engineering. Maybe not engineering at all. I think _this_ is where AI-generated code might have value.

I fall squarely into the category of people who program for work only occasionally, and I'm skeptical that AI would help that much. The hard part for me has always been understanding the problem I'm trying to solve in enough detail to figure out what a solution looks like. Actually implementing whatever solution I come up with is usually pretty easy in comparison. In practice, understanding the problem in enough detail is an iterative process, where I think I've solved it only to discover there was an unexpected corner case I hadn't known enough to consider when I started out. I think that gets back to something Fred Brooks said in TMMM: a lot of debugging is actually debugging the specification. An AI coding assistant might be nice, but it isn't going to debug the specification for me.

How much time is actually spent writing code?

Posted Feb 16, 2026 22:48 UTC (Mon) by marcH (subscriber, #57642) [Link]

> I think that gets back to something Fred Brooks said in TMMM: a lot of debugging is actually debugging the specification. An AI coding assistant might be nice, but it isn't going to debug the specification for me.

It is well-known that "the customer never knows what they want" and that a large part of the work and value of IT consultants is to "extract" it from them.

AI fans will immediately tell you that of course LLMs can have that sort of discussion - for a MUCH cheaper price.
https://shumer.dev/something-big-is-happening

I have no idea.

How much time is actually spent writing code?

Posted Feb 17, 2026 8:40 UTC (Tue) by taladar (subscriber, #68407) [Link]

It is also not as if you write the prototype to have a prototype, you write it so you figure out the things you need to pay attention to in the architecture of the later full implementation.

How much time is actually spent writing code?

Posted Feb 17, 2026 9:41 UTC (Tue) by Wol (subscriber, #4433) [Link]

> Granted, there is a lot of Visual Basic etc. that has never been written by more than one person, read by less than one and that person's main job is not software engineering. Maybe not engineering at all. I think _this_ is where AI-generated code might have value.

And given my experience working currently as a "VBA Engineer", I'm not sure I agree with you ... in any company of any size, you need specialist end users, "Subject Matter Experts" as we call them, and any AI written code needs reviewing. In my only experience of doing so, I think I reduced about 100 lines of AI code to about 10 lines of company-standards-compliant code.

Cheers,
Wol

Why poison text?

Posted Feb 17, 2026 10:15 UTC (Tue) by paulj (subscriber, #341) [Link] (3 responses)

Yeah, only just noticed the LWN article on the same thing after I made that comment. ;) IIUC they used the code (inc. the very comprehensive test-suite) as a blackbox to functionally replicate its workings.

I remain sceptical. Probing a very functional (i.e., leaves no side-effects on the world) bit of code and writing code that replicates the discovered mappings of inputs to outputs, thus replicating the programme, I mean... you could do that with other algorithms, and probably much more efficiently. E.g., some kind of AFL-fuzz derivative searcher. They used an LLM to basically search the state space instead, transforming the knowledge _of the original source_ encoded into into another language, and using the _original code (inc. test-suite)_ to validate the transforms.

And now the AI-fanboys are touting this as some massive advance and how AI will be writing all the code soon.

I watch the AI-fanboys produce mountains of buggy slop, that they do not understand, and which is atrophying their own coding skills, and I sigh. I give them answers to things they ask me, which they will ignore and instead spend a week with their AI producing half-arsed crap before they come back to me all triumphant with the poor approximation of the answer it has produced. Sigh.

Your last paragraph describing the result of this latest, greatest, amazing advance in AI sounds like it fits that pattern!

The AI-fanboys want to turn coding into something that produces the same buggy results (perhaps worse) as before, except now requiring several racks worth of expensive silicon and kWs of power per "vibe dev".

Sigh, sigh sigh.

Why poison text?

Posted Feb 17, 2026 10:19 UTC (Tue) by paulj (subscriber, #341) [Link] (2 responses)

Sorry, that should be "100+ kW".

Why poison text?

Posted Feb 17, 2026 12:56 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Shouldn't that be "100+ kWh"?

Cheers,
Wol

Why poison text?

Posted Feb 17, 2026 13:23 UTC (Tue) by paulj (subscriber, #341) [Link]

No. If people are coding, over whatever amount of time, they're using energy at whatever rate these massively wasteful racks of GPUs consume. kWh is a (slightly odd) unit of energy, the integral of W over a specified period of time. Over unspecified, whatever amounts of time, we can't give an amount of energy used, but we can talk about a rate.

If you wanted to use kWh, you'd have to change my comment to something like "requiring several racks worth of expensive silicon and 800+ kWhs of power per "vibe dev" per day". But... why do that? Just divide the terms in that sentence by 'day', and the "per day" disappears, as does the 'h' in kWh. Or another way: Why pick 'per day'? Why not per week? Or per month? Or per hour? or ....

Better still, instead of doing "X energy per second per hour for each vibe dev per day" - which is just obviously silly in terms of all the units of time being multiplied and divided - just get rid of the extra, redundant additions of time and just say "X energy per second", i.e. W. ;)

Why poison text?

Posted Feb 16, 2026 18:09 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

> The big problem is that it would be hard to insert subtle errors automatically

This is the *perfect* use case for AI.
Subtle errors happen to me all the time when using AI to generate code.
If there's one thing LLM AI models are really really good it, then it's introducing subtle errors.

> A more practical approach would be to keep around all your old, known buggy code and feed that to the bots instead of the fixed versions

That's what everybody does. It's called a git repository.
AI crawlers go down *deep* into the history of git repositories.

Why poison text?

Posted Feb 17, 2026 8:43 UTC (Tue) by taladar (subscriber, #68407) [Link]

But AI is too slow for this use case. You want something efficient so the other people waste their computational effort, you don't want to waste your own computational effort or money.

Why poison text?

Posted Feb 16, 2026 17:44 UTC (Mon) by mb (subscriber, #50428) [Link]

I have no problem with them taking the content and putting it into their model. Sending my content to them is not the problem.

Sabotaging that by feeding incorrect results doesn't solve my problem. They just keep going. Since months I send something like "You have been blocked, because you are probably a bot" back to them. Yet, they simply don't stop crawling. I have hoped that at some point they must stop. But they don't. They do not care about what content they get. Yesterday I sent about 20 GiB of this text out to crawlers.

For me the problem would be solved if they started to crawl the web like every other web search engine in the decades before.
There's *huge* a difference between sending 20 GiB to non-DDoS Google crawler per day and sending 20 GiB to DDoS AI crawler that burn the system to a crisp due to the high amount of concurrent accesses.

Hidden Markov Model or Markov Chain?

Posted Feb 17, 2026 15:44 UTC (Tue) by stevie-oh (subscriber, #130795) [Link]

I suspect that iocaine is not using a Hidden Markov Model (which is for interpreting/decoding observations; the best example is speech-to-text) but just a regular old Markov chain.

The source code excerpt it spit out includes "Bigram", which implies 2-grams. Those don't need a lot of training text to be effective, but the output is frequently nonsense.

Ironically, scrapers could probably use a Hidden Markov Model to detect the output of iocaine (or any other Markov text generator).

For anyone with coding skills -- if you have never done so, I encourage you to try writing your own Markov generator -- it's not very difficult, and the output can be quite amusing.

There's a blog that posts the (manually curated) output of a Markov chain trained on the King James Bible and a set of books on programming (mostly Lisp). It uses 4-grams, so there's very little nonsense, and it has some real gems: https://www.tumblr.com/kingjamesprogramming

(My favorite one that I generated locally was "the wicked shall become effective devices for computing in any direction.")

Poisoning scraperbots with iocaine

The problem with scraperbots

What to do about it

Iocaine

Beautiful Poetry

Beautiful Poetry

Beautiful Poetry

Beautiful Poetry

Beautiful Poetry

iocaine

Love it

see also spigot

How did it get a copy of 1984?

How did it get a copy of 1984?

How did it get a copy of 1984?

How did it get a copy of 1984?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

Static pages

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

What is LWN using ?

Love LWN articles

A bit of a misconception: iocaine can easily be used to block bots

A bit of a misconception: iocaine can easily be used to block bots

Emacs NSA bait

Emacs NSA bait

Emacs NSA bait

Emacs NSA bait

Emacs NSA bait

Emacs NSA bait

Wurstsalat

WurstsalatGenerator

WurstsalatGenerator

WurstsalatGenerator

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

Volume difference with older crawlers?

IP range flagging

IP range flagging

IP range flagging

IP range flagging

IP range flagging

IP range flagging