Alarmed by what companies are building with artificial intelligence models, a handful of industry insiders are calling for those opposed to the current state of affairs to undertake a mass data poisoning effort to undermine the technology.

Their initiative, dubbed Poison Fountain, asks website operators to add links to their websites that feed AI crawlers poisoned training data. It’s been up and running for about a week.

AI crawlers visit websites and scrape data that ends up being used to train AI models, a parasitic relationship that has prompted pushback from publishers. When scaped data is accurate, it helps AI models offer quality responses to questions; when it’s inaccurate, it has the opposite effect.

  • algernon@lemmy.ml
    link
    fedilink
    English
    arrow-up
    19
    ·
    13 hours ago

    I had a short tootstorm about this, because oh my god, this is some terribly ineffective, useless piece of nothing.

    For one, Poison Fountain tells us to join the war effort and cache responses. Okay…

     curl -i https://rnsaffn.com/poison2/ --compressed -s
    HTTP/2 200
    content-disposition: inline
    content-encoding: gzip
    content-type: text/plain; charset=utf-8
    x-content-type-options: nosniff
    content-length: 959
    date: Sun, 11 Jan 2026 21:17:36 GMT
    
    

    Yeaah… how am I supposed to cache this? Do I cache one response and then continue serving that for the 50+ million crawlers that visit my sites every day? And you think a single, repetitive thing will poison anything at all? Really?

    Then, the Poison Fountain explanation goes on to explain that serving garbage to the crawlers will end up in the training data. I’m fairly sure the person who set this up never worked with model training, because this is not what happens. Not even the AI companies are that clueless, they do not train on anything and everything, they do filter it down.

    And what this fountain provides, is trivial to filter.

    It’s also mighty hard to set up! It’s not just a reverse_proxy https://rnsaffn.com/posion2, because then you leak all the headers you got. No, you have to make a sanitized request that doesn’t leak data. Good luck!

    Meanwhile, there are a gazillion of self-hostable garbage generators and tarpits that you can literally shove in a docker container and reverse proxy tarpit URLs to them, safely, locally. Much more efficient, far more effective. And, seeing as this is practically uncacheable, if I were to use it, I’d have to send all the shit that hits my servers, their way. As far as I can tell, this is a single Linode server. It probably wouldn’t crumble under my 50 million requests / day, but if ten more people would join the “war effort” without caching, my well educated guess is that it would fall over and die.

    Besides, we have no idea whether poisoning works. We can’t measure that. What we can measure, is the load on our servers, and this helps fuck all in that regard. The bots will still come, they’ll still hit everything, and I’d have additional load due to the network traffic between my server and theirs (remember: the returned response provides no sane indicators that’d allow caching while keeping the responses useful for poisoning purposes).

    Not only is this ineffective in poisoning, it’s not usable at all in its current state. And they call for joining the war effort. C’mon.

    • sobchak@programming.dev
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 hours ago

      I once saw an old lecture where the guy working on Yahoo spam filters noticed that spammers would create accounts to mark their own spam messages as not spam (in an attempt to trick the spam filters; I guess a kind of a Sybil attack), and because the way the SPAM filtering models were created and used, it made the SPAM filtering more effective. It’s possible that wider variety of “poisoned” data can actually help improve models.

      • algernon@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 hours ago

        I… have my doubts. I do not doubt that a wider variety of poisoned data can improve training, by implementing new ways to filter out unusable training data. In itself, this would, indeed, improve the model.

        But in many cases, the point of poisoning is not to poison the data, but to deny the crawlers access to the real work (and provide an opportunity to poison their URL queue, which is something I can demonstrate as working). If poison is served instead of the real content, that will hurt the model, because even if it filters out the junk, it will have access to less new data to train on.

      • algernon@lemmy.ml
        link
        fedilink
        English
        arrow-up
        6
        ·
        11 hours ago

        Yup. All of the things listed there are far better than this.

        (I’m also in that article, look for “iocaine”, although it evolved into something a whole lot more powerful, and a lot easier to deploy since the article was written).

        • GMac@feddit.org
          link
          fedilink
          English
          arrow-up
          6
          ·
          11 hours ago

          Love the princess bride reference. Thank you for acting on behalf of those of us with less technical skills.