My instance is getting pummeled by scrapers crawling nonsense. Like issue and pull searches with every single variant of label combinations.

Everything’s coming from a shitload of different residential IPs at a very fast cadence.

There’s just not that much content on my instance to warrant this traffic. It could be scraped in a minute or two like this if it were legitimate traffic.

  • Kissaki
    link
    fedilink
    English
    arrow-up
    11
    ·
    16 days ago

    Possibly AI company crawlers. When they came up there was a lot of bad publicity and reports of actively malicious and toxic crawling behavior, including ban evasion.

    You can think about locking some url paths behind valid login sessions, or use a proof of work proxy guard.

    Anubis is the popular tool for that. I’ve seen maybe three alternatives, one of which from Cloudflare.

    See also related Codeberg ticket (Forgejo instance) https://codeberg.org/forgejo/discussions/issues/319

    If you search, you can find various blog posts about these issues. Not just when Forgejo.

    • treadful@lemmy.zipOP
      link
      fedilink
      English
      arrow-up
      6
      ·
      16 days ago

      Possibly AI company crawlers. When they came up there was a lot of bad publicity and reports of actively malicious and toxic crawling behavior, including ban evasion.

      That was kind of what I was thinking, but if that’s true, they’re wasting so much bandwidth and compute. Going through every combination of issue label combinations does not get them any useful code to hoover up. They could’ve just cloned my repos and be done with it.

      You can think about locking some url paths behind valid login sessions, or use a proof of work proxy guard.

      Anubis is the popular tool for that. I’ve seen maybe three alternatives, one of which from Cloudflare.

      Really don’t want to Cloudflare, but Anubis is interesting. If I can’t shake these bots, maybe I’ll consider this. Thanks.

    • Eezyville@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      16 days ago

      If you think it’s AI then maybe you can get another AI to write bad code and poison their training data.

  • reluctant_squidd@lemmy.ca
    link
    fedilink
    English
    arrow-up
    3
    ·
    16 days ago

    I try not to expose mine to the internet for this reason. I have it on a central server that WireGuard connects to a vps overseas, then I have it tunneled to my home server through a random port as needed for access from the net, then I block it again. All my machines sync this way with the central server, either through vpn tunnel or directly on my LAN depending on where I am.

    Unless you need to showcase your code, I wouldn’t recommend exposing your instance to the internet at all. And if you have to, maybe reverse proxying it and add some monitoring and blocking software to help. Like fail2ban or the like. Good luck.

    • treadful@lemmy.zipOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      16 days ago

      Having a private instance isn’t exactly indicative to open source software, so I don’t think that’s the way I want to take it. I’d probably move to Codeberg or even GitHub before hosting the entire thing on a private net.

      I also don’t think monitoring and blocking are going to help here. This traffic came from so many different IPs that it would be almost impossible to detect and block them all without blocking legitimate traffic. I also really don’t want to hook up a Cloudflare-like centralized challenge system to deal with this if I can avoid it.

      • reluctant_squidd@lemmy.ca
        link
        fedilink
        English
        arrow-up
        4
        ·
        16 days ago

        It sounds to me like you are at the mercy of the bots then unfortunately. I have had literal empty websites up just to see what the bots do and within a few hours the sites are hammered with crazy bot traffic trying everything from MySQL connections, ssh, Wordpress sniffing, xss attacks, you name it. They don’t even seem to care that the site is 403 forbidden or just a blank page.

        It’s the World Wide Web we live in nowadays according to my experience.

        • treadful@lemmy.zipOP
          link
          fedilink
          English
          arrow-up
          6
          ·
          16 days ago

          […] crazy bot traffic trying everything from MySQL connections, ssh, Wordpress sniffing, xss attacks, you name it.

          oh yeah, I see that on everything. I’m not so worried about those vuln scanners than this overwhelming nonsense traffic that I’m seeing now. This is different, and seemingly pointless.

  • dajoho@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    ·
    16 days ago

    Yes! Exactly how you describe. They were going through certain repos and parsing every commit. I couldn’t block them because there were loads of different residential IPs and random user-agents. :-(

    • treadful@lemmy.zipOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      16 days ago

      Well, at least it doesn’t seem targeted, then. Did you do anything to remedy the situation?

      • Marthirial@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        16 days ago

        Why do you need a self hosted instance open to the World? Mine is behind a CloudFlare rule that allows connections only from a list of IPs, like my self hosted WireGuard instance.

        • treadful@lemmy.zipOP
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          16 days ago

          Why do you need a self hosted instance open to the World?

          Because I can and I want to?