Now even my personal instance is infested by these AI crawlers

potatoguy@potato-guy.space · edit-2 12 days ago

Now even my personal instance is infested by these AI crawlers

flamingos-cant@feddit.uk · edit-2 12 days ago

You can enable Private Instance in your admin settings, this will mean only logged in users can see content. This will prevent AI scrapers from slowing down your instance as all they’ll see is an empty homepage, so no DB calls. As long as you’re on 0.19.11, federation will still work.

potatoguy@potato-guy.space · 12 days ago

Enabled, thanks for the tip!

xep@fedia.io · 12 days ago

At some point they’re going to try to evade detection to continue scraping the web. The cat and mouse game continues except now the “pirates” are big tech.

brandon@piefed.social · edit-2 12 days ago

They already do. (“They” meaning AI generally, I don’t know about Claude or ChatGPT’s bots specifically). There are a number of tools server admins can use to help deal with this.

See also:

lurch (he/him)@sh.itjust.works · 12 days ago

these solutions have the side effect of making the bots stay on your site longer and generate more traffic. it’s not for everyone.

MonkderVierte@lemmy.zip · 11 days ago

Patience, AI crash bubble burst will be soon.

AstralPath@lemmy.ca · 11 days ago

🤞

termaxima@programming.dev · edit-2 7 days ago

deleted by creator

Finch9678@europe.pub · 11 days ago

Article for whoever was unaware like me.

parpol@programming.dev · 12 days ago

Use Anubis. That’s pretty much the only thing you can do against bots that they have no way of circumventing.

carrylex@lemmy.world · 11 days ago

So I just had a look at your robots.txt:

User-Agent: *
  Disallow: /login
  Disallow: /login_reset
  Disallow: /settings
  Disallow: /create_community
  Disallow: /create_post
  Disallow: /create_private_message
  Disallow: /inbox
  Disallow: /setup
  Disallow: /admin
  Disallow: /password_change
  Disallow: /search/
  Disallow: /modlog
  Crawl-delay: 60

You explicitly allow searching your content by bots… That’s likely one of the reasons why you get bot traffic.

plz1@lemmy.world · 11 days ago

AI crawlers ignore robots.txt. The only way to get them to stop is with active counter measures.

Net_Runner :~$@lemmy.zip · 8 days ago

I highly recommend using Anubis as a proxy for your entire instance. It’s a little complicated to get going, but it stops any and all AI scrapers with a denial of access. Having a robots.txt works, but only so much, because some of these bots do not respect it. And, honestly, with the way Sam Altman talks about the people he’s stolen and scraped from, I don’t think anyone should be surprised.

But, I have Anubis running on my personal website and I’ve tested to see if ChatGPT can see it, and it cannot. Good enough for me

Mwa@thelemmy.club · edit-2 11 days ago

You can either use Cloudflare(proprietary) or anubis (Foss)

jagged_circle@feddit.nl · 11 days ago

Don’t do this

Mwa@thelemmy.club · 11 days ago

Why?

jagged_circle@feddit.nl · 11 days ago

Because it harms marginalized folks’ ability to access content while also letting evil corp (and their fascist government) view (and modify) all encrypted communication with your site and its users.

It’s bad.

Jerkface (any/all)@lemmy.ca · 11 days ago

For clarity, you are referring to Cloudflare and not anaubis?

jagged_circle@feddit.nl · 11 days ago

I am referring to cf, but I would expect anaubis would be the same if it provides DoS fronting

monogram@feddit.nl · 11 days ago

Anubis work in a very different way than cloudflare