I found that I was getting random bot attacks on progscrape.com with no identifiable bot signature (ie: a signature matching a valid Chrome Desktop client), but at a rate that was only possible via bot. I ended up having to add token buckets by IP/User Agent to help avoid this deluge of traffic.
Agents that trigger the first level of rate-limiting go through a "tarpit" that holds their connection for a bit before serving it which seems to keep most of the bad actors in check. It's impossible to block them via robots.txt, and I'm trying to avoid using too big of a hammer on my CloudFlare settings.
EDIT: checking the logs, it seems that the only bot getting tarpitted right now is OpenAI, and they _do_ have a GPTBot signature:
2024-10-31T02:30:23.312139Z WARN progscrape::web: User hit soft rate limit: ratelimit=soft ip="20.171.206.77" browser=Some("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)") method=GET uri=/?search=science.org
atif089 175 days ago [-]
Did you implement this in your web server or within your application? I'd love to see the code if you're willing to share
I actually misremembered my implementation. It's rolling counting bloom filters, not a token bucket. :)
jhpacker 175 days ago [-]
Cloudflare radar, which presumably a much bigger and better sample, reports Bytespider as the #5 AI Crawler behind FB, Amazon, GPTBot, and Google:
https://radar.cloudflare.com/explorer?dataSet=ai.bots
And that's not including the most of highest volume spiders overall like Googlebot, Bingbot, Yandex, Ahrefs, etc.
Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.
jsheard 175 days ago [-]
The difference is that, AFAIK, those bigger AI crawlers do respect robots.txt. Google even provides a way to opt-out of AI training without opting-out of search indexing.
yazzku 175 days ago [-]
And how much do you trust that shit? Has anyone set up a honeypot as an experiment?
BXlnt2EachOther 175 days ago [-]
possibly unpopular opinion, I trust the bigger companies more than small ones on stuff like this. It would be so much easier to not offer anything, rather than intentionally create a potemkin setting and risk the blowback that would occur if discovered. Hopefully this comment does not age poorly.
full disclosure: worked there [edit: google] a while ago, not in search, not in AI.
Arnt 175 days ago [-]
You can trust Google to do what it says, and yes I've seen Google obey robots.txt. You can't trust Google to do what you think is right.
I'm a bit in a hurry, don't have time for close reading. Does that article say some Google apps (notably Maps) store locations on your device even if you have configured them to not store it in your Google account? I may miss something, don't have time to read between the lines today.
neilv 175 days ago [-]
Given the high-profile national security scrutiny that ByteDance was already in over TikTok, and now with the AI training competitiveness on national authorities' minds, maybe this behavior by ByteDance is on the radar of someone who's thinking of whether CFAA or other regulation applies.
As someone who's built multiple (respectful) Web crawlers, for academic research and for respectable commerce, I'm wondering whether abusers are going to make it harder for legitimate crawlers to operate.
wtf242 175 days ago [-]
I had the same issue with TikTok/ByteDance. They were using almost 100gb of my traffic per month.
I now block all ai crawlers at the cloudflare WAF level. On Monday I noticed a HUGE spike in traffic and my site was not handling it well. After a lot of troubleshooting and log parsing, I was getting millions of requests from China that were getting past cloudflare's bot protection.
I ended up having to force a CF managed challenge for the entire country of China to get my site back in a normal working state.
In the past 24 hours CF has blocked 1.66M bot requests. Good luck running a site without using CloudFlare or something similar.
AI crawlers are just out of control
PittleyDunkin 175 days ago [-]
How do you differentiate between "ai" (whatever that means) and other crawlers?
yazzku 175 days ago [-]
You don't. Theoretically, they would respect the user agent, but who can trust that anymore?
Sparkyte 175 days ago [-]
And it is a fine pickle we jarred ourselves into. We thought it would be sweet but it just came out dill.
jcat123 175 days ago [-]
More than user-agent, because user-agent cannot be trusted.
PittleyDunkin 175 days ago [-]
Great! Well then... how?
prophesi 175 days ago [-]
HAProxy Edge is their product, and akin to Cloudflare and other competitors the heuristics to stifle bad actors is likely the secret sauce. Disclosing it would only lend bad actors the advantage in their game of cat and mouse.
superkuh 175 days ago [-]
Their user-agent.
PittleyDunkin 175 days ago [-]
Ah, so this is just marketing.
jcat123 175 days ago [-]
(disclaimer: i wrote that post)
It is not. We rely on more than User Agents because they are too often faked, so it is not just marketing. There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.
PittleyDunkin 175 days ago [-]
> There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.
Great! What are these signals? That seems to be the meat of the post but it's conspicuously absent. How are we supposed to validate the post?
signatoremo 175 days ago [-]
> how are we supposed to validate the post?
Imagine you were a vendor who were trying to trick the author into divulging his methods. Can a stranger on the Internet be trusted?
dgfitz 175 days ago [-]
I imagine if that information is disclosed, you won’t be able to verify it, as it will be bypassed… because it was disclosed.
PittleyDunkin 175 days ago [-]
What a wonderful world we live in where serious people are expected to believe press releases based purely on brand prestige.
dgfitz 174 days ago [-]
There’s a lot of assumptions in that comment.
superkuh 175 days ago [-]
So user-agent and whois to see if it's coming from a plausible netblock and accept: strings, http header stuff?
odc 175 days ago [-]
Good to know there are other solutions than Cloudflare to block those leeches.
sghiassy 175 days ago [-]
It’s 90% of 1%… title is misleading
richwater 175 days ago [-]
It's completely accurate.
90% of their crawler traffic (which is 1% of their total traffic) is ByteDance.
sghiassy 175 days ago [-]
No. It’s 90% of their “AI traffic” is ByteDance. Here’s the quote:
“””
Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok)
“””
manojlds 175 days ago [-]
No it isn't
sghiassy 175 days ago [-]
It’s in big bold bullets at the top of the article
“Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok)”
throawayonthe 175 days ago [-]
[dead]
yazzku 175 days ago [-]
tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.
dartos 175 days ago [-]
Should we webmasters just start blocking user agents wholesale?
I mean except known good actors.
I guess known actors would need a verifiable signature
rty32 175 days ago [-]
Not viable. They are going to use user agents that look like those coming from completely normal human users.
"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.
jsheard 175 days ago [-]
Search engine crawlers do have verifiable signatures, if a client claims to be Googlebot or Bingbot you don't have to take their word for it.
But the converse is not true? There is no guarantee the crawler is not amassing data for model training, or that a crawler (AI or otherwise) does not disguise itself as a normal user?
jsheard 175 days ago [-]
Yeah, but traffic appearing to come from normal users can be throttled and/or CAPTCHA'ed while still allowing Google and Bing to crawl to their hearts content so your SEO isn't affected.
SoftTalker 175 days ago [-]
I would think rate-limiting would be good. Crawlers are not patient enough to operate at the speed of a real human user.
readyplayernull 175 days ago [-]
Greedy crawlers will use fake user-agent strings.
Narhem 175 days ago [-]
It’s relatively simple to detect crawlers writing one from scratch could take a few weeks if the infrastructure was in place.
With salaries though finding an externally managed solution might be cheaper.
andrethegiant 175 days ago [-]
[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.
Agents that trigger the first level of rate-limiting go through a "tarpit" that holds their connection for a bit before serving it which seems to keep most of the bad actors in check. It's impossible to block them via robots.txt, and I'm trying to avoid using too big of a hammer on my CloudFlare settings.
EDIT: checking the logs, it seems that the only bot getting tarpitted right now is OpenAI, and they _do_ have a GPTBot signature:
https://github.com/progscrape/progscrape/blob/master/web/src...
Here's where we handle the rate limits:
https://github.com/progscrape/progscrape/blob/master/web/src...
I actually misremembered my implementation. It's rolling counting bloom filters, not a token bucket. :)
Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.
full disclosure: worked there [edit: google] a while ago, not in search, not in AI.
As someone who's built multiple (respectful) Web crawlers, for academic research and for respectable commerce, I'm wondering whether abusers are going to make it harder for legitimate crawlers to operate.
I now block all ai crawlers at the cloudflare WAF level. On Monday I noticed a HUGE spike in traffic and my site was not handling it well. After a lot of troubleshooting and log parsing, I was getting millions of requests from China that were getting past cloudflare's bot protection.
I ended up having to force a CF managed challenge for the entire country of China to get my site back in a normal working state.
In the past 24 hours CF has blocked 1.66M bot requests. Good luck running a site without using CloudFlare or something similar.
AI crawlers are just out of control
It is not. We rely on more than User Agents because they are too often faked, so it is not just marketing. There are other signals we see that confirm whether the request came from a "legitimate" AI scraper, or a different scraper with the same user agent.
Great! What are these signals? That seems to be the meat of the post but it's conspicuously absent. How are we supposed to validate the post?
Imagine you were a vendor who were trying to trick the author into divulging his methods. Can a stranger on the Internet be trusted?
90% of their crawler traffic (which is 1% of their total traffic) is ByteDance.
“”” Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok) “””
“Nearly 1% of our total traffic comes from AI crawlers
Close to 90% of that traffic is from Bytespider, by Bytedance (the parent company of TikTok)”
I mean except known good actors.
I guess known actors would need a verifiable signature
"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.
https://developers.google.com/search/docs/crawling-indexing/...
https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...
With salaries though finding an externally managed solution might be cheaper.
[1] https://crawlspace.dev