Key Takeaways
Fastly’s Q2 2025 Threat Insights Report shows that nearly 80% of AI bot traffic now comes from crawlers, and 90% of that activity is targeting the U.S. and Canada.
That means North American providers have become the succulent meal at an AI crawler’s all-you-can-eat-buffet, with stateside clients facing more wasted bandwidth, server strain, and unclear analytics than anywhere else in the world.
Why North America?
Fastly suggests that a major draw is the fact that North America is home to most English-language websites.
“A significant observation is the apparent heavy reliance of most AI models on content sourced from North America. This concentration suggests a potential bias towards North American perspectives in their learned understanding,” the report reads.
Broken down, here’s what the data looks like:
- North America: Almost all bot traffic is crawlers (about 90%)
- Latin America: Still crawler-heavy at 72%
- APAC (Asia Pacific): More balanced, but crawlers still dominate at 58%
- EMEA (Europe, Middle East, Africa): Fetchers, which are real-time bots like ChatGPT queries, make up 59% of AI bot traffic

For some, this will come as no surprise: Training sets are overwhelmingly English-based.
For example, when Common Crawl sweeps the web, almost half of everything it grabs (about 45%) is English-language content. No other language comes close: German, Russian, Japanese, French, Spanish, Chinese, etc. all sit below 6% each.
| Rank | Language | Approx. % of Documents |
|---|---|---|
| 1 | English | 44–46% |
| 2 | German | 5.4–5.8% |
| 3 | Russian | 5–6% |
| 4 | Japanese | 5.1% |
| 5 | French | 4.5% |
| 6 | Spanish | 4.3% |
| 7 | Polish | 1.7–1.8% |
| 8 | Chinese | 1.1–1.5% |
Of course, that’s no accident. Most major LLMs come out of English-speaking institutions. Take a look at OpenAI, Meta, Anthropic, and Google, all of which are U.S.-based and building first for U.S. markets.
Meta’s own LLaMA 2 paper acknowledged that more than 80% of its training data is English and even warns the model may not perform well in other languages.
For hosts, that bias has very real consequences.
Because English content dominates training sets, U.S. and Canadian websites — and the infrastructure behind them — become the first stop for AI crawlers.
But it’s not just any sites being scraped: Fastly confirmed that eCommerce, technology, and media/entertainment are the most sought-after verticals.

As for why, the report noted that “This likely reflects the high value of these domains in terms of fresh, dynamic, and information rich content such as product listings, news articles, reviews, and technical documentation, which are useful for training or grounding language models.”
What It Means for Hosts
Fastly reported that some crawlers spike at 1,000 requests per minute, while fetchers can hit 39,000 requests per minute. It’s enough to cause DDoS-like effects, such as slowdowns and timeouts.
On top of that, a lack of bot verification is still an issue, making it hard for security teams to distinguish between legit automation (think search engines, uptime monitors) and human impersonation.

“Whether scraping for training data or delivering real-time responses, these bots create new challenges for visibility, control, and cost,” said Arun Kumar, Senior Security Researcher at Fastly. “You can’t secure what you can’t see, and without clear verification standards, AI-driven automation risks are becoming a blind spot for digital teams.”
HostingAdvice has already reported on this freeloading problem — for lack of a better term — when Cloudflare called out AI scrapers for harvesting content without consent or compensation.
If hosts don’t control crawler traffic, they basically end up subsidizing AI companies while customers pay higher bills with worse performance.
That’s a recipe for churn disaster, but there is an upside.
Hosts can’t stop crawlers, but they can control how much they take, when they take it, and what it costs. Providers who treat AI bot traffic as an infrastructure challenge — not a faraway concept — will protect their clients from unforeseen charges from capacity hikes.
Whether that means rolling out llm.txt or partnering with vendors who offer bot mitigation tools, the point is to never pay for someone else’s training data.




