Every publisher is facing a pressing question right now, “Should we block AI bots from scraping our content? And if so, how do we actually do it?”
Website visits to ChatGPT, Claude, Google Gemini, and others have exploded, accounting for well over 15 billion visits on a monthly basis.1
You’ve seen the headlines. Major news publishers are reporting that AI-generated answers are hurting their website traffic, with some seeing drops of 10% to 15% … even up to 50% in extreme cases.
But here is the reality: Not all publishers are impacted equally.
AI threat level by publisher type
National news outlets are shouting the loudest because they are the most highly impacted by generative AI. However, the impact of AI varies wildly depending on your niche.
Here is how I breakdown the risk level of generative AI across the media industry:
- National News (Very High Risk): These organizations are hit the hardest. They compete with dozens (if not hundreds) of outlets reporting on the same story, they are highly dependent on SEO, and they are heavily reliant on programmatic advertising and page view volume.
- Hobby & Enthusiast (High Risk): Publishers who cover topics like model railroading, knitting, jazz, fishing, etc. are still highly dependent on SEO. They have niche content, but there is a “commodity” risk where AI can easily summarize general knowledge topics.
- City & Regional News (Medium-High Risk): These publishers have an advantage: local content. However, they still face competition / commoditization from other local newspaper, TV and radio websites, are somewhat dependent on SEO, and many of them rely on high page volume for programmatic ad revenue.
- City & Regional Lifestyle (Medium Risk): Magazines like D Magazine (Dallas) or Philadelphia Magazine have unique, local content. They are still somewhat SEO dependent, and some rely on programmatic backfill advertising, but their content is harder for an AI to generically replicate.
- National B2B Trade (Low Risk): This is where we see a shift. B2B / professional association publishers (e.g., auto body, construction, restaurant owners) are much less dependent on SEO and can rely more on email and other channels. Their content is highly specialized and low-commodity, and very few B2B publishers rely on programmatic backfill revenue.
- City & Regional Business (Low Risk): Publications like BizTimes Milwaukee or Ottawa Business Journal are actually seeing their search traffic grow in some cases. They have highly unique local content and email, Google News, Nextdoor News and other channels often drive much more traffic than search.

AI visibility vs. website traffic
Before you block AI bots from your site, you have to answer one core question: Is visibility in an AI answer more important to you than website traffic?
If you allow bots to scrape your site, your content might appear in a ChatGPT or Gemini answer. You gain visibility, but users may never click through to your site.
For most of my publisher clients, we have decided that traffic is more important. We also don’t want to feed the AI models for free and are exploring ways to license our content to them instead.
And for those worried that blocking bots will hurt your SEO, the data suggests otherwise. Here is a week-by-week view of a client where we blocked all AI bots in mid-September. You’ll notice that since blocking the bots, there has been no harmful impact on organic search traffic.

Loopholes AI companies use
If you decide to block AI bots from your website, you can’t just rely on a simple robots.txt file or rules at the server / firewall level. As my friend, Jez Walters, talked about during in our What’s New in Publishing webinar, there are loopholes that AI companies can use to get your content anyway.
- The Internet Archive Loophole: AI companies claim they honor robots.txt. However, reports suggest they are bypassing this by grabbing your content from Common Crawl or the Internet Archive (Wayback Machine). They aren’t scraping you directly; they are scraping the archives.
- The Sublicensing Loophole: If you participate in content distribution programs like SmartNews or NewsBreak, check the terms and conditions of your agreement. You may have unwittingly given them permissions to sub-license your content to AI bots.
How to block AI bots effectively
If you want to lock down your content from generative AI bots, use a multi-layered approach:
- Update your robots.txt file: The robots.txt file is used to give instructions to bots that come to your site. You need to specifically list and block the user agents for every AI bot (ChatGPT, Gemini, Claude, etc.). To close the Internet archive loophole, you must also block the Creative Commons and Wayback Machine bots. Note that only the “ethical” AI bots respect robots.txt.
- Use a web application firewall (WAF): This happens at the website hosting level. You want to block AI bots at the firewall so they don’t even hit your site. Cloudflare and AWS, for example, have specific AI bot filtering settings.
- Implement geo-blocking: If you are a local or regional publisher in the US or Canada (or elsewhere), consider blocking all traffic from outside your country. Many unscrupulous AI bots originate internationally. If international readership isn’t critical to your business model, this can be another effective tactic.
- Advanced blocking tactics: You can implement “AI honeypots” (invisible links that only bots click and put them into a loop). So can also set behavioral rules. For example, if a user requests your sitemap immediately upon arrival, or makes multiple requests extremely fast, that is likely a bot, not a human.
WARNING: Be very careful not to block legitimate search engine bots. You do not want to accidentally block Google or Bing from indexing your site for standard search.
Footnotes
- Similarweb, September 2025 Desktop and Mobile Visits ↩︎