Updated: October 19, 2025
When your best work lives behind a paywall, the last thing you want is AI models and “reader” apps quietly training on it. In this guide, we’ll allow Google/Bing/Apple to crawl your public content while blocking AI training and scraper bots from your premium paths (e.g., /premium/, /members/, /courses/). Let’s Block AI Training! We’ll do it in layers: robots.txt ➝ WAF rules ➝ server rules ➝ WordPress gate ➝ quick tests.
TL;DR (Copy & Go)
Selective opt-out robots.txt (only blocks premium)
# --- Keep SEO bots working sitewide ---
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: Applebot
Disallow:
# --- Opt-out of AI training for premium path only ---
User-agent: Google-Extended
Disallow: /premium/
User-agent: Applebot-Extended
Disallow: /premium/
# --- Block common AI harvesters in premium path ---
User-agent: GPTBot
Disallow: /premium/
User-agent: CCBot
Disallow: /premium/
User-agent: PerplexityBot
Disallow: /premium/
# --- Housekeeping ---
User-agent: *
Allow: /$
Sitemap: https://aihika.com/sitemap_index.xml
Why this setup?
- Public content = SEO friendly. Googlebot/Bingbot/Applebot can crawl everything they need.
- Premium content = opt-out from AI training and blocked for common “AI reader” crawlers.
- Defense in depth.
robots.txtis polite control; some bots ignore it, so we add WAF and server rules plus a login gate.
Step-by-Step Guide
1) Decide your structure (2 minutes)
- Use one folder for paid stuff, e.g.,
/premium/. - Keep a public preview on
/blog/...or/premium/slug/with an excerpt.
2) Add the selective robots.txt (3 minutes)
WordPress:
- Yoast → Tools → File editor (or RankMath → General Settings → Edit robots.txt)
- Paste the TL;DR block above → Save
- Test in browser:
https://yourdomain.com/robots.txt
Non-WP:
- Upload
robots.txtto webroot (public_html/), then open it in the browser.
Robots is not secret—don’t list private URLs. Just the folder.
3) Cloudflare WAF rule (edge-level blocking)
Target: requests to /premium/ with “AI” UAs → Block (or Managed Challenge if you want a softer gate).
Expression (example):
(http.request.uri.path starts_with "/premium/")
and (
http.user_agent contains "GPTBot" or
http.user_agent contains "CCBot" or
http.user_agent contains "PerplexityBot" or
http.user_agent contains "Perplexity-User" or
http.user_agent contains "Claude-Web" or
http.user_agent contains "AI2Bot" or
http.user_agent contains "DataForSeoBot"
)
Tip: Create an Allow rule for (Googlebot|Bingbot|Applebot) so legit crawlers never get challenged.
4) Server rules (Nginx/Apache) — optional but strong
Nginx (inside your server block):
location ^~ /premium/ {
# Kill known AI/scraper UAs
if ($http_user_agent ~* "(GPTBot|CCBot|PerplexityBot|Perplexity-User|Claude-Web|AI2Bot|DataForSeoBot)") { return 403; }
# Let good crawlers through (for preview snippets, if any)
if ($http_user_agent ~* "(Googlebot|Bingbot|Applebot)") { try_files $uri $uri/ =404; break; }
# Require WP login for humans
if ($http_cookie !~* "wordpress_logged_in_") {
return 302 https://yourdomain.com/login/?redirect_to=$scheme://$host$request_uri;
}
add_header X-Robots-Tag "noarchive, nosnippet" always;
try_files $uri $uri/ /index.php?$args;
}
Apache (.htaccess):
RewriteEngine On
# Block AIs on premium
RewriteCond %{REQUEST_URI} ^/premium/ [NC]
RewriteCond %{HTTP_USER_AGENT} (GPTBot|CCBot|PerplexityBot|Perplexity-User|Claude-Web|AI2Bot|DataForSeoBot) [NC]
RewriteRule ^ - [F,L]
# Gate humans (must be logged in)
RewriteCond %{REQUEST_URI} ^/premium/ [NC]
RewriteCond %{HTTP:Cookie} !wordpress_logged_in_ [NC]
RewriteRule ^ https://yourdomain.com/login/?redirect_to=https://%{HTTP_HOST}%{REQUEST_URI} [R=302,L]
# Optional header to avoid snippet/archive leakage
<IfModule mod_headers.c>
<FilesMatch "^premium/">
Header set X-Robots-Tag "noarchive, nosnippet"
</FilesMatch>
</IfModule>
5) WordPress-only fallback (if you can’t edit server)
Drop into a small mu-plugin or functions.php:
add_action('template_redirect', function () {
$uri = $_SERVER['REQUEST_URI'] ?? '';
if (strpos($uri, '/premium/') === 0) {
// Allow major SEO bots to view previews
$ua = $_SERVER['HTTP_USER_AGENT'] ?? '';
if (preg_match('/Googlebot|Bingbot|Applebot/i', $ua)) { return; }
if (!is_user_logged_in()) {
wp_redirect( wp_login_url( home_url($uri) ) );
exit;
}
}
});
6) Testing (1 minute each)
Run these from a terminal or your monitoring box:
# See your robots
curl -s https://yourdomain.com/robots.txt
# AI UA should be blocked on premium
curl -I -A "GPTBot" https://yourdomain.com/premium/example/
curl -I -A "Perplexity-User" https://yourdomain.com/premium/example/
# Human (no login) should redirect to login
curl -I https://yourdomain.com/premium/example/
# Googlebot should be allowed to fetch preview
curl -I -A "Googlebot" https://yourdomain.com/premium/example/
Short FAQ
Q1. Will this hurt my Google rankings?
No—Googlebot stays allowed. We only block Google-Extended (AI training) and certain AI/scraper UAs on /premium/.
Q2. Why do I need WAF if I already set robots.txt?
Because robots.txt is a polite request. Some bots ignore it. WAF blocks at the edge before they touch PHP/MySQL.
Q3. Do I need both Nginx/Apache rules and the WP snippet?
Pick one strong layer plus WAF. If you can edit server config, use Nginx/Apache. If not, use the WP snippet as a fallback.
Q4. Should I block /premium/ from Google completely?
Usually no. Keep a public preview page indexed for discovery; gate the full content behind login.
Q5. Can I use a different path, like /members/ or /courses/?
Absolutely. Replace /premium/ everywhere (robots, WAF, server, WP code).
Q6. What about new AI bots I haven’t listed?
Review logs monthly. Add new UAs/IPs to your WAF expression. This landscape changes fast.
Bonus: Copy Pack (one place for your VA)
robots.txt (selective)
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: Applebot
Disallow:
User-agent: Google-Extended
Disallow: /premium/
User-agent: Applebot-Extended
Disallow: /premium/
User-agent: GPTBot
Disallow: /premium/
User-agent: CCBot
Disallow: /premium/
User-agent: PerplexityBot
Disallow: /premium/
User-agent: *
Allow: /$
Sitemap: https://yourdomain.com/sitemap_index.xml
Cloudflare WAF expression
(http.request.uri.path starts_with "/premium/") and
(http.user_agent contains "GPTBot" or http.user_agent contains "CCBot" or
http.user_agent contains "PerplexityBot" or http.user_agent contains "Perplexity-User" or
http.user_agent contains "Claude-Web" or http.user_agent contains "AI2Bot" or
http.user_agent contains "DataForSeoBot")
Nginx location
location ^~ /premium/ {
if ($http_user_agent ~* "(GPTBot|CCBot|PerplexityBot|Perplexity-User|Claude-Web|AI2Bot|DataForSeoBot)") { return 403; }
if ($http_user_agent ~* "(Googlebot|Bingbot|Applebot)") { try_files $uri $uri/ =404; break; }
if ($http_cookie !~* "wordpress_logged_in_") { return 302 https://yourdomain.com/login/?redirect_to=$scheme://$host$request_uri; }
add_header X-Robots-Tag "noarchive, nosnippet" always;
try_files $uri $uri/ /index.php?$args;
}
Apache .htaccess
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/premium/ [NC]
RewriteCond %{HTTP_USER_AGENT} (GPTBot|CCBot|PerplexityBot|Perplexity-User|Claude-Web|AI2Bot|DataForSeoBot) [NC]
RewriteRule ^ - [F,L]
RewriteCond %{REQUEST_URI} ^/premium/ [NC]
RewriteCond %{HTTP:Cookie} !wordpress_logged_in_ [NC]
RewriteRule ^ https://yourdomain.com/login/?redirect_to=https://%{HTTP_HOST}%{REQUEST_URI} [R=302,L]
<IfModule mod_headers.c>
<FilesMatch "^premium/">
Header set X-Robots-Tag "noarchive, nosnippet"
</FilesMatch>
</IfModule>
If you want, I can now draft a “5-minute Cloudflare recipe” spinoff (super concise guide with only the
Sources & Further Reading
An update on web publisher controls — what Google-Extended is and why it exists.
Google’s common crawlers — tokens, behavior & robots.txt notes.
Grounding with Google Search — honors Google-Extended disallow.
About Applebot — Apple’s crawler, features it powers, and control notes.
GPTBot docs — how to allow/deny via robots.txt.
CCBot page | FAQ — UA string, robots.txt, crawl-delay.
PerplexityBot & Perplexity-User — roles, IP JSON, behavior.
Cloudflare blog | Coverage: The Verge
User Agent Blocking — create UA rules; actions (Block/Challenge).
Subscription & Paywalled Content — structured data & SEO best practices.
Yoast guide — File Editor path + tips.
Rank Math guide — virtual robots.txt editor.










Leave a Reply