AI crawlers are aggressively scraping websites to train large language models. Unlike traditional search engine bots that drive traffic to your site, AI crawlers take your content without giving anything back. This guide shows you how to block AI crawlers using NGINX and the GetPageSpeed device detection module — a server-level solution that enforces your access rules.
Why Block AI Crawlers?
AI companies deploy crawlers to harvest web content for training their models. They consume your bandwidth, increase server load, and use your content commercially—often without permission.
Common reasons to deny AI crawler access:
- Bandwidth protection — AI crawlers make thousands of requests per day
- Content protection — Prevent your content from training competing AI models
- Server resources — Reduce load from non-beneficial traffic
- Legal compliance — Some jurisdictions require explicit consent for AI training
- Competitive advantage — Keep your content from powering competitor tools
The robots.txt Limitation
Many site owners add rules to robots.txt:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
However, robots.txt is merely a suggestion. Polite crawlers respect it, but nothing technically prevents a bot from ignoring these rules. For true protection, you need server-level blocking that actively rejects requests.
Installing the NGINX Device Detection Module
The GetPageSpeed device detection module identifies AI crawlers by analyzing User-Agent strings against a comprehensive database of known bots.
RHEL, CentOS, AlmaLinux, Rocky Linux
sudo dnf install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf install nginx-module-device-type
Then load the module in your nginx.conf:
load_module modules/ngx_http_device_type_module.so;
Debian and Ubuntu
First, set up the GetPageSpeed APT repository, then install:
sudo apt-get update
sudo apt-get install nginx-module-device-type
On Debian/Ubuntu, the package handles module loading automatically. No
load_moduledirective is needed.
For details, see the RPM module page or APT module page.
Basic Configuration
Once loaded, the module provides the $is_ai_crawler variable. It returns 1 when the request comes from a known AI crawler:
load_module modules/ngx_http_device_type_module.so;
events {
worker_connections 1024;
}
http {
server {
listen 80;
server_name example.com;
if ($is_ai_crawler) {
return 403;
}
location / {
root /var/www/html;
}
}
}
This returns 403 Forbidden to all AI crawlers. The module detects crawlers from OpenAI, Anthropic, Google, Microsoft, Amazon, and more.
Complete List of AI Crawlers Detected
The device detection module maintains an up-to-date database. Here are the major crawlers it identifies:
OpenAI Crawlers
| Bot Name | User-Agent | Purpose |
|---|---|---|
| GPTBot | GPTBot |
Training data |
| ChatGPT-User | ChatGPT-User |
Web browsing |
| OAI-SearchBot | OAI-SearchBot |
SearchGPT |
Anthropic Crawlers
| Bot Name | User-Agent | Purpose |
|---|---|---|
| ClaudeBot | ClaudeBot |
Training data |
| Claude-SearchBot | Claude-SearchBot |
Web search |
| anthropic-ai | anthropic-ai |
General crawling |
Google AI Crawlers
| Bot Name | User-Agent | Purpose |
|---|---|---|
| Google-Extended | Google-Extended |
Gemini training |
| Gemini | Gemini |
AI model training |
Other Major AI Crawlers
| Bot Name | Company | Purpose |
|---|---|---|
| Amazonbot | Amazon | Alexa/AI training |
| Applebot-Extended | Apple | Apple Intelligence |
| PerplexityBot | Perplexity | AI search engine |
| cohere-ai | Cohere | LLM training |
| DeepseekBot | DeepSeek | AI model training |
| xAI-Bot | xAI | Grok training |
| Ai2Bot | Allen Institute | Research AI |
| Meta-ExternalAgent | Meta | Llama training |
| Bytespider | ByteDance | TikTok AI |
| HuggingFace-Bot | Hugging Face | Model hub |
| MistralAI-User | Mistral AI | Mistral training |
The database includes 50+ AI crawlers and receives regular updates.
Understanding Bot Categories
The module classifies AI bots into categories for granular control:
- ai_crawler — General-purpose AI training bots (GPTBot, ClaudeBot)
- ai_data_scraper — Bots focused on data extraction
- ai_assistant — Browsing agents for AI chat interfaces (ChatGPT-User)
- ai_search_crawler — AI-powered search engines (PerplexityBot)
- ai_agent — Autonomous AI agents performing tasks
This categorization lets you allow AI search engines while blocking training bots.
Advanced Blocking Strategies
Return a Custom Error Page
Use a named location to serve a custom response to blocked crawlers:
location @ai_blocked {
default_type text/html;
return 403 "<html><body><h1>Access Denied</h1><p>AI crawlers are not allowed.</p></body></html>";
}
location / {
error_page 403 = @ai_blocked;
if ($is_ai_crawler) {
return 403;
}
# ... your normal config
}
Return 410 Gone
A 410 response tells crawlers the resource is permanently unavailable:
if ($is_ai_crawler) {
return 410;
}
Log AI Crawler Requests
Monitor AI crawler activity before deciding to block:
map $is_ai_crawler $ai_log_format {
1 "AI_CRAWLER";
default "NORMAL";
}
log_format ai_tracking '$remote_addr - $ai_log_format - $bot_name - "$request"';
server {
access_log /var/log/nginx/ai_crawlers.log ai_tracking;
}
This creates log entries like:
127.0.0.1 - AI_CRAWLER - GPTBot - "GET / HTTP/1.1"
127.0.0.1 - NORMAL - Googlebot - "GET / HTTP/1.1"
Selective Blocking by Category
Use $bot_category for granular control:
map $bot_category $block_ai {
"ai_crawler" 1;
"ai_data_scraper" 1;
"ai_assistant" 0;
"ai_search_crawler" 0;
"ai_agent" 1;
default 0;
}
if ($block_ai) {
return 403;
}
This blocks training crawlers but allows AI search engines.
Target Specific Companies
Use $bot_name to target specific crawlers:
map $bot_name $block_specific_ai {
"GPTBot" 1;
"ClaudeBot" 1;
"Bytespider" 1;
default 0;
}
if ($block_specific_ai) {
return 403;
}
Protecting Specific Content
Allow AI crawlers on some pages while blocking on others:
location /premium/ {
if ($is_ai_crawler) {
return 403;
}
try_files $uri $uri/ =404;
}
location /about/ {
# No blocking - allow AI indexing
try_files $uri $uri/ =404;
}
Combining with robots.txt
Use both approaches for maximum protection:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
Compliant bots respect robots.txt. The NGINX module catches any that don’t.
Testing Your Configuration
Verify everything works:
nginx -t
systemctl reload nginx
# Test with AI crawler User-Agent
curl -H "User-Agent: GPTBot/1.0" http://localhost/
# Should return 403
curl -H "User-Agent: Mozilla/5.0" http://localhost/
# Should return 200
Performance Impact
The device detection module uses optimized pattern matching:
- Memory: ~2MB for the bot database
- CPU: Negligible string matching overhead
- Latency: Sub-millisecond per request
The module outperforms Lua-based alternatives and external API lookups. See our NGINX performance tuning guide for more tips.
Keeping the Database Updated
Update through package management:
# RHEL/Rocky Linux
sudo dnf update nginx-module-device-type
# Debian/Ubuntu
sudo apt-get update && sudo apt-get upgrade nginx-module-device-type
New AI crawlers appear frequently. Updates add them to the database.
Real-World Use Cases
News and Media Sites
Publishers often want to prevent AI from training on their journalism. Block training crawlers while allowing AI search engines:
map $bot_category $block_ai {
"ai_crawler" 1;
"ai_data_scraper" 1;
"ai_search_crawler" 0;
default 0;
}
E-commerce Product Descriptions
Protect unique product descriptions from powering competitor AI tools:
location /products/ {
if ($is_ai_crawler) {
return 403;
}
try_files $uri $uri/ =404;
}
Documentation Sites
Allow AI assistants to browse documentation for users while blocking training crawlers:
map $bot_category $block_ai {
"ai_crawler" 1;
"ai_assistant" 0;
default 0;
}
Frequently Asked Questions
Will blocking AI crawlers affect my search rankings?
No. The module only blocks AI training crawlers. Traditional search engines like Googlebot and Bingbot remain unaffected. Your SEO rankings stay intact.
Can AI crawlers bypass this protection?
Sophisticated crawlers could disguise their User-Agent. However, legitimate AI companies use identifiable agents. For more protection, use the JS Challenge module.
Should I block all AI crawlers or just some?
It depends on your goals. Block all for zero AI training. Use category-based blocking for selective control.
How often is the bot database updated?
Updates arrive with each package release. Major new crawlers are added within days.
Does this work with NGINX Plus?
Yes. The module works with both open-source NGINX and NGINX Plus.
Troubleshooting
Crawlers Still Getting Through
- Verify the module is loaded via
nginx -T - Check if crawlers use disguised User-Agents
- Update to the latest module version
False Positives
If legitimate users are blocked, check $bot_name in logs and report issues.
Module Not Loading
Place load_module before the http block:
load_module modules/ngx_http_device_type_module.so;
http {
# ...
}
Alternative: Manual Blocking
Block AI crawlers using map directives without the module:
map $http_user_agent $is_ai_bot {
default 0;
~*GPTBot 1;
~*ClaudeBot 1;
~*PerplexityBot 1;
}
if ($is_ai_bot) {
return 403;
}
This needs manual updates. The module with auto-updates is preferred.
Conclusion
The device detection module provides server-level protection that robots.txt cannot match. The comprehensive bot database, simple $is_ai_crawler variable, and granular controls offer effective protection from AI scraping.
For advanced features including mobile detection and browser identification, explore the NGINX Device Detection Module documentation.
Source code is on GitHub.

