How to Block AI Crawlers with NGINX: Complete Guide

Danila Vershinin

8 hours ago

Block AI Crawlers with NGINX: Complete Guide

AI crawlers are aggressively scraping websites to train large language models. Unlike traditional search engine bots that drive traffic to your site, AI crawlers take your content without giving anything back. This guide shows you how to block AI crawlers using NGINX and the GetPageSpeed device detection module — a server-level solution that enforces your access rules.

Why Block AI Crawlers?

AI companies deploy crawlers to harvest web content for training their models. They consume your bandwidth, increase server load, and use your content commercially—often without permission.

Common reasons to deny AI crawler access:

Bandwidth protection — AI crawlers make thousands of requests per day
Content protection — Prevent your content from training competing AI models
Server resources — Reduce load from non-beneficial traffic
Legal compliance — Some jurisdictions require explicit consent for AI training
Competitive advantage — Keep your content from powering competitor tools

The robots.txt Limitation

Many site owners add rules to robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

However, robots.txt is merely a suggestion. Polite crawlers respect it, but nothing technically prevents a bot from ignoring these rules. For true protection, you need server-level blocking that actively rejects requests.

Installing the NGINX Device Detection Module

The GetPageSpeed device detection module identifies AI crawlers by analyzing User-Agent strings against a comprehensive database of known bots.

RHEL, CentOS, AlmaLinux, Rocky Linux

sudo dnf install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf install nginx-module-device-type

Then load the module in your nginx.conf:

load_module modules/ngx_http_device_type_module.so;

Debian and Ubuntu

First, set up the GetPageSpeed APT repository, then install:

sudo apt-get update
sudo apt-get install nginx-module-device-type

On Debian/Ubuntu, the package handles module loading automatically. No load_module directive is needed.

For details, see the RPM module page or APT module page.

Basic Configuration

Once loaded, the module provides the $is_ai_crawler variable. It returns 1 when the request comes from a known AI crawler:

load_module modules/ngx_http_device_type_module.so;

events {
    worker_connections 1024;
}

http {
    server {
        listen 80;
        server_name example.com;

        if ($is_ai_crawler) {
            return 403;
        }

        location / {
            root /var/www/html;
        }
    }
}

This returns 403 Forbidden to all AI crawlers. The module detects crawlers from OpenAI, Anthropic, Google, Microsoft, Amazon, and more.

Complete List of AI Crawlers Detected

The device detection module maintains an up-to-date database. Here are the major crawlers it identifies:

OpenAI Crawlers

Bot Name	User-Agent	Purpose
GPTBot	`GPTBot`	Training data
ChatGPT-User	`ChatGPT-User`	Web browsing
OAI-SearchBot	`OAI-SearchBot`	SearchGPT

Anthropic Crawlers

Bot Name	User-Agent	Purpose
ClaudeBot	`ClaudeBot`	Training data
Claude-SearchBot	`Claude-SearchBot`	Web search
anthropic-ai	`anthropic-ai`	General crawling

Google AI Crawlers

Bot Name	User-Agent	Purpose
Google-Extended	`Google-Extended`	Gemini training
Gemini	`Gemini`	AI model training

Other Major AI Crawlers

Bot Name	Company	Purpose
Amazonbot	Amazon	Alexa/AI training
Applebot-Extended	Apple	Apple Intelligence
PerplexityBot	Perplexity	AI search engine
cohere-ai	Cohere	LLM training
DeepseekBot	DeepSeek	AI model training
xAI-Bot	xAI	Grok training
Ai2Bot	Allen Institute	Research AI
Meta-ExternalAgent	Meta	Llama training
Bytespider	ByteDance	TikTok AI
HuggingFace-Bot	Hugging Face	Model hub
MistralAI-User	Mistral AI	Mistral training

The database includes 50+ AI crawlers and receives regular updates.

Understanding Bot Categories

The module classifies AI bots into categories for granular control:

ai_crawler — General-purpose AI training bots (GPTBot, ClaudeBot)
ai_data_scraper — Bots focused on data extraction
ai_assistant — Browsing agents for AI chat interfaces (ChatGPT-User)
ai_search_crawler — AI-powered search engines (PerplexityBot)
ai_agent — Autonomous AI agents performing tasks

This categorization lets you allow AI search engines while blocking training bots.

Advanced Blocking Strategies

Return a Custom Error Page

Use a named location to serve a custom response to blocked crawlers:

location @ai_blocked {
    default_type text/html;
    return 403 "<html><body><h1>Access Denied</h1><p>AI crawlers are not allowed.</p></body></html>";
}

location / {
    error_page 403 = @ai_blocked;

    if ($is_ai_crawler) {
        return 403;
    }

    # ... your normal config
}

Return 410 Gone

A 410 response tells crawlers the resource is permanently unavailable:

if ($is_ai_crawler) {
    return 410;
}

Log AI Crawler Requests

Monitor AI crawler activity before deciding to block:

map $is_ai_crawler $ai_log_format {
    1 "AI_CRAWLER";
    default "NORMAL";
}

log_format ai_tracking '$remote_addr - $ai_log_format - $bot_name - "$request"';

server {
    access_log /var/log/nginx/ai_crawlers.log ai_tracking;
}

This creates log entries like:

127.0.0.1 - AI_CRAWLER - GPTBot - "GET / HTTP/1.1"
127.0.0.1 - NORMAL - Googlebot - "GET / HTTP/1.1"

Selective Blocking by Category

Use $bot_category for granular control:

map $bot_category $block_ai {
    "ai_crawler"       1;
    "ai_data_scraper"  1;
    "ai_assistant"     0;
    "ai_search_crawler" 0;
    "ai_agent"         1;
    default            0;
}

if ($block_ai) {
    return 403;
}

This blocks training crawlers but allows AI search engines.

Target Specific Companies

Use $bot_name to target specific crawlers:

map $bot_name $block_specific_ai {
    "GPTBot"        1;
    "ClaudeBot"     1;
    "Bytespider"    1;
    default         0;
}

if ($block_specific_ai) {
    return 403;
}

Protecting Specific Content

Allow AI crawlers on some pages while blocking on others:

location /premium/ {
    if ($is_ai_crawler) {
        return 403;
    }
    try_files $uri $uri/ =404;
}

location /about/ {
    # No blocking - allow AI indexing
    try_files $uri $uri/ =404;
}

Combining with robots.txt

Use both approaches for maximum protection:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

Compliant bots respect robots.txt. The NGINX module catches any that don’t.

Testing Your Configuration

Verify everything works:

nginx -t
systemctl reload nginx

# Test with AI crawler User-Agent
curl -H "User-Agent: GPTBot/1.0" http://localhost/
# Should return 403

curl -H "User-Agent: Mozilla/5.0" http://localhost/
# Should return 200

Performance Impact

The device detection module uses optimized pattern matching:

Memory: ~2MB for the bot database
CPU: Negligible string matching overhead
Latency: Sub-millisecond per request

The module outperforms Lua-based alternatives and external API lookups. See our NGINX performance tuning guide for more tips.

Keeping the Database Updated

Update through package management:

# RHEL/Rocky Linux
sudo dnf update nginx-module-device-type

# Debian/Ubuntu  
sudo apt-get update && sudo apt-get upgrade nginx-module-device-type

New AI crawlers appear frequently. Updates add them to the database.

Real-World Use Cases

News and Media Sites

Publishers often want to prevent AI from training on their journalism. Block training crawlers while allowing AI search engines:

map $bot_category $block_ai {
    "ai_crawler"       1;
    "ai_data_scraper"  1;
    "ai_search_crawler" 0;
    default            0;
}

E-commerce Product Descriptions

Protect unique product descriptions from powering competitor AI tools:

location /products/ {
    if ($is_ai_crawler) {
        return 403;
    }
    try_files $uri $uri/ =404;
}

Documentation Sites

Allow AI assistants to browse documentation for users while blocking training crawlers:

map $bot_category $block_ai {
    "ai_crawler"   1;
    "ai_assistant" 0;
    default        0;
}

Frequently Asked Questions

Will blocking AI crawlers affect my search rankings?

No. The module only blocks AI training crawlers. Traditional search engines like Googlebot and Bingbot remain unaffected. Your SEO rankings stay intact.

Can AI crawlers bypass this protection?

Sophisticated crawlers could disguise their User-Agent. However, legitimate AI companies use identifiable agents. For more protection, use the JS Challenge module.

Should I block all AI crawlers or just some?

It depends on your goals. Block all for zero AI training. Use category-based blocking for selective control.

How often is the bot database updated?

Updates arrive with each package release. Major new crawlers are added within days.

Does this work with NGINX Plus?

Yes. The module works with both open-source NGINX and NGINX Plus.

Troubleshooting

Crawlers Still Getting Through

Verify the module is loaded via nginx -T
Check if crawlers use disguised User-Agents
Update to the latest module version

False Positives

If legitimate users are blocked, check $bot_name in logs and report issues.

Module Not Loading

Place load_module before the http block:

load_module modules/ngx_http_device_type_module.so;

http {
    # ...
}

Alternative: Manual Blocking

Block AI crawlers using map directives without the module:

map $http_user_agent $is_ai_bot {
    default 0;
    ~*GPTBot 1;
    ~*ClaudeBot 1;
    ~*PerplexityBot 1;
}

if ($is_ai_bot) {
    return 403;
}

This needs manual updates. The module with auto-updates is preferred.

Conclusion

The device detection module provides server-level protection that robots.txt cannot match. The comprehensive bot database, simple $is_ai_crawler variable, and granular controls offer effective protection from AI scraping.

For advanced features including mobile detection and browser identification, explore the NGINX Device Detection Module documentation.

Source code is on GitHub.