Skip to main content

NGINX / Server Setup

How to Block AI Crawlers with NGINX: Complete Guide

by ,


We have by far the largest RPM repository with NGINX module packages and VMODs for Varnish. If you want to install NGINX, Varnish, and lots of useful performance/security software with smooth yum upgrades for production use, this is the repository for you.
Active subscription is required.

AI crawlers are aggressively scraping websites to train large language models. Unlike traditional search engine bots that drive traffic to your site, AI crawlers take your content without giving anything back. This guide shows you how to block AI crawlers using NGINX and the GetPageSpeed device detection module — a server-level solution that enforces your access rules.

Why Block AI Crawlers?

AI companies deploy crawlers to harvest web content for training their models. They consume your bandwidth, increase server load, and use your content commercially—often without permission.

Common reasons to deny AI crawler access:

  • Bandwidth protection — AI crawlers make thousands of requests per day
  • Content protection — Prevent your content from training competing AI models
  • Server resources — Reduce load from non-beneficial traffic
  • Legal compliance — Some jurisdictions require explicit consent for AI training
  • Competitive advantage — Keep your content from powering competitor tools

The robots.txt Limitation

Many site owners add rules to robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

However, robots.txt is merely a suggestion. Polite crawlers respect it, but nothing technically prevents a bot from ignoring these rules. For true protection, you need server-level blocking that actively rejects requests.

Installing the NGINX Device Detection Module

The GetPageSpeed device detection module identifies AI crawlers by analyzing User-Agent strings against a comprehensive database of known bots.

RHEL, CentOS, AlmaLinux, Rocky Linux

sudo dnf install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf install nginx-module-device-type

Then load the module in your nginx.conf:

load_module modules/ngx_http_device_type_module.so;

Debian and Ubuntu

First, set up the GetPageSpeed APT repository, then install:

sudo apt-get update
sudo apt-get install nginx-module-device-type

On Debian/Ubuntu, the package handles module loading automatically. No load_module directive is needed.

For details, see the RPM module page or APT module page.

Basic Configuration

Once loaded, the module provides the $is_ai_crawler variable. It returns 1 when the request comes from a known AI crawler:

load_module modules/ngx_http_device_type_module.so;

events {
    worker_connections 1024;
}

http {
    server {
        listen 80;
        server_name example.com;

        if ($is_ai_crawler) {
            return 403;
        }

        location / {
            root /var/www/html;
        }
    }
}

This returns 403 Forbidden to all AI crawlers. The module detects crawlers from OpenAI, Anthropic, Google, Microsoft, Amazon, and more.

Complete List of AI Crawlers Detected

The device detection module maintains an up-to-date database. Here are the major crawlers it identifies:

OpenAI Crawlers

Bot Name User-Agent Purpose
GPTBot GPTBot Training data
ChatGPT-User ChatGPT-User Web browsing
OAI-SearchBot OAI-SearchBot SearchGPT

Anthropic Crawlers

Bot Name User-Agent Purpose
ClaudeBot ClaudeBot Training data
Claude-SearchBot Claude-SearchBot Web search
anthropic-ai anthropic-ai General crawling

Google AI Crawlers

Bot Name User-Agent Purpose
Google-Extended Google-Extended Gemini training
Gemini Gemini AI model training

Other Major AI Crawlers

Bot Name Company Purpose
Amazonbot Amazon Alexa/AI training
Applebot-Extended Apple Apple Intelligence
PerplexityBot Perplexity AI search engine
cohere-ai Cohere LLM training
DeepseekBot DeepSeek AI model training
xAI-Bot xAI Grok training
Ai2Bot Allen Institute Research AI
Meta-ExternalAgent Meta Llama training
Bytespider ByteDance TikTok AI
HuggingFace-Bot Hugging Face Model hub
MistralAI-User Mistral AI Mistral training

The database includes 50+ AI crawlers and receives regular updates.

Understanding Bot Categories

The module classifies AI bots into categories for granular control:

  • ai_crawler — General-purpose AI training bots (GPTBot, ClaudeBot)
  • ai_data_scraper — Bots focused on data extraction
  • ai_assistant — Browsing agents for AI chat interfaces (ChatGPT-User)
  • ai_search_crawler — AI-powered search engines (PerplexityBot)
  • ai_agent — Autonomous AI agents performing tasks

This categorization lets you allow AI search engines while blocking training bots.

Advanced Blocking Strategies

Return a Custom Error Page

Use a named location to serve a custom response to blocked crawlers:

location @ai_blocked {
    default_type text/html;
    return 403 "<html><body><h1>Access Denied</h1><p>AI crawlers are not allowed.</p></body></html>";
}

location / {
    error_page 403 = @ai_blocked;

    if ($is_ai_crawler) {
        return 403;
    }

    # ... your normal config
}

Return 410 Gone

A 410 response tells crawlers the resource is permanently unavailable:

if ($is_ai_crawler) {
    return 410;
}

Log AI Crawler Requests

Monitor AI crawler activity before deciding to block:

map $is_ai_crawler $ai_log_format {
    1 "AI_CRAWLER";
    default "NORMAL";
}

log_format ai_tracking '$remote_addr - $ai_log_format - $bot_name - "$request"';

server {
    access_log /var/log/nginx/ai_crawlers.log ai_tracking;
}

This creates log entries like:

127.0.0.1 - AI_CRAWLER - GPTBot - "GET / HTTP/1.1"
127.0.0.1 - NORMAL - Googlebot - "GET / HTTP/1.1"

Selective Blocking by Category

Use $bot_category for granular control:

map $bot_category $block_ai {
    "ai_crawler"       1;
    "ai_data_scraper"  1;
    "ai_assistant"     0;
    "ai_search_crawler" 0;
    "ai_agent"         1;
    default            0;
}

if ($block_ai) {
    return 403;
}

This blocks training crawlers but allows AI search engines.

Target Specific Companies

Use $bot_name to target specific crawlers:

map $bot_name $block_specific_ai {
    "GPTBot"        1;
    "ClaudeBot"     1;
    "Bytespider"    1;
    default         0;
}

if ($block_specific_ai) {
    return 403;
}

Protecting Specific Content

Allow AI crawlers on some pages while blocking on others:

location /premium/ {
    if ($is_ai_crawler) {
        return 403;
    }
    try_files $uri $uri/ =404;
}

location /about/ {
    # No blocking - allow AI indexing
    try_files $uri $uri/ =404;
}

Combining with robots.txt

Use both approaches for maximum protection:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

Compliant bots respect robots.txt. The NGINX module catches any that don’t.

Testing Your Configuration

Verify everything works:

nginx -t
systemctl reload nginx

# Test with AI crawler User-Agent
curl -H "User-Agent: GPTBot/1.0" http://localhost/
# Should return 403

curl -H "User-Agent: Mozilla/5.0" http://localhost/
# Should return 200

Performance Impact

The device detection module uses optimized pattern matching:

  • Memory: ~2MB for the bot database
  • CPU: Negligible string matching overhead
  • Latency: Sub-millisecond per request

The module outperforms Lua-based alternatives and external API lookups. See our NGINX performance tuning guide for more tips.

Keeping the Database Updated

Update through package management:

# RHEL/Rocky Linux
sudo dnf update nginx-module-device-type

# Debian/Ubuntu  
sudo apt-get update && sudo apt-get upgrade nginx-module-device-type

New AI crawlers appear frequently. Updates add them to the database.

Real-World Use Cases

News and Media Sites

Publishers often want to prevent AI from training on their journalism. Block training crawlers while allowing AI search engines:

map $bot_category $block_ai {
    "ai_crawler"       1;
    "ai_data_scraper"  1;
    "ai_search_crawler" 0;
    default            0;
}

E-commerce Product Descriptions

Protect unique product descriptions from powering competitor AI tools:

location /products/ {
    if ($is_ai_crawler) {
        return 403;
    }
    try_files $uri $uri/ =404;
}

Documentation Sites

Allow AI assistants to browse documentation for users while blocking training crawlers:

map $bot_category $block_ai {
    "ai_crawler"   1;
    "ai_assistant" 0;
    default        0;
}

Frequently Asked Questions

Will blocking AI crawlers affect my search rankings?

No. The module only blocks AI training crawlers. Traditional search engines like Googlebot and Bingbot remain unaffected. Your SEO rankings stay intact.

Can AI crawlers bypass this protection?

Sophisticated crawlers could disguise their User-Agent. However, legitimate AI companies use identifiable agents. For more protection, use the JS Challenge module.

Should I block all AI crawlers or just some?

It depends on your goals. Block all for zero AI training. Use category-based blocking for selective control.

How often is the bot database updated?

Updates arrive with each package release. Major new crawlers are added within days.

Does this work with NGINX Plus?

Yes. The module works with both open-source NGINX and NGINX Plus.

Troubleshooting

Crawlers Still Getting Through

  1. Verify the module is loaded via nginx -T
  2. Check if crawlers use disguised User-Agents
  3. Update to the latest module version

False Positives

If legitimate users are blocked, check $bot_name in logs and report issues.

Module Not Loading

Place load_module before the http block:

load_module modules/ngx_http_device_type_module.so;

http {
    # ...
}

Alternative: Manual Blocking

Block AI crawlers using map directives without the module:

map $http_user_agent $is_ai_bot {
    default 0;
    ~*GPTBot 1;
    ~*ClaudeBot 1;
    ~*PerplexityBot 1;
}

if ($is_ai_bot) {
    return 403;
}

This needs manual updates. The module with auto-updates is preferred.

Conclusion

The device detection module provides server-level protection that robots.txt cannot match. The comprehensive bot database, simple $is_ai_crawler variable, and granular controls offer effective protection from AI scraping.

For advanced features including mobile detection and browser identification, explore the NGINX Device Detection Module documentation.

Source code is on GitHub.

D

Danila Vershinin

Founder & Lead Engineer

NGINX configuration and optimizationLinux system administrationWeb performance engineering

10+ years NGINX experience • Maintainer of GetPageSpeed RPM repository • Contributor to open-source NGINX modules

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.