Skip to main content

NGINX / Security

NGINX Bot Verification: Block Fake Crawlers

by , , revisited on


We have by far the largest RPM repository with NGINX module packages and VMODs for Varnish. If you want to install NGINX, Varnish, and lots of useful performance/security software with smooth yum upgrades for production use, this is the repository for you.
Active subscription is required.

Many website owners allow search engine bots to bypass security measures. They do this to ensure proper indexing and maintain good SEO rankings. However, this creates a significant vulnerability. Malicious actors frequently impersonate legitimate crawlers like Googlebot to scrape content, launch attacks, or bypass rate limits.

The NGINX bot verification module solves this problem by validating whether visitors claiming to be search engine bots are genuine. It uses the same reverse DNS verification method that Google, Microsoft, and other search engines officially recommend. In this comprehensive guide, you will learn how to install, configure, and test this essential security module.

Why You Need NGINX Bot Verification

Search engine crawlers identify themselves through the User-Agent header. For example, Googlebot sends a header like this:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The problem is simple. Anyone can set this header. A malicious script can easily claim to be Googlebot. Therefore, many websites unknowingly grant special access to attackers who spoof these headers.

Consider these common scenarios where fake bots cause problems:

  • Content scraping: Competitors steal your content by pretending to be search crawlers
  • DDoS attacks: Attackers bypass rate limits by using bot user-agents
  • Vulnerability scanning: Hackers avoid security tools by impersonating crawlers
  • Click fraud: Bots fake their identity to manipulate analytics data

The solution is reverse DNS verification. This technique confirms that the requesting IP address actually belongs to the claimed search engine. Google, Microsoft, and other search providers officially document this verification method.

How NGINX Bot Verification Works

The module operates in the access phase of request processing. When a request arrives, it follows this verification process:

  1. User-Agent Detection: The module checks if the User-Agent header contains known bot identifiers (Google, Bing, Yahoo, Baidu, or Yandex)
  2. Reverse DNS Lookup: If a bot is detected, the module performs a reverse DNS lookup on the client IP address
  3. Domain Validation: The resulting hostname must match approved domains for that search engine
  4. Forward DNS Verification: The module confirms the hostname resolves back to the original IP
  5. Result Caching: Valid and invalid results get cached in Redis to prevent repeated lookups

This NGINX bot verification approach is effective because search engines control their DNS records. An attacker cannot spoof the reverse DNS of IP addresses they do not own. Moreover, the caching mechanism ensures minimal performance impact on your server.

Supported Search Engines

The module validates bots from these major search engines:

Search Engine Verified Domains
Google googlebot.com, google.com
Bing search.msn.com
Yahoo yahoo.com
Baidu crawl.baidu.com
Yandex yandex.com, yandex.net, yandex.ru

When a request fails verification, the module returns a 403 Forbidden response. This blocks fake crawlers while allowing legitimate search engine bots to access your content normally.

Installation on Rocky Linux, AlmaLinux, and RHEL

Installing the NGINX bot verification module requires the GetPageSpeed repository. This repository provides pre-built packages for all major RHEL-based distributions. Follow these steps to install the module.

First, install the GetPageSpeed repository:

dnf -y install https://extras.getpagespeed.com/release-latest.rpm

Next, install the bot verifier module along with Redis (or KeyDB) for caching:

dnf -y install nginx-module-bot-verifier keydb

KeyDB is a high-performance Redis alternative that works identically for this purpose. You can also use the standard Redis server if you prefer.

Start and enable the caching service:

systemctl enable --now keydb

Finally, load the module in your NGINX configuration. Add this line at the very top of /etc/nginx/nginx.conf, before the events block:

load_module modules/ngx_http_bot_verifier_module.so;

Test and reload your configuration:

nginx -t && systemctl reload nginx

SELinux Configuration

On systems with SELinux enabled, NGINX needs permission to connect to Redis. Run this command to allow network connections:

setsebool -P httpd_can_network_connect 1

Without this setting, the NGINX bot verification module will bypass verification and log connection errors. Therefore, this step is essential for proper functionality.

Configuration Guide

The bot verifier module uses simple directives within location blocks. Here is a complete configuration example:

server {
    listen 80;
    server_name example.com;

    location / {
        bot_verifier on;
        bot_verifier_redis_host 127.0.0.1;
        bot_verifier_redis_port 6379;
        bot_verifier_redis_connection_timeout 10;
        bot_verifier_redis_read_timeout 10;
        bot_verifier_redis_expiry 3600;

        # Your existing location configuration
        try_files $uri $uri/ =404;
    }
}

Configuration Directives Explained

The module provides several directives for fine-tuning its behavior:

bot_verifier (on|off)

This directive enables or disables bot verification for the location. The default value is off. Set it to on to activate the module.

bot_verifier_redis_host

Specifies the Redis or KeyDB server hostname. The default is localhost. Use the IP address 127.0.0.1 or the hostname of your caching server.

bot_verifier_redis_port

Sets the Redis port number. The default is 6379, which is the standard Redis port. Change this only if your Redis server uses a non-standard port.

bot_verifier_redis_connection_timeout

Defines the connection timeout in seconds. The default is 10 seconds. Lower values provide faster failure detection but may cause issues with slow networks.

bot_verifier_redis_read_timeout

Sets the timeout for Redis read operations. The default is 10 seconds. This affects how long the module waits for cached results.

bot_verifier_redis_expiry

Controls how long verification results stay cached. The default is 3600 seconds (one hour). Longer values reduce DNS lookups but delay detection of IP changes.

Production Configuration Example

For production environments, consider this enhanced NGINX bot verification configuration:

# Main context - load the module
load_module modules/ngx_http_bot_verifier_module.so;

http {
    # ... other settings ...

    server {
        listen 443 ssl http2;
        server_name example.com;

        # Protect all dynamic content
        location / {
            bot_verifier on;
            bot_verifier_redis_host 127.0.0.1;
            bot_verifier_redis_port 6379;
            bot_verifier_redis_expiry 7200;

            proxy_pass http://backend;
        }

        # Static files do not need verification
        location /static/ {
            bot_verifier off;
            alias /var/www/static/;
        }
    }
}

This configuration applies verification only to dynamic content. Static files skip verification because scraping static assets poses less risk. Additionally, the cache expiry is set to 7200 seconds (two hours) to further reduce DNS lookups.

Testing Your NGINX Bot Verification Setup

After installing and configuring the module, you should verify it works correctly. Use these curl commands to test different scenarios.

First, test a normal request without any bot user-agent:

curl -s -o /dev/null -w '%{http_code}\n' http://localhost/

This should return 200, indicating the request passed normally.

Next, test a fake Googlebot request from your local machine:

curl -s -o /dev/null -w '%{http_code}\n' \
    -A 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' \
    http://localhost/

This should return 403 because your local IP does not belong to Google. The module correctly identified the fake bot.

You can also test other bot user-agents:

# Test fake Bingbot
curl -s -o /dev/null -w '%{http_code}\n' \
    -A 'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)' \
    http://localhost/

# Test fake YandexBot
curl -s -o /dev/null -w '%{http_code}\n' \
    -A 'Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)' \
    http://localhost/

Both requests should return 403 as expected.

Checking the Error Log

The module logs detailed information about its decisions. Check the NGINX error log to understand what happens during verification:

tail -f /var/log/nginx/error.log

You will see messages like:

User Agent identified as provider Mozilla/5.0 (compatible; Googlebot/2.1; ...)
HOSTNAME: some-hostname.example.com
Result does not match known domain
Verification failed, blocking request

These logs help you troubleshoot any issues and confirm the module is working.

Inspecting the Cache

You can examine cached verification results using the Redis CLI:

keydb-cli KEYS '*:bvs'

This shows all cached bot verification status entries. To check a specific IP:

keydb-cli GET '192.0.2.1:bvs'

The result will be either success (verified bot) or failure (fake bot). To clear the cache during testing:

keydb-cli FLUSHALL

Performance Considerations

The NGINX bot verification module is designed for minimal performance impact. However, there are several factors to consider for optimal operation.

DNS Resolution Overhead

Reverse and forward DNS lookups add latency to requests from bot user-agents. Without caching, each request would require two DNS queries. The Redis cache eliminates this overhead for repeated visits from the same IP.

For high-traffic sites, consider these optimizations:

  • Increase cache expiry: Longer cache times mean fewer DNS lookups. Set bot_verifier_redis_expiry to 7200 or higher for production.
  • Use local DNS resolver: Configure a local caching DNS resolver like dnsmasq or unbound to speed up lookups.
  • Redis connection pooling: The module maintains persistent Redis connections. Ensure your Redis server has enough connection slots.

Failsafe Behavior

If the module cannot connect to Redis, it bypasses verification entirely. This failsafe prevents blocking legitimate traffic when the cache is unavailable. However, it also means fake bots can pass through during Redis outages.

Monitor your Redis service health with:

systemctl status keydb
keydb-cli ping

Consider setting up Redis monitoring alerts to detect connectivity issues promptly.

Security Best Practices

While NGINX bot verification provides strong protection, follow these additional best practices:

Combine with Other Bot Protection

For comprehensive bot protection, use the verifier module alongside other techniques. Consider the testcookie module for JavaScript-based bot challenges. This combination catches both fake crawlers and automated scripts.

Combine with Rate Limiting

Bot verification works best alongside rate limiting. Even verified bots should respect reasonable limits:

limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location /api/ {
    bot_verifier on;
    bot_verifier_redis_host 127.0.0.1;

    limit_req zone=api burst=20 nodelay;
}

Use with ModSecurity

For comprehensive protection, combine bot verification with a Web Application Firewall. The ModSecurity module provides additional security layers:

location / {
    bot_verifier on;
    bot_verifier_redis_host 127.0.0.1;

    modsecurity on;
    modsecurity_rules_file /etc/nginx/modsec/main.conf;
}

Monitor Blocked Requests

Track blocked bot requests in your access log for security analysis:

log_format botlog '$remote_addr - $status - "$http_user_agent"';
access_log /var/log/nginx/bots.log botlog if=$is_bot_blocked;

IP-Based Access Control

For additional security, combine bot verification with IP whitelisting and blacklisting. This allows you to explicitly allow or deny known IP ranges.

Regular Configuration Audits

Use Gixy to analyze your NGINX configuration for security issues:

gixy /etc/nginx/nginx.conf

This tool detects common misconfigurations that could undermine your security measures.

Troubleshooting Common Issues

Here are solutions to problems you might encounter with NGINX bot verification:

Module Not Blocking Fake Bots

If fake bots are not being blocked, check these items:

  1. Verify module is loaded: Run nginx -V 2>&1 | grep bot_verifier
  2. Check directive is enabled: Ensure bot_verifier on; is in the correct location block
  3. Test Redis connectivity: Run keydb-cli ping and verify it returns PONG
  4. Check SELinux: Run getsebool httpd_can_network_connect and verify it is on
  5. Review error log: Look for connection errors in /var/log/nginx/error.log

All Requests Being Blocked

If legitimate traffic is being blocked, verify:

  1. Cache is not corrupted: Clear the cache with keydb-cli FLUSHALL
  2. Only bot locations have verification: Ensure bot_verifier on is not in unexpected locations
  3. Check the error log: The log will show why requests are being blocked

High Latency on Bot Requests

If requests from bots are slow:

  1. Verify DNS resolution speed: Test with time host 66.249.66.1
  2. Check cache hit rate: High miss rates indicate cache expiry is too short
  3. Monitor Redis performance: Use keydb-cli INFO to check memory and connections

Conclusion

NGINX bot verification provides essential protection against fake search engine crawlers. By implementing reverse DNS verification, you can confidently allow real bots while blocking impostors. This protects your content from scraping, your server from attacks, and your analytics from pollution.

The installation process is straightforward on Rocky Linux, AlmaLinux, and other RHEL-based distributions. The module integrates seamlessly with existing NGINX configurations. Furthermore, Redis caching ensures minimal performance impact even under heavy traffic.

Remember to test your configuration thoroughly after installation. Monitor the error logs to understand the module’s behavior. Combine NGINX bot verification with other security measures for comprehensive protection.

For additional information, visit the module’s GitHub repository. You can find the complete source code and report any issues there.

D

Danila Vershinin

Founder & Lead Engineer

NGINX configuration and optimizationLinux system administrationWeb performance engineering

10+ years NGINX experience • Maintainer of GetPageSpeed RPM repository • Contributor to open-source NGINX modules

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.