Whitelist OpenAI IP Ranges in NGINX and fail2ban

Danila Vershinin

2 months ago

Whitelist OpenAI (ChatGPT) IP Ranges in NGINX and fail2ban

If you want ChatGPT Search, GPTBot, and OpenAI’s on-demand fetcher to actually reach your content, the answer is not just “set Allow: / in robots.txt.” Modern NGINX deployments lean on rate limiting, GeoIP rules, and fail2ban; each of those will silently choke an OpenAI bot if you don’t deliberately whitelist OpenAI IP ranges in every layer. This guide walks through how to whitelist OpenAI IP ranges end-to-end: in NGINX (for rate limit bypass and per-location allow lists) and in fail2ban (so a brief burst from a crawler doesn’t get jailed across every service on the box).

The IP ranges OpenAI publishes are not static. They sit in Microsoft Azure space and rotate as OpenAI scales out. Hard-coding them into config files means stale lists within months. The pattern below pulls them from the upstream JSON feeds and ships them as RPM packages that update via dnf, so the moment OpenAI publishes a new prefix, your next dnf update picks it up.

The three OpenAI bot identities

OpenAI runs three distinct crawlers, each with its own user-agent, its own purpose, and its own published IP list. When you whitelist OpenAI IP ranges, you whitelist all three:

Bot	User-Agent contains	Purpose	Published IP source
GPTBot	`GPTBot`	Training data collection	`openai.com/gptbot.json`
ChatGPT-User	`ChatGPT-User`	On-demand fetch when a user pastes a link or runs an action	`openai.com/chatgpt-user.json`
OAI-SearchBot	`OAI-SearchBot`	Indexing for ChatGPT Search citations	`openai.com/searchbot.json`

All three feeds use the same JSON shape:

{
  "creationTime": "2026-05-27T06:31:14Z",
  "prefixes": [
    { "ipv4Prefix": "4.151.71.176/28" },
    { "ipv4Prefix": "4.151.119.48/28" }
  ]
}

creationTime is what you key cache-busting against. Anything older than today’s value means OpenAI has rotated and your local copy is stale.

Why `robots.txt` allow rules are not enough

Robots.txt is honored by the bot’s own logic. It tells the bot what to crawl. It tells nothing else. The other layers in your stack still see incoming traffic by IP and user-agent, and they react independently:

limit_req and limit_conn throttle by source IP. A polite crawler doing 50 requests per minute hits a 10r/s limit and starts seeing 429s, regardless of robots.txt.
fail2ban watches access and error logs for patterns. A series of 404s while a crawler probes deprecated paths trips nginx-botsearch or recidive. Once jailed, every subsequent request from that IP is dropped at the firewall.
GeoIP or country blocks at the firewall apply before any web layer ever runs.

The result: you publish Allow: GPTBot, the bot tries to crawl, gets blocked at layer 4 or layer 7, and your content stops appearing in ChatGPT Search results. To whitelist OpenAI IP ranges across every one of those layers closes the gap.

Whitelist OpenAI IP ranges in NGINX `limit_req`

The cleanest way to exempt OpenAI from NGINX rate limits is the empty-key bypass. limit_req_zone skips any request whose key evaluates to an empty string. Wire up a two-stage geo + map chain: the geo block marks OpenAI IPs with "" and everyone else with 0; the map translates that into either an empty string (bypass) or $binary_remote_addr (the normal per-IP key).

# /etc/nginx/conf.d/openai-whitelist.conf

# Stage 1: tag OpenAI IPs with empty string, all others with "0".
geo $openai_ip {
    default 0;
    include /etc/nginx/iplist/openai.nolimit.conf;
}

# Stage 2: derive the limit_req key. Empty string means "bypass the zone";
# anything else becomes the bucket. The map is needed because geo does
# not expand variables on the value side at use time.
map $openai_ip $limit_key {
    ""      "";
    default $binary_remote_addr;
}

limit_req_zone $limit_key zone=public:10m rate=20r/s;
limit_req_status 429;

server {
    listen 80;
    server_name www.example.com;

    location / {
        limit_req zone=public burst=40 nodelay;
        # ... your existing config
    }
}

openai.nolimit.conf is a list of bare CIDR entries shaped for inclusion inside a geo block:

# /etc/nginx/iplist/openai.nolimit.conf
4.151.71.176/28 "";
4.151.119.48/28 "";
4.197.64.0/27 "";
# ... 250+ more

When a request comes from 4.151.71.180, the geo block sets $openai_ip to "", the map collapses that to $limit_key = "", NGINX skips the zone, and the request is never counted against the bucket. Everyone else falls to the geo default 0, the map’s default branch sets $limit_key = $binary_remote_addr, and they keep paying the rate-limit tax.

You can verify the wiring with a debug location:

set_real_ip_from 127.0.0.1;
real_ip_header X-Real-IP;

location /show {
    return 200 "openai_ip=[$openai_ip] limit_key=[$remote_addr]\n";
}

$ curl -sH 'X-Real-IP: 4.151.71.180' http://localhost/show
openai_ip=[] limit_key=[4.151.71.180]
$ curl -sH 'X-Real-IP: 8.8.8.8' http://localhost/show
openai_ip=[0] limit_key=[8.8.8.8]

OpenAI IPs land in the empty-key branch; everyone else gets a normal per-IP key.

For deeper rate-limit context (burst behavior, nodelay, the difference between 429 and 503), see NGINX Rate Limiting: The Complete Guide.

Whitelist OpenAI IP ranges with `allow`/`deny`

If you have endpoints that are normally locked down (admin areas, staging, dev), you can extend the allow list to include OpenAI ranges with a one-line include:

location /docs-private/ {
    include /etc/nginx/iplist/openai.allow.conf;  # allow CIDR; lines
    allow 10.0.0.0/8;                              # your internal network
    deny  all;

    proxy_pass http://backend;
    # `allow`/`deny` only kicks in when the location passes through the access
    # phase. A bare `return 200` short-circuits earlier and would skip the check.
}

The .allow.conf variant of the same source ships pre-formatted allow <CIDR>; directives:

# /etc/nginx/iplist/openai.allow.conf
allow 4.151.71.176/28;
allow 4.151.119.48/28;
# ...

Skip the busy-work: install the package

Maintaining 250+ CIDR entries by hand and refreshing them weekly is the kind of work a Friday afternoon eats. The GetPageSpeed extras repository ships these as ready-to-include RPMs:

sudo dnf -y install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf -y install nginx-iplist-openai

This drops three formats into /etc/nginx/iplist/:

openai.geo.conf — full geo block declaring $is_openai
openai.allow.conf — allow <CIDR>; directives for use with include
openai.nolimit.conf — bare entries for the empty-key rate-limit bypass shown above

It also drops a plain text CIDR list at /usr/share/trusted-lists/plain/openai.txt. That’s the file the fail2ban integration in the next section reads.

The package is a thin wrapper around the three JSON feeds OpenAI publishes; a dnf update refreshes the lists whenever a new build is published.

For the broader picture of which trusted-list packages exist (Google, Bing, Stripe, PayPal, Cloudflare, etc.), see fds FirewallD Made Easy: Trusted Lists.

Whitelist OpenAI IP ranges in fail2ban

NGINX-level whitelisting solves half the problem. The other half is fail2ban. A crawler that probes /wp-login.php, /.env, or any other URL your bot-search jail watches for will get jailed even though it had no malicious intent. Once jailed, every subsequent request from that IP gets dropped by the host firewall, before NGINX ever sees it. Your geo block never gets a chance to run.

fail2ban supports a [DEFAULT] ignorecommand directive: an external script that runs for every potential ban and skips the action if the script exits 0. That hook is the right place to whitelist OpenAI IP ranges from the same source NGINX uses.

GetPageSpeed ships a small companion package, fail2ban-trusted-lists-helper, that wires this up:

sudo dnf -y install fail2ban-trusted-lists-helper

The package contains two files:

/usr/share/trusted-lists/scripts/fail2ban-ignoreip-check — a bash one-liner that runs grepcidr -f against every /usr/share/trusted-lists/plain/*.txt, taking the candidate IP as its first argument
/etc/fail2ban/jail.d/00-trusted-lists.local — a single [DEFAULT] block hooking that script as ignorecommand

Inspect the installed snippet:

$ cat /etc/fail2ban/jail.d/00-trusted-lists.local
[DEFAULT]
ignorecommand = /usr/share/trusted-lists/scripts/fail2ban-ignoreip-check <ip>

The <ip> token is fail2ban’s substitution syntax: at runtime it gets replaced with the candidate IP and the script receives it as $1.

Because the hook sits in [DEFAULT], every jail inherits it: sshd, recidive, nginx-bothostname, nginx-botsearch, wordpress-soft, the lot. fail2ban runs the check before banning. If the candidate IP is in any installed plain/*.txt, the script exits 0 and fail2ban silently skips the ban. Otherwise normal logic runs.

Confirm the wiring after install by querying any active jail (DEFAULT itself is not a jail you can query directly):

$ sudo fail2ban-client get sshd ignorecommand
/usr/share/trusted-lists/scripts/fail2ban-ignoreip-check <ip>

$ sudo /usr/share/trusted-lists/scripts/fail2ban-ignoreip-check 4.151.71.180 && echo OK
OK

The pattern composes. Install more lists, the helper picks them up automatically:

sudo dnf -y install nginx-iplist-stripe nginx-iplist-paypal nginx-iplist-bingbot
# every CIDR from every list is now an automatic fail2ban exemption

Verify with curl

End-to-end smoke test from any host (substitute a real OpenAI IP from /usr/share/trusted-lists/plain/openai.txt):

# Simulate GPTBot hitting the home page
curl -sI -A 'GPTBot/1.0 (+https://openai.com/gptbot)' https://www.example.com/ | head -1
# Expect: HTTP/2 200

# Confirm the rate-limit bypass works under load
hey -n 200 -c 20 -H 'User-Agent: GPTBot/1.0' https://www.example.com/
# Expect: all 200s, no 429s

# Confirm the fail2ban hook accepts the IP
sudo /usr/share/trusted-lists/scripts/fail2ban-ignoreip-check 4.151.71.180 && echo "would be ignored"

If would be ignored prints, the next time 4.151.71.180 shows up in any jail’s filter pattern, fail2ban will quietly skip the ban.

Update `robots.txt` while you are here

The whitelist is the load-bearing fix, but robots.txt still controls what well-behaved bots try to fetch. The minimum acceptable block:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

If you publish content you do not want OpenAI to train on, deny those paths explicitly under the GPTBot block while keeping ChatGPT-User and OAI-SearchBot open. That way the on-demand fetch and the search-citation crawler still see the page when a user is actively asking about it.

Summary

A complete approach to whitelist OpenAI IP ranges on a modern NGINX server is two packages and a robots.txt edit:

sudo dnf -y install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf -y install nginx-iplist-openai fail2ban-trusted-lists-helper

Then in NGINX:

geo $openai_ip {
    default 0;
    include /etc/nginx/iplist/openai.nolimit.conf;
}
map $openai_ip $limit_key {
    ""      "";
    default $binary_remote_addr;
}
limit_req_zone $limit_key zone=public:10m rate=20r/s;

That covers the rate-limit layer, the access-control layer, and the intrusion-prevention layer with one source of truth that auto-updates via dnf. New OpenAI prefix tomorrow, your server already trusts it the next morning.

🛠️ Related Tools

The three OpenAI bot identities

Why robots.txt allow rules are not enough

Whitelist OpenAI IP ranges in NGINX limit_req

Whitelist OpenAI IP ranges with allow/deny