Skip to main content

NGINX / Server Setup

NGINX Reverse Proxy Ollama: Host Your Own Free LLM

by , , revisited on


We have by far the largest RPM repository with NGINX module packages and VMODs for Varnish. If you want to install NGINX, Varnish, and lots of useful performance/security software with smooth yum upgrades for production use, this is the repository for you.
Active subscription is required.

Why pay per token when you can run a powerful LLM on your own server for free? Tools like Ollama and vLLM make self-hosting open-source models like Gemma, Llama, and Mistral surprisingly easy. The missing piece is an NGINX reverse proxy Ollama setup that actually streams tokens correctly.

NGINX’s default settings silently break LLM token streaming — buffering delays tokens, HTTP/1.0 blocks chunked transfer, and a 60-second timeout kills long completions mid-generation. This guide takes you from a fresh RHEL-based server to a fully working, secured LLM API with a proper NGINX reverse proxy Ollama configuration. Every command has been tested on Rocky Linux 10 with NGINX 1.28 and real inference using the gemma3:1b model.

Installing Ollama on RHEL, Rocky Linux, or AlmaLinux

Start with a fresh server running RHEL 10, Rocky Linux 10, or AlmaLinux 10. You need curl and zstd — the Ollama installer requires zstd to decompress its binary:

sudo dnf install -y curl zstd

Then run the official Ollama installer:

curl -fsSL https://ollama.com/install.sh | sh

The installer creates an ollama system user, installs the binary to /usr/local/bin/ollama, and sets up a systemd service. You should see output ending with:

>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

Verify the service is running:

systemctl status ollama

Pulling Your First Model

Download a model to test with. The gemma3:1b model from Google is an excellent choice — it is small (815 MB), fast even on CPU, and smart enough to give correct answers:

ollama pull gemma3:1b

Verify it works by asking a quick question:

ollama run gemma3:1b "What is 2+2? Reply with just the number."

You should get back 4. If you see a correct response, Ollama is working. You can also try larger models like llama3 (4.7 GB) or mistral (4.1 GB) for production use.

Verify Ollama Listens on Localhost

By default, Ollama binds to 127.0.0.1:11434, which is secure — only local connections are accepted:

ss -tlnp | grep 11434
# Expected: LISTEN 0 2048 127.0.0.1:11434

Do not change this to 0.0.0.0. NGINX will handle external traffic.

Installing NGINX

Install the latest stable NGINX. On RHEL-based systems, the default nginx package from AppStream works, but for the latest version (1.28+) you can use the GetPageSpeed repository:

sudo dnf install -y nginx
sudo systemctl enable --now nginx

Verify NGINX is running:

curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1/
# Expected: 200

Enable SELinux Network Connections

On RHEL-based distributions, SELinux blocks NGINX from connecting to backend services by default. Without this step, your NGINX reverse proxy Ollama setup will return 502 Bad Gateway errors:

sudo setsebool -P httpd_can_network_connect on

The -P flag makes this persistent across reboots. Verify it took effect:

getsebool httpd_can_network_connect
# Expected: httpd_can_network_connect --> on

Why the Default NGINX Config Breaks LLM Streaming

Now that both Ollama and NGINX are running, you might be tempted to add a simple proxy_pass and call it done. Unfortunately, four default NGINX behaviors conspire to break LLM token streaming.

1. Proxy Buffering Is Enabled by Default

NGINX’s proxy_buffering directive defaults to on. This means NGINX collects the upstream response into memory buffers (32 KB by default) before forwarding anything to the client.

For a normal web page, this improves performance. For LLM token streaming, it causes tokens to accumulate in NGINX’s buffer instead of reaching the client immediately. The user sees nothing for seconds, then gets a burst of text all at once.

2. Proxy HTTP Version Defaults to 1.0

The proxy_http_version directive defaults to 1.0. HTTP/1.0 does not support chunked transfer encoding, which is the mechanism both Ollama and vLLM use to stream tokens.

With HTTP/1.0, NGINX must wait for the entire response before it knows the Content-Length. This defeats the purpose of streaming entirely.

3. Read Timeout Is Only 60 Seconds

The proxy_read_timeout directive defaults to 60 seconds. LLM generation regularly takes longer than a minute, especially for large prompts or long completions on CPU-only hardware.

When the timeout fires, NGINX drops the connection mid-generation. The client receives an incomplete response or a 504 Gateway Timeout error.

4. Ollama Rejects Non-Local Host Headers

When you use an upstream block and proxy_pass http://upstream_name, NGINX sends Host: upstream_name to the backend. Ollama validates the Host header and rejects requests that don’t appear to come from localhost. The result is a silent 403 Forbidden with an empty body — no error message, no log entry from Ollama.

You must explicitly set proxy_set_header Host localhost; to fix this.

Here is the naive config that exhibits all these problems:

# DO NOT USE - streaming will break, Ollama returns 403
upstream ollama {
    server 127.0.0.1:11434;
}

location / {
    proxy_pass http://ollama;
}

Understanding the Streaming Formats

Before writing the NGINX config, you should understand how Ollama and vLLM stream responses. They use different formats, but NGINX handles both correctly once buffering is disabled.

Ollama: NDJSON Streaming

Ollama’s native API endpoints (/api/generate and /api/chat) stream responses as Newline-Delimited JSON (NDJSON). Each token arrives as a separate JSON object followed by a newline character:

{"model":"gemma3:1b","created_at":"2026-02-25T12:00:00Z","response":"Hello","done":false}
{"model":"gemma3:1b","created_at":"2026-02-25T12:00:00Z","response":" world","done":false}
{"model":"gemma3:1b","created_at":"2026-02-25T12:00:00Z","response":"","done":true,"eval_count":2}

The Content-Type is application/x-ndjson and the transfer encoding is chunked.

vLLM and Ollama’s OpenAI-Compatible Endpoint: SSE

vLLM’s /v1/chat/completions endpoint (and Ollama’s equivalent at the same path) uses Server-Sent Events (SSE). Each token arrives as a data: line separated by blank lines:

data: {"id":"chatcmpl-0","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-1","choices":[{"delta":{"content":" world"}}]}

data: [DONE]

The Content-Type is text/event-stream.

Why This Matters for NGINX

Both formats rely on chunked transfer encoding to deliver tokens incrementally. Your NGINX reverse proxy Ollama or vLLM deployment must use HTTP/1.1 upstream and disable buffering. Without both settings, neither format streams correctly.

Minimal Streaming-Safe Configuration

Here is the minimum NGINX reverse proxy Ollama configuration with working token streaming. Create or edit /etc/nginx/conf.d/llm-proxy.conf:

upstream ollama {
    server 127.0.0.1:11434;
    keepalive 4;
}

server {
    listen 80;
    server_name llm.example.com;

    location / {
        proxy_pass http://ollama;

        # Required for streaming
        proxy_http_version 1.1;
        proxy_set_header Host localhost;
        proxy_set_header Connection "";
        proxy_buffering off;

        # LLM generation can take minutes
        proxy_read_timeout 600s;
    }
}

Test and reload:

sudo nginx -t && sudo systemctl reload nginx

Let’s examine each directive and why it is essential.

proxy_http_version 1.1

Switches the upstream connection from HTTP/1.0 to HTTP/1.1. This enables chunked transfer encoding, which is required for both NDJSON and SSE streaming. It also allows the keepalive directive in the upstream block to work, since HTTP/1.0 closes the connection after every request.

proxy_set_header Host localhost

Sends Host: localhost to the upstream instead of the upstream block name. Ollama validates the Host header and rejects requests from non-local origins with a 403 Forbidden. Without this directive, every request through the proxy fails silently.

For vLLM, this is not strictly required since vLLM does not validate the Host header. However, setting it to $host (the client’s original Host) is good practice.

proxy_set_header Connection ""

Clears the Connection header sent to the upstream. By default, NGINX forwards Connection: close when using HTTP/1.1, which prevents keepalive connections. Setting it to an empty string allows the upstream connection to persist across multiple requests.

proxy_buffering off

Disables response buffering entirely. NGINX passes each chunk from the upstream directly to the client as soon as it arrives. This is the single most important directive for LLM streaming.

For more background on how proxy buffering works in NGINX, see our detailed guide.

proxy_read_timeout 600s

Increases the read timeout to 10 minutes. Adjust this based on your longest expected generation time. For CPU-only inference, you may need even more.

The timeout resets each time NGINX receives data from the upstream. Actively streaming responses will not time out regardless of total duration.

keepalive 4

Maintains up to 4 idle keepalive connections to the upstream. This avoids the overhead of establishing a new TCP connection for every request. For a single-user setup, 4 is sufficient. For multi-user production, increase it to 16 or 32.

Verifying Streaming Works

With the minimal NGINX reverse proxy Ollama config in place, test that tokens stream correctly.

Test NDJSON Streaming (Ollama Native)

curl --no-buffer -X POST http://llm.example.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma3:1b","prompt":"What is 2+2? Reply with just the number.","options":{"num_predict":10}}'

You should see JSON objects appearing one at a time, not all at once. With gemma3:1b, the response will be a clean 4 — confirming both streaming delivery and correct model inference.

Test SSE Streaming (OpenAI-Compatible)

curl --no-buffer -X POST http://llm.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma3:1b","stream":true,"messages":[{"role":"user","content":"Say hello"}]}'

You should see data: {…} lines appearing incrementally, ending with data: [DONE].

The --no-buffer Flag

The --no-buffer (or -N) flag tells curl to disable output buffering. Without it, curl may buffer the streamed response on its side, making it look like NGINX is buffering when it is not.

See Your Self-Hosted LLM In Action

Once streaming works, the fun begins. Here are real responses from gemma3:1b running on a CPU-only Rocky Linux server, proxied through NGINX — all generated in under 3 seconds each.

Write a haiku about Linux servers:

curl -s http://llm.example.com/api/generate \
  -d '{"model":"gemma3:1b","prompt":"Write a haiku about Linux servers.","stream":false}' | jq -r .response

Code flows, swift and clean,
Servers hum, a steady beat,
Data’s gentle flow.

Explain recursion in one sentence, then explain it again:

curl -s http://llm.example.com/api/generate \
  -d '{"model":"gemma3:1b","prompt":"Explain recursion in one sentence, then explain it again.","stream":false}' | jq -r .response

Recursion is a technique where a function solves a problem by calling itself to solve smaller subproblems of the same type.

Recursion is a programming technique where a function solves a problem by breaking it down into smaller, self-similar instances of the same problem.

Why is 42 the answer to everything?

curl -s http://llm.example.com/api/generate \
  -d '{"model":"gemma3:1b","prompt":"Why is 42 the answer to everything? One paragraph.","stream":false}' | jq -r .response

The enduring fascination with the number 42 as the “answer to everything” stems from Douglas Adams’s The Hitchhiker’s Guide to the Galaxy. A supercomputer named Deep Thought calculates the answer to the ultimate question of life, the universe, and everything — and the answer is 42. However, Deep Thought cannot actually explain what the question is.

If NGINX and Apache had a rap battle, write 4 lines for NGINX:

curl -s http://llm.example.com/api/generate \
  -d '{"model":"gemma3:1b","prompt":"If NGINX and Apache had a rap battle, write 4 lines for NGINX.","stream":false}' | jq -r .response

Yo, I’m the king, a flexible design,
Load balancing strong, a truly divine
Configuration, I’m the master of flow,
While you’re a legacy, watch my data grow!

Translate to French: “The server is running perfectly.”

curl -s http://llm.example.com/api/generate \
  -d '{"model":"gemma3:1b","prompt":"Translate to French: The server is running perfectly.","stream":false}' | jq -r .response

Le serveur fonctionne parfaitement.

All of these responses came from a 1 billion parameter model running entirely on CPU. With a GPU and a larger model like llama3 or mistral, the quality and speed improve dramatically. The point is: your self-hosted LLM is real, it works, and NGINX delivers every token the instant it is generated.

Production-Ready Configuration for Ollama

A production NGINX reverse proxy Ollama deployment needs more than just streaming. You need TLS, forwarded client information, request size limits, caching disabled, and dangerous management endpoints blocked.

upstream ollama {
    server 127.0.0.1:11434;
    keepalive 8;
}

server {
    listen 443 ssl;
    server_name llm.example.com;

    ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;

    # Block dangerous management endpoints
    location ~ ^/api/(pull|push|delete|copy)$ {
        return 403 '{"error":"management endpoints are disabled"}';
        default_type application/json;
    }

    # Proxy all other requests to Ollama
    location / {
        proxy_pass http://ollama;
        proxy_http_version 1.1;
        proxy_set_header Host localhost;
        proxy_set_header Connection "";
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Disable buffering and caching for streaming
        proxy_buffering off;
        proxy_cache off;

        # Generous timeouts for LLM generation
        proxy_connect_timeout 10s;
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;

        # Limit prompt size
        client_max_body_size 2m;
    }
}

Blocking Management Endpoints

Ollama exposes API endpoints for pulling, pushing, deleting, and copying models. If left accessible, anyone who can reach your proxy can download new models or delete your existing ones.

The location ~ ^/api/(pull|push|delete|copy)$ block returns a 403 Forbidden for these dangerous endpoints while allowing /api/generate, /api/chat, and /api/tags to pass through normally.

Why proxy_cache off Matters

NGINX may cache upstream responses if a caching configuration exists elsewhere in your config. Since every LLM response is unique and streaming, caching would cause stale or incomplete responses. Adding proxy_cache off explicitly prevents this.

Production-Ready Configuration for vLLM

vLLM uses the OpenAI-compatible API format exclusively. It also provides health and metrics endpoints useful for monitoring.

If you are using vLLM instead of Ollama, start it bound to localhost:

vllm serve meta-llama/Llama-3-8B --host 127.0.0.1 --port 8000

Then configure NGINX:

upstream vllm {
    server 127.0.0.1:8000;
    keepalive 16;
}

server {
    listen 443 ssl;
    server_name llm.example.com;

    ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;

    # Health check (no auth needed)
    location = /health {
        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    # Prometheus metrics (restrict to monitoring network)
    location = /metrics {
        allow 10.0.0.0/8;
        allow 172.16.0.0/12;
        allow 192.168.0.0/16;
        deny all;

        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    # OpenAI-compatible API
    location /v1/ {
        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header Connection "";
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Disable buffering for SSE streaming
        proxy_buffering off;
        proxy_cache off;

        # Generous timeouts for LLM generation
        proxy_connect_timeout 10s;
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;

        client_max_body_size 2m;
    }
}

Protecting the Metrics Endpoint

vLLM exposes Prometheus-format metrics at /metrics, including request counts, token throughput, and GPU utilization. This data is valuable for monitoring but should not be publicly accessible.

The allow/deny directives restrict access to private network ranges where your monitoring system lives.

Securing Your LLM API with Authentication

Neither Ollama nor vLLM provides robust built-in authentication. Exposing an unauthenticated LLM API to the internet is an invitation for abuse — each request can consume significant GPU or CPU time.

NGINX provides several authentication methods you can layer on top.

API Key Authentication with map

For programmatic access, API key authentication via the Authorization header is the most practical approach. It works with any OpenAI-compatible client library without modifications:

# Validate API key format: Bearer sk-<32 alphanumeric chars>
map $http_authorization $llm_auth_valid {
    default 0;
    "~^Bearer sk-[a-zA-Z0-9]{32}$" 1;
}

server {
    listen 443 ssl;
    server_name llm.example.com;

    # ... SSL config ...

    location /v1/ {
        if ($llm_auth_valid = 0) {
            return 401 '{"error":"invalid or missing API key"}';
        }

        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 600s;
        proxy_connect_timeout 10s;
        proxy_send_timeout 600s;
        client_max_body_size 2m;
    }
}

This validates that the Authorization header matches the pattern Bearer sk- followed by exactly 32 alphanumeric characters. Requests without a valid key receive a 401 Unauthorized response.

To generate a key:

echo "sk-$(openssl rand -hex 16)"

Basic Authentication

For simple deployments where only a few users need access, HTTP Basic Auth works well:

location /v1/ {
    auth_basic "LLM API";
    auth_basic_user_file /etc/nginx/.llm-htpasswd;

    proxy_pass http://vllm;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 600s;
}

Generate the password file with htpasswd:

sudo dnf install httpd-tools
sudo htpasswd -c /etc/nginx/.llm-htpasswd myuser

Important: Always use Basic Auth over TLS. Without HTTPS, credentials are sent in base64 encoding, which is trivially decoded.

Rate Limiting LLM Requests

LLM inference is computationally expensive. A single request can monopolize your GPU for minutes. Rate limiting prevents accidental or malicious overload.

NGINX’s built-in rate limiting module handles this well:

# 10 requests per second with a burst of 20
limit_req_zone $binary_remote_addr zone=llm_api:10m rate=10r/s;

server {
    # ...

    location /v1/ {
        limit_req zone=llm_api burst=20 nodelay;

        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 600s;
    }
}

The rate=10r/s allows a sustained rate of 10 requests per second per client IP. The burst=20 handles short spikes of up to 20 simultaneous requests. The nodelay flag processes burst requests immediately instead of throttling them.

When the limit is exceeded, NGINX returns 503 Service Temporarily Unavailable. For an API, you may want a JSON error response:

limit_req_status 429;

# In the server block:
error_page 429 = @rate_limited;

location @rate_limited {
    default_type application/json;
    return 429 '{"error":"rate limit exceeded, try again later"}';
}

Per-API-Key Rate Limiting

If you use API key authentication, you can rate-limit per key instead of per IP. This is more fair for deployments behind a shared NAT or VPN:

limit_req_zone $http_authorization zone=llm_by_key:10m rate=5r/s;

The X-Accel-Buffering Header Alternative

If you cannot modify NGINX’s configuration (for example, in a managed hosting environment), the upstream application can send the X-Accel-Buffering: no response header. NGINX checks this header at response time and switches to non-buffered mode if the value is no.

Neither Ollama nor vLLM sends this header by default. However, if you place a middleware in front of them, you can inject it:

# Example: Flask middleware that adds X-Accel-Buffering
@app.after_request
def add_accel_header(response):
    if response.content_type == 'text/event-stream':
        response.headers['X-Accel-Buffering'] = 'no'
    return response

For most deployments, setting proxy_buffering off in NGINX is simpler and more reliable.

Load Balancing Multiple LLM Backends

If you run multiple Ollama or vLLM instances (for example, across multiple GPUs or machines), NGINX can load-balance requests across them:

upstream llm_cluster {
    least_conn;
    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
    server 10.0.0.3:8000;
    keepalive 32;
}

The least_conn strategy routes each request to the backend with the fewest active connections. This is ideal for LLM inference because request durations vary wildly — a 10-token completion finishes in milliseconds while a 4096-token generation takes minutes.

Round-robin would overload servers that happen to receive long requests.

Troubleshooting Common Issues

Tokens Arrive in Bursts Instead of Streaming

Cause: proxy_buffering is still on, or a CDN/load balancer in front of NGINX is buffering.

Fix: Add proxy_buffering off; to your location block. If you use Cloudflare or another CDN, check their streaming/SSE settings.

403 Forbidden with Empty Body from Ollama

Cause: NGINX is sending a non-local Host header to Ollama. When you use proxy_pass http://upstream_name`, NGINX sendsHost: upstream_name` by default. Ollama rejects requests from unrecognized hosts.

Fix: Add proxy_set_header Host localhost; to your Ollama location block.

502 Bad Gateway with “Permission denied”

Cause: SELinux is blocking NGINX from connecting to the upstream.

Fix: Run sudo setsebool -P httpd_can_network_connect on.

504 Gateway Timeout During Long Generations

Cause: proxy_read_timeout is too short (default 60 seconds).

Fix: Increase it to match your longest expected generation time:

proxy_read_timeout 600s;  # 10 minutes

Connection Resets Mid-Stream

Cause: proxy_send_timeout is too short, or a firewall is closing idle connections.

Fix: Increase proxy_send_timeout and check for firewalls or NAT devices with aggressive idle timeouts.

Streaming Works with curl but Not in the Browser

Cause: The browser’s fetch() API may buffer responses internally.

Fix: Use the ReadableStream API on the client side:

const response = await fetch('/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
        model: 'gemma3:1b',
        stream: true,
        messages: [{ role: 'user', content: 'Hello' }]
    })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    console.log(decoder.decode(value));
}

Model Takes a Long Time to Respond Initially

Cause: Ollama loads models into memory on first request. This cold start can take several seconds to over a minute depending on model size.

Fix: Preload the model after Ollama starts:

curl -s http://127.0.0.1:11434/api/generate \
  -d '{"model":"gemma3:1b","prompt":"warmup","options":{"num_predict":1}}' > /dev/null

For production, add this to your systemd service file’s ExecStartPost directive.

Conclusion

You now have a complete self-hosted LLM setup: Ollama serving models locally, NGINX streaming tokens to clients without buffering, and production-grade security layered on top. The four critical NGINX changes — HTTP/1.1 upstream, Host header set to localhost, proxy buffering disabled, and extended read timeouts — are what separate a working NGINX reverse proxy Ollama deployment from a broken one.

All configurations in this article have been tested on Rocky Linux 10 with NGINX 1.28 and real Ollama inference using gemma3:1b. Use them as a starting point and adjust timeouts and rate limits based on your hardware and expected load.

D

Danila Vershinin

Founder & Lead Engineer

NGINX configuration and optimizationLinux system administrationWeb performance engineering

10+ years NGINX experience • Maintainer of GetPageSpeed RPM repository • Contributor to open-source NGINX modules

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.