Skip to main content

NGINX / Security

NGINX HTML Sanitize Module: Strip Dangerous HTML at the Edge

by ,


We have by far the largest RPM repository with NGINX module packages and VMODs for Varnish. If you want to install NGINX, Varnish, and lots of useful performance/security software with smooth yum upgrades for production use, this is the repository for you.
Active subscription is required.

The NGINX HTML sanitize module (ngx_http_html_sanitize_module) lets you strip dangerous markup from untrusted HTML content directly at the web server level. Cross-site scripting (XSS) remains one of the most prevalent web security threats, and this module addresses it by transforming NGINX into an HTML sanitization microservice. It parses HTML5 content and outputs only whitelisted elements, attributes, and CSS properties — before malicious code ever reaches your backend.

Built on top of Google’s gumbo-parser for standards-compliant HTML5 parsing and the katana-parser for inline CSS analysis, the NGINX HTML sanitize module provides a language-agnostic sanitization layer that any backend can use.

Why Sanitize HTML at the NGINX Level?

Most web frameworks include their own HTML sanitization libraries. However, there are several compelling reasons to move this responsibility to the NGINX layer:

  • Language-agnostic protection: A single NGINX sanitization endpoint serves applications written in any language — Python, PHP, Go, Java, or Node.js — without duplicating sanitization logic.
  • Centralized security policy: Security teams define the whitelist once in the NGINX configuration, rather than auditing sanitization rules across multiple codebases.
  • Reduced attack surface: Malicious HTML is stripped before it reaches application code, therefore reducing the risk of parser-specific bypasses in individual frameworks.
  • Professional maintenance: Security engineers maintain the sanitization rules at the infrastructure level, independent of application developers.

This approach is especially useful in microservice architectures where multiple services accept HTML input. Additionally, it eliminates the need to keep sanitization libraries updated across every service individually.

If you are already using NGINX security headers or a WAF like NAXSI, this module is a natural extension of your defense-in-depth strategy.

How the NGINX HTML Sanitize Module Works

The module operates as a content handler, not a response filter. This is an important distinction: rather than filtering HTML in proxied responses, NGINX itself becomes the sanitization service.

Here is the typical workflow:

  1. A client sends an HTTP POST request with raw HTML in the request body.
  2. The NGINX HTML sanitize module parses the HTML using gumbo-parser.
  3. The module walks the parsed HTML tree and filters it against configured whitelists.
  4. For elements with inline CSS style attributes, katana-parser analyzes the CSS properties.
  5. NGINX returns the sanitized HTML in the response body.

The module supports three filtering modes for elements, attributes, and CSS properties:

Mode Value Behavior
Disabled 0 Strip all (no output)
Allow all 1 Output everything
Whitelist 2 Output only whitelisted items

These modes are controlled per request via query string parameters, giving you fine-grained control over sanitization behavior.

Installation

RHEL, CentOS, AlmaLinux, Rocky Linux

Install the module from the GetPageSpeed RPM repository:

sudo dnf install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf install nginx-module-html-sanitize

After installation, load the module by adding this directive to the top of /etc/nginx/nginx.conf:

load_module modules/ngx_http_html_sanitize_module.so;

Debian and Ubuntu

First, set up the GetPageSpeed APT repository, then install the module:

sudo apt-get update
sudo apt-get install nginx-module-html-sanitize

On Debian/Ubuntu, the package handles module loading automatically. No load_module directive is needed.

You can find the module’s RPM package details and APT package details on the respective pages.

Configuration

All directives for the NGINX HTML sanitize module are available in the location context only. This means you define a specific location block that acts as your sanitization endpoint.

Critical: Always Include html, head, and body

When using whitelist mode (element=2), you must include html, head, and body in the html_sanitize_element whitelist. The gumbo-parser internally wraps all input content in these structural tags. If they are not whitelisted, the tree traversal stops at the root and no child elements are output — resulting in an empty response.

These structural tags are not included in the response output by default (controlled by the html and document query parameters). However, they must be present in the whitelist for the module to traverse into child elements.

NGINX Directive Argument Limit

NGINX limits each directive to a maximum of 7 arguments (the NGX_CONF_MAX_ARGS constant is 8, which includes the directive name itself). If you specify more than 7 elements on a single html_sanitize_element line, the extra ones are silently ignored. Therefore, split long element lists across multiple directives:

# Correct: max 7 elements per line
html_sanitize_element html head body h1 h2 h3 h4;
html_sanitize_element h5 h6 p br pre div span;

# Wrong: more than 7 elements (extras silently ignored)
html_sanitize_element html head body h1 h2 h3 h4 h5 h6 p br pre div span;

Basic Configuration Example

The following example sets up a sanitization endpoint at /sanitize that allows common HTML elements, safe attributes, and basic CSS properties:

server {
    listen 8888;

    location = /sanitize {
        add_header Content-Type "text/html; charset=UTF-8";

        client_body_buffer_size 10M;
        client_max_body_size 10M;

        html_sanitize on;

        # Required structural elements
        html_sanitize_element html head body;

        # Sections
        html_sanitize_element h1 h2 h3 h4 h5 h6;
        html_sanitize_element section nav article aside;
        html_sanitize_element header footer;

        # Grouping content
        html_sanitize_element p hr br pre blockquote;
        html_sanitize_element ol ul li dl dt dd;

        # Text-level semantics
        html_sanitize_element a q cite em strong small;
        html_sanitize_element mark dfn abbr time code;
        html_sanitize_element var samp kbd sub sup span;
        html_sanitize_element i b;

        # Edits
        html_sanitize_element ins del;

        # Embedded content
        html_sanitize_element img;

        # Tables
        html_sanitize_element table caption colgroup col;
        html_sanitize_element tbody thead tfoot tr td th;

        # Miscellaneous
        html_sanitize_element div legend;

        # Allowed attributes
        html_sanitize_attribute *.style;
        html_sanitize_attribute a.href a.rel a.name;
        html_sanitize_attribute img.src img.alt img.width;
        html_sanitize_attribute img.height;
        html_sanitize_attribute td.colspan td.rowspan;
        html_sanitize_attribute th.colspan th.rowspan th.scope;
        html_sanitize_attribute ol.type ol.reversed;
        html_sanitize_attribute table.border table.cellpadding;
        html_sanitize_attribute table.cellspacing table.width;

        # CSS properties
        html_sanitize_style_property color font-size;
        html_sanitize_style_property background-color text-align;
        html_sanitize_style_property font-weight;

        # URL restrictions
        html_sanitize_url_protocol http https;
    }
}

Directive Reference

html_sanitize

Syntax: html_sanitize on | off
Default: html_sanitize on
Context: location

Enables or disables the HTML sanitize handler in the given location. When enabled, the location processes POST request bodies as HTML and returns sanitized output.

html_sanitize_hash_max_size

Syntax: html_sanitize_hash_max_size size
Default: html_sanitize_hash_max_size 2048
Context: location

Sets the maximum size of the internal hash tables used for elements, attributes, style properties, URL protocols, and URL domains. Increase this value if you have a large number of whitelisted items.

html_sanitize_hash_bucket_size

Syntax: html_sanitize_hash_bucket_size size
Default: depends on CPU cache line size (typically 64)
Context: location

Sets the bucket size for the internal hash tables. The default value depends on the processor’s cache line size.

html_sanitize_element

Syntax: html_sanitize_element element ...
Default: none
Context: location

Defines whitelisted HTML5 elements. When the element=2 query string parameter is used, only elements listed in this directive are included in the output. You can use the directive multiple times. Remember to always include html, head, and body:

html_sanitize_element html head body;
html_sanitize_element p br div span;
html_sanitize_element h1 h2 h3 h4 h5 h6;
html_sanitize_element ul ol li;

html_sanitize_attribute

Syntax: html_sanitize_attribute attribute ...
Default: none
Context: location

Defines whitelisted HTML5 attributes using the element.attribute format. Additionally, the directive supports wildcards:

  • a.href — allow href only on <a> tags
  • *.style — allow style on any element
  • img.* — allow all attributes on <img> tags

Example:

html_sanitize_attribute *.style *.class;
html_sanitize_attribute a.href a.rel;
html_sanitize_attribute img.src img.alt img.width;

html_sanitize_style_property

Syntax: html_sanitize_style_property property ...
Default: none
Context: location

Defines whitelisted CSS properties for inline style attributes. When style_property=2 is set in the query string, only listed CSS properties pass through:

html_sanitize_style_property color font-size;
html_sanitize_style_property background-color text-align;

html_sanitize_url_protocol

Syntax: html_sanitize_url_protocol protocol ...
Default: none
Context: location

Restricts URL protocols in linkable attributes (a.href, img.src, blockquote.cite, q.cite, del.cite, ins.cite, and CSS url() functions). Only absolute URLs are checked — relative URLs pass through unchanged:

html_sanitize_url_protocol http https;

html_sanitize_url_domain

Syntax: html_sanitize_url_domain domain ...
Default: none
Context: location

Restricts URL domains in linkable attributes. Supports wildcard prefixes. This directive requires url_protocol=1 and url_domain=1 in the query string to take effect:

html_sanitize_url_domain example.com *.example.com;

html_sanitize_iframe_url_protocol

Syntax: html_sanitize_iframe_url_protocol protocol ...
Default: none
Context: location

Same as html_sanitize_url_protocol but applies exclusively to iframe.src attributes. This allows you to enforce different protocol rules for iframes:

html_sanitize_iframe_url_protocol https;

html_sanitize_iframe_url_domain

Syntax: html_sanitize_iframe_url_domain domain ...
Default: none
Context: location

Same as html_sanitize_url_domain but applies exclusively to iframe.src attributes:

html_sanitize_iframe_url_domain youtube.com *.youtube.com;
html_sanitize_iframe_url_domain vimeo.com *.vimeo.com;

Query String API

The module’s behavior is controlled per request using query string parameters. This design allows a single NGINX endpoint to serve different sanitization needs.

Element, Attribute, and Style Control

Parameter Values Default Description
element 0, 1, 2 0 0 = strip all elements, 1 = allow all, 2 = whitelist only
attribute 0, 1, 2 0 0 = strip all attributes, 1 = allow all, 2 = whitelist only
style_property 0, 1, 2 0 0 = strip CSS, 1 = allow all, 2 = whitelist only
style_property_value 0, 1 0 1 = check CSS values for url() and IE expression() to prevent XSS

URL Validation

Parameter Values Default Description
url_protocol 0, 1 0 1 = enforce whitelisted URL protocols
url_domain 0, 1 0 1 = enforce whitelisted URL domains (requires url_protocol=1)
iframe_url_protocol 0, 1 0 Same as url_protocol but for iframe.src only
iframe_url_domain 0, 1 0 Same as url_domain but for iframe.src only

Document Structure

Parameter Values Default Description
document 0, 1 0 1 = prepend <!DOCTYPE> to output
html 0, 1 0 1 = wrap output in <html></html>
script 0, 1 0 1 = allow <script> tags (must also pass element whitelist if element=2)
style 0, 1 0 1 = allow <style> tags (must also pass element whitelist if element=2)
namespace 0, 1, 2 0 0 = HTML, 1 = SVG, 2 = MathML
context 0–149 38 (div) Gumbo parser context tag (38 = <div>)

Note on script and style parameters: The script and style flags act as additional blockers. Setting script=0 blocks <script> tags even if they are in the element whitelist. Setting script=1 lifts that block, but with element=2, the script tag must also be listed in html_sanitize_element to appear in the output. With element=1 (allow all), setting script=1 is sufficient.

Common Query String Combinations

For a typical use case — accepting user-generated content with whitelisted elements and safe URLs — use:

?element=2&attribute=2&style_property=2&style_property_value=1&url_protocol=1

To strip everything except plain text:

?element=0&attribute=0

To allow all elements but restrict attributes to the whitelist:

?element=1&attribute=2

Testing the Module

Once the module is configured, test it with curl. The module only accepts POST requests — GET requests return HTTP 405 (Method Not Allowed). Send HTML content via POST and include the appropriate query string parameters:

# Basic test: whitelist elements, allow all attributes
curl -X POST -d '<h1>Hello World</h1><script>alert("xss")</script>' \
  'http://127.0.0.1:8888/sanitize?element=2&attribute=1'

Expected output (the <script> tag is stripped because it was not in the html_sanitize_element whitelist):

<h1>Hello World</h1>

Test attribute filtering:

# Whitelist attributes: only configured attributes pass through
curl -X POST \
  -d '<img src="/photo.jpg" onerror="alert(1)" alt="Photo" />' \
  'http://127.0.0.1:8888/sanitize?element=2&attribute=2'

Expected output (the dangerous onerror attribute is stripped):

<img src="/photo.jpg" alt="Photo" />

Test inline CSS filtering:

# Whitelist style properties
curl -X POST \
  -d '<p style="color:red;position:fixed;top:0;">Styled text</p>' \
  'http://127.0.0.1:8888/sanitize?element=2&attribute=1&style_property=2'

Expected output (position and top are not whitelisted, only color passes through):

<p style="color:red;">Styled text</p>

Test URL protocol filtering:

# Block javascript: URLs, allow https:
curl -X POST \
  -d '<a href="javascript:alert(1)">click</a><a href="https://safe.com">safe</a>' \
  'http://127.0.0.1:8888/sanitize?element=2&attribute=2&url_protocol=1'

Expected output (the javascript: href is removed):

<a>click</a><a href="https://safe.com">safe</a>

Verifying the Module Is Loaded

To confirm the module is correctly loaded, check that your configuration passes the syntax test:

nginx -t

If the configuration file includes html_sanitize directives and nginx -t succeeds, the module is loaded and working correctly.

Use Cases

User-Generated Content Platforms

Forums, comment systems, and social platforms often allow users to submit rich HTML content. The NGINX HTML sanitize module strips dangerous tags like <script>, <object>, and <embed> while preserving formatting elements like <p>, <strong>, and <a>.

CMS and WYSIWYG Editors

Content management systems with WYSIWYG editors produce HTML that may contain unexpected or malicious markup. Route the editor output through the sanitization endpoint before storing it in the database.

Email Template Processing

Email rendering engines sometimes process user-supplied HTML templates. Sanitizing these templates at the NGINX level removes dangerous constructs before the email engine processes them.

API Gateway Sanitization

In a microservice architecture, deploy the NGINX HTML sanitize module as an API gateway that all HTML-accepting services route through. This creates a single point of enforcement for HTML security policies. For comprehensive protection, combine it with ModSecurity and security headers.

Performance Considerations

The module processes HTML entirely in memory using gumbo-parser’s C implementation. Benchmark results from the module’s test suite on an Intel Xeon E5-2630 v3 show strong throughput for typical content sizes:

Input Size Avg Latency Requests/sec
Hacker News 30 KB 9 ms 2,921
Baidu 76 KB 13 ms 1,815
Arabic newspapers 78 KB 16 ms 1,112
BBC 115 KB 17 ms 993
Xinhua 323 KB 33 ms 275
Wikipedia 511 KB 57 ms 160
HTML5 spec 7.7 MB 1.6 s 2

For typical user-generated content (under 100 KB), the module processes requests in under 20 milliseconds. However, for very large HTML documents, set appropriate client_max_body_size limits to prevent memory exhaustion:

location = /sanitize {
    html_sanitize on;
    client_body_buffer_size 1M;
    client_max_body_size 1M;
    # ... whitelist directives
}

Security Best Practices

Always Use Whitelist Mode

Set element=2 and attribute=2 in your query strings rather than element=1 (allow all). Whitelisting is fundamentally more secure than blacklisting because new, potentially dangerous elements are blocked by default.

Enable CSS Value Checking

Always set style_property_value=1 when allowing inline styles. This parameter checks CSS values for url() functions and Internet Explorer’s expression() function. Both of these can be vectors for XSS attacks:

?element=2&attribute=2&style_property=2&style_property_value=1

Restrict URL Protocols

Enable url_protocol=1 and whitelist only http and https. This prevents javascript: protocol URLs, which are a common XSS vector:

html_sanitize_url_protocol http https;

Never Allow Script Tags

Keep the script query parameter at its default value of 0. There is rarely a legitimate reason to allow <script> tags in user-generated content.

Restrict iframe Domains

If your application requires iframe embeds, use html_sanitize_iframe_url_domain to whitelist only trusted providers:

html_sanitize_iframe_url_protocol https;
html_sanitize_iframe_url_domain youtube.com *.youtube.com;
html_sanitize_iframe_url_domain vimeo.com *.vimeo.com;

Limit Request Body Size

Set a reasonable client_max_body_size to prevent denial-of-service attacks through extremely large HTML payloads. Oversized requests are rejected with HTTP 413:

client_max_body_size 512k;

Troubleshooting

“unknown directive html_sanitize”

This error means the module is not loaded. Verify that you have the load_module directive at the top of nginx.conf:

load_module modules/ngx_http_html_sanitize_module.so;

Also verify the module file exists:

ls /usr/lib64/nginx/modules/ngx_http_html_sanitize_module.so

Empty Response Body

If the sanitization endpoint returns an empty body, check these common causes:

  1. Missing structural elements: The most common cause. You must include html, head, and body in your html_sanitize_element whitelist when using element=2. Without them, gumbo-parser’s internal tree structure blocks traversal to child elements.
  2. Wrong HTTP method: The module only accepts POST requests. GET requests return HTTP 405 (Method Not Allowed). Use curl -X POST -d "...", not GET requests.
  3. Too many arguments per directive: NGINX limits each directive to 7 arguments. If you list more than 7 elements on a single html_sanitize_element line, the extras are silently ignored.
  4. Missing query string: The module defaults to element=0 (strip all). Include at least element=2 (or element=1) to produce output.

Hash Table Errors

If you see errors related to hash table initialization, increase the hash table size:

html_sanitize_hash_max_size 4096;
html_sanitize_hash_bucket_size 128;

Large Payloads Rejected

If large HTML payloads are rejected with a 413 error, increase the client body size limit:

client_body_buffer_size 10M;
client_max_body_size 10M;

Conclusion

The NGINX HTML sanitize module provides a powerful, language-agnostic approach to HTML sanitization at the infrastructure level. By leveraging Google’s gumbo-parser for standards-compliant HTML5 parsing, the module ensures that malicious elements, attributes, and CSS properties are stripped before they reach your application code.

For organizations running multiple backend services that accept HTML content, this module eliminates the need to maintain separate sanitization libraries in each language. Moreover, it centralizes security policy at the NGINX layer, making it easier to audit and update.

The module’s source code is available on GitHub under the Apache 2.0 license.

D

Danila Vershinin

Founder & Lead Engineer

NGINX configuration and optimizationLinux system administrationWeb performance engineering

10+ years NGINX experience • Maintainer of GetPageSpeed RPM repository • Contributor to open-source NGINX modules

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.