yum upgrades for production use, this is the repository for you.
Active subscription is required.
The NGINX HTML sanitize module (ngx_http_html_sanitize_module) lets you strip dangerous markup from untrusted HTML content directly at the web server level. Cross-site scripting (XSS) remains one of the most prevalent web security threats, and this module addresses it by transforming NGINX into an HTML sanitization microservice. It parses HTML5 content and outputs only whitelisted elements, attributes, and CSS properties — before malicious code ever reaches your backend.
Built on top of Google’s gumbo-parser for standards-compliant HTML5 parsing and the katana-parser for inline CSS analysis, the NGINX HTML sanitize module provides a language-agnostic sanitization layer that any backend can use.
Why Sanitize HTML at the NGINX Level?
Most web frameworks include their own HTML sanitization libraries. However, there are several compelling reasons to move this responsibility to the NGINX layer:
- Language-agnostic protection: A single NGINX sanitization endpoint serves applications written in any language — Python, PHP, Go, Java, or Node.js — without duplicating sanitization logic.
- Centralized security policy: Security teams define the whitelist once in the NGINX configuration, rather than auditing sanitization rules across multiple codebases.
- Reduced attack surface: Malicious HTML is stripped before it reaches application code, therefore reducing the risk of parser-specific bypasses in individual frameworks.
- Professional maintenance: Security engineers maintain the sanitization rules at the infrastructure level, independent of application developers.
This approach is especially useful in microservice architectures where multiple services accept HTML input. Additionally, it eliminates the need to keep sanitization libraries updated across every service individually.
If you are already using NGINX security headers or a WAF like NAXSI, this module is a natural extension of your defense-in-depth strategy.
How the NGINX HTML Sanitize Module Works
The module operates as a content handler, not a response filter. This is an important distinction: rather than filtering HTML in proxied responses, NGINX itself becomes the sanitization service.
Here is the typical workflow:
- A client sends an HTTP POST request with raw HTML in the request body.
- The NGINX HTML sanitize module parses the HTML using gumbo-parser.
- The module walks the parsed HTML tree and filters it against configured whitelists.
- For elements with inline CSS
styleattributes, katana-parser analyzes the CSS properties. - NGINX returns the sanitized HTML in the response body.
The module supports three filtering modes for elements, attributes, and CSS properties:
| Mode | Value | Behavior |
|---|---|---|
| Disabled | 0 |
Strip all (no output) |
| Allow all | 1 |
Output everything |
| Whitelist | 2 |
Output only whitelisted items |
These modes are controlled per request via query string parameters, giving you fine-grained control over sanitization behavior.
Installation
RHEL, CentOS, AlmaLinux, Rocky Linux
Install the module from the GetPageSpeed RPM repository:
sudo dnf install https://extras.getpagespeed.com/release-latest.rpm
sudo dnf install nginx-module-html-sanitize
After installation, load the module by adding this directive to the top of /etc/nginx/nginx.conf:
load_module modules/ngx_http_html_sanitize_module.so;
Debian and Ubuntu
First, set up the GetPageSpeed APT repository, then install the module:
sudo apt-get update
sudo apt-get install nginx-module-html-sanitize
On Debian/Ubuntu, the package handles module loading automatically. No
load_moduledirective is needed.
You can find the module’s RPM package details and APT package details on the respective pages.
Configuration
All directives for the NGINX HTML sanitize module are available in the location context only. This means you define a specific location block that acts as your sanitization endpoint.
Critical: Always Include html, head, and body
When using whitelist mode (element=2), you must include html, head, and body in the html_sanitize_element whitelist. The gumbo-parser internally wraps all input content in these structural tags. If they are not whitelisted, the tree traversal stops at the root and no child elements are output — resulting in an empty response.
These structural tags are not included in the response output by default (controlled by the html and document query parameters). However, they must be present in the whitelist for the module to traverse into child elements.
NGINX Directive Argument Limit
NGINX limits each directive to a maximum of 7 arguments (the NGX_CONF_MAX_ARGS constant is 8, which includes the directive name itself). If you specify more than 7 elements on a single html_sanitize_element line, the extra ones are silently ignored. Therefore, split long element lists across multiple directives:
# Correct: max 7 elements per line
html_sanitize_element html head body h1 h2 h3 h4;
html_sanitize_element h5 h6 p br pre div span;
# Wrong: more than 7 elements (extras silently ignored)
html_sanitize_element html head body h1 h2 h3 h4 h5 h6 p br pre div span;
Basic Configuration Example
The following example sets up a sanitization endpoint at /sanitize that allows common HTML elements, safe attributes, and basic CSS properties:
server {
listen 8888;
location = /sanitize {
add_header Content-Type "text/html; charset=UTF-8";
client_body_buffer_size 10M;
client_max_body_size 10M;
html_sanitize on;
# Required structural elements
html_sanitize_element html head body;
# Sections
html_sanitize_element h1 h2 h3 h4 h5 h6;
html_sanitize_element section nav article aside;
html_sanitize_element header footer;
# Grouping content
html_sanitize_element p hr br pre blockquote;
html_sanitize_element ol ul li dl dt dd;
# Text-level semantics
html_sanitize_element a q cite em strong small;
html_sanitize_element mark dfn abbr time code;
html_sanitize_element var samp kbd sub sup span;
html_sanitize_element i b;
# Edits
html_sanitize_element ins del;
# Embedded content
html_sanitize_element img;
# Tables
html_sanitize_element table caption colgroup col;
html_sanitize_element tbody thead tfoot tr td th;
# Miscellaneous
html_sanitize_element div legend;
# Allowed attributes
html_sanitize_attribute *.style;
html_sanitize_attribute a.href a.rel a.name;
html_sanitize_attribute img.src img.alt img.width;
html_sanitize_attribute img.height;
html_sanitize_attribute td.colspan td.rowspan;
html_sanitize_attribute th.colspan th.rowspan th.scope;
html_sanitize_attribute ol.type ol.reversed;
html_sanitize_attribute table.border table.cellpadding;
html_sanitize_attribute table.cellspacing table.width;
# CSS properties
html_sanitize_style_property color font-size;
html_sanitize_style_property background-color text-align;
html_sanitize_style_property font-weight;
# URL restrictions
html_sanitize_url_protocol http https;
}
}
Directive Reference
html_sanitize
Syntax: html_sanitize on | off
Default: html_sanitize on
Context: location
Enables or disables the HTML sanitize handler in the given location. When enabled, the location processes POST request bodies as HTML and returns sanitized output.
html_sanitize_hash_max_size
Syntax: html_sanitize_hash_max_size size
Default: html_sanitize_hash_max_size 2048
Context: location
Sets the maximum size of the internal hash tables used for elements, attributes, style properties, URL protocols, and URL domains. Increase this value if you have a large number of whitelisted items.
html_sanitize_hash_bucket_size
Syntax: html_sanitize_hash_bucket_size size
Default: depends on CPU cache line size (typically 64)
Context: location
Sets the bucket size for the internal hash tables. The default value depends on the processor’s cache line size.
html_sanitize_element
Syntax: html_sanitize_element element ...
Default: none
Context: location
Defines whitelisted HTML5 elements. When the element=2 query string parameter is used, only elements listed in this directive are included in the output. You can use the directive multiple times. Remember to always include html, head, and body:
html_sanitize_element html head body;
html_sanitize_element p br div span;
html_sanitize_element h1 h2 h3 h4 h5 h6;
html_sanitize_element ul ol li;
html_sanitize_attribute
Syntax: html_sanitize_attribute attribute ...
Default: none
Context: location
Defines whitelisted HTML5 attributes using the element.attribute format. Additionally, the directive supports wildcards:
a.href— allowhrefonly on<a>tags*.style— allowstyleon any elementimg.*— allow all attributes on<img>tags
Example:
html_sanitize_attribute *.style *.class;
html_sanitize_attribute a.href a.rel;
html_sanitize_attribute img.src img.alt img.width;
html_sanitize_style_property
Syntax: html_sanitize_style_property property ...
Default: none
Context: location
Defines whitelisted CSS properties for inline style attributes. When style_property=2 is set in the query string, only listed CSS properties pass through:
html_sanitize_style_property color font-size;
html_sanitize_style_property background-color text-align;
html_sanitize_url_protocol
Syntax: html_sanitize_url_protocol protocol ...
Default: none
Context: location
Restricts URL protocols in linkable attributes (a.href, img.src, blockquote.cite, q.cite, del.cite, ins.cite, and CSS url() functions). Only absolute URLs are checked — relative URLs pass through unchanged:
html_sanitize_url_protocol http https;
html_sanitize_url_domain
Syntax: html_sanitize_url_domain domain ...
Default: none
Context: location
Restricts URL domains in linkable attributes. Supports wildcard prefixes. This directive requires url_protocol=1 and url_domain=1 in the query string to take effect:
html_sanitize_url_domain example.com *.example.com;
html_sanitize_iframe_url_protocol
Syntax: html_sanitize_iframe_url_protocol protocol ...
Default: none
Context: location
Same as html_sanitize_url_protocol but applies exclusively to iframe.src attributes. This allows you to enforce different protocol rules for iframes:
html_sanitize_iframe_url_protocol https;
html_sanitize_iframe_url_domain
Syntax: html_sanitize_iframe_url_domain domain ...
Default: none
Context: location
Same as html_sanitize_url_domain but applies exclusively to iframe.src attributes:
html_sanitize_iframe_url_domain youtube.com *.youtube.com;
html_sanitize_iframe_url_domain vimeo.com *.vimeo.com;
Query String API
The module’s behavior is controlled per request using query string parameters. This design allows a single NGINX endpoint to serve different sanitization needs.
Element, Attribute, and Style Control
| Parameter | Values | Default | Description |
|---|---|---|---|
element |
0, 1, 2 | 0 | 0 = strip all elements, 1 = allow all, 2 = whitelist only |
attribute |
0, 1, 2 | 0 | 0 = strip all attributes, 1 = allow all, 2 = whitelist only |
style_property |
0, 1, 2 | 0 | 0 = strip CSS, 1 = allow all, 2 = whitelist only |
style_property_value |
0, 1 | 0 | 1 = check CSS values for url() and IE expression() to prevent XSS |
URL Validation
| Parameter | Values | Default | Description |
|---|---|---|---|
url_protocol |
0, 1 | 0 | 1 = enforce whitelisted URL protocols |
url_domain |
0, 1 | 0 | 1 = enforce whitelisted URL domains (requires url_protocol=1) |
iframe_url_protocol |
0, 1 | 0 | Same as url_protocol but for iframe.src only |
iframe_url_domain |
0, 1 | 0 | Same as url_domain but for iframe.src only |
Document Structure
| Parameter | Values | Default | Description |
|---|---|---|---|
document |
0, 1 | 0 | 1 = prepend <!DOCTYPE> to output |
html |
0, 1 | 0 | 1 = wrap output in <html></html> |
script |
0, 1 | 0 | 1 = allow <script> tags (must also pass element whitelist if element=2) |
style |
0, 1 | 0 | 1 = allow <style> tags (must also pass element whitelist if element=2) |
namespace |
0, 1, 2 | 0 | 0 = HTML, 1 = SVG, 2 = MathML |
context |
0–149 | 38 (div) | Gumbo parser context tag (38 = <div>) |
Note on script and style parameters: The
scriptandstyleflags act as additional blockers. Settingscript=0blocks<script>tags even if they are in the element whitelist. Settingscript=1lifts that block, but withelement=2, thescripttag must also be listed inhtml_sanitize_elementto appear in the output. Withelement=1(allow all), settingscript=1is sufficient.
Common Query String Combinations
For a typical use case — accepting user-generated content with whitelisted elements and safe URLs — use:
?element=2&attribute=2&style_property=2&style_property_value=1&url_protocol=1
To strip everything except plain text:
?element=0&attribute=0
To allow all elements but restrict attributes to the whitelist:
?element=1&attribute=2
Testing the Module
Once the module is configured, test it with curl. The module only accepts POST requests — GET requests return HTTP 405 (Method Not Allowed). Send HTML content via POST and include the appropriate query string parameters:
# Basic test: whitelist elements, allow all attributes
curl -X POST -d '<h1>Hello World</h1><script>alert("xss")</script>' \
'http://127.0.0.1:8888/sanitize?element=2&attribute=1'
Expected output (the <script> tag is stripped because it was not in the html_sanitize_element whitelist):
<h1>Hello World</h1>
Test attribute filtering:
# Whitelist attributes: only configured attributes pass through
curl -X POST \
-d '<img src="/photo.jpg" onerror="alert(1)" alt="Photo" />' \
'http://127.0.0.1:8888/sanitize?element=2&attribute=2'
Expected output (the dangerous onerror attribute is stripped):
<img src="/photo.jpg" alt="Photo" />
Test inline CSS filtering:
# Whitelist style properties
curl -X POST \
-d '<p style="color:red;position:fixed;top:0;">Styled text</p>' \
'http://127.0.0.1:8888/sanitize?element=2&attribute=1&style_property=2'
Expected output (position and top are not whitelisted, only color passes through):
<p style="color:red;">Styled text</p>
Test URL protocol filtering:
# Block javascript: URLs, allow https:
curl -X POST \
-d '<a href="javascript:alert(1)">click</a><a href="https://safe.com">safe</a>' \
'http://127.0.0.1:8888/sanitize?element=2&attribute=2&url_protocol=1'
Expected output (the javascript: href is removed):
<a>click</a><a href="https://safe.com">safe</a>
Verifying the Module Is Loaded
To confirm the module is correctly loaded, check that your configuration passes the syntax test:
nginx -t
If the configuration file includes html_sanitize directives and nginx -t succeeds, the module is loaded and working correctly.
Use Cases
User-Generated Content Platforms
Forums, comment systems, and social platforms often allow users to submit rich HTML content. The NGINX HTML sanitize module strips dangerous tags like <script>, <object>, and <embed> while preserving formatting elements like <p>, <strong>, and <a>.
CMS and WYSIWYG Editors
Content management systems with WYSIWYG editors produce HTML that may contain unexpected or malicious markup. Route the editor output through the sanitization endpoint before storing it in the database.
Email Template Processing
Email rendering engines sometimes process user-supplied HTML templates. Sanitizing these templates at the NGINX level removes dangerous constructs before the email engine processes them.
API Gateway Sanitization
In a microservice architecture, deploy the NGINX HTML sanitize module as an API gateway that all HTML-accepting services route through. This creates a single point of enforcement for HTML security policies. For comprehensive protection, combine it with ModSecurity and security headers.
Performance Considerations
The module processes HTML entirely in memory using gumbo-parser’s C implementation. Benchmark results from the module’s test suite on an Intel Xeon E5-2630 v3 show strong throughput for typical content sizes:
| Input | Size | Avg Latency | Requests/sec |
|---|---|---|---|
| Hacker News | 30 KB | 9 ms | 2,921 |
| Baidu | 76 KB | 13 ms | 1,815 |
| Arabic newspapers | 78 KB | 16 ms | 1,112 |
| BBC | 115 KB | 17 ms | 993 |
| Xinhua | 323 KB | 33 ms | 275 |
| Wikipedia | 511 KB | 57 ms | 160 |
| HTML5 spec | 7.7 MB | 1.6 s | 2 |
For typical user-generated content (under 100 KB), the module processes requests in under 20 milliseconds. However, for very large HTML documents, set appropriate client_max_body_size limits to prevent memory exhaustion:
location = /sanitize {
html_sanitize on;
client_body_buffer_size 1M;
client_max_body_size 1M;
# ... whitelist directives
}
Security Best Practices
Always Use Whitelist Mode
Set element=2 and attribute=2 in your query strings rather than element=1 (allow all). Whitelisting is fundamentally more secure than blacklisting because new, potentially dangerous elements are blocked by default.
Enable CSS Value Checking
Always set style_property_value=1 when allowing inline styles. This parameter checks CSS values for url() functions and Internet Explorer’s expression() function. Both of these can be vectors for XSS attacks:
?element=2&attribute=2&style_property=2&style_property_value=1
Restrict URL Protocols
Enable url_protocol=1 and whitelist only http and https. This prevents javascript: protocol URLs, which are a common XSS vector:
html_sanitize_url_protocol http https;
Never Allow Script Tags
Keep the script query parameter at its default value of 0. There is rarely a legitimate reason to allow <script> tags in user-generated content.
Restrict iframe Domains
If your application requires iframe embeds, use html_sanitize_iframe_url_domain to whitelist only trusted providers:
html_sanitize_iframe_url_protocol https;
html_sanitize_iframe_url_domain youtube.com *.youtube.com;
html_sanitize_iframe_url_domain vimeo.com *.vimeo.com;
Limit Request Body Size
Set a reasonable client_max_body_size to prevent denial-of-service attacks through extremely large HTML payloads. Oversized requests are rejected with HTTP 413:
client_max_body_size 512k;
Troubleshooting
“unknown directive html_sanitize”
This error means the module is not loaded. Verify that you have the load_module directive at the top of nginx.conf:
load_module modules/ngx_http_html_sanitize_module.so;
Also verify the module file exists:
ls /usr/lib64/nginx/modules/ngx_http_html_sanitize_module.so
Empty Response Body
If the sanitization endpoint returns an empty body, check these common causes:
- Missing structural elements: The most common cause. You must include
html,head, andbodyin yourhtml_sanitize_elementwhitelist when usingelement=2. Without them, gumbo-parser’s internal tree structure blocks traversal to child elements. - Wrong HTTP method: The module only accepts POST requests. GET requests return HTTP 405 (Method Not Allowed). Use
curl -X POST -d "...", not GET requests. - Too many arguments per directive: NGINX limits each directive to 7 arguments. If you list more than 7 elements on a single
html_sanitize_elementline, the extras are silently ignored. - Missing query string: The module defaults to
element=0(strip all). Include at leastelement=2(orelement=1) to produce output.
Hash Table Errors
If you see errors related to hash table initialization, increase the hash table size:
html_sanitize_hash_max_size 4096;
html_sanitize_hash_bucket_size 128;
Large Payloads Rejected
If large HTML payloads are rejected with a 413 error, increase the client body size limit:
client_body_buffer_size 10M;
client_max_body_size 10M;
Conclusion
The NGINX HTML sanitize module provides a powerful, language-agnostic approach to HTML sanitization at the infrastructure level. By leveraging Google’s gumbo-parser for standards-compliant HTML5 parsing, the module ensures that malicious elements, attributes, and CSS properties are stripped before they reach your application code.
For organizations running multiple backend services that accept HTML content, this module eliminates the need to maintain separate sanitization libraries in each language. Moreover, it centralizes security policy at the NGINX layer, making it easier to audit and update.
The module’s source code is available on GitHub under the Apache 2.0 license.
