Server Setup / Varnish

Strip marketing or analytics query parameters with Varnish

by , , revisited on


We have by far the largest RPM repository with NGINX module packages and VMODs for Varnish. If you want to install NGINX, Varnish, and lots of useful performance/security software with smooth yum upgrades for production use, this is the repository for you.
Active subscription is required.

There are often cases when you need Varnish to cache the page whether it contains query parameters or not.
The most common example of this is when Google (Adwords, Analytics, etc.) adds tracking parameters to your website URLs.
Namely, ?gclid and ?utm_ are appended to the final URL.

But this will cause Varnish to hold multiple cache entries for a single page.

The solution is quite simple. Varnish VCL can do wonders and we can actually rewrite the final URL that will reach our backend (Nginx). Simplicity is beauty: we strip the specific parameters. As a result, Varnish will cache those pages properly.

How to change your VCL to strip ?gclid and ?utm parameters

Add the following to your vcl_recv procedure (between sub vcl_recv { and closing bracket }:


if (req.url ~ "(\?|&)(gclid|utm_[a-z]+)=") {
    set req.url = regsuball(req.url, "(gclid|utm_[a-z]+)=[-_A-z0-9+()%.]+&?", "");
    # remove trailing question mark and ampersand from URL
    set req.url = regsub(req.url, "[?|&]+$", "");
}

You can test the main regex in question by visiting this link. I made sure that it will work in all possible cases, including the case when the parameter’s value has round brackets.

The code will strip out Google Analytics campaign variables properly. Those variables are only needed by the Javascript running on the page. Variables are utm_source, utm_medium, utm_campaign, gclid, etc.

vmod-querystring

You may want to look into using the vmod-querystring for the same purpose. It has an advantage of less memory footprint, especially in case you have long URLs.

Installing vmod-querystring for CentOS/RHEL 7 and Varnish 4.x

sudo yum -y install https://extras.getpagespeed.com/release-latest.rpm
sudo yum -y install vmod-querystring

Installing vmod-querystring for CentOS/RHEL 7 and Varnish 6.0.x LTS

sudo yum -y install https://extras.getpagespeed.com/release-latest.rpm
sudo yum install yum-utils
sudo yum-config-manager --enable getpagespeed-extras-varnish60
sudo yum install vmod-querystring

Installing vmod-querystring for CentOS/RHEL 8 and Varnish 6.0.x LTS

sudo yum -y install https://extras.getpagespeed.com/release-latest.rpm
sudo yum -y install vmod-querystring

Using vmod-querystring for stripping (marketing) URL parameters

You can get documentation for the module by running man vmod_querystring.

But here’s a simple snippet of VCL to illustrate how you can strip marketing parameters using this VMOD:

import std;
import querystring;

sub vcl_init {
    new tracking_params_filter = querystring.filter();
    tracking_params_filter.add_string("gclid");
    tracking_params_filter.add_glob("utm_*"); # google analytics parameters

}

sub vcl_recv {
    std.log("tracking_params_filter:" + tracking_params_filter.extract(req.url, mode = keep));
    set req.url = tracking_params_filter.apply(req.url);
}

As you can see, using this VMOD allows for a cleaner VCL, because if often allows you to do things without fancy regex.
But in case you have a requirement for a parameter name which can be expressed with a regex, this VMOD also has .add_regex method.

  1. Tim

    Thank you very much for this VCL adaptation to strip out utm tags from urls. I added \%\. to strip out also % and . as these characters were in many of my utms

    Reply
    • Danila Vershinin

      Hi Tim.

      Thanks for your input. I’ve added the missing bits to regex.

      Actually, I think escaping is not needed in “character class” part of regex, so it has been removed now.

      Reply
  2. laura b

    Hi there, Thanks for this post. I think my question is related. I seem to have the issue where a cached version of the page with the gclid or fbclid parameters is making it’s way into things like the pagination or filters of our site and the marketing tracking is being appended to urls within the site – the hard code doesn’t have the tracking perams, they’re being applied through serving a cached version which has them. If we ensure that varnish strips these out first would it solve our problem as nothing with a marketing utm or click id will be able to be cached in the first place to be served up to general users who came via another source?

    In case my example isn’t too clear: I come to the site direct with a clean url. I click on ‘dresses’, I use the link at the bottom of the first page to visit “/p=2”. The link that I click to see page two applies a gclid for a ‘black dresses’ generic google ad keyword campaign. If I buy the sale attributes to that keyword even though I came direct.

    It’s a big problem for us as it’s inflating ads campaigns and making optimisation impossible guesswork.

    So grateful to know advice on how to prevent this.

    Reply
    • Danila Vershinin

      There’s obviously a problem in your framework, in how it constructs URLs for pagination. In all likelihood, it considers $_SERVER['REQUEST_URI'] (if we’re talking PHP) while constructing links, whereas it shouldn’t.
      Sure enough, applying the stripping as above in Varnish will resolve the glcid issue you’re having. But framework should be fixed as well, as it likely will construct “bad” URLs from any arbitrary parameters.

      Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: