A silly mistake
At some point I started seeing strange things about my Varnish instance. It gave unexplained “backend fetch failed” errors. Only when I viewed syslog (and this was for an entirely different task), I spotted Varnish panic happening quite often:
Child (28380) Last panic at: Tue, 11 Sep 2018 19:18:42 GMT#012″Assert error in default_oc_getobj(), storage/stevedore.c line 60:#012 Condition(((o))->magic == (0x32851d42)) not true
My immediate reaction was trying to downgrade, etc. All was in vain – the actual error was my own misconfiguration. Cache segmentation was configured in a way that both static files and page cache backend were looking at the same file:
-s static=file,/var/lib/varnish/varnish_storage.bin,512M -s file,/var/lib/varnish/varnish_storage.bin,512M
Things were wrong on many levels:
- Pointing to the same file by different cache backends
- There is no need to segment cache if it’s intended to store it in the same filesystem
Surely this was an easy fix. But the frustrating part was not knowing that something is wrong with Varnish configuration before spotting the panic messages in syslog, merely by accident. How can we do better here?
How Varnish runs
Varnish architecture builds upon two main processes: the master and the child process.
The child process is the process that actually caches stuff. It panics if there’s a problem. Responsibility of the master process is basically watching over the cache process and restarting it as needed.
Improving things in terms of monitoring and a bit of reliability raises questions:
- How can we easily spot Varnish panics and be alert about them?
- Who is watching over watcher (master)?
Notification for Varnish panics
It’s easy to know if your running Varnish instance had a panic happen with the following command:
If a panic has happened, you’d see its details. But how do we know we have to check it in the first place? It would be nice to be notified. Here comes our simple Monit check. E.g. place in
check program varnishpanic with path "/bin/varnishadm panic.show" if status != 1 then alert
The trick here is knowing that
varnishadm panic.show will have an exit code
0 if panic exists and
1 otherwise. The easy check will ensure that you will get an alert, should there be any panic. And act on it early.
Watch over master
The master Varnish process is quite reliable and is the least likely thing to crash. But why not add a bit of monitoring if we can?
There are basically two options here: you can also use Monit to ensure main Varnish process is running. Or you can use systemd feature.
With the arrival of systemd in CentOS 7, one does not have to take care about constant Varnish uptime, should it completely crash.
Simply edit Varnish unit file, e.g. with
systemctl edit varnish and add:
Note that this isn’t needed if you use our Varnish 4 package, as this is already incorporated.
varnish.mon (assuming Varnish is set to run on port
check process varnish with pidfile /var/run/varnish.pid group www start program = "/usr/bin/systemctl start varnish" stop program = "/usr/bin/systemctl stop varnish" if failed host localhost port 80 protocol http and request "/" then restart if 3 restarts within 5 cycles then timeout