Site icon GetPageSpeed

Minimalistic Cache Warmer

Here’s a simple cron job to warm up your website cache using wget:

@daily /usr/bin/wget --directory-prefix=/tmp --spider --recursive --no-directories --quiet https://www.getpagespeed.com/

If your website publishes a sitemap.xml which includes all URLs (and sitemap of sitemaps), you can use curl to crawl.
The benefit is that you can crawl multiple encodings (e.g. Brotli), useful if you cache them separately.

To crawl through gzip and Brotli versions:

curl --no-buffer --silent https://www.example.com/robots.txt \
  | sed -n 's/^Sitemap: \(.*\)$/\1/p' | sed 's/\r$//g' | xargs -n1 curl --no-buffer --silent | grep -oP '<loc>\K[^<]*' \
  | xargs -n1 curl --no-buffer --silent -H 'Accept-Encoding: br' 

curl --no-buffer --silent https://www.example.com/robots.txt \
  | sed -n 's/^Sitemap: \(.*\)$/\1/p' | sed 's/\r$//g' | xargs -n1 curl --no-buffer --silent | grep -oP '<loc>\K[^<]*' \
  | xargs -n1 curl --no-buffer --silent -H 'Accept-Encoding: gzip' 

To crawl through multiple CDN servers (hosted on different IPs), use --resolve www.example.com:443:x.x.x.x for the above, and call those commands with different x.x.x.x specifying your edge server IP addresses.

Best is creating a script, crawlup.sh, and call it like so in cron:

@daily /usr/local/bin/crawlup.sh 2>&1 >/dev/null
#!/bin/bash

# crawl main server first:

curl --no-buffer --silent https://www.example.com/robots.txt \
  | sed -n 's/^Sitemap: \(.*\)$/\1/p' | sed 's/\r$//g' | xargs -n1 curl --no-buffer --silent | grep -oP '<loc>\K[^<]*' \
  | xargs -n1 curl --no-buffer --silent -H 'Accept-Encoding: br'

curl --no-buffer --silent https://www.example.com/robots.txt \
  | sed -n 's/^Sitemap: \(.*\)$/\1/p' | sed 's/\r$//g' | xargs -n1 curl --no-buffer --silent | grep -oP '<loc>\K[^<]*' \
  | xargs -n1 curl --no-buffer --silent -H 'Accept-Encoding: gzip'


# crawl edge servers, 2 in this case:

edges=( x.x.x.x y.y.y.y )

for ip in "${edges[@]}"
do
  curl --no-buffer --silent https://www.example.com/robots.txt \
    | sed -n 's/^Sitemap: \(.*\)$/\1/p' | sed 's/\r$//g' | xargs -n1 curl --no-buffer --silent | grep -oP '<loc>\K[^<]*' \
    | xargs -n1 curl --no-buffer --silent -H 'Accept-Encoding: br' --resolve www.example.com:443:$ip

  curl --no-buffer --silent https://www.example.com/robots.txt \
    | sed -n 's/^Sitemap: \(.*\)$/\1/p' | sed 's/\r$//g' | xargs -n1 curl --no-buffer --silent | grep -oP '<loc>\K[^<]*' \
    | xargs -n1 curl --no-buffer --silent -H 'Accept-Encoding: gzip' --resolve www.example.com:443:$ip
done

Follow up

Users of WordPress may be interested in automatic warming of purged cache after editing or adding content.
This is one of the features implemented in the WordPress Cacheability plugin.

Literature:

Exit mobile version