Hosting RPM repositories with nginx and CDN with blazing speed

Danila Vershinin

7 years ago

;tldr Why our repo rocks

Serves .rpm files as immutable resources (Far Future Expires)
Serves repomd.xml in a way to ensure repo updates go through CDN uncached
Proxy, CDN cache-able

If you want to deliver a bunch of custom packages for RedHat and its clones, like CentOS, you will look into creating your own RPM repository.
What software to use and what is the efficient web stack for this? Let’s break it down to how we have setup our RPM repository in terms of performance.

RPM repository basics

Each RPM (YUM) repository consists mainly of these areas:

The repodata directory: this holds the metadata about packages, e.g. list of packages in the repository with their version information, so that client yum programs are able to search packages in your repository, without inspecting .rpm files themselves. It’s more like an index data about your packages, with repomd.xml file being the main index file
The RPMS directory: holds the actual .rpm files

We won’t go into details on how to build .rpm files or creating repodata. Let’s concentrate on how to host the repository you have already created.

Since none of the files in your RPM repository are dynamic in nature, NGINX is the best software to use for hosting it. In fact, NGINX was designed to be efficient at serving static files from the very beginning. So we go ahead and create a server block in nginx confguration, e.g.:

http {
    server {
        server_name repo.example.com;
        root /path/to/repo/files;
        autoindex on;
    }
}

Of course we would further do things like configuring SSL certificate and such, but is there more to it? How can we make it really fast without mirrors across the globe? 🙂

RPM repository caching essentials

To understand the best caching policy for RPM repositories, we have to break things down to essential categories.

1. URL resources that never change their content

An example of those would be .rpm files. Once you’ve built your RPM package, its filename contains version information.
When you put it to your repository, the URL would look like this:

https://repo.example.com/redhat/7/x86_64/RPMS/package-1.14.0.el7.x86_64.rpm

Are we ever going to see a different .rpm file than the one we initially put, on this exact URL ? No. This means that we can say that this URL resource is immutable, and essentially cache it forever.

One edge case where we might have a different .rpm file on this URL, is when we forgot to sign it. Once it’s correctly signed, the actual .rpm file will be different. However, with automated build systems, this case is ruled out. And even if we don’t resort to using automated signing, we can simply bump release number.

The secondary RPM metadata is another representative of URL resources with content that never changes. These files already bear version information within them (hash of RPM files). So these resources can be cached forever as well.

2. URL resources with changing content.

Primarily, this is the repomd.xml file. It contains references to the secondary metadata indexes, and while its URL stays the the same, its content is going to be different after we build yet another package. Thus, we should not cache it forever. Proxies in the wild (that we have no control of), should never cache it.

Nginx caching policy for RPM repositories

So if you want to deliver packages efficiently, you want all the clients to cache both your .rpm files and secondary metadata indexes forever. Here’s a simple configuration:

# Applicable to directory listings and repomd.xml: always fresh for end clients and shared proxies
add_header Cache-Control "no-cache, no-store, must-revalidate";

# These resources never change: RPMs and secondary metadata
location ~ \.(d?rpm|xml\.gz|sqlite\.bz2)$ {
    add_header Cache-Control "public, max-age=31536000, immutable";
}

location ~ /repoview/.*\.html$ {
    # Files are not updated atomically by repoview, so to avoid SPDY errors:
    open_file_cache off;
}

The default no-cache will apply to directory listings that you see in your browser, as well as to the repomd.xml primary metadata index file.
So the primary metadata is going to be pulled fresh always, while the caching proxies (or browsers) will happily cache downloaded .rpm files and reuse them. Thus speeding things up tremendously.

Shared proxies you can control

Things get more interesting if you use Cloudflare or Varnish. Both are shared caching proxies that you have control of. Subsequently, you can tell them to cache everything indefinitely, or for as long as they allow us. This opens possibility for great things like hosting the entire repository on the CDN. You can always purge the CDN cache after building a package.

A simple Cloudflare page rule allows for this:

URL: repo.example.com/*
Cache Level: Cache Everything
Edge Cache TTL: a month

The Cache Everything cache level instructs Cloudflare to cache .rpm and other repository files, in addition to the default static file types they cache.
With Edge Cache TTL we basically override our previously defined caching policy just for Cloudflare and have it cache (and deliver) the entire RPM repository on their CDN, worldwide.

Subsequently, we only need to instruct Cloudflare to clear its caches after building a package. Sample Python 2.7 script will suffice for the job.

purge-cloudflare.py

Pre-requisite for the script in CentOS 7 is Cloudflare Python module. You can install it via yum install python2-cloudflare. Our script’s logic is simple:

Purge all the files which are not .rpm or secondary metadata file extensions
Purge all the directories

You would call this script after building a package and putting it into your RPM repository. This will ensure that metadata at Cloudflare is up-to-date, while keeping the cached RPMs intact at Cloudflare.

#!/usr/bin/env python

import os
import sys
import CloudFlare
import argparse
from pprint import pprint

parser = argparse.ArgumentParser()
parser.add_argument("--subdir", help="clear out specific directory", nargs='?', const='', default='')
args = parser.parse_args()

pprint(args.subdir)

httpdocs = '/path/to/repo/files' + args.subdir
siteurl = 'https://repo.example.com' + args.subdir
zone_name = 'example.com'

matches = []
exclude = ['.git', '.well-known']
for root, dirnames, filenames in os.walk(httpdocs, topdown=True):
    dirnames[:] = [d for d in dirnames if d not in exclude]
    for filename in filenames:
        if not filename.endswith(('.drpm', '.rpm', '.xml.gz', '.sqlite.bz2', '.index.html')):
            matches.append(os.path.join(root, filename).replace(httpdocs, siteurl))
    for dirname in dirnames:
        matches.append(os.path.join(root, dirname).replace(httpdocs, siteurl) + '/')

cf = CloudFlare.CloudFlare()

# grab the zone identifier
try:
    params = {'name':zone_name, 'per_page':1}
    zone_info = cf.zones.get(params=params)
except CloudFlare.exceptions.CloudFlareAPIError as e:
    exit('/zones %d %s - api call failed' % (e, e))
except Exception as e:
    exit('/zones - %s - api call failed' % (e))

try:
    params = {'files':matches}
    r = cf.zones.purge_cache.post(zone_info[0]['id'], data=params)
except CloudFlare.exceptions.CloudFlareAPIError as e:
    exit('/zones %d %s - api call failed' % (e, e))
except Exception as e:
    exit('/zones - %s - api call failed' % (e))

You can clear specific repository within the same domain, by providing the path to it, e.g.:

/path/to/purge-cloudflare.py --subdir=/redhat/7

So what we have here, is:

An efficient caching mechanism for RPM files in both shared proxies we have no control of and Cloudflare CDN
A way to host the entire repository on the CDN edge with ability to purge when a new package is pushed
A big “eat my shorts” to packagecloud.io and the likes 🙂