In January, I upgraded my home Internet connection to 3 Gbps symmetric, because, strangely enough, it was cheaper than the package I already had at the time (1500 Mbps down, 940 Mbps up). This was connected to the second port on my ConnectX-3, allowing my home server to achieve the full speed where 2.5 Gbps Ethernet would have failed. Unfortunately, nothing I was doing could have harnessed the full speed of this Internet connection, or anywhere near it, so I started thinking…

In February, I realized that I could run a mirroring service for open-source software to serve the community at basically no additional cost—I am already paying for this 3 Gbps Internet connection and I have some spare disk space on my SSD. So I decided to do exactly that.

Today, I am happy to announce that this mirror, mirror.quantum5.ca, has been tested for a few months and is fully ready for production. If you find the service helpful, please feel free to support me via GitHub Sponsors, Ko-fi, Liberapay, or directly with credit card or bank through Stripe (CAD), though this is of course strictly optional.

If you are interested in how it’s all set up, please read on:

Beginnings

I started by mirroring Arch Linux, because it was decently popular, didn’t require much disk space, and was relatively welcoming to new mirrors. The process was fairly easy—I basically created a virtual host for mirror.quantum5.ca in nginx, set it to serve files from a directory, and created a cron to periodically run rsync to update the Arch Linux files from another mirror. Then I simply filed a ticket on Arch Linux’s bug tracker, and in a few days, I became an official tier 2 mirror.

To avoid this accidentally saturating my Internet connection and start affecting other stuff, I’ve configured it in such a way that the mirror can never use more than 2 Gbps of upload bandwidth, leaving plenty of room for other things.

Due to the way Arch Linux spreads the load between mirrors, almost immediately I started pushing over 100 GB a day. Surprisingly, people at work started noticing the mirror and thanking me for running it. Clearly, people care about this stuff.

Making it look pretty

Encouraged by people caring, I decided to build a nice-looking page at the root of the mirror domain. I started by doing the minimal amount of work necessary, pulling out Bootstrap to make a simple page that looks decent. I also wrote a Python script to render it as a Jinja2 template so I could have the last synchronization time and the size dynamically updated. The script is run every time rsync is run.

However, nginx’s autogenerated index pages simply looked out of place, and I wanted them to be pretty. It wouldn’t be hard to write some web app that rendered the directory index pages, but I didn’t really want to maintain a separate application. Instead, I wanted something that just runs inside nginx. As it turns out, this was possible, and thus I became thoroughly nerd-sniped.

You see, nginx could output the autogenerated index in multiple formats—the default HTML, XML, JSON, and the ancient JSONP. nginx also has an XSLT processing module that allows you to transform any XML server-side. Naturally, the idea was to write an XSLT stylesheet that transformed the autogenerated XML index into HTML.

The result was an XSLT stylesheet that looked something like this (example output):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:str="http://exslt.org/strings" exclude-result-prefixes="str">
  <xsl:output method="html" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
  <xsl:param name="uri"/>
  <xsl:template match="/">
    <xsl:text disable-output-escaping='yes'>&lt;!DOCTYPE html&gt;</xsl:text>
    ...
    <h2 id="path">
      <xsl:text>Index of /</xsl:text>
      <xsl:variable name="path" select="str:tokenize($uri, '/')" />
      <xsl:variable name="levels" select="count($path)"/>
      <xsl:for-each select="$path">
        <xsl:variable name="pos" select="position()"/>
        <xsl:variable name="parent">
          <xsl:for-each select="$path[position() &lt;= $pos]">
            <xsl:value-of select="."/><xsl:text>/</xsl:text>
          </xsl:for-each>
        </xsl:variable>
        <a href="/{$parent}"><xsl:value-of select="."/></a>
        <xsl:text>/</xsl:text>
      </xsl:for-each>
      <xsl:text> </xsl:text>
      <a class="button" href="..">⬆️</a>
    </h2>

    <table class="table table-sm table-hover sortable">
      <thead class="thead-dark">
        <tr>
          <th scope="col">Name</th>
          <th scope="col" class="col-1 text-right">Size</th>
          <th scope="col" class="col-6 col-sm-5 col-md-4 col-xl-3 text-right">Updated (UTC)</th>
        </tr>
      </thead>
      <tbody>
        <xsl:for-each select="list/*">
          <xsl:variable name="name">
            <xsl:value-of select="."/>
          </xsl:variable>

          <xsl:variable name="size">
            <xsl:if test="string-length(@size) &gt; 0">
              <xsl:if test="number(@size) &gt; 0">
                <xsl:choose>
                  <xsl:when test="(@size div 1024) &lt; 0.9"><xsl:value-of select="@size" /></xsl:when>
                  <xsl:when test="(@size div 1048576) &lt; 0.9"><xsl:value-of select="format-number((@size div 1024), '0.0')" />k</xsl:when>
                  <xsl:when test="(@size div 1073741824) &lt; 0.9"><xsl:value-of select="format-number((@size div 1048576), '0.00')" />M</xsl:when>
                  <xsl:otherwise><xsl:value-of select="format-number((@size div 1073741824), '0.00')" />G</xsl:otherwise>
                </xsl:choose>
              </xsl:if>
            </xsl:if>
          </xsl:variable>

          <xsl:variable name="date">
            <xsl:value-of select="substring(@mtime,1,4)"/>-<xsl:value-of select="substring(@mtime,6,2)"/>-<xsl:value-of select="substring(@mtime,9,2)"/><xsl:text> </xsl:text>
            <xsl:value-of select="substring(@mtime,12,2)"/>:<xsl:value-of select="substring(@mtime,15,2)"/>:<xsl:value-of select="substring(@mtime,18,2)"/>
          </xsl:variable>

          <tr>
            <td><a href="{$name}"><xsl:value-of select="."/></a></td>
            <td class="text-right" data-sort="{@size}"><xsl:value-of select="$size"/></td>
            <td class="col-6 col-sm-5 col-md-4 col-xl-3 text-right"><xsl:value-of select="$date"/></td>
          </tr>
        </xsl:for-each>
      </tbody>
    </table>
    ...
  </xsl:template>
</xsl:stylesheet>

This involved some major struggle with XSLT, since nginx—or really, libxslt—only supported XSLT 1.0, which is rather underpowered compared to newer versions. This made things like linking to all the parent directories difficult. Nevertheless, with some crazy EXSLT extensions (that libxslt does support), I did work out a solution, but it probably wasn’t worth the effort in retrospect. Still, I was impressed by the power of XSLT and would recommend using it for times when running an application server is overkill.

Of course, since I wanted to keep the same header and footer as the home page, I converted the Jinja2 template to use template inheritance instead and made a template to render the XSLT. This had several issues, most notable of which was the different handling of self-closing tags between HTML and XML, but it was nothing that some variables couldn’t fix.

With this, the mirror finally looked good enough for my taste.

Drive failure

I also suffered a drive failure during the testing period, which was completely unexpected. This all stemmed from buying a cheap M.2 SSD, the 2 TB ADATA XPG GAMMIX S50 Lite. First of all, this SSD was constantly overheating under normal loads, causing it to thermal throttle, which then caused some I/O operations to timeout. This forced me to install a fan to blow on it to keep the temperatures under control1. This doesn’t bode well, but I didn’t do anything, which was a mistake.

Then finally, in April, the SSD gave up the ghost, suddenly failing to respond to any I/O, causing all I/O to timeout. Since it contained the rootfs, the entire OS locked up, and the mirror (plus everything else) was dead. After a hard power cycle (turning off the PSU), it was working again… for 10 minutes before the same thing happened. Of course, this also happened right before I was going to sleep, so I had to do it while half-asleep.

Having no choice, I had to copy everything off the drive as fast as possible. In the end, I pulled out another SSD that I was using as a big flash drive and ended up using dd with a progress bar, restarting with seek and skip after the drive locked up2. This was highly unpleasant, but fortunately, there was no data loss. After this, I bought a 2 TB Samsung 970 Evo Plus to relieve the other cheap SSD.

Lessons learned:

  1. SSDs are said to be way more reliable than HDDs, but they still fail;
  2. SSDs are commonly cited to fail due to the NAND flash wearing out, which would force it to enter a permanent read-only mode, requiring replacement, but without data loss. However, the controller itself can fail too, and that could easily be catastrophic; and
  3. Don’t cheap out on storage or buy from non-reputable brands, at least for things you care about. There’s no reason to buy the latest generation and pay the early-adopter tax though, since I didn’t need a PCIe Gen 4 SSD—Gen 3 is more than enough to saturate my networking.

More mirroring

I just left the mirror sitting there for a while. In May, I saw the post by Kenneth Finnegan about what he calls “Micro Mirrors”. Clearly, there’s more demand for these things than I thought, so I decided to mirror a few more things.

Not sure what the best things to mirror are, I wrote to Kenneth, and he kindly responded with some stats from mirror.fcix.net, the (macro) mirror he also runs. Dividing the daily traffic by the size of the directory allowed him to compute a CDN efficiency coefficient. The higher the number, the more impact mirroring this directory has given the amount of disk space used. From this data, I selected ubuntu-releases (ISOs) and LibreOffice to mirror, since these are reasonably sized and easy to mirror.

I also mirrored some software that I regularly use, such as TeX (really, the Comprehensive TeX Archive Network or CTAN) and Termux.

There are some projects that I use but decided not to mirror, the most notable of which is Debian, which has not been processing new mirror tickets for almost a year at the time of writing. What was the point of blowing at least a terabyte on the Debian archive when no one will even hear about the mirror? This was rather disheartening, to be honest.

If you would like me to mirror more stuff, feel free to put your suggestions in the comments and donate some money to support this endeavour. The links are in the sidebar.

Notes

  1. A large passive heatsink might be enough, but my motherboard didn’t come with one, and the one the SSD came with was clearly insufficient. 

  2. I probably should have used ddrescue, but since this was also my router, my home Internet died as well and I didn’t want to figure out how to use ddrescue at some ungodly hour on mobile data when I had a working solution.