Backup Sampling

HomeAbout
Sat 8 Mar 2025

Early in my career as a software engineer, I worked at a large database company. The company took data loss seriously. Their private data center contained computers of many types, including systems made by DEC, HP, IBM, SGI, Siemens, and Sun. They did backups constantly. Periodically, someone would transport a huge cart of metal boxes of tapes off site for storage. Seeing that cart in the elevator gave me confidence that my files were safe. At first.

In the three years I worked there, I only submitted two tickets to restore files from backup. One time, the folks who ran the backup service reported that the backup had been corrupted. The other time, the files I wanted were missing from the backup. The elaborate backup system failed, and I had to reconstruct the files from scratch.

I learned a lesson from that experience: Unless you test your backup, you haven't done a backup. However, while I'd love to check every file in every backup against the original, that would make the backups take about twice as long, which would encourage me to do them less often. Instead, I compare a random sampling of the files. The more files in the sample, the more confident I can be that the backup is complete and not corrupt.

I've written a shell script to automate this work. Here's how to use it to compare 1000 files chosen at random from the ~/papers/ directory to the corresponding backup files:

compare-random-sample \
  /Users/arthur/papers/ \
  /Volumes/MacintoshHD\ backup\ 20230916/ \
  1000

The output will list each file that has been compared, and whether it is identical to, different from, or missing from the backup. The last line shows how many mismatches were found.

By default, the script ignores hidden directories and files as well as patterns from .gitignore. If you want to check those files, too, add the --no-ignore option.

You can find compare-random-sample on Github. I've tested it on Linux and MacOS. I hope you find it useful. But even if you do, please be sure to test your backups manually every once in a while.