Backup Sampling

Arthur A. Gleckler — Sat, 8 Mar 2025 12:00:00 +0000

Sat 8 Mar 2025

Early in my career as a software engineer, I worked at a large database company. The company took data loss seriously. Their private data center contained computers of many types, including systems made by DEC, HP, IBM, SGI, Siemens, and Sun. They did backups constantly. Periodically, someone would transport a huge cart of metal boxes of tapes off site for storage. Seeing that cart in the elevator gave me confidence that my files were safe. At first.

In the three years I worked there, I only submitted two tickets to restore files from backup. One time, the folks who ran the backup service reported that the backup had been corrupted. The other time, the files I wanted were missing from the backup. The elaborate backup system failed, and I had to reconstruct the files from scratch.

I learned a lesson from that experience: Unless you test your backup, you haven't done a backup. However, while I'd love to check every file in every backup against the original, that would make the backups take about twice as long, which would encourage me to do them less often. Instead, I compare a random sampling of the files. The more files in the sample, the more confident I can be that the backup is complete and not corrupt.

I've written a shell script to automate this work. Here's how to use it to compare 1000 files chosen at random from the ~/papers/ directory to the corresponding backup files:

compare-random-sample \
  /Users/arthur/papers/ \
  /Volumes/MacintoshHD\ backup\ 20230916/ \
  1000

The output will list each file that has been compared, and whether it is identical to, different from, or missing from the backup. The last line shows how many mismatches were found.

By default, the script ignores hidden directories and files as well as patterns from .gitignore. If you want to check those files, too, add the --no-ignore option.

You can find compare-random-sample on Github. I've tested it on Linux and MacOS. I hope you find it useful. But even if you do, please be sure to test your backups manually every once in a while.

papers

Arthur A. Gleckler — Mon, 27 May 2024 12:00:00 +0000

Mon 27 May 2024

command-line, miscellaneous, scheme

I left graduate school decades ago, but I still love to read academic papers. The field of computer science reinvents its wheels constantly, but academic literature is a great way to mine existing ideas and avoid that problem. It's a way to "stand on the shoulders of giants."

For a long time, I maintained and carefully indexed a collection of actual printed papers. Once it reached the hundreds, that approach became too cumbersome. I ended up throwing away papers in order to avoid having too many, but often regretted doing that when some half-remembered idea popped up again in another context.

Now I have a crude system that meets my needs. I keep notes on the most interesting papers using Org-mode files, and I keep my collection in a Git repo in purely digital form. Every paper appears in the top-level directory, and there's a subdirectory to-read/ for papers I haven't yet read. A little bit of automation helps, too. Now managing almost two thousand papers is no problem.

Here's the Scheme program, papers, I use for adding new papers:

#!/usr/local/bin/chibi-scheme -r
;;;; -*- mode: scheme -*-

;;;; Expect environment variable <p> to name a directory that holds
;;;; all papers, e.g. "papers/".  A subdirectory "to-read/" that holds
;;;; unread papers must also exist.

;;;; Copy specified documents into "$p/" directory.

;;;; If <--to-read> is specified, copy them to "$p/to-read/", too.

;;;; If <--commit> is specified, use Git to commit each of the new
;;;; documents.  Each document will be committed separately.  It is an
;;;; error if any other file is already staged.

(import (chibi filesystem)
        (chibi pathname)
        (only (chibi process)
              exit
              process->output+error+status)
        (scheme process-context)
        (scheme small)
        (scheme write)
        (only (srfi 1)
              any
              remove)
        (srfi 98)
        (only (srfi 130)
              string-join
              string-prefix?)
        (srfi 166 base))

(define (echo . rest)
  (apply show #true rest)
  (show #true nl))

(define (echo-command command)
  (apply echo (list "command: " (string-join command " "))))

(define (usage program)
  (echo "Usage: "
        (path-strip-directory program)
        " [--commit] [--to-read] <pathname> ..."
        nl
        nl
        "Environment variable p must be set to a directory for holding papers."
        nl
        "It must have a subdirectory called \"to-read/\"."
        nl))

(define (run command)
  (let* ((out+err+status (process->output+error+status command))
         (stdout (car out+err+status))
         (stderr (cadr out+err+status))
         (status (caddr out+err+status)))
    (unless (zero? status)
      (echo-command command)
      (write-string stderr)
      (write-string stdout)
      (exit 1))))

(define (run/assert command message)
  (unless (zero? (caddr (process->output+error+status command)))
    (echo-command command)
    (echo message)
    (exit 1)))

(define (act document papers commit? to-read?)
  (let ((filename (path-strip-directory document)))
    (link-file document (path-resolve filename papers))
    (when commit? (run `("git" "add" ,filename)))
    (when to-read?
      (let ((place (path-resolve filename (path-resolve "to-read" papers))))
        (symbolic-link-file (path-resolve filename "..") place)
        (when commit? (run `("git"  "add" ,place)))))
    (when commit?
      (run `("git" "commit" "-m" ,filename)))))

(define (switch? string) (string-prefix? "--" string))

(define (main arguments)
  (let* ((program (car arguments))
         (options (cdr arguments))
         (papers (or (get-environment-variable "p")
                     (begin (usage program) (exit 1))))
         (valid-switches '("--commit" "--help" "--to-read")))
    (cond ((member "--help" options) (usage program) (exit 0))
          ((any (lambda (a)
                  (and (switch? a)
                       (not (member a valid-switches))))
                options)
           (usage program)
           (exit 1)))
    (let* ((commit? (member "--commit" options))
           (to-read? (member "--to-read" options))
           (cwd (current-directory))
           (documents (map (lambda (f) (path-resolve f cwd))
                           (remove switch? options))))

      (when (null? documents)
        (usage program)
        (exit 1))
      (change-directory papers)
      (when commit?
        (run/assert `("git" "diff" "--cached" "--quiet")
                    "Error: Files already staged."))
      (for-each (lambda (f) (act f papers commit? to-read?))
                documents))))

For example, to add one paper, including a copy in to-read/, commiting it to the repo:

papers --commit --to-read /tmp/aim-349.pdf

I use MIT Scheme for most of my Scheme hacking, but Alex Shinn's Chibi Scheme is wonderful for implementing this kind of tool. It's small, R7RS-Small-compliant, and has many useful libraries. Thank you, Alex!

Fixed on Wed 29 May 2024 to handle relative pathnames.

Speechcode.com

Backup Sampling

papers