Scanning Workflow Notes for Summer 2021/2022

Summer is on its way, which means staying inside with the blinds closed, air conditioner on, cricket playing the iPad and scanning my way through the pile of magazines I’ve collected over the past 12 months. I’ve written about my workflow previously but have recently re-jigged for less fucking around.

Instead of uploading a 300dpi PDF and an archive of the original 600dpi TIFF files, I’m going to follow the big man’s advice and simply upload a CBZ file (aka a renamed ZIP archive with the images named sequentially) of the scans to the Internet Archive and let their servers do the rest.

https://twitter.com/textfiles/status/1258267195790024705

I’m also not going to bother to save a local copy of the archive. I never refer back to them and it’s getting expensive to store terabytes of this stuff.

My upload script is really simple now:

# compress TIFF files into a CBZ
7zz a -tzip "${PWD##*/}.cbz" *.tiff
&&

# upload to Internet Archive
ia upload ${PWD##*/} "${PWD##*/}.cbz" --retries 10 --metadata="mediatype:texts" --metadata="title:${PWD##*/}" --metadata="description:Uncompressed TIFF 600dpi scan" --metadata="subject:" --metadata="language:eng"
&&

# delete the TIFF files
rm -rf "$PWD"

Tip: if you’re uploading a CBZ, make sure it’s using the Deflate compression algorithm. Using anything fancy like LZMA isn’t compatible with the unzip program the Internet Archive uses so the derive process will fail.

I’d like to make a more advanced script that I can simply point at a directory and it’ll work its way through each directory, compressing the TIFFs until there’s no more directories then uploading all the CBZs at once, but I’m too dumb. I’ll stick to opening up like 30 byobu sessions and running the script in each one.

Last time I did a big bunch of scanning, I had a HP DL560 with 32 cores to help me out. I sold that server because I needed the money, but I regret it now as all those cores made light work of all this compression. Instead I’m going to try use the computers I do have:

  • Dell R220 II with a Pentium G-something shitbox CPU (file/backup server)
  • Dell R230 with an E3-1225v3 (piracy box)
  • HP ML10v2 with an i7-4770 (server I use for shits and giggles)
  • HP ProDesk-something with an i7-4790 (was a 2nd plex box for mates but they got their own so this is idle now)
  • HP Optiplex 7040 with an i5-6500 (wife’s old PC that’s now idle since she got a new one)
  • Mac Mini with an Apple Silicon M1 (my daily driver)

The plan is to set up the R220 II as a file server that the TIFF files are scanned directly to. All the other PCs will read from the file server over NFS/SMB and compress the TIFFs to their local disk, then upload that compressed file to the Internet Archive.

The file server’s local disk will be a bottleneck with at least 6 computers pinging it at once, so I’m gonna actually take advantage of the fancy network switch I paid far too much for and do some link aggregation to combine the two built-in gigabit ports for 200MB/sec of bandwidth. The disks in that file server (2x 6TB HDDs) might choke on their own vomit with all the simultaneous reading, so I’ll probably get a cheap 4TB SSD, two HDDs in a RAID-0, or something else suitable in size and speed to clear its airway.

It could be a massive pain in the arse, so I’ll be on the look-out for a cheap 4-core server or Threadripper workstation or something 2nd hand. Lemme know if you see anything!