Preparing 18TB Of Data For Long Term Storage on LTO Tapes

18th May, 2022

Now that all my Internet Archive uploads are downloaded and sitting on two very precarious 5 year old HDDs, it's time to back them up. I already decided on LTO-5 tape and I a drive along with all the accessories (except tapes, still waiting). The next step is to figure out how they'll be placed on to the tapes.

The data is a mix of ZIP, RAR and GZIP files. I want to "remaster" them all into the same compression format and use higher compression levels so they take up less space.

I installed all kinds of compression apps to test what speeds and file size are like with my particular set of data. Here's the results on a directory with 100 uncompressed TIFF files that clocks in at 9.37GB:

The clear winner is WinZip's ZipX format. I have no idea what it is, but it did a great job squeezing those files down to just 2.57GB, almost a full gigabyte less than the best ZIP/Deflate format at 3.32GB. There's weird shit like ZPAQ and ARC that worked really well, but I'm weary of using those formats as what are the chances there will be a modern decompression client available for a future platform in a decade? Out of the "mainstream" formats, old mate Bzip2 is the best.

I also considered converting all the TIFF files to a different lossless image format, like PNG, to see what sort of compression I could get. I couldn't figure out how to configure ImageMagick to do AVIF and couldn't find any easy to use tools on Mac or Windows to generate lossless AVIF images, so that's unfortunately not in this set of results. JPEG XL is the clear winner.

The lossless JPEG XL files were generated via the JPEG XL reference library utility using the command cjxl -e 3 -d 0, with the original PNGs (it can't convert TIFF to JPEG XL directly) converted by ImageMagick's convert image.tiff image.png. Those same PNGs were run through OxiPNG with the command oxipng -o 3 --strip all. WebP files generated with ImageMagick 7 via the command convert image.tiff -define webp:lossless=true image.webp.

I tried compressing the JPEG XL files with Bzip2 and 7zip, but it made no difference to the file size and took a while - which makes sense, the images are already heavily compressed. By the way, if you're after a way to check if two images are visually identical, use this ImageMagick feature (thanks Jon Sneyers): compare -metric pae image1 image2 null: and if it comes to 0, your images are the same.

All of this is just fucking around for shits and giggles, as the Internet Archive only accepts ZIP and RAR files (renamed to CBZ/CBR) in order to do the best auto-generation of PDFs and whatnot.

So yes, I could make everything a Bzip2 and use 13.5% less data, which applied to 9.87TB is a 1.34TB saving (one less LTO5 tape). But it means I'd have to add Bzip2 to my workflow (compress the data twice, once for IA, once for my backup) and if I had to restore from these backups, I'd need to extract all that data then re-compress it ZIP/RAR for the Internet Archive. Not a big deal for a handful of items, but for all 2400+, no thanks.

That means I have to pick either ZIP or RAR to compress all these files. I think I'm gonna go with RAR unless someone smarter than me tells me otherwise:

RAR compresses better (3% more efficient). ZIP can probably match it using Zopfli but it's extremely slow.

RAR has built-in recovery records. Can do the same with PAR files for ZIP, but having it built-in to the archive is nice.

RAR is faster at max compression than ZIP is. I didn't time it exactly, but just eyeballing it, RAR finished sooner.

RAR supports BLAKE2 file checksums. I don't know what that is, but this page tells me it is better than ye olde CRC32 for file checksumming.

The only downside of RAR is that it's a proprietary program and is technically not free. Considering its heavy use in piracy circles, I'd be surprised if decompression apps disappeared any time soon however.

Next steps now are to extract all this data so I can re-compress it. I was able to use the gzip -l, zipinfo and rar l commands to view the uncompressed size for each archive, which added up to 17.53TB. I'll need to buy 17.53TB of HDDs, pop them in my server, extract them all to the new drives, erase the 2x 6TB HDDs with the archives on them, then compress them all again from the 2x 10TB HDDs to the 2x 6TB HDDs.