My scanning setup is all new for 2022. Heaps more scanners means I can digitise around 1,300 A4 pages an hour for the Internet Archive. Here’s a brief overview of what I’ve been using to make that happen.
Cut and Slice
First step is still cutting the spine off magazines or books. Not much has changed here. A crappy guillotine I got off Facebook for $25 that doesn’t cut straight and needs its blade sharpened far too often. Each time I sharpen the blade I gotta go into Melbourne, wait a day and it costs me $30! Real pain in the arse.
I should really get a new guillotine - it’s starting to become a bottleneck now that I’ve got so many scanners. I can scan for 2-3 days straight and the guillotine will be blunt by the end of it.
Scan
The sliced up paper gets placed into one of my seven scanners. I’ve got 5x Avision AD260, a Canon DR-6050C and a Fujitsu ScanSnap S1500M.
The Avision AD260s were super cheap on eBay as refurbished units for $76 each. They have all the features you could want from a document scanner like duplex scanning and ultrasonic multi-feed detection. The only thing it lacks is decent automatic image correction.
My Canon DR-6050C is an absolute workhorse. I can dump in an entire book or half a dozen magazines and it let it do its thing. It’s got a killer feature called “prevent bleed through” that does what it says on the label - it’ll automatically adjust the contrast to avoid text and images from the opposing page bleeding through, saving me having to do it manually. I love it but they’re so expensive. I got this one for a few hundo as a fluke. They still sell for many thousands of dollars.
The ScanSnap was kindly given to me for free by @kai_h on Twitter. Fujitsu dropped Mac support for it a while ago, so it’s no use to him. Unfortunately I forgot that the ScanSnap scanners do not have TWAIN/WIA drivers. It’s a proprietary driver that only works with Fujitsu’s software. The software isn’t that bad (it even has a prevent bleed through feature like Canon), but it doesn’t support TIFF only JPEG. A bit of a bummer for archiving. Will play around with it some more before I write it off entirely.
Process
The Canon and Fujitsu scanners are plugged in to an old iMac running Windows 10.
The Avision scanners are hooked up to a Lenovo P700 workstation running VMware ESXi (that’s the P700 under the desk).
Each scanner has its own VM because the driver shits itself when two scanners are plugged in to the same PC. Instead of having 5 separate computers I made a VM for each scanner and control it over Remote Desktop, which is snappier than I thought and perfectly fine for my needs.
Specs on the P700:
- Dual E5-2620 V3 Xeon CPUs (12-cores total)
- 128GB DDR4-2133 registered ECC RAM (overkill, but 8x 16GB sticks are cheaper than 8x 8GB sticks)
- 1TB Crucial P5 NVMe PCIe SSD in a slot adapter because the P700 has no M.2 sockets
- Quad Intel PCIe NIC + 2x onboard Intel gigabit ports
- 8x USB 3.0 ports
Each VM gets 24GB of RAM, 2x vCPUs, 180GB of storage and a passed-through gigabit network port. Disappointingly, it kinda chugs when exporting with only 2x vCPUs, so I will likely upgrade the CPUs in the P700 to something with more cores.
NAPS2 is the software I like to use for acquiring images from the scanners. I’ve tried other software like VueScan, Adobe Acrobat, whatever ABBYY software is bundled with scanners and some others, but NAPS2 is free and simple, with the main adjustments (deskew, rotate, brightness/contrast) built in so I don’t need to use a separate app for that kinda stuff.
Upload
Once the images are scanned in, they’re exported out of NAPS2 over the network to a file server in my garage that also prepares and uploads the scans to the Internet Archive.
The file server is a HP ML350 G9. Specs:
- Dual E5-2683 v3 (28 cores/56 threads)
- 64GB DDR4-2133 registered ECC RAM
- 2TB Kingston KC2500 PCIe SSD
- 1TB Kingston KC2500 PCIe SSD
- 4x 300GB 10K SAS HDDs in RAID 1 +0 (boot/system volume)
- 4x gigabit ethernet ports
You’re probably wondering why this thing has so many cores! It’s for making light work of compressing all the scanned images. I upload a CBZ (comic book zip - which is just a zip file with a .cbz file extension) of the TIFF files and the Internet Archive’s servers do the rest.
Compression chews up CPU so having all those cores means I’m not waiting literally days for all the images to compress. The actual upload process is explained in this post I wrote a few weeks ago.
I put two SSDs in there to make sure when compressing there’s no disk bottleneck. The images are saved on the 2TB SSD from the scanners and the CBZ files written to the 1TB SSD and uploaded to the Internet Archive. Reading and writing to the same SSD can slow things down.
Network
This new scanning workflow also meant I had to upgrade my network. With up to 6 machines all sending up to a gigabit of data at once to the file server, a huge bottleneck forms when all I’ve got is a single gigabit port as backhaul to the switch that server is connected to.
Luckily the “core” switch in my network, a Ubiquiti US-48-750W, has 2x 10 gig SFP+ ports on it. I replaced the $29 TP-Link 8-port gigabit switch with a Ubiquiti USW-Pro-24, which also has two 10 gig SFP+ ports. Slapped a UF-RJ45-10G transceiver in each switch and it works nicely to create a 10gig link between my study and the core switch in the garage. It wasn’t cheap ($238 for the transceivers and $596 for the switch), but it integrates with my existing UniFi gear and should be good for a long time yet. I wrote about the decision process to drop that cash on network gear in this post.
The processing server has 4x gigabit ethernet ports on it that I’ve configured into an 802.3ad bond in the server’s OS (Ubuntu 20.04 LTS) and on the UniFi switch. I’ve done a bit of load testing and if I copy a file from 4x computers at once, it hits 400MB/sec. Not bad, but I’m looking into the cheapest way to get a 10 gig link between the file server and the US-48-750W. Probably an Intel SFP+ NIC from Aliexpress, two fibre transceivers and a 5m fibre patch lead.
Costs
There we have it - that’s my scanning setup for 2022. Now the job is scrounging around Australia looking for crates of books and magazines to scan! Now to take a big swig of water and add up how much all this shit cost me…
- 5x Avision AD260 scanners - $380
- Canon DR-6050C scanner - $300 (I think, I purchased it a while ago!)
- Lenovo P700 workstation - $550
- 128GB DDR4-2133 RDIMM RAM - $308
- PCIe adapter for SSDs - $9
- 1TB Crucial P5 SSD - $149
- Quad gigabit NIC - $69
- HP ML350 G9 server - $300
- 2x Intel Xeon E5-2683 v3 CPUs - $240
- 64GB DDR4-2133 RDIMM RAM - $141
- 1TB Kingston KC2500 SSD - $288
- 2TB Kingston KC2500 SSD - $135
- 2x PCIe adapters for SSDs - $18
- 2x UF-RJ45-10G SFP+ transceivers - $238
- Ubiquiti USW-Pro-24 switch - $596
- 10x CAT6 network cables (5x 5M, 5x 10M) - $84
$3805 - not counting time and travel. Oof.