How to download 2400 items off the Internet Archive

To make a backup of all my Internet Archive scans, I need to download them all. Sounds easy, but it was actually a bit of a hassle.

First up I had to get a list of all the URLs of my items. Luckily the Internet Archive has a handy Python app to interact with its collection.

Everything on there has a bunch of metadata attached and using the command line app I can view it all for any item:

ia metadata dr_dobbs_journal-1987_08 > metadata.txt

It gives the result as a big blob of JSON:

But that’s okay as I can dump it into a JSON viewer and view it in a more human readable format:

I can see that the “uploader” item contains my email address, so everything I’ve uploaded contains that piece of metadata. I can now use that with the IA app to pull in all the items where the uploader metadata item is my email address:

ia search 'uploader:aagius@gmail.com' > myitems.txt

That dumps a list of identifiers (the unique names given to each item on the IA) I uploaded into a text file that looks like this, with 2400 records in it:

{"identifier": "dick_smith_electronics-white_paper_ad"}
{"identifier": "dk_eyewitness-computer"}
{"identifier": "dr_dobbs_journal-1987_08"}

A quick find and replace in Sublime Text got rid of all the JSON stuff to give me a plain list like so:

dick_smith_electronics-white_paper_ad
dk_eyewitness-computer
dr_dobbs_journal-1987_08

Now I’ve got a big list of all my identifiers, I want the URLs of my files so I can chuck them in something like aria2c to download them. The list command gives a list of all the files related to an identifier. The -l switch gives the full URL instead of just the file name:

ia list -l dr_dobbs_journal-1987_08

Gives me this:

There’s many files here that are auto-generated by the IA that I don’t need. All I want are the original scanned images I uploaded. For this item, that is dr_dobbs_journal-1987_08.cbr - a Comic Book RAR file (view this post as to why I upload in Comic Book RAR/ZIP format). Some of my uploads are CBR, some are CBZ and some are GZIP files, from before I realised I should have been uploading CBR/CBZ. Luckily the IA app lets me select specific file types.

Running the command:

ia metadata --formats dr_dobbs_journal-1987_08

Results in a list of all the file formats that item contains:

Now I know what the Internet Archive calls each format, I can use this command to get the files related to that format:

ia list -l --format='Comic Book RAR' --format='Comic Book ZIP' --format='GZIP' dr_dobbs_journal-1987_08

Spits out just the CBR file like I asked:

https://archive.org/download/dr_dobbs_journal-1987_08/dr_dobbs_journal-1987_08.cbr

Now I can feed the entire list of identifiers I generated earlier into this command and get a big list of URLs. It will read through myitems.txt line by line and run the IA app on each line.

for line in $(cat /home/decryption/myitems.txt); do /home/decryption/ia list -l --format='Comic Book RAR' --format='Comic Book ZIP' --format='GZIP' "$line" >> urls.txt; done

Now I’ve got a big fat list of URLs to pump into aria2c:

aria2c -c -s 16 -x 16 -k 1M -j 1 -i urls.txt

It will download one file at a time, but with 16 connections to the server on that single file. It worked but I could not max out my gigabit connection. The best I’d get off a single file was around 3-4MB/s. With 2400 items at roughly 4GB each (9-10TB all up), it was going to take a very long time to download!

To brute force this bastard, I split up the 2400 URLs into 6 files and run aria2c in 6 separate sessions. That increased speeds to around 20MB/s, but adding a 7th or an 8th concurrent download did not improve the overall bandwidth much. It would take 5-6 days to download everything at that speed.

Whinging about the slow speeds to @voltagex, he reminded me that every item on the IA has a torrent, which could be faster than downloading over HTTP. Getting the torrent files is just like getting the list of file URLs, but instead of asking for a list of files, I want to download the file:

for line in $(cat /home/decryption/myitems.txt); do /home/decryption/ia download --format='Archive BitTorrent' "$line" >> urls.txt; done

It took a while but after about 5 hours I had downloaded all 2400 torrent files. I could have sped this up by running multiple commands at the same time (i.e: GNU Parallel) but it was late at night so I just let it run while I slept.

Now my problem is all those torrent files will download all the auto-generated IA metadata I don’t need. It adds up to a huge amount of data as there’s PDFs and JP2 files that are a few extra gigabytes per torrent. Instead of needing around 10TB to download it all, I’d need 20TB. Not only would it take twice as long to download, I don’t have enough disk space!

Once again, @voltagex helped out and made a little Python script that added all the torrents to qBitorrent (my torrent client of choice) via its API, but filtered out all the files that aren’t .cbr .cbz or .tar.gz - it worked perfectly.

But speeds still didn’t get above 50-60MB/s. My crappy old 6TB HDDs in a RAID-0 were struggling under the load, so I moved a 2TB SSD from one PC into the torrent box and told it to store all the incomplete torrents on that 2TB drive and move them to the 12TB volume when complete. This sped things up to 100MB/s+! However, I could only download 2TB at a time as when the temporary SSD got full, it would just stop downloading and I had to manually manage the torrents (delete a few to make space then start again).

Another issue with all these torrents was constant stalling. Torrents would get to 99.something% and then just stop. I’d have to delete them then re-download to get it going again. Was a big waste of time.

Eventually they all downloaded but it took a week.

Then it took 2 days to verify that all the torrents were correctly downloaded - which I’m glad I did as about 300 were not fully downloaded and during a random sampling of those 300, I discovered they didn’t extract properly and had to be re-downloaded. In hindsight, I should have just let the “slow” HTTP downloads go at around 20-30MB/s and it would have been complete in the same time or even faster, with less fucking around than the torrents.