Downloading all archive.org metadata

BermudaHighball ( @BermudaHighball@lemmy.dbzer0.com ) · edit-2 5 days ago

Downloading all archive.org metadata

thingsiplay ( @thingsiplay@beehaw.org ) · edit-2 5 days ago

I just found out you can get all metadata with ia metadata ID > metadata.json (replace ID with gamefaqs_txt in example). So from there you could extract any information too, if you know how to handle json. (Edit: Just load the metadata.json in your browser to see a better formatted list.)

drspod ( @drspod@lemmy.ml ) · 5 days ago

I like to pipe my json to python -m json.tool for quick formatting in the terminal.

CHKMRK ( @CHKMRK@programming.dev ) · 4 days ago

Take a look at jq, it’s a really nice tool for handling json in the terminal, also gron for searching json

thingsiplay ( @thingsiplay@beehaw.org ) · 5 days ago

I use the CLI tool, even right now waiting to finish some downloads. The CLI tool can actually give you a list of all items with ia list {ID} (replace {ID} with the actual id of the stuff you want to download). But you don’t even need to list the items, because you can download with a glob (in example *.torrent like your shell has. Or if you have the ID anyway, you can specify the filenames too with {ID}_archive.torrent

Here is an example how to do this with my own upload https://archive.org/details/gamefaqs_txt where the id becomes gamefaqs_txt

ia download gamefaqs_txt --glob *.torrent

or use a variable to set id and download all files that start with the id, which should be all the meta data

id=gamefaqs_txt ; ia download "${id}" --glob "${id}_*"

BermudaHighball ( @BermudaHighball@lemmy.dbzer0.com ) · 5 days ago

Thank you for the tips. I am actually interested in enumerating metadata for all the “items” as defined by the API page ever uploaded. For example, one item = one ID:

Archive.org is made up of “items”. An item is a logical “thing” that we represent on one web page on archive.org. An item can be considered as a group of files that deserve their own metadata.

You did cause me to look at the API docs again, though, and I think I found something that does enumerate all item names, and as a bonus, it will keep you updated when changes are made: https://archive.org/developers/changes.html

We’ll see how much progress I can make. It might take a while to get through all the millions of them.

thingsiplay ( @thingsiplay@beehaw.org ) · 5 days ago

Isn’t “item” and “id” basically the same thing? Because every item has a unique id. So in my example gamefaqs_txt would be the item and id.

BermudaHighball ( @BermudaHighball@lemmy.dbzer0.com ) · 5 days ago

Yes, I think so. I’ll definitely use the example for downloading some of the files (.torrent, metadata file) once I have some items. But first I need to find all the items ever uploaded.