Skip to content


The Cleveland Museum of Art recently released 30,000 images of art under CC-Zero (~public domain). Some of the good people on Wikimedia Commons have begun uploading them there, to be used, amongst others, by Wikipedia and Wikidata.

But how to find the relevant Wikipedia article (if there is one) or Wikidata item for such a picture? One way would be via titles, names etc. But the filenames are currently something like Clevelandart_12345.jpg, and while the title is in the wiki text, it is neither easily found, nor overly precise. But, thankfully, the uploader also includes the accession number (aka inventory number). That one is, in conjunction with the museum, a lot more precise, and it can be found in the rendered HTML reliably (to a degree).

But who wants to go through 30K of files, extract the inventory numbers, and check if they are on, say, Wikidata? And what about other museums/GLAM institutions? That’s too big a task to do manually. So, I wrote some code.

First, I need to get museums and their categories on Commons. But how to find them? Wikidata comes to the rescue, yielding (at the time of writing) 9,635 museums with a Commons category.

Then, I need to check each museum’s category tree for files, and try to extract the inventory number. Because this requires HTML rendering on tens (hundreds?) of thousands of files, I don’t want to repeat the exercise at a later date. So, I keep a log in a database (s51203__inventory_p, if you are on Labs). It logs

  • the files that were found with an inventory number (file, number, museum)
  • the number of files with an inventory number per museum

That way, I can skip museum categories that do not have inventory numbers in their file descriptions. If one of them is updated with such numbers, I can always remove the entry that says “no inventory numbers”, thus forcing a re-index.

My initial scan is now running, slowly filling the database. It will be run on a daily basis after that, just for categories that have inventory numbers, and new categories (as per Wikidata query above).

That’s the first part. The second part is to match the files against Wikidata items. I have that as well, to a degree; I am using the collection and inventory number properties to identify a work of art. Initial tests did generate some matches, for example, this file was automatically added to this item on Wikidata. This automatic matching will run daily as well.

A third part is possible, but not implemented yet. Assuming that a work of art has an image on Commons, as well as a museum and an inventory number, but no Wikidata item, such an item could be created automatically. This would pose little difficulty to code, but might generate duplicate items, where an items exists but does not have the inventory number set. Please let me know if this would be something to do. I’ll give you some numbers once the initial run is through.

One Comment