Skip to content

The File (Dis)connect

I’ll be going on about Wikidata, images, and tools. Again. You have been warned.

I have written a few image-related Wikimedia tools over the years (such as FIST, WD-FIST, to name two big ones), because I believe that images in articles and Wikidata items are important, beyond their adorning effect. But despite everyone’s efforts, images on Wikidata (the Wikimedia site with the most images) are still few and far between. For example, less than 8% of taxa have an image; across all of Wikidata, it’s rather around 5% of items.

On the other hand, Commons now has ~45M files, and other sites like Flickr also have vast amounts of freely licensed files. So how to bring the two of them together? One problem is that, lacking prior knowledge, matching an item to an image means full-text searching the image site, which even these days takes time for thousands of items (in addition to potential duplication of searches, stressing APIs unnecessarily). A current example for “have items, need images” is the Craig Newmark Pigeon Challenge by the WMF.

The answer (IMHO) is to prepare item-file-matches beforehand; take a subset of items (such as people or species) which do not have images, and search for them on Commons, Flickr, etc. Then store the results, and present them to the user upon request. I had written some quick one-off scans like that before, together with the odd shoe-string interface; now I have consolidated the data, the scan scripts, and the interface into a new tool, provisionally called File Candidates. Some details:

  • Already seeded with plenty of candidates, including >85K species items, >53K humans, and >800 paintings (more suggestions welcome)
  • Files are grouped by topic (e.g. species)
  • Files can be located on Commons or Flickr; more sites are possible (suggestions welcome)
  • One-click transfer of files from Flickr to Commons (with does-it-exists-on-Commons check)
  • One-/Two-click (one for image, two for other properties) adding of the file to the Wikidata item
  • Some configuration options (click the “⚙” button)
  • Can use a specific subset of items via SPARQL (example: Pigeon Challenge for species)

There is another aspect of this tool I am excited about: Miriam is looking into ranking images by quality through machine learning, and has given me a set of people-file-matches, which I have already incorporated into my “person” set, including the ranking. From the images that users add through this tool, we can then see how much the ranking algorithm agrees with the human decision. This can set us on a path towards AI-assisted editing!

ADDENDUM: Video with demo usage!