For a while now, Wikimedia pages (usually, Wikipedia articles) have a “page image”, an image from that page used as a thumbnail in article previews, e.g. in the mobile app. While it is not entirely clear to me how this is image is chosen, it appears to be the first image of the article in most cases, probably excluding some icons.
Wikidata is doing something similar with the “image” property (P18), however, this needs to be an image of the item’s subject, not “something related to the item”. Wikipedia’s “page image” often turns out to be a painting made by the article’s subject, or a map, or something related to an event. This discrepancy prevent an automated import of the “page image” into Wikidata. However, exceptions aside, the “page item” presents a highly specific resource for P18-suitable images.
So I added a new function to my WD_FIST tool, to help facilitate the import of suitable images from that rich source into Wikidata. As a first step, a bot checks several large Wikipedias on a daily basis, and retrieves “page images” where the associated Wikidata item has none, and the “page image” is stored on Commons. It also skips “non-subject” pages like list articles. In a second stage, images (excluding PNG, GIF, and SVG) that are used as a “page image” on at least three Wikipedias for the same subject are put into a main candidate list. The image must also not be on the tool-internal “ignore” list. Even after all this filtering, >32K candidates remain in the current list.
dewiki | 346,204 |
---|---|
enwiki | 700,832 |
frwiki | 255,527 |
itwiki | 148,041 |
nowiki | 73,508 |
plwiki | 181,323 |
svwiki | 109,349 |
Combined | 32,137 |
I will likely add more Wikipedias to this list (es and pt will show up tomorrow), and eventually lower the inclusion threshold, as candidates are added to Wikidata, or to the “ignore” list.
As the candidate list is already heavily filtered, I am not applying some of the usual WD-FIST filters. This also helps with retrieving a candidate set of 50 very quickly. In this mode, the tool also lends itself well to mobile usage.
2 Comments
Hi Magnus, great article. I’m looking forward to watching this work progress.
I don’t know if this helps, but I recently had to figure out how the images are selected for Wikimedia wikis. One of the things the PageImages extensions looks at is if the image is freely licensed. It will only return images that are CC/Public Domain (not fair use) which can cause some confusion. That removes many first images on an article. Then images that are used in maintenance templates, stubs or icons and omitted. Image position, width, height/width ratio and a PageImage blacklist also contribute to the scoring of an image. The highest scoring images is then used – assuming one is found. 🙂
A terrible example is this article (https://en.wikipedia.org/wiki/Meredith_Grey). The first two images are of the character, but the PageImages API returns the third, of a writer of the TV show (https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=Meredith_Grey).
Thanks for this, very interesting! Luckily, I couldn’t use “fair use” images anyway, I only process images on Commons. Template exclusion is good as well, though it won’t work on many templates that have ““.