Skip to content

Reductionism

While I do occasionally write Wikimedia tools “to order”, I wrote quite a few of them because I required (or just enjoyed) the functionality myself. One thing I like to do is adding images to Wikidata, using WD-FIST. Recently, I started to focus on a specific list, people with awards (of any kind). People with awards are, in general, more likely to have an image; also, it can be satisfying to see a “job list” shrink over time. So for this one, I logged some data points:

Screen Shot 2015-06-24 at 11.24.54Over the last 2-3 weeks, even my sporadic use of the tool has reduced the list by 1/4 (note the plateau when Labs was offline!). Some thoughts along the way:

  • The list of item candidates is re-calculated on every page load, and is not stable. As awards are more likely to be added to than removed from items, the total list of people with awards is likely to be longer today than it was at the beginning of this exercise.
  • I cannot take credit for all of this reduction; images that were added to Wikidata independently, but to items on this list by chance, likewise reduce the number of items on the list.
  • Not all of the items I “dealt with” now have an image; many had their candidate images suppressed thanks to a recently implemented function, where all the Wikipedia candidate images for a person are not depicting the person, but either a navbox icon, or something associated with the person (a sculpture made by the person, a house the person lived in, etc.)
  • Many items were “dealt with” by setting a “grave image”. These seem to be surprisingly (to me at least) popular on Wikipedia, especially for people from the former Soviet Union, for some reason.
  • I skipped many items where either the item label or the image name are in non-Latin characters. Oddly enough, I can match images to items quite well if both are in the same (non-Latin) script, by visual comparison đŸ˜‰
  • I also skipped many items where a candidate item has multiple people. I tried my hand on generating cropped images for specific people with the excellent CropTool, but that remains quite slow compared to the usual WD-FIST actions. Maybe if I can find a way to pre-fill the CropTool values (e.g. “create new image with this name”).
  • Based on a gut feeling, the “low-hanging fruit” will probably run out at ~10-15K items.
  • A sore point for me are statues of people; sometimes, I use close-ups of statues as an image of the person, when no proper image is available. I’m not sure if that is the right thing to do; it often seems to cover the likeness of the person (at least, better than “no image”), but somehow it feels like cheating…
  • There should be a “pictures of people” project somewhere, making prioritized lists of people to get an image for, then systematically “hunt them down” (e.g. ask these people or their heirs for free images, check other free image sources in print and online, group them by “likely event” where they could show up in the future, etc.).
  • I could really use some help for the “Cyrillic people”, towards the end of the list.

Wikidata has passed German Wikipedia in terms of articles/items with an image of the subject, and is now only second to English Wikipedia in that regard. As if in celebration, I added several new features to my Wikidata FIST tool, which makes adding images to Wikidata as easy as a single click (or two, if it’s a plaque, coat of arms, map etc.).

Screen Shot 2015-06-08 at 09.14.49The first feature is to suppress image suggestions, if that image is already used in a Wikidata item. This cuts down on already associated “grave pictures”, as well as “symbol pictures” from infoboxes.

The second is “JPEG only”, if you are looking for actual photographs, and not scans, maps etc.

Third, each item with image candidates now has a little yellow button, which will prevent the images candidates from being shown again for this item. While there are several media properties for items, some things (painting by an artist, buildings by an architect etc.) will never be directly added to the item; instead, each painting/building should have its own item, and link to its creator. So if all image candidates are of that nature, spare yourself and others from having to go through them again next time.

Musing on lists

As some of you may know, I write the occasional tool to help support Wikipedia, Wikidata, Commons, and other projects in the WikiVerse. Most of my tools work on the same basic principle: Get some data to start with, think about it, and present a result. The input data is often a list of pages (or Wikidata items, which is similar), defined by some sort of query.

Now, the number of potential sources for such lists have been multiplying over the years. Off the top of my head, I can think of:

  • Manual lists (paste in a box)
  • Lists on Wikis (numbered, unnumbered, with or without links, with comments, in tables etc.)
  • Category trees
  • Category tree intersections (e.g. QuickIntersection)
  • More complex intersections of category trees, templates, etc. (e.g. CatScan2)
  • Wikidata Queries
  • Complex intersections of categories, WDQ, lists etc. (e.g. AutoList)
  • SQL queries (e.g. Quarry)
  • SPARQL queries (e.g. WDQS)
  • All the tools that use any combination of the above, and generate page lists in return, could be sources again

The problem, however, is more complicated than this:

  • Most tools that process lists could potentially use most of the above sources, or combinations thereof, even if this is not apparent at first glance; a Wikidata tool can still use lists of French Wikipedia articles, as they can be “converted” into corresponding Wikidata items, and vice versa
  • Any of these sources can be combined in several ways; e.g. only pages that are in list A and (list B or list C)
  • These can be combined with non-list properties (last edited less than a month ago, excluding bots; created over 5 years ago; edited by one of these users; use the matching talk/content page; no redirects)
  • This can be done recursively; the same source “types” can be used several times, in a complex query

I have previously tried to allow users to construct a query pipeline, combining the outputs of different tools, and processing (e.g. filtering) them through more tools in new and interesting ways. However, that attempt was not taken up, neither by users nor tool developers.

I tried again to solve the issue, this time by putting the “pipeline” into JavaScript, running right in the users’ web browser. However, usage numbers (except for a single, quite active user) show that again, there was no uptake by users in general.

Maybe I am the only one in the WikiVerse thinking about this? Maybe my attempts are still too clunky for the average user? Maybe there is just no demand, and all the tools run perfectly fine as they are?

There seems to be some general interest in lists; my list generating bot appears to be reasonably popular with users on Wikipedia, albeit not in the article namespace. And an experimental, manual list-generating feature called Gather on mobile Wikipedia seems to be popular. Maybe I am just missing the “killer application” for lists, though the point is that all tools and applications could benefit from list management.

The Game of Source

Wikidata has beautiful mechanisms to associate individual claims with sources for that claim. However, finding and adding such sources is surprisingly complex, and, between multiple open tabs and the somewhat sluggish interface, can strain the patience of the most well-meaning editor.

I had previously attempted to simplify adding sources to Wikidata statements; and while I believe this interface to be much easier to use than Wikidata proper, it is still clunky, and has issues on mobile.

Screen Shot 2015-06-01 at 22.18.35So, I went ahead and reduced the issue to its most basic form: Does a short text snippet support a specific claim? To achieve such a simplified interface, the following must happen:

  • A Wikidata item is picked (by random)
  • The associated Wikipedia articles are investigated
  • The external links of these articles are merged
  • The HTML for these URLs is retrieved, and HTML tags are stripped, leaving only the plain text
  • Claims from the item are prepared. This includes getting the label of “item statements” (Pxx => Qyy), and formatting the dates of “time statements” in various ways (2015-06-01, “June 1, 2015″, etc.)
  • The claim values (labels and dates, for now) are searched for in the HTML of the external URLs above
  • The hits, including some flanking text, are stored in a database

This process is repeated over and again. Finally, an interface presents the hits for a specific claim to the user. A single click can now add that URL as a source to the claim on Wikidata (via WiDaR), together with the original retrieval date (example). The entire set can be marked as “Done” (as in, don’t show this again to anyone), or skipped (claim goes back into the “pool”).

It is early days for this interface now. No doubt, many improvements are possible, and even though claims are added to the database in the background, there are only ~1,000 claims in there at the time of writing this. Patience.

Ăœberlistet

User_Magnus_Manske_listeria_test_-_Wikipedia,_the_free_encyclopedia_-_2015-05-06_13.20.14One of the early promises of Wikidata was the improvement of lists on Wikipedia. These would be automatically generated and displayed, solving a number of problems:

  • Solve inconsistent lists on the same topic across Wikipedias
  • Keep all lists up-to-date
  • Track all possible members of the list via items, instead of per-Wikipedia red links
  • A single edit on Wikidata would propagate to all Wikipedias

Like many other features of Wikidata, this one has been delayed for some time now. With WDQ, and the upcoming SPARQL services, there are now several unofficial query services for Wikidata. It’s time to introduce a service for auto-generating lists now.

Which brings me to the pun of the blog entry title: It’s the German word for “outwitted”, but it could also be read as “super-listed”. Sadly, umlauts can still cause problems with non-German speakers and keyboards, so I run this tool under a biology pun name: Listeria (actually, a genus of bacteria).

How does this work? On Wikipedia (currently, English and German are supported, but it would be easy to add more), one adds a pair of templates to a Wiki page. Once a day (or on manual request), a bot finds those pages, reads the template parameters, and generates a WDQ-based list of items. The list is implemented as a table, to allow for various properties, including images, to accompany the entry. Items are linked to the respective article on the wiki, or to the Wikidata item if no article exists. The list can be auto-sectioned on a Wikidata property (e.g. the administrative unit of an item). Once generated, the bot compares the list with the one already on the page (between the two templates); if different, the bot replaces the list on the page with the new, up-to-date list.

My example page lists Dutch lighthouses, auto-sectioned by administrative unit. I made an English and a German version, using the same template code. They will both be updated at least once a day by the bot; the top template also generates a link to manually trigger the update for a specific page. Starting a new automatic list is as easy as inserting and filling the two templates into a page. So, Wikidata-based lists have arrived, after a fashion.

What’s that, you say? Your manual list contains more entries? Well, go to Wikidata, and create or link up items correctly so they all show on the automated list as well! Oh, your manual table contains more details? Add them to Wikidata! That way, any language edition of Wikipedia can enjoy the list and the information it contains. Also, comparing your list to the automatic one can highlight discrepancies, which may point to faulty information somewhere.

Don’t like lighthouses? How about 15th century composers instead, sectioned by nationality? Or 1980s video games, sectioned by company, ordered by date? Your imagination is the limit!

Now, if we only had numbers with units on Wikidata, so we could store the height of those lighthouses…

The games must go on

When I first announced the Wikidata Game almost a year ago, it certainly profited from its novelty value. Since then, it has seen a few new sub-games, and quite a number of code patches from others (which doesn’t happen often for my other tools!). But how does the game fare medium-/long-term?

With >200K “actions” (distinct game decisions, some of which result in edits on Wikidata) in March 2015 alone (an average of >6.500 actions per day, or one action every 15 seconds), it has certainly dropped from its initial popularity (>30.000 actions/day over the first ten days), but is still going. Let’s look at the long-term number of actions per sub-game:

2015-03-game

Most games show the initial “popularity peak”, which does seem to cause the cheapo trend lines I added to point downwards. Some games have started later than others. Some games have ended, because the users have won the game (that is, few or no more candidates remained).

So action numbers are down but stable. But what about distinct user numbers? Let’s look at the “people without birth/death dates” game as an example:

users_people_no_dateAgain, we see the initial peak drop off quickly to ~1/4 of its initial value; however, the number of distinct players remains between 75 and 100 per month, over the last 7 months.

All in all, it appears that the Wikidata Game is still in use, and contributing to Wikidata proper, one statement at a time.

Wikidata by country

Since November 2014, I have been collecting statistics on Wikidata by country, initially for Australia, France, New Zealand, and the United Kingdom. The raw, live numbers have always been online here, but staring at raw data has been known to result in aggravated academics. Therefore, I generated a few plot from said data, and accumulated them in a PDF (218kb). Data collection was interrupted for all countries but UK for some time, but I like to think that even patchy data can make for some interesting read…

Sex and artists

Now that I got your attention … Prompted by a post from Jane Darnell, I thought to quickly run some gender-related stats on artists in Wikidata. Specifically, the number of articles on Wikipedias for artists with a specific property, by gender.

First, RKDartists (at the moment of writing, 21,859 male and 2,801 female artists on Wikidata):

Number of articles on male (x axis) and female (y axis) artists with the RKDartists property. The line indicates the gender-unbiased coverage.

Number of articles on male (x axis) and female (y axis) artists with the RKDartists property.
The line indicates the gender-unbiased coverage. Only Wikipedias with >= RKDartist articles were used.

As we can see, not a single Wikipedia reaches the unbiased line; all Wikipedia are biased towards male biographies. English, German, Dutch, and French Wikipedia come closest, however, that may be due to reaching saturation (as in, almost complete coverage of all RKDartists) rather than intrinsically unbiased views. Otherwise, Finnish Wikipedia seems to be closest to unbiased amongst the “mid-range” Wikipedias.

Doing the same for ULAN (33,057 men, 3,100 women) looks a little better:

mf_ulan

Here, English Wikipedia has actually a tiny bias towards women. Breton and Maltese appear as “less biased outliers”.

I have uploaded both data sets here.

UPDATE: ODNB

mf_odnb

Pictures, reloaded

About four month ago, I blogged about Wikipedia pages vs. Wikidata items using images. In that post, I predicted that Wikidata would pass German Wikipedia in about four months’ time, so about the end of this month. Using the same metrics, it turns out that it’s a close run:

Site 2014-11 2015-03 Difference Per day
enwiki 1,726,772
enwiki (Commons only) 1,257,691
dewiki 709,736 729,577 19,841 182
wikidata 604,925 720,360 115,435 1,059
frwiki 602,664 623,400 20,736 190
ruwiki 491,916 509,436 17,520 161
itwiki 451,499 462,879 11,380 104
eswiki 414,308 425,399 11,091 102
jawiki 278,359 284,607 6,248 57

So, images in Wikidata items “grow” at about 1,000 per day, or ~900 faster than German Wikipedia. The difference has shrunk to ~9,000 pages/items. As there are 10 days to go for my prediction, it looks like I’m spot on…

Now, assuming a similar rate for enwiki, Wikidata should pass the Commons usage on en.wp in about two years.

Linkin mash

So I recently blogged about automatic descriptions based on Wikidata. And as nice as these APIs are, what could they be used for? You got it – demo time!

Linkin Park band member Dave Farrell has no article on English Wikipedia (only a redirect to Linkin Park, which is unhelpful). He does, however, have a Wikidata item with articles in 35 other languages. This is, essentially, the situation you get on smaller Wikipedias – lots of articles in other languages, just not in yours. There is information about the subject, but unless you can read any of those other languages, it’s closed to you.

On English Wikipedia, I created a template for this situation a while ago. Instead of a redlink, you specify the target and the Wikidata item, and you get the “normal” redlink, as well as links to Wikidata and Reasonator. An improvement, undoubtedly, but still rather clunky.

Thus, I resorted to something from the 90’s – a “mash-up” of multiple existing parts. A little bit of Wikipedia mobile view, my automatic description API, the Wikipedia API to render Wikitext as HTML, season with some JavaScript, and boom! we have a Wikipedia clone – with a twist. By default, this mash-up will happily display the Linkin Park article; however, under the “Band members” section (about the middle of the page), Dave Farrell now has just another, normal link:

Band section on Wikipedia

Band section on Wikipedia

Band section in the mash-up

Band section in the mash-up

The mash-up code recognized the code generated by the template on Wikipedia, and replaced it with a normal-looking, special link. Clicking on that “Dave Farrell” will lead to a live-generated page. It uses the automatic description API to get Wikitext plus infobox, then uses the Wikipedia API to render that as (mobile) HTML. And while the text is a little bit dull, it looks just like another Wikipedia page rendered through the mash-up, image and all.

I am well aware of the current limitations of this approach, including the potential deterrent to creating “proper” articles. However, with the much-hyped next billion internet users, many of them limited to the smaller Wikipedias, banging on our virtual door, such a mechanism could be a stop-gap measure to provide at least basic information in smaller languages, in a user-friendly way. Details of text generation for those languages, infoboxes, integration into Wikipedia proper, redlink markup, etc. would have to be worked out for this to happen.

Dave Farrell

Dave Farrell, as rendered by the mash-up


UPDATE: Aaaand… my template introduction has been reverted, effectively breaking the demo. I’m not going to start an edit war over this, you still have the screenshots.