Skip to content

A quick description

So there is a lively discussion about using descriptions from Wikidata in places like Wikipedia search results, especially on mobile. While everyone seems to agree that this is a good idea, camps are forming with supporters of manual and automatically generated descriptions, respectively. Time for an entirely manual description of my POV.

At the time of writing this, there are about 14 million items on Wikidata. The Wikiverse deals with about 250 languages. That comes to ~3.5 billion possible descriptions of items, a number that will only increase with time. Right now, less than 4% of these descriptions are filled in, many of them generated by bots (e.g. “Wikimedia disambiguation page”, not all of them correctly). And do not kid yourselves, those will stay. They will continue to say “American actor”. Even after we add statements about his/her nationality, gender, birth and death dates, spouses, parents, children, important awards, etc., the description will still say “American actor”. There are, by far, not enough volunteers to fill in >3 billion descriptions, especially on the ~240 or so non-“main” languages; most have little enough labels for the items. Except for maybe English, there are no people to go around and improve existing descriptions, probably multiple times for the same item. For most people in the world, Wikidata manual item descriptions are a wasteland, and it’s here to stay.

But there is an alternative. A bot can look at an item, see it’s about a person with nationality “U.S.”, and occupation “actor”. It can, from that, write “American actor”. It can, in fact, do much better than that, given the right statements. It will improve its description as more information becomes available. And it can do so in all 250 languages, given a little volunteer effort for each of them. It won’t win a literature contest any time soon, but it will get the basic message across, in most cases.


As a hands-on person, I wrote a little tool a while ago, which attempts to do just that. Limited by time and my 2-out-of-250 language abilities, it is far from perfect, or even working properly for may languages. But let me give an example.

There is an article about a specific model of “flying boat” in several languages, the Dornier Do J. On Wikidata, there is a (as in: one) manual description, in Italian, for the respective item, which reads “idrovolante Dornier-Werke”. I don’t speak Italian, but it looks … truncated? (Google translate agrees with this assessment.)

So I ran my automatic description on this, for a few languages:

English: Dornier Do J : Flying boat by Dornier
German: Dornier Wal : Flugboot von Dornier-Werke
French: Dornier Do J : Hydravion à coque par Dornier
Spanish: Dornier Do J : Hidrocanoa por Dornier Flugzeugwerke
Japanese: Do J : 飛行艇 by ドルニエ
Vietnamese: Dornier Do J : Tàu bay bởi Dornier Flugzeugwerke
Telugu: Q1245981 : ఎగిరే పడవ Dornier Flugzeugwerke చేత తయారు చేయబడినది

Perfect? Certainly not. Wrong? In some cases; “by” is not really a Japanese word, as far as I know. But I would think that most Japanese readers would know what the item is about, from that description.

Screen Shot 2015-08-19 at 16.08.44Note that there are no Wikipedia articles about this topic in Vietnamese, nor Telugu. These texts (as good or bad as they may be) could show up in a Telugu Wikidata search. Or a Wikipedia one, even if no te.wikipedia results were found. The code exists, and is used (e.g. on Italian Wikipedia) already.

You can see the automatic descriptions for the Wikipedia page you are on yourself. Simply add

mw.loader.load("// Manske/autodesc.js&action=raw&ctype=text/javascript");

to your common.js User subpage, or to your global JavaScript page, which will activate it on all Wikipedias you work on. I found this a great way to see where the Wikidata item is lacking, and needs some more statements, or where items need a label in your language.

A suggestive tool

Do you know what “Stomatitis” is? Neither did I. But there is an article about it on German Wikipedia, and when I found it, it had a blank Wikidata item. Now, I happen to speak German, but I have run into plenty of other blank items with, say, a Russian Wikipedia article, which is not exactly my forte. I could go to Google translate, but oh so inconvenient. And then I’ll have to figure out which properties and items I should link to are. Easy enough for “human” (P31:Q5) to remember, but what was the item for “sex:female” again?

Then I thought: I might not know what the text of the article says, but I bet it is in one or more categories. And these categories have other articles in them, articles similar to the article I can’t read. And many of these articles should have Wikidata items. So I could look at these items, see what statements are common among those, and some of the top ones will probably apply to my blank item as well.

You all know what we need now: More tools! So I wrote some code on Labs, which will give me the “statement ranking” for a single language. I also wrote a JavaScript wrapper around it. It will add a link called “Suggestor” to your toolbar on the left. This is what it looks like:

Screen Shot 2015-08-14 at 22.32.45

It’s definitely a disease, and the “medical specialty” looks right too. If the source describes it, I do not know, but it might be worth finding out.

To add this handy function to your Wikidata experience, simply add

importScript( 'User:Magnus_Manske/suggestor.js' );

to your common.js user subpage. Enjoy!

Add it to the pile!

I have previously blogged about Wikipedia-related page lists, and how they relate to many tools and activities. I also lamented my previous, failed attempts at introducing a “tool pipeline system”.

Well, I am not one to give up easily! The latest, greatest iteration in this vein is PagePile. Essentially, this new tool is managing piles (newspeak for “lists”) of pages from Wikipedia, Wikidata, Commons, and other projects form the WikiVerse.


Filtering a list.

Filtering a list.

New piles can be taken from various sources, including manual lists, WDQ, and the Gather extension. Several of my tools can also generate piles, including AutoList, CatScan, QuickIntersection, and Not-in-the-other-language. Either way, you end up with a numeric PagePile ID.

What can you do with that ID? First of all, you can look at the list (that example leads to the list of all humans on Wikidata, ~2.8M items long), and download it in various formats.

You can filter the list, creating a new list (with a new ID) by following language links, resolving redirects, merging and subsetting with other lists, etc.

Finally, you can import them into several of my tools, including Autolist, FIST, WD-FIST,Not-in-the-other-language, and GetItemNames.

This list will likely grow; it is quite easy to add PagePiles as an input and/or output to a tool. Let me know if there is a tool you would like to see connected to the PagePile ecosystem; likewise for new filters.


If you are a tool author on Labs, you might want to consider linking up to the obvious possibilities of this system. I made a brief introduction for programmers, put the code on BitBucket, and I am working on some code documentation.

Basically, the tool manages a list of sqlite files, each of which represents a pile (=list) of pages on a wiki. You can get the file name of the sqlite3 file from the API or via the PHP class described in the intro. Via that class, or using sqlite3 directly, you can read and write that file, adding and changing lists. Please let me know if you have problems or comments, and if you start using PagePile in your tools, so I can add them to my consumer and/or generator lists.


While I do occasionally write Wikimedia tools “to order”, I wrote quite a few of them because I required (or just enjoyed) the functionality myself. One thing I like to do is adding images to Wikidata, using WD-FIST. Recently, I started to focus on a specific list, people with awards (of any kind). People with awards are, in general, more likely to have an image; also, it can be satisfying to see a “job list” shrink over time. So for this one, I logged some data points:

Screen Shot 2015-06-24 at 11.24.54Over the last 2-3 weeks, even my sporadic use of the tool has reduced the list by 1/4 (note the plateau when Labs was offline!). Some thoughts along the way:

  • The list of item candidates is re-calculated on every page load, and is not stable. As awards are more likely to be added to than removed from items, the total list of people with awards is likely to be longer today than it was at the beginning of this exercise.
  • I cannot take credit for all of this reduction; images that were added to Wikidata independently, but to items on this list by chance, likewise reduce the number of items on the list.
  • Not all of the items I “dealt with” now have an image; many had their candidate images suppressed thanks to a recently implemented function, where all the Wikipedia candidate images for a person are not depicting the person, but either a navbox icon, or something associated with the person (a sculpture made by the person, a house the person lived in, etc.)
  • Many items were “dealt with” by setting a “grave image”. These seem to be surprisingly (to me at least) popular on Wikipedia, especially for people from the former Soviet Union, for some reason.
  • I skipped many items where either the item label or the image name are in non-Latin characters. Oddly enough, I can match images to items quite well if both are in the same (non-Latin) script, by visual comparison 😉
  • I also skipped many items where a candidate item has multiple people. I tried my hand on generating cropped images for specific people with the excellent CropTool, but that remains quite slow compared to the usual WD-FIST actions. Maybe if I can find a way to pre-fill the CropTool values (e.g. “create new image with this name”).
  • Based on a gut feeling, the “low-hanging fruit” will probably run out at ~10-15K items.
  • A sore point for me are statues of people; sometimes, I use close-ups of statues as an image of the person, when no proper image is available. I’m not sure if that is the right thing to do; it often seems to cover the likeness of the person (at least, better than “no image”), but somehow it feels like cheating…
  • There should be a “pictures of people” project somewhere, making prioritized lists of people to get an image for, then systematically “hunt them down” (e.g. ask these people or their heirs for free images, check other free image sources in print and online, group them by “likely event” where they could show up in the future, etc.).
  • I could really use some help for the “Cyrillic people”, towards the end of the list.

Wikidata has passed German Wikipedia in terms of articles/items with an image of the subject, and is now only second to English Wikipedia in that regard. As if in celebration, I added several new features to my Wikidata FIST tool, which makes adding images to Wikidata as easy as a single click (or two, if it’s a plaque, coat of arms, map etc.).

Screen Shot 2015-06-08 at 09.14.49The first feature is to suppress image suggestions, if that image is already used in a Wikidata item. This cuts down on already associated “grave pictures”, as well as “symbol pictures” from infoboxes.

The second is “JPEG only”, if you are looking for actual photographs, and not scans, maps etc.

Third, each item with image candidates now has a little yellow button, which will prevent the images candidates from being shown again for this item. While there are several media properties for items, some things (painting by an artist, buildings by an architect etc.) will never be directly added to the item; instead, each painting/building should have its own item, and link to its creator. So if all image candidates are of that nature, spare yourself and others from having to go through them again next time.

Musing on lists

As some of you may know, I write the occasional tool to help support Wikipedia, Wikidata, Commons, and other projects in the WikiVerse. Most of my tools work on the same basic principle: Get some data to start with, think about it, and present a result. The input data is often a list of pages (or Wikidata items, which is similar), defined by some sort of query.

Now, the number of potential sources for such lists have been multiplying over the years. Off the top of my head, I can think of:

  • Manual lists (paste in a box)
  • Lists on Wikis (numbered, unnumbered, with or without links, with comments, in tables etc.)
  • Category trees
  • Category tree intersections (e.g. QuickIntersection)
  • More complex intersections of category trees, templates, etc. (e.g. CatScan2)
  • Wikidata Queries
  • Complex intersections of categories, WDQ, lists etc. (e.g. AutoList)
  • SQL queries (e.g. Quarry)
  • SPARQL queries (e.g. WDQS)
  • All the tools that use any combination of the above, and generate page lists in return, could be sources again

The problem, however, is more complicated than this:

  • Most tools that process lists could potentially use most of the above sources, or combinations thereof, even if this is not apparent at first glance; a Wikidata tool can still use lists of French Wikipedia articles, as they can be “converted” into corresponding Wikidata items, and vice versa
  • Any of these sources can be combined in several ways; e.g. only pages that are in list A and (list B or list C)
  • These can be combined with non-list properties (last edited less than a month ago, excluding bots; created over 5 years ago; edited by one of these users; use the matching talk/content page; no redirects)
  • This can be done recursively; the same source “types” can be used several times, in a complex query

I have previously tried to allow users to construct a query pipeline, combining the outputs of different tools, and processing (e.g. filtering) them through more tools in new and interesting ways. However, that attempt was not taken up, neither by users nor tool developers.

I tried again to solve the issue, this time by putting the “pipeline” into JavaScript, running right in the users’ web browser. However, usage numbers (except for a single, quite active user) show that again, there was no uptake by users in general.

Maybe I am the only one in the WikiVerse thinking about this? Maybe my attempts are still too clunky for the average user? Maybe there is just no demand, and all the tools run perfectly fine as they are?

There seems to be some general interest in lists; my list generating bot appears to be reasonably popular with users on Wikipedia, albeit not in the article namespace. And an experimental, manual list-generating feature called Gather on mobile Wikipedia seems to be popular. Maybe I am just missing the “killer application” for lists, though the point is that all tools and applications could benefit from list management.

The Game of Source

Wikidata has beautiful mechanisms to associate individual claims with sources for that claim. However, finding and adding such sources is surprisingly complex, and, between multiple open tabs and the somewhat sluggish interface, can strain the patience of the most well-meaning editor.

I had previously attempted to simplify adding sources to Wikidata statements; and while I believe this interface to be much easier to use than Wikidata proper, it is still clunky, and has issues on mobile.

Screen Shot 2015-06-01 at 22.18.35So, I went ahead and reduced the issue to its most basic form: Does a short text snippet support a specific claim? To achieve such a simplified interface, the following must happen:

  • A Wikidata item is picked (by random)
  • The associated Wikipedia articles are investigated
  • The external links of these articles are merged
  • The HTML for these URLs is retrieved, and HTML tags are stripped, leaving only the plain text
  • Claims from the item are prepared. This includes getting the label of “item statements” (Pxx => Qyy), and formatting the dates of “time statements” in various ways (2015-06-01, “June 1, 2015”, etc.)
  • The claim values (labels and dates, for now) are searched for in the HTML of the external URLs above
  • The hits, including some flanking text, are stored in a database

This process is repeated over and again. Finally, an interface presents the hits for a specific claim to the user. A single click can now add that URL as a source to the claim on Wikidata (via WiDaR), together with the original retrieval date (example). The entire set can be marked as “Done” (as in, don’t show this again to anyone), or skipped (claim goes back into the “pool”).

It is early days for this interface now. No doubt, many improvements are possible, and even though claims are added to the database in the background, there are only ~1,000 claims in there at the time of writing this. Patience.


User_Magnus_Manske_listeria_test_-_Wikipedia,_the_free_encyclopedia_-_2015-05-06_13.20.14One of the early promises of Wikidata was the improvement of lists on Wikipedia. These would be automatically generated and displayed, solving a number of problems:

  • Solve inconsistent lists on the same topic across Wikipedias
  • Keep all lists up-to-date
  • Track all possible members of the list via items, instead of per-Wikipedia red links
  • A single edit on Wikidata would propagate to all Wikipedias

Like many other features of Wikidata, this one has been delayed for some time now. With WDQ, and the upcoming SPARQL services, there are now several unofficial query services for Wikidata. It’s time to introduce a service for auto-generating lists now.

Which brings me to the pun of the blog entry title: It’s the German word for “outwitted”, but it could also be read as “super-listed”. Sadly, umlauts can still cause problems with non-German speakers and keyboards, so I run this tool under a biology pun name: Listeria (actually, a genus of bacteria).

How does this work? On Wikipedia (currently, English and German are supported, but it would be easy to add more), one adds a pair of templates to a Wiki page. Once a day (or on manual request), a bot finds those pages, reads the template parameters, and generates a WDQ-based list of items. The list is implemented as a table, to allow for various properties, including images, to accompany the entry. Items are linked to the respective article on the wiki, or to the Wikidata item if no article exists. The list can be auto-sectioned on a Wikidata property (e.g. the administrative unit of an item). Once generated, the bot compares the list with the one already on the page (between the two templates); if different, the bot replaces the list on the page with the new, up-to-date list.

My example page lists Dutch lighthouses, auto-sectioned by administrative unit. I made an English and a German version, using the same template code. They will both be updated at least once a day by the bot; the top template also generates a link to manually trigger the update for a specific page. Starting a new automatic list is as easy as inserting and filling the two templates into a page. So, Wikidata-based lists have arrived, after a fashion.

What’s that, you say? Your manual list contains more entries? Well, go to Wikidata, and create or link up items correctly so they all show on the automated list as well! Oh, your manual table contains more details? Add them to Wikidata! That way, any language edition of Wikipedia can enjoy the list and the information it contains. Also, comparing your list to the automatic one can highlight discrepancies, which may point to faulty information somewhere.

Don’t like lighthouses? How about 15th century composers instead, sectioned by nationality? Or 1980s video games, sectioned by company, ordered by date? Your imagination is the limit!

Now, if we only had numbers with units on Wikidata, so we could store the height of those lighthouses…

The games must go on

When I first announced the Wikidata Game almost a year ago, it certainly profited from its novelty value. Since then, it has seen a few new sub-games, and quite a number of code patches from others (which doesn’t happen often for my other tools!). But how does the game fare medium-/long-term?

With >200K “actions” (distinct game decisions, some of which result in edits on Wikidata) in March 2015 alone (an average of >6.500 actions per day, or one action every 15 seconds), it has certainly dropped from its initial popularity (>30.000 actions/day over the first ten days), but is still going. Let’s look at the long-term number of actions per sub-game:


Most games show the initial “popularity peak”, which does seem to cause the cheapo trend lines I added to point downwards. Some games have started later than others. Some games have ended, because the users have won the game (that is, few or no more candidates remained).

So action numbers are down but stable. But what about distinct user numbers? Let’s look at the “people without birth/death dates” game as an example:

users_people_no_dateAgain, we see the initial peak drop off quickly to ~1/4 of its initial value; however, the number of distinct players remains between 75 and 100 per month, over the last 7 months.

All in all, it appears that the Wikidata Game is still in use, and contributing to Wikidata proper, one statement at a time.

Wikidata by country

Since November 2014, I have been collecting statistics on Wikidata by country, initially for Australia, France, New Zealand, and the United Kingdom. The raw, live numbers have always been online here, but staring at raw data has been known to result in aggravated academics. Therefore, I generated a few plot from said data, and accumulated them in a PDF (218kb). Data collection was interrupted for all countries but UK for some time, but I like to think that even patchy data can make for some interesting read…