Skip to content

The Reference Wars

In a recent Wikipedia Signpost Op-Ed, Andreas Kolbe wrote about Wikidata and references. He comes to the conclusion that Wikidata needs more (non-Wikipedia) references, a statement I wholeheartedly agree with. He also divines that this will never happen, that Wikidata is doomed, while at the same time somehow being controlled by Google and Microsoft; I will not comment on these “conclusions”, as others have already done so elsewhere.

Andreas also uses my own Wikidata statistics to make his point about missing references on Wikidata. The numbers I show are useful, IMHO, to show the remarkable progress of Wikidata, but they are much too crude to draw conclusions about the state of references there. Also, the impression I get from Andreas’ text is that, while Wikipedia has some issues, references are basically OK, whereas they are essentially non-existent in Wikidata.

So I thought I’d have a look at some actual numbers, especially comparing Wikipedia and Wikidata in terms of references.

One key issue is that there is no build-in way to get metrics about statements and references from Wikipedia. I therefore developed my own approach. Given a Wikipedia article, I use the REST API to get HTML for the article. I then count the number of reference uses (essentially, <ref> tags) in the article; note that this number is larger then (or at least equal to) the number of references at the bottom of the page. Then, I strip the HTML tags, and count the number of sentences (starts with an upper-case character, has at least 50 characters, ends with a “.”); the numbers were confirmed manually for a few example articles through other sentence counting tools on the web, and yielded similar results. I then assume that each sentence in the article contains one statement (or fact); in reality, there are likely many such statements (such as the first sentence of a biographical article), but I am aiming for a lower boundary here. (Any sentence not containing a statement/fact should be deleted from Wikipedia anyway.) A useful metric from both the number of reference uses, and the number of statements (=sentences), is the references-per-statement (RPS) ratio.

For Wikidata, a similar metric can be calculated. For practical purposes, I skip statements of the “string” type, as they are mostly external references in themselves (e.g. VIAF identifiers); I also skip “media”-type statements, as they should have “references” in their file description page on Commons. For references, I do not count “imported from Wikipedia”, as these are not “real” references, but rather placeholders for future improvement. Again, a RPS ratio can be computed.

I then calculated these ratios for 4,683 Featured Articles from English Wikipedia and their associated Wikidata items (data). As these articles have been significantly worked over and approved by the English Wikipedia community, they should represent the “best case scenario” for Wikipedia.

Indeed, the RPS ratio is higher for Wikipedia in 87% of cases, which would mean that Wikipedia is better referenced than Wikidata. But keep in mind that this represents the best of the best of the best of English Wikipedia articles, fifteen years in the making, compared to a three-and-a-half-year old Wikidata (and references were not supported for the first year or so). This is as good as it gets for Wikipedia, and still, Wikidata has a better RPS in about 13% of cases.

Even more interesting IMHO: Taking the mean of both number of statements and number of references for both Wikipedia and Wikidata, respectively, and calculating the RPS ratios for those means, yield 0.32 for Wikipedia and 0.15 for Wikidata. This seems counter-intuitive, given the previous 87/13 “ratio of ratios”. However, further investigation shows that only 1305 (~28%) of Wikidata items have any references at all, but where there are references, they usually outshine Wikipedia; about half of the items with at least one reference have a better RPS ratio than the respective Wikipedia article. This seems to indicate a “care factor” at work; where someone cared about adding references to the item, it was done quite well. Wikidata RPS ratios range up to 1.5, meaning two statements are, on average, supported by three references, whereas Wikipedia reaches “peak RPS ratio” at 0.93, or slightly less than one reference per statement.

I believe these numbers show that Wikidata can equal and surpass Wikipedia in terms of “referencedness”, but it is a function of attention to the items. Which in turn is a matter of man- and bot-hours spent. Indeed, for the Wikidata showcase items (the equivalent of Featured Articles on Wikipedia), the Wikidata RPS ratio is better that that of the associated English Wikipedia article in 19 out of 24 cases (~80%).

So will Wikidata ever catch up to Wikipedia in terms of RPS ratio? I think so. The ability of Wikidata to be reliably edited by a machine allows for improvement by automated and semi-automated bots, tools, games, on-wiki gadgets, etc. which allow for much steeper editing rate, as I demonstrated previously for images, where Wikidata went from nothing to second place in about two years, and is now angling for the pole position (~1.1M images at the moment). I see no reason to doubt this will happen to references as well.

The beatings will continue until morale improves

So over the weekend, Wikimedia Labs ran into a bit of trouble. Database replication broke, and was lagging about two days behind the live databases. But, thanks to tireless efforts by JCrespo, replication has now picked up again, and replication lag should be back to normal soon (even though there might be a few bits missing).

Now, this in itself is not something I would blog about; things break, things get fixed, life goes on. But then, I saw a comment by JCrespo with a preliminary analysis of what happened, and how to avoid it happening again:

“…it is due to the contraints we have for labs in terms of hardware and human resources. In order to prevent this in the future, I would like to discuss enforcing stronger constraints per user/tool.”

So, there are insufficient resources invested into (Tools) Labs. The solution, obviously, is to curtail the use of resources. This train of thought should be familiar to everyone whose country went to a phase of austerity in recent years. Even though, it now seems to be commonly agreed outside the cloudy realm of politicians, that austerity is the wrong way to go. If you have a good thing going, and you require some more resources to keep it that way, you give it more resources. You do not cut away scarce resources even more! This is how you go the way of Greece.

This is how you go the way of the toolserver.

More mixin’, more matches

Mix’n’match has seen some updates in the past few days. There are about ~170K new entries, in several catalogs:

Also, there is a brand-new import tool that anyone (who has made at least one mix’n’match “edit”) can use! Just paste or upload a tab-delimited text. Note that the tool is, as of yet, untested with “production” data, so please measure twice or thrice before importing.

Distributed stats

Just about a week after its inception, the Distributed Game has passed 10K (now 12K, since I started writing this text) actions. Enough to see some interesting patterns in the stats.

By actions (an action is any decision made), the most popular sub-games are mix’n’match (42%) and matching new articles to existing items (36%), followed by the classic “merge items” (10%) from the original Game. “Merge items” has the lowest rate of actions that lead to Wikidata edits (18%), probably because it has been running for a while on the original Game, and many of the obvious cases have been done.

Mix’n’match, “administrative unit”, and “items without image” all have over 80% actions that also edit Wikidata. The last one is easy to understand; the candidate images are taken from associated Wikipedia articles, and have been filtered against “common” images (that is, images that are used on many pages, like navbox icons), so the success rate is high.

“Administrative unit” is new, and the coordinate-based suggestions appear to be of high quality. Mix’n’match has been around in the shape of the original tool for quite a while, but has always lacked enough people to get all the “easy ones”; since many of the entries here are people, ambiguity seems to be low.

The “source meta” game has a Wikidata “edit rate” of over 60%, which might drop as soon as the easy cases are done. Of course, many more entries will follow soon, so this should keep players entertained for a while.

Enter the Distributed Game

The Wikidata Game has been a success, both in terms of work done, as well as a demonstration of micro-contributions to Wikidata. I consider ~2,500 players not a trivial thing, considering the games are neither particularly thrilling, nor resulting in awards (how about one repercussion-free vandalism on English Wikipedia for 1,000 actions in a sub-game?). However, I found the underlying system increasingly complex to work with, and did not add any more games recently. This is despite several people suggesting games, and even providing data to play on.

So, just in time for Wikidata’s third birthday, I present The Distributed Game. What’s new, you might ask? Just about everything!

Screen Shot 2015-10-10 at 22.00.35First and foremost, the concept of “game”: The site does not hold any actual games. Rather, it holds a list of “game tile providers”, that is, other web tools (APIs) that generate game tiles to play. The great thing about this is, anyone with a little programming experience can write a new game, test it on the Distributed Game, and add it as a new game, all with a few clicks! So if you are a programmer and interested to have people play with your Wikidata-related data, check out the API doc.
For starters, I have created the following games myself:

  • The “merge game” from the original Game. Thanks to the API model, it runs on the same data set, and even credits your edits there as well!
  • A “mix’n’match game”. In my mix’n’match tool, you can match external catalog IDs to Wikidata items. One way to do that is to confirm (or remove) automated, name-based matches. These are now exposed as a game. As above, confirming or removing a match in the game will also credit you in mix’n’match (and on Wikidata, of course!).
  • A “match new articles” game. Again, this runs on existing data, in this case a list of Wikipedia articles with no Wikidata item in the duplicity tool.
  • A brand-new game, with a bespoke data set, that tries to infer the “administrative unit” for a Wikidata item from its coordinates, and other items nearby.

That last game shows a strength of the new gaming system: The “game tile” (essentially, the information required to complete the task) can consist of several types of display; in this case, a map (using the new Wikimedia maps service) to get a look at the region, the Wikidata item (as an automated description, with inline previews of associated Wikipedia pages), and buttons to decide on the task. Other types include a Wikipedia page intro, or “just” static text; adding more types is quite simple.

Screen Shot 2015-10-13 at 16.30.23The Distributed Game can perform up to three actions on every decision you take:

  • Store your decision centrally, for Recent Changes, user contributions, statistics etc.
  • Feed your decision back to the remote API that provided the game tile, if only to mark this tile as “done” and not present it again.
  • Perform one or more edits on Wikidata.

Quite a few new ideas have gone into the interface. Wikidata item previews show image (if available) and map (based on coordinates) thumbnails. The next “game tiles” already show, but are greyed out. Page URLs update automatically to provide a link to the current “game state”. Games can offer different modes (for example, the “merge game” offers to only serve entries of people). You can set a list of languages, which will be passed to the individual games (it is up to the game to act on that appropriately).

At the moment of writing, there are several functions that the Distributed Game lacks, in comparison to the old one, including text highlighting, and detailed statistics. I have not yet decided which other games, if any, to port from the “old” game. I think for this “game 2.0”, it is more important to get the essentials right, and to make it visually appealing (I have borrowed some ideas from tufte.css), rather than to be “feature-complete”. The original Game will continue to work for the time being.

So, if you have a sub-game you really would like to see ported to the new system, or know of a data source that would make for an interesting game, please contact me. More importantly, though: If you are a programmer, check out the documentation, and have a go at writing your own. After all, building a whole from individual, diverse contributions is what makes us mighty.


Wikipedians love lists. Thus, my list-generating bot is now active on over a dozen wikis, most of then upon request by users, who have set up quite a variety of lists to generate and update.

However, a few issues with this approach have emerged. Some of them are technical; lists get too long for wikitext, some desired functions are hard to implement, and using templates to set up parameters is awkward. Some issues are social; while several wikis have no problems with bot-generated lists in the article namespace, some (OK, one) communities have concerns, ranging from the data quality of Wikidata, over style issues, to the fact that the list can only be edited via Wikidata, and not directly on the respective wiki.

Screen Shot 2015-09-24 at 16.07.01A proposed solution is to implement Wikidata lists as a tool. This solves the “social issues” by moving the lists outside the wikis, while releasing storage and display options from the limitations of MediaWiki. So, for the impatient: Dynamic Wikidata  Lists.

In this tool, everyone can view lists, and change options like language, columns, and sections on-the-fly; to create your own lists, use your trusted WiDaR login. Lists consist of two parts: The items resulting from a Wikidata Query (there could be other data sources down the line), which is stored in the tool, and updated every six hours, or on demand; and the data in the columns, which is loaded on-the-fly, directly from Wikidata. This is a trade-off; no need to store large amounts of data in the tool, and getting the latest data straight from the source, in exchange for a few seconds of waiting time for large lists. Once a list is loaded, the display can be changed with little or no need to load more data. However, even my largest list with over 4,300 items loads the items in ~10sec, and the labels for the column values in another ~5sec, on my machine.

Screen Shot 2015-09-24 at 16.38.43The “>” icon on top of the list opens the display options, and there are quite a few of those. Seven columns types, three sections types, multiple (sub-)section levels, an option to display the top-level section as tabs, arbitrary precision when using dates as sections (e.g. force birth dates into decades), options to override column titles per language, etc.

Links go the the Wikipedia article in the current language, or to Wikidata by default; columns with links to specific wikis are possible. Preferred statements are shown if present, normal ranks otherwise. Columns can be sorted by clicking on the column header (there is no default sort, as the labels to sort on are loaded after the table is created).

A list, once created, can only be changed by the person who created it; but anyone with a WiDaR login can create a new list based on an existing one, and change it in any way desired. List and default language can be specified in the URL, so you can link to a list from Wikipedia in the local language.

There are, undoubtedly, things left to do; the next big one will be to allow Wikidata editing directly from within the tool. I hope this tool will, in addition to the bot, give everyone the power and flexibility to create and manage Wikidata-based lists, help improve Wikidata statements, and maybe even convert the occasional Wikidata nay-sayer :-)

Wikidata lists – Full Circle

So my Wikidata list-generating bot Listeria has become popular in certain circles, creating and updating lists of artworks, species, or ORCID ID holders. With the introduction of the Wikidata SPARQL service, Wikidata queries are becoming more mainstream, and lists are a logical next step.

At the same time, many Wikipedians lack an awareness of Wikidata, and hesitate to go there and edit. Micro-contributions are a way for people to improve Wikidata without much fuzz, but are “hidden” in external tools.

So I added a little bit of code to Listeria. The output now contains a few minor extras, like class names for table cells. These are then used by JavaScript code to allow adding and editing of information in Listeria table cells, right on Wikipedia. Label, description (where unavoidable), item links, dates, coordinates, strings, and images are supported. Simply add

Dialog to add an item link

Dialog to add an item link

to your common.js page on Wikipedia, hover over a Listeria-generated table cell, and you will see add/edit options. Clicking on those will open a dialog to find/enter a value, validated through Wikidata itself. Clicking OK adds this information to Wikidata. Done! (Because of the table being static wikitext, your addition will only show after the next Listeria update, but it is already on Wikidata proper.)

The JavaScript code is adaptable, meaning it could be used to let people edit Wikidata-based infoboxes etc. Of course, it would be much more effective to have this enables for all Wikipedia users, and with more Listeria lists around. But for now, I am content with this being a demo, which may inspire “official” functionality be the WMF, in a few years’ time.

A quick description

So there is a lively discussion about using descriptions from Wikidata in places like Wikipedia search results, especially on mobile. While everyone seems to agree that this is a good idea, camps are forming with supporters of manual and automatically generated descriptions, respectively. Time for an entirely manual description of my POV.

At the time of writing this, there are about 14 million items on Wikidata. The Wikiverse deals with about 250 languages. That comes to ~3.5 billion possible descriptions of items, a number that will only increase with time. Right now, less than 4% of these descriptions are filled in, many of them generated by bots (e.g. “Wikimedia disambiguation page”, not all of them correctly). And do not kid yourselves, those will stay. They will continue to say “American actor”. Even after we add statements about his/her nationality, gender, birth and death dates, spouses, parents, children, important awards, etc., the description will still say “American actor”. There are, by far, not enough volunteers to fill in >3 billion descriptions, especially on the ~240 or so non-“main” languages; most have little enough labels for the items. Except for maybe English, there are no people to go around and improve existing descriptions, probably multiple times for the same item. For most people in the world, Wikidata manual item descriptions are a wasteland, and it’s here to stay.

But there is an alternative. A bot can look at an item, see it’s about a person with nationality “U.S.”, and occupation “actor”. It can, from that, write “American actor”. It can, in fact, do much better than that, given the right statements. It will improve its description as more information becomes available. And it can do so in all 250 languages, given a little volunteer effort for each of them. It won’t win a literature contest any time soon, but it will get the basic message across, in most cases.


As a hands-on person, I wrote a little tool a while ago, which attempts to do just that. Limited by time and my 2-out-of-250 language abilities, it is far from perfect, or even working properly for may languages. But let me give an example.

There is an article about a specific model of “flying boat” in several languages, the Dornier Do J. On Wikidata, there is a (as in: one) manual description, in Italian, for the respective item, which reads “idrovolante Dornier-Werke”. I don’t speak Italian, but it looks … truncated? (Google translate agrees with this assessment.)

So I ran my automatic description on this, for a few languages:

English: Dornier Do J : Flying boat by Dornier
German: Dornier Wal : Flugboot von Dornier-Werke
French: Dornier Do J : Hydravion à coque par Dornier
Spanish: Dornier Do J : Hidrocanoa por Dornier Flugzeugwerke
Japanese: Do J : 飛行艇 by ドルニエ
Vietnamese: Dornier Do J : Tàu bay bởi Dornier Flugzeugwerke
Telugu: Q1245981 : ఎగిరే పడవ Dornier Flugzeugwerke చేత తయారు చేయబడినది

Perfect? Certainly not. Wrong? In some cases; “by” is not really a Japanese word, as far as I know. But I would think that most Japanese readers would know what the item is about, from that description.

Screen Shot 2015-08-19 at 16.08.44Note that there are no Wikipedia articles about this topic in Vietnamese, nor Telugu. These texts (as good or bad as they may be) could show up in a Telugu Wikidata search. Or a Wikipedia one, even if no te.wikipedia results were found. The code exists, and is used (e.g. on Italian Wikipedia) already.

You can see the automatic descriptions for the Wikipedia page you are on yourself. Simply add

mw.loader.load("// Manske/autodesc.js&action=raw&ctype=text/javascript");

to your common.js User subpage, or to your global JavaScript page, which will activate it on all Wikipedias you work on. I found this a great way to see where the Wikidata item is lacking, and needs some more statements, or where items need a label in your language.

A suggestive tool

Do you know what “Stomatitis” is? Neither did I. But there is an article about it on German Wikipedia, and when I found it, it had a blank Wikidata item. Now, I happen to speak German, but I have run into plenty of other blank items with, say, a Russian Wikipedia article, which is not exactly my forte. I could go to Google translate, but oh so inconvenient. And then I’ll have to figure out which properties and items I should link to are. Easy enough for “human” (P31:Q5) to remember, but what was the item for “sex:female” again?

Then I thought: I might not know what the text of the article says, but I bet it is in one or more categories. And these categories have other articles in them, articles similar to the article I can’t read. And many of these articles should have Wikidata items. So I could look at these items, see what statements are common among those, and some of the top ones will probably apply to my blank item as well.

You all know what we need now: More tools! So I wrote some code on Labs, which will give me the “statement ranking” for a single language. I also wrote a JavaScript wrapper around it. It will add a link called “Suggestor” to your toolbar on the left. This is what it looks like:

Screen Shot 2015-08-14 at 22.32.45

It’s definitely a disease, and the “medical specialty” looks right too. If the source describes it, I do not know, but it might be worth finding out.

To add this handy function to your Wikidata experience, simply add

importScript( 'User:Magnus_Manske/suggestor.js' );

to your common.js user subpage. Enjoy!

Add it to the pile!

I have previously blogged about Wikipedia-related page lists, and how they relate to many tools and activities. I also lamented my previous, failed attempts at introducing a “tool pipeline system”.

Well, I am not one to give up easily! The latest, greatest iteration in this vein is PagePile. Essentially, this new tool is managing piles (newspeak for “lists”) of pages from Wikipedia, Wikidata, Commons, and other projects form the WikiVerse.


Filtering a list.

Filtering a list.

New piles can be taken from various sources, including manual lists, WDQ, and the Gather extension. Several of my tools can also generate piles, including AutoList, CatScan, QuickIntersection, and Not-in-the-other-language. Either way, you end up with a numeric PagePile ID.

What can you do with that ID? First of all, you can look at the list (that example leads to the list of all humans on Wikidata, ~2.8M items long), and download it in various formats.

You can filter the list, creating a new list (with a new ID) by following language links, resolving redirects, merging and subsetting with other lists, etc.

Finally, you can import them into several of my tools, including Autolist, FIST, WD-FIST,Not-in-the-other-language, and GetItemNames.

This list will likely grow; it is quite easy to add PagePiles as an input and/or output to a tool. Let me know if there is a tool you would like to see connected to the PagePile ecosystem; likewise for new filters.


If you are a tool author on Labs, you might want to consider linking up to the obvious possibilities of this system. I made a brief introduction for programmers, put the code on BitBucket, and I am working on some code documentation.

Basically, the tool manages a list of sqlite files, each of which represents a pile (=list) of pages on a wiki. You can get the file name of the sqlite3 file from the API or via the PHP class described in the intro. Via that class, or using sqlite3 directly, you can read and write that file, adding and changing lists. Please let me know if you have problems or comments, and if you start using PagePile in your tools, so I can add them to my consumer and/or generator lists.