Skip to content

WDQ, obsolete?

Since a few years, I run the WikiData Query tool (WDQ) to provide a query functionality to Wikidata. Nowadays, the (confusingly similarly named) SPARQL-based WDQS is the “official” way to query Wikidata. WDQS has been improving a lot, and while some of my tools still support WDQ, I deliberately left that option out of new tools like PetScan. But before I shut down WDQ, and the tools that use it, for good, I wanted to know if it is still used, and if SPARQL could take over.

I therefore added a query logger to Autolist1 and Autolist2. The logs contain all WDQ queries run through those tools. I will monitor the results for a while, but here is what I saw so far. I will comment on translating the query to SPARQL using WDQ2SPARQL, the general ability for such queries, and the performance of WDQS. “OK” means the query could be converted automatically to SPARQL, runs, and produces a similar (as in, equal or more up-to-date) result.

WDQ Comment
CLAIM[279:13219666]  OK
BETWEEN[569,1016-1,1016-12]  BETWEEN not implemented in WDQS, but manual translation feasible
(CLAIM[1435:10387684] OR CLAIM[1435:10387575]) AND NOCLAIM[380] AND NOCLAIM[481]  OK
BETWEEN[569,1359-1,1359-12]  BETWEEN not implemented in WDQS, but manual translation feasible
CLAIM[31:5]  All humans ~3.2M humans on Wikidata. Not really a useful query in these tools.
Q22686  Single item. Doesn’t really need a query?
Q22686  Single item. Doesn’t really need a query?
CLAIM[106:170790] AND CLAIM[27:35]  OK
CLAIM[195:842858]  OK
Gustav III  What the hell?
claim[17]  All items with “country”. Not really a useful query in these tools.
claim[31]  All items with “instance of”. Not really a useful query in these tools.
claim[106:82955] and claim[509:(tree[12078][][279])]  OK
claim[31:5]   All humans ~3.2M humans on Wikidata. Not really a useful query in these tools.
claim[31:5]   All humans ~3.2M humans on Wikidata. Not really a useful query in these tools.
claim[21]  All items with gender. Not really a useful query in these tools.
LINK[lvwiki] AND CLAIM[31:5]  OK
LINK[lvwiki] AND CLAIM[31:5]  OK
claim[27] and noclaim[21]  OK
LINK[lvwiki] AND CLAIM[31:56061]  OK
LINK[lvwiki] AND tree[56061][150][17,279]  OK
claim[31:(tree[16521][][279])]  OK

As far as I can tell, SPARQL could take over for WDQ immediately.

i19n

Wikipedia has language editions, Wikidata has labels, aliases, descriptions, and some properties in multiple languages. This a great resource, to get the world’s knowledge in your language! But looking at the technical site, things become a little dim. Wikimedia sites have their interface translated in many languages, but beyond that, English rules supreme. Despite many requests, only few tools on Labs have a translatable (and translated) interface.

One exception is PetScan, which uses the i18n mechanism from its predecessor CatScan, namely a single wiki page on meta, which contains all interface translations. This works in principle, as the many translations there show, but it has several disadvantages, ranging from bespoke wikitext parsing, over load/rendering times on meta, to the fact that there is no easy way to answer the question “which of these keys have not been translated into Italian”? New software features require new interface strings, so the situation gets worse over time.

The answer I got when asking about good ways to translate interfaces is usually “just use TranslateWiki“, which IIRC is used for the official Wikimedia sites. This is a great project, with powerful applications, but I was looking for something more light-weight, both on the “add a translation” side, and the “how to use this in my tool” side.

ToolTranslateIf you know me or my blog, then by this point, you will already have guessed what happened next: I rolled my own (for more detailed information, see the manual page).

ToolTranslate is a tool that allows everyone (after the usual OAuth ceremony) to provide translations for interface texts, in almost 300 languages. I even made a video demonstrating how easy it is to add translations (ToolTranslate uses its own mechanism, so the demo edit shows up live in the interface). You can even also your own tool, without having to jump through bureaucratic hurdles, just with the press of a button!

On the tool-author side, you will have to change your HTML, from <div>My text</div> to <div tt=”mytext”></div>, and then add “My text” as a translation for the “mytext” key. Just use the language(s) you know, anyone can add translations in other languages later.I experienced this myself; after I uploaded the demo video, User:Geraki added Greek translations to the interface, before this blog post, or any other instructions, were available. Just, suddenly, as if by magic, Greek appeared as an interface option… You will also need to include a JavaScript file I provide, and add a single line of code (two, if you want to have a drop-down to switch languages live).

There is a simplistic demo page, mainly intended for tool authors, to see how it works in practice. Besides ToolTranslate itself, I also used it on my WikiLovesMonuments tool, to show that it is feasible to retrofit an existing tool. This took less than 10 minutes.

I do provide the necessary JavaScript code to convert HTML/JS-based tools. I will be working on a PHP class next, if there is demand. All translations are also provided as JSON files online, so you can, in turn, “roll your own” code if you want. And if you have existing translations for your tool and want to switch to ToolTranslate, let me know, and I can import your existing translations.

First image, good image?

For a while now, Wikimedia pages (usually, Wikipedia articles) have a “page image”, an image from that page used as a thumbnail in article previews, e.g. in the mobile app. While it is not entirely clear to me how this is image is chosen, it appears to be the first image of the article in most cases, probably excluding some icons.

Wikidata is doing something similar with the “image” property (P18), however, this needs to be an image of the item’s subject, not “something related to the item”. Wikipedia’s “page image” often turns out to be a painting made by the article’s subject, or a map, or something related to an event. This discrepancy prevent an automated import of the “page image” into Wikidata. However, exceptions aside, the “page item” presents a highly specific resource for P18-suitable images.

Screen Shot 2016-07-18 at 10.38.02So I added a new function to my WD_FIST tool, to help facilitate the import of suitable images from that rich source into Wikidata. As a first step, a bot checks several large Wikipedias on a daily basis, and retrieves “page images” where the associated Wikidata item has none, and the “page image” is stored on Commons. It also skips “non-subject” pages like list articles. In a second stage, images (excluding PNG, GIF, and SVG) that are used as a “page image” on at least three Wikipedias for the same subject are put into a main candidate list. The image must also not be on the tool-internal “ignore” list. Even after all this filtering, >32K candidates remain in the current list.

dewiki 346,204
enwiki 700,832
frwiki 255,527
itwiki 148,041
nowiki 73,508
plwiki 181,323
svwiki 109,349
Combined 32,137

I will likely add more Wikipedias to this list (es and pt will show up tomorrow), and eventually lower the inclusion threshold, as candidates are added to Wikidata, or to the “ignore” list.

As the candidate list is already heavily filtered, I am not applying some of the usual WD-FIST filters. This also helps with retrieving a candidate set of 50 very quickly. In this mode, the tool also lends itself well to mobile usage.

A week of looking at women

Images and their use in the WikiVerse have always been a particular interest of mine, on Wikipedia, Commons, and of course, Wikidata. Commons holds the files and groups them by subject, author, or theme; Wikidata references images and files for key aspects of a subject; and Wikipedia uses them to enrich texts, and puts files into context.

Wikidata uses images for more subjects than any Wikipedia, save English, and it is slowly encroaching on the latter; the “break-even” should happen later this year. This is not just a purpose in itself, but will also massively benefit the many smaller Wikipedias, by holding such material in easily usable form at the ready.

Screen Shot 2016-04-07 at 00.10.26

Image candidates, ready to be added with a single click

So I did a small experiment, as to how much one person can do “on the side” (besides work, other interests, and such luxuries as sleeping or eating), to improve the Wikidata image fundus. I thus picked the German category for women, which currently holds >92K articles. I used my WD-FIST tool to find all potential images on all Wikipedias, for the Wikidata items corresponding to the German articles. This does not show items that already have an image, or items that have no possible candidate image anywhere; just the ones where a Wikipedia does have an image, and Wikidata does not.

A week ago, I started with 3,060 items of women that potentially had an image on Wikipedia, somewhere. A week later, I am down to ~290. Now, that does not mean I added ~2,700 images to Wikidata; a database query comes to about ~1,100 added images, and ~200 other file properties (spoken text, commemorative plaque image, etc.). Some items just had no suitable image on Wikipedia; others had group photos, which I tagged to be cropped on Commons (those tagged images will not show in the tool, while the crop template remains).

The image candidates for the remaining 290 or so items need to be investigated in more detail; some of them might be not actually images of the subject (hard to tell if the file name and description are in e.g. Russian), or they are low-resolution group pictures, which do not warrant cropping, as the resulting, individual image would be too grainy.

Adding the ~1,100 images is good, but only part of the point I am trying to make here. The other part is, no one will have to wade again through the ~90% of item/image suggestions I have resolved, one way or another. Ideally, the remaining 290 items should be resolved to, so if an image is added on any Wikipedia, for any of the >92K women in the category, just that new image would show in the tool, which would make updating Wikidata so much easier. Even just one volunteer could drop by every few weeks and keep Wikidata up-to-date with images, for that group of items, with a few clicks’ worth.

The next step is, of course, all women on Wikidata (caution: that one will load a few minutes). The count of items with potential images is at 15,986 at the time of writing. At my speed, it would take one person about a month of late evening clicking to reduce that by 90%, though I do hope some of you have been inspired to help me out a bit.

Of cats and pets

CatScan is one of these workhorse tools that are familiar to many Wikimedia users, all the way back to the toolserver. Its popularity, however, has also caused problems with reliability time and again. As Labs became usable, I added QuickIntersection to the mix, allowing for a quicker and more reliable service at the expense of some complex functionality. Alas, despite my best efforts, CatScan reliability is fluctuating a lot. The reasons for that include the choice of PHP as a programming language, and the shared nature of Labs tools, where resources are concerned.

So I spent the last two weeks (as time allowed) with a complete rewrite of the tool, using C++ and a dedicated virtual machine on Labs. The result is one of the most complex tools I developed to date.I call it PetScan, both to indicate that it does more than just cat(egorie)s, and as a pun on the more versatile PET scan (compared to the CAT scan).

Its basic interface is based on CatScan, and it is backwards-compatible for both URL parameters (so if you have a CatScan URL, you just need to replace the server name) and output (so the JSON output will be almost identical). It can also be switched to QuickIntersection output with a parameter, so it could replace that tool as well.

But PetScan is much more encompassing. Several times before, I tried to “connect” my (and other) tools, the last time via PagePile; however, the uptake was rather low. It is clear that most users prefer a tool that slices and dices. This is why PetScan can also process other data sources, like the Wikidata SPARQL query, manual page or item lists, and yes, PagePile. Given more than one source, it builds a subset of the respective results, even if they are on other wikis (via Wikidata).

You want a list of all cats known to Wikidata that are also in the category tree of the battleship “Bismarck” on English Wikipedia? No problem. You can chose which of the input wikis should be the output wiki, so you can have the same result as Wikidata items. Now, for the latter, you might have seen an additional box at the top of the results; this is the full functionality from AutoList 2, directly available on your resulting items.

Additional goodies include:

  • Interface language can be switched “live”. The translations were copied from the CatScan translations, so that effort is re-used.
  • Namespaces are updated live when you change the wiki for the categories
  • Both templates and (incoming) links can now be a primary source, instead of just being filters for categories
  • You can filter the results by a regular expression. This works on page titles, or Wikidata labels, respectively
  • For Wikidata results, you can specify the label language used (will default to the interface language)
  • Show only Wikipedia pages without a Wikidata item
  • Only the first 10K results will be shown in HTML mode, as to not crash your browser. Others (e.g. JSON) will get you all results

I have tested PetScan on my own, but with a project of this complexity, bugs will only become apparent with many users, over time, so please help testing it. Eventually, I believe this tool can (and will) replace CatScan, QuickIntersections, Autolist, and maybe others as well.

The Reference Wars

In a recent Wikipedia Signpost Op-Ed, Andreas Kolbe wrote about Wikidata and references. He comes to the conclusion that Wikidata needs more (non-Wikipedia) references, a statement I wholeheartedly agree with. He also divines that this will never happen, that Wikidata is doomed, while at the same time somehow being controlled by Google and Microsoft; I will not comment on these “conclusions”, as others have already done so elsewhere.

Andreas also uses my own Wikidata statistics to make his point about missing references on Wikidata. The numbers I show are useful, IMHO, to show the remarkable progress of Wikidata, but they are much too crude to draw conclusions about the state of references there. Also, the impression I get from Andreas’ text is that, while Wikipedia has some issues, references are basically OK, whereas they are essentially non-existent in Wikidata.

So I thought I’d have a look at some actual numbers, especially comparing Wikipedia and Wikidata in terms of references.

One key issue is that there is no build-in way to get metrics about statements and references from Wikipedia. I therefore developed my own approach. Given a Wikipedia article, I use the REST API to get HTML for the article. I then count the number of reference uses (essentially, <ref> tags) in the article; note that this number is larger then (or at least equal to) the number of references at the bottom of the page. Then, I strip the HTML tags, and count the number of sentences (starts with an upper-case character, has at least 50 characters, ends with a “.”); the numbers were confirmed manually for a few example articles through other sentence counting tools on the web, and yielded similar results. I then assume that each sentence in the article contains one statement (or fact); in reality, there are likely many such statements (such as the first sentence of a biographical article), but I am aiming for a lower boundary here. (Any sentence not containing a statement/fact should be deleted from Wikipedia anyway.) A useful metric from both the number of reference uses, and the number of statements (=sentences), is the references-per-statement (RPS) ratio.

For Wikidata, a similar metric can be calculated. For practical purposes, I skip statements of the “string” type, as they are mostly external references in themselves (e.g. VIAF identifiers); I also skip “media”-type statements, as they should have “references” in their file description page on Commons. For references, I do not count “imported from Wikipedia”, as these are not “real” references, but rather placeholders for future improvement. Again, a RPS ratio can be computed.

I then calculated these ratios for 4,683 Featured Articles from English Wikipedia and their associated Wikidata items (data). As these articles have been significantly worked over and approved by the English Wikipedia community, they should represent the “best case scenario” for Wikipedia.

Indeed, the RPS ratio is higher for Wikipedia in 87% of cases, which would mean that Wikipedia is better referenced than Wikidata. But keep in mind that this represents the best of the best of the best of English Wikipedia articles, fifteen years in the making, compared to a three-and-a-half-year old Wikidata (and references were not supported for the first year or so). This is as good as it gets for Wikipedia, and still, Wikidata has a better RPS in about 13% of cases.

Even more interesting IMHO: Taking the mean of both number of statements and number of references for both Wikipedia and Wikidata, respectively, and calculating the RPS ratios for those means, yield 0.32 for Wikipedia and 0.15 for Wikidata. This seems counter-intuitive, given the previous 87/13 “ratio of ratios”. However, further investigation shows that only 1305 (~28%) of Wikidata items have any references at all, but where there are references, they usually outshine Wikipedia; about half of the items with at least one reference have a better RPS ratio than the respective Wikipedia article. This seems to indicate a “care factor” at work; where someone cared about adding references to the item, it was done quite well. Wikidata RPS ratios range up to 1.5, meaning two statements are, on average, supported by three references, whereas Wikipedia reaches “peak RPS ratio” at 0.93, or slightly less than one reference per statement.

I believe these numbers show that Wikidata can equal and surpass Wikipedia in terms of “referencedness”, but it is a function of attention to the items. Which in turn is a matter of man- and bot-hours spent. Indeed, for the Wikidata showcase items (the equivalent of Featured Articles on Wikipedia), the Wikidata RPS ratio is better that that of the associated English Wikipedia article in 19 out of 24 cases (~80%).

So will Wikidata ever catch up to Wikipedia in terms of RPS ratio? I think so. The ability of Wikidata to be reliably edited by a machine allows for improvement by automated and semi-automated bots, tools, games, on-wiki gadgets, etc. which allow for much steeper editing rate, as I demonstrated previously for images, where Wikidata went from nothing to second place in about two years, and is now angling for the pole position (~1.1M images at the moment). I see no reason to doubt this will happen to references as well.

The beatings will continue until morale improves

So over the weekend, Wikimedia Labs ran into a bit of trouble. Database replication broke, and was lagging about two days behind the live databases. But, thanks to tireless efforts by JCrespo, replication has now picked up again, and replication lag should be back to normal soon (even though there might be a few bits missing).

Now, this in itself is not something I would blog about; things break, things get fixed, life goes on. But then, I saw a comment by JCrespo with a preliminary analysis of what happened, and how to avoid it happening again:

“…it is due to the contraints we have for labs in terms of hardware and human resources. In order to prevent this in the future, I would like to discuss enforcing stronger constraints per user/tool.”

So, there are insufficient resources invested into (Tools) Labs. The solution, obviously, is to curtail the use of resources. This train of thought should be familiar to everyone whose country went to a phase of austerity in recent years. Even though, it now seems to be commonly agreed outside the cloudy realm of politicians, that austerity is the wrong way to go. If you have a good thing going, and you require some more resources to keep it that way, you give it more resources. You do not cut away scarce resources even more! This is how you go the way of Greece.

This is how you go the way of the toolserver.

More mixin’, more matches

Mix’n’match has seen some updates in the past few days. There are about ~170K new entries, in several catalogs:

Also, there is a brand-new import tool that anyone (who has made at least one mix’n’match “edit”) can use! Just paste or upload a tab-delimited text. Note that the tool is, as of yet, untested with “production” data, so please measure twice or thrice before importing.

Distributed stats

Just about a week after its inception, the Distributed Game has passed 10K (now 12K, since I started writing this text) actions. Enough to see some interesting patterns in the stats.

By actions (an action is any decision made), the most popular sub-games are mix’n’match (42%) and matching new articles to existing items (36%), followed by the classic “merge items” (10%) from the original Game. “Merge items” has the lowest rate of actions that lead to Wikidata edits (18%), probably because it has been running for a while on the original Game, and many of the obvious cases have been done.

Mix’n’match, “administrative unit”, and “items without image” all have over 80% actions that also edit Wikidata. The last one is easy to understand; the candidate images are taken from associated Wikipedia articles, and have been filtered against “common” images (that is, images that are used on many pages, like navbox icons), so the success rate is high.

“Administrative unit” is new, and the coordinate-based suggestions appear to be of high quality. Mix’n’match has been around in the shape of the original tool for quite a while, but has always lacked enough people to get all the “easy ones”; since many of the entries here are people, ambiguity seems to be low.

The “source meta” game has a Wikidata “edit rate” of over 60%, which might drop as soon as the easy cases are done. Of course, many more entries will follow soon, so this should keep players entertained for a while.

Enter the Distributed Game

The Wikidata Game has been a success, both in terms of work done, as well as a demonstration of micro-contributions to Wikidata. I consider ~2,500 players not a trivial thing, considering the games are neither particularly thrilling, nor resulting in awards (how about one repercussion-free vandalism on English Wikipedia for 1,000 actions in a sub-game?). However, I found the underlying system increasingly complex to work with, and did not add any more games recently. This is despite several people suggesting games, and even providing data to play on.

So, just in time for Wikidata’s third birthday, I present The Distributed Game. What’s new, you might ask? Just about everything!

Screen Shot 2015-10-10 at 22.00.35First and foremost, the concept of “game”: The site does not hold any actual games. Rather, it holds a list of “game tile providers”, that is, other web tools (APIs) that generate game tiles to play. The great thing about this is, anyone with a little programming experience can write a new game, test it on the Distributed Game, and add it as a new game, all with a few clicks! So if you are a programmer and interested to have people play with your Wikidata-related data, check out the API doc.
For starters, I have created the following games myself:

  • The “merge game” from the original Game. Thanks to the API model, it runs on the same data set, and even credits your edits there as well!
  • A “mix’n’match game”. In my mix’n’match tool, you can match external catalog IDs to Wikidata items. One way to do that is to confirm (or remove) automated, name-based matches. These are now exposed as a game. As above, confirming or removing a match in the game will also credit you in mix’n’match (and on Wikidata, of course!).
  • A “match new articles” game. Again, this runs on existing data, in this case a list of Wikipedia articles with no Wikidata item in the duplicity tool.
  • A brand-new game, with a bespoke data set, that tries to infer the “administrative unit” for a Wikidata item from its coordinates, and other items nearby.

That last game shows a strength of the new gaming system: The “game tile” (essentially, the information required to complete the task) can consist of several types of display; in this case, a map (using the new Wikimedia maps service) to get a look at the region, the Wikidata item (as an automated description, with inline previews of associated Wikipedia pages), and buttons to decide on the task. Other types include a Wikipedia page intro, or “just” static text; adding more types is quite simple.

Screen Shot 2015-10-13 at 16.30.23The Distributed Game can perform up to three actions on every decision you take:

  • Store your decision centrally, for Recent Changes, user contributions, statistics etc.
  • Feed your decision back to the remote API that provided the game tile, if only to mark this tile as “done” and not present it again.
  • Perform one or more edits on Wikidata.

Quite a few new ideas have gone into the interface. Wikidata item previews show image (if available) and map (based on coordinates) thumbnails. The next “game tiles” already show, but are greyed out. Page URLs update automatically to provide a link to the current “game state”. Games can offer different modes (for example, the “merge game” offers to only serve entries of people). You can set a list of languages, which will be passed to the individual games (it is up to the game to act on that appropriately).

At the moment of writing, there are several functions that the Distributed Game lacks, in comparison to the old one, including text highlighting, and detailed statistics. I have not yet decided which other games, if any, to port from the “old” game. I think for this “game 2.0”, it is more important to get the essentials right, and to make it visually appealing (I have borrowed some ideas from tufte.css), rather than to be “feature-complete”. The original Game will continue to work for the time being.

So, if you have a sub-game you really would like to see ported to the new system, or know of a data source that would make for an interesting game, please contact me. More importantly, though: If you are a programmer, check out the documentation, and have a go at writing your own. After all, building a whole from individual, diverse contributions is what makes us mighty.