Skip to content

Picture this!

Recently, someone told me that “there are no images on Wikidata”. I found that rather hard to believe, as I had added quite a few using my own tools. So I had a quick look at the numbers.

For Wikidata, counting the number of items with images is straightforward. For Wikipedia, not so much; by default, navigation bar logos and various icons are counted just as actual photographs of the article topic. So, I devised a crude filter, counting only articles with images (one would do) that were not used in three or more articles in total.

I ran this query on some of the larger Wikipedias. While most of them ran fine, English Wikipedia failed to return a timely result; and since its generous sprinkling with “fair use” local images would inflate the number anyway, I am omitting this result here. Otherwise:

Site Articles/Items with images
dewiki 709,736
wikidata 604,925
frwiki 602,664
ruwiki 491,916
itwiki 451,499
eswiki 414,308
jawiki 278,359

As you can see, Wikidata already outperforms all but one (with en.wp: two) Wikipedias. Since image addition to Wikidata is easy through tools (and games), and there are many “pre-filtered” candidates from Wikipedias to use, I expect Wikidata to surpass German Wikipedia soon (assuming linear increase, in less than four months), and eventually English Wikipedia as well, at least for images from Commons (not under “fair use”).

But even at this moment, I am certain there are thousands of Wikidata items with an image, while the corresponding article on German (or Spanish or Russian) Wikipedia remains a test desert. Hesitation of the Wikipedia communities to use these readily available images deprive their respective readers of something that helps to make article come alive, and all the empty talk of “quality” and “independence” does not serve as compensation.

Also, the above numbers count all types of files on Wikipedia, whereas they count only images of the item subject on Wikidata. Not only does that bias the numbers in favour of Wikipedia, it also hides the various “special” file types that Wikidata offers: videos, audio recordings, pronunciations, maps, logos, coat of arms, to name just a few. It is likely that their use on Wikipedia is even more scattered than that of subject images. Great opportunities to improve Wikipedias of all languages, for those bold enough to nudge the system.

The way is shut

So I saw a mail about the new, revamped Internet Archive. Fantastic! All kinds of free, public domain (for the most part) files to play with! So I thought to myself: Why not celebrate that new archive.org by using a file to improve Wikidata? After all, I just have to upload it to Commons!

Easy, right? Well, I did write a tool to directly upload a file from a URL to Commons, but the IA only offers mp3, so I don’t know how that would work. Let’s do it the old-fashioned way, as every newcomer would: Download it to disk, and upload it to Commons. Except Commons barfs at mp3 uploads. Commons is the domain of free formats, after all. And we could not possibly set non-free formats free by converting them automatically, oh no! I am sure there is a good reason why the WMF can’t turn non-free mp3 into free formats during upload; that reason just escapes me at the moment, as it sure will escape everyone else who tried this. Maybe they would have to gasp! license an mp3 decoder? Not sure if that is actually required, but it would surely irk the free-only purity of the organization. Never mind the Foundation heavily relied on non-free software and services like Google internally; if they can’t get things done with free software and open-source services alone, obviously non-free ones are made available. Just not for the community.

The mp3 refusal surely means that there are well-documented ways to deal with this issue, right? The Upload Wizard itself is not very helpful, though; the dialog box that pops up says:

This wiki does not accept filenames that end in the extension “.mp3″.

That’s it. No reason why, no suggestion what to do about it, no links, nothing. Just “bugger off”, in so many words. Never mind; after all, there is a prominent, highlighted link in the Wizard to Upload help. Which, one would assume, offers help with uploading files. I search the page for “mp3″ – no result. Ah well, this seems to be a list of questions rather than an actual help page, but there is a “search archive” function; surely, this problem must have been discussed before! Nope. Neither do the FAQ cover the topic of mp3. But lo and behold, searching for “audio” gets me here, which tells me (finally!) that Commons accepts OGG and FLAC; OPUS is not mentioned, probably because there are “issues” with uploading OPUS to Commons (no, really?!?). There are some links to software and online converters, but I had found some of those on my own already by now.

I tried the Miro converter, but it “only” creates OGG, not FLAC, which I wanted to us in order to avoid re-encoding losses. Then I tried online-convert, which returned me a 10MB FLAC file for my 1.6MB mp3. So I upload the FLAC. And by that, I mean, I try. The Wizard takes the file, and starts “encoding”. And never finishes. Or at least, it’s been at it for >10min now, and not showing any sign it’s alive.

This is my experience; I could probably get it to work, if I cared enough. I shudder to think how a newbie would fair with this task. Where audio (and, most likely, video) is concerned, Commons is, in effect, a community-driven media site that does not accept media files. It has been for years, but we are approaching 2015; time we do something about that. Merely preaching more free format ideology is not a solution.

Clusterf…amilies

Wikidata and its web of interconnected items lends itself to automated clustering. I have used my Wikidata query tool to quickly (as in, a few minutes) check all clusters of humans, that is, items about humans connected by properties such as mother, child, spouse, brother, etc.

At the time of writing, there are 11,784 clusters on Wikidata, each containing two or more humans. The largest one is the supercluster “European Royalty” with 20,543 members. 7,471 clusters contain only two humans, 1,955 contain three, and the numbers drop from there.

Beyond the royalty supercluster, the largest ones include:

Sadly, there is no good genealogy rendering software that is open source and JavaScript-only; and I don’t really have the bandwidth to develop one.

I have uploaded the cluster list here; each row has a item to start with, and the size of the cluster (=number of humans). The members of the cluster can be retrieved with the Wikidata query web[start_item][22,25,40,7,9,26,45,1038]. If there is interest, I can re-calculate this cluster list again later.

The missing origin of species

Now, instead of humans and their relations, what about taxa? We recently talked about taxonomy on Wikidata at WikiCon 2014, so I thought I’d modify the script to show that taxonomy on Wikidata is in a good state. Sadly, it is not.

Using “parent taxon”, as well as the deprecated “family” and “order” properties, I get a whooping 193,040 separate clusters; and that doesn’t even count completely “unconnected” items. The good news is, the main “supercluster” consists of 1,351,245 taxa that, presumably, can all be traced back to a common root “biota”.

But, the next one is a cluster of 1,006 taxa unconnected to that root. Using a modified query, I can get the unconnected root of that cluster, Molophilus. I have uploaded the complete cluster list here; a list of items per cluster, as well as the unconnected root, can be retrieved using the respective start item, and the methods demonstrated above.

The Men Who Stare at Media

Shortly after the 2014 London Wikimania, the happy world of Wikimedia experienced a localized earthquake when a dispute between some editors of the German Wikipedia and the Wikimedia Foundation escalated into exchanges of electronic artillery. Here, I try to untangle the threads of the resulting Gordian knot, interwoven with my own view on the issue.

Timeline

As best as I can tell, the following sequence of events is roughly correct:

  1. The WMF (Wikimedia Foundation) decides to update and, at least by intention, improve the viewing of files (mostly images), mainly when clicked on in Wikipedia. The tool for this, dubbed MediaViewer, would do what most people expect when they click on a thumbnail on a website in 2014, and be activated by default. This is aimed at the casual reader, comprising the vast majority of people using Wikipedia. For writers (that is, “old hands” with log-ins), there is an off switch.
  2. A small group of editors on English Wikipedia suggest that the MediaViewer, at least in its current state, is not suitable for default activation. This is ignored by the WMF due to lack of total votes.
  3. A “Meniungsbild” (literally “opinion picture”; basically, a non-binding poll) is initiated on German Wikipedia.
  4. The WMF posts on the Meinungsbild page that it (the WMF) reserves the right to overrule a negative result.
  5. About 300 editors vote on German Wikipedia, with ~2/3 against the default activation of the MediaViewer.
  6. The WMF, as announced, overrules the Meinungsbild and activates the MediaViewer by default.
  7. An admin on German Wikipedia implements a JavaScript hack that deactivates the MediaViewer.
  8. The WMF implements a “super-protect” right that locks out even admins from editing a page, reverts the hack to re-enable the MediaViewer, and protects the “hacked” page from further editing.
  9. Mailing list shitstorm ensues.

An amalgamate of issues

In the flurry of mails, talk page edits, tweets, blog posts, and press not-quite-breaking-news items, a lot of issues were thrown into the increasingly steaming-hot soup of contention-laden bones. Sabotage of the German Wikipedia by its admins, to prevent everyone from reading it, was openly suggested as a possible solution to the problem, Erik Möller of WMF was called a Nazi, and WMF management is raking in the donations for themselves while only delivering shoddy software. I’ll try to list the separate issues that are being bundled under the “MediaViewer controversy” label:

  • Technical issues. This includes claims that MediaViewer is useless, not suitable for readers, too buggy for prime time, violates copyright by hiding some licenses, etc.
  • WMF response. Claims that the Foundation is not responding properly to technical issues (e.g. bug reports), community wishes, etc.
  • WMF aim. Claims that the Foundation is focusing exclusively on readers and new editors, leaving the “old hands” to fend for themselves.
  • Authority. Should the WMF or the community of the individual language edition have the final word about software updates?
  • Representation: Does a relatively small pool of vocal long-time editors speak for all the editors, and/or all the readers?
  • Rules of engagement: Is it OK for admins to use technological means to enforce a point of view? Is it OK for the WMF to do so?
  • Ownership: Does the WMF own Wikipedia, or do the editors who wrote it?

A house needs a foundation

While the English word “foundation” is know to many Germans, I feel it is often interpreted as “Verein”, the title of the German Wikimedia chapter. The literal translation (“Fundament”), and thus its direct meaning, are often overlooked. The WMF is not “the project”; it is a means to an end, a facilitator, a provider of services for “the community” (by whatever definition) to get stuff done. At the same time, “the community” could not function without a foundation; some argue that the community needs a different foundation, because the next one will be much better, for sure. Thankfully, these heroic separatists are a rather minute minority.

The foundation provides stability and reliability; it takes care of a lot of necessary plumbing and keeps it out of everyone’s living room. At the same time, when the foundation changes (this is stretching the literal interpretation of the word a bit, unless you live in The Matrix), everything build on the foundation has to change with it. So what does this specific foundation provide?

  • The servers and the connectivity (network, bandwidth) to run the Wikis.
  • The core software (MediaWiki) and site-specific extensions. Yes, since it’s open source, everyone can make a fork, so WMF “ownership” is limited; however, WMF employs people to develop MediaWiki, with the specific aim of supporting WMFs projects. Third-party use is wide-spread, but not a primary aim.
  • The setup (aka installation) of MediaWiki and its components for the individual projects.
  • The people and know-how to make the above run smoothly.
  • Non-technical aspects, such as strategic planning, public relations and press management, legal aspects etc. which would be hard/impossible for “the community” to provide reliably.
  • The money to pay for all of the above. Again, yes, the money comes from donation; but WMF collects, prioritizes, and distributes it; they plan and execute the fundraising that gets the money in.

The WMF does specifically not provide:

  • The content of Wikipedia, Commons, and other projects.
  • The editorial policies for these projects, beyond certain basic principles (“Wikipedia is an encyclopedia, NPOV, no original research”, etc.) which are common to all language editions of a project.

Authorities

I think that last point deserves attention in the light of the battle of MediaViewer. The WMF is not just your hosting provider. It does stand for, and is tasked to uphold, some basic principles of the project, across communities and languages. For example, the “neutral point of view” is a basic principle on all Wikipedia. What if a “community” (again, by whatever definition) were to decide to officially abandon it, and have opinionated articles instead? Say, the Urdu edition, a language mostly spoken in Pakistan (which I chose as a random example here!). I think that most editors, from most “communities”, would want the WMF to intervene at that point, and rightly so. You want opinionated texts, get a blog (like this one); the web is large enough. In such a case, the WMF should go against the wishes of that “community” and, if necessary, enforce NPOV, even if it means to de-admin or block people on that project. And while I hope that such a situation will never develop, it would be a case were the WMF would, and should, enforce editorial policy (because otherwise, it wouldn’t be Wikipedia anymore). Which is a far more serious issue than some image viewer tool.

The point I am trying to make here is that there are situations where it is part of the mission and mandate of WMF to overrule “the community”. The question at hand is, does MediaViewer comprise such a situation? It is certainly a borderline case. On one hand, seen from the (German) “community” POV, it is a non-essential function that mostly gets in the way of the established editors that are most likely to show up on the Meinungsbild, and admittedly has some software issues with a generous sprinkling of bug reports. On the other hand, from the WMF’s point of view, the dropping number of editors is a major problem, at it is their duty to solve it as best as they can. Some reasons, e.g. “newbie-biting”, are up to the communities and essentially out of the WMF’s control. Other reasons for the lack on “fresh blood” in the wiki family include the somewhat antiquated technology exposed to the user, and that is something well within its remit. The Visual Editor was developed to get more (non-technical) people to edit Wikipedia. The Upload Wizard and the MediaViewer were developed to get more people interested in (and adding to) the richness of free images and sounds available on the sites.

The Visual Editor (which seems to work a lot better than it used to) represents a major change in the way Wikipedia can be used by editors, and its initial limitations were well known. Here, the WMF did yield to the wishes of individual “communities”, and not even an option for the Visual Editor is shown on German Wikipedia for “anonymous” users.

The MediaViewer is, in this context, a little different. Most people (that is, anonymous readers of Wikipedia, all of which are potential future editors) these days expect that, when you click on a thumbnail image on a website, you see a large version of it. Maybe even with next/prev arrows to cycle through available images on the page. (I make no judgement about whether this is the right thing; it just is this way.) Instead, Wikipedia thus far treated the reader to a slightly larger thumbnail, surrounded by mostly incomprehensible text. And when I say “incomprehensible”, I mean people mailing me if they could use my image from Commons; they skip right past the {{Information}} template and the license boxes to look for the uploader, which happens to be my Flickr/Wikipedia transfer bot.

So the WMF decided that, in this specific case, the feature should be rolled out as default, on all projects instead of piecemeal like the Visual Editor (and do not kid yourself, it will come to every Wikipedia sooner or later). I do not know what prompted this decision; consistency for multilingual readers, simplicity of maintenance, pressure on the programmers to get the code into shape under the ensuing bug report avalanche, or simply the notion of this being a minor change that can be turned off even by anonymous users. I also do not know if this was the right technical decision to make, in light of quite a few examples where MediaViewer does not work as correctly as it should. I am, however, quite certain that it was the WMF’s right to make that decision. It falls within two of their areas of responsibility, which are (a) MediaWiki software and its components, and (b) improving reader and editor numbers by improving their experience of the site. Again, no judgement whether or not it was the right decision; just that it was the WMF’s decision to make, if they chose to do so.

Respect

I do, however, understand the “community’s” point of view as well; while I haven’t exactly been active on German Wikipedia for a while, I have been around through all of its history. The German community is very dedicated to quality; where the English reader may be exposed to an army of Pokemons, the article namespace in German Wikipedia is pruned rather rigorously (including an article about Yours Truly). There are no “mispeeling” redirects (apparently, if you can’t spell correctly, you have no business reading an encyclopedia!), and few articles have infoboxes (Wikipedia is an encyclopedia, not a trading card game!). There are “tagging categories”, e.g. for “man” and “woman”, with no subcategories; biographies generally have Persondata and authority control templates. In short, the German community is very much in favor of rigorously controlling many aspects of the pages, in order to provide the (in the community’s view) best experience for the user. This is an essential point: the German community cares very much about the reader experience! This is not to say that other languages don’t care; but, in direct comparison, English Wikipedia is an amorphous free-for-all playground (exaggerating a bit here, but only a bit). If you don’t believe me, ask Jimbo; he speaks some German, enough to experience the effect.

So some of the German editors saw (and continue to see) the default activation of the MediaViewer as an impediment to not only themselves, but especially to the reader. And while Germans are known for their “professional outrage”, and some just dislike everything new (“it worked for me so far, why change anything?”), I believe the majority of editors voting against the MediaViewer are either actually concerned about the reader experience, or were convinced (not to say “dragged into”) by those concerned to vote “no”.

The reactions by the WMF, understandably as they are from their perspective, namely

  1. announcing to ignore the “vote” (not a real, democratic vote, which is why it’s called “Meinungsbild” and not “Wahl”)
  2. proceeding to ignore the vote
  3. using “force” to enforce their decision

were interpreted by many editors as a lack of respect. We the people editors wrote the encyclopedia, after all; how dare they (the WMF) change our carefully crafted user experience, and ignore our declared will? It is from that background that comparisons to corporate overlords etc. stem, barely kept in check by Mike Godwin himself. And while such exaggerations are a common experience to everyone on the web, they do not exactly help in getting the discussion back to where it should be. Which is “where do we go from here”?

The road to hell

One thing is clear to me, and I suspect even to the most hardened edit warrior in the wikiverse: Both “sides”, community and WMF, actually want the same thing, which is to give the reader the best experience possible when browsing the pages of any Wikimedia project. The goal is not in question; the road to get there is. And whose authority it is to decide that.

On the technical side, one issue is the testing-and-fixing cycle. Traditionally, the WMF has made new functionality available for testing by the community quite early. By the same tradition, that option is ignored by most members of that community, only to complain about being steamrollered into it when it suddenly appears on the live site. On the other hand,  the WMF has rolled out both the Visual Editor and the MediaViewer in a state that would be called “early beta” in most software companies. “Release early, release often” is a time-honored motto in open source software development; but in this specific case, using early releases in production isn’t optional for the users. From discussions I had on Wikimania, I have the distinct impression that people expect a higher standard of quality for software rolled out by the WMF on the live sites, especially if it becomes default. How this should work without volunteers to test early remains a mystery; maybe a little more maturity on the initial release, followed by more widespread use of “beta” features, is part of the answer here.

On the votes-vs-foundation side, I am of the opinion that clearer lines need to be drawn. The WMF does have a responsibility for user experience, which includes software changes, some of which will have to be applied across the wikiverse to be effective; the upcoming “forced account unification” for (finally!) Single User Login comes to mind. And, in a twist on the famous Spiderman quote, with great responsibility needs to come great power to fulfill it. Responsibility without power is the worst state one can have in a job, which even the most uncompromising “community fighter” will agree to. So if and when the WMF makes such a decision within their remit, the energy of the community would be best spent in feeding back the flaws in order to get the best possible result, instead of half-assed attempts at sabotage (I much prefer full-assed attempts myself).

There is, of course, another side of that coin. In my opinion, the WMF should leave the decision for default activation of a new feature to a representative vote of a community, unless the activation is necessary for (a) technical, (b) consistency, or (c) interdependency reasons. A security fix would fall under (a); the Single User Login will fall under (c); MediaViewer falls under (b), though somewhat weakly IMHO. Now, the key word in the beginning of this paragraph is “representative”. I am not quite sure how this would work in practice. I am, however, quite sure it is not 300 editors (or Spartans) voting on some page. It could include votes by a randomized subset of readers. It could also include “calls to vote” as part of beta features, e.g. if you had the feature enabled in the last week. These could be repeated over time, as the “product” would change, sometimes significantly so, as it happened with the Visual Editor; a “no” three month ago would be quite invalid today.

Finally, I believe we need at least part of the above written out, and agreed upon, by both the WMF and “the communities”. It is my hope that enough people will share my opinion that both “parties” still have a common goal. Because the house that is Wikipedia cannot stand without a foundation, and a foundation without a house on top is but a dirty pond.

Evacuation

Wikimedia Commons is a great project. Over 21 million freely licensed files speak a clear language. But, like all projects of such a magnitude, there are some issues that somewhat dampen the joy. The major source of conflict is the duality of Commons: on one hand, it is a stand-alone repository of free files; on the other hand, it is the central repository for files used on many other projects of the Wikimedia family, first and foremost Wikipedia. Some Wikipedias, the Spanish one for example, have completely deactivated local file storage and rely completely on Commons; others prefer Commons, but keep local file storage around for special cases, like the “fair use” files on English Wikipedia, and “Freedom of Panorama” images on German Wikipedia.

The ExCommons interface on the deletion page.

The interface on the deletion page.

Commons admins (myself among them, albeit more in a technical capacity) want the keep Commons “clean”, and remove non-free images. While this is only proper, it can lead to issues with the projects depending on Commons, when files that are used e.g. on Wikipedia get deleted on Commons. The issue becomes aggravated if the deletion appears to be “overzealous”; some admins interpret “only free images” as “there must not be a shadow of a doubt”. When files that are very likely to be free, and will (in all likelihood) never see a takedown notice from a third party, are deleted because of a vague feeling of unease, and thus vanish from a dozen Wikipedias without warning, it is bound to raise the ire of someone. Yet, Commons admins do have a point when they want to keep their project in excellent legal shape, no matter the consequences.

One of my most popular tools is the CommonsHelper, which helps to transfer suitable images from other projects to Commons. Today, I try to at least reduce the impact of happy-trigger-finger file deletions on Commons by throwing some tech at the admins there: I present ExCommons. It is a small JavaScript that reverses the direction of CommonsHelper: It can transfer files from Commons to other Wikimedia projects, using OAuth.

This is what an "evacuated" file looks like.

This is what an “evacuated” file looks like.

It presents as a list of projects that use the file; the list automatically loads on the special page for deletion (Commons admins can try it here; but don’t actually delete the image!).

  • Sites that use the file in one or more articles are marked in bold; other sites use the file in other namespaces, where a loss might not be as critical
  • Sites that are known to have local upload disabled are shown but unavailable
  • Sites that are known to allow a wider range of files and use the file in an article are automatically checked; others can be checked manually

If you have authorized OAuth file upload, you can click the “Copy file to selected wikis” button, and the tool will attempt to copy the file there. The upload will be done under your user name. A Commons deletion template will be removed to avoid confusion; a {{From Commons}} template will be added. I created that template for the English Wikipedia; it states that the file was deleted on Commons, that it might not be suitable on en.wp either, it has a No-Commons template to prevent re-uploading, a category for tracking etc.

For this tool to be permanently enabled, add the line

importScript('MediaWiki:ExCommons.js') ;

to your common.js page.

This tool should allow Commons admins to quickly and painlessly “rescue” files to Wikipedias that use them, prior to their deletion on Commons. It is said that social problems cannot be fixed by technology; but sometimes they can be alleviated.

The Games Continue

Two weeks after releasing the first version of The Wikidata Game, I feel a quick look at the progress is in order.

First, thank you everyone for trying, playing, and feedback! The response has been overwhelming; sometimes quite literally so, thus I ask your forgiveness if I can’t quite keep up with the many suggestions coming coming in through the issue tracker, email, twitter, and various Wikidata talk pages. Also, my apologies to the Wikidata admins, especially those patrolling RfD, for the flood of deletion requests that the merge sub-game can cause at times.

Now, some numbers. There are now six sub-games you can play, and play you do. At the time of writing, 643 players have made an astonishing 352,710 decisions through the game, many of which result in improving Wikidata directly, or at least keep other players from having to make the same decision over again.

Let’s look at a single game as an example. The merge game has a total of ~200K candidate item pairs, selected by identical labels; one of these pairs, selected at random, is presented to the user to decide if the items describe the same “object” (and should thus be merged), or if they just happen to have the same name, and should not be shown in the game again. ~20% of item pairs have such a decision so far, which comes to ~3.000 item pairs per day in this game alone. At that speed, all candidates could be checked in two months time; realistically, a “core” of pairs having only articles in smaller languages is likely to linger much longer.

Of the item pairs with decisions, ~30% were judged to be identical (and thus merged), while ~31% were found to be different. But wait, what about the other 39%? Well, there are automatic cleanup operations going on, while you play! ~26% of item pairs, when loaded for presentation to the used, were found to contain at least one deleted item; most likely, someone merged them “by hand” (these probably include a few thousand species items that I accidentally created earlier…). ~6% contained at least one item that was marked as a disambiguation page since the candidate list was created. And ~9% were automatically discarded because one of the items had a link to the other, which implies a relation and, therefore, that the items are not identical.

As with the other sub-games, new candidates are automatically added every day. At the same time, users and automated filters resolve the candidate status. So far, resolving happens much quicker than addition of new candidates, which means there is light at the end of the tunnel.

Gender property assignments over time.

Gender property assignments over time.

Merging is a complex and slow decision. Some “quicker” games look even better in terms of numbers: The “gender” game, assigning a male or female tag to person items, has completed 42% of its ~390K candidates, a rate of almost 12K per day. The “sex ratio” is ~80% male to ~18% female (plus ~2% items already tagged or deleted on Wikidata). This is slightly “better” for women than the Wikidata average (85% vs. 15%), maybe because it does “solve” rare and ambiguous names as well, which are usually not tagged by bots, or because it has no selection bias when presenting candidates.

The disambiguation game is already running out of candidates (at 82% of ~23K candidates). Even the “birth/death date” game, barely a day old, has already over 10K decisions made (with over 84% resulting in the addition of one or two dates to Wikidata).

In closing, I want to thank everyone involved again, and encourage you to keep playing, or help this effort in other ways; by helping out on Wikidata RfD, by fixing potentially problematic items on flagged items, by submitting code patches, or even by becoming a co-maintainer for The Game.

The Game Is On

Game main page.

Game main page.

Gamification. One of those horrible buzzwords that are thrown around by everyone these days, between “cloud computing” and “the internet of things” (as opposed to the internet of people fitted with ethernet jacks, or those who get a good WiFi signal on their tooth fillings). Sure enough, gamification in the Wiki-verse has not been met with resounding success yet, unless you consider Wikipedia itself a MMORPG, or count the various vandal-fighting tools. Now, editing Wikipedia is a complex issue, which doesn’t really lend itself to game-like activity. But there’s a (not quite so) new player in town: Wikidata. I saw an opening, and I went for it. Without further ado, I give you Wikidata – The Game (for desktop and mobile)! (Note for the adventurous: Tools Labs is slightly shaky at the moment of writing this, if the page doesn’t load, try again in an hour…)

So what’s the approach here? I feel the crucial issue for gamification is breaking complicated processes down into simple actions, which themselves are just manifest decisions – “A”, “B”, or “I don’t want to decide this now!”. I believe the third option to be of essential importance; it is, unfortunately, mostly absent from Real Life™, and the last thing people want in a game is feeling pressured into making a decision. My Wikidata game acts as a framework of sub-games, all of which are using that three-options approach. The framework takes care of things like landing page, high scores, communications etc., so the individual game modules can focus on the essentials. For this initial release, I have:

The "merge items" game.

The “merge items” game.

  • Merge items shows you two items that have the same label or alias. Are they the same topic, and should thus be merged? One button will merge them on Wikidata (and leave a deletion request), the other will mark the pair as “not the same” within the game, not showing this specific combination again.
  • Person shows you an item that has no “instance of” property, but might be a person based on its label (the first word of the label is also the first word in another item, which is a person). One button sets “instance of:person” on Wikidata, the other prevents it from being offered in this game again.
  • Gender shows you an item that is a person, but has no gender property set. Set the property on Wikidata to “male” or “female”, or skip this item (like you can do with the other games – skipped items will show up again eventually).

There is also an option to randomly pick one game each time you press a button in the previous one – slightly more “challenging” than the single-game mode, which one can play at quite high speed. Of course, this simplification misses a lot of “fine-tuning” – what if you are asked to decide the gender of an item that has been accidentally tagged as “person”? What if the gender of this person is something other than “male” or “female”? Handling all these special cases would, of course, be possible – but it would destroy the simplicity of the three-button interface. The games always leave you a “way out” – when in doubt, skip the decision. Someone else will take care of it, eventually, probably on Wikidata proper.

Another point worth mentioning is the speed of the game. I took some measures to ensure the user never, ever, has to wait for the game. First, all the potential decisions are made server-side, and written into a database; for example, there are ~290K people waiting for “gender assignment”, and candidates are updated once a day. Upon loading the game website, a single candidate entry from each game is loaded in the background, so one will be ready for you instantaneously, no matter which game you choose. Upon opening a specific game, the cache is loaded with four more candidates, and kept at that level; at no point, you will have to wait for a new page to appear once you made a decision on the current one (I actually had to add a brief fade-out-fade-in sequence, so that the user can notice that a new page has been loaded – it’s that fast). Actions (merging items, requesting deletions, adding statements, remembering to not show items again) is done in the background as well, so no waiting for that either.

What else is there to say? The tool requires the user to allow OAuth edits, for both high-score keeping and accountability for the edits through the game. The game interface is English-only at the moment, but at least the main page has been designed with i18n in mind. The games are designed to work on desktop and mobile alike; passing time on the bus has never been that world-knowledge-improving! As a small additional incentive, there are high-score lists per game, and the overall process players have made in improving Wikidata. Finally, the code for the individual games is quite small; ~50 lines of code for the Person game, plus the updating code to find more candidates, run daily.

Finally, I hope some of you will enjoy playing Wikidata – The Game, and maybe some of you would like to work with me, either as programmers to share the tool (maybe even the good folks of WMF?), or with ideas for new games. I already have a few of those; I’m thinking images…

Is it a bot? Is it a user?

So I was recently blocked on Wikidata. That was one day after I passed a quarter million edits there. These two events are related, in an odd way. I have been using one of my tools to perform some rudimentary mass-adding of information; specifically, the tool was adding “instance of:human” to all Wikidata items in the English Wikipedia category “living people”. I had been running this for over a day before (there were ~50K items missing this basic fact!), but eventually, someone got annoyed with me for flooding Recent Changes, and I was blocked when I didn’t reply on-Wiki quickly enough.

I’ve since been un-blocked, and I throttled the tool, now waiting 10 seconds between every (!) edit. No harm done, but I believe it is an early sign of a larger controversy: Was I running an “unauthorized bot”, as the message on my talk page was titled? I don’t think I was. Let me explain.

Bots have been with Wikipedia and other Wikimedia projects almost since the beginning. Almost as old are complaints about bots, the best known probably Rambot‘s addition of >30K auto-generated stubs on English Wikipedia. Besides making every second article on Wikipedia about a town in the U.S. no one ever head of (causing exceptional dull-age in the “random page” function), it also flooded Recent Changes, which eventually let to bot policies and the bot flag, hiding bot edits from the default Recent Changes view. These days, bots make up a large amount of Wikipedia editing; I seem to remember that most Wikipedia edits are actually done by bots, ranging from talk page archiving to vandal fighting.

So how was my mass-adding of information different from Rambot’s? Rambot was written to perform a very specific purpose: Construct plain-text descriptions of towns from a dataset, then add these to Wikipedia. It was run once, for that specific purpose, by its creator. Other bots, like automatically reverting of certain types of vandalism, run without any supervision at the time (which is the whole point, in that case).

Herein lies the separation: Yes, I did write the tool, and I did operate it, but as two different people. That is, anyone with a Wikidata user name can use that tool, under his user name, via OAuth. Also, while the tool does perform an algorithmically defined function, it is not really constrained to a purpose, as a “classic” bot would be. That alone would most likely disqualify it to get a “bot permission” on Wikidata (unless the mood has really changed for the better there since the last time I tried). Certainly, there are overlaps between what a bot does, and what my tool does; that does not justify putting the “bot” label on it, just because it’s the only label you’ve got.

To be sure, no one (as far a I know) disputed that the edits were actually correct (unlike Rambot, which added a few thousand “broken” articles initially). And the fact that ~50K Wikidata items about living people were not even “tagged” as being about people surely highlights the necessity for such edits. Certainly, no one would object to me getting a list of items that need the “instance of:human” statement, and adding them manually. All the tool does is make such editing easier and faster for me.

Now, there is the issue of me “flooding” the Recent Changes page. I do agree that this is an issue (which is why I’m throttling the tool at the moment). I have filed a bug report to address this issue, so I can remove the throttling again eventually. So Users, bots, and OAuth-based tools can live in harmony again.

Post scriptum

I am running a lot of tools on Labs. As with most software, the majority of feedback I get for those tools falls into one of two categories: bug reports and feature requests, the latter often in the form “can the tool get input from/filter on/output to…”. In many cases, that is quick to implement; others are more tricky. Besides increasing the complexity of tools, and filling up the interface with rarely-used buttons and input fields, the combinations (“…as you did in that other tool…”) would eventually exceed my coding bandwidth. And with “eventually”, I mean some time ago.

Wouldn’t it be better if users could “connect” tools on their own? Take the output of tool X and use it as the input of tool Y? About two years ago, I tried to let users pipeline some tools on their own; the uptake, however, was rather underwhelming, which might have been due to the early stage of this “meta-tool”, and its somewhat limited flexibility.

A script and its output

A script and its output.

So today, I present a new approach to the issue: scripting! Using toolscript, users can now take results from other tools such as category intersection and Wikidata Query, filter and combine the results, and display the results or even use tools like WiDaR to perform on-wiki actions. Many of these actions come “packaged” with this new tool, and the user has almost unlimited flexibility in operating on the data. This flexibility, however, is bought by the scary word programming (an euphemism for “scripting”). In essence, the tool runs JavaScript code that the user types or pastes into a text box.

Still here? Good! Because, first, there are some examples you can copy, run, and play with; if people can learn MediaWiki markup this way, JavaScript should pose little challenge. Second, I am working on a built-in script storage, which should add many more example scripts, ready to run (in the meantime, I recommend a wiki or pastebin). Third, all build-in functions use synchronous data access (no callbacks!), which makes JavaScript a lot more … scriptable, as in “logical linear flow”.

The basic approach is to generate one or more page lists (on a single Wikimedia project), and then operate on those. One can merge lists, filter them, “flip” from Wikipedia to associated Wikidata items and back, etc. Consider this script, which I wrote for my dutiful beta tester Gerard:

all_items = ts.getNewList('','wikidata');
cat = ts.getNewList('it','wikipedia').addPage('Category:Morti nel 2014') ;
cat_item = cat.getWikidataItems().loadWikidataInfo();
$.each ( cat_item.pages[0].wd.sitelinks , function ( site , sitelink ) {
  var s = ts.getNewList(site).addPage(sitelink.title);
  if ( s.pages[0].page_namespace != 14 ) return ;
  var tree = ts.categorytree({language:s.language,project:s.project,root:s.pages[0].page_title,redirects:'none'}) ;
  var items = tree.getWikidataItems().hasProperty("P570",false);
  all_items = all_items.join(items);
} )
all_items.show();

This short script will display a list of all Wikidata items that are in a “died 2014″ category tree on any Wikipedia, that do not have a death date yet. The steps are as follows:

  • Takes the “Category:Morti nel 2014″ from it.wikipedia
  • Finds the associated Wikidata item
  • Gets the item data for that item
  • For all of the site links into different projects on this item:
    • Checks if the link is a category
    • Gets the pages in the category tree for that category, on that site
    • Gets the associated Wikidata items for those pages
    • Removes those items that already have a death date
    • Adds the ones without a death date to a “collection list”
  • Finally, displays that list of Wikidata items with missing death dates

Thus, with a handful of straightforward functions (like “get Wikidata items for these pages”), one can ask complex questions of Wikimedia sites. A slight modification could, for example, create Wikidata items the pages in these categories. All functions are documented in the tool. Many more can be added on request; and, as with adding Wikidata labels, a single added function can enable many more use-cases.

I hope that this tool can become a hub for users who want more than the “simple” tools, to answer complex questions, or automate tedious actions.

The OAuth haters’ FAQ

Shortly after the dawn of the toolserver, the necessity for authentication in some tools became apparent. We couldn’t just let anyone use tools to upload files to Commons, doing so anonymously, hiding behind the tools’ user account; that would be like allowing anyone to edit Wikipedia anonymously. Crazy talk! But toolserver policies forbade tools asking for users’ Wikipedia/Commons passwords. So, different tool authors came up with different solutions; I created TUSC to have a single authentication mechanism across my various tools.

As some of you may have noticed, some of my new tools that require authentication are using OAuth instead of TUSC. Not only that, but I have been busy porting some of my long-standing tools, like flickr2commons and now commonshelper, to OAuth. This has been met with … unease by various parties. I hope that this post can alleviate some concerns, or at least answer some questions.

Q: Why did you switch from TUSC to OAuth?
A: TUSC was a crutch, and always has been. Not only is OAuth a standard technology, it is now also the “official” way to use tools that require user rights on Wikimedia sites.

Q: So I’m not uploading as your bot, but as myself?
A: Yes! You took the time and interest to run the tool; that effort should be rewarded, by having the upload associated with your use account. It will be so much easier to see who uploaded a file, to assign glory and blame alike. Also, the army of people who have been haunting me for re-use rights and image sources, just because my name was linked to the uploading bot account, will now come after YOU! Progress!!

Q: OK, maybe for new tools. But the old tools were working fine!
A: No, they were not. Years ago, when the WMF changed API login procedure,  I switched my tools to use the Peachy toolkit, so I would not have to do the “fiddly bits” myself for every tool. However, a few month ago, something changed again, causing the Peachy uploads to fail. It turned out that Peachy was no longer developed, so I had to go back and hack my own upload code. Something was wrong with that as well, leading to a flurry of bug reports across my Commons-uploading tools. The subsequent switch to OAuth uploads wasn’t exactly smooth for me either, but now that it’s running, it should work nicely for a while. Yeah, right.

Q: But now all the tools are using JavaScript to upload. I hate JavaScript!
A: That pesky JavaScript is all over the web these days. So you probably have installed NoScript in your browser (if you don’t, do!). You can easily set an exception for the tools server (or specific tools), so you’re safe from the evil JavaScript out there, while the tools “just work”.

Q: But now it won’t work in my text browser!
A: Text browser? In 2014? Really? You seem to be an old-school 1337 hax0r to be using that. All tools on tools.wmflabs.org are multi-maintainer; I’ll be happy to add you as a co-maintainer, so you can add a text browser mode to any tool you like. You don’t have time, you say? Fancy that. In the meantime, you could use elinks which can support JavaScript.

Q: But you changed my favourite tool! I don’t like change.
A: Sorry, Sheldon. The only constant is change.

In closing, the good old bot has clocked over 1.2 million uploads on Commons, or ~6% of all files there. I think it deserves some well-earned rest.