Skip to content

The Game of Source

Wikidata has beautiful mechanisms to associate individual claims with sources for that claim. However, finding and adding such sources is surprisingly complex, and, between multiple open tabs and the somewhat sluggish interface, can strain the patience of the most well-meaning editor.

I had previously attempted to simplify adding sources to Wikidata statements; and while I believe this interface to be much easier to use than Wikidata proper, it is still clunky, and has issues on mobile.

Screen Shot 2015-06-01 at 22.18.35So, I went ahead and reduced the issue to its most basic form: Does a short text snippet support a specific claim? To achieve such a simplified interface, the following must happen:

  • A Wikidata item is picked (by random)
  • The associated Wikipedia articles are investigated
  • The external links of these articles are merged
  • The HTML for these URLs is retrieved, and HTML tags are stripped, leaving only the plain text
  • Claims from the item are prepared. This includes getting the label of “item statements” (Pxx => Qyy), and formatting the dates of “time statements” in various ways (2015-06-01, “June 1, 2015”, etc.)
  • The claim values (labels and dates, for now) are searched for in the HTML of the external URLs above
  • The hits, including some flanking text, are stored in a database

This process is repeated over and again. Finally, an interface presents the hits for a specific claim to the user. A single click can now add that URL as a source to the claim on Wikidata (via WiDaR), together with the original retrieval date (example). The entire set can be marked as “Done” (as in, don’t show this again to anyone), or skipped (claim goes back into the “pool”).

It is early days for this interface now. No doubt, many improvements are possible, and even though claims are added to the database in the background, there are only ~1,000 claims in there at the time of writing this. Patience.


  1. Thanks for a nice webservice (again).

    The first item/property I was presented with was Jack Bauer (Q24) and given name. There seems to be a few property-item combinations where it is not necessary to give a reference, such as given name, surname and perhaps title. Perhaps these properties should be left out?

    Tuesday, June 2, 2015 at 14:58 | Permalink
  2. Magnus wrote:

    Yes, I already filter out “instance of:taxon”, but I haven’t had a chance to collect bad ones yet. Start a list somewhere?

    Tuesday, June 2, 2015 at 15:00 | Permalink
  3. Nice game. Problematic properties I encountered are:

    (1) Things that have textual matches on almost all linked references, but where the match does not mean anything. For example: “given name”, “surname”, “parent taxon” (often a substring of the item label), very short value strings (e.g., “named after”::”1″ for Sunday), very common strings (e.g., “English” or “in”), things where the value is the same as the label (“Portugal-country-Portugal” or “1909-point in time-1909”).

    (2) Things for which it is unclear how a proper reference could look.

    * “given name”/”surname” (hardly any source will highlight what the given name of a person is)
    * “native language” (words like “Korean” or “English” occur all over the place on pages about, say, a Korean football player, but where can I read which language someone spoke at home with his parents?)
    * “instance of” for many “obvious” cases. It is hard to find a proper reference for the fact that something is a “city” or “country”
    * “taxon rank” species: similar to “instance of”

    In some cases the reference finding seems like pulling yourself out of the swamp by your own collar. I cannot find any reference for the fact that Germany is a country without already making this assumption when looking for the reference (e.g., the CIA WFB will mention Germany-the-country as a country, but this does not tell me that /the/ Germany which we mean in Q183 is that Germany that the CIA WFB is talking about).

    For some cases, there would be better ways of finding candidates for sources. For example, personnel and publication year of films can always be found at IMDB and at Rotten Tomatoes; no need to show other things. The “is described by URL” and “website” properties would also be good candidates to look for references of many basic statements.

    It would be nice if the tool had a way to mark further statements that as supported by a given reference. Often after reading a page I learned a lot of things about a topic (e.g., all sister cities, all actors, etc.) and it seems to be a pity that I only use this for one statement and forget the rest. Especially if the same property has many values, it could be likely that one reference covers all of them. More generally, it would be nice to study (after a while) which properties tend to use the same reference (e.g., birth date and birth place) to find synergies there.

    Monday, June 8, 2015 at 16:06 | Permalink