Skip to content

The Reference Wars

In a recent Wikipedia Signpost Op-Ed, Andreas Kolbe wrote about Wikidata and references. He comes to the conclusion that Wikidata needs more (non-Wikipedia) references, a statement I wholeheartedly agree with. He also divines that this will never happen, that Wikidata is doomed, while at the same time somehow being controlled by Google and Microsoft; I will not comment on these “conclusions”, as others have already done so elsewhere.

Andreas also uses my own Wikidata statistics to make his point about missing references on Wikidata. The numbers I show are useful, IMHO, to show the remarkable progress of Wikidata, but they are much too crude to draw conclusions about the state of references there. Also, the impression I get from Andreas’ text is that, while Wikipedia has some issues, references are basically OK, whereas they are essentially non-existent in Wikidata.

So I thought I’d have a look at some actual numbers, especially comparing Wikipedia and Wikidata in terms of references.

One key issue is that there is no build-in way to get metrics about statements and references from Wikipedia. I therefore developed my own approach. Given a Wikipedia article, I use the REST API to get HTML for the article. I then count the number of reference uses (essentially, <ref> tags) in the article; note that this number is larger then (or at least equal to) the number of references at the bottom of the page. Then, I strip the HTML tags, and count the number of sentences (starts with an upper-case character, has at least 50 characters, ends with a “.”); the numbers were confirmed manually for a few example articles through other sentence counting tools on the web, and yielded similar results. I then assume that each sentence in the article contains one statement (or fact); in reality, there are likely many such statements (such as the first sentence of a biographical article), but I am aiming for a lower boundary here. (Any sentence not containing a statement/fact should be deleted from Wikipedia anyway.) A useful metric from both the number of reference uses, and the number of statements (=sentences), is the references-per-statement (RPS) ratio.

For Wikidata, a similar metric can be calculated. For practical purposes, I skip statements of the “string” type, as they are mostly external references in themselves (e.g. VIAF identifiers); I also skip “media”-type statements, as they should have “references” in their file description page on Commons. For references, I do not count “imported from Wikipedia”, as these are not “real” references, but rather placeholders for future improvement. Again, a RPS ratio can be computed.

I then calculated these ratios for 4,683 Featured Articles from English Wikipedia and their associated Wikidata items (data). As these articles have been significantly worked over and approved by the English Wikipedia community, they should represent the “best case scenario” for Wikipedia.

Indeed, the RPS ratio is higher for Wikipedia in 87% of cases, which would mean that Wikipedia is better referenced than Wikidata. But keep in mind that this represents the best of the best of the best of English Wikipedia articles, fifteen years in the making, compared to a three-and-a-half-year old Wikidata (and references were not supported for the first year or so). This is as good as it gets for Wikipedia, and still, Wikidata has a better RPS in about 13% of cases.

Even more interesting IMHO: Taking the mean of both number of statements and number of references for both Wikipedia and Wikidata, respectively, and calculating the RPS ratios for those means, yield 0.32 for Wikipedia and 0.15 for Wikidata. This seems counter-intuitive, given the previous 87/13 “ratio of ratios”. However, further investigation shows that only 1305 (~28%) of Wikidata items have any references at all, but where there are references, they usually outshine Wikipedia; about half of the items with at least one reference have a better RPS ratio than the respective Wikipedia article. This seems to indicate a “care factor” at work; where someone cared about adding references to the item, it was done quite well. Wikidata RPS ratios range up to 1.5, meaning two statements are, on average, supported by three references, whereas Wikipedia reaches “peak RPS ratio” at 0.93, or slightly less than one reference per statement.

I believe these numbers show that Wikidata can equal and surpass Wikipedia in terms of “referencedness”, but it is a function of attention to the items. Which in turn is a matter of man- and bot-hours spent. Indeed, for the Wikidata showcase items (the equivalent of Featured Articles on Wikipedia), the Wikidata RPS ratio is better that that of the associated English Wikipedia article in 19 out of 24 cases (~80%).

So will Wikidata ever catch up to Wikipedia in terms of RPS ratio? I think so. The ability of Wikidata to be reliably edited by a machine allows for improvement by automated and semi-automated bots, tools, games, on-wiki gadgets, etc. which allow for much steeper editing rate, as I demonstrated previously for images, where Wikidata went from nothing to second place in about two years, and is now angling for the pole position (~1.1M images at the moment). I see no reason to doubt this will happen to references as well.

6 Comments

  1. GerardM wrote:

    Thank you. This is exactly what was needed 🙂

    Thursday, January 7, 2016 at 08:29 | Permalink
  2. There seems to be quite a number of properties where reference are less important and sort of given, e.g., where would I find a reference for Q13520818 being a human, male and has a given name ‘Magnus’?

    Thursday, January 7, 2016 at 14:46 | Permalink
  3. Magnus wrote:

    “human” and “given name” could be referenced to GND or VIAF. One of these may gave gender as well.

    Thursday, January 7, 2016 at 14:50 | Permalink
  4. Andy Mabbett wrote:

    We could set a bot running, to add references for those properties, on each item about a human, that has a VIAF or GND (or ORCID, or other) identifier. That would improve that stats critiqued in The Signpost markedly. While of course, doing nothing to further the mission of providing free, global, access to the sum of all knowledge.

    Saturday, January 9, 2016 at 13:29 | Permalink
  5. Andy Mabbett wrote:

    …or we could just exclude, or list separately, such properties from the stats.

    Saturday, January 9, 2016 at 13:33 | Permalink
  6. Magnus wrote:

    I am already running such a bot, occasionally:
    https://www.wikidata.org/wiki/Special:Contributions/SourcererBot

    Over 160K edits so far, exclusively references.

    Saturday, January 9, 2016 at 14:08 | Permalink