Skip to content

Sex and artists

Now that I got your attention … Prompted by a post from Jane Darnell, I thought to quickly run some gender-related stats on artists in Wikidata. Specifically, the number of articles on Wikipedias for artists with a specific property, by gender.

First, RKDartists (at the moment of writing, 21,859 male and 2,801 female artists on Wikidata):

Number of articles on male (x axis) and female (y axis) artists with the RKDartists property. The line indicates the gender-unbiased coverage.

Number of articles on male (x axis) and female (y axis) artists with the RKDartists property.
The line indicates the gender-unbiased coverage. Only Wikipedias with >= RKDartist articles were used.

As we can see, not a single Wikipedia reaches the unbiased line; all Wikipedia are biased towards male biographies. English, German, Dutch, and French Wikipedia come closest, however, that may be due to reaching saturation (as in, almost complete coverage of all RKDartists) rather than intrinsically unbiased views. Otherwise, Finnish Wikipedia seems to be closest to unbiased amongst the “mid-range” Wikipedias.

Doing the same for ULAN (33,057 men, 3,100 women) looks a little better:

mf_ulan

Here, English Wikipedia has actually a tiny bias towards women. Breton and Maltese appear as “less biased outliers”.

I have uploaded both data sets here.

UPDATE: ODNB

mf_odnb

Pictures, reloaded

About four month ago, I blogged about Wikipedia pages vs. Wikidata items using images. In that post, I predicted that Wikidata would pass German Wikipedia in about four months’ time, so about the end of this month. Using the same metrics, it turns out that it’s a close run:

Site 2014-11 2015-03 Difference Per day
enwiki 1,726,772
enwiki (Commons only) 1,257,691
dewiki 709,736 729,577 19,841 182
wikidata 604,925 720,360 115,435 1,059
frwiki 602,664 623,400 20,736 190
ruwiki 491,916 509,436 17,520 161
itwiki 451,499 462,879 11,380 104
eswiki 414,308 425,399 11,091 102
jawiki 278,359 284,607 6,248 57

So, images in Wikidata items “grow” at about 1,000 per day, or ~900 faster than German Wikipedia. The difference has shrunk to ~9,000 pages/items. As there are 10 days to go for my prediction, it looks like I’m spot on…

Now, assuming a similar rate for enwiki, Wikidata should pass the Commons usage on en.wp in about two years.

Linkin mash

So I recently blogged about automatic descriptions based on Wikidata. And as nice as these APIs are, what could they be used for? You got it – demo time!

Linkin Park band member Dave Farrell has no article on English Wikipedia (only a redirect to Linkin Park, which is unhelpful). He does, however, have a Wikidata item with articles in 35 other languages. This is, essentially, the situation you get on smaller Wikipedias – lots of articles in other languages, just not in yours. There is information about the subject, but unless you can read any of those other languages, it’s closed to you.

On English Wikipedia, I created a template for this situation a while ago. Instead of a redlink, you specify the target and the Wikidata item, and you get the “normal” redlink, as well as links to Wikidata and Reasonator. An improvement, undoubtedly, but still rather clunky.

Thus, I resorted to something from the 90’s – a “mash-up” of multiple existing parts. A little bit of Wikipedia mobile view, my automatic description API, the Wikipedia API to render Wikitext as HTML, season with some JavaScript, and boom! we have a Wikipedia clone – with a twist. By default, this mash-up will happily display the Linkin Park article; however, under the “Band members” section (about the middle of the page), Dave Farrell now has just another, normal link:

Band section on Wikipedia

Band section on Wikipedia

Band section in the mash-up

Band section in the mash-up

The mash-up code recognized the code generated by the template on Wikipedia, and replaced it with a normal-looking, special link. Clicking on that “Dave Farrell” will lead to a live-generated page. It uses the automatic description API to get Wikitext plus infobox, then uses the Wikipedia API to render that as (mobile) HTML. And while the text is a little bit dull, it looks just like another Wikipedia page rendered through the mash-up, image and all.

I am well aware of the current limitations of this approach, including the potential deterrent to creating “proper” articles. However, with the much-hyped next billion internet users, many of them limited to the smaller Wikipedias, banging on our virtual door, such a mechanism could be a stop-gap measure to provide at least basic information in smaller languages, in a user-friendly way. Details of text generation for those languages, infoboxes, integration into Wikipedia proper, redlink markup, etc. would have to be worked out for this to happen.

Dave Farrell

Dave Farrell, as rendered by the mash-up


UPDATE: Aaaand… my template introduction has been reverted, effectively breaking the demo. I’m not going to start an edit war over this, you still have the screenshots.

Thy data, writ large

Ever since Rambot effectively doubled the size of English Wikipedia in a matter of days, automatic text generation from a dataset has been met with suspicion in the Wikiverse. Some text is better than none, for most readers, say some; but number-heavy, boring bot text is not really an encyclopaedia entry, and it could also take away some of the joy of writing, say others. To this day, it is an issue that can split Wikipedians into fierce combatant groups like little else.

Change of scenery. Wikidata is a young but vibrant Wikimedia project, and in many aspects still finding its shape. Each item on Wikidata can have a brief, textual description. This is helpful, for example in the current Wikipedia mobile app, where these descriptions are superimposed on a header image, say some; it is a waste of volunteer’s time to write a text that just reiterates the item statements, say others. Some (including myself) say that manual descriptions make sense for a few items, but the vast majority of items do not require a human to describe the item.

The solution for both above issues is, of course, bot-generated text on-the-fly; text that is written by software based on a data source, but that is not permanently stored. That way, essential information can be given to the reader, without discouraging writers, and without the need to maintain and update the bot-generated text, as it is never stored in the first place, but updated from the current dataset on demand.

I have previously written some code that does aspects of this; Wikidata search results are displayed on some Wikipedias (e.g. Italian) underneath the standard ones; they contain a brief, automatically generated description of each Wikidata item. And some people have seen my Reasonator tool, where (for some item types, and some languages) rather long descriptions can be generated.

But these examples are “trapped” in their respective tool or environment; other tools, websites, or third-party users have no way to get automated descriptions for Wikidata items easily. That is, until now. AutoDesc is a web API that can generate automated descriptions of almost any Wikidata item; the quality of the description improves with the quality of the statements in the item, of course.

And thanks to node.js, which is now available as a server on Wikimedia Labs (thanks to YuviPanda!), little rewriting of code was necessary; the “long description” generator is, in fact, the exact same source code used in Reasonator at this moment. This means that previous development by myself and other volunteers is not lost, but has paid off; and future improvement to either version of the text generator can simply be copied to the other.

The API can take an item number, a language code, and some other options, and generate a description of that item. It can return the description wrapped in JSON(P), or as an HTML page. It can generate plain text, wiki markup, or HTML with Wikipedia/Wikidata/Reasonator links. If you request the long description, it will automatically fall back to the short one if the item type or language for a long description are not supported (yet!).

Now, a word of caution: As I cobbled the text generation together from previously existing code, and code that was intended for use in a browser at that, things may not run as smoothly as one would expect. There is, in fact, little caching, and the cache that exists is not invalidated until the next server restart; an event that will be necessary to put new code live, and that will mean several seconds (the horror!) downtime for the API. If you base anything important on the API at this moment in time, homework will be eaten, data will be lost, and the write-everything-by-hand-fanatics will win. Be warned!

That said, I will try to improve the code over the coming weeks; if you want to help out, you can find the code here.

Red vs. blue

Recently, @notconfusing has been living up to his name by presenting us with preliminary results from the Wikipedia Gender Inequality Index. For me, that report is also an annoyance, because I was not aware this was going on, and had started to prepare my own research, with intend to publish, about the same topic. Fact is, I’ve been “scooped”, though not intentionally of course. Ah well, bygones. So that my (quite early) work was not entirely in vain, I’ll show some titbits of it here; interested parties, feel free to me for access to the the data and the full Google doc (which is not exactly in a polished state). All data presented here was collected in November 2014-January 2015, using either WDQ or Labs databases. As far as I can tell, my findings correlate well with @notconfusings, which is always nice.

Methods

WDQ was used to retrieve item counts for items marked as human (P31:Q5) on Wikidata, grouped by birth dates (P569) in the ranges of 0-1800, 1800-1900, 1900-1950, 1950-1980, and 1980-today. Item counts were further grouped by gender (P21), using male (Q6581097) and female (Q6581072) only (ignoring intersex, transgender, and genderqueer). Further subgrouping was done for items with identifiers from external catalogs (e.g. ODNB, VIAF), and for nationality (P27).

To compare biographical article sizes in Wikipedias, the replica database for Wikidata in conjunction with the respective language Wikipedia replica database were used. Items that link to either male (Q6581097) or female (Q6581072) items were retrieved. For these items, corresponding Wikipedia articles in a language were interrogated for their size, measured in bytes of Wikitext markup.

Results

Datasets

Wikidata had, at the time of writing, 2,634,209 items tagged as human, of which 2,363,146 (~90%) have a gender (P21) assigned. A total of 1,575,028 items that are human and have a birth date were found on Wikidata, of which 909,075 (~58%) have a nationality assigned.

Total change over time, by country

Starting with the basics, this shows the percentage of male biographical items in the individual time ranges. While there are less male (and, thus, more female) biographies in recent times, the spread (variance by country) increases as well. Notably, there are always “low male” outliers; this seems to be mostly Sweden, for some reason.

 

Change over time by region

These two figures show the male percentage faceted by region and time range. The figure on the right also shows it by country; darker blue means less %men=more %women.

Date ranges, faceted by region Faceted by region, raster by country

Biographical items gender by country

This figure shows the percentage of male biographical items by country, for countries with >= 30 items; blue=more male, red=more female. At a glance, one can see the male-dominated countries in Africa and South America, as well as the South-East Asian countries (which @notconfusing mostly calls “Confucian”, which I find confusing) with a high female percentage. “The West” appears to be stuck somewhere in the middle.

 map

 Number of articles per gender

This table shows the number of sitelinks (that is, Wikipedia articles, mostly) by gender. Interestingly, there are slightly more articles about women than men, though women have more items without sitelinks, and less images. This might be due to historical factors; there would be less images (remember, paintings cost serious money!) of women than men from before, say, 1900. Also, items about women are often created for “structural need”; the father and the husband both have an article, but to connect them, a new item about the daughter/wife is created, without sitelinks.

Male Female
Total items with sitelinks 1,973,773 367,194
Single sitelink (~63.1%) 1,245,727 (~62.3%) 228,619
Mean sitelinks per item 2.48 2.55
Items without sitelinks (~1.2%) 23,600 (~1.5%) 5,454
Items with images (~9%) 177,993 (~10.7%) 39,287

Size of biographical articles by language

For each wiki with at least 100 biographical articles, this figure shows the size (in bytes) of the article. A few “high-size” wikis were removed from this figure; they appear to make heavy use of unicode, thus increasing the byte size massively, though they roughly adhere to the same “shape”. Each dot represents a wiki; the dot size increases with the number of biographical items on the wiki. The X axis shows the mean bytes per male, the Y axis the mean bytes per female article. Wikis above the line have more bytes per women! The linear fit is surprisingly good (Pearson 0.9955423). According to the distance to the line, Mirandese Wikipedia is the most sexist one biased towards men, whereas Tamil Wikipedia is the most sexist one biased towards women :-)

Comparison to other biographical sources

A quick comparison between biographical items that have both a birth date and an ODNB or VIAF identifier. It seems ODNB (>85% of ODNB entries have a Wikidata item!) is more sexist than VIAF, which is more sexist than the Wikidata per-country mean!

External catalog Wikidata items Overall male %
ODNB 29,017 89.6%
VIAF 447,758 85.3%

And as a plot, by time range:

Total gender ratio, ODNB, VIAFSummary

This would be “Discussion&Conclusion” in a proper publication, but as this is just a blog post…

Strong gender bias towards men exists in the number of biographical items on Wikipedia and Wikidata, however, this bias appears to be to a large degree due to historical and/or cultural bias, rather than generated by Wikimedians. Since our projects are not primary sources, we are restricted to material gathered by others, and so reflect their consistent bias. All the above data points to less bias towards men over time, and in Asian and (to a degree) Western cultures, a trend which is mirrored in other sources. It also shows that we have comparable numbers of articles about men and women, and comparable article sizes on Wikipedia, though the latter depends on the language to some degree; all Wikipedias with over 100.000 biographical items are on the “female side” of the article size distribution (data not shown, though it can be glimpsed in the article size plot), which would indicate to me that, given enough eyeballs, gender bias becomes less of an issue on Wikipedia and Wikidata.

Content Ours, or: the sum of the parts

Open source projects like Linux, and open content projects like Wikipedia and Wikidata, are fine things indeed by themselves. However, the power of individual projects is multiplied if they can be linked up. For free software, this can be taken literally; linking libraries to your code is what allows complex applications to exists. For open data, links can be direct (as in weblinks, or external catalog IDs on Wikidata), or via a third party.

Recently, and once again, Peter Murray-Rust (of Blue Obelisk, CML, and Wikimania 2014 fame) has put his code where his mouth it. ContentMine harvests open access scientific publication and automatically extracts “facts”, such as mentions of species names. These facts are accessible through an API. Due to resource limitations, the facts are only stored temporarily, and will be lost after some time (though they can be regenerated automatically from the publications). Likewise, the search function is rather rudimentary.

Why is this important? Surely these publications are Google-indexed, and you can find what you want by just typing keywords into a search engine; text/data mining would be a waste of time, right? Well, not quite. With over 50 million research papers published (as of 2009), your search terms will have to be very tightly phrased to get a useful answer. Of course, if you use overly specific search terms, you are likely to miss that one paper you were looking for.

At the time of writing this, ContentMine is only a few weeks old; it contains less than 2,000 facts (all of them species names), extracted from 18 publications. But, even this tiny amount of data allows for a demonstration of what the linking of open data projects can accomplish.

Since all facts from ContentMine are CC-BY, I wrote some code to archive the “fact stream” in a database on Labs. As a second step, I use WDQ to automatically match species names to Wikidata items, where possible. Then, I slapped a simple interface on top, which lets a user query the database. One can use either a (trivial) name search in facts, or use a WDQ query; the latter would return a list of papers that contain facts that match items returned from WDQ.

If that sounds too complicated, try the example query on the interface page. This will:

  1. get all species from the Wikidata species tree with root “human” (which is only the item for “human”); query string “tree[5][][171]”
  2. get all species from the Wikidata species tree with root “orangutan”; query string “tree[41050][][171]”
  3. show all papers that have at least one item from 1. and at least one item from 2. as facts

At the moment, this is only one paper, one that talks about both homo sapiens (humans) and Pongo pygmaeus (Bornean orangutans). But here is the point: we did not search for Pongo pygmaeus! We only queried for “any species of orangutans”. ContentMine knows about species mentioned in papers, Wikidata knows about species, and WDQ can query Wikidata. By putting these parts together, even if only in such a simple fashion, we have multiplied the power of what we can achieve!

While this example might not strike you as particularly impressive, it should suffice to bring the point across. Imagine many more publications (and yes, thanks to a recent legal decision, ContentMine can also harvest “closed” journals), and many more types of facts (places, chemicals, genetic data, etc.). Once we can query millions of papers for the effects of a group of chemicals on  bacterial species in a certain genus, or with a specific property, the power of accessing structured knowledge will become blindingly obvious.

Picture this!

Recently, someone told me that “there are no images on Wikidata”. I found that rather hard to believe, as I had added quite a few using my own tools. So I had a quick look at the numbers.

For Wikidata, counting the number of items with images is straightforward. For Wikipedia, not so much; by default, navigation bar logos and various icons are counted just as actual photographs of the article topic. So, I devised a crude filter, counting only articles with images (one would do) that were not used in three or more articles in total.

I ran this query on some of the larger Wikipedias. While most of them ran fine, English Wikipedia failed to return a timely result; and since its generous sprinkling with “fair use” local images would inflate the number anyway, I am omitting this result here. Otherwise:

Site Articles/Items with images
dewiki 709,736
wikidata 604,925
frwiki 602,664
ruwiki 491,916
itwiki 451,499
eswiki 414,308
jawiki 278,359

As you can see, Wikidata already outperforms all but one (with en.wp: two) Wikipedias. Since image addition to Wikidata is easy through tools (and games), and there are many “pre-filtered” candidates from Wikipedias to use, I expect Wikidata to surpass German Wikipedia soon (assuming linear increase, in less than four months), and eventually English Wikipedia as well, at least for images from Commons (not under “fair use”).

But even at this moment, I am certain there are thousands of Wikidata items with an image, while the corresponding article on German (or Spanish or Russian) Wikipedia remains a test desert. Hesitation of the Wikipedia communities to use these readily available images deprive their respective readers of something that helps to make article come alive, and all the empty talk of “quality” and “independence” does not serve as compensation.

Also, the above numbers count all types of files on Wikipedia, whereas they count only images of the item subject on Wikidata. Not only does that bias the numbers in favour of Wikipedia, it also hides the various “special” file types that Wikidata offers: videos, audio recordings, pronunciations, maps, logos, coat of arms, to name just a few. It is likely that their use on Wikipedia is even more scattered than that of subject images. Great opportunities to improve Wikipedias of all languages, for those bold enough to nudge the system.

The way is shut

So I saw a mail about the new, revamped Internet Archive. Fantastic! All kinds of free, public domain (for the most part) files to play with! So I thought to myself: Why not celebrate that new archive.org by using a file to improve Wikidata? After all, I just have to upload it to Commons!

Easy, right? Well, I did write a tool to directly upload a file from a URL to Commons, but the IA only offers mp3, so I don’t know how that would work. Let’s do it the old-fashioned way, as every newcomer would: Download it to disk, and upload it to Commons. Except Commons barfs at mp3 uploads. Commons is the domain of free formats, after all. And we could not possibly set non-free formats free by converting them automatically, oh no! I am sure there is a good reason why the WMF can’t turn non-free mp3 into free formats during upload; that reason just escapes me at the moment, as it sure will escape everyone else who tried this. Maybe they would have to gasp! license an mp3 decoder? Not sure if that is actually required, but it would surely irk the free-only purity of the organization. Never mind the Foundation heavily relied on non-free software and services like Google internally; if they can’t get things done with free software and open-source services alone, obviously non-free ones are made available. Just not for the community.

The mp3 refusal surely means that there are well-documented ways to deal with this issue, right? The Upload Wizard itself is not very helpful, though; the dialog box that pops up says:

This wiki does not accept filenames that end in the extension “.mp3″.

That’s it. No reason why, no suggestion what to do about it, no links, nothing. Just “bugger off”, in so many words. Never mind; after all, there is a prominent, highlighted link in the Wizard to Upload help. Which, one would assume, offers help with uploading files. I search the page for “mp3″ – no result. Ah well, this seems to be a list of questions rather than an actual help page, but there is a “search archive” function; surely, this problem must have been discussed before! Nope. Neither do the FAQ cover the topic of mp3. But lo and behold, searching for “audio” gets me here, which tells me (finally!) that Commons accepts OGG and FLAC; OPUS is not mentioned, probably because there are “issues” with uploading OPUS to Commons (no, really?!?). There are some links to software and online converters, but I had found some of those on my own already by now.

I tried the Miro converter, but it “only” creates OGG, not FLAC, which I wanted to us in order to avoid re-encoding losses. Then I tried online-convert, which returned me a 10MB FLAC file for my 1.6MB mp3. So I upload the FLAC. And by that, I mean, I try. The Wizard takes the file, and starts “encoding”. And never finishes. Or at least, it’s been at it for >10min now, and not showing any sign it’s alive.

This is my experience; I could probably get it to work, if I cared enough. I shudder to think how a newbie would fair with this task. Where audio (and, most likely, video) is concerned, Commons is, in effect, a community-driven media site that does not accept media files. It has been for years, but we are approaching 2015; time we do something about that. Merely preaching more free format ideology is not a solution.

Clusterf…amilies

Wikidata and its web of interconnected items lends itself to automated clustering. I have used my Wikidata query tool to quickly (as in, a few minutes) check all clusters of humans, that is, items about humans connected by properties such as mother, child, spouse, brother, etc.

At the time of writing, there are 11,784 clusters on Wikidata, each containing two or more humans. The largest one is the supercluster “European Royalty” with 20,543 members. 7,471 clusters contain only two humans, 1,955 contain three, and the numbers drop from there.

Beyond the royalty supercluster, the largest ones include:

Sadly, there is no good genealogy rendering software that is open source and JavaScript-only; and I don’t really have the bandwidth to develop one.

I have uploaded the cluster list here; each row has a item to start with, and the size of the cluster (=number of humans). The members of the cluster can be retrieved with the Wikidata query web[start_item][22,25,40,7,9,26,45,1038]. If there is interest, I can re-calculate this cluster list again later.

The missing origin of species

Now, instead of humans and their relations, what about taxa? We recently talked about taxonomy on Wikidata at WikiCon 2014, so I thought I’d modify the script to show that taxonomy on Wikidata is in a good state. Sadly, it is not.

Using “parent taxon”, as well as the deprecated “family” and “order” properties, I get a whooping 193,040 separate clusters; and that doesn’t even count completely “unconnected” items. The good news is, the main “supercluster” consists of 1,351,245 taxa that, presumably, can all be traced back to a common root “biota”.

But, the next one is a cluster of 1,006 taxa unconnected to that root. Using a modified query, I can get the unconnected root of that cluster, Molophilus. I have uploaded the complete cluster list here; a list of items per cluster, as well as the unconnected root, can be retrieved using the respective start item, and the methods demonstrated above.

The Men Who Stare at Media

Shortly after the 2014 London Wikimania, the happy world of Wikimedia experienced a localized earthquake when a dispute between some editors of the German Wikipedia and the Wikimedia Foundation escalated into exchanges of electronic artillery. Here, I try to untangle the threads of the resulting Gordian knot, interwoven with my own view on the issue.

Timeline

As best as I can tell, the following sequence of events is roughly correct:

  1. The WMF (Wikimedia Foundation) decides to update and, at least by intention, improve the viewing of files (mostly images), mainly when clicked on in Wikipedia. The tool for this, dubbed MediaViewer, would do what most people expect when they click on a thumbnail on a website in 2014, and be activated by default. This is aimed at the casual reader, comprising the vast majority of people using Wikipedia. For writers (that is, “old hands” with log-ins), there is an off switch.
  2. A small group of editors on English Wikipedia suggest that the MediaViewer, at least in its current state, is not suitable for default activation. This is ignored by the WMF due to lack of total votes.
  3. A “Meniungsbild” (literally “opinion picture”; basically, a non-binding poll) is initiated on German Wikipedia.
  4. The WMF posts on the Meinungsbild page that it (the WMF) reserves the right to overrule a negative result.
  5. About 300 editors vote on German Wikipedia, with ~2/3 against the default activation of the MediaViewer.
  6. The WMF, as announced, overrules the Meinungsbild and activates the MediaViewer by default.
  7. An admin on German Wikipedia implements a JavaScript hack that deactivates the MediaViewer.
  8. The WMF implements a “super-protect” right that locks out even admins from editing a page, reverts the hack to re-enable the MediaViewer, and protects the “hacked” page from further editing.
  9. Mailing list shitstorm ensues.

An amalgamate of issues

In the flurry of mails, talk page edits, tweets, blog posts, and press not-quite-breaking-news items, a lot of issues were thrown into the increasingly steaming-hot soup of contention-laden bones. Sabotage of the German Wikipedia by its admins, to prevent everyone from reading it, was openly suggested as a possible solution to the problem, Erik Möller of WMF was called a Nazi, and WMF management is raking in the donations for themselves while only delivering shoddy software. I’ll try to list the separate issues that are being bundled under the “MediaViewer controversy” label:

  • Technical issues. This includes claims that MediaViewer is useless, not suitable for readers, too buggy for prime time, violates copyright by hiding some licenses, etc.
  • WMF response. Claims that the Foundation is not responding properly to technical issues (e.g. bug reports), community wishes, etc.
  • WMF aim. Claims that the Foundation is focusing exclusively on readers and new editors, leaving the “old hands” to fend for themselves.
  • Authority. Should the WMF or the community of the individual language edition have the final word about software updates?
  • Representation: Does a relatively small pool of vocal long-time editors speak for all the editors, and/or all the readers?
  • Rules of engagement: Is it OK for admins to use technological means to enforce a point of view? Is it OK for the WMF to do so?
  • Ownership: Does the WMF own Wikipedia, or do the editors who wrote it?

A house needs a foundation

While the English word “foundation” is know to many Germans, I feel it is often interpreted as “Verein”, the title of the German Wikimedia chapter. The literal translation (“Fundament”), and thus its direct meaning, are often overlooked. The WMF is not “the project”; it is a means to an end, a facilitator, a provider of services for “the community” (by whatever definition) to get stuff done. At the same time, “the community” could not function without a foundation; some argue that the community needs a different foundation, because the next one will be much better, for sure. Thankfully, these heroic separatists are a rather minute minority.

The foundation provides stability and reliability; it takes care of a lot of necessary plumbing and keeps it out of everyone’s living room. At the same time, when the foundation changes (this is stretching the literal interpretation of the word a bit, unless you live in The Matrix), everything build on the foundation has to change with it. So what does this specific foundation provide?

  • The servers and the connectivity (network, bandwidth) to run the Wikis.
  • The core software (MediaWiki) and site-specific extensions. Yes, since it’s open source, everyone can make a fork, so WMF “ownership” is limited; however, WMF employs people to develop MediaWiki, with the specific aim of supporting WMFs projects. Third-party use is wide-spread, but not a primary aim.
  • The setup (aka installation) of MediaWiki and its components for the individual projects.
  • The people and know-how to make the above run smoothly.
  • Non-technical aspects, such as strategic planning, public relations and press management, legal aspects etc. which would be hard/impossible for “the community” to provide reliably.
  • The money to pay for all of the above. Again, yes, the money comes from donation; but WMF collects, prioritizes, and distributes it; they plan and execute the fundraising that gets the money in.

The WMF does specifically not provide:

  • The content of Wikipedia, Commons, and other projects.
  • The editorial policies for these projects, beyond certain basic principles (“Wikipedia is an encyclopedia, NPOV, no original research”, etc.) which are common to all language editions of a project.

Authorities

I think that last point deserves attention in the light of the battle of MediaViewer. The WMF is not just your hosting provider. It does stand for, and is tasked to uphold, some basic principles of the project, across communities and languages. For example, the “neutral point of view” is a basic principle on all Wikipedia. What if a “community” (again, by whatever definition) were to decide to officially abandon it, and have opinionated articles instead? Say, the Urdu edition, a language mostly spoken in Pakistan (which I chose as a random example here!). I think that most editors, from most “communities”, would want the WMF to intervene at that point, and rightly so. You want opinionated texts, get a blog (like this one); the web is large enough. In such a case, the WMF should go against the wishes of that “community” and, if necessary, enforce NPOV, even if it means to de-admin or block people on that project. And while I hope that such a situation will never develop, it would be a case were the WMF would, and should, enforce editorial policy (because otherwise, it wouldn’t be Wikipedia anymore). Which is a far more serious issue than some image viewer tool.

The point I am trying to make here is that there are situations where it is part of the mission and mandate of WMF to overrule “the community”. The question at hand is, does MediaViewer comprise such a situation? It is certainly a borderline case. On one hand, seen from the (German) “community” POV, it is a non-essential function that mostly gets in the way of the established editors that are most likely to show up on the Meinungsbild, and admittedly has some software issues with a generous sprinkling of bug reports. On the other hand, from the WMF’s point of view, the dropping number of editors is a major problem, at it is their duty to solve it as best as they can. Some reasons, e.g. “newbie-biting”, are up to the communities and essentially out of the WMF’s control. Other reasons for the lack on “fresh blood” in the wiki family include the somewhat antiquated technology exposed to the user, and that is something well within its remit. The Visual Editor was developed to get more (non-technical) people to edit Wikipedia. The Upload Wizard and the MediaViewer were developed to get more people interested in (and adding to) the richness of free images and sounds available on the sites.

The Visual Editor (which seems to work a lot better than it used to) represents a major change in the way Wikipedia can be used by editors, and its initial limitations were well known. Here, the WMF did yield to the wishes of individual “communities”, and not even an option for the Visual Editor is shown on German Wikipedia for “anonymous” users.

The MediaViewer is, in this context, a little different. Most people (that is, anonymous readers of Wikipedia, all of which are potential future editors) these days expect that, when you click on a thumbnail image on a website, you see a large version of it. Maybe even with next/prev arrows to cycle through available images on the page. (I make no judgement about whether this is the right thing; it just is this way.) Instead, Wikipedia thus far treated the reader to a slightly larger thumbnail, surrounded by mostly incomprehensible text. And when I say “incomprehensible”, I mean people mailing me if they could use my image from Commons; they skip right past the {{Information}} template and the license boxes to look for the uploader, which happens to be my Flickr/Wikipedia transfer bot.

So the WMF decided that, in this specific case, the feature should be rolled out as default, on all projects instead of piecemeal like the Visual Editor (and do not kid yourself, it will come to every Wikipedia sooner or later). I do not know what prompted this decision; consistency for multilingual readers, simplicity of maintenance, pressure on the programmers to get the code into shape under the ensuing bug report avalanche, or simply the notion of this being a minor change that can be turned off even by anonymous users. I also do not know if this was the right technical decision to make, in light of quite a few examples where MediaViewer does not work as correctly as it should. I am, however, quite certain that it was the WMF’s right to make that decision. It falls within two of their areas of responsibility, which are (a) MediaWiki software and its components, and (b) improving reader and editor numbers by improving their experience of the site. Again, no judgement whether or not it was the right decision; just that it was the WMF’s decision to make, if they chose to do so.

Respect

I do, however, understand the “community’s” point of view as well; while I haven’t exactly been active on German Wikipedia for a while, I have been around through all of its history. The German community is very dedicated to quality; where the English reader may be exposed to an army of Pokemons, the article namespace in German Wikipedia is pruned rather rigorously (including an article about Yours Truly). There are no “mispeeling” redirects (apparently, if you can’t spell correctly, you have no business reading an encyclopedia!), and few articles have infoboxes (Wikipedia is an encyclopedia, not a trading card game!). There are “tagging categories”, e.g. for “man” and “woman”, with no subcategories; biographies generally have Persondata and authority control templates. In short, the German community is very much in favor of rigorously controlling many aspects of the pages, in order to provide the (in the community’s view) best experience for the user. This is an essential point: the German community cares very much about the reader experience! This is not to say that other languages don’t care; but, in direct comparison, English Wikipedia is an amorphous free-for-all playground (exaggerating a bit here, but only a bit). If you don’t believe me, ask Jimbo; he speaks some German, enough to experience the effect.

So some of the German editors saw (and continue to see) the default activation of the MediaViewer as an impediment to not only themselves, but especially to the reader. And while Germans are known for their “professional outrage”, and some just dislike everything new (“it worked for me so far, why change anything?”), I believe the majority of editors voting against the MediaViewer are either actually concerned about the reader experience, or were convinced (not to say “dragged into”) by those concerned to vote “no”.

The reactions by the WMF, understandably as they are from their perspective, namely

  1. announcing to ignore the “vote” (not a real, democratic vote, which is why it’s called “Meinungsbild” and not “Wahl”)
  2. proceeding to ignore the vote
  3. using “force” to enforce their decision

were interpreted by many editors as a lack of respect. We the people editors wrote the encyclopedia, after all; how dare they (the WMF) change our carefully crafted user experience, and ignore our declared will? It is from that background that comparisons to corporate overlords etc. stem, barely kept in check by Mike Godwin himself. And while such exaggerations are a common experience to everyone on the web, they do not exactly help in getting the discussion back to where it should be. Which is “where do we go from here”?

The road to hell

One thing is clear to me, and I suspect even to the most hardened edit warrior in the wikiverse: Both “sides”, community and WMF, actually want the same thing, which is to give the reader the best experience possible when browsing the pages of any Wikimedia project. The goal is not in question; the road to get there is. And whose authority it is to decide that.

On the technical side, one issue is the testing-and-fixing cycle. Traditionally, the WMF has made new functionality available for testing by the community quite early. By the same tradition, that option is ignored by most members of that community, only to complain about being steamrollered into it when it suddenly appears on the live site. On the other hand,  the WMF has rolled out both the Visual Editor and the MediaViewer in a state that would be called “early beta” in most software companies. “Release early, release often” is a time-honored motto in open source software development; but in this specific case, using early releases in production isn’t optional for the users. From discussions I had on Wikimania, I have the distinct impression that people expect a higher standard of quality for software rolled out by the WMF on the live sites, especially if it becomes default. How this should work without volunteers to test early remains a mystery; maybe a little more maturity on the initial release, followed by more widespread use of “beta” features, is part of the answer here.

On the votes-vs-foundation side, I am of the opinion that clearer lines need to be drawn. The WMF does have a responsibility for user experience, which includes software changes, some of which will have to be applied across the wikiverse to be effective; the upcoming “forced account unification” for (finally!) Single User Login comes to mind. And, in a twist on the famous Spiderman quote, with great responsibility needs to come great power to fulfill it. Responsibility without power is the worst state one can have in a job, which even the most uncompromising “community fighter” will agree to. So if and when the WMF makes such a decision within their remit, the energy of the community would be best spent in feeding back the flaws in order to get the best possible result, instead of half-assed attempts at sabotage (I much prefer full-assed attempts myself).

There is, of course, another side of that coin. In my opinion, the WMF should leave the decision for default activation of a new feature to a representative vote of a community, unless the activation is necessary for (a) technical, (b) consistency, or (c) interdependency reasons. A security fix would fall under (a); the Single User Login will fall under (c); MediaViewer falls under (b), though somewhat weakly IMHO. Now, the key word in the beginning of this paragraph is “representative”. I am not quite sure how this would work in practice. I am, however, quite sure it is not 300 editors (or Spartans) voting on some page. It could include votes by a randomized subset of readers. It could also include “calls to vote” as part of beta features, e.g. if you had the feature enabled in the last week. These could be repeated over time, as the “product” would change, sometimes significantly so, as it happened with the Visual Editor; a “no” three month ago would be quite invalid today.

Finally, I believe we need at least part of the above written out, and agreed upon, by both the WMF and “the communities”. It is my hope that enough people will share my opinion that both “parties” still have a common goal. Because the house that is Wikipedia cannot stand without a foundation, and a foundation without a house on top is but a dirty pond.