Skip to content

Red vs. blue

Recently, @notconfusing has been living up to his name by presenting us with preliminary results from the Wikipedia Gender Inequality Index. For me, that report is also an annoyance, because I was not aware this was going on, and had started to prepare my own research, with intend to publish, about the same topic. Fact is, I’ve been “scooped”, though not intentionally of course. Ah well, bygones. So that my (quite early) work was not entirely in vain, I’ll show some titbits of it here; interested parties, feel free to me for access to the the data and the full Google doc (which is not exactly in a polished state). All data presented here was collected in November 2014-January 2015, using either WDQ or Labs databases. As far as I can tell, my findings correlate well with @notconfusings, which is always nice.

Methods

WDQ was used to retrieve item counts for items marked as human (P31:Q5) on Wikidata, grouped by birth dates (P569) in the ranges of 0-1800, 1800-1900, 1900-1950, 1950-1980, and 1980-today. Item counts were further grouped by gender (P21), using male (Q6581097) and female (Q6581072) only (ignoring intersex, transgender, and genderqueer). Further subgrouping was done for items with identifiers from external catalogs (e.g. ODNB, VIAF), and for nationality (P27).

To compare biographical article sizes in Wikipedias, the replica database for Wikidata in conjunction with the respective language Wikipedia replica database were used. Items that link to either male (Q6581097) or female (Q6581072) items were retrieved. For these items, corresponding Wikipedia articles in a language were interrogated for their size, measured in bytes of Wikitext markup.

Results

Datasets

Wikidata had, at the time of writing, 2,634,209 items tagged as human, of which 2,363,146 (~90%) have a gender (P21) assigned. A total of 1,575,028 items that are human and have a birth date were found on Wikidata, of which 909,075 (~58%) have a nationality assigned.

Total change over time, by country

Starting with the basics, this shows the percentage of male biographical items in the individual time ranges. While there are less male (and, thus, more female) biographies in recent times, the spread (variance by country) increases as well. Notably, there are always “low male” outliers; this seems to be mostly Sweden, for some reason.

 

Change over time by region

These two figures show the male percentage faceted by region and time range. The figure on the right also shows it by country; darker blue means less %men=more %women.

Date ranges, faceted by region Faceted by region, raster by country

Biographical items gender by country

This figure shows the percentage of male biographical items by country, for countries with >= 30 items; blue=more male, red=more female. At a glance, one can see the male-dominated countries in Africa and South America, as well as the South-East Asian countries (which @notconfusing mostly calls “Confucian”, which I find confusing) with a high female percentage. “The West” appears to be stuck somewhere in the middle.

 map

 Number of articles per gender

This table shows the number of sitelinks (that is, Wikipedia articles, mostly) by gender. Interestingly, there are slightly more articles about women than men, though women have more items without sitelinks, and less images. This might be due to historical factors; there would be less images (remember, paintings cost serious money!) of women than men from before, say, 1900. Also, items about women are often created for “structural need”; the father and the husband both have an article, but to connect them, a new item about the daughter/wife is created, without sitelinks.

Male Female
Total items with sitelinks 1,973,773 367,194
Single sitelink (~63.1%) 1,245,727 (~62.3%) 228,619
Mean sitelinks per item 2.48 2.55
Items without sitelinks (~1.2%) 23,600 (~1.5%) 5,454
Items with images (~9%) 177,993 (~10.7%) 39,287

Size of biographical articles by language

For each wiki with at least 100 biographical articles, this figure shows the size (in bytes) of the article. A few “high-size” wikis were removed from this figure; they appear to make heavy use of unicode, thus increasing the byte size massively, though they roughly adhere to the same “shape”. Each dot represents a wiki; the dot size increases with the number of biographical items on the wiki. The X axis shows the mean bytes per male, the Y axis the mean bytes per female article. Wikis above the line have more bytes per women! The linear fit is surprisingly good (Pearson 0.9955423). According to the distance to the line, Mirandese Wikipedia is the most sexist one biased towards men, whereas Tamil Wikipedia is the most sexist one biased towards women :-)

Comparison to other biographical sources

A quick comparison between biographical items that have both a birth date and an ODNB or VIAF identifier. It seems ODNB (>85% of ODNB entries have a Wikidata item!) is more sexist than VIAF, which is more sexist than the Wikidata per-country mean!

External catalog Wikidata items Overall male %
ODNB 29,017 89.6%
VIAF 447,758 85.3%

And as a plot, by time range:

Total gender ratio, ODNB, VIAFSummary

This would be “Discussion&Conclusion” in a proper publication, but as this is just a blog post…

Strong gender bias towards men exists in the number of biographical items on Wikipedia and Wikidata, however, this bias appears to be to a large degree due to historical and/or cultural bias, rather than generated by Wikimedians. Since our projects are not primary sources, we are restricted to material gathered by others, and so reflect their consistent bias. All the above data points to less bias towards men over time, and in Asian and (to a degree) Western cultures, a trend which is mirrored in other sources. It also shows that we have comparable numbers of articles about men and women, and comparable article sizes on Wikipedia, though the latter depends on the language to some degree; all Wikipedias with over 100.000 biographical items are on the “female side” of the article size distribution (data not shown, though it can be glimpsed in the article size plot), which would indicate to me that, given enough eyeballs, gender bias becomes less of an issue on Wikipedia and Wikidata.

One Comment

  1. Nemo wrote:

    Thanks! Too bad you weren’t able to comment on the research by Max & Piotr earlier, but this is a great addition to the genre.

    I find it a bit confusing that you translate “Mean sitelinks per item” with “there are slightly more articles about women than men”: it’s *proportionally* more articles. If you say “more articles”, IMHO you are implicitly assuming what you’re meant to prove, i.e. that Wikidata is neutral and any bias would be on the content projects’ side.

    I’d rather say that existing articles about women probably get translated more often, or that articles on men are probably more localistic. There is still a huge gap.

    I don’t know if the comparison to VIAF is fair, and certainly it’s not a complete picture, because VIAF is mainly about authors, while Wikimedia projects are about all kinds of people. Are the Wikidata items linked to VIAF a representative sample of VIAF itself?

    Finally, the “size” graph is very interesting, a challenge for interpretation. It’s worth working more on, IMHO, as nobody else assessed the “weight” of existing articles other than counting them. The fact that women articles are apparently bigger, and also more global, might mean that covered womeon are on average “more important” (or given more importance) than men. Which makes sense, given e.g. the number of irrelevant male football players 😉 to whom hopefully Wikimedia projects don’t give as much importance as many seem to think from brutal countings.

    Tuesday, January 27, 2015 at 17:14 | Permalink