Recently, @notconfusing has been living up to his name by presenting us with preliminary results from the Wikipedia Gender Inequality Index. For me, that report is also an annoyance, because I was not aware this was going on, and had started to prepare my own research, with intend to publish, about the same topic. Fact is, I’ve been “scooped”, though not intentionally of course. Ah well, bygones. So that my (quite early) work was not entirely in vain, I’ll show some titbits of it here; interested parties, feel free to me for access to the the data and the full Google doc (which is not exactly in a polished state). All data presented here was collected in November 2014-January 2015, using either WDQ or Labs databases. As far as I can tell, my findings correlate well with @notconfusings, which is always nice.
WDQ was used to retrieve item counts for items marked as human (P31:Q5) on Wikidata, grouped by birth dates (P569) in the ranges of 0-1800, 1800-1900, 1900-1950, 1950-1980, and 1980-today. Item counts were further grouped by gender (P21), using male (Q6581097) and female (Q6581072) only (ignoring intersex, transgender, and genderqueer). Further subgrouping was done for items with identifiers from external catalogs (e.g. ODNB, VIAF), and for nationality (P27).
To compare biographical article sizes in Wikipedias, the replica database for Wikidata in conjunction with the respective language Wikipedia replica database were used. Items that link to either male (Q6581097) or female (Q6581072) items were retrieved. For these items, corresponding Wikipedia articles in a language were interrogated for their size, measured in bytes of Wikitext markup.
Wikidata had, at the time of writing, 2,634,209 items tagged as human, of which 2,363,146 (~90%) have a gender (P21) assigned. A total of 1,575,028 items that are human and have a birth date were found on Wikidata, of which 909,075 (~58%) have a nationality assigned.
Total change over time, by country
Starting with the basics, this shows the percentage of male biographical items in the individual time ranges. While there are less male (and, thus, more female) biographies in recent times, the spread (variance by country) increases as well. Notably, there are always “low male” outliers; this seems to be mostly Sweden, for some reason.
Change over time by region
These two figures show the male percentage faceted by region and time range. The figure on the right also shows it by country; darker blue means less %men=more %women.
Biographical items gender by country
This figure shows the percentage of male biographical items by country, for countries with >= 30 items; blue=more male, red=more female. At a glance, one can see the male-dominated countries in Africa and South America, as well as the South-East Asian countries (which @notconfusing mostly calls “Confucian”, which I find confusing) with a high female percentage. “The West” appears to be stuck somewhere in the middle.
Number of articles per gender
This table shows the number of sitelinks (that is, Wikipedia articles, mostly) by gender. Interestingly, there are slightly more articles about women than men, though women have more items without sitelinks, and less images. This might be due to historical factors; there would be less images (remember, paintings cost serious money!) of women than men from before, say, 1900. Also, items about women are often created for “structural need”; the father and the husband both have an article, but to connect them, a new item about the daughter/wife is created, without sitelinks.
|Total items with sitelinks||1,973,773||367,194|
|Single sitelink||(~63.1%) 1,245,727||(~62.3%) 228,619|
|Mean sitelinks per item||2.48||2.55|
|Items without sitelinks||(~1.2%) 23,600||(~1.5%) 5,454|
|Items with images||(~9%) 177,993||(~10.7%) 39,287|
Size of biographical articles by language
For each wiki with at least 100 biographical articles, this figure shows the size (in bytes) of the article. A few “high-size” wikis were removed from this figure; they appear to make heavy use of unicode, thus increasing the byte size massively, though they roughly adhere to the same “shape”. Each dot represents a wiki; the dot size increases with the number of biographical items on the wiki. The X axis shows the mean bytes per male, the Y axis the mean bytes per female article. Wikis above the line have more bytes per women! The linear fit is surprisingly good (Pearson 0.9955423). According to the distance to the line, Mirandese Wikipedia is the most sexist one biased towards men, whereas Tamil Wikipedia is the most sexist one biased towards women 🙂
A quick comparison between biographical items that have both a birth date and an ODNB or VIAF identifier. It seems ODNB (>85% of ODNB entries have a Wikidata item!) is more sexist than VIAF, which is more sexist than the Wikidata per-country mean!
|External catalog||Wikidata items||Overall male %|
And as a plot, by time range:
This would be “Discussion&Conclusion” in a proper publication, but as this is just a blog post…
Strong gender bias towards men exists in the number of biographical items on Wikipedia and Wikidata, however, this bias appears to be to a large degree due to historical and/or cultural bias, rather than generated by Wikimedians. Since our projects are not primary sources, we are restricted to material gathered by others, and so reflect their consistent bias. All the above data points to less bias towards men over time, and in Asian and (to a degree) Western cultures, a trend which is mirrored in other sources. It also shows that we have comparable numbers of articles about men and women, and comparable article sizes on Wikipedia, though the latter depends on the language to some degree; all Wikipedias with over 100.000 biographical items are on the “female side” of the article size distribution (data not shown, though it can be glimpsed in the article size plot), which would indicate to me that, given enough eyeballs, gender bias becomes less of an issue on Wikipedia and Wikidata.