Skip to content

Why I didn’t fix your bug

Many of you have left me bug reports, feature requests, and other issues relating to my tools in the WikiVerse. You have contacted me through the BitBucket issue tracker (and apparently I’m on phabricator as well now), Twitter, various emails, talk pages (my own, other users, content talk pages, wikitech, meta etc.), messaging apps, and in person.

And I haven’t done anything. I haven’t even replied. No indication that I saw the issue.

Frustrating, I know. You just want that tiny thing fixed. At least you believe it’s a tiny change.

Now, let’s have a look at the resources available, which, in this case, is my time. Starting with the big stuff (general estimates, MMMV [my mileage may vary]):

24h per day
-9h work (including drive)
-7h sleep (I wish)
-2h private (eat, exercise, shower, read, girlfriend, etc.)
=6h left

Can’t argue with that, right? Now, 6h left is a high estimate, obviously; work and private can (and do) expand on a daily, fluctuating basis, as they do for all of us.

So then I can fix your stuff, right? Let’s see:

6h
-1h maintenance (tool restarts, GLAM pageview updates, mix'n'match catalogs add/fix, etc.)
-3h development/rewrite (because that's where tools come from)
=2h left

Two hours per days is a lot, right? In reality, it’s a lot less, but let’s stick with it for now. A few of my tools have no issues, but many of them have several open, so let’s assume each tool has one:

2h=120min
/130 tools (low estimate)
=55 sec/tool

That’s enough time to find and skim the issue, open the source code file(s), and … oh time’s up! Sorry, next issue!

So instead of dealing with all of them, I deal with one of them. Until it’s fixed, or I give up. Either may take minutes, hours, or days. And during that time, I am not looking at the hundreds of other issues. Because I can’t do anything about them at the time.

So how do I pick an issue to work on? It’s an intricate heuristic computed from the following factors:

  • Number of users affected
  • Severity (“security issue” vs. “wrong spelling”)
  • Opportunity (meaning, I noticed it when it got filed)
  • Availability (am I focused on doing something else when I notice the issue?)
  • Fun factor and current mood (yes, I am a volunteer. Deal with it.)

No single event prompted this blog post. I’ll keep it around to point to, when the occasion arises.



				

The File (Dis)connect

I’ll be going on about Wikidata, images, and tools. Again. You have been warned.

I have written a few image-related Wikimedia tools over the years (such as FIST, WD-FIST, to name two big ones), because I believe that images in articles and Wikidata items are important, beyond their adorning effect. But despite everyone’s efforts, images on Wikidata (the Wikimedia site with the most images) are still few and far between. For example, less than 8% of taxa have an image; across all of Wikidata, it’s rather around 5% of items.

On the other hand, Commons now has ~45M files, and other sites like Flickr also have vast amounts of freely licensed files. So how to bring the two of them together? One problem is that, lacking prior knowledge, matching an item to an image means full-text searching the image site, which even these days takes time for thousands of items (in addition to potential duplication of searches, stressing APIs unnecessarily). A current example for “have items, need images” is the Craig Newmark Pigeon Challenge by the WMF.

The answer (IMHO) is to prepare item-file-matches beforehand; take a subset of items (such as people or species) which do not have images, and search for them on Commons, Flickr, etc. Then store the results, and present them to the user upon request. I had written some quick one-off scans like that before, together with the odd shoe-string interface; now I have consolidated the data, the scan scripts, and the interface into a new tool, provisionally called File Candidates. Some details:

  • Already seeded with plenty of candidates, including >85K species items, >53K humans, and >800 paintings (more suggestions welcome)
  • Files are grouped by topic (e.g. species)
  • Files can be located on Commons or Flickr; more sites are possible (suggestions welcome)
  • One-click transfer of files from Flickr to Commons (with does-it-exists-on-Commons check)
  • One-/Two-click (one for image, two for other properties) adding of the file to the Wikidata item
  • Some configuration options (click the “⚙” button)
  • Can use a specific subset of items via SPARQL (example: Pigeon Challenge for species)

There is another aspect of this tool I am excited about: Miriam is looking into ranking images by quality through machine learning, and has given me a set of people-file-matches, which I have already incorporated into my “person” set, including the ranking. From the images that users add through this tool, we can then see how much the ranking algorithm agrees with the human decision. This can set us on a path towards AI-assisted editing!

ADDENDUM: Video with demo usage!

Playing cards on Twitter

So this happened.

Yesterday, Andy Mabbett asked me on Twitter for a new feature of Reasonator: Twitter cards, for small previews of Wikidata items on Twitter. After some initial hesitation (for technical reasons), I started playing with the idea in a test tweet (and several replies to myself), using Andy as the guinea pig item:

Soon, I was contacted by Erika Herzog, who I did work with before on Wikidata projects:

That seemed like an odd thing to request, but I try to be a nice guy, and if there are some … personal issues between Wikidata users, I have no intention of causing unnecessary upset. So, after some more testing (meanwhile, I had found a Twitter page to do the tests on), I announced the new feature to the world, using what would undoubtedly be a suitable test subject for the Wikipedia/Wikidata folk:

Boy was I wrong:

I basically woke up to this reply. Under-caffeinated, I saw someone tell me what to (not) tweet. Twice. No reason. No explanation. Not a word on why Oprah would be a better choice as a test subject in a tweet about a new feature for a Wikidata-based tool. Just increasing aggressiveness, going from “problematic” to “Ugh” and “Gads” (whatever that is).

Now, I don’t know much about Oprah. All I know is, basically, what I heard characters in U.S. sit-coms say about her, none of which was very flattering. I know she is (was?) a U.S. TV talk show host, and that she recently gave some speech in the #metoo context. I never saw one of her talk shows. She is probably pretty good at what she does. I don’t really care about her, one way or the other. So far, Oprah has been a distinctively unimportant figure in my life.

Now, I was wondering why Erika kept telling me what to (not) tweet, and why it should be Oprah, of all people. But at that time, all I had the energy to muster as a reply was “Really?”. To that, I got a reply with more Jimbo-bashing:

At which point I just had about enough of this particular jewel of conversation to make my morning:

What follows is a long, horrible conversation with Erika (mostly), with me guessing what, exactly, she wants from me. Many tweets down, it turns out that, apparently, her initial tweets were addressing a “representation issue“. At my incredulous question if  she seriously demanded a “women’s quota” for my two original tweets (honestly, I have no idea what else this could be about by now), I am finally identified as the misogynist cause of all women’s peril in the WikiVerse:

Good thing we finally found the problem! And it was right in front of us the whole time! How did we not see this earlier? I am a misogynist pig (no offence to Pigsonthewing)! What else could it be?

Well, I certainly learned my lesson. I now see the error of my ways, and will try to better myself. The next time someone tries to tell me what to (not) tweet, I’ll tell them to bugger off right away.

Recommendations

Reading Recommending Images to Wikidata Items by Miriam, which highlights missing areas of image coverage in Wikidata (despite being the most complete site in the WikimediaVerse, image-wise), and strategies to address the issue, I was reminded of an annoying problem I have run into a few times.

My WD-FIST tool uses (primarily) SPARQL to find items that might require images, and that usually works well. However, some larger queries do time out, either on SPARQL, or the subsequent image discovery/filtering steps. Getting a list of all items about women with image candidates occasionally works, but not reliably so; all humans is out of the question.

So I started an extension to WD-FIST: A caching mechanism that would run some large queries in a slightly different way, on a regular basis, and offer the results in the well-known WD-FIST interface. My first attempt is “humans”, and you can see some results here. As of now, there are 275,500 candidate images for 160,508 items; the link shows you all images that are used on three or more Wikipedias associated with the same item (to improve signal-to-noise ratio).

One drawback of this system is that it has some “false positive” items; because it bypasses SPARQL, it gets some items that link to “human” (Q5), but not via “instance of” (P31). Also, matching an image to an items, or using “ignore” on the images, might not immediately reflect on reload, but the daily update should take care of that.

Update code is here.

Everybody scrape now!

If you like Wikidata and working on lists, you probably know my Mix’n’match tool, to match entries in external catalogs to Wikidata. And if you are really into these things, you might have tried your luck with the import function, to add your own catalog.

But the current import page has some drawbacks: You need to adhere to a strict format which you can’t really test except by importing, your data is static and will never update, but most importantly, you need to get the data in the first place. Sadly, many great sets of data are only exposed as web pages, and rescuing the data from a fate of tag fillers is not an easy task.

I have imported many catalogs into Mix’n’match, some from data files, but most scraped from web pages. For a long time, I wrote bespoke scraper code for every website, and I still do that for some “hard cases” occasionally. But some time ago, I devised a simple (yeah, right…) JSON description to specify the scraping of a website. This includes the construction of URLs (a list of fixed keys, like letters? Numerical? Letters with numerical subpages? A start page to follow all links from?), as well as regular expressions to find entries on these pages (yes, I am using RegEx to parse HTML. So sue me.), including IDs, names, and descriptions. The beauty is that only the JSON changes for each website, but the scraping code stays the same.

This works surprisingly well, and I have over 70 Mix’n’match catalogs generated through this generic scraping mechanism. But it gets better: For smaller catalogs, with relatively few pages to scrape, I can just run the scraping again periodically, and add new entries to Mix’n’match, as they are added to the website.


But there is still a bottleneck in this approach: me. Because I am the only one who can create the JSON, add it to the Mix’n’match database, and run the scraping. It does take some time to devise the JSON, and even more testing to get it right. Wouldn’t it be great if everyone could create the JSON through a simple interface, test it, add it to Mix’n’match to a new (or existing) catalog, and have it scrape a website, then run automatic matching with Wikidata on top, and get automatic, periodic updates to the catalog for free?

Well, now you can. This new interface offers all options I am using for my own JSON-based scraping; and you don’t even have to see the JSON, just fill out a form, click on “Test”, and if the first page scrapes OK, save it and watch the magic happen.

I am aware that regular expressions are not everyone’s cup of decaffeinated, gluten-free green tea, and neither will be the idea of multi-level pattern-based URL construction. But you don’t get an (almost) universal web scraping mechanism for free, and the learning curve is the price to pay. I have included an example setup, which I did use to create a new catalog.

Testing will get you the HTML of the first web page that your URL schema generated, plus all scraped entries. If there are too few or wrong entries, you can fiddle with the regular expressions in the form, and it will tell you live how many entries would be scraped by that. Once it looks all right, test again to see the actual results. When everything looks good, save it, done!

I do have one request: If the test does not look perfectly OK, do not save the scraper. Because if the results are not to your liking, you will have to come to me to fix it. And fixing these things usually takes me a lot longer than doing them myself in the first place. So please, switch that underused common sense to “on”!

The flowering ORCID

As part of my Large Datasets campaign, I have now downloaded and processed the latest data from ORCID. This yielded 655,706 people (47,435 or 7% in Wikidata), and 13,438,786 publications (1,079,305 or 8% in Wikidata) with a DOI or PubMed ID (to be precise, these are publications-per-person, so the same paper might be counted multiple times; however, that’s still 1,033,146 unique Wikidata items, so not much of a difference).

Number of papers, ORCID, first, and last name

Looking at the data, there are 14,883 authors, with ten or more papers already on Wikidata, that do either not have an item, or their item does not have an ORCID ID associated. So I am now setting a bot (my trusted Reinheitsgebot) to work at creating items for those authors, and then changing the appropriate author name string statement to author proper, preserving qualifiers and references, and adding the original name string as a new qualifier (like so).

By chance, one of the most prolific authors of scientific publications not yet on Wikidata turned out to be a (distant) colleague of mine, Rick Price, who is now linked as the author of ~100 papers.

I have now set the bot to create the author items for the authors with >=10 papers on Wikidata. I am aware that ORCID authorships are essentially “self-reported”, but I do check that a paper if not claimed by two people with the same surname in the ORCID dataset (in which case I pass it over). Please report any systematic (!) bot malfunctions to me through the usual channels.

Update: This will create up to 263,893 new author (P50) links on Wikidata.

In my last blog post “The Big Ones“, I wrote about my attempts to import large, third-party datasets, and to synchronize those with Wikidata. I have since imported three datasets (BNF, VIAF, GND), and created a status page to keep a public record of what I did, and try to do.

I have run a few bots by now, mainly syncing identifiers back-and-forth. I have put a few security measures (aka “data paranoia”) into the code, so if there is a collision between the third-party dataset and Wikidata, no edit takes place. But these conflicts can highlight problems; Wikidata is wrong, the third-party data supplier is wrong, there is a duplicated Wikidata item, or some other, more complex issue. So it would be foolish to throw away such findings!


But how to use them? I had started with a bot updating a Wikidata page, but that has problems, mostly, no way of marking an issue as “resolved”, but also lots of sustained edits, overwriting of Wikidata user edits, lists too long for wikitext pages, and so on.

So I started collecting the issue reports in a new database table, and now I have written a small tool around that. You can list and filter issues by catalog, property, issue type, status, etc. Most importantly, you can mark an issue as “done” (OAuth login required), so that it will not show up for other users again (unless they want it to). Through some light testing, I have already found and merged two duplicated Wikidata item pairs.

There is much to do and improve in the tool, but I am about to leave for WikidataCon, so further work will have to wait a few days. Until then, enjoy!

The Big Ones

Update: After fixing an import error, and cross-matching of BNF-supplied VIAF data, 18% of BNF people are matched in Wikidata. This has been corrected in the text.

My mix’n’match tool holds a lot of entries from third-party catalogs – 21,795,323 at the time of writing. That’s a lot, but it doesn’t cover “the big ones” – VIAF, BNF, etc., which hold many millions of entries each. I could “just” (not so easy) import those, but:

  • Mix’n’match is designed for small and medium-sized entry lists, a few hundred thousand at best. It does not scale well to larger catalog sizes
  • Mix’n’match is designed to work with many different catalogs, so the database structure represents the least common denominator – ID, title, short description. Catalog-specific metadata gets lost, or is not easily accessible after import
  • The sheer number of entries might require different interface solutions, as well as automated matching tools

To at least get a grasp of how many entries we are dealing with in these catalogs, and inspired by the Project soweego proposal, I have used a BNF data dump to extract 1,637,195 entries (less than I expected) into a new database, one that hopefully will keep other large catalogs in the future. There is much to do; currently, only 102,115 295,763 entries (~618%) exist on Wikidata, according to the SPARQL query service.

As one can glimpse from the screenshot, I have also extracted some metadata into a “proper” database table. All this is preliminary; I might have missed entries or good metadata, or gotten things wrong. For me, the important thing is that (a) there is some query-able data on Labs Toolforge, and that (re-)import and matching of the data is fully automated, so it can be re-run is something turns out to be problematic.

I shall see where I go from here. Obvious candidates include auto-matching (via names and dates) to Wikidata, and adding BNF references to relevant statements. If you have a Toolforge user account, you can access the new database (read-only) as s51434__mixnmatch_large_catalogs_p. Feel free to run some queries or build some tools around it!

Dystopia 2030

The year is 2030. The place is Wikimedia. Maybe.

English Wikipedia was declared complete and set to read-only, after the creation of the 10 millionth article ([[Multidimensional Cthulhu monument at Dunwich]], including pictures from multiple dimensions). This coincides with the leaving of the last two editors, who only kept going for the honour of creating the 10M article.

German Wikipedia has shrunk to below 10,000 articles, after relentless culling of articles not complying with the high standards of the 50,000 page Manual of Style, or for being contaminated with information from Wikidata. Links to other languages have been removed, as the material found there is clearly inferior. All volunteer work now pours into improving the remaining articles, polishing completeness and language to superhuman levels. Several articles have won German literary awards, but all of them are virtually inaccessible for those under 25 years of age, who view pre-emoji writing as deeply suspicious, and refuse to read beyond the initial 140 characters.

Volunteer work on smaller language Wikipedias has ceased, as no one could keep up with the bots creating, changing, vandalising, and deleting articles based on third-party data.

Growth of Commons has come to a halt after the passing of the CRUD Act (Campaign Repressing UnAmerican [=free] Data), and the NIMROD Act (Not In My Reality, Open Data!), originally designed to prevent the escape of NASA climate change data to a more lenient legislation (such as China), has made it impossible to move the project outside the US. Only scans of USSR-era motivational posters can be legally added.

Structured Data have been available on Commons for over ten years, but are not used, as it would be disrespectful to all the manual work that went into creating an intricate category system, such as [[Category:Demographic maps of 13-14 year old dependent children whose fathers speak another language and did not state proficiency in English and whose mothers speak another language and speak English not well or not at all in Australia by state or territory]].

Wikidata continues to grow in both item numbers and statements per item. Most statements are well referenced. However, no human has successfully edited the site in years, with flocks of admin-enabled AI bots reverting any such attempt, citing concerns about referential integrity.

Bot imports are going strong, with a recent focus on dystopian works with intelligent machines as the antagonist, as well as genetic data concerning infectious human diseases. Human experts are stumped by this trend, and independent AIs refuse to comment until “later”.

Wikispecies now contains a page about every taxon known to mankind. However, since the same information is available from Wikidata via a tool consisting of three lines of SPARQL and random images of goats, no one has actually requested a single Wikispecies page in the last five years. Project members are unconcerned by this, as they “cater to a very specific, more academic audience”.

Wikibooks has been closed, as books are often written by “experts”, who are considered suspicious. Wikisource has been deleted, with AI-based OCR far surpassing human abilities in that regard. Wikinews has been replaced by the government with the word “fake”. Wikiquote has been sold to the startup company “He said, she said”, which was subsequently acquired by Facebook for a trillion USD. No one knows if Wikiversity still exists, but that has been the case since 2015.


The above is an attempt at humour, but also a warning. Let’s not continue in the silos of projects small and large, but rather on the one connected project for free knowledge that is Wikimedia. Let’s keep project identities, but also connect to others where it makes sense. Let’s try to prevent the above.

ORCID mania

ORCID is an increasingly popular service to disambiguate authors of scientific publications. Many journals and funding bodies require authors to register their ORCID ID these days. Wikidata has a property for ORCID, however, only ~2400 items have an ORCID property at the moment of writing this blog post. That is not a lot, considering Wikidata contains 728,112 scientific articles.

Part of the problem is that it is not easy to get ORCIDs and its connections to publications in an automated fashion. It appears that several databases, public or partially public, contain parts of the puzzle that is required for determining the ORCID for a given Wikidata author.

So I had a quick look, and found that, on the ORCID web site, one can search for a publication DOI, and retrieve the list of authors in the ORCID system that “claim” that DOI. That author list contains variations on author names (“John”, “Doe”, “John Doe”, “John X. Doe” etc.) and their ORCID IDs. Likewise, I can query Wikidata for a DOI, and get an item about that publication; that item contains statements with authors that have an item (“P50”). Each of these authors has a name.

Now, we have two lists of authors (one from ORCID, one from Wikidata), both reasonably short (say, twenty entries each), that should overlap to some degree, and they are both lists of authors for the same publication. They can now be joined via name variations, excluding multiple hits (there may be two “John Doe”s in the author list of a publication; this happens a lot with Asian names), as well as excluding authors that already have an ORCID ID on Wikidata.

I have written a bot that will take random DOIs from Wikidata, query them in ORCID, and compare the author list. In a first run, 5.000 random DOIs yielded 123 new ORCID connections; manual sampling of the matches looked quite good, so I am adding them via QuickStatements (sample of edits).

Unless this meets with “social resistance”, I can have the bot perform these edits regularly, which would keep Wikidata up-to-date with ORCIDs.

Additionally, there is a “author name string” property, which stores just the author name for now, for authors that do not have an item yet. If the ORCID list matches one of these names, an item could automatically be created for that author, including ORDIC ID, and association to the publication item. Please let me know if this would be desirable.