Skip to content

The Hand-editor’s Tale

Disclaimer: I am the author of Listeria, and maintainer of ListeriaBot.

In January 2016, User:Emijrp had an idea. Why not use that newfangled Listeria tool, a bot that generates lists based on Wikidata, and puts them on Wikipedia pages, to maintain a List of Women Linguists on English Wikipedia? It seemed that a noble cause had met with (at the time) cutting edge technology to provide useful information for both readers and editors (think: red links) on Wikipedia.

The bot thus began its work, and continued dutifully for almost a year (until January 2, 2017). At that time, a community decision was made to deactivate ListeriaBot edits on the page, but to keep the list it had last generated, for manual curation. No matter what motivated that decision, it is interesting to evaluate the progress of the page in its “manual mode”.

Since the bot was deactivated, 712 days have passed (at the time of writing, 2018-12-17). Edit frequency dropped from one edit every 1-2 days by the bot, to one edit every 18 days on average.

In that time, the number of entries increased from 663 (last bot edit) to 673 (adding one entry every 71 days on average). The query (women, linguists, but no translators) used to generate the Listeria list now yields 1,673 entries. This means the list on English Wikipedia is now exactly 1,000 entries (or 148%) out of date. A similar lag for images and birth/death dates is to be expected.

The manual editors kept the “Misc” section, which was used by ListeriaBot to group entries of unknown or “one-off” (not warranting their own section) nationalities. It appears that few, if any, have been moved into appropriate sections.

It is unknown if manual edits to the list (example), the protection of which was given as a main reason to deactivate the bot, were propagated to the Wikidata item, or to the articles on Wikipedia where such exist (here, es and ca), or if they are destined to wither on the list page.

The list on Wikipedia links to 462 biography pages (likely a slight overestimate) on English Wikipedia. However, there are 555 Wikidata items from the original query that have a sitelink to English Wikipedia. The list on Wikipedia thus fails to link to (at least) 93 women linguists that exist on the same site. One example of a missing entry would be Antonella Sorace, a Fellow of the Royal Society. She is, of course, linked from another, ListeriaBot-maintained page.

Humans are vastly superior to machines in many respects. Curating lists is not necessarily one of them.

What else?

Structured Data on Commons is approaching. I have done a bit of work on converting Infoboxes into statements, that is, to generate structured data. But what about using it? What could that look like?

Taxon hierarchy for animals (image page)

Inspired by a recent WMF blog post, I wrote a simple demo on what you might call “auto-categorisation”. You can try it out by adding the line

importScript('User:Magnus Manske/whatelse.js') ;

to your common.js script.

It works for files on Commons that are used in a Wikidata item (so, ~2.8M files at the moment), though that could be expanded (e.g. scanning for templates with Qids, “depicts” in Structured data, etc.). The script then investigates the Wikidata item(s), and tries to find ways to get related Wikidata items with images.

The Night Watch (image page)

That could be simple things as “all items that have the same creator (and an image)”, but I also added a few bespoke ones.

If the item is a taxon (e.g. the picture is of an animal), it finds the “taxon tree” by following the “parent taxon” property. It even follows branches, and constructs the longest path possible, to get as many taxon levels as possible (I stole that code from Reasonator).

A similar thing happens for all P31 (“instance of”) values, where it follows the subclass hierarchy; the London Eye is “instance of:Ferris wheel”, so you get “Ferris wheel”, its super-class “amusement ride” etc.

The same, again, for locations, all the way up to country. If the item has a coordinate, there are also a some location-based “nearby” results.

Finally, some date fields (birthdays, creation dates) are harvested for the years.

The London Eye (image page)

Each of these, if applicable, get their own section in a box floating on the right side of the image. They link to a gallery-type SPARQL query result page, showing all items that match a constraint and have an image. So, if you look at The Night Watch on Commons, the associated Wikidata item has “Creator:Rembrandt”. Therefore, you get a “Creator” section, with a “Rembrandt” link, that opens a page showing all Wikidata items with “Creator:Rembrandt” that have an image.

In a similar fashion, there are links to “all items with inception year 1642”. Items with “movement”baroque”. You get the idea.

Now, this is just a demo, and there are several issues with it. First, it uses Wikidata, as there is no Structured Data on Commons yet. That limits it to files used in Wikidata items, and to the property schema and tree structure used on Wikidata. Some links that are offered lead to ridiculously large queries (all items that are an instance of a subclass of “entity”, anyone?), some that just return the same file you came from (because it is the only item with an image created by Painter X), and some that look useful but time out anyway. And, as it is, the way I query the APIs would likely not be sustainable for use by everyone by default.

But then, this is what a single guy can hack in a few hours, using a “foreign” database that was never intended to make browsing files easy. Given these limitations, I think about what the community can do with a bespoke, for-purpose Structured Data, and some well-designed code, and I am very hopeful.

Note: Please feel free to work with the JS code; it also contains my attempt to show the results in a dialog box on the File Page, but I couldn’t get it to look nice, so I keep using external links.

Match point

Mix’n’match is one of my more popular tools. It contains a number of catalogs, each in turn containing hundreds or even millions of entries, that could (and often should!) have a corresponding Wikidata item. The tool offers various ways to make it easier to match an entry in a catalog to a Wikidata item.

While the user-facing end of the tool does reasonably well, the back-end has become a bit of an issue. It is a bespoke, home-grown MySQL database that has changed a lot over the years to incorporate more (and more complex) metadata to go with the core data of the entries. Entries, birth and death dates, coordinates, third-party identifiers are all stored in separate, dedicated tables. So is full-text search, which is not exactly performant these days.

The perhaps biggest issue, however, is the bottleneck in maintaining that data – myself. As the only person with write access to the database, all maintenance operations have to run through me. And even though I have added import functions for new catalogs, and run various automatic update and maintenance scripts on a regular basis, the simple task of updating an existing catalog depends on me, and it is rather tedious work.

At the 2017 Wikimania in Montreal, I was approached by the WMF about Mix’n’match; the idea was that they would start their own version of it, in collaboration with some of the big providers of what I call catalogs. My recommendation to the WMF representative was to use Wikibase, the data management engine underlying Wikidata, as the back-end, to allow for a community-based maintenance of the catalogs, and use a task specific interface on top of that, to make the matching as easy as possible.

As it happens with the WMF, a good idea vanished somewhere in the mills of bureaucracy, and was never heard from again. I am not a system administrator (or, let’s say, it is not the area where I traditionally shine), so setting up such a system myself was out of the question at that time. However, these days, there is a Docker image by the German chapter that incorporates MediaWiki, Wikibase, Elasticsearch, the Wikibase SPARQL service, and QuickStatements (so cool to see one of  my own tools in there!) in a single package.

Long story short, I set up a new Mix’n’match using Wikibase ad the back-end.

Automatic matches

The interface is similar to the current Mix’n’match (I’ll call it V1, and the new one V2), but a complete re-write. It does not support all of the V1 functionality – yet. I have set up a single catalog in V2 for testing, one that is also in V1. Basic functionality in V2 is complete, meaning you can match (and unmatch) entries in both Mix’n’match and Wikidata. Scripts can import matches from Wikidata, and do (preliminary) auto-matches of entries to Wikidata, which need to be confirmed by a user. This, in principle, is similar to V1.

There are a few interface perks in V2. There can be more than one automatic match for an entry, and they are all shown as a list; one can set the correct one with a single click. And manually setting a match will open a full-text Wikidata search drop-down inline, often sparing one the need to search on Wikidata and then copying the QID to Mix’n’match. Also, the new auto-matcher takes the type of the entry (if any) into account; given a type Qx, only Wikidata items with name matches that are either “instance of” (P31) Qx (or one of the subclasses of Qx), or items with name matches but without P31 are used as matches; that should improve auto-matching quality.

Manual matching with inline Wikidata search

But the real “killer app” lies in the fact that everything is stored in Wikibase items. All of Mix’n’match can be edited directly in MediaWiki, just like Wikidata. Everything can be queried via SPARQL, just like Wikidata. Mass edits can be done via QuickStatements, just like… well, you get the idea. But users will just see the task-specific interface, hiding all that complexity, unless they really want to peek under the hood.

So far with the theory; sadly, I have run into some real-world issues that I do not know how to fix on my own (or do not have the time and bandwidth to figure out; same effect). First, as I know from bitter experience, MediaWiki installations attract spammers. Because I really don’t have time to clean up after spammers on this one, I have locked account creation and editing; that means only I can run QuickStatements on this Wiki (let me know your Wikidata user name and email, and I’ll create an account for you, if you are interested!). Of course, this kind of defeats the purpose of having the community maintain the back-end, but what can I do? Since the WMF has bowed out in silence, the wiki isn’t using the WMF single sign-on. The OAuth extension, which was originally developed for that specific purpose, ironically doesn’t work for MediaWiki as a client.

But how can people match entries without an account, you ask? Well, for the Wikidata side, they have to use my Widar login system, just like in V1. And for the V2 Wiki, I have … enabled anonymous editing of the item namespace. Yes, seriously. I just hope that Wikibase data spamming is a bit in the future, for now. Your edits will still be credited using your Wikidata user name in edit summaries and statements. Yes, I log all edits as Wikibase statements! (Those are also used for V2 Recent Changes, but since Wikibase only stores day-precision timestamps, Recent Changes looks a bit odd at the moment…)

I also ran into a few issues with the Docker system, and I have now idea how to fix them. This includes:

  • Issues with QuickStatements (oh the irony)
  • SPARQL linking to the wrong server
  • Fulltext search is broken (this also breaks the V2 search function; I am using prefix search for now)
  • I have no idea how to backup/restore any of this (bespoke configuration, MySQL)

None of the above are problems with Mix’n’match V2 in principal, but rather engineering issues to fix. Help would be most welcome.

Other topics that would need work and thought include:

  • Syncing back to Wikidata (probably easy to do).
  • Importing of new catalogs, and updating of existing ones. I am thinking about a standardized interchange format, so I can convert from various input formats (CSV files, auto-scrapers, MARC 21, SPARQL interfaces, MediaWiki installations, etc.).
  • Meta-data handling. I am thinking of a generic method of storing Wikidata property Px ID and a corresponding value as Wikibase statements, possibly with a reference for the source. That would allow most flexibility for storage, matching, and import into Wikidata.

I would very much like to hear what you think about this approach, and this implementation. I would like to go ahead with it, unless there are principal concerns. V1 and V2 would run in parallel, at least for the time being. Once V2 has more functionality, I would import new catalogs into V2 rather than V1. Suggestions for test catalogs (maybe something with interesting metadata) are most welcome. And every bit of technical advice, or better hands-on help, would be greatly appreciated. And if the WMF or WMDE want to join in, or take over, let’s talk!

The blind referee

A quick blog post, before the WordPress editor dies on my again… Wikidata is great. Wikidata with references is even better. So I have written a little tool called Referee. It checks a Wikidata item, collects web pages that are linked via external ID statements, and via associated Wikipedia pages, and checks them for potential matches of statements (birth dates, locations, etc.). If you add a little invocation to your common.js page on Wikidata:
importScript( 'User:Magnus_Manske/referee.js' );
it will automatically check on every Wikidata item load if there are potential references for that item in the cache, and display them with the appropriate statement, with one-click add/reject links. If you put this line:
referee_mode = 'manual' ;
before the importScript invocation, it will not check automatically, but wait for you to click the “Referee” link in the toolbox sidebar. However, in manual, it will force a check (might take a few seconds) in case there are no reference candidates; the toolbar link will remain highlighted while the check is running. I made a brief video demonstrating the interface. Enjoy. Addendum: Forgot to mention that the tool does not work on certain instances (P31) of items, namely “taxon” and “scholarly article”. This is to keep the load on the scraper low, plus these types are special cases and would likely profit more from dedicated scraper systems.

Wikipedia, Wikidata, and citations

As part of an exploratory census of citations on Wikipedia, I have generated a complete (yeah, right) list of all scientific publications cited on Wikispecies, English and German Wikipedia. This is done based on the rendered HTML of the respective articles, and tries to find DOIs, PubMed, and PubMed Central IDs. The list is kept up to date (with only a few minutes lag). I also continuously match the publications I find to Wikidata, and create the missing items, most cited ones first.

A bit about the dataset (“citation” here means that an article mentions/links to a publication ID) at this point in time:

  • 476,560 distinct publications
  • 1,968,852 articles tracked across three Wikimedia projects (some citing publications)
  • 717,071 total citations (~1.5 citations per publication), of which
    • 261,486 have a Wikidata item
    • 214,425 have no Wikidata match
    • 649 cannot be found or created as a Wikidata item (parsing error, or DOI does not exist)
  • The most cited publication is Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences (used 3,403 times)
  • Publications with a Wikidata item are cited 472,793 times, those without 244,191 times
  • 266 publications are cited from all three Wikimedia sites (263 have a Wikidata item)

There is no interface for this project yet. If you have a Toolforge (formerly known as Labs) account, you can look at the database as s52680__science_source_p.

Judgement Day

Wikipedia label by Gmhofmann on Commons

At the dawn of Wikidata, I wrote a tool called “Terminator”. Not just because I wanted to have one of my own, but as a pun on the term “term”, used in the database table name (“wb_term”) where Wikidata labels, descriptions, and aliases are stored. The purpose of the tool is to find important (by some definition) Wikidata items that lack a label in a specific language. This can be very powerful, especially in languages with low Wikidata participation; setting the label for teacher (Q37226) in a language will immediately allow all Wikidata items using that item (as an occupation, perhaps) to show that label. A single edit can improve hundreds or thousands of items, and make them more accessible in that language.

Well, Wikidata has grown a lot since I started that tool, and the Terminator didn’t cope well with the growth; it was limited to a handful of languages, and the daily update was compute intensive. Plus, the interface was slow and ugly. Time for a rewrite!

So without further ado, I present version 2 of the Terminator tool. Highlights:

  • Now covers all Wikidata languages
  • Get the top items with missing labels, descriptions, or Wikipedia articles
  • Sort items by total number of claims, external IDs, sitelinks, or a compound score
  • The database currently contains the top (by compound score) ~4.1 million items on Wikidata
  • Updated every 10 minutes
  • Search for missing labels in multiple languages (e.g. German, Italian, or Welsh)
  • Only show items that have labels in languages you know
  • Automatically hides “untranslatable” items (scientific articles, humans, Wikipedia-related pages such as templates and categories), unless you want those as well
  • Can use a SPARQL query to filter items (only shows items that match all the above, plus are in the SPARQL result, for results with <10K items or so)
  • Game mode (single, unsorted random result, more details, re-flows on mobile)

Please let me know through the usual channels about bugs and feature requests. I have dropped some functionality from the old version, such as data download; but that version is still linked form the new main page. Enjoy!

More topics

After my recent blog post about the TopicMatcher tool, I had quite a few conversations about the general area of “main topic”, especially relating to the plethora of scientific publications represented on Wikidata. Here’s a round-up of related things I did since:

As a first attempt, I queried all subspecies items from Wikidata, searched for scientific publications, and added them to TopicMatcher.

That worked reasonably well, but didn’t yield a lot of results, and they need to be human-confirmed. So I came at the problem the other way: Start with a scientific publication, try to find a taxon (species etc.) name, and them add the “main subject” match. Luckily, many such publications put taxon names in () in the title. Once I have the text in between, I can query P225 for an exact match (excluding cases where there are more than one!), and then add “main subject” directly to the paper item, without having to confirm it by a user. I am aware that this will cause a few wrong matches, but I imagine those are few and far between, can be easily corrected when found, and are dwarfed by the usefulness of having publications annotated this way.

There are millions of publications to check, so this is running on a cronjob, slowly going through all the scientific publications on Wikidata. I find quite a few topic in () that are not taxa, or have some issue with the taxon name; I am recording those, to run some analysis (and maybe other, advanced auto-matching) at a later date. So far, I see mostly disease names, which seem to be precise enough to match, in many cases.

Someone suggested to use Mix’n’match sets to find e.g. chemical substances in titles that way, but this requires both “common name” and ID to be present in the title for a sufficient degree of reliability, which is rarely the case. Some edits have been made for E numbers, though. I have since started a similar mechanism running directly off Wikidata (initial results).

Then, I discovered some special cases of publications that lend themselves to automated matching, especially obituaries, as they often contain name, birth, and death date of a person, which is precise enough to automatically set the “main subject” property. For cases where there is no match found, I add them to TopicMatcher, for manual resolution.

I have also added “instance of:erratum” to ~8,000 papers indicating this from the title. This might be better places in “genre”, but at least we have a better handle on those now.

Both errata and obituaries will run regularly, to update any new publications accordingly.

As always, I am happy to get more ideas to deal with this vast realm of publications versus topic.

On Topic

Wikidata already contains a lot of information about topics – people, places, concepts etc. It also contains topics that have a topic, e.g., a painting of a person, a biographical article about someone, a scientific publication about a species. Ideally, Wikidata also describes the connection between the work and the subject. Such connections can be tremendously useful in many contexts, including GLAM and scientific research.

This kind of work can generally not be done by bots, as this would require machine-readable, reliable source data to begin with. However, manually finding items about works, then finding the matching item on Wikidata is somewhat tedious. Thus, I give you TopicMatcher!

In a nutshell, I prepare a list of Wikidata items that are about creative works – paintings, biographical articles, books, scientific publications. Then, I try to guesstimate what Wikidata item they are (mainly) about. Finally, you can get one of these “work items” and their potential subject, with buttons to connect them. At the moment, I have biographical articles looking for a “main subject”, and paintings lacking a “depicts” statement. That comes to a total of 13,531 “work items”, with 54,690 potential matches.

You will get the expected information about the work item and the potential matches, using my trusty AutoDesc. You also get a preview of the painting (if there is an image) and a search function. Below that is a page preview that differs with context; depending on the work item, you could get

  • a WikiSource page, for biographical articles there
  • a GLAM page, if the item has a statement with an external reference that can be used to construct a URL
  • a publication page, using PMC, DOI, or PubMed IDs
  • the Wikidata page of the item, if nothing else works

The idea of the page preview is to find more information about the work, which will allow you to determine the correct match. If there are no suggested subjects in the database, a search is performed automatically, in case new items have been created since the last update.

Once you are done with the item, you can click “Done” (which marks the work item as finished, so it is not shown again), or “Skip”, to keep the item in the pool. Either way, you will get another random item; the reward for good work is more work….

At the top of the page are some filtering options, if you prefer to work on a specific subset of work items. The options are a bit limited for now, but should improve when the database grows to encompass new types of works and subjects.

Alternatively, you can also look for potential works that cover a specific subject. George Washington is quite popular.

I have chosen the current candidates because they are computationally cheap and reasonably accurate to generate. However, I hope to expand to more work and subject areas over time. Scientific articles that describe species come to mind, but the queries to generate candidate matches are quite slow.

If you have ideas for queries, or just work/subject areas, or even some candidate lists, I would be happy to incorporate those into the tool!

The Quickening

My QuickStatements tool has been quite popular, in both version 1 and 2. It appears to be one of the major vectors of adding large amounts of prepared information to Wikidata. All good and well, but, as will all well-used tools, some wrinkles appear over time. So, time for a do-over! It has been a long time coming, and while most of it concerns the interface and interaction with the rest of the world, the back-end learned a few new tricks too.

For the impatient, the new interface is the new default at QuickStatements V2. There is also a link to the old interface, for now. Old links with parameters should still work, let me know if not.

What has changed? Quite a lot, actually:

  • Creating a new batch in the interface now allows to choose a site. Right now, only Wikidata is on offer, but Commons (and others?) will join it in the near future. Cross-site configuration has already been tested with FactGrid
  • You can now load an existing batch and see the commands
  • If a batch that has (partially) run threw errors, some of them can be reset and re-run. This works for most statements, except the ones that use LAST as the item, after item creation
  • You can also filter for “just errors” and “just commands that didn’t run yet”
  • The above limitation is mostly of historic interest, as QuickStatements will now automatically group a CREATE command and the following LAST item commands into a single, more complex command. So no matter how many statements, labels, sitelinks etc. you add to a new item, it will just run as a single command, which means item creation just got a lot faster, and it goes easier on the Wikidata database/API too
  • The STOP button on both the batch list and the individual batch page should work properly now (let me know if not)! There are instructions for admins to stop a batch
  • Each server-side (“background”) batch now gets a link to the EditGroups tool, which lets you discuss an entire batch, gives information about the batch details, and most importantly, lets you undo an entire batch at the click of a button
  • A batch run directly from the browser now gets a special, temporary ID as well, that is added to the edit summary. Thanks to the quick work of the EditGroups maintainers, even browser-based batch runs are now one-click undo-able
  • The numbers for a running batch are now updated automatically every 5 seconds, on both the batch list and the individual batch page
  • The MERGE command, limited to QuickStatements V1 until now, now works in V2 as well. Also, it will automatically merge into the “lower Q number” (=older item), no matter the order of the parameters

I have changed my token by now…

For the technically inclined:

  • I rewrote the interface using vue.js, re-using quite a few pre-existing components from other projects
  • You can now get a token to use with your user name, to programmatically submit batches to QuickStatements. They will show up and be processed as if you had pasted them into the interface yourself. The token can be seen on your new user page
  • You can also open a pre-filled interface, using GET or POST, like so

One thing, however, got lost from the previous interface, and that is the ability to edit commands directly in the interface. I do not know how often that was used in practice, but I suspect it was not often, as it is not really suited for a mass-edit tool. If there is huge demand, I can look into retro-fitting that. Later.

 

Why I didn’t fix your bug

Many of you have left me bug reports, feature requests, and other issues relating to my tools in the WikiVerse. You have contacted me through the BitBucket issue tracker (and apparently I’m on phabricator as well now), Twitter, various emails, talk pages (my own, other users, content talk pages, wikitech, meta etc.), messaging apps, and in person.

And I haven’t done anything. I haven’t even replied. No indication that I saw the issue.

Frustrating, I know. You just want that tiny thing fixed. At least you believe it’s a tiny change.

Now, let’s have a look at the resources available, which, in this case, is my time. Starting with the big stuff (general estimates, MMMV [my mileage may vary]):

24h per day
-9h work (including drive)
-7h sleep (I wish)
-2h private (eat, exercise, shower, read, girlfriend, etc.)
=6h left

Can’t argue with that, right? Now, 6h left is a high estimate, obviously; work and private can (and do) expand on a daily, fluctuating basis, as they do for all of us.

So then I can fix your stuff, right? Let’s see:

6h
-1h maintenance (tool restarts, GLAM pageview updates, mix'n'match catalogs add/fix, etc.)
-3h development/rewrite (because that's where tools come from)
=2h left

Two hours per days is a lot, right? In reality, it’s a lot less, but let’s stick with it for now. A few of my tools have no issues, but many of them have several open, so let’s assume each tool has one:

2h=120min
/130 tools (low estimate)
=55 sec/tool

That’s enough time to find and skim the issue, open the source code file(s), and … oh time’s up! Sorry, next issue!

So instead of dealing with all of them, I deal with one of them. Until it’s fixed, or I give up. Either may take minutes, hours, or days. And during that time, I am not looking at the hundreds of other issues. Because I can’t do anything about them at the time.

So how do I pick an issue to work on? It’s an intricate heuristic computed from the following factors:

  • Number of users affected
  • Severity (“security issue” vs. “wrong spelling”)
  • Opportunity (meaning, I noticed it when it got filed)
  • Availability (am I focused on doing something else when I notice the issue?)
  • Fun factor and current mood (yes, I am a volunteer. Deal with it.)

No single event prompted this blog post. I’ll keep it around to point to, when the occasion arises.