Skip to content

The Depicts

So Structured Data on Commons (SDC) has been going for a while. Time to reap some benefits!

Besides free-text image descriptions, the first, and likely most used, element one can add to a picture via SDC is “depicts”. This can be one or several Wikidata items which are visible (prominently or as background) on the image. Many people have done so, manually or via JavaScript- or Toolforge-based mass editing tools.

This is all well and good, but what to do with that data? It can be searched for, if you know the magic incantation for the search engine, but that’s pretty much it for now. A SPARQL query engine would be insanely useful for more complex queries, especially if it would work seamlessly with the Wikidata one, but no usable, up-to-date one is in sight so far.

Inspired by a tweet by Hay, and with some help from Maarten Dammers, I found a way to use SDC “depicts” information in my File Candidates tool. It suggests files that might be useful to add to specific Wikidata items.

Now, since proper SDC support is … let’s say incomplete at the moment, I had to go a bit off beaten path. First, I use the “random” sort in the Commons API search for files with a “depicts” statement. That way, I get 50 such files with one query. Then, I use the wikibase API on Commons to get the structured data for these files. The structured data contains the information which Wikidata item(s) each file depicts.

Armed with these Wikidata item IDs, I use the database replicas on Toolforge to retrieve the subset of items that (a) have no image (P18), (b) have P31 “instance of”, (c) have no P279 “subclass of”, and (d) do not link to any of a number of “unsuitable” items (eg. templates or given names). For that subset, I get the files the items use, eg as a logo image (to not suggest their usage with the item), and then I add an entry to the database that says “this item might use this image”, according to the depicts statements in the respective image (Code is here, in case you are interested).

50 files (a restriction imposed by the Commons API) are not much, especially since many images with depicts statements probably are used as an image on the respective Wikidata item. So I do keep running such random requests in the background and collect them for the File Candidates tool. At the time of writing, over 12k such candidates exist.

Happy image matching, and don’t forget to check out the other candidate image groups in the tool (including potentially useful free images from Flickr!).

Eval, not evil

My Mix-n-match tool deals with third-party catalogs, and helps matching their entries to Wikidata. This involves, as a necessity, importing minimal information about those entries into the Mix’n’match database, but ideally also imports additional metadata, such as (in case of biographical entries) gender, birth/death dates, VIAF etc., which are invaluable in automatically matching entries to Wikidata items, and thus greatly reduce volunteer workload.

However, virtually none of the (currently) ~2600 catalogs in Mix’n’match offers a standardized format to retrieve either basic or meta-data. Some catalogs are imported by volunteers from tabbed files, but most are “scraped”, that is, automatically read and parsed, from the source website.

Some source websites are set up in a way that allows a standardized scraping tool to run there, and I offer a web form to create new scrapers; over 1400 of these scrapers have run successfully, and ~750 of them can automatically run again on a regular basis.

But the autoscraper does not handle metadata, such as birth/death dates, and many catalogs need bespoke import code even for the basic information. Until recently, I had hundreds of scripts, some of them consisting of thousands of lines of code, running data retrieval and parsing:

  • Basic (ID, name, URL) information retrieval from source site
  • Creating or amending entry descriptions from source site
  • Importing auxiliary data (other IDs, such as VIAF, coordinates, etc.) from source site
  • extraction of birth/death dates from descriptions, taking care not to use estimates, “flourit” etc
  • extraction of auxiliary data from descriptions
  • linking of two related catalogs (e.g. one for painters, one for paintings) to improve matching (e.g. artworks only from that artist on Wikidata)

and many others.

Over time, all this has become unwieldy, unstructured, repetitive; I have written bespoke scrapers only to find that I already had one somewhere else etc.

So I went to radically redesign all these processes. My approach is that since only some small piece of code performs the actual scraping/parsing logic, these code fragments are now stored in the Mix’n’match database, associated with the respective catalog. I imported many code fragments from the “old” scripts into this table. I also wrote function-specific wrapper code that can load and execute a code fragment (via the eval function, which is often considered “evil”, hence the blog post title) on its associated catalog. An example of such code fragments for a catalog can be seen here.

I can now use that web interface to retrieve, create, test, and save code, without having to touch the command line at all.

In an ideal world, I would let everyone add and edit code here; however, since the framework executes PHP code, this would open the way for all kinds of malicious attacks. I can not think of a way to safeguard against (deliberate or accidental) destructive code, though I have put some mitigations in place, in case I make a mistake. So, for now, you can look, but you can’t touch. If you want to contribute code (new or patches), please give it to me, and I’ll be happy to add it!

This code migration is just in its infancy; so far, I support four functions, with a total of 591 code fragments. Many more to come, over time.

A Scanner Rusty

One of my most-used WikiVerse tools is PetScan. It is a complete re-write of several other PHP-based tools, in C++ for performance reasons. PetScan has turned into the Swiss Army Knife of doing things with Wikipedia, Wikidata, and other projects.

But PetScan has also developed a few issues over time. It is suffering from the per-tool database connection limit of 10, enforced by the WMF. It also has some strange bugs, one of them creating weirdly named files on disk, which generally does not inspire confidence. Finally, from a development/support POV, it is the “odd man out”, as none of my other WikiVerse tools are written in C++.

So I went ahead and re-wrote PetScan, this time in Rust. If you read this blog, you’ll know that Rust is my recent go-to language. It is fast, safe, and comes with a nice collection of community-maintained libraries (called “crates”). The new PetScan:

  • uses MediaWiki and Wikibase crates, which simplifies coding considerably
  • automatically chunks database queries, which should improve reliability, and could be developed into multi-threaded queries
  • pools replica database access form several of my other HTML/JS-only tools (which do not use the database, but still get allocated connections)

Long story short, I want to replace the C++ version with the Rust version. Most of the implementation is done, but I can’t think of all possible corner cases myself. So I ask the interested community members to give the test instance of PetScan V2 a whirl. I re-used the web interface from PetSan V1, so it should look very familiar. If your query works, it should do so much more reliably than in V1. The code is on github, and so is the new issue tracker, where you can file bug reports and feature requests.

Batches of Rust

QuickStatments is a workhorse for Wikidata, but it had a few problems of late.

One of those is bad performance with batches. Users can submit a batch of commands to the tool, and these commands are then run on the Labs server. This mechanism has been bogged down for several reasons:

  • Batch processing written in PHP
  • Each batch running in a separate process
  • Limitation of 10 database connection per tool (web interface, batch processes, testing etc. together) on Labs
  • Limitation of (16? observed but not validated) simultaneous processes per tool on Labs cloud
  • No good way to auto-start a batch process when it is submitted (currently, auto-starting a PHP process every 5 minutes, and exit if there is nothing to do)
  • Large backlog developing

Amongst continued bombardment on Wiki talk pages, Twitter, Telegram etc. that “my batch is not running (fast enough)”, I went to mitigate the issue. My approach is to do all the batches in a new processing engine, written in Rust. This has several advantages:

  • Faster and easier on the resources than PHP
  • A single process running on Labs cloud
  • Each batch is a thread within that process
  • Checking for a batch to start every second (if you submit a new batch, it should start almost immediately)
  • Use of a database connection pool (the individual thread might have to wait a few milliseconds to get a connection, but the system never runs out)
  • Limiting simultaneous batch processing for batches from the same user (currently: 2 batches max) to avoid the MediaWiki API “you-edit-too-fast” error
  • Automatic handling of maxlag, bot/OAuth login etc. by using my mediawiki crate

This is now running on Labs, processing all (~40 at the moment) open batches simultaneously. Grafana shows the spikes in edits, but no increased lag so far. The process is given 4GB of RAM, but could probably do with a lot less (for comparison, each individual PHP process used 2GB).

A few caveats:

  • This is a “first attempt”. It might break in new, fun, unpredicted ways
  • It will currently not process batches that deal with Lexemes. This is mostly a limitation of the wikibase crate I use, and will likely get solved soon. In the meantime, please run Lexeme batches only within the browser!
  • I am aware that I have now code duplication (the PHP and the Rust processing). For me, the solution will be to implement QuickStatements command parsing in Rust as well, and replace PHP completely. I am aware that this will impact third-party use of QuickStatements (e.g. the WikiBase docker container), but the PHP and Rust sources are independent, so there will be no breakage; of course, the Rust code will likely evolve away from PHP in the long run, possibly causing incompatabilities

So far, it seems to be running fine. Please let me know if you encounter any issues (unusual errors in your batch, weird edits etc.)!

Bad credentials

So there has been an issue with QuickStatements on Friday.

As users of that tool will know, you can run QuickStatements either from within your browser, or “in the background” from a Labs server. Originally, these “batch edits” were performed as QuickStatementsBot, mentioning batch and the user who submitted it in the edit summary. Later, through a pull request, QuickStatements gained the ability to run batch edits as the user who submitted the batch. This is done by storing the OAuth information of the user, and playing it back to the Wikidata API for the edits. So far so good.

However, as with many of my databases on Labs, I made the QuickStatements database open for “public reading”, that is, any Labs tool account could see its contents. Including the OAuth login credentials. Thus, since the introduction of the “batch edit as user” feature, up until last Friday, anyone with a login on Labs could, theoretically, perform edits and anyone who did submit a QuickStatements batch, by copying the OAuth credentials.

We (WMF staff and volunteers, including myself) are not aware that any such user account spoofing has taken place (security issue). If you suspect that this has happened, please contact WMF staff or myself.

Once the issue was reported, the following actions were taken

  • deactivation of the OAuth credentials of QuickStatements, so no more edits via spoofed user OAuth information could take place
  • removal of the “publicly” (Labs-internally) visible OAuth information from the database
  • deactivation of the QuickStatement bots and web interface

Once spoofed edits were no longer possible, I went ahead and moved the OAuth storage to a new database that only the QuickStatements “tool user” (the instance of the tool that is running, and myself) can see. I then got a new OAuth consumer for QuickStatements, and restarted the tool. You can now use QuickStatements as before. Your OAuth information will be secure now. Because of the new OAuth consumer for QuickStatements, you will have to log in again once.

This also means that all the OAuth information that was stored prior to Friday is no longer usable, and was deleted. This means that the batches you submitted until Friday will now fall back on the aforementioned QuickStatementsBot, and no longer edit as your user account. If it is very important to you that your edits appear under your user account, please let me know. All new batches will run edit your user accounts, as before.

My apologies for this incident. Luckily, there appears to be no actual damage done.

The Corfu Projector

I recently spent a week on Corfu. I was amazed by the history, the culture, the traditions, and, of course, the food. I was, however, appalled by the low coverage of Corfu localities on Wikidata. While I might be biased, living in the UK where every postbox is a historic monument, a dozen or so items for Corfu’s capital still seemed a bit thin. So I thought a bit about how to change this. Not just for Corfu, but in general for a geographic entity. I know I can’t do this all by myself, even for “just” Corfu, but I can make it easier and more appealing for anyone who might have an interest in helping out.

An important first step is to get an idea of what items there are. For a locality, this can be done in two ways: by coordinates, or by administrative unit. Ideally, relevant items should have both, which makes a neat first to-do-list (items with coordinates but without administrative unit, and vice versa). Missing images are also a low-hanging fruit, to use the forbidden management term. But there are more items that relate to a locality, besides geographical objects. People are born, live, work, and die there. A region has typical, local food, dresses, traditions, and events. Works are written about it, paintings are painted, and songs are sung. Plants and animals can be specific to the place. The list goes on.

All this does not sound like just a list of buildings; it sounds like its own project. A WikiProject. And so I wrote a tool specifically to support WikiProjects on Wikidata that deal with a locality. I created a new one for Corfu, and added some configuration data as a sub-page. Based on that data, the Projector tool can generate a map and a series of lists with items related to the project (Corfu example).

The map shows all items with coordinates (duh), either via the “administrative unit” property tree, or via coordinates within pre-defined areas. You can draw new areas on the map, and then copy the area definition into the configuration page. Items without an administrative unit will have a thick border, and items without an image will be red.

There are also lists of items, including the locations from the map (plus the ones in the administrative unit, but without coordinates), people related to these locations, creative works (books, paintings etc.), and “things” (food, organizations, events, you name it). All of this is generated on page load via SPARQL. The lists can be filtered (e.g. only items without image), and the items in the list can be opened in other tools, including WD-FIST to find images, Terminator to find missing labels, Recent Changes within these items over the last week, PetScan (all items or just the filtered ones), and Tablernacle. And, of course, WikiShootMe for the entire region. I probably forgot a few tools, suggestions are (as always) welcome.

Adopting this should be straightforward for other WikiProjects; just copy my Corfu example configuration (without the areas), adapt the “root regions”, and it should work (using “X” for “WikiProject X”). I am looking forward to grow this tool in functionality, and maybe to other, not location-based projects.

Papers on Rust

I have written about my attempt with Rust and MediaWiki before. This post is an update on my progress.

I started out writing a MediaWiki API crate to be able to talk to MediaWiki installations from Rust. I was then pointed to a wikibase crate by Tobias Schönberg and others, to which I subsequently contributed some code, including improvements to the existing codebase, but also an “entity cache” struct (to retrieve and manage larger amounts of entities from Wikibase/Wikidata), as well as an “entity diff” struct.

The latter is something I had started in PHP before, but never really finished. The idea is that, when creating or updating an entity, instead of painstakingly testing if each statement/label/etc. exists, one simply creates a new, blank item, fills it with all the data that should be in there, and then generates a “diff” to a blank (for creating) or existing (for updating) entity. That diff can then be passed to the wbeditentity API action. The diff generation can be fine-tuned, e.g. only add English labels, or add/update (but not remove) P31 statements.

Armed with these two crates, I went to re-create a functionality that I had written in PHP before: creation and updating of items for scientific publications, mainly used in my SouceMD tool. The code underlying that tool has grown over the years, meaning it’s a mess, and has also developed some idiosyncrasies that lead to unfortunate edits.

For a rewrite in Rust, I also wanted to make the code more modular, especially regarding the data sources. I did find a crate to query CrossRef, but not much else. So I wrote new crates to query pubmed, ORCID, and Semantic Scholar. All these crates are completely independent of MediaWiki/Wikibase; they can be re-used in any kind of Rust code related to scientific publications. I consider them a sound investment into the Rust crate ecosystem.

With these crates in a basic but usable state, I went to write papers, Rust code (not a crate just yet) to gather data from the above sources, and inject them into Wikidata. I wrote a Rust trait to represent a generic source, and then wrote adapter structs for each of the sources. Finally, I added some wrapper code to take a list of adapters, query them about a paper, and update Wikidata accordingly. It can already

  • iteratively gather IDs (supply a PubMed ID, PubMed might get a DOI, which then can get you data from ORCID)
  • find item(s) for these IDs on Wikidata
  • gather information about the authors of the paper
  • find items for these authors on Wikidata
  • create new author items, or update existing ones, on Wikidata
  • create new paper items, or update existing ones, on Wikidata (no author statement updates yet)

The adapter trait is designed to both unify data across sources (e.g. use standardized author information), but also allow to update paper items with source-specific data (e.g. publication dates, Mesh terms). This system is open to add more adapters for different sources. It is also flexible enough to extend to other, similar “publication types”, such as book, or maybe even artwork. My test example shows how easy it is to use in other code; indeed, I am already using it in Rust code I developed for my work (publication in progress).

I see all of this as a seeding of the Rust crate system with easily reusable, MediaWiki-related code. I will add more such code in the future, and hope this will help in the adoption of Rust in the MediaWiki programmer community. Watch this space.

Update: Now available as a crate!

Deleted gender wars

After reading the excellent analysis of AfD vs gender by Andrew Gray, where he writes about the articles that faced and survived the “Article for Deletion” process, I couldn’t help but wonder what happened to the articles that were not kept, that is, where AfD was “successful”.

So I quickly took all article titles from English Wikipedia, article namespace, that were in the logging table in the database replica as “deleted” (this may include cases where the article was deleted but later re-created under the same name), a total of 4,263,370 at the time of writing.

I filtered out some obvious non-name article titles, but which of the remaining ones were about humans? And which of those were about men, which about women? If there only were an easily accessible online database that has male and female first names… oh wait…

So I took male and female first names, as well as last (family) names, from Wikidata. For titles that end with a family name, I group the title into male, female, or unknown.

Male 289,036 50%
Female 86,281 16%
Unknown 196,562 34%
Total 571,879

Of the (male+female) articles, 23% are about women, which almost exactly matches Andrew’s ratio of women in BLP (Biographies of Living People). That would indicate no significant gender bias in actually deleted articles.

Of course, this was just a quick calculation. The analysis code and the raw data, plus instructions how to re-create it, are available here, in case you want to play further.

Dealing with the Rust

Rust is a up-and-coming programming language, developed by the Mozilla Foundation, and is used in the Firefox rendering engine, as well as the Node Package Manager, amongst others. There is a lot to say about Rust; suffice it to say that it’s designed to be memory-safe, fast (think: C or better), it can compile to WebAssembly, and has been voted “most loved language” on StackOverflow in 2017 and 2018. As far as new-ish languages go, this is one to keep an eye on.

Rust comes with a fantastic package manager, Cargo, which, by default, uses crates (aka libraries) from crates.io, the central repository for Rust code. As part of my personal and professional attempt at grasping Rust (it has a bit of a learning curve), I wrote a tiny crate to access the API of MediaWiki installations. Right now, it can

  • run the usual API queries, and return the result as JSON (using the Rust serde_json crate)
  • optionally, continue queries for which there are more results available, and merge all the results into a single JSON result
  • log in your bot, and edit

This represents a bare-bones approach, but it would already be quite useful for simple command-line tools and bots. Because Rust compiles into stand-alone binaries, such tools are easy to run; no issues with Composer and PHP versions, node.js version/dependency hell, Python virtualenvs, etc.

The next functionality to implement might include OAuth logins, and a SPARQL query wrapper.

If you know Rust, and would like to play, the crate is called mediawiki, and the initial version is 0.1.0. The repository has two code examples to get you started. Ideas, improvements, and Pull Requests are always welcome!

Update: A basic SPARQL query functionality is now in the repo (not the crate – yet!)

Inventory

The Cleveland Museum of Art recently released 30,000 images of art under CC-Zero (~public domain). Some of the good people on Wikimedia Commons have begun uploading them there, to be used, amongst others, by Wikipedia and Wikidata.

But how to find the relevant Wikipedia article (if there is one) or Wikidata item for such a picture? One way would be via titles, names etc. But the filenames are currently something like Clevelandart_12345.jpg, and while the title is in the wiki text, it is neither easily found, nor overly precise. But, thankfully, the uploader also includes the accession number (aka inventory number). That one is, in conjunction with the museum, a lot more precise, and it can be found in the rendered HTML reliably (to a degree).

But who wants to go through 30K of files, extract the inventory numbers, and check if they are on, say, Wikidata? And what about other museums/GLAM institutions? That’s too big a task to do manually. So, I wrote some code.

First, I need to get museums and their categories on Commons. But how to find them? Wikidata comes to the rescue, yielding (at the time of writing) 9,635 museums with a Commons category.

Then, I need to check each museum’s category tree for files, and try to extract the inventory number. Because this requires HTML rendering on tens (hundreds?) of thousands of files, I don’t want to repeat the exercise at a later date. So, I keep a log in a database (s51203__inventory_p, if you are on Labs). It logs

  • the files that were found with an inventory number (file, number, museum)
  • the number of files with an inventory number per museum

That way, I can skip museum categories that do not have inventory numbers in their file descriptions. If one of them is updated with such numbers, I can always remove the entry that says “no inventory numbers”, thus forcing a re-index.

My initial scan is now running, slowly filling the database. It will be run on a daily basis after that, just for categories that have inventory numbers, and new categories (as per Wikidata query above).

That’s the first part. The second part is to match the files against Wikidata items. I have that as well, to a degree; I am using the collection and inventory number properties to identify a work of art. Initial tests did generate some matches, for example, this file was automatically added to this item on Wikidata. This automatic matching will run daily as well.

A third part is possible, but not implemented yet. Assuming that a work of art has an image on Commons, as well as a museum and an inventory number, but no Wikidata item, such an item could be created automatically. This would pose little difficulty to code, but might generate duplicate items, where an items exists but does not have the inventory number set. Please let me know if this would be something to do. I’ll give you some numbers once the initial run is through.