Skip to content

Orthogonal Recent Changes

Recent Changes is a core functionality of all wikis. It shows which articles or, in the case of Wikidata, items have changed in the last minutes, hours, days. As useful as this is, for Wikidata it is like drinking from the proverbial fire hose; if you look for a specific type of change, it is a lot of data to process.

I found myself thinking that, for some of my tools like Mix’n’match, it would be useful to monitor Wikidata Recent Changes for edits of a specific property, one that is associated with a Mix’n’match catalog. If I could see that an item had, say, a statement with a specific property added, I could check the associated catalog and set that match there as well, to keep both systems in sync. Similarly, if a statement had a statement removed, the item should be unlinked in the catalog as well.

A significant part of the community is involved in bringing more languages to Wikidata. For them, it would be good to monitor label, alias, and description changes in a specific language. Also, changes in sitelinks (eg adding a Wikipedia page in a specific language to a Wikidata item) are relevant here.

This view on Recent Changes is orthogonal to the standard one; it is primarily concerned with the type of change, rather than with the order of edits. Of course, the final result would be similar in nature, since it is still Recent Changes.

So without further ado, I present Wikidata Recent Changes (my apologies for the boring name). This is a simple front-end to the actual API. The data is updated in almost real time (<10 sec behind live Wikidata). You can query for either statement changes based on properties, or label/alias/description/sitelinks. The default format is JSONL, where each row is one independent result in JSON format. That makes is both easier to generate the data (no intermediate storage required), and easier to read (line-by-line, no need to download and parse a giant JSON object). Traditional JSON and simple HTML are also available.

You can also specify any combination of added, changed, and removed, when it comes to event types; by default, all are returned.

To make processing faster, several subsequent edits may be grouped into one “event”; the revision number and timestamp returned just says “this was the case as of this revision”, not necessarily the exact revision of the change. It tells you that an item has changed in a way you requested, but it is up to you to make sense of that change, and to check the current status of the item. To save database storage, I also do not include the actual values of the labels/statements/etc.; again you have to figure this out yourself.

I started the data collection yesterday, and I see ~1M rows/day added. If this gets too large, I will prune older data, say, 1 month? But until then, please give this a whirl!

AutoDesc Reloaded

A long, long time ago, in a codebase far, far away, I wrote some code to generate automatic descriptions of Wikidata items for Reasonator. This turned out to be very useful, and I forked the code into its own tool/API, AutoDesc. Many of my tools, including the popular Mix-n-match, use AutoDesc under the hood.

As the original Reasonator code was in JavaScript, I decided to try a node.js implementation.

This worked reasonably well for a while, but bitrot and changes to the Toolforge ecosystem made it more unreliable over time, and the node.js version is quite deprecated now, even on Toolforge.

A few weeks ago, AutoDesc started to fail completely, that is, it could not be used any longer. Rather than patching the failing code into a new node.js version, I decided to stabelise it by converting it to Python. And by converting, I mean open a text editor and run regular expression search/replace to change JavaScript into Python. (I did look around for code that can do that automatically, but not much joy).

I am happy to report that, for some days now, the new Python code is running on Toolforge. It only does short descriptions for now, but at least you can get something from it. The code for the long descriptions is partially ready, but does not actually work yet. I will continue on this as I have time, and I would welcome any help in the form of pull requests.

The Buggregator

As you may know, I have a lot of tools for Wikipedia, Wikidata, Commons, etc. A lot of tools means a lot of code, and that means a lot of bugs, things that could work better, feature requests, and so on.

How do I learn about such issues as people encounter them? In a variety of ways. Most of my tools have git repositories, usually on GitHub and BitBucket, where issues and feature requests can be posted. But many people just leave comments on one of my many talk pages (de, en, commons, wikidata, meta, …), on the talk page of a tool (usually for the documentation page), and sometimes I get messages via email, Twitter, Telegram (the app), Signal, etc.

There is just no way for me to keep track of all of these. Some of the talk pages I visit rarely (eg meta), messages on Twitter are forgotten once they scroll out of sight, and so on. I needed a way to aggregate all those bug reports in one place.

Enter The Buggregator.

I started by creating a database of my tools. Yes, all of them. Including the JavaScript ones on the wikis. Including the old, obsolete and unmaintained ones. The sub-tools on my Toolforge amalgamate tools like “wikidata-todo”. Everything.

Then, I wrote import scripts. One for github, one for bitbucket, one for talk pages. (Tweets can be added manually at the moment). These run once a day, for everything I can think of. A heuristic tries to assign tools to (for example) new talk page topics.

I started out with 1795 issues, many of them “historical”, and am now down to 1516. Some way to go, but that has never discouraged me before.

I did this a few months ago and forgot to write about it, as I was busy with some personal issues (not tracked in Buggregator!). The reason I remembered is that Toolhub is now in production, and I think it’s great. But while it is using entries from Hay’s Directory (HD), it does not have all the tools I have in Buggregator.

So I made a script that uses the Buggregator tool list, checks it against the Toolhub list, and (once a day) creates a HD-style tool list, which I just added to the HD sources. That should add the tools to HD, which in turn should add them to Toolhub. There are ~200 of my tools not in HD and (probably) not in Toolhub, though this might create some duplication. But better to have a tool listed twice (until that can be cleaned up) than not having it listed at all, IMHO.

If you find tools of mine I forgot (entirely possible), or see some ways to get that count of open issues down, please let me know!

The Listeria Evolution

My Listeria tool has been around for years now, and is used on over 72K pages across 80 wikis in the Wikimediaverse. And while it still works in principle, it has some issues, an, being a single PHP script, it is not exactly flexible to adapt to new requirements.

Long story short, I rewrote the thing in Rust. The PHP-based bot has been deactivated, and all editing of ListeriaBot (marked as “V2”, example) since 2020-11-12 are done by the new version.

I tried to keep the output as compatible to the previous version as possible, but some minute changes are to be expected, so there should be a one-time “wave” of editing by the bot. Once every page has been updated, things should stabilize again.

As best as I can tell, the new version does everything the old one did, but it can do more already, and has some foundations for future expansions:

  • Multiple lists per page (a much requested feature), eliminating the need for subpage transclusion.
  • Auto-linking external IDs (eg VIAF) instead of just showing the value.
  • Multiple list rows per item, depending on the SPARQL (another requested feature). This requires the new one_row_per_item=no parameter.
  • Foundation to use other SPARQL engines, such as the one being prepared for Commons (as there is an OAuth login required for the current test one, I have not completed that yet). This could generate lists for SDC queries.
  • Portability to generic wikibase installations (untested might require some minor configuration changes). Could even be bundled with Docker, as QuickStatements is now.
  • Foundation to use the Commons Data namespace to store the lists, then display them on a wiki via Lua. This would allow lists to be updated without editing the wikitext of the page, and no part of the list is directly editable by users (thus, no possibility of the bot overwriting human edits, a reason given to disallow Listeria edits in main namespace). The code is actually pretty complete already (including the Lua), but it got bogged down a bit in details of encoding information like sections which is not “native” to tabular data. An example with both wiki and “tabbed” versions is here.

As always with new code, there will be bugs and unwanted side effects. Please use the issue tracker to log them.

The Toolforge Composition

Toolforge , formerly known as wmflabs, is changing its URLs. Where there was one host ( before, each tool now gets its own sub-domain (eg

Until now, I have used my WiDaR tool as a universal OAuth login for many of my tools, so users only have to sign in once. However, since this solution only works within the same sub-domain, it is no longer viable with the new Toolforge URL schema.

I am scrambling to port my tools that use OAuth to their own sign-in. To make this easier, I put my WiDaR tool into a PHP class, that can be reused across tools; the individual tool API can then pick up the requests that were previously sent to WiDar. Some tools, like Mix-n-match, have already been ported.

This brought me back to something that has been requested of some of my tools before – portability, namely to MediaWiki/Wikibase installations other then the Wikimedia ones. A tool with its own WiDaR would be much more portable to such installations.

But the new WiDaR class is included via the shared file system of Toolforge; how to get it portable? Just copying it seems like a duplication of effort, and it won’t receive updates etc.

The solution, in the PHP world, is called composer, a package manager for PHP. While I was at it, I ported several of my often-reused PHP “library scripts” to composer, and they are available in code here, or as an installable package here.

Since the source files for composer slightly differ from the ones I use internally on Toolforge, I wrote a script to “translate” my internal scripts into composer-compatible copies.

The first tool I equipped with this composer-based WiDaR is Tabernacle. It should be generic enough to be useful on other Wikibase installations, and is very lightweight (the PHP part just contains a small wrapper API around the WiDaR class). Installation instructions are in the repo README.

I will continue converting tools to the new URL schema, as time allows. I hope I will beat the hard deadline of June 15.

The Depicts

So Structured Data on Commons (SDC) has been going for a while. Time to reap some benefits!

Besides free-text image descriptions, the first, and likely most used, element one can add to a picture via SDC is “depicts”. This can be one or several Wikidata items which are visible (prominently or as background) on the image. Many people have done so, manually or via JavaScript- or Toolforge-based mass editing tools.

This is all well and good, but what to do with that data? It can be searched for, if you know the magic incantation for the search engine, but that’s pretty much it for now. A SPARQL query engine would be insanely useful for more complex queries, especially if it would work seamlessly with the Wikidata one, but no usable, up-to-date one is in sight so far.

Inspired by a tweet by Hay, and with some help from Maarten Dammers, I found a way to use SDC “depicts” information in my File Candidates tool. It suggests files that might be useful to add to specific Wikidata items.

Now, since proper SDC support is … let’s say incomplete at the moment, I had to go a bit off beaten path. First, I use the “random” sort in the Commons API search for files with a “depicts” statement. That way, I get 50 such files with one query. Then, I use the wikibase API on Commons to get the structured data for these files. The structured data contains the information which Wikidata item(s) each file depicts.

Armed with these Wikidata item IDs, I use the database replicas on Toolforge to retrieve the subset of items that (a) have no image (P18), (b) have P31 “instance of”, (c) have no P279 “subclass of”, and (d) do not link to any of a number of “unsuitable” items (eg. templates or given names). For that subset, I get the files the items use, eg as a logo image (to not suggest their usage with the item), and then I add an entry to the database that says “this item might use this image”, according to the depicts statements in the respective image (Code is here, in case you are interested).

50 files (a restriction imposed by the Commons API) are not much, especially since many images with depicts statements probably are used as an image on the respective Wikidata item. So I do keep running such random requests in the background and collect them for the File Candidates tool. At the time of writing, over 12k such candidates exist.

Happy image matching, and don’t forget to check out the other candidate image groups in the tool (including potentially useful free images from Flickr!).

Eval, not evil

My Mix-n-match tool deals with third-party catalogs, and helps matching their entries to Wikidata. This involves, as a necessity, importing minimal information about those entries into the Mix’n’match database, but ideally also imports additional metadata, such as (in case of biographical entries) gender, birth/death dates, VIAF etc., which are invaluable in automatically matching entries to Wikidata items, and thus greatly reduce volunteer workload.

However, virtually none of the (currently) ~2600 catalogs in Mix’n’match offers a standardized format to retrieve either basic or meta-data. Some catalogs are imported by volunteers from tabbed files, but most are “scraped”, that is, automatically read and parsed, from the source website.

Some source websites are set up in a way that allows a standardized scraping tool to run there, and I offer a web form to create new scrapers; over 1400 of these scrapers have run successfully, and ~750 of them can automatically run again on a regular basis.

But the autoscraper does not handle metadata, such as birth/death dates, and many catalogs need bespoke import code even for the basic information. Until recently, I had hundreds of scripts, some of them consisting of thousands of lines of code, running data retrieval and parsing:

  • Basic (ID, name, URL) information retrieval from source site
  • Creating or amending entry descriptions from source site
  • Importing auxiliary data (other IDs, such as VIAF, coordinates, etc.) from source site
  • extraction of birth/death dates from descriptions, taking care not to use estimates, “flourit” etc
  • extraction of auxiliary data from descriptions
  • linking of two related catalogs (e.g. one for painters, one for paintings) to improve matching (e.g. artworks only from that artist on Wikidata)

and many others.

Over time, all this has become unwieldy, unstructured, repetitive; I have written bespoke scrapers only to find that I already had one somewhere else etc.

So I went to radically redesign all these processes. My approach is that since only some small piece of code performs the actual scraping/parsing logic, these code fragments are now stored in the Mix’n’match database, associated with the respective catalog. I imported many code fragments from the “old” scripts into this table. I also wrote function-specific wrapper code that can load and execute a code fragment (via the eval function, which is often considered “evil”, hence the blog post title) on its associated catalog. An example of such code fragments for a catalog can be seen here.

I can now use that web interface to retrieve, create, test, and save code, without having to touch the command line at all.

In an ideal world, I would let everyone add and edit code here; however, since the framework executes PHP code, this would open the way for all kinds of malicious attacks. I can not think of a way to safeguard against (deliberate or accidental) destructive code, though I have put some mitigations in place, in case I make a mistake. So, for now, you can look, but you can’t touch. If you want to contribute code (new or patches), please give it to me, and I’ll be happy to add it!

This code migration is just in its infancy; so far, I support four functions, with a total of 591 code fragments. Many more to come, over time.

A Scanner Rusty

One of my most-used WikiVerse tools is PetScan. It is a complete re-write of several other PHP-based tools, in C++ for performance reasons. PetScan has turned into the Swiss Army Knife of doing things with Wikipedia, Wikidata, and other projects.

But PetScan has also developed a few issues over time. It is suffering from the per-tool database connection limit of 10, enforced by the WMF. It also has some strange bugs, one of them creating weirdly named files on disk, which generally does not inspire confidence. Finally, from a development/support POV, it is the “odd man out”, as none of my other WikiVerse tools are written in C++.

So I went ahead and re-wrote PetScan, this time in Rust. If you read this blog, you’ll know that Rust is my recent go-to language. It is fast, safe, and comes with a nice collection of community-maintained libraries (called “crates”). The new PetScan:

  • uses MediaWiki and Wikibase crates, which simplifies coding considerably
  • automatically chunks database queries, which should improve reliability, and could be developed into multi-threaded queries
  • pools replica database access form several of my other HTML/JS-only tools (which do not use the database, but still get allocated connections)

Long story short, I want to replace the C++ version with the Rust version. Most of the implementation is done, but I can’t think of all possible corner cases myself. So I ask the interested community members to give the test instance of PetScan V2 a whirl. I re-used the web interface from PetSan V1, so it should look very familiar. If your query works, it should do so much more reliably than in V1. The code is on github, and so is the new issue tracker, where you can file bug reports and feature requests.

Batches of Rust

QuickStatments is a workhorse for Wikidata, but it had a few problems of late.

One of those is bad performance with batches. Users can submit a batch of commands to the tool, and these commands are then run on the Labs server. This mechanism has been bogged down for several reasons:

  • Batch processing written in PHP
  • Each batch running in a separate process
  • Limitation of 10 database connection per tool (web interface, batch processes, testing etc. together) on Labs
  • Limitation of (16? observed but not validated) simultaneous processes per tool on Labs cloud
  • No good way to auto-start a batch process when it is submitted (currently, auto-starting a PHP process every 5 minutes, and exit if there is nothing to do)
  • Large backlog developing

Amongst continued bombardment on Wiki talk pages, Twitter, Telegram etc. that “my batch is not running (fast enough)”, I went to mitigate the issue. My approach is to do all the batches in a new processing engine, written in Rust. This has several advantages:

  • Faster and easier on the resources than PHP
  • A single process running on Labs cloud
  • Each batch is a thread within that process
  • Checking for a batch to start every second (if you submit a new batch, it should start almost immediately)
  • Use of a database connection pool (the individual thread might have to wait a few milliseconds to get a connection, but the system never runs out)
  • Limiting simultaneous batch processing for batches from the same user (currently: 2 batches max) to avoid the MediaWiki API “you-edit-too-fast” error
  • Automatic handling of maxlag, bot/OAuth login etc. by using my mediawiki crate

This is now running on Labs, processing all (~40 at the moment) open batches simultaneously. Grafana shows the spikes in edits, but no increased lag so far. The process is given 4GB of RAM, but could probably do with a lot less (for comparison, each individual PHP process used 2GB).

A few caveats:

  • This is a “first attempt”. It might break in new, fun, unpredicted ways
  • It will currently not process batches that deal with Lexemes. This is mostly a limitation of the wikibase crate I use, and will likely get solved soon. In the meantime, please run Lexeme batches only within the browser!
  • I am aware that I have now code duplication (the PHP and the Rust processing). For me, the solution will be to implement QuickStatements command parsing in Rust as well, and replace PHP completely. I am aware that this will impact third-party use of QuickStatements (e.g. the WikiBase docker container), but the PHP and Rust sources are independent, so there will be no breakage; of course, the Rust code will likely evolve away from PHP in the long run, possibly causing incompatabilities

So far, it seems to be running fine. Please let me know if you encounter any issues (unusual errors in your batch, weird edits etc.)!

Bad credentials

So there has been an issue with QuickStatements on Friday.

As users of that tool will know, you can run QuickStatements either from within your browser, or “in the background” from a Labs server. Originally, these “batch edits” were performed as QuickStatementsBot, mentioning batch and the user who submitted it in the edit summary. Later, through a pull request, QuickStatements gained the ability to run batch edits as the user who submitted the batch. This is done by storing the OAuth information of the user, and playing it back to the Wikidata API for the edits. So far so good.

However, as with many of my databases on Labs, I made the QuickStatements database open for “public reading”, that is, any Labs tool account could see its contents. Including the OAuth login credentials. Thus, since the introduction of the “batch edit as user” feature, up until last Friday, anyone with a login on Labs could, theoretically, perform edits and anyone who did submit a QuickStatements batch, by copying the OAuth credentials.

We (WMF staff and volunteers, including myself) are not aware that any such user account spoofing has taken place (security issue). If you suspect that this has happened, please contact WMF staff or myself.

Once the issue was reported, the following actions were taken

  • deactivation of the OAuth credentials of QuickStatements, so no more edits via spoofed user OAuth information could take place
  • removal of the “publicly” (Labs-internally) visible OAuth information from the database
  • deactivation of the QuickStatement bots and web interface

Once spoofed edits were no longer possible, I went ahead and moved the OAuth storage to a new database that only the QuickStatements “tool user” (the instance of the tool that is running, and myself) can see. I then got a new OAuth consumer for QuickStatements, and restarted the tool. You can now use QuickStatements as before. Your OAuth information will be secure now. Because of the new OAuth consumer for QuickStatements, you will have to log in again once.

This also means that all the OAuth information that was stored prior to Friday is no longer usable, and was deleted. This means that the batches you submitted until Friday will now fall back on the aforementioned QuickStatementsBot, and no longer edit as your user account. If it is very important to you that your edits appear under your user account, please let me know. All new batches will run edit your user accounts, as before.

My apologies for this incident. Luckily, there appears to be no actual damage done.