Skip to content

Lists. The plague of managing things. But also surprisingly useful for many tasks, including Wikimedia-related issues. Mix’n’match is a list of third-party entries. PetScan generates lists from Wikipedia and Wikidata. And Listeria generates lists on-wiki.

But there is a need for generic, Wikimedia-related, user-curated lists. In the past, I have tried to quell that demand with PagePile, but the community wants more. So here you go: I present GULP, the Generic Unified List Processor. At the moment, this is a work in progress, but is already usable, though lacking many nice-to-have features.

How is this different from PagePile? Let’s see:

  • A list can have multiple columns (so, a table, technically)
  • Each column has a defined type (Wiki page, Location, Plain text)
  • Lists can be updated (file upload, load from URL)
  • Lists can be edited by the list owner (and by others, to be implemented)
  • Lists can have snapshots; a snapshot is like a version number, but it does not create a copy of the list; rather, further changes to the list are logged as such, and the snapshot remains untouched
  • Lists can be downloaded in various formats, and accessed via API

You can already create your own lists, update and modify them, and they are available for public download. Have a look at some demo lists:

Next steps for GULP include:

  • Adding/removing rows
  • More data sources and output formats
  • More column types
  • All API actions (including editing etc) will be available via a token system, so you can write scripts to edit your lists

For bug reports and feature requests, feel free to use the issue tracker. Also, if you want to integrate GULP into your tools , please let me know!

Turn the AC on

A large part of Wikidata is the collection of external identifiers for items. For some item types, such as items about people (Q5), some of this is what is known as Authority Control (AC) data, for example, VIAF (P214). One thing that distinguishes AC data from other external IDs is that AC data sources are often available in machine-readable form. This can be used for Wikidata in several ways:

  • to add new statements (eg occupation or birth place of a person)
  • to add new references to existing statements
  • to find and add more AC identifiers

Over the years, I wrote several bespoke tools and scripts that would query one of these AC websites, and add bits and pieces to Wikidata items, but I always wanted a more unified solution. So I finally got around to it and wrote a new tool on Toolforge, AC2WD. This presents an API to:

  • query multiple AC sources via an ID
  • create new Wikidata items (in memory, not on Wikidata!) from each source
  • merge several such “virtual” items into one (the sum of all AC data knowledge)
  • construct a “diff”, a JSON structure containing instructions to add new information (statements, references) to an existing Wikidata item

By giving an existing Wikidata item ID to this API, it will extract existing AC identifiers from the item, and load information from the respective AC sources. It will then check the new information for new, usable AC identifiers, and repeat. Once all possible AC data has been loaded, it will return the “diff” data to extend the item with the new information (via https://ac2wd.toolforge.org/extend/Q_YOUR_ITEM ), using the wbeditentity action of the Wikidata API.

Now this is all very technical for most users, so I wrote a little JavaScript utility for Wikidata, rather predictably also called AC2WD (usage instructions on the page). Any Wikidata item with at least one supported AC property will have an “AC2WD” link in the tool sidebar. If you click on it, it will fetch the “diff” from the Toolforge tool, and attempt to make an edit to add new information (reload the page to see it, or check out the history to see the changes in detail).

I am happy to extend the API to support new AC sources (that have a Wikidata property!), please let me know your favourite candidates. I may also be able to extract more information from certain AC sources than I currently do; again, please let me know if you have a preference.

Trust in Rust

So Toolforge is switching from grid engine to Kubernetes. This also means that tool owners such as myself need to change their tool background jobs to the new system. Mix’n’match was my tool with the most diverse job setup. But resource constraints and the requirement to “name” jobs meant that I couldn’t just port things one-to-one.

Mix’n’match has its own system of jobs that run once or on a regular basis, depend on other jobs finishing before them etc. For the grid engine, I could start a “generic” job every few minutes, that would pick up the next job and run it, with plenty of RAM assigned. Kubernetes resource restriction make this impossible. So I had to refactor/rewrite several jobs, and make them usable as PHP classes, rather than individual scripts to run.

Mix’n’match classes have become rather significant in code size, with >10K lines of code. Unsurprisingly, despite my best efforts, jobs got “stuck” for no apparent reason, bringing the whole system to a halt. This made especially new Mix’n’match catalogs rather unusable, with no automated matches etc.

Rather than fiddling with the intricacies of a hard-to-maintain codebase, I decided to replace the failing job types with new Rust code. This is already live for several job types, mainly preliminary match and person name/date match, and I am adding more. Thanks to the easy multi-threading and async/await capabilities of Rust, many jobs can run in parallel in a single process. One design feature for the new code is batched processing, so memory requirements are low (<200MB) even for multiple parallel jobs. Also, jobs now keep track of their position in the batch, and can resume if the process is stopped (eg to deploy new code).

I strongly doubt I will replace the entire code base, especially since much of the scraping code involve user-supplied PHP code that gets dynamically included. But safe, fast, and reliable Rust code serves its purpose in this complex tool.

Vue le vue

A while ago, Wikimedia sites, including Wikidata, started to use the Vue.js framework to ease the future development of user interface components. Vue is, to some degree, also available for user scripts.



I have a few user scripts on Wikidata, and some of them have a seriously outdated interface. There was a modal (page-blocking) dialog box for authority control, a floating dialog box for “useful” shortcuts, and a box at the top of the page for the relatively recent Mix’n’Match gadget. The appearance of all of these was different, and they took increasing amounts of screen real estate. They also can add statements to the item, but the confirmation for that were small notes thrown into the HTML, rather than actual statement boxes.

So I decided to re-style (and partially rewrite) them, but also to centralize shared functionality, to avoid duplication in these and other scripts. So I created a JavaScript component that can create a nice, central “tabbed box” at the top of the page, to make it easy to change between tools while not wasting screen space. Each tool gets its own tab and output space. Additionally, this new component can create new statements in the current item, and put new statement entries into the interface. Now, Wikidata does not expose (or at least, properly document) the methods to do this, so I have to write my own, which look similar to the original but sadly have none of the functionality. I hope Wikidata will expose the visual code to do that in the future, so everything could work nicely without me having to (badly) duplicate code.

I have updated the existing scripts to the new interface now; all users of these scripts should see the new interface, though you might have to force-reload the browser once.

The updated script are:

If you want other scripts of mine to be updated like this, or you want to update/develop your own, please let me know!

Join the Quest

I recently came across an interesting, semi-automated approach to create statements in Wikidata. A SPARQL query would directly generate commands for QuickStatements. These would then be manually checked before running them. The queries exist as links on Wikidata user pages, and have to be run by clicking on them.


That seemed useful but tedious. I decided to make it less tedious.

Enter the Quickstatements User Evaluation of Statements and Terms, or QUEST for short. Using this tool (logged in), you will be presented with a random set of QuickStatement commands, visually enhanced, to let you take a decision on running the command, or flagging it as “bad”. If you decide to run it, the command will edit under your user account.

The list of commands is fuelled by SPARQL queries. You can add your own; a help text is available. Every query has an interval (hours, days) and it will be periodically run in the background, to add new commands if possible. New or edited queries should be run within a few seconds. You can see all queries, those by a specific user, and commands from a specific query.

I hope this will help the Wikidata community with repeating tasks, and queries to “keep an eye on”.

Orthogonal Recent Changes

Recent Changes is a core functionality of all wikis. It shows which articles or, in the case of Wikidata, items have changed in the last minutes, hours, days. As useful as this is, for Wikidata it is like drinking from the proverbial fire hose; if you look for a specific type of change, it is a lot of data to process.

I found myself thinking that, for some of my tools like Mix’n’match, it would be useful to monitor Wikidata Recent Changes for edits of a specific property, one that is associated with a Mix’n’match catalog. If I could see that an item had, say, a statement with a specific property added, I could check the associated catalog and set that match there as well, to keep both systems in sync. Similarly, if a statement had a statement removed, the item should be unlinked in the catalog as well.

A significant part of the community is involved in bringing more languages to Wikidata. For them, it would be good to monitor label, alias, and description changes in a specific language. Also, changes in sitelinks (eg adding a Wikipedia page in a specific language to a Wikidata item) are relevant here.

This view on Recent Changes is orthogonal to the standard one; it is primarily concerned with the type of change, rather than with the order of edits. Of course, the final result would be similar in nature, since it is still Recent Changes.

So without further ado, I present Wikidata Recent Changes (my apologies for the boring name). This is a simple front-end to the actual API. The data is updated in almost real time (<10 sec behind live Wikidata). You can query for either statement changes based on properties, or label/alias/description/sitelinks. The default format is JSONL, where each row is one independent result in JSON format. That makes is both easier to generate the data (no intermediate storage required), and easier to read (line-by-line, no need to download and parse a giant JSON object). Traditional JSON and simple HTML are also available.

You can also specify any combination of added, changed, and removed, when it comes to event types; by default, all are returned.

To make processing faster, several subsequent edits may be grouped into one “event”; the revision number and timestamp returned just says “this was the case as of this revision”, not necessarily the exact revision of the change. It tells you that an item has changed in a way you requested, but it is up to you to make sense of that change, and to check the current status of the item. To save database storage, I also do not include the actual values of the labels/statements/etc.; again you have to figure this out yourself.

I started the data collection yesterday, and I see ~1M rows/day added. If this gets too large, I will prune older data, say, 1 month? But until then, please give this a whirl!

AutoDesc Reloaded

A long, long time ago, in a codebase far, far away, I wrote some code to generate automatic descriptions of Wikidata items for Reasonator. This turned out to be very useful, and I forked the code into its own tool/API, AutoDesc. Many of my tools, including the popular Mix-n-match, use AutoDesc under the hood.

As the original Reasonator code was in JavaScript, I decided to try a node.js implementation.

This worked reasonably well for a while, but bitrot and changes to the Toolforge ecosystem made it more unreliable over time, and the node.js version is quite deprecated now, even on Toolforge.

A few weeks ago, AutoDesc started to fail completely, that is, it could not be used any longer. Rather than patching the failing code into a new node.js version, I decided to stabelise it by converting it to Python. And by converting, I mean open a text editor and run regular expression search/replace to change JavaScript into Python. (I did look around for code that can do that automatically, but not much joy).

I am happy to report that, for some days now, the new Python code is running on Toolforge. It only does short descriptions for now, but at least you can get something from it. The code for the long descriptions is partially ready, but does not actually work yet. I will continue on this as I have time, and I would welcome any help in the form of pull requests.

The Buggregator

As you may know, I have a lot of tools for Wikipedia, Wikidata, Commons, etc. A lot of tools means a lot of code, and that means a lot of bugs, things that could work better, feature requests, and so on.

How do I learn about such issues as people encounter them? In a variety of ways. Most of my tools have git repositories, usually on GitHub and BitBucket, where issues and feature requests can be posted. But many people just leave comments on one of my many talk pages (de, en, commons, wikidata, meta, …), on the talk page of a tool (usually for the documentation page), and sometimes I get messages via email, Twitter, Telegram (the app), Signal, etc.

There is just no way for me to keep track of all of these. Some of the talk pages I visit rarely (eg meta), messages on Twitter are forgotten once they scroll out of sight, and so on. I needed a way to aggregate all those bug reports in one place.

Enter The Buggregator.

I started by creating a database of my tools. Yes, all of them. Including the JavaScript ones on the wikis. Including the old, obsolete and unmaintained ones. The sub-tools on my Toolforge amalgamate tools like “wikidata-todo”. Everything.

Then, I wrote import scripts. One for github, one for bitbucket, one for talk pages. (Tweets can be added manually at the moment). These run once a day, for everything I can think of. A heuristic tries to assign tools to (for example) new talk page topics.

I started out with 1795 issues, many of them “historical”, and am now down to 1516. Some way to go, but that has never discouraged me before.

I did this a few months ago and forgot to write about it, as I was busy with some personal issues (not tracked in Buggregator!). The reason I remembered is that Toolhub is now in production, and I think it’s great. But while it is using entries from Hay’s Directory (HD), it does not have all the tools I have in Buggregator.

So I made a script that uses the Buggregator tool list, checks it against the Toolhub list, and (once a day) creates a HD-style tool list, which I just added to the HD sources. That should add the tools to HD, which in turn should add them to Toolhub. There are ~200 of my tools not in HD and (probably) not in Toolhub, though this might create some duplication. But better to have a tool listed twice (until that can be cleaned up) than not having it listed at all, IMHO.

If you find tools of mine I forgot (entirely possible), or see some ways to get that count of open issues down, please let me know!

The Listeria Evolution

My Listeria tool has been around for years now, and is used on over 72K pages across 80 wikis in the Wikimediaverse. And while it still works in principle, it has some issues, an, being a single PHP script, it is not exactly flexible to adapt to new requirements.

Long story short, I rewrote the thing in Rust. The PHP-based bot has been deactivated, and all editing of ListeriaBot (marked as “V2”, example) since 2020-11-12 are done by the new version.

I tried to keep the output as compatible to the previous version as possible, but some minute changes are to be expected, so there should be a one-time “wave” of editing by the bot. Once every page has been updated, things should stabilize again.

As best as I can tell, the new version does everything the old one did, but it can do more already, and has some foundations for future expansions:

  • Multiple lists per page (a much requested feature), eliminating the need for subpage transclusion.
  • Auto-linking external IDs (eg VIAF) instead of just showing the value.
  • Multiple list rows per item, depending on the SPARQL (another requested feature). This requires the new one_row_per_item=no parameter.
  • Foundation to use other SPARQL engines, such as the one being prepared for Commons (as there is an OAuth login required for the current test one, I have not completed that yet). This could generate lists for SDC queries.
  • Portability to generic wikibase installations (untested might require some minor configuration changes). Could even be bundled with Docker, as QuickStatements is now.
  • Foundation to use the Commons Data namespace to store the lists, then display them on a wiki via Lua. This would allow lists to be updated without editing the wikitext of the page, and no part of the list is directly editable by users (thus, no possibility of the bot overwriting human edits, a reason given to disallow Listeria edits in main namespace). The code is actually pretty complete already (including the Lua), but it got bogged down a bit in details of encoding information like sections which is not “native” to tabular data. An example with both wiki and “tabbed” versions is here.

As always with new code, there will be bugs and unwanted side effects. Please use the issue tracker to log them.

The Toolforge Composition

Toolforge , formerly known as wmflabs, is changing its URLs. Where there was one host (tools.wmflabs.org) before, each tool now gets its own sub-domain (eg mix-n-match.toolforge.org).

Until now, I have used my WiDaR tool as a universal OAuth login for many of my tools, so users only have to sign in once. However, since this solution only works within the same sub-domain, it is no longer viable with the new Toolforge URL schema.

I am scrambling to port my tools that use OAuth to their own sign-in. To make this easier, I put my WiDaR tool into a PHP class, that can be reused across tools; the individual tool API can then pick up the requests that were previously sent to WiDar. Some tools, like Mix-n-match, have already been ported.

This brought me back to something that has been requested of some of my tools before – portability, namely to MediaWiki/Wikibase installations other then the Wikimedia ones. A tool with its own WiDaR would be much more portable to such installations.

But the new WiDaR class is included via the shared file system of Toolforge; how to get it portable? Just copying it seems like a duplication of effort, and it won’t receive updates etc.

The solution, in the PHP world, is called composer, a package manager for PHP. While I was at it, I ported several of my often-reused PHP “library scripts” to composer, and they are available in code here, or as an installable package here.

Since the source files for composer slightly differ from the ones I use internally on Toolforge, I wrote a script to “translate” my internal scripts into composer-compatible copies.

The first tool I equipped with this composer-based WiDaR is Tabernacle. It should be generic enough to be useful on other Wikibase installations, and is very lightweight (the PHP part just contains a small wrapper API around the WiDaR class). Installation instructions are in the repo README.

I will continue converting tools to the new URL schema, as time allows. I hope I will beat the hard deadline of June 15.