Skip to content

Get into the Flow

Unix philosophy contains the notion that each program should perform an single function (and perform that function exceptionally well), and then be used together with other single-function programs to form a powerful “toolbox”, with tools connected via the geek-famous pipe (“|”).

A ToolFlow workflow, filtering and re-joining data rows

The Wikimedia ecosystem has lots of tools that perform a specific function, but there is little in terms of interconnection between them. Some tool prorammers have added other tools as (optional) inputs (eg PetScan) or outputs (eg PagePile), but that is the extend of it. There is PAWS, a Wikimedia-centric Jupyter notebook, but it does require a certain amount of coding, which excludes many volunteers.

So I finally got around to implementing the “missing pipe system” for Wikimedia tools, which I call ToolFlow. I also started a help page explaining some concepts. The basic idea is that a user can create a new workflow (or fork an existing one). A workflow would usually start with one or more adapters that represent the output of some tool (eg PetScan, SPARQL query, Quarry) for a specific query. The adapter queries the tool, and represents the tool output in a standardized internal format (JSONL) that is written into a file on the server. These files can then be filtered and combined to form sub- and supersets. Intermediate files will be cleaned up automatically, but the output of steps (ie nodes) that have no further steps after them is kept, and can only be cleared by re-running the entire workflow run. Output can also be exported via a generator; at the moment, the only generator is a Wikipage editor, which will create a Wikitext table on a Wiki page of your choice from an output file.

Only the owner (=creator) of a workflow can edit or run it, but the owner can also set a scheduler (think UNIX cronjob) for the workflow to be run regularly every day, week, or month. ToolFlow remembers your OAuth details, so it can edit wiki pages regularly, updating a wikitext table with the current results of your workflow.

I have created some demo workflows:

Now, this is a very complex software, spanning two code repoitories (one for the HTML/JS/PHP front-end, and one for the Rust back-end). Many things can go wrong here, and many tools, filters etc can be added. Please use the issue trackers for problems and suggestions. I would especially like some more suggestions for tools to use as input. And despite my best efforts, the interface is still somewhat complicated to understand, so please feel free to improve the help page.

Mix’n’match background sync

My Mix’n’match tool helps matching third-party catalogs to Wikidata items. Now, things happen on Mix’n’match and Wikidata in parallel, amongst them:

  • Wikidata items are deleted
  • Wikidata items are merged, leving one to redirect to the other
  • External IDs are added to Wikidata

This leads to the states of Mix’n’match and Wikidata diverging over time. I already had some automated measures in place to keep them in sync, and there is a “manual sync” function on each catalog that has a Wikidata property, but it is not ideal, especially since deleted/redirect items can show up as mismatches in various places.

I previously blogged about another tool of mine, Wikidata Recent Changes (WDRC), which records fine-grained changes of Wikidata items, and offers an API to check on them. The other day, I added recording of item creations, deletions, and redirect creations. So it was finally time to use WDRC myself. Every 15 minutes, Mix’n’match now

  • un-matches all deleted items
  • points matches to redirected items to the redirect target
  • checks all catalogs with a Wikidata property for new external IDs on Wikidata, and sets them as matches (if the Mix’n’match entry is not already set).

Please note that this applies only to edits from now on; there may be many Mix’n’match entries that are still matched to deleted/redirected items on Wikidata. But, one thing at a time.

Technical remark: I am using the JSONL output format of WDRC, for several reasons:

  • On the WDRS side, the lines are just printed out, so no need to cache some giant result set when generating it
  • On the consumer side (Mix’n’match), I can stream the API result into a temporary file, which consumes no memory. Then, I read the file line-by-line, using only memory for a single entry

This way, I can request (almost) unlimited output from the API, and process it reliably, with very little resources (which are at a premium these days on Toolforge).

Dude, where are my pictures?

Like many people, I take a lot of pictures with my phone. After all, the best camera is the one you have with, and phone cameras produce high quality pictures these days. And like many Wikimedia volunteers, sometimes I take pictures thinking “this should go on Commons/Wikipedia/Wikidata”. But then days pass, and I don’t remember which things I took a picture of, or it becomes too much of a hassle to find, upload, and use them in Wikimedia projects.

Now, I back up the pictures from my phone to a disk at home automatically, instead of cloud. Also, I have the geolocation in pictures set to “on”. As a result, I have (tens of) thousands of pictures on my disk that have a geolocation in their EXIF data. However, that’s where it ends; the data is just hidden there. So recently (well, yesterday), I wrote a little command line tool called img_coords, which can scan a directory structure for pictures with EXIF location data, and aggregate that data into a single file (KML or GeoJSON), using the file path as point label.

As a final step, I added a new function to my trusty WikiShootMe! tool. “Load GeoSJON” from the burger dropdown menu lets you load a valid GeoJSON file, which then gets displayed as an additional layer. This includes GeoJSON you create with the img_coords tool. You can then see the locations of where you took pictures, next to Wikidata items (with/without image), Wikipedia articles, etc. Clicking on one of the blue GeoJSON dots shows the label, in my case the path of the file on my local disk. For Mac/Linux, the label turns into a link that you can copy&paste into a new tab (it’s a local file so your browser won’t just let you click on it). You can now esaily find Wikipedia articles and Wikidata items you have taken a picture of, by location. Happy uploading!

UPDATE: Now with thumbnails!

Cram as cram can

So I am trying to learn (modern) Greek, for reasons. I have books, and online classes, and the usual apps. But what I was missing was a simple way to rehearse common words. My thoughts went to Wikidata items, and then to lexemes. Lexemes are something I have not worked with a lot, so this seemed like a good opportunity. Pauken (German for to cram) is the result.

I decided to start with nouns. Nouns have probably a corresponding Wikidata item, like “dog”. A small number of (somehow related) items makes for a nice learning exercise. I could just take the label in the language to be learned, and be done. Or, I could make it more complicated! So, given a small set of items to learn, I am using the “item for this sense” property on lexemes to link them. Then, I use:

  • lexeme label, pronounciation audio, and gammatical gender
  • item image, and (as a fallback) label

to generate entries on a page. These will show the image from the item, or the label in the “known” language (which I determine by the browser settings) as a fallback. There are currently two modes:

  • Reveal shows you the entries, with images and audio (where available), but with the label as “???”. You can try to recall the correct label, then click on the image, and the label (and grammatical gender) will be revealed. Another click hides it again.
  • Pick hides everything except the image, and offers an audio “play” button, with a randomly chosen pronounciation audio. Play the audio, then click the correct image; it will mark a correct answer in green, or a wrong one in red (with the correct one in blue). The color marking will disappear after a moment, and the next random audio will play automatically (so you dont have to click the “play” button all the time).

This is a simple working demo, but already quite useful IMHO. But beyond learning a language, you are also encouraged to add and edit lexemes; a missing lexeme, or rather no lexeme linking to the item, will show in a traditional red link. A lexeme with missing pronounciation audio will show a microphone, linking to lingualibre.org, so you can record the missing audio.

Also, the example links on the main page of the tool can be improved by editing Wikidata.

Musings on the backend

So the Wikidata query service (WDQS), currently powered by Blazegraph, does not perform well. Even simple queries time out, it lags behind the live site, and individual instances can (and do) silently go out of sync. The WMF is searching for a replacement.

One of the proposed alternatives to blazegraph is Virtuoso. It models the graph structure of Wikidata in MySQL. Even before I read about that, I had a similar thought. But unlike Virtuoso, which is a general solution, Wikidata is a special case, with optimization potential. So I went to do a bit of tinkering. The following was done in the last few days, so it’s not exactly pretty, but initial results are promising.

I downloaded a test dataset of ~108K RDF triples from Wikidata. I then wrote some Rust code to (hackishly) parse the triples, and transform them. I am creating (on demand, so to speak) MySQL tables, based on the RDF predicate, which (often enough) is the Wikidata property used. For example, the data__PropertyDirect_P31__Entity__Entity table contains the entity-to-entity mapping for P31 (instance of). Since we know that the entire table is about P31, we don’t have to encode the property in the rows; two columns of VARCHAR(64) are enough. This could be compressed further (two INT columns and an enum, to encode Q/P/M/L and the sub-lexeme types should do) but I’m keeping it simple for now.

Importing the 108K triples into MySQL creates 1434 tables. MySQL should scale to millions of tables, so no problem. The initial import takes 8 minutes, but that seems to be due to the table creation, which would only happen initially on a full Wikidata import. Indeed, re-running the exact same import finishes in 48 seconds (data won’t change, thanks to INSERT IGNORE). I don’t know how much of this is network or database; commenting out the INSERT command on a re-import processes the 108K entries in 2.5 seconds, so that’s the parsing speed, thanks to the generous use of multi-threaded async Rust.

Adding (and removing) RDF triples as they change on Wikidata should therefore be no problem; I suspect changes could be represented with almost immeasurable delay. As for querying, Wikipedia uses read-only replicas, which can be scaled to demand.

But can we actually query? Yes we can! I wrote a “simple case” query generator that takes triples, and can “connect” them, right now with “and” (the “.” in SPARQL). So:

let mut qt1 = QueryTriples::from_str(&app,"?person","wdt:P31","wd:Q5").await?;
let qt2 = QueryTriples::from_str(&app,"?person","wdt:P21","wd:Q6581072").await?;
qt1.and(&qt2)?;
let result = qt1.run().await?;

asks for all people (“P31:Q5”) that are female (“P21:Q6581072”). These triples are transformed into individual SQL queries, which then are joined. The query generated in this case is:

SELECT t1.person 
FROM (SELECT k0 AS `person` FROM `data__PropertyDirect_P31__Entity__Entity` WHERE `v0`="Q5") AS t1
INNER JOIN (SELECT k0 AS `person` FROM `data__PropertyDirect_P21__Entity__Entity` WHERE `v0`="Q6581072") AS t2
ON t1.person=t2.person;

In the test dataset, there is one result (Q241961). The query runs ~100ms (on Toolforge shared tool database, from remote). The same query on Blazegraph takes about a minute (yes, it yields 2 million results, but even with all that data loaded into this database, the SQL query would be the same, and an indexed INNER JOIN is probably as fast as we can get it).

This implementation is obviously far from complete, and could be optimized a lot. I do not have the time to complete it, especially since it would likely not be used anyway. I guess the days of my weekend hack becoming production code are over. But maybe it can give inspiration to other.

You can find my code here, and if you have a Toolforge account, you can point it to the ss55360__wdqsbe_p database.

Lists. The plague of managing things. But also surprisingly useful for many tasks, including Wikimedia-related issues. Mix’n’match is a list of third-party entries. PetScan generates lists from Wikipedia and Wikidata. And Listeria generates lists on-wiki.

But there is a need for generic, Wikimedia-related, user-curated lists. In the past, I have tried to quell that demand with PagePile, but the community wants more. So here you go: I present GULP, the Generic Unified List Processor. At the moment, this is a work in progress, but is already usable, though lacking many nice-to-have features.

How is this different from PagePile? Let’s see:

  • A list can have multiple columns (so, a table, technically)
  • Each column has a defined type (Wiki page, Location, Plain text)
  • Lists can be updated (file upload, load from URL)
  • Lists can be edited by the list owner (and by others, to be implemented)
  • Lists can have snapshots; a snapshot is like a version number, but it does not create a copy of the list; rather, further changes to the list are logged as such, and the snapshot remains untouched
  • Lists can be downloaded in various formats, and accessed via API

You can already create your own lists, update and modify them, and they are available for public download. Have a look at some demo lists:

Next steps for GULP include:

  • Adding/removing rows
  • More data sources and output formats
  • More column types
  • All API actions (including editing etc) will be available via a token system, so you can write scripts to edit your lists

For bug reports and feature requests, feel free to use the issue tracker. Also, if you want to integrate GULP into your tools , please let me know!

Turn the AC on

A large part of Wikidata is the collection of external identifiers for items. For some item types, such as items about people (Q5), some of this is what is known as Authority Control (AC) data, for example, VIAF (P214). One thing that distinguishes AC data from other external IDs is that AC data sources are often available in machine-readable form. This can be used for Wikidata in several ways:

  • to add new statements (eg occupation or birth place of a person)
  • to add new references to existing statements
  • to find and add more AC identifiers

Over the years, I wrote several bespoke tools and scripts that would query one of these AC websites, and add bits and pieces to Wikidata items, but I always wanted a more unified solution. So I finally got around to it and wrote a new tool on Toolforge, AC2WD. This presents an API to:

  • query multiple AC sources via an ID
  • create new Wikidata items (in memory, not on Wikidata!) from each source
  • merge several such “virtual” items into one (the sum of all AC data knowledge)
  • construct a “diff”, a JSON structure containing instructions to add new information (statements, references) to an existing Wikidata item

By giving an existing Wikidata item ID to this API, it will extract existing AC identifiers from the item, and load information from the respective AC sources. It will then check the new information for new, usable AC identifiers, and repeat. Once all possible AC data has been loaded, it will return the “diff” data to extend the item with the new information (via https://ac2wd.toolforge.org/extend/Q_YOUR_ITEM ), using the wbeditentity action of the Wikidata API.

Now this is all very technical for most users, so I wrote a little JavaScript utility for Wikidata, rather predictably also called AC2WD (usage instructions on the page). Any Wikidata item with at least one supported AC property will have an “AC2WD” link in the tool sidebar. If you click on it, it will fetch the “diff” from the Toolforge tool, and attempt to make an edit to add new information (reload the page to see it, or check out the history to see the changes in detail).

I am happy to extend the API to support new AC sources (that have a Wikidata property!), please let me know your favourite candidates. I may also be able to extract more information from certain AC sources than I currently do; again, please let me know if you have a preference.

Trust in Rust

So Toolforge is switching from grid engine to Kubernetes. This also means that tool owners such as myself need to change their tool background jobs to the new system. Mix’n’match was my tool with the most diverse job setup. But resource constraints and the requirement to “name” jobs meant that I couldn’t just port things one-to-one.

Mix’n’match has its own system of jobs that run once or on a regular basis, depend on other jobs finishing before them etc. For the grid engine, I could start a “generic” job every few minutes, that would pick up the next job and run it, with plenty of RAM assigned. Kubernetes resource restriction make this impossible. So I had to refactor/rewrite several jobs, and make them usable as PHP classes, rather than individual scripts to run.

Mix’n’match classes have become rather significant in code size, with >10K lines of code. Unsurprisingly, despite my best efforts, jobs got “stuck” for no apparent reason, bringing the whole system to a halt. This made especially new Mix’n’match catalogs rather unusable, with no automated matches etc.

Rather than fiddling with the intricacies of a hard-to-maintain codebase, I decided to replace the failing job types with new Rust code. This is already live for several job types, mainly preliminary match and person name/date match, and I am adding more. Thanks to the easy multi-threading and async/await capabilities of Rust, many jobs can run in parallel in a single process. One design feature for the new code is batched processing, so memory requirements are low (<200MB) even for multiple parallel jobs. Also, jobs now keep track of their position in the batch, and can resume if the process is stopped (eg to deploy new code).

I strongly doubt I will replace the entire code base, especially since much of the scraping code involve user-supplied PHP code that gets dynamically included. But safe, fast, and reliable Rust code serves its purpose in this complex tool.

Vue le vue

A while ago, Wikimedia sites, including Wikidata, started to use the Vue.js framework to ease the future development of user interface components. Vue is, to some degree, also available for user scripts.



I have a few user scripts on Wikidata, and some of them have a seriously outdated interface. There was a modal (page-blocking) dialog box for authority control, a floating dialog box for “useful” shortcuts, and a box at the top of the page for the relatively recent Mix’n’Match gadget. The appearance of all of these was different, and they took increasing amounts of screen real estate. They also can add statements to the item, but the confirmation for that were small notes thrown into the HTML, rather than actual statement boxes.

So I decided to re-style (and partially rewrite) them, but also to centralize shared functionality, to avoid duplication in these and other scripts. So I created a JavaScript component that can create a nice, central “tabbed box” at the top of the page, to make it easy to change between tools while not wasting screen space. Each tool gets its own tab and output space. Additionally, this new component can create new statements in the current item, and put new statement entries into the interface. Now, Wikidata does not expose (or at least, properly document) the methods to do this, so I have to write my own, which look similar to the original but sadly have none of the functionality. I hope Wikidata will expose the visual code to do that in the future, so everything could work nicely without me having to (badly) duplicate code.

I have updated the existing scripts to the new interface now; all users of these scripts should see the new interface, though you might have to force-reload the browser once.

The updated script are:

If you want other scripts of mine to be updated like this, or you want to update/develop your own, please let me know!

Join the Quest

I recently came across an interesting, semi-automated approach to create statements in Wikidata. A SPARQL query would directly generate commands for QuickStatements. These would then be manually checked before running them. The queries exist as links on Wikidata user pages, and have to be run by clicking on them.


That seemed useful but tedious. I decided to make it less tedious.

Enter the Quickstatements User Evaluation of Statements and Terms, or QUEST for short. Using this tool (logged in), you will be presented with a random set of QuickStatement commands, visually enhanced, to let you take a decision on running the command, or flagging it as “bad”. If you decide to run it, the command will edit under your user account.

The list of commands is fuelled by SPARQL queries. You can add your own; a help text is available. Every query has an interval (hours, days) and it will be periodically run in the background, to add new commands if possible. New or edited queries should be run within a few seconds. You can see all queries, those by a specific user, and commands from a specific query.

I hope this will help the Wikidata community with repeating tasks, and queries to “keep an eye on”.