Skip to content

Merge and diff

Originally, I wanted to blog about adding new properties (taxon data speficially, NCBI, GBIF, and iINaturalist) to my AC2WD tool (originally described here). If you have the user script installed on Wikidata, AC2WD will automatically show up on relevant taxon items.

But then I realized that the underlying tech might be useful to others, if exposed as an API. The tool checks at least one, and likely multiple, external IDs (eg GND, NCBI taxon) for useful information. Instead of trying to patch an existing item, I build new ones in memory, one for each external ID. Then, I merge all the new items into one. Finally, I merge that new item into the existing Wikidata item. Each merge gives me a “diff”, a JSON structure that can be send to the wbeditentity action in the Wikidata API. For the first merges of all the new items into one, I ignore the diffs (because none of these items exist, there is no point in keeping them), but rather I keep the merged item. On the last step, I ignore the resulting item itself, but keep the diff, which can then be applied to Wikidata. This is what the user script does; it retrieves the diff from the AC2WD API and applies it on-wiki. So I am now exposing the merge/diff functionality in the API.

Why does this matter? Because many edits to Wikidata, especially automated ones, are additions, either labels, statements, etc., or references to statements. But if you want to add a new statement, you will have to check if such a statement already exists. If it does, you will need to check the references you have; which ones are already in the satatement, and which should be added? This is very tedious and error-prone to do. But now, you can just take your input data, create the item you want in memory, send it and the Wikidata item in question, and apply the diff with wbeditentity. You can even use the same code to create a new item (with “new=item”).

Statements are considered the same if they have the same property, value, and qualifiers. If they are the same, references will be added if they do not already exist (excluding eg “retrieved” (P813) timetamps). A label in the new item will become an alias in the merged one, unless it is already the label or an alias. All you have to do is to generate an item that has the information you want to be in the Wikidata item, and the AC2WD merge API will do the rest. And if you write in Rust, you can use the merge functionality directly, without going through the API.

I see this as a niche between hand-rolled code to edit Wikidata, and using QuickStatements to offload your edits. The merge function is a bit opinionated at the moment (no deletions of statements etc, no changing values), but I can add some desired functionality if you let me know.

A quick comparison

Visually comparing artworks (example)

Over the years, Mix’n’match has helped to connect many (millions?) of third-party entries to Wikidata. Some entries can be identified and matched in a fully automated fashion (eg people with birth and death dates), but the majority of entries require human oversight. For some entries that works nicely, but others are hard to disambiguate from text alone. Maps (example) for entries with coordinates can help, but pictures also speak a proverbial thousand words.

To aid the matching of entries, I wrote a new functionality called quick compare that shows a random, automatically matched entry from a catalog, as well as the matched Wikidata item. For both, it shows the usual basinc information, but also (where available) an image, and a location on a map.

The external image will have to be extracted from the catalog first, which requires a manual “image pattern” to be created. This is reasonably easy to do in most cases; please let me know if you want this for any specific catalog.

Comparing castles by image and location (example)

The image itself is never copied, but inserted as an <img src="" /> element, hot-loading it from the external source. This is only ever done in the “quick compare” page. While hotloading from external pages is sometimes frowned upon, the low volume and specific context in which this is done here should qualify as fair use (images are displayed no larger than 300×250px). In the end, it saves the user the click on the containing web page, and it saves the site the loading of all associated HTML/JS/etc. files.

I hope this will ease the matching of certain types of entries, including (but not limited to) artworks and buildings.

Get into the Flow

Unix philosophy contains the notion that each program should perform an single function (and perform that function exceptionally well), and then be used together with other single-function programs to form a powerful “toolbox”, with tools connected via the geek-famous pipe (“|”).

A ToolFlow workflow, filtering and re-joining data rows

The Wikimedia ecosystem has lots of tools that perform a specific function, but there is little in terms of interconnection between them. Some tool prorammers have added other tools as (optional) inputs (eg PetScan) or outputs (eg PagePile), but that is the extend of it. There is PAWS, a Wikimedia-centric Jupyter notebook, but it does require a certain amount of coding, which excludes many volunteers.

So I finally got around to implementing the “missing pipe system” for Wikimedia tools, which I call ToolFlow. I also started a help page explaining some concepts. The basic idea is that a user can create a new workflow (or fork an existing one). A workflow would usually start with one or more adapters that represent the output of some tool (eg PetScan, SPARQL query, Quarry) for a specific query. The adapter queries the tool, and represents the tool output in a standardized internal format (JSONL) that is written into a file on the server. These files can then be filtered and combined to form sub- and supersets. Intermediate files will be cleaned up automatically, but the output of steps (ie nodes) that have no further steps after them is kept, and can only be cleared by re-running the entire workflow run. Output can also be exported via a generator; at the moment, the only generator is a Wikipage editor, which will create a Wikitext table on a Wiki page of your choice from an output file.

Only the owner (=creator) of a workflow can edit or run it, but the owner can also set a scheduler (think UNIX cronjob) for the workflow to be run regularly every day, week, or month. ToolFlow remembers your OAuth details, so it can edit wiki pages regularly, updating a wikitext table with the current results of your workflow.

I have created some demo workflows:

Now, this is a very complex software, spanning two code repoitories (one for the HTML/JS/PHP front-end, and one for the Rust back-end). Many things can go wrong here, and many tools, filters etc can be added. Please use the issue trackers for problems and suggestions. I would especially like some more suggestions for tools to use as input. And despite my best efforts, the interface is still somewhat complicated to understand, so please feel free to improve the help page.

Mix’n’match background sync

My Mix’n’match tool helps matching third-party catalogs to Wikidata items. Now, things happen on Mix’n’match and Wikidata in parallel, amongst them:

  • Wikidata items are deleted
  • Wikidata items are merged, leving one to redirect to the other
  • External IDs are added to Wikidata

This leads to the states of Mix’n’match and Wikidata diverging over time. I already had some automated measures in place to keep them in sync, and there is a “manual sync” function on each catalog that has a Wikidata property, but it is not ideal, especially since deleted/redirect items can show up as mismatches in various places.

I previously blogged about another tool of mine, Wikidata Recent Changes (WDRC), which records fine-grained changes of Wikidata items, and offers an API to check on them. The other day, I added recording of item creations, deletions, and redirect creations. So it was finally time to use WDRC myself. Every 15 minutes, Mix’n’match now

  • un-matches all deleted items
  • points matches to redirected items to the redirect target
  • checks all catalogs with a Wikidata property for new external IDs on Wikidata, and sets them as matches (if the Mix’n’match entry is not already set).

Please note that this applies only to edits from now on; there may be many Mix’n’match entries that are still matched to deleted/redirected items on Wikidata. But, one thing at a time.

Technical remark: I am using the JSONL output format of WDRC, for several reasons:

  • On the WDRS side, the lines are just printed out, so no need to cache some giant result set when generating it
  • On the consumer side (Mix’n’match), I can stream the API result into a temporary file, which consumes no memory. Then, I read the file line-by-line, using only memory for a single entry

This way, I can request (almost) unlimited output from the API, and process it reliably, with very little resources (which are at a premium these days on Toolforge).

Dude, where are my pictures?

Like many people, I take a lot of pictures with my phone. After all, the best camera is the one you have with, and phone cameras produce high quality pictures these days. And like many Wikimedia volunteers, sometimes I take pictures thinking “this should go on Commons/Wikipedia/Wikidata”. But then days pass, and I don’t remember which things I took a picture of, or it becomes too much of a hassle to find, upload, and use them in Wikimedia projects.

Now, I back up the pictures from my phone to a disk at home automatically, instead of cloud. Also, I have the geolocation in pictures set to “on”. As a result, I have (tens of) thousands of pictures on my disk that have a geolocation in their EXIF data. However, that’s where it ends; the data is just hidden there. So recently (well, yesterday), I wrote a little command line tool called img_coords, which can scan a directory structure for pictures with EXIF location data, and aggregate that data into a single file (KML or GeoJSON), using the file path as point label.

As a final step, I added a new function to my trusty WikiShootMe! tool. “Load GeoSJON” from the burger dropdown menu lets you load a valid GeoJSON file, which then gets displayed as an additional layer. This includes GeoJSON you create with the img_coords tool. You can then see the locations of where you took pictures, next to Wikidata items (with/without image), Wikipedia articles, etc. Clicking on one of the blue GeoJSON dots shows the label, in my case the path of the file on my local disk. For Mac/Linux, the label turns into a link that you can copy&paste into a new tab (it’s a local file so your browser won’t just let you click on it). You can now esaily find Wikipedia articles and Wikidata items you have taken a picture of, by location. Happy uploading!

UPDATE: Now with thumbnails!

Cram as cram can

So I am trying to learn (modern) Greek, for reasons. I have books, and online classes, and the usual apps. But what I was missing was a simple way to rehearse common words. My thoughts went to Wikidata items, and then to lexemes. Lexemes are something I have not worked with a lot, so this seemed like a good opportunity. Pauken (German for to cram) is the result.

I decided to start with nouns. Nouns have probably a corresponding Wikidata item, like “dog”. A small number of (somehow related) items makes for a nice learning exercise. I could just take the label in the language to be learned, and be done. Or, I could make it more complicated! So, given a small set of items to learn, I am using the “item for this sense” property on lexemes to link them. Then, I use:

  • lexeme label, pronounciation audio, and gammatical gender
  • item image, and (as a fallback) label

to generate entries on a page. These will show the image from the item, or the label in the “known” language (which I determine by the browser settings) as a fallback. There are currently two modes:

  • Reveal shows you the entries, with images and audio (where available), but with the label as “???”. You can try to recall the correct label, then click on the image, and the label (and grammatical gender) will be revealed. Another click hides it again.
  • Pick hides everything except the image, and offers an audio “play” button, with a randomly chosen pronounciation audio. Play the audio, then click the correct image; it will mark a correct answer in green, or a wrong one in red (with the correct one in blue). The color marking will disappear after a moment, and the next random audio will play automatically (so you dont have to click the “play” button all the time).

This is a simple working demo, but already quite useful IMHO. But beyond learning a language, you are also encouraged to add and edit lexemes; a missing lexeme, or rather no lexeme linking to the item, will show in a traditional red link. A lexeme with missing pronounciation audio will show a microphone, linking to lingualibre.org, so you can record the missing audio.

Also, the example links on the main page of the tool can be improved by editing Wikidata.

Musings on the backend

So the Wikidata query service (WDQS), currently powered by Blazegraph, does not perform well. Even simple queries time out, it lags behind the live site, and individual instances can (and do) silently go out of sync. The WMF is searching for a replacement.

One of the proposed alternatives to blazegraph is Virtuoso. It models the graph structure of Wikidata in MySQL. Even before I read about that, I had a similar thought. But unlike Virtuoso, which is a general solution, Wikidata is a special case, with optimization potential. So I went to do a bit of tinkering. The following was done in the last few days, so it’s not exactly pretty, but initial results are promising.

I downloaded a test dataset of ~108K RDF triples from Wikidata. I then wrote some Rust code to (hackishly) parse the triples, and transform them. I am creating (on demand, so to speak) MySQL tables, based on the RDF predicate, which (often enough) is the Wikidata property used. For example, the data__PropertyDirect_P31__Entity__Entity table contains the entity-to-entity mapping for P31 (instance of). Since we know that the entire table is about P31, we don’t have to encode the property in the rows; two columns of VARCHAR(64) are enough. This could be compressed further (two INT columns and an enum, to encode Q/P/M/L and the sub-lexeme types should do) but I’m keeping it simple for now.

Importing the 108K triples into MySQL creates 1434 tables. MySQL should scale to millions of tables, so no problem. The initial import takes 8 minutes, but that seems to be due to the table creation, which would only happen initially on a full Wikidata import. Indeed, re-running the exact same import finishes in 48 seconds (data won’t change, thanks to INSERT IGNORE). I don’t know how much of this is network or database; commenting out the INSERT command on a re-import processes the 108K entries in 2.5 seconds, so that’s the parsing speed, thanks to the generous use of multi-threaded async Rust.

Adding (and removing) RDF triples as they change on Wikidata should therefore be no problem; I suspect changes could be represented with almost immeasurable delay. As for querying, Wikipedia uses read-only replicas, which can be scaled to demand.

But can we actually query? Yes we can! I wrote a “simple case” query generator that takes triples, and can “connect” them, right now with “and” (the “.” in SPARQL). So:

let mut qt1 = QueryTriples::from_str(&app,"?person","wdt:P31","wd:Q5").await?;
let qt2 = QueryTriples::from_str(&app,"?person","wdt:P21","wd:Q6581072").await?;
qt1.and(&qt2)?;
let result = qt1.run().await?;

asks for all people (“P31:Q5”) that are female (“P21:Q6581072”). These triples are transformed into individual SQL queries, which then are joined. The query generated in this case is:

SELECT t1.person 
FROM (SELECT k0 AS `person` FROM `data__PropertyDirect_P31__Entity__Entity` WHERE `v0`="Q5") AS t1
INNER JOIN (SELECT k0 AS `person` FROM `data__PropertyDirect_P21__Entity__Entity` WHERE `v0`="Q6581072") AS t2
ON t1.person=t2.person;

In the test dataset, there is one result (Q241961). The query runs ~100ms (on Toolforge shared tool database, from remote). The same query on Blazegraph takes about a minute (yes, it yields 2 million results, but even with all that data loaded into this database, the SQL query would be the same, and an indexed INNER JOIN is probably as fast as we can get it).

This implementation is obviously far from complete, and could be optimized a lot. I do not have the time to complete it, especially since it would likely not be used anyway. I guess the days of my weekend hack becoming production code are over. But maybe it can give inspiration to other.

You can find my code here, and if you have a Toolforge account, you can point it to the ss55360__wdqsbe_p database.

Lists. The plague of managing things. But also surprisingly useful for many tasks, including Wikimedia-related issues. Mix’n’match is a list of third-party entries. PetScan generates lists from Wikipedia and Wikidata. And Listeria generates lists on-wiki.

But there is a need for generic, Wikimedia-related, user-curated lists. In the past, I have tried to quell that demand with PagePile, but the community wants more. So here you go: I present GULP, the Generic Unified List Processor. At the moment, this is a work in progress, but is already usable, though lacking many nice-to-have features.

How is this different from PagePile? Let’s see:

  • A list can have multiple columns (so, a table, technically)
  • Each column has a defined type (Wiki page, Location, Plain text)
  • Lists can be updated (file upload, load from URL)
  • Lists can be edited by the list owner (and by others, to be implemented)
  • Lists can have snapshots; a snapshot is like a version number, but it does not create a copy of the list; rather, further changes to the list are logged as such, and the snapshot remains untouched
  • Lists can be downloaded in various formats, and accessed via API

You can already create your own lists, update and modify them, and they are available for public download. Have a look at some demo lists:

Next steps for GULP include:

  • Adding/removing rows
  • More data sources and output formats
  • More column types
  • All API actions (including editing etc) will be available via a token system, so you can write scripts to edit your lists

For bug reports and feature requests, feel free to use the issue tracker. Also, if you want to integrate GULP into your tools , please let me know!

Turn the AC on

A large part of Wikidata is the collection of external identifiers for items. For some item types, such as items about people (Q5), some of this is what is known as Authority Control (AC) data, for example, VIAF (P214). One thing that distinguishes AC data from other external IDs is that AC data sources are often available in machine-readable form. This can be used for Wikidata in several ways:

  • to add new statements (eg occupation or birth place of a person)
  • to add new references to existing statements
  • to find and add more AC identifiers

Over the years, I wrote several bespoke tools and scripts that would query one of these AC websites, and add bits and pieces to Wikidata items, but I always wanted a more unified solution. So I finally got around to it and wrote a new tool on Toolforge, AC2WD. This presents an API to:

  • query multiple AC sources via an ID
  • create new Wikidata items (in memory, not on Wikidata!) from each source
  • merge several such “virtual” items into one (the sum of all AC data knowledge)
  • construct a “diff”, a JSON structure containing instructions to add new information (statements, references) to an existing Wikidata item

By giving an existing Wikidata item ID to this API, it will extract existing AC identifiers from the item, and load information from the respective AC sources. It will then check the new information for new, usable AC identifiers, and repeat. Once all possible AC data has been loaded, it will return the “diff” data to extend the item with the new information (via https://ac2wd.toolforge.org/extend/Q_YOUR_ITEM ), using the wbeditentity action of the Wikidata API.

Now this is all very technical for most users, so I wrote a little JavaScript utility for Wikidata, rather predictably also called AC2WD (usage instructions on the page). Any Wikidata item with at least one supported AC property will have an “AC2WD” link in the tool sidebar. If you click on it, it will fetch the “diff” from the Toolforge tool, and attempt to make an edit to add new information (reload the page to see it, or check out the history to see the changes in detail).

I am happy to extend the API to support new AC sources (that have a Wikidata property!), please let me know your favourite candidates. I may also be able to extract more information from certain AC sources than I currently do; again, please let me know if you have a preference.

Trust in Rust

So Toolforge is switching from grid engine to Kubernetes. This also means that tool owners such as myself need to change their tool background jobs to the new system. Mix’n’match was my tool with the most diverse job setup. But resource constraints and the requirement to “name” jobs meant that I couldn’t just port things one-to-one.

Mix’n’match has its own system of jobs that run once or on a regular basis, depend on other jobs finishing before them etc. For the grid engine, I could start a “generic” job every few minutes, that would pick up the next job and run it, with plenty of RAM assigned. Kubernetes resource restriction make this impossible. So I had to refactor/rewrite several jobs, and make them usable as PHP classes, rather than individual scripts to run.

Mix’n’match classes have become rather significant in code size, with >10K lines of code. Unsurprisingly, despite my best efforts, jobs got “stuck” for no apparent reason, bringing the whole system to a halt. This made especially new Mix’n’match catalogs rather unusable, with no automated matches etc.

Rather than fiddling with the intricacies of a hard-to-maintain codebase, I decided to replace the failing job types with new Rust code. This is already live for several job types, mainly preliminary match and person name/date match, and I am adding more. Thanks to the easy multi-threading and async/await capabilities of Rust, many jobs can run in parallel in a single process. One design feature for the new code is batched processing, so memory requirements are low (<200MB) even for multiple parallel jobs. Also, jobs now keep track of their position in the batch, and can resume if the process is stopped (eg to deploy new code).

I strongly doubt I will replace the entire code base, especially since much of the scraping code involve user-supplied PHP code that gets dynamically included. But safe, fast, and reliable Rust code serves its purpose in this complex tool.