My Mix’n’match tool helps matching third-party catalogs to Wikidata items. Now, things happen on Mix’n’match and Wikidata in parallel, amongst them:
- Wikidata items are deleted
- Wikidata items are merged, leving one to redirect to the other
- External IDs are added to Wikidata
This leads to the states of Mix’n’match and Wikidata diverging over time. I already had some automated measures in place to keep them in sync, and there is a “manual sync” function on each catalog that has a Wikidata property, but it is not ideal, especially since deleted/redirect items can show up as mismatches in various places.
I previously blogged about another tool of mine, Wikidata Recent Changes (WDRC), which records fine-grained changes of Wikidata items, and offers an API to check on them. The other day, I added recording of item creations, deletions, and redirect creations. So it was finally time to use WDRC myself. Every 15 minutes, Mix’n’match now
- un-matches all deleted items
- points matches to redirected items to the redirect target
- checks all catalogs with a Wikidata property for new external IDs on Wikidata, and sets them as matches (if the Mix’n’match entry is not already set).
Please note that this applies only to edits from now on; there may be many Mix’n’match entries that are still matched to deleted/redirected items on Wikidata. But, one thing at a time.
Technical remark: I am using the JSONL output format of WDRC, for several reasons:
- On the WDRS side, the lines are just printed out, so no need to cache some giant result set when generating it
- On the consumer side (Mix’n’match), I can stream the API result into a temporary file, which consumes no memory. Then, I read the file line-by-line, using only memory for a single entry
This way, I can request (almost) unlimited output from the API, and process it reliably, with very little resources (which are at a premium these days on Toolforge).
One Comment
Basically piping…