Skip to content

In my last blog post “The Big Ones“, I wrote about my attempts to import large, third-party datasets, and to synchronize those with Wikidata. I have since imported three datasets (BNF, VIAF, GND), and created a status page to keep a public record of what I did, and try to do.

I have run a few bots by now, mainly syncing identifiers back-and-forth. I have put a few security measures (aka “data paranoia”) into the code, so if there is a collision between the third-party dataset and Wikidata, no edit takes place. But these conflicts can highlight problems; Wikidata is wrong, the third-party data supplier is wrong, there is a duplicated Wikidata item, or some other, more complex issue. So it would be foolish to throw away such findings!


But how to use them? I had started with a bot updating a Wikidata page, but that has problems, mostly, no way of marking an issue as “resolved”, but also lots of sustained edits, overwriting of Wikidata user edits, lists too long for wikitext pages, and so on.

So I started collecting the issue reports in a new database table, and now I have written a small tool around that. You can list and filter issues by catalog, property, issue type, status, etc. Most importantly, you can mark an issue as “done” (OAuth login required), so that it will not show up for other users again (unless they want it to). Through some light testing, I have already found and merged two duplicated Wikidata item pairs.

There is much to do and improve in the tool, but I am about to leave for WikidataCon, so further work will have to wait a few days. Until then, enjoy!