Mix’n’match is one of my more popular tools. It contains a number of catalogs, each in turn containing hundreds or even millions of entries, that could (and often should!) have a corresponding Wikidata item. The tool offers various ways to make it easier to match an entry in a catalog to a Wikidata item.
While the user-facing end of the tool does reasonably well, the back-end has become a bit of an issue. It is a bespoke, home-grown MySQL database that has changed a lot over the years to incorporate more (and more complex) metadata to go with the core data of the entries. Entries, birth and death dates, coordinates, third-party identifiers are all stored in separate, dedicated tables. So is full-text search, which is not exactly performant these days.
The perhaps biggest issue, however, is the bottleneck in maintaining that data – myself. As the only person with write access to the database, all maintenance operations have to run through me. And even though I have added import functions for new catalogs, and run various automatic update and maintenance scripts on a regular basis, the simple task of updating an existing catalog depends on me, and it is rather tedious work.
At the 2017 Wikimania in Montreal, I was approached by the WMF about Mix’n’match; the idea was that they would start their own version of it, in collaboration with some of the big providers of what I call catalogs. My recommendation to the WMF representative was to use Wikibase, the data management engine underlying Wikidata, as the back-end, to allow for a community-based maintenance of the catalogs, and use a task specific interface on top of that, to make the matching as easy as possible.
As it happens with the WMF, a good idea vanished somewhere in the mills of bureaucracy, and was never heard from again. I am not a system administrator (or, let’s say, it is not the area where I traditionally shine), so setting up such a system myself was out of the question at that time. However, these days, there is a Docker image by the German chapter that incorporates MediaWiki, Wikibase, Elasticsearch, the Wikibase SPARQL service, and QuickStatements (so cool to see one of my own tools in there!) in a single package.
Long story short, I set up a new Mix’n’match using Wikibase ad the back-end.
The interface is similar to the current Mix’n’match (I’ll call it V1, and the new one V2), but a complete re-write. It does not support all of the V1 functionality – yet. I have set up a single catalog in V2 for testing, one that is also in V1. Basic functionality in V2 is complete, meaning you can match (and unmatch) entries in both Mix’n’match and Wikidata. Scripts can import matches from Wikidata, and do (preliminary) auto-matches of entries to Wikidata, which need to be confirmed by a user. This, in principle, is similar to V1.
There are a few interface perks in V2. There can be more than one automatic match for an entry, and they are all shown as a list; one can set the correct one with a single click. And manually setting a match will open a full-text Wikidata search drop-down inline, often sparing one the need to search on Wikidata and then copying the QID to Mix’n’match. Also, the new auto-matcher takes the type of the entry (if any) into account; given a type Qx, only Wikidata items with name matches that are either “instance of” (P31) Qx (or one of the subclasses of Qx), or items with name matches but without P31 are used as matches; that should improve auto-matching quality.
But the real “killer app” lies in the fact that everything is stored in Wikibase items. All of Mix’n’match can be edited directly in MediaWiki, just like Wikidata. Everything can be queried via SPARQL, just like Wikidata. Mass edits can be done via QuickStatements, just like… well, you get the idea. But users will just see the task-specific interface, hiding all that complexity, unless they really want to peek under the hood.
So far with the theory; sadly, I have run into some real-world issues that I do not know how to fix on my own (or do not have the time and bandwidth to figure out; same effect). First, as I know from bitter experience, MediaWiki installations attract spammers. Because I really don’t have time to clean up after spammers on this one, I have locked account creation and editing; that means only I can run QuickStatements on this Wiki (let me know your Wikidata user name and email, and I’ll create an account for you, if you are interested!). Of course, this kind of defeats the purpose of having the community maintain the back-end, but what can I do? Since the WMF has bowed out in silence, the wiki isn’t using the WMF single sign-on. The OAuth extension, which was originally developed for that specific purpose, ironically doesn’t work for MediaWiki as a client.
But how can people match entries without an account, you ask? Well, for the Wikidata side, they have to use my Widar login system, just like in V1. And for the V2 Wiki, I have … enabled anonymous editing of the item namespace. Yes, seriously. I just hope that Wikibase data spamming is a bit in the future, for now. Your edits will still be credited using your Wikidata user name in edit summaries and statements. Yes, I log all edits as Wikibase statements! (Those are also used for V2 Recent Changes, but since Wikibase only stores day-precision timestamps, Recent Changes looks a bit odd at the moment…)
I also ran into a few issues with the Docker system, and I have now idea how to fix them. This includes:
- Issues with QuickStatements (oh the irony)
- SPARQL linking to the wrong server
- Fulltext search is broken (this also breaks the V2 search function; I am using prefix search for now)
- I have no idea how to backup/restore any of this (bespoke configuration, MySQL)
None of the above are problems with Mix’n’match V2 in principal, but rather engineering issues to fix. Help would be most welcome.
Other topics that would need work and thought include:
- Syncing back to Wikidata (probably easy to do).
- Importing of new catalogs, and updating of existing ones. I am thinking about a standardized interchange format, so I can convert from various input formats (CSV files, auto-scrapers, MARC 21, SPARQL interfaces, MediaWiki installations, etc.).
- Meta-data handling. I am thinking of a generic method of storing Wikidata property Px ID and a corresponding value as Wikibase statements, possibly with a reference for the source. That would allow most flexibility for storage, matching, and import into Wikidata.
I would very much like to hear what you think about this approach, and this implementation. I would like to go ahead with it, unless there are principal concerns. V1 and V2 would run in parallel, at least for the time being. Once V2 has more functionality, I would import new catalogs into V2 rather than V1. Suggestions for test catalogs (maybe something with interesting metadata) are most welcome. And every bit of technical advice, or better hands-on help, would be greatly appreciated. And if the WMF or WMDE want to join in, or take over, let’s talk!
8 Comments
Wow. Will try in the next few days. Quick comment on account creation: isn’t the ConfirmAccount extension better, to manage manual requests?
Brilliant; thanks for all your work on this. I’m wanting to try matching PastScape up with Wikidata; might this be a candidate?
@Andrew I wrote new scraper code for V2, running now for PastScape: https://mixnmatch.wmflabs.org/interface/#/catalog/Q1215
Will take a while to import…
Hi Magnus. We’re enormous admirers at the Foundation of your work. Maybe I can help kickstart the discussion internally again. Please email me at jorlowitz@wikimedia.org. Best, Jake Orlowitz (Head of the Wikipedia Library)
1. Able to add more statements (using Wikidata properties) to matching entries once we can use.these properties (You may query them, and if you create an item, you can fill more things than just P31)
2. Define a licence of the backend instance (many data will be non-free as a whole, though individual item may mot be copyrightable)
3. Try to resolve T110460 so that we can use Wikidata accounts
Hi Magnus, I think this approach has an enormous potential for extension – e.g.: Users may provide a SPARQL query on a “catalog” endpoint with result variables named after WD properties (e.g., ?P571 for the inception of a company) for matching.
I’d be happy to provide and optimize the 20th Century Press Archives companies subset, for which I’ve recently set up a SPARQL endpoint with according documentation (see https://github.com/zbw/cdv2018-pressemappe20) as a test case. Cheers, Joachim
Another big advantage: The multilingual features of Wikibase could help improving the matching for “catalogs” which provide labels in multiple languages and/or synonyms (think of thesauri like EuroVoc). Cheers, Joachim
@Joachim: I agree, there is much potential. I have already started a “potential Wikidata property/value” schema, see here: https://mixnmatch.wmflabs.org/wiki/Item:Q43692
However, right now the SPARQL situation looks dire:
https://phabricator.wikimedia.org/T207133