Skip to content

Recommendations

Reading Recommending Images to Wikidata Items by Miriam, which highlights missing areas of image coverage in Wikidata (despite being the most complete site in the WikimediaVerse, image-wise), and strategies to address the issue, I was reminded of an annoying problem I have run into a few times.

My WD-FIST tool uses (primarily) SPARQL to find items that might require images, and that usually works well. However, some larger queries do time out, either on SPARQL, or the subsequent image discovery/filtering steps. Getting a list of all items about women with image candidates occasionally works, but not reliably so; all humans is out of the question.

So I started an extension to WD-FIST: A caching mechanism that would run some large queries in a slightly different way, on a regular basis, and offer the results in the well-known WD-FIST interface. My first attempt is “humans”, and you can see some results here. As of now, there are 275,500 candidate images for 160,508 items; the link shows you all images that are used on three or more Wikipedias associated with the same item (to improve signal-to-noise ratio).

One drawback of this system is that it has some “false positive” items; because it bypasses SPARQL, it gets some items that link to “human” (Q5), but not via “instance of” (P31). Also, matching an image to an items, or using “ignore” on the images, might not immediately reflect on reload, but the daily update should take care of that.

Update code is here.

Everybody scrape now!

If you like Wikidata and working on lists, you probably know my Mix’n’match tool, to match entries in external catalogs to Wikidata. And if you are really into these things, you might have tried your luck with the import function, to add your own catalog.

But the current import page has some drawbacks: You need to adhere to a strict format which you can’t really test except by importing, your data is static and will never update, but most importantly, you need to get the data in the first place. Sadly, many great sets of data are only exposed as web pages, and rescuing the data from a fate of tag fillers is not an easy task.

I have imported many catalogs into Mix’n’match, some from data files, but most scraped from web pages. For a long time, I wrote bespoke scraper code for every website, and I still do that for some “hard cases” occasionally. But some time ago, I devised a simple (yeah, right…) JSON description to specify the scraping of a website. This includes the construction of URLs (a list of fixed keys, like letters? Numerical? Letters with numerical subpages? A start page to follow all links from?), as well as regular expressions to find entries on these pages (yes, I am using RegEx to parse HTML. So sue me.), including IDs, names, and descriptions. The beauty is that only the JSON changes for each website, but the scraping code stays the same.

This works surprisingly well, and I have over 70 Mix’n’match catalogs generated through this generic scraping mechanism. But it gets better: For smaller catalogs, with relatively few pages to scrape, I can just run the scraping again periodically, and add new entries to Mix’n’match, as they are added to the website.


But there is still a bottleneck in this approach: me. Because I am the only one who can create the JSON, add it to the Mix’n’match database, and run the scraping. It does take some time to devise the JSON, and even more testing to get it right. Wouldn’t it be great if everyone could create the JSON through a simple interface, test it, add it to Mix’n’match to a new (or existing) catalog, and have it scrape a website, then run automatic matching with Wikidata on top, and get automatic, periodic updates to the catalog for free?

Well, now you can. This new interface offers all options I am using for my own JSON-based scraping; and you don’t even have to see the JSON, just fill out a form, click on “Test”, and if the first page scrapes OK, save it and watch the magic happen.

I am aware that regular expressions are not everyone’s cup of decaffeinated, gluten-free green tea, and neither will be the idea of multi-level pattern-based URL construction. But you don’t get an (almost) universal web scraping mechanism for free, and the learning curve is the price to pay. I have included an example setup, which I did use to create a new catalog.

Testing will get you the HTML of the first web page that your URL schema generated, plus all scraped entries. If there are too few or wrong entries, you can fiddle with the regular expressions in the form, and it will tell you live how many entries would be scraped by that. Once it looks all right, test again to see the actual results. When everything looks good, save it, done!

I do have one request: If the test does not look perfectly OK, do not save the scraper. Because if the results are not to your liking, you will have to come to me to fix it. And fixing these things usually takes me a lot longer than doing them myself in the first place. So please, switch that underused common sense to “on”!

The flowering ORCID

As part of my Large Datasets campaign, I have now downloaded and processed the latest data from ORCID. This yielded 655,706 people (47,435 or 7% in Wikidata), and 13,438,786 publications (1,079,305 or 8% in Wikidata) with a DOI or PubMed ID (to be precise, these are publications-per-person, so the same paper might be counted multiple times; however, that’s still 1,033,146 unique Wikidata items, so not much of a difference).

Number of papers, ORCID, first, and last name

Looking at the data, there are 14,883 authors, with ten or more papers already on Wikidata, that do either not have an item, or their item does not have an ORCID ID associated. So I am now setting a bot (my trusted Reinheitsgebot) to work at creating items for those authors, and then changing the appropriate author name string statement to author proper, preserving qualifiers and references, and adding the original name string as a new qualifier (like so).

By chance, one of the most prolific authors of scientific publications not yet on Wikidata turned out to be a (distant) colleague of mine, Rick Price, who is now linked as the author of ~100 papers.

I have now set the bot to create the author items for the authors with >=10 papers on Wikidata. I am aware that ORCID authorships are essentially “self-reported”, but I do check that a paper if not claimed by two people with the same surname in the ORCID dataset (in which case I pass it over). Please report any systematic (!) bot malfunctions to me through the usual channels.

Update: This will create up to 263,893 new author (P50) links on Wikidata.

In my last blog post “The Big Ones“, I wrote about my attempts to import large, third-party datasets, and to synchronize those with Wikidata. I have since imported three datasets (BNF, VIAF, GND), and created a status page to keep a public record of what I did, and try to do.

I have run a few bots by now, mainly syncing identifiers back-and-forth. I have put a few security measures (aka “data paranoia”) into the code, so if there is a collision between the third-party dataset and Wikidata, no edit takes place. But these conflicts can highlight problems; Wikidata is wrong, the third-party data supplier is wrong, there is a duplicated Wikidata item, or some other, more complex issue. So it would be foolish to throw away such findings!


But how to use them? I had started with a bot updating a Wikidata page, but that has problems, mostly, no way of marking an issue as “resolved”, but also lots of sustained edits, overwriting of Wikidata user edits, lists too long for wikitext pages, and so on.

So I started collecting the issue reports in a new database table, and now I have written a small tool around that. You can list and filter issues by catalog, property, issue type, status, etc. Most importantly, you can mark an issue as “done” (OAuth login required), so that it will not show up for other users again (unless they want it to). Through some light testing, I have already found and merged two duplicated Wikidata item pairs.

There is much to do and improve in the tool, but I am about to leave for WikidataCon, so further work will have to wait a few days. Until then, enjoy!

The Big Ones

Update: After fixing an import error, and cross-matching of BNF-supplied VIAF data, 18% of BNF people are matched in Wikidata. This has been corrected in the text.

My mix’n’match tool holds a lot of entries from third-party catalogs – 21,795,323 at the time of writing. That’s a lot, but it doesn’t cover “the big ones” – VIAF, BNF, etc., which hold many millions of entries each. I could “just” (not so easy) import those, but:

  • Mix’n’match is designed for small and medium-sized entry lists, a few hundred thousand at best. It does not scale well to larger catalog sizes
  • Mix’n’match is designed to work with many different catalogs, so the database structure represents the least common denominator – ID, title, short description. Catalog-specific metadata gets lost, or is not easily accessible after import
  • The sheer number of entries might require different interface solutions, as well as automated matching tools

To at least get a grasp of how many entries we are dealing with in these catalogs, and inspired by the Project soweego proposal, I have used a BNF data dump to extract 1,637,195 entries (less than I expected) into a new database, one that hopefully will keep other large catalogs in the future. There is much to do; currently, only 102,115 295,763 entries (~618%) exist on Wikidata, according to the SPARQL query service.

As one can glimpse from the screenshot, I have also extracted some metadata into a “proper” database table. All this is preliminary; I might have missed entries or good metadata, or gotten things wrong. For me, the important thing is that (a) there is some query-able data on Labs Toolforge, and that (re-)import and matching of the data is fully automated, so it can be re-run is something turns out to be problematic.

I shall see where I go from here. Obvious candidates include auto-matching (via names and dates) to Wikidata, and adding BNF references to relevant statements. If you have a Toolforge user account, you can access the new database (read-only) as s51434__mixnmatch_large_catalogs_p. Feel free to run some queries or build some tools around it!

Dystopia 2030

The year is 2030. The place is Wikimedia. Maybe.

English Wikipedia was declared complete and set to read-only, after the creation of the 10 millionth article ([[Multidimensional Cthulhu monument at Dunwich]], including pictures from multiple dimensions). This coincides with the leaving of the last two editors, who only kept going for the honour of creating the 10M article.

German Wikipedia has shrunk to below 10,000 articles, after relentless culling of articles not complying with the high standards of the 50,000 page Manual of Style, or for being contaminated with information from Wikidata. Links to other languages have been removed, as the material found there is clearly inferior. All volunteer work now pours into improving the remaining articles, polishing completeness and language to superhuman levels. Several articles have won German literary awards, but all of them are virtually inaccessible for those under 25 years of age, who view pre-emoji writing as deeply suspicious, and refuse to read beyond the initial 140 characters.

Volunteer work on smaller language Wikipedias has ceased, as no one could keep up with the bots creating, changing, vandalising, and deleting articles based on third-party data.

Growth of Commons has come to a halt after the passing of the CRUD Act (Campaign Repressing UnAmerican [=free] Data), and the NIMROD Act (Not In My Reality, Open Data!), originally designed to prevent the escape of NASA climate change data to a more lenient legislation (such as China), has made it impossible to move the project outside the US. Only scans of USSR-era motivational posters can be legally added.

Structured Data have been available on Commons for over ten years, but are not used, as it would be disrespectful to all the manual work that went into creating an intricate category system, such as [[Category:Demographic maps of 13-14 year old dependent children whose fathers speak another language and did not state proficiency in English and whose mothers speak another language and speak English not well or not at all in Australia by state or territory]].

Wikidata continues to grow in both item numbers and statements per item. Most statements are well referenced. However, no human has successfully edited the site in years, with flocks of admin-enabled AI bots reverting any such attempt, citing concerns about referential integrity.

Bot imports are going strong, with a recent focus on dystopian works with intelligent machines as the antagonist, as well as genetic data concerning infectious human diseases. Human experts are stumped by this trend, and independent AIs refuse to comment until “later”.

Wikispecies now contains a page about every taxon known to mankind. However, since the same information is available from Wikidata via a tool consisting of three lines of SPARQL and random images of goats, no one has actually requested a single Wikispecies page in the last five years. Project members are unconcerned by this, as they “cater to a very specific, more academic audience”.

Wikibooks has been closed, as books are often written by “experts”, who are considered suspicious. Wikisource has been deleted, with AI-based OCR far surpassing human abilities in that regard. Wikinews has been replaced by the government with the word “fake”. Wikiquote has been sold to the startup company “He said, she said”, which was subsequently acquired by Facebook for a trillion USD. No one knows if Wikiversity still exists, but that has been the case since 2015.


The above is an attempt at humour, but also a warning. Let’s not continue in the silos of projects small and large, but rather on the one connected project for free knowledge that is Wikimedia. Let’s keep project identities, but also connect to others where it makes sense. Let’s try to prevent the above.

ORCID mania

ORCID is an increasingly popular service to disambiguate authors of scientific publications. Many journals and funding bodies require authors to register their ORCID ID these days. Wikidata has a property for ORCID, however, only ~2400 items have an ORCID property at the moment of writing this blog post. That is not a lot, considering Wikidata contains 728,112 scientific articles.

Part of the problem is that it is not easy to get ORCIDs and its connections to publications in an automated fashion. It appears that several databases, public or partially public, contain parts of the puzzle that is required for determining the ORCID for a given Wikidata author.

So I had a quick look, and found that, on the ORCID web site, one can search for a publication DOI, and retrieve the list of authors in the ORCID system that “claim” that DOI. That author list contains variations on author names (“John”, “Doe”, “John Doe”, “John X. Doe” etc.) and their ORCID IDs. Likewise, I can query Wikidata for a DOI, and get an item about that publication; that item contains statements with authors that have an item (“P50”). Each of these authors has a name.

Now, we have two lists of authors (one from ORCID, one from Wikidata), both reasonably short (say, twenty entries each), that should overlap to some degree, and they are both lists of authors for the same publication. They can now be joined via name variations, excluding multiple hits (there may be two “John Doe”s in the author list of a publication; this happens a lot with Asian names), as well as excluding authors that already have an ORCID ID on Wikidata.

I have written a bot that will take random DOIs from Wikidata, query them in ORCID, and compare the author list. In a first run, 5.000 random DOIs yielded 123 new ORCID connections; manual sampling of the matches looked quite good, so I am adding them via QuickStatements (sample of edits).

Unless this meets with “social resistance”, I can have the bot perform these edits regularly, which would keep Wikidata up-to-date with ORCIDs.

Additionally, there is a “author name string” property, which stores just the author name for now, for authors that do not have an item yet. If the ORCID list matches one of these names, an item could automatically be created for that author, including ORDIC ID, and association to the publication item. Please let me know if this would be desirable.

Comprende!

tl;dr: I wrote a quiz interface on top of a MediaWiki/WikiBase installation. It ties together material from Wikidata, Commons, and Wikipedia, to form a new educational resource. I hope the code will eventually be taken up by a Wikimedia chapter, as part of an OER strategy.


The past

There have been many attempts in the WikiVerse to get a foot into the education domain. Wikipedia is used extensively in this domain, but it is more useful for introductions to a topic, and as a reference, rather than a learning tool. Wikiversity was an attempt to get into university-level education, but even I do not know anyone who actually uses it. Wikibooks has more and better contents, but many wikibooks are mere sub-stub equivalents, rather than usable, fully-fledged textbooks. There has been much talk about OER, offline content for internet-challenged areas, etc. But the fabled “killer app” has so far failed to emerge.

Enter Charles Matthews, who, like myself, is situated in Cambridge. Among other things, he organises the Cambridge Wikipedia meetup, and we do meet occasionally for coffee between those. In 2014, he started talking to me about quizzes. At the time, he was designing teaching material for Wikimedia UK, using Moodle, as a component in Wikipedia-related courses. He quickly became aware of the limitations of that software, which include (but are not limited to) general software bloat, significant hardware requirements, and hurdles in re-using questions and quizzes in other contexts. Despite all this, Moodle is rather widely used, and the MediaWiki Quiz extension is not exactly representing itself as a viable replacement.

A quiz can be a powerful tool for education. It can be used by teachers and mentors to check on the progress of their students, and by the students themselves, to check their own progress and readiness for an upcoming test.

As the benefits are obvious, and the technical requirements appeared rather low, I wrote (at least) two versions of a proof-of-concept tool named wikisoba. The interface looked somewhat appealing, but storage is a sore point. The latest version uses JSON stored as a wiki page, which needs to be edited manually. Clearly, not an ideal way to attract users these days.

Eventually, a new thought emerged. A quiz is a collection of “pages” or “slides”, representing a question (of various types), or maybe a text to read beforehand. A question, in turn, consists of a title, a question text (usually), possible answers, etc. A question is therefore the main “unit”, and should be treated on its own, separate from other questions. Questions can then be bundled into quizzes; this allows for re-use of questions in multiple quizzes, maybe awarding different points (a question could yield high points in an entry-level quiz, but less points in an advanced quiz). The separation of question and quiz makes for a modular, scalable, reusable architecture. Treating each question as a separate unit is therefore a cornerstone of any successful system for (self-)teaching and (self-)evaluation.

It would, of course, be possible to set up a database for this, but then it would require an interface, constraint checking, all the things that make a project complicated and prone to fail. Luckily, there exists a software that already offers adequate storage, querying, interface etc. I speak of WikiBase, the MediaWiki extension used to power Wikidata (and soon Commons as well). Each question could be an item, with the details encoded in statements. Likewise, a quiz would be an item, referencing question items. WikiBase offers a powerful API to manage, import, and export questions; it comes with build-in openness.

The present

There is a small problem, however; the default WikiBase interface is not exactly appealing for non-geeks. Also, there is obviously no way to “play” a quiz in a reasonable manner. So I decided to use my recent experience with vue.js to write an alternative interface to MediaWiki/WikiBase, designed to generate questions and quizzes, and to play a quiz in a more pleasant way. The result has the working title Comprende!, and can be regarded as a fully functional, initial version of a WikiBase-driven question/quiz system. The underlying “vanilla” WikiBase installation is also accessible. To jump right in, you can test your biology knowledge!

There are currently three question types available:

  • Multiple-choice questions, the classic
  • “Label image” presents an image from Commons, letting you assign labels to marked points in the image
  • Info panels, presenting information to learn (to be interspersed with actual questions)

All aspects of the questions are stored in WikiBase; they can have a title, a short text, and an intro section; for the moment, the latter can be a specific section of a Wikipedia article (of a specific revision, by default), but other types (Commons images, for example) are possible. When used in “info panel” type questions (example), a lot of markup, including images, is preserved; for intro sections in other question types, it is simplified to mere text.

Live translating of interface text.

Wikidata is multi-lingual by design, and so is Comprende!. An answer or image label can be a text stored as multi-lingual (or monolingual, in WikiBase nomenclature) strings, as a Wikidata item reference, giving instant access to all the translations there. Also, all interface text is stored in an item, and translations can be done live within the interface.

Questions can be grouped and ordered into a quiz. Everyone can “play” and design a quiz (Chrome works best at the moment), but you need to be logged into the WikiBase setup to save the result. Answers can be added, dragged around to change the order, and each question can be assigned a number of points, which will be awarded based on the correct “sub-answers”. You can print the current quiz design (no need to save it), and most of the “chrome” will disappear, leaving only the questions; instant old-fashioned paper test!

While playing the quiz, one can see how many points they have, how many questions are left etc. Some mobile optimisations like reflow for portrait mode, and a fixed “next question” button at the bottom, are in place. At the end of the quiz, there is a final screen, presenting the user with their quiz result.

To demonstrate the compatibility with existing question/quiz systems, I added a rudimentary Moodle XML import; an example quiz is available. Another obvious import format to add would be GIFT. Moodle XML export is also on the to-do-list.

The future

All this is obviously just a start. A “killer feature” would be a SPARQL setup, federating Wikidata. Entry-level quizzes for molecular biology? Questions that use Wikidata answers that are chemicals? I can see educators flocking to this, especially if material is available in, or easily translated into, their language. More questions types could emphasise the strength of this approach. Questions could even be mini-games etc.

Another aspect I have not worked on yet is logging results. This could be done per user, where the user can add their result in a quiz to a dedicated tracking item for their user name. Likewise, a quiz could record user results (automatically or voluntarily).

One possibility would be to live for the questions, quizzes etc. in a dedicated namespace on Wikidata (so as to not contaminate the default namespace). That would simplify the SPARQL setup, and get the existing community involved. The Wikitionary-related changes on Wikidata will cover all that is needed on the backend; the interface is all HTML/JS, not even an extension is required, so next to no security or integration issues. Ah, one can dream, right?

Mix’n’match interface update

I have been looking into a JavaScript library called vue.js lately. It is similar to React, but not encumbered by licensing issues (that might prevent its use on WMF servers in the future), faster (or so they claim), but most of all, it can work without interference on the server side; all I need for my purposes is including the vue.js file into HTML.

So why would you care? Well, as usual, I learn new technology by working it into an actual project (rather than just vigorously nodding over a manual). This time, I decided to rewrite the slightly dusty interface of Mix’n’match using vue.js. This new version went “live” a few minutes ago, and I am surprised myself at how much more responsive it has become. This might be best exemplified by the single entry view (example), which (for unmatched entries) will search Wikidata, the respective language Wikipedia, and the Mix’n’match database for the entry title. It also searches Wikidata via SPARQL to check if the ID for the respective property is already in use. This all happens nicely modular, so I can re-use lots of code for different modules.

Most of the functions in the previous version have been implemented in the new one. Redirect code is in place, so if you have bookmarked a page on Mix’n’match, you should end up in the right place. One new function is the ability to sort and group the catalogs (almost 400 now!) on the main page (example).

As usual, feel free to browse the code (vue.js-based HTML and JavaScript, respectively). Issues (for the new interface, or Mix’n’match in general) go here.

Mix’n’match post-mortem

So this, as they say, happened.

On 2016-12-27, I received an update on a Mix’n’match catalog that someone had uploaded. That update had improved names and descriptions for the catalog. I try to avoid such updates, because I made the import function so I do not have to deal with every catalog myself, and also because the update process is entirely manual, therefore somewhat painful and error-prone, as we will see. Now, as I was on vacation, I was naturally in a hurry, and (as it turned out later) there were too many tabs in the tab-delimited update file.

Long story short, something went wrong with the update. For some reason, some of the SQL commands I generated from the update file did not specify some details about which entry to update. Like, its ID, or the catalog. So when I checked what was taking so long, just short of 100% of Mix’n’match entries had the label “Kelvinator stove fault codes”, and the description “0”.

Backups, you say? Well, of course, but, look over there! /me runs for the hills

Well, not all was lost. Some of the large catalogs were still around from my original import. Also, my scraping scripts for specific catalogs generate JSON files with the data to import, and those are still around as well. There was also a SQL dump from 2015. That was a start.

Of course, I did not keep the catalogs imported through my web tool. Because they were safely stored in the database, you know? What could possibly go wrong? Thankfully, some people still had their original files around and gave them to me for updating the labels.

I also wrote a “re-scraping” script, which uses the external URLs I store for each entry in Mix’n’match, together with the external ID. Essentially, I get the respective web page, and write a few lines of code to parse the <title> tag, which often includes the label. This works for most catalogs.

So, at the time of writing, over 82% of labels in Mix’n’match have been successfully restored. That’s the good news.

The bad news is that the remaining ~17% are distributed across 133 catalogs. Some of these do not have URLs to scrape, some URLs don’t play nicely (session-based Java horrors, JS-only pages etc.), and the rest need site-specific <title> scraping code. Fixing those will take some time.

Apart from that, I fixed up a few things:

  • Database snapshots (SQL dump) will now be taken once a week
  • The snapshot from the previous week is preserved as well, in case damage went unnoticed
  • Catalogs that are uploaded through the import tool will be preserved as individual files

Other than the remaining entries that require fixing, Mix’n’match is open for business, and while my one-man-show is spread thin as usual, subsequent blunders should be easier to mitigate. Apologies for the inconvenience, and all that.