Skip to content

Post scriptum

I am running a lot of tools on Labs. As with most software, the majority of feedback I get for those tools falls into one of two categories: bug reports and feature requests, the latter often in the form “can the tool get input from/filter on/output to…”. In many cases, that is quick to implement; others are more tricky. Besides increasing the complexity of tools, and filling up the interface with rarely-used buttons and input fields, the combinations (“…as you did in that other tool…”) would eventually exceed my coding bandwidth. And with “eventually”, I mean some time ago.

Wouldn’t it be better if users could “connect” tools on their own? Take the output of tool X and use it as the input of tool Y? About two years ago, I tried to let users pipeline some tools on their own; the uptake, however, was rather underwhelming, which might have been due to the early stage of this “meta-tool”, and its somewhat limited flexibility.

A script and its output

A script and its output.

So today, I present a new approach to the issue: scripting! Using toolscript, users can now take results from other tools such as category intersection and Wikidata Query, filter and combine the results, and display the results or even use tools like WiDaR to perform on-wiki actions. Many of these actions come “packaged” with this new tool, and the user has almost unlimited flexibility in operating on the data. This flexibility, however, is bought by the scary word programming (an euphemism for “scripting”). In essence, the tool runs JavaScript code that the user types or pastes into a text box.

Still here? Good! Because, first, there are some examples you can copy, run, and play with; if people can learn MediaWiki markup this way, JavaScript should pose little challenge. Second, I am working on a built-in script storage, which should add many more example scripts, ready to run (in the meantime, I recommend a wiki or pastebin). Third, all build-in functions use synchronous data access (no callbacks!), which makes JavaScript a lot more … scriptable, as in “logical linear flow”.

The basic approach is to generate one or more page lists (on a single Wikimedia project), and then operate on those. One can merge lists, filter them, “flip” from Wikipedia to associated Wikidata items and back, etc. Consider this script, which I wrote for my dutiful beta tester Gerard:

all_items = ts.getNewList('','wikidata');
cat = ts.getNewList('it','wikipedia').addPage('Category:Morti nel 2014') ;
cat_item = cat.getWikidataItems().loadWikidataInfo();
$.each ( cat_item.pages[0].wd.sitelinks , function ( site , sitelink ) {
  var s = ts.getNewList(site).addPage(sitelink.title);
  if ( s.pages[0].page_namespace != 14 ) return ;
  var tree = ts.categorytree({language:s.language,project:s.project,root:s.pages[0].page_title,redirects:'none'}) ;
  var items = tree.getWikidataItems().hasProperty("P570",false);
  all_items = all_items.join(items);
} )
all_items.show();

This short script will display a list of all Wikidata items that are in a “died 2014″ category tree on any Wikipedia, that do not have a death date yet. The steps are as follows:

  • Takes the “Category:Morti nel 2014″ from it.wikipedia
  • Finds the associated Wikidata item
  • Gets the item data for that item
  • For all of the site links into different projects on this item:
    • Checks if the link is a category
    • Gets the pages in the category tree for that category, on that site
    • Gets the associated Wikidata items for those pages
    • Removes those items that already have a death date
    • Adds the ones without a death date to a “collection list”
  • Finally, displays that list of Wikidata items with missing death dates

Thus, with a handful of straightforward functions (like “get Wikidata items for these pages”), one can ask complex questions of Wikimedia sites. A slight modification could, for example, create Wikidata items the pages in these categories. All functions are documented in the tool. Many more can be added on request; and, as with adding Wikidata labels, a single added function can enable many more use-cases.

I hope that this tool can become a hub for users who want more than the “simple” tools, to answer complex questions, or automate tedious actions.

The OAuth haters’ FAQ

Shortly after the dawn of the toolserver, the necessity for authentication in some tools became apparent. We couldn’t just let anyone use tools to upload files to Commons, doing so anonymously, hiding behind the tools’ user account; that would be like allowing anyone to edit Wikipedia anonymously. Crazy talk! But toolserver policies forbade tools asking for users’ Wikipedia/Commons passwords. So, different tool authors came up with different solutions; I created TUSC to have a single authentication mechanism across my various tools.

As some of you may have noticed, some of my new tools that require authentication are using OAuth instead of TUSC. Not only that, but I have been busy porting some of my long-standing tools, like flickr2commons and now commonshelper, to OAuth. This has been met with … unease by various parties. I hope that this post can alleviate some concerns, or at least answer some questions.

Q: Why did you switch from TUSC to OAuth?
A: TUSC was a crutch, and always has been. Not only is OAuth a standard technology, it is now also the “official” way to use tools that require user rights on Wikimedia sites.

Q: So I’m not uploading as your bot, but as myself?
A: Yes! You took the time and interest to run the tool; that effort should be rewarded, by having the upload associated with your use account. It will be so much easier to see who uploaded a file, to assign glory and blame alike. Also, the army of people who have been haunting me for re-use rights and image sources, just because my name was linked to the uploading bot account, will now come after YOU! Progress!!

Q: OK, maybe for new tools. But the old tools were working fine!
A: No, they were not. Years ago, when the WMF changed API login procedure,  I switched my tools to use the Peachy toolkit, so I would not have to do the “fiddly bits” myself for every tool. However, a few month ago, something changed again, causing the Peachy uploads to fail. It turned out that Peachy was no longer developed, so I had to go back and hack my own upload code. Something was wrong with that as well, leading to a flurry of bug reports across my Commons-uploading tools. The subsequent switch to OAuth uploads wasn’t exactly smooth for me either, but now that it’s running, it should work nicely for a while. Yeah, right.

Q: But now all the tools are using JavaScript to upload. I hate JavaScript!
A: That pesky JavaScript is all over the web these days. So you probably have installed NoScript in your browser (if you don’t, do!). You can easily set an exception for the tools server (or specific tools), so you’re safe from the evil JavaScript out there, while the tools “just work”.

Q: But now it won’t work in my text browser!
A: Text browser? In 2014? Really? You seem to be an old-school 1337 hax0r to be using that. All tools on tools.wmflabs.org are multi-maintainer; I’ll be happy to add you as a co-maintainer, so you can add a text browser mode to any tool you like. You don’t have time, you say? Fancy that. In the meantime, you could use elinks which can support JavaScript.

Q: But you changed my favourite tool! I don’t like change.
A: Sorry, Sheldon. The only constant is change.

In closing, the good old bot has clocked over 1.2 million uploads on Commons, or ~6% of all files there. I think it deserves some well-earned rest.

Qualified answers

WikiData Query (WDQ), one of my pet projects I have blogged about before, has gained a new functionality: search by qualifier. That is, a query for items with specific statements can now be fine-tuned by querying the qualifiers as well.

Too abstract? Let’s see: Who won the Royal Medal in 1853? Turns out it was Charles Darwin. Or how about this: Who received any award, medal, etc. in 1835? According to Wikidata, it’s just one other person besides Darwin; Patrice de Mac-Mahon was awarded a rank in the French Legion of Honour.

This might seem like a minor gimmick, but I think it’s huge. It is, to the best of my knowledge, the only way to query WikiData on qualifiers; and with qualifiers becoming more important (replacing groups of properties in lieu of simpler, “qualified” ones, for example in taxonomy) and more numerous, such queries will become more important to the “ecosystem” around Wikidata.

With the main point delivered, here are some points for those interested in using this system, or the development of WDQ:

  • Currently, you can only use qualifier queries in combination with the CLAIM command, though I’ll add it to string, time, location, and quantity commands next.
  • A subquery in {curly brackets} directly after the CLAIM command is used on the qualifiers of the statements that match the “main” command
  • Within the qualifier subquery, you can use all other commands, ranges, trees etc.
  • Adding qualifiers to the in-memory database required some major retrofitting of WDQ. It started my using “proper” statement representations (C++ template classes, anyone?), followed by creating a qualifier data type (to re-use all the existing commands within the subquery, each qualifier set is basically its own Wikidata!), extending the dump parsing, binary storage and loading, and a zillion other small changes.
  • While I believe the code quality has improved significantly, some optimization remains to be done. For example, memory requirements have skyrocketed from <1GB to almost 4GB. The current machine has 16GB RAM, so that should hold for a while, but I believe I can do better.
  • There is an olde memory leak, which appears to have become worse by the increased RAM usage. WDQ will restart automatically if the process dies, but queries will be delayed for a minute or so every time that happens.
  • I have prepared the code to use ranks, but neither parsing nor querying are implemented yet.
  • I need to update the API documentation to reflect the qualifier queries. Also, I need to write a better query builder. In due time.

Points of view

For many years, Henrik has single-handedly (based oScreen Shot 2014-02-17 at 17.07.43n data by Domas) done what the world’s top 5 website has consistently failed to provide: Page view information, per page, per month/day. Requested many times, repeatedly promised, page view data has remained proverbial vaporware, the Duke Nukem Forever of the Wikimedia Foundation (except DNF was delivered in 2011). A quick search of my mail found a request for this data from 2007, but I would be surprised if that is the oldest instance of such a query.

Now, it would be ridiculous to assume the Foundation does not actually have the data; indeed they do, and they are provided as unwieldy files for download. So what’s all the complaining about? First, the download data cannot be queried in any reasonable fashion; if I want to know how often Sochi was viewed in January 2014, I will have to parse an entire file. Just kidding; it’s not one file. Actually, it’s one file for every single hour. With the page titles URL-encoded as requested, that is, not normalized; a single page can have dozens of different “keys”, have fun finding them all!

But I can get that information from Henrik’s fantastic page, right? Right. Unless I want to query a lot of pages. Which I have to request one by one. Henrik has done a fantastic job, and single queries seem fast, but it adds up. Especially if you do it for thousands of pages. And try to be interactive about it. (My attempt to run queries in parallel ended with Henrik temporarily blocking my tools for DDOSing his server. And rightly so. Sorry about that.)

GLAMs (Galleries, Libraries, Archives, and Museums) are important partners of the Wikimedia Foundation in the realm of free content, and increasingly so. Last month, the Wellcome Trust released 100.000 images under the CC-BY license. Wikimedia UK is working on a transfer of these images to Wikimedia Commons. Like other GLAMs, the Wellcome Trust would surely like to know if and how these images are used in the Wikiverse, and how many people are seeing them. I try to provide a tool for that information, but, using Henrik’s server, it runs for several days to collect data for a single month, for some of the GLAM projects we have. And, having to hit a remote server with hundreds of thousands of queries via http each month, things sometimes go wrong, and then people write me why their view count is waaaay down this month, and I’ll go and fix it. Currently, I am fixing data from last November. By re-running that particular subset and crossing my fingers it will run smoothly this time.

Like others, I have tried to get the Foundation to provide the page view data in a more accessible and local (as in toolserver/Labs) way. Like others, I failed. The last iteration was a video meeting with the Analytics team (newly restarted, as the previous Analytics team didn’t really work out for a reason; I didn’t inquire too deeply), which ended with a promise to get this done Real Soon Now™, and the generous offer to use the page view data from their hadoop cluster. Except the cluster turned out to be empty; I then was encouraged to import the view data myself. (No, this is not a joke. I have the emails to prove it.) As much as I enjoy working with and around the Wikiverse, I do have neither the time, the bandwidth, nor the inclination to do your paid jobs for you, thank you very much.

As the sophisticated reader might have picked up at this point, the entire topic is rather frustrating for myself and others, and being unable to offer a patchy, error-prone data set to GLAMs who have released hundreds of thousands of files under a free license into Commons is, quite frankly, disgraceful. The requirement for the Foundation is not unreasonable; providing what Henrik has been doing for years on his own would be quite sufficient. Not even that is required; myself and others have volunteered to write interfaces if the back-end data is provided in a usable form.

Of the tools I try to provide in the GLAM realm, some don’t really work at the moment due to the constraints described above; some work so-so, kept running with a significant amount of manual fixing. Adding 100.000 Wellcome Trust images may be enough for them to come to a grinding halt. And when all the institutions who so graciously have contributed free content to the Wikiverse come a-running, I will make it perfectly clear that there is only the Foundation to blame.

Modest doubt is called the BEACON of the wise

Now that I got Shakespeare out of the way, I picked up chatter (as they say in the post-Snowden world) about the BEACON format again, which seemed to have fallen quiet for a while. In short, BEACON is a simple text format linking two items in different catalogs, usually web sites. There are many examples “in the wild”, often Wikipedia-to-X, with X being VIAF, GND, or the like.

Now, one thing has changed in the last year, and that is the emergence of Wikidata. Among other things, it allows us to have a single identifier for an item in the Wikiverse, without having to point at a Wikipedia in one language, with a title that is prone to change. When I saw the BEACON discussion starting up again, I realized that, not only do we now have such identifiers, but that I was sitting on the mother lode of external catalog linkage: Wikidata Query. So, I did what I always do: Write a tool.

I have not checked if there are any other Wikidata BEACON mappings out there, but my solution has several points going for it that would be hard to replicate otherwise:

  • An (almost) up-to-date representation of the state of Wikidata, usually no more than ~10-15 minutes behind the live site.
  • Quick on-the-fly data generation.
  • A large number of supported catalogs (76 at the moment).
  • The option to link two external catalogs via Wikidata items that have identifiers for both.

This service is pretty new (just hacked it together while waiting for other code to run…), and I may have gotten some of the BEACON header wrong. If so, please tell me!

The Reason For It All

Gerard has blogged tirelessly about improvements to Reasonator, my attempt at making Wikidata a little more accessible. Encouraged by constant feedback, suggestions, and increasing view numbers, it has grown from “that thing that shows biography data” into a versatile and, more importantly, useful view of Wikidata. This is my attempt at a summary of the state of Reasonator.

Q1339

Johann Sebastian Bach (Q1339). Reasonator and Wikidata side-by-side comparison.

The Cake

Reasonator attempts to show a Wikidata item in a way that it easy to access by humans. By focusing on display rather than editing, a lot of screen real estate can be used to get the information to the reader.

Besides the main body of text, there is a sidebar reminiscent of the Wikipedia infoboxes, containing images (of the item itself, as well as signature, coat of arms, seal, audio, video, etc.), simple string-based information (such as postal codes), links to external sources via their ID (such as VIAF), links to other Wikimedia projects associated with the item, and miscellaneous tidbits such as a QRpedia code (by special request).

The main text consists of a header (with title, Wikidata item number and link, aliases, manual and automatic description, where available), a body mostly made of linked property-value lists, and a footer with links to an associated Commons category, and images from related items. Qualifiers are shown beneath the respective claims. So far, not much more than what you get on Wikidata, except some images and pretty layout.

The Icing

Cambridge Reasonator

Cambridge (Q350) with Wikivoyage banner, location hierarchy, and maps.

One fundamental difference between the Wikidata and the Reasonator displays is that the latter is context-sensitive; it supports specialized information rendering for certain item types, and only uses the generic one as a fall-back. Thus, for a biographical item, there are the relatives (if any), and a link to the family tree display; Wikivoyage banner, OpenStreetMap displays (two zoom levels), map image, and a hierarchy list for a location; an automatically generated taxonomy list for a species.

Even the generic item display can generate a hierarchy via the “subclass of” property. Such hierarchical lists are generated by Wikidata Query for speed reasons, but can also be calculated on the “live” data, if you want to check your latest Wikidata edit. There are already requests for other specialized item types, such as books and book editions.

The bespoke display variations and hierarchical lists hint at the “reason” part of the Reasonator name: It does not “just” display the item, but also looks at the item type, as well as related items. Statues depicting Bach, things that are named after him, items for music he has composed – all is displayed on the item page, even though the item itself “knows” nothing about those related items. Reasonator also has some build-in inferences: any son of Bach’s parents is his brother, even if neither item has the “brother of” property.

JCF Bach

Hover box for J.C.F. Bach.

Every link to another Wikidata item (which will point to the appropriate Reasonator page for browsing continuity) has a “hover box” that will pop up when you move the mouse over the link. The box contains the label of the item; a link to Wikidata; links to Wikipedia, Wikisource, and Wikivoyage in the current language; a manual description; an automatic description (with appropriate Reasonator links); and a thumbnail image, all subject to availability.

Strange Ingredients

Reasonator has a faux “Universal Language Selector” to easily switch languages. (I decided to just steal the ULS button and roll my own, with a few lines of JavaScript, rather than including no less than 15 additional files into the HTML page.) Item and property labels will show in your preferred language, with a fallback to the “big ones” (as determined by Wikipedia sizes) first, and a random “minor language” as a last resort. Items using a such a “fallback label” are marked with a dotted red underline, to alert native speakers to missing labels in their language. (A “translate this label now” utility, build into the hover box, is currently held up by a Wikidata bug.) The interface text is translatable as well via a wiki page.

A few candles on top

One simple Reasonator function I personally use a lot is “random page”. The results can be both astonishing (the “richness” of some items is remarkable) and depressing (“blank” items, with no statements, or maybe just a Commons category). Non-content items, such as category or template pages, are not shown by this function, if it knows about them. If you load a page without statements with a “:” in the label, Reasonator offers you a one-click “tagging” of the item as template, category, etc., is you have WiDaR enabled. (If you don’t, you really should!) In a similar manner, Reasonator suggests items to use as a “parent taxon” for species without such a statement, based on the species taxonomic name.

There is also a build-in search function, based on the Wikidata search, enriching it with automatic descriptions and hover boxes. As usual, missing labels are highlighted, and a missing auto-description hints at items lacking fundamental statements. (You can go and edit the item on Wikidata via the link in the hover box, without having to load the Reasonator page first.)

Calendar Tunguska event

A calendar example.

The latest addition is a calendar function. This special type of page is linked to from all date values within Reasonator, using the values’ accuracy/resolution. This is inspired by the “year” pages on Wikipedia, alas, it requires no additional effort by humans, besides the addition of date values to Wikidata (which should be done anyway!). The page uses Wikidata Query to aggregate items with events inside the set time period (day, month, year). At this moment, events (“point in time”), foundation/creation/discovery dates, births and deaths are shown. Additionally, there is a list of “ongoing events”, which started less than 10 years before and ended less than 10 years after the viewed date. A sidebar allows navigation by date, and hover boxes give a quick insight into the listed items.

Here as well, Reasonator uses more than a simple “list of stuff”. Relevant dates are shown besides the item, e.g., death dates in the birthday list. For a specific day, births and deaths are shown side-by-side to avoid blank space; for month/year displays, both birth and death dates are shown (as they are not specifically implied by the date), and the display wraps after the births list.

For a whole, “recent” year (such as 1900), it can take a while to show the page. Counter-intuitively, this is not because it takes time to find the items; all the items for a year are returned by Wikidata Query in less than half a second. It is the loading of several thousand (in this example, over 3,000) items, their labels, descriptions, Wikipedia sites, etc. from the Wikidata API (at a maximum of 50 apiece) that takes a while.

Serving suggestion

This leads nicely to another perk of Reasonator: speed. Recently, I optimized Reasonator to load as fast as possible. This leads to the odd situation that the J.S. Bach entry takes, on my laptop, 43 seconds to load on Wikidata (logged out, force-reload), but shows all text in Reasonator in 8 seconds, and finishes loading all images after another 7 seconds. This is despite Reasonator loading over 600 additional items to provide the context described above. This was achieved by reducing the serial number of requests (where one request depends on the result of the previous one) to the Wikidata API, parallel loading of scripts and interface texts, grouping and parallelizing Wikidata API requests, and “outsourcing” complex queries, such as the hierarchical lists and the calendar, to Wikidata Query (while providing a native JavaScript fall-back for most).

A wafer-thin mint

So what’s the point of all of this? For most people who come across Wikidata, it presents itself either as a back-end to Wikipedia (for language links, and infobox statements, on some wikis), or as a somewhat dull editing interface for form fetishists. I have tried to breathe some life into aspects of Wikidata with other tools, but Reasonator allows to dwell through all of its information.

And while a human-written English Wikipedia will always be infinitely more readable than any Wikidata interface, I hope that Reasonator could one day be a useful addition to small-language Wikipedias, and scale for such a community; on Wikidata/Reasonator, you only have to translate “singer” once to have everyone with that occupation labelled as such; you don’t have to manually include hundreds of thousands of images from Commons; and you don’t have to manually curate birthday lists.

Finally, I hope that browsing Reasonator will not only be interesting, but to encourage people to participate and improve Wikidata. A claim-less page can be a big motivator to improve it, and even a showcase item is more fun if it doesn’t take a minute to load…

Scaling up Wikidata editing

Wikidata continues to grow. The growth is not in the number of items (which are roughly limited by the total number of Wikipedia articles at the moment) but in labels for different languages and, of course, statements. These statements are supplied by two groups of editors: the bots, adding vast numbers of statements based on programmatically defined criteria, and the humans, painstakingly adding one statement at a time. While some statements are researched and take a human to investigate, basic statements (e.g. “this item is a human”) are still missing from many items. JavaScript-based tools are trying to ease the monotonous task of adding these by hand, but do not really scale well.

Víðarr

Through the recent addition of OAuth to all Wikimedia projects, an opportunity has presented itself to ease this burden. Wikimedians can now authorize certain tools to edit on their behalf, under their control. This keeps the responsibility for edits were it always was, in the hand of the editing user, while simplifying the mechanics of editing, uploading, or otherwise modifying a wiki. Thus, I created WiDaR, the WikiData Remote editor (and Norse deity). Widar, by himself, does not come with a user interface, except the authorization link (which you need to click to use it). It serves as a central conduit for other tools to edit Wikidata on your behalf.

AutoList, reloaded

The first tool to use Widar is the improved AutoList. If you have signed up to Widar, you can now use checkboxes on AutoList to set a property:item statement (aka “claim”) on Wikidata for up to 50 items at a time. Select the items you wish to modify, and use the “Claim!” button. Enter the property and target item numbers (e.g. 31 and 5 for P31=”instance of” and Q5=”human” to mark the items as human beings) in the new dialog, click “Set claims”, and you’re done. A new window will open, representing the Widar batch operation of setting your new claims, which will take a few seconds. At the same time, the items you had selected are removed from AutoList so you can process the next batch. There is no need to wait for Widar to finish your previous batch, they can run in parallel.

AutoList now also supports lists based on a Wikipedia category tree; That way, you can generate a “pre-selected” item list (e.g. based on [[Category:Fictional cats]]), and then set Wikidata claims on the respective items appropriately. Remember, all Wikidata edits will be attributed to your account, and tagged with the Widar tool. While it is tempting to fire-and-forget batches, please take a few seconds to prevent wrong statements from being added to Wikidata.

The narrow path?

Unsurprisingly, the potential for abuse of this type of tool has not escaped me. However, my experience with TUSC has told me that there seems to be little vandalism that cannot be handled on-wiki with both build-in tools and third-party tools. Also, the availability of a technology like OAuth that allows for user mass-edits also allows for mass-reverts by administrators; even if this is initially abused in some cases, proper countermeasures exist or can be developed. Potential vandalism  should not be used as a “look, terrorists!” club; otherwise, Wikipedia (and Wikidata!) would not exist.

Red link lists on steroids

I have long been a fan of red link lists, collections of topics that ought to have a Wikipedia article. Often, these are complied from other sources, such as the public domain 1911 Encyclopaedia Britannica, or the Dictionary of National Biography. Such lists often give a good overview over an area of knowledge such as general encyclopedic topics or biographies, and, in the early days of Wikipedia, were also a crude marker of “how good are we”, compared to established works. However, having worked on both compiling and working down such lists, I am also aware of the temptation they present to the OCD-inclined; seeing the list shrink, and the percentage of checked-off topics grow is good, as it presents a clear goal, in contrast to the “more articles!” drive often encountered.

That said, the practicalities of link lists on Wikipedia can be quite painful. Tens of thousands of links need to be chopped into many subpages, and these have to be divided into sections for easy editing. In the beginning, we tended to remove “done” links, but later we switched to tagging each entry with a {{done}} template. Now, just because a link has switched to “blue”, doesn’t mean it actually points to the topic stated in the list; it may be a different person or concept, or a disambiguation page. It also becomes tedious to check between rendered links and wikitext, even when editing sections.

Another point that always bothered me is that these lists are, by their very implementation, limited to a single language Wikipedia. There may be a perfectly good German article about a “redlink” on the English site. These days, the latter issues could, in principle, be solved quite elegantly by Wikidata. Also, linking Wikidata items to external resources via ID properties is a big step ahead. But Wikidata, in its current state, does not really lend itself to the basic problem, which is listing items we should have articles about, but currently do not. Technically, this could be solved by creating “empty” items that only have an external ID property; however, I feel this would be frowned upon. Furthermore, for this to work in an automated fashion, it needs to be clear that, at the time of item creation, there is no existing item that should be used instead. “Fuzzy” matching would lead to tens of thousands of empty, duplicated items. All this assumes that there is actually a property of the external resource in question.

The catalog list.

The catalog list.

So, finally, I took a request by the master of the DNB articles as an excuse to write Yet Another Tool. Called mix’n’match, it can manage entries in “catalogs”, that is, third-party resources, and their relation to Wikidata items. Initially, I have imported entries form ODNB, BBC’s Your Paintings, Appletons’, and the 1913 Catholic Encyclopedia; some of these we already have as redlink lists on English Wikipedia. I’ll be happy to add more catalogs on request. The entries, slightly over 100,000 at the time of writing, have individual links back to the respective resource, titles, and (partial) descriptions.

Individual entry editing.

Individual entry editing.

Entries can be not matched to a Wikidata item (the default state), matched to an item manually by a user (requiring TUSC to log in, as a vandalism precaution), automatically matched by some fuzzy name matching, or “not applicable” (N/A), if an entry is in the resource but should not have a Wikidata item, ever.

Entry matches can be individually changed, by

  • specifying a Wikidata “Q number”, e.g. after searching Wikidata or Wikipedia
  • confirming an automatically suggested match
  • removing a match that was set automatically or by a user
  • flagging an entry as not applicable
Recent Changes.

Recent Changes.

Lists of entries and their status can be generated in chunks, and filtered by their status; for example, you can show only automatic suggestions for easy, one-click matching (name and description of the suggested Wikidata item are shown as well, where available), or only unmatched items to get the cases that are a little bit harder. All changes are tracked with edit time and user; there is even a Recent Changes page, a well as a rudimentary search function.

I hope this tool will reach critical mass of fellow obsessive list-checkers! And while you’re at it, feel free to add some statements to the odd Wikidata item that shows up as a candidate…

Wikidata, or: Wikipedia, the other 60%

As Gerard has so eloquently described, over 60% of Wikidata items have no corresponding article in the English Wikipedia; once we leave the “top five”, this exceeds 90%. The shoot-from-the-hip response would be: write more articles! While there is no issue in principle with this approach, it might not scale well, even with all our volunteers.

The next best thing, as Gerard describes, is to have labels (and maybe descriptions) for Wikidata items in all these languages. But that too takes time; how can we take advantage of the wealth of Wikidata in the mean time? You will, undoubtedly, be surprised (shocked, even!) to learn that I wrote some code in that regard. This is JavaScript which lives on English Wikipedia, but it could be used verbatim on all other Wikipedias. On the search results page, it starts a separate background search on Wikidata for your search term, and displays the results below the normal search results.

wdsearchThe search can find Wikidata items that match your query, even if no article exists on your current (here: English) Wikipedia. As this Wikidata search is performed across all languages, it can find matching items even if there is no label in your current language. This is handy for items that have the same or similar names across languages, for example, items about people. The example shows a search for “Alain Danilet” on the English Wikipedia. Not such article exists there, not even a mention. However, there exists a Wikidata item that has such a label in French (not in English, though). The item is found through the language-insensitive search, and displayed as a link (see the bottom of the screenshot). Also, another piece of JavaScript is used to generate an automatic description form statements in the current language. Furthermore, some parts of the description are linked to the relevant pages on the current Wikipedia.

Thus, you learn that Alain Danilet was a French politician, male, born in 1947, and died in 2012. Again, at the time of writing, this information does not exist on the English Wikipedia; there is no English label for this item on Wikidata; and there is no English description of him anywhere in the Wikiverse. In the current version, a small link labeled R gets you to the relevant Reasonator page, where you can find a picture of Monsieur Danilet, birth place, political party membership etc.

The code page contains brief instructions on how you can add this feature to your own Wikipedia experience, and make use of the long, long tail of Wikidata.

The Apple Tax Dare

So yesterday, October 31, I experienced the ultimate Halloween horror: My 5-year-old 24″ iMac died. It was my primary machine, and, while occasionally a little sluggish, worked quite well. Five years lifetime is acceptable these days, but now I need to get a replacement. I like the form factor of the all-in-one machine, and shudder at the memories of giant tower cases and space-eating desktop “screen stands”, screaming their existence into the world through oh-too-many fans.

So my first instinct took me to the Apple Store web page for the 27″ iMac – I got used to large screen estate, and this seems to be the only one comparing to the previous 24″ 4:3-ratio screen. This sells, in its basic configuration, for ~£1,600. But wait, I thought, what about the “Apple tax”? The persisting rumor that Apple hardware cost much more than it should, just because of the logo and ooooh, shiny! Also, as a U.S. company, who knows who’ll have a peek into my machine from, say, Cupertino or Maryland? I am quite comfortable with Linux, so why shouldn’t I get a similar, cheaper, and more secure machine instead? (Don’t even mention Windows. Just don’t.) So, lets the contestants enter the arena!

First, I remembered reading about the Sable Complete from System76, a dedicated Linux-friendly all-in-one machine (the PC lingo for “iMac”, apparently). But, as nice as it looks, it comes only in 21″. Sorry.

Next, I found the Dell XPS 27. Yes, it’s plastic all the way, but it has touch screen (not required, but hey!), i7 instead of i5 processor, 16GB instead of 8GB RAM, 2TB instead of 1TB hard disk, and all for the same price as the iMac! Done deal! Except for some small details, such that Dell won’t sell it to me, or that some minor hardware won’t work under Linux, such as the graphics card.

But fret not, for HP offers the HP Z1, with a special emphasis on “works with Linux”! NOW we’re talking! I’ll just configure it to match the basic 27″ iMac and … end up with a machine that costs ~£1,900. Thus, I did not bother to check if it’s actually sold in the UK, or if the Linux claims are actually true. At this point, I wouldn’t be surprised if neither is the case.

So here is my challenge to the hardware manufacturers, Linux community and Apple haters alike (just to clarify, I am not a fanboy either; proud owner of both an Android phone and a tablet): I dare you to find me a machine that

  • is hardware-equivalent (no need identical components, but similar screen, processor, RAM, graphics card, disk, form factor; does not have to be pretty) to the basic 27″ iMac 2013 model
  • runs Linux without major hoop-jumping, and supports all build-in hardware (all the way up to the web cam!)
  • is significantly (say, £200) cheaper
  • can be bought in the UK and delivered within a week

I don’t have much hope for this. Therefore, tomorrow (November 2), I shall walk into the local Apple shop and have a really close look at the iMac. I’ll check the comment here before entering the shop, just in case. But so far, it doesn’t look like 2013 is Linux year of the desktop for me.