Skip to content

Thy data, writ large

Ever since Rambot effectively doubled the size of English Wikipedia in a matter of days, automatic text generation from a dataset has been met with suspicion in the Wikiverse. Some text is better than none, for most readers, say some; but number-heavy, boring bot text is not really an encyclopaedia entry, and it could also take away some of the joy of writing, say others. To this day, it is an issue that can split Wikipedians into fierce combatant groups like little else.

Change of scenery. Wikidata is a young but vibrant Wikimedia project, and in many aspects still finding its shape. Each item on Wikidata can have a brief, textual description. This is helpful, for example in the current Wikipedia mobile app, where these descriptions are superimposed on a header image, say some; it is a waste of volunteer’s time to write a text that just reiterates the item statements, say others. Some (including myself) say that manual descriptions make sense for a few items, but the vast majority of items do not require a human to describe the item.

The solution for both above issues is, of course, bot-generated text on-the-fly; text that is written by software based on a data source, but that is not permanently stored. That way, essential information can be given to the reader, without discouraging writers, and without the need to maintain and update the bot-generated text, as it is never stored in the first place, but updated from the current dataset on demand.

I have previously written some code that does aspects of this; Wikidata search results are displayed on some Wikipedias (e.g. Italian) underneath the standard ones; they contain a brief, automatically generated description of each Wikidata item. And some people have seen my Reasonator tool, where (for some item types, and some languages) rather long descriptions can be generated.

But these examples are “trapped” in their respective tool or environment; other tools, websites, or third-party users have no way to get automated descriptions for Wikidata items easily. That is, until now. AutoDesc is a web API that can generate automated descriptions of almost any Wikidata item; the quality of the description improves with the quality of the statements in the item, of course.

And thanks to node.js, which is now available as a server on Wikimedia Labs (thanks to YuviPanda!), little rewriting of code was necessary; the “long description” generator is, in fact, the exact same source code used in Reasonator at this moment. This means that previous development by myself and other volunteers is not lost, but has paid off; and future improvement to either version of the text generator can simply be copied to the other.

The API can take an item number, a language code, and some other options, and generate a description of that item. It can return the description wrapped in JSON(P), or as an HTML page. It can generate plain text, wiki markup, or HTML with Wikipedia/Wikidata/Reasonator links. If you request the long description, it will automatically fall back to the short one if the item type or language for a long description are not supported (yet!).

Now, a word of caution: As I cobbled the text generation together from previously existing code, and code that was intended for use in a browser at that, things may not run as smoothly as one would expect. There is, in fact, little caching, and the cache that exists is not invalidated until the next server restart; an event that will be necessary to put new code live, and that will mean several seconds (the horror!) downtime for the API. If you base anything important on the API at this moment in time, homework will be eaten, data will be lost, and the write-everything-by-hand-fanatics will win. Be warned!

That said, I will try to improve the code over the coming weeks; if you want to help out, you can find the code here.