Skip to content

Livin’ on the edge

A few days ago, Lydia posted about the first prototype of the new structured data system for Commons, based on Wikidata technology. While this is just a first step, structured data for Commons seems finally within reach.

And that brings home the reality of over 32 million files on Commons, all having unstructured data about them, in the shape of the file description pages. It would be an enormous task do manually transcribe all these descriptions, licenses, etc. to the appropriate data structures. And while we will have to do just that for many of the files, the ones that can be transcribed by a machine, should be.

So I went ahead and re-wrote a prototype tool I had build for just this occasion a while ago. I call it CommonsEdge (a play on Common sedge). It is both an API, and an interface to that API. It will parse a file description page on Commons, and return a JSON object with the data elements corresponding to the description page. An important detail is that this parser does not just pick some elements it understands, and ignore the rest; internally, it tries to “explain” all elements of the description (templates, links, categories, etc.) as data, and fails if it can not explain one. That’s right, the API call will fail with an error, unless 100% of the page would be represented in the JSON object returned. This prevents “half-parsed” pages; a file description page that is successfully pared by the API can safely be replaced in its entirety by the resulting structured data. In case of failure, the error message is usually quite specific and detailed about the cause; this allows for incremental improvements of the parser.

Screen Shot 2016-08-03 at 21.35.19At the moment of writing, I find that ~50-60% of file descriptions (based on sets of 1000 random files) produce a JSON object, that is, can be completely understood by the parser, and completely represented in the result. That’s 16-19 million files descriptions that can be converted to structured data automatically, today. Most of the failures appear to be due to bespoke templates; the more common ones can be added over time.

A word about the output: Since the structured data setup, including properties and foreign keys, is still in flux, I opted for a simple output format. It is not Wikibase format, but similar; most elements (except categories and coordinates, I think) are just lists of type-and-value tuples (example). I try to use URLs as much as possible, for example, when referencing users on Commons (or other Wikimedia projects) or flickr. Licenses are currently links to the Wikidata element corresponding to the used template (ideally, I would like to resolve that through Wikidata properties pointing to the appropriate license).

Source code is available. Pull requests are welcome.