In my last post, I talked about Wikidata Query (WDQ), a web API and tool to run complex queries on the Wikidata corpus. There has been some interest in this system “as is”, but also into the possibilities of using the code in third-party tools. While the code (C++, if you want to know) is open source, at the time it was developed for this specific web service. Even loading the data was clunky: a Wikidata dump had to be processed by a PHP (!) script into a tabbed file format storing claims, and then piped into the actual service.
Now, things have changed for the better! For one, the code has been reorganized into three parts:
- A shared “core” code for reading data, storing it in RAM, and running queries
- The web server running the API and query tool
- A simple tool for rudimentary tasks around WDQ, also serving as a demo for how to write your own tools
A second improvement is the ability to read (decompressed) Wikidata XML dumps directly. Not all information is currently used; labels, descriptions, wiki-links, qualifiers, and sources are ignored; however, with the exception of qualifiers, these have limited value for data queries.
While this does away with the PHP script, the latest Wikidata dump (September 22, 2013; 1.8GB, bzip2-compressed) still takes a whooping 42 minutes (tested on the WDQ machine) to read into memory. While this time can undoubtedly be improved, a significant part of that time (30%, by my rough estimate) is reading, decompressing, and piping the XML, which can not really be improved much.
So, as a third improvement, I am introducing a bespoke binary format. Like the XML parser above, it is incomplete, but sufficient for the task at hand. What is the advantage of this? First, file size: The 1.8 bzip2 dump file shrinks to 224MB (uncompressed), or about 12%. While this format is highly “lossy” by omitting a lot of information, it still carries all the data necessary to perform WDQ queries.
The second, and more important, advantage is load time, which goes down from 42 minutes to 25 seconds, or 1%. WDQ requires ~500MB of RAM to hold this data in a query-able way. Yes, this is for the entirety of Wikidata as of last week, including:
- 4,239,729 monolingual strings
- 265,770 time/date values
- 749,051 coordinates
- 15,284,254 item-to-item connections
I hope that these, and future, improvements will get people more interested in this code base, and maybe volunteers to help me improve and maintain it, and of course people to adapt it for their own tools. After all, Wikidata just got a bit more accessible.