Skip to content

Musings on the backend

So the Wikidata query service (WDQS), currently powered by Blazegraph, does not perform well. Even simple queries time out, it lags behind the live site, and individual instances can (and do) silently go out of sync. The WMF is searching for a replacement.

One of the proposed alternatives to blazegraph is Virtuoso. It models the graph structure of Wikidata in MySQL. Even before I read about that, I had a similar thought. But unlike Virtuoso, which is a general solution, Wikidata is a special case, with optimization potential. So I went to do a bit of tinkering. The following was done in the last few days, so it’s not exactly pretty, but initial results are promising.

I downloaded a test dataset of ~108K RDF triples from Wikidata. I then wrote some Rust code to (hackishly) parse the triples, and transform them. I am creating (on demand, so to speak) MySQL tables, based on the RDF predicate, which (often enough) is the Wikidata property used. For example, the data__PropertyDirect_P31__Entity__Entity table contains the entity-to-entity mapping for P31 (instance of). Since we know that the entire table is about P31, we don’t have to encode the property in the rows; two columns of VARCHAR(64) are enough. This could be compressed further (two INT columns and an enum, to encode Q/P/M/L and the sub-lexeme types should do) but I’m keeping it simple for now.

Importing the 108K triples into MySQL creates 1434 tables. MySQL should scale to millions of tables, so no problem. The initial import takes 8 minutes, but that seems to be due to the table creation, which would only happen initially on a full Wikidata import. Indeed, re-running the exact same import finishes in 48 seconds (data won’t change, thanks to INSERT IGNORE). I don’t know how much of this is network or database; commenting out the INSERT command on a re-import processes the 108K entries in 2.5 seconds, so that’s the parsing speed, thanks to the generous use of multi-threaded async Rust.

Adding (and removing) RDF triples as they change on Wikidata should therefore be no problem; I suspect changes could be represented with almost immeasurable delay. As for querying, Wikipedia uses read-only replicas, which can be scaled to demand.

But can we actually query? Yes we can! I wrote a “simple case” query generator that takes triples, and can “connect” them, right now with “and” (the “.” in SPARQL). So:

let mut qt1 = QueryTriples::from_str(&app,"?person","wdt:P31","wd:Q5").await?;
let qt2 = QueryTriples::from_str(&app,"?person","wdt:P21","wd:Q6581072").await?;
qt1.and(&qt2)?;
let result = qt1.run().await?;

asks for all people (“P31:Q5”) that are female (“P21:Q6581072”). These triples are transformed into individual SQL queries, which then are joined. The query generated in this case is:

SELECT t1.person 
FROM (SELECT k0 AS `person` FROM `data__PropertyDirect_P31__Entity__Entity` WHERE `v0`="Q5") AS t1
INNER JOIN (SELECT k0 AS `person` FROM `data__PropertyDirect_P21__Entity__Entity` WHERE `v0`="Q6581072") AS t2
ON t1.person=t2.person;

In the test dataset, there is one result (Q241961). The query runs ~100ms (on Toolforge shared tool database, from remote). The same query on Blazegraph takes about a minute (yes, it yields 2 million results, but even with all that data loaded into this database, the SQL query would be the same, and an indexed INNER JOIN is probably as fast as we can get it).

This implementation is obviously far from complete, and could be optimized a lot. I do not have the time to complete it, especially since it would likely not be used anyway. I guess the days of my weekend hack becoming production code are over. But maybe it can give inspiration to other.

You can find my code here, and if you have a Toolforge account, you can point it to the ss55360__wdqsbe_p database.

One Comment