Open source projects like Linux, and open content projects like Wikipedia and Wikidata, are fine things indeed by themselves. However, the power of individual projects is multiplied if they can be linked up. For free software, this can be taken literally; linking libraries to your code is what allows complex applications to exists. For open data, links can be direct (as in weblinks, or external catalog IDs on Wikidata), or via a third party.
Recently, and once again, Peter Murray-Rust (of Blue Obelisk, CML, and Wikimania 2014 fame) has put his code where his mouth it. ContentMine harvests open access scientific publication and automatically extracts “facts”, such as mentions of species names. These facts are accessible through an API. Due to resource limitations, the facts are only stored temporarily, and will be lost after some time (though they can be regenerated automatically from the publications). Likewise, the search function is rather rudimentary.
Why is this important? Surely these publications are Google-indexed, and you can find what you want by just typing keywords into a search engine; text/data mining would be a waste of time, right? Well, not quite. With over 50 million research papers published (as of 2009), your search terms will have to be very tightly phrased to get a useful answer. Of course, if you use overly specific search terms, you are likely to miss that one paper you were looking for.
At the time of writing this, ContentMine is only a few weeks old; it contains less than 2,000 facts (all of them species names), extracted from 18 publications. But, even this tiny amount of data allows for a demonstration of what the linking of open data projects can accomplish.
Since all facts from ContentMine are CC-BY, I wrote some code to archive the “fact stream” in a database on Labs. As a second step, I use WDQ to automatically match species names to Wikidata items, where possible. Then, I slapped a simple interface on top, which lets a user query the database. One can use either a (trivial) name search in facts, or use a WDQ query; the latter would return a list of papers that contain facts that match items returned from WDQ.
If that sounds too complicated, try the example query on the interface page. This will:
- get all species from the Wikidata species tree with root “human” (which is only the item for “human”); query string “tree”
- get all species from the Wikidata species tree with root “orangutan”; query string “tree”
- show all papers that have at least one item from 1. and at least one item from 2. as facts
At the moment, this is only one paper, one that talks about both homo sapiens (humans) and Pongo pygmaeus (Bornean orangutans). But here is the point: we did not search for Pongo pygmaeus! We only queried for “any species of orangutans”. ContentMine knows about species mentioned in papers, Wikidata knows about species, and WDQ can query Wikidata. By putting these parts together, even if only in such a simple fashion, we have multiplied the power of what we can achieve!
While this example might not strike you as particularly impressive, it should suffice to bring the point across. Imagine many more publications (and yes, thanks to a recent legal decision, ContentMine can also harvest “closed” journals), and many more types of facts (places, chemicals, genetic data, etc.). Once we can query millions of papers for the effects of a group of chemicals on bacterial species in a certain genus, or with a specific property, the power of accessing structured knowledge will become blindingly obvious.