Skip to content

Papers on Rust

I have written about my attempt with Rust and MediaWiki before. This post is an update on my progress.

I started out writing a MediaWiki API crate to be able to talk to MediaWiki installations from Rust. I was then pointed to a wikibase crate by Tobias Schönberg and others, to which I subsequently contributed some code, including improvements to the existing codebase, but also an “entity cache” struct (to retrieve and manage larger amounts of entities from Wikibase/Wikidata), as well as an “entity diff” struct.

The latter is something I had started in PHP before, but never really finished. The idea is that, when creating or updating an entity, instead of painstakingly testing if each statement/label/etc. exists, one simply creates a new, blank item, fills it with all the data that should be in there, and then generates a “diff” to a blank (for creating) or existing (for updating) entity. That diff can then be passed to the wbeditentity API action. The diff generation can be fine-tuned, e.g. only add English labels, or add/update (but not remove) P31 statements.

Armed with these two crates, I went to re-create a functionality that I had written in PHP before: creation and updating of items for scientific publications, mainly used in my SouceMD tool. The code underlying that tool has grown over the years, meaning it’s a mess, and has also developed some idiosyncrasies that lead to unfortunate edits.

For a rewrite in Rust, I also wanted to make the code more modular, especially regarding the data sources. I did find a crate to query CrossRef, but not much else. So I wrote new crates to query pubmed, ORCID, and Semantic Scholar. All these crates are completely independent of MediaWiki/Wikibase; they can be re-used in any kind of Rust code related to scientific publications. I consider them a sound investment into the Rust crate ecosystem.

With these crates in a basic but usable state, I went to write papers, Rust code (not a crate just yet) to gather data from the above sources, and inject them into Wikidata. I wrote a Rust trait to represent a generic source, and then wrote adapter structs for each of the sources. Finally, I added some wrapper code to take a list of adapters, query them about a paper, and update Wikidata accordingly. It can already

  • iteratively gather IDs (supply a PubMed ID, PubMed might get a DOI, which then can get you data from ORCID)
  • find item(s) for these IDs on Wikidata
  • gather information about the authors of the paper
  • find items for these authors on Wikidata
  • create new author items, or update existing ones, on Wikidata
  • create new paper items, or update existing ones, on Wikidata (no author statement updates yet)

The adapter trait is designed to both unify data across sources (e.g. use standardized author information), but also allow to update paper items with source-specific data (e.g. publication dates, Mesh terms). This system is open to add more adapters for different sources. It is also flexible enough to extend to other, similar “publication types”, such as book, or maybe even artwork. My test example shows how easy it is to use in other code; indeed, I am already using it in Rust code I developed for my work (publication in progress).

I see all of this as a seeding of the Rust crate system with easily reusable, MediaWiki-related code. I will add more such code in the future, and hope this will help in the adoption of Rust in the MediaWiki programmer community. Watch this space.

Update: Now available as a crate!

One Comment

  1. GerardM wrote:

    When will this code go live in SourceMD ?? Would it be possible to add a “sanity check”? It would be cool when double entries of author names strings are handled and the deletion when it has not been done before.

    Thursday, May 16, 2019 at 11:22 | Permalink