Skip to content

ORCID mania

ORCID is an increasingly popular service to disambiguate authors of scientific publications. Many journals and funding bodies require authors to register their ORCID ID these days. Wikidata has a property for ORCID, however, only ~2400 items have an ORCID property at the moment of writing this blog post. That is not a lot, considering Wikidata contains 728,112 scientific articles.

Part of the problem is that it is not easy to get ORCIDs and its connections to publications in an automated fashion. It appears that several databases, public or partially public, contain parts of the puzzle that is required for determining the ORCID for a given Wikidata author.

So I had a quick look, and found that, on the ORCID web site, one can search for a publication DOI, and retrieve the list of authors in the ORCID system that “claim” that DOI. That author list contains variations on author names (“John”, “Doe”, “John Doe”, “John X. Doe” etc.) and their ORCID IDs. Likewise, I can query Wikidata for a DOI, and get an item about that publication; that item contains statements with authors that have an item (“P50”). Each of these authors has a name.

Now, we have two lists of authors (one from ORCID, one from Wikidata), both reasonably short (say, twenty entries each), that should overlap to some degree, and they are both lists of authors for the same publication. They can now be joined via name variations, excluding multiple hits (there may be two “John Doe”s in the author list of a publication; this happens a lot with Asian names), as well as excluding authors that already have an ORCID ID on Wikidata.

I have written a bot that will take random DOIs from Wikidata, query them in ORCID, and compare the author list. In a first run, 5.000 random DOIs yielded 123 new ORCID connections; manual sampling of the matches looked quite good, so I am adding them via QuickStatements (sample of edits).

Unless this meets with “social resistance”, I can have the bot perform these edits regularly, which would keep Wikidata up-to-date with ORCIDs.

Additionally, there is a “author name string” property, which stores just the author name for now, for authors that do not have an item yet. If the ORCID list matches one of these names, an item could automatically be created for that author, including ORDIC ID, and association to the publication item. Please let me know if this would be desirable.

13 Comments

  1. Andy Mabbett wrote:

    Thank you so much for this, Magnus. It’s just the kind of solution I was hoping for when I wrote this:

    https://groups.google.com/a/wikimedia.org/forum/#!topic/wikicite-discuss/orRFkcPdt6s

    in my capacity of Wikimedian in Residence at ORCID

    Tuesday, June 20, 2017 at 10:49 | Permalink
  2. Andy Mabbett wrote:

    …and yes; please create new items for authors, as you suggest.

    Tuesday, June 20, 2017 at 10:55 | Permalink
  3. Jakob Voß wrote:

    Thanks for the quick analysis and solution! I hesitate to welcome new items for authors if all we have is an ORCID and a name. There should be some more information such as homepage to justify an item. Moreover Wikidata may already contain a person item e.g. connected to VIAF so we would create duplicates without further checking.

    Tuesday, June 20, 2017 at 11:48 | Permalink
  4. Andy Mabbett wrote:

    Jakob: It’s not a case of “all we have is an ORCID and a name”; we also have a work.

    Duplicates in such cases are a “necessary evil”, and we have methods to detect and merge them as more data is added.

    Tuesday, June 20, 2017 at 13:08 | Permalink
  5. Magnus wrote:

    The bot will now check if a “string author” already is known to Wikidata by ORCID, and use that.
    Also, adding homepage URL via ORCID to new authors (code active but not tested, few people seem to have that).

    Tuesday, June 20, 2017 at 14:03 | Permalink
  6. Andy Mabbett wrote:

    You should also be able to obtain other IDs from ORCID records, such as Scopus and Researcher ID, where present.

    Tuesday, June 20, 2017 at 14:44 | Permalink
  7. Magnus wrote:

    A separate bot might be better for that, because this one doesn’t check ORCID if an author already has an ID.

    Tuesday, June 20, 2017 at 14:47 | Permalink
  8. The sample edits look good. Great work – thanks! Would be nice to run that as a cron job.

    I’d be hesitant to recommend automatic creation of author items based on P2093 statements on paper items, even if the ORCID record for that paper has an author whose name matches that P2093 string – as Jakob points out, this is likely to result in duplicates.

    What about bringing that information into Mix’n’Match instead of editing Wikidata directly?

    Other points to consider: for a paper that is indexed in both ORCID and Wikidata, can we leverage existing information from ORCID for Wikidata (or vice versa) on its association with authors?

    Likewise, for a given author indexed on both platforms and linked to at least one paper, can we harvest other identifiers or education/ employment etc. or even co-author information (e.g. via author order; for this and/ or other papers) for Wikidata?

    This Python wrapper of the ORCID API may be useful: https://github.com/pyOpenSci/pyApiToolkit/blob/master/code/orcid.py .

    Tuesday, June 20, 2017 at 23:54 | Permalink
  9. Magnus wrote:

    It is now running as a cron job. Including the creation of new authors. I am aware of the risk of duplicates, but I believe “seeding the ORCID pool” on Wikidata is more important, long term, than avoiding a few duplicates.

    Checking with ORCID on other things might help here as well. At the very least, we could get, for an author, the papers on Wikidata where s/he is an author (according to ORCID), and find collisions/duplicates.

    Meanwhile, two or more items with the same ORCID: http://tinyurl.com/y8kz5rf2

    Wednesday, June 21, 2017 at 08:35 | Permalink
  10. Andy Mabbett wrote:

    I’ve been watching this job as it progresses. We have already more than *doubled* the number of ORCID iDs in Wikidata.

    I’ve seen relatively few duplicates, and where they have been created, merging them has enabled items about papers using author strings to instead have author items, added with a high degree of confidence, and which of course include an external iD.

    This is all most satisfying.

    What someone could usefully do at some point, is to look at all the ORCID records we link to, and import ISNIs, ResearcherIDs and Scopus IDs, for which ORCID has distinct properties, and to extract Twitter, Google Scholar and ResearchGate IDs, and perhaps others, plus LinkedIn Profile URLs, from their generic “websites” field. I would like us to be able to import home page URLs as well, but many researchers have the unfortunate trait of linking to an institutional page or site, rather than something specific to themselves.

    Thursday, June 22, 2017 at 23:19 | Permalink
  11. Andy Mabbett wrote:

    As luck would have it, immediately after I made my previous comment, I found a very rich example, that illustrates just what is possible.

    All the data added here came from the subject’s ORCID record:

    https://www.wikidata.org/w/index.php?title=Q5543353&type=revision&diff=504969386&oldid=484572831

    Thursday, June 22, 2017 at 23:50 | Permalink
  12. Christian Kleineidam wrote:

    I support both adding ORCIDs to existing authors and the creation of new author items through the proposed process.

    As a matter of process I however don’t think this blog is the right venue to have that discussion. How about requesting permission for the bot that creates new author items at https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot ?

    Monday, June 26, 2017 at 16:37 | Permalink
  13. Andy Mabbett wrote:

    @Christian – The bot used for this task already has approval:

    https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/QuickStatementsBot

    Thursday, June 29, 2017 at 11:42 | Permalink