Skip to content

The Big Ones

Update: After fixing an import error, and cross-matching of BNF-supplied VIAF data, 18% of BNF people are matched in Wikidata. This has been corrected in the text.

My mix’n’match tool holds a lot of entries from third-party catalogs – 21,795,323 at the time of writing. That’s a lot, but it doesn’t cover “the big ones” – VIAF, BNF, etc., which hold many millions of entries each. I could “just” (not so easy) import those, but:

  • Mix’n’match is designed for small and medium-sized entry lists, a few hundred thousand at best. It does not scale well to larger catalog sizes
  • Mix’n’match is designed to work with many different catalogs, so the database structure represents the least common denominator – ID, title, short description. Catalog-specific metadata gets lost, or is not easily accessible after import
  • The sheer number of entries might require different interface solutions, as well as automated matching tools

To at least get a grasp of how many entries we are dealing with in these catalogs, and inspired by the Project soweego proposal, I have used a BNF data dump to extract 1,637,195 entries (less than I expected) into a new database, one that hopefully will keep other large catalogs in the future. There is much to do; currently, only 102,115 295,763 entries (~618%) exist on Wikidata, according to the SPARQL query service.

As one can glimpse from the screenshot, I have also extracted some metadata into a “proper” database table. All this is preliminary; I might have missed entries or good metadata, or gotten things wrong. For me, the important thing is that (a) there is some query-able data on Labs Toolforge, and that (re-)import and matching of the data is fully automated, so it can be re-run is something turns out to be problematic.

I shall see where I go from here. Obvious candidates include auto-matching (via names and dates) to Wikidata, and adding BNF references to relevant statements. If you have a Toolforge user account, you can access the new database (read-only) as s51434__mixnmatch_large_catalogs_p. Feel free to run some queries or build some tools around it!