The Whelming

My Mix-n-match tool deals with third-party catalogs, and helps matching their entries to Wikidata. This involves, as a necessity, importing minimal information about those entries into the Mix’n’match database, but ideally also imports additional metadata, such as (in case of biographical entries) gender, birth/death dates, VIAF etc., which are invaluable in automatically matching entries to Wikidata items, and thus greatly reduce volunteer workload.

However, virtually none of the (currently) ~2600 catalogs in Mix’n’match offers a standardized format to retrieve either basic or meta-data. Some catalogs are imported by volunteers from tabbed files, but most are “scraped”, that is, automatically read and parsed, from the source website.

Some source websites are set up in a way that allows a standardized scraping tool to run there, and I offer a web form to create new scrapers; over 1400 of these scrapers have run successfully, and ~750 of them can automatically run again on a regular basis.

But the autoscraper does not handle metadata, such as birth/death dates, and many catalogs need bespoke import code even for the basic information. Until recently, I had hundreds of scripts, some of them consisting of thousands of lines of code, running data retrieval and parsing:

Basic (ID, name, URL) information retrieval from source site
Creating or amending entry descriptions from source site
Importing auxiliary data (other IDs, such as VIAF, coordinates, etc.) from source site
extraction of birth/death dates from descriptions, taking care not to use estimates, “flourit” etc
extraction of auxiliary data from descriptions
linking of two related catalogs (e.g. one for painters, one for paintings) to improve matching (e.g. artworks only from that artist on Wikidata)

and many others.

Over time, all this has become unwieldy, unstructured, repetitive; I have written bespoke scrapers only to find that I already had one somewhere else etc.

So I went to radically redesign all these processes. My approach is that since only some small piece of code performs the actual scraping/parsing logic, these code fragments are now stored in the Mix’n’match database, associated with the respective catalog. I imported many code fragments from the “old” scripts into this table. I also wrote function-specific wrapper code that can load and execute a code fragment (via the eval function, which is often considered “evil”, hence the blog post title) on its associated catalog. An example of such code fragments for a catalog can be seen here.

I can now use that web interface to retrieve, create, test, and save code, without having to touch the command line at all.

In an ideal world, I would let everyone add and edit code here; however, since the framework executes PHP code, this would open the way for all kinds of malicious attacks. I can not think of a way to safeguard against (deliberate or accidental) destructive code, though I have put some mitigations in place, in case I make a mistake. So, for now, you can look, but you can’t touch. If you want to contribute code (new or patches), please give it to me, and I’ll be happy to add it!

This code migration is just in its infancy; so far, I support four functions, with a total of 591 code fragments. Many more to come, over time.

Eval, not evil

One Comment

‹ Home

Search

Contents

Categories

Archives

RSS Feeds

Meta