Skip to content

Everybody scrape now!

If you like Wikidata and working on lists, you probably know my Mix’n’match tool, to match entries in external catalogs to Wikidata. And if you are really into these things, you might have tried your luck with the import function, to add your own catalog.

But the current import page has some drawbacks: You need to adhere to a strict format which you can’t really test except by importing, your data is static and will never update, but most importantly, you need to get the data in the first place. Sadly, many great sets of data are only exposed as web pages, and rescuing the data from a fate of tag fillers is not an easy task.

I have imported many catalogs into Mix’n’match, some from data files, but most scraped from web pages. For a long time, I wrote bespoke scraper code for every website, and I still do that for some “hard cases” occasionally. But some time ago, I devised a simple (yeah, right…) JSON description to specify the scraping of a website. This includes the construction of URLs (a list of fixed keys, like letters? Numerical? Letters with numerical subpages? A start page to follow all links from?), as well as regular expressions to find entries on these pages (yes, I am using RegEx to parse HTML. So sue me.), including IDs, names, and descriptions. The beauty is that only the JSON changes for each website, but the scraping code stays the same.

This works surprisingly well, and I have over 70 Mix’n’match catalogs generated through this generic scraping mechanism. But it gets better: For smaller catalogs, with relatively few pages to scrape, I can just run the scraping again periodically, and add new entries to Mix’n’match, as they are added to the website.

But there is still a bottleneck in this approach: me. Because I am the only one who can create the JSON, add it to the Mix’n’match database, and run the scraping. It does take some time to devise the JSON, and even more testing to get it right. Wouldn’t it be great if everyone could create the JSON through a simple interface, test it, add it to Mix’n’match to a new (or existing) catalog, and have it scrape a website, then run automatic matching with Wikidata on top, and get automatic, periodic updates to the catalog for free?

Well, now you can. This new interface offers all options I am using for my own JSON-based scraping; and you don’t even have to see the JSON, just fill out a form, click on “Test”, and if the first page scrapes OK, save it and watch the magic happen.

I am aware that regular expressions are not everyone’s cup of decaffeinated, gluten-free green tea, and neither will be the idea of multi-level pattern-based URL construction. But you don’t get an (almost) universal web scraping mechanism for free, and the learning curve is the price to pay. I have included an example setup, which I did use to create a new catalog.

Testing will get you the HTML of the first web page that your URL schema generated, plus all scraped entries. If there are too few or wrong entries, you can fiddle with the regular expressions in the form, and it will tell you live how many entries would be scraped by that. Once it looks all right, test again to see the actual results. When everything looks good, save it, done!

I do have one request: If the test does not look perfectly OK, do not save the scraper. Because if the results are not to your liking, you will have to come to me to fix it. And fixing these things usually takes me a lot longer than doing them myself in the first place. So please, switch that underused common sense to “on”!