So this, as they say, happened.
On 2016-12-27, I received an update on a Mix’n’match catalog that someone had uploaded. That update had improved names and descriptions for the catalog. I try to avoid such updates, because I made the import function so I do not have to deal with every catalog myself, and also because the update process is entirely manual, therefore somewhat painful and error-prone, as we will see. Now, as I was on vacation, I was naturally in a hurry, and (as it turned out later) there were too many tabs in the tab-delimited update file.
Long story short, something went wrong with the update. For some reason, some of the SQL commands I generated from the update file did not specify some details about which entry to update. Like, its ID, or the catalog. So when I checked what was taking so long, just short of 100% of Mix’n’match entries had the label “Kelvinator stove fault codes”, and the description “0”.
Backups, you say? Well, of course, but, look over there! /me runs for the hills
Well, not all was lost. Some of the large catalogs were still around from my original import. Also, my scraping scripts for specific catalogs generate JSON files with the data to import, and those are still around as well. There was also a SQL dump from 2015. That was a start.
Of course, I did not keep the catalogs imported through my web tool. Because they were safely stored in the database, you know? What could possibly go wrong? Thankfully, some people still had their original files around and gave them to me for updating the labels.
I also wrote a “re-scraping” script, which uses the external URLs I store for each entry in Mix’n’match, together with the external ID. Essentially, I get the respective web page, and write a few lines of code to parse the <title> tag, which often includes the label. This works for most catalogs.
So, at the time of writing, over 82% of labels in Mix’n’match have been successfully restored. That’s the good news.
The bad news is that the remaining ~17% are distributed across 133 catalogs. Some of these do not have URLs to scrape, some URLs don’t play nicely (session-based Java horrors, JS-only pages etc.), and the rest need site-specific <title> scraping code. Fixing those will take some time.
Apart from that, I fixed up a few things:
- Database snapshots (SQL dump) will now be taken once a week
- The snapshot from the previous week is preserved as well, in case damage went unnoticed
- Catalogs that are uploaded through the import tool will be preserved as individual files
Other than the remaining entries that require fixing, Mix’n’match is open for business, and while my one-man-show is spread thin as usual, subsequent blunders should be easier to mitigate. Apologies for the inconvenience, and all that.