For many years, Henrik has single-handedly (based on data by Domas) done what the world’s top 5 website has consistently failed to provide: Page view information, per page, per month/day. Requested many times, repeatedly promised, page view data has remained proverbial vaporware, the Duke Nukem Forever of the Wikimedia Foundation (except DNF was delivered in 2011). A quick search of my mail found a request for this data from 2007, but I would be surprised if that is the oldest instance of such a query.
Now, it would be ridiculous to assume the Foundation does not actually have the data; indeed they do, and they are provided as unwieldy files for download. So what’s all the complaining about? First, the download data cannot be queried in any reasonable fashion; if I want to know how often Sochi was viewed in January 2014, I will have to parse an entire file. Just kidding; it’s not one file. Actually, it’s one file for every single hour. With the page titles URL-encoded as requested, that is, not normalized; a single page can have dozens of different “keys”, have fun finding them all!
But I can get that information from Henrik’s fantastic page, right? Right. Unless I want to query a lot of pages. Which I have to request one by one. Henrik has done a fantastic job, and single queries seem fast, but it adds up. Especially if you do it for thousands of pages. And try to be interactive about it. (My attempt to run queries in parallel ended with Henrik temporarily blocking my tools for DDOSing his server. And rightly so. Sorry about that.)
GLAMs (Galleries, Libraries, Archives, and Museums) are important partners of the Wikimedia Foundation in the realm of free content, and increasingly so. Last month, the Wellcome Trust released 100.000 images under the CC-BY license. Wikimedia UK is working on a transfer of these images to Wikimedia Commons. Like other GLAMs, the Wellcome Trust would surely like to know if and how these images are used in the Wikiverse, and how many people are seeing them. I try to provide a tool for that information, but, using Henrik’s server, it runs for several days to collect data for a single month, for some of the GLAM projects we have. And, having to hit a remote server with hundreds of thousands of queries via http each month, things sometimes go wrong, and then people write me why their view count is waaaay down this month, and I’ll go and fix it. Currently, I am fixing data from last November. By re-running that particular subset and crossing my fingers it will run smoothly this time.
Like others, I have tried to get the Foundation to provide the page view data in a more accessible and local (as in toolserver/Labs) way. Like others, I failed. The last iteration was a video meeting with the Analytics team (newly restarted, as the previous Analytics team didn’t really work out for a reason; I didn’t inquire too deeply), which ended with a promise to get this done Real Soon Now™, and the generous offer to use the page view data from their hadoop cluster. Except the cluster turned out to be empty; I then was encouraged to import the view data myself. (No, this is not a joke. I have the emails to prove it.) As much as I enjoy working with and around the Wikiverse, I do have neither the time, the bandwidth, nor the inclination to do your paid jobs for you, thank you very much.
As the sophisticated reader might have picked up at this point, the entire topic is rather frustrating for myself and others, and being unable to offer a patchy, error-prone data set to GLAMs who have released hundreds of thousands of files under a free license into Commons is, quite frankly, disgraceful. The requirement for the Foundation is not unreasonable; providing what Henrik has been doing for years on his own would be quite sufficient. Not even that is required; myself and others have volunteered to write interfaces if the back-end data is provided in a usable form.
Of the tools I try to provide in the GLAM realm, some don’t really work at the moment due to the constraints described above; some work so-so, kept running with a significant amount of manual fixing. Adding 100.000 Wellcome Trust images may be enough for them to come to a grinding halt. And when all the institutions who so graciously have contributed free content to the Wikiverse come a-running, I will make it perfectly clear that there is only the Foundation to blame.