Skip to content

Points of view

For many years, Henrik has single-handedly (based oScreen Shot 2014-02-17 at 17.07.43n data by Domas) done what the world’s top 5 website has consistently failed to provide: Page view information, per page, per month/day. Requested many times, repeatedly promised, page view data has remained proverbial vaporware, the Duke Nukem Forever of the Wikimedia Foundation (except DNF was delivered in 2011). A quick search of my mail found a request for this data from 2007, but I would be surprised if that is the oldest instance of such a query.

Now, it would be ridiculous to assume the Foundation does not actually have the data; indeed they do, and they are provided as unwieldy files for download. So what’s all the complaining about? First, the download data cannot be queried in any reasonable fashion; if I want to know how often Sochi was viewed in January 2014, I will have to parse an entire file. Just kidding; it’s not one file. Actually, it’s one file for every single hour. With the page titles URL-encoded as requested, that is, not normalized; a single page can have dozens of different “keys”, have fun finding them all!

But I can get that information from Henrik’s fantastic page, right? Right. Unless I want to query a lot of pages. Which I have to request one by one. Henrik has done a fantastic job, and single queries seem fast, but it adds up. Especially if you do it for thousands of pages. And try to be interactive about it. (My attempt to run queries in parallel ended with Henrik temporarily blocking my tools for DDOSing his server. And rightly so. Sorry about that.)

GLAMs (Galleries, Libraries, Archives, and Museums) are important partners of the Wikimedia Foundation in the realm of free content, and increasingly so. Last month, the Wellcome Trust released 100.000 images under the CC-BY license. Wikimedia UK is working on a transfer of these images to Wikimedia Commons. Like other GLAMs, the Wellcome Trust would surely like to know if and how these images are used in the Wikiverse, and how many people are seeing them. I try to provide a tool for that information, but, using Henrik’s server, it runs for several days to collect data for a single month, for some of the GLAM projects we have. And, having to hit a remote server with hundreds of thousands of queries via http each month, things sometimes go wrong, and then people write me why their view count is waaaay down this month, and I’ll go and fix it. Currently, I am fixing data from last November. By re-running that particular subset and crossing my fingers it will run smoothly this time.

Like others, I have tried to get the Foundation to provide the page view data in a more accessible and local (as in toolserver/Labs) way. Like others, I failed. The last iteration was a video meeting with the Analytics team (newly restarted, as the previous Analytics team didn’t really work out for a reason; I didn’t inquire too deeply), which ended with a promise to get this done Real Soon Now™, and the generous offer to use the page view data from their hadoop cluster. Except the cluster turned out to be empty; I then was encouraged to import the view data myself. (No, this is not a joke. I have the emails to prove it.) As much as I enjoy working with and around the Wikiverse, I do have neither the time, the bandwidth, nor the inclination to do your paid jobs for you, thank you very much.

As the sophisticated reader might have picked up at this point, the entire topic is rather frustrating for myself and others, and being unable to offer a patchy, error-prone data set to GLAMs who have released hundreds of thousands of files under a free license into Commons is, quite frankly, disgraceful. The requirement for the Foundation is not unreasonable; providing what Henrik has been doing for years on his own would be quite sufficient. Not even that is required; myself and others have volunteered to write interfaces if the back-end data is provided in a usable form.

Of the tools I try to provide in the GLAM realm, some don’t really work at the moment due to the constraints described above; some work so-so, kept running with a significant amount of manual fixing. Adding 100.000 Wellcome Trust images may be enough for them to come to a grinding halt. And when all the institutions who so graciously have contributed free content to the Wikiverse come a-running, I will make it perfectly clear that there is only the Foundation to blame.

6 Comments

  1. bawolff wrote:

    Is the blame game really neccesarry? Yes it would be nice if the foundation provided the data in aggregated form. However that’s not exactly a critical mission objective, its not like the foundation is actively preventing third parties from doing so, and the foundation is providing the raw data neccesarry for others to do so.

    Tuesday, February 18, 2014 at 01:39 | Permalink
  2. Sadads wrote:

    Have you tried interacting with https://tools.wmflabs.org/wikiviewstats/ . I just discovered it recently, and it reads numbers slightly differently then stats.grok.se

    Tuesday, February 18, 2014 at 01:45 | Permalink
  3. pfctdayelise wrote:

    Wow, reading this brought back memories.

    Good luck with your quest this time. Love your work, as always :)

    Tuesday, February 18, 2014 at 10:07 | Permalink
  4. Magnus wrote:

    @bawolff The simple truth is, if nothing changes, the one remaining, marginally functioning GLAM stats tool of mine will cease to work. At which point the GLAMs will not get any more updates on basic view stats for their images. Since the underlying cause of this will be the unavailability of view data in a reasonable form (and /you/ try to load those data dumps into a database!), this will reflect very badly on the Foundation and future GLAM collaborations.
    If all I wanted to do is blame the Foundation, I’d just have to wait a little longer (the NASA files for January alone are now running for more than 24h; not that long to wait…) and watch the fireworks once the tools stop working. The reason for my blog post is that I /don’t/ want this to happen. There is still time for the Foundation to get their act together on this. Not a lot, though.

    Tuesday, February 18, 2014 at 11:29 | Permalink
  5. Toby wrote:

    I’m the Director of Analytics at the WMF, and I want to address Magnus’ concerns directly. First of all, I’m sorry that we let Magnus and other folks down on the page views APIs — we made some commitments late last year that we weren’t able to meet. Not only that, these failures echoed previous points of frustration with the Foundation.

    I do want to note that we actively support the infrastructure that feeds data to stats.grok.se. We’ve fixed a number of issues with that pipeline, most recently last week. We understand the importance of this data to the community.

    The page view API project has been challenging for a number of reasons — the size of the data, the fact that definitions of page views have not been updated to stay in line with the changing traffic (mobile, bots, API requests, etc) and the challenges in aggregating various aliases. We’ve needed to revisit our definitions of page views in order to get this right as well as design and build a global architecture for collecting these and other metrics. In addition, we’ve tried to do this with a perspective of privacy and respect for our users.

    To this end, we presented an approach to measuring page views in MediaWiki at FOSDEM in January and have made progress towards our new infrastructure by deploying middleware delivering unsampled page view data from mobile devices from our globally distributed datacenters to our compute cluster for analysis.

    However, these initiatives are complex and will take several months to complete at the earliest. In the meantime, we’re working with Henrik to scale up stats.grok.se.

    I also want to call out that the Analytics team has been supporting a wide range of users and stakeholders during the year. We’ve developed WikiMetrics, a tool for measuring editor productivity that is used by WMF program evaluation and community members; provided dashboards and support for Wikipedia Zero, our program to partner with our mobile partners to enable mobile Wikipedia access free from data charges; and supported product teams, researchers both inside and outside of the foundation.

    We’ve been prioritizing and working on these projects as our resources allow and it’s important to understand that the team has not been idle. While we’ve done a less than stellar job in communicating our progress to the community, information on what we’ve been doing is available via our planning pages on mediawiki. In the future, we will be more proactive in communicating with the community regarding our goals and projects.

    If you have questions and follow-up please feel free to reach out to the team and myself at analytics@lists.wikimedia.org.

    Thursday, February 20, 2014 at 00:08 | Permalink
  6. Nemo wrote:

    I asked results for 2014 in https://meta.wikimedia.org/wiki/Grants_talk:APG/Proposals/2013-2014_round2/Wikimedia_Foundation/Proposal_form#Multiplication_of_tools

    The last update we had is https://en.wikipedia.org/w/index.php?title=User_talk:Henrik&diff=600917917&oldid=600897425

    Saturday, April 12, 2014 at 13:29 | Permalink