Why I Hate Mechanical Turk Research
by admin on November 11, 2010
Update (5/12/2011): I wrote a longer (but still short) paper about this for the Crowdsourcing workshop at CHI. It’s available here.
Ok, it’s not that I actually hate it, but in reviewing a number of Turk papers, reading many more, and listening to many, many, many more planned Turk projects I find myself increasingly frustrated. Don’t get me wrong: we use it all the time in my group for evaluation purposes (or labeling training/test data). It allows us to cheaply–an interesting debate in itself–evaluate a number of new ideas in front of a large population.
Like a good reviewer, let me also mention the type(s) of Turk-specific research that I do like (they’re related, bear with me):
- projects that help us understand the demographics, motivations and behavioral patterns of the Turkers,
- papers that identify unique design patterns and optimizations,
- mechanism papers that demonstrate effective ways to post-process Turk data to get signal from noise.
We basically need all these to enable us to effectively use the Turk (run the experiment), and address issues of bias and validity (produce believable, generalizable, reproducible results).
Think of it this way: we’ve been given some new instrument with some rough edges and without a manual. We’re collectively writing one. Unfortunately, at some point the production of the manual and the sharpening of the tool, which up until now have been a scientific process, are no longer interesting as science. We seem to be rapidly arriving at this point: there are only so many design patterns, so many demographic studies, so many measures of noise, and so many inference techniques for noisy data for this particular instrument (and arguably most instruments of this type). In other words, we’re running out of low hanging fruit.
Showing that Humans can do Human Tasks is not a Contribution
To stretch the metaphor, the lack of low hanging fruit of the first type have led to the pursuit of an entirely different kind of fruit, the ones that already fell on the floor (and occasionally ones we’ve already pretty much eaten in pie*). Basically, taking AI-hard problems–those that we know people can do but that machines are not so good at–and farming them out to Turkers. We know that most people can do this kind of work (find a car in a picture, pick out cancer cells, label sentences). The surprise would be if Turkers couldn’t.
In the past, when dealing with computationally taxing problems, one would either a) make simplifying assumptions, b) optimize, optimize, optimize, or c) propose some difficult-to-compute, partial solution and argue that current technology trends will allow us to catch up (proof-by-Moore’s law). Although not always satisfying, the key feature of all approaches was that we built automated solutions–hopefully creating true advancements on the boundaries.
The Turk has created a certain laziness. When I recently pointed out in a review that a particular hard computational step had been glossed over, the authors countered with a small experiment that indicated that Turkers could be paid 4 cents per task. Though they made no arguments about the total cost of this approach (which could have easily been in the millions given the scope of their project), the authors felt this was a sufficient explanation. QED, proof-by-Turk-existence.
The difference between the two “proofs” is subtle. but important. When we argue that our algorithm will eventually run reasonably well we do so by relying on the fact that computers are getting faster and cheaper. We frequently design things that we know won’t run well for a few years out. The same can’t be said for the Turk. Humans don’t really get cheaper and their ability to do work doesn’t magically scale.
Are there situations where crowdsourcing is desirable? Sure. If you’re a biologist and the advancement in your discipline is hampered by some hard AI step, don’t waste your time on solving the AI part. Your artifact/contribution is a drug and you’re stuck on how proteins fold? By all means, crowdsource!**
The switch in “fruit” has led to a deluge of papers that spend 98% of the space demonstrating that yes, people can solve easy people-problems; that yes, the mechanical Turk makes this easy; and that yes, it’s pretty cheap. The other 2% (or 1% or 0%) might be some minor observation: an attempt at adding something to the manual, some undiscovered toggle somewhere. Maybe, if we’re lucky, it will be novel.
But are we really that lucky? Have we stumbled on some population or work environment that is so fundamentally different that it requires 1000s of pages of research? Or does this really boil down to things we already knew? Turkers are human, Turkers are unreliable and Turkers maximize “profits.”
We seem to have forgotten the original mechanical Turk. Most generously, it was an elegant illusion, but really it was a hoax and a con with the purpose of deceiving . So are we just tricking ourselves into believing we are doing something interesting? When was the last time paying-people-to-do-stuff a novel scientific or engineering contribution?
So now that we have our manual, can we go back to the hard problems? trying to build reliable mechanized systems that just do what we ask. I have some electrons I can pay with.
* The worst is when we already know that there’s a good solution
** Want to win a Nobel? Hint: solve the automation problem. But, seriously, figuring out how to get people to do biology without knowing biology… that’s pretty slick too (GWAP++).
Update: I’ve been told by some beta readers that it’s not entirely obvious what I don’t like. While it’s easy to point to papers that I think are good (I do above, in the 3 bullet points), I’m uncomfortable pointing at specific papers I think are bad. Luckily, most of the really bad ones don’t make it past review. But to be clear: it’s the papers that don’t advance the field in any dimension that I object to. These fail the “so what?” test by demonstrating that humans can do work that we knew humans could do (contrary to popular belief, this is not a contribution). Or they don’t show anything surprising about the instrument (the Turk). These almost always get the “well, duh!” response from me as they really just confirm what we know. There are a lot of these.
Update 2: Someone (he can self-identify if he wants) pointed out to me that the difficult part of Turk system building is that while a human is good at a task humans (plural) are not. Reminds me of one of my favorite quotes: “A person is smart. People are dumb, panicky, dangerous animals, and you know it.” I totally agree, this is an interesting challenge which needs to be addressed by technical and non-technical interventions: we either make individual components less noisy or just learn to derive signal. Here we can, though we rarely do, turn to other scientific disciplines. Von Neumann taught us about majority voting, and the hundreds of system architecture and OS researchers that followed him only added to this literature (voting, RAID, parallel development, etc.). Similarly, “manipulating” and “incentivizing” people to behave better (making the component less noisy) is also thoroughly addressed in psych, survey design, and marketing research. This doesn’t mean there’s nothing new here. We have people coming in and out, issues of money, cultural issues, etc. The combination is novel and unique enough that we need that new operating manual. But why do we need to rediscover everything? New to us does not mean new.