Why I Hate Mechanical Turk Research

by admin on November 11, 2010

Update (5/12/2011): I wrote a longer (but still short) paper about this for the Crowdsourcing workshop at CHI.  It’s available here.

Ok, it’s not that I actually hate it, but in reviewing a number of Turk papers, reading many more, and listening to many, many, many more planned Turk projects I find myself increasingly frustrated.  Don’t get me wrong: we use it all the time in my group for evaluation purposes (or labeling training/test data).  It allows us to cheaply–an interesting debate in itself–evaluate a number of new ideas in front of a large population.

Like a good reviewer, let me also mention the type(s) of Turk-specific research that I do like (they’re related, bear with me):

  1. projects that help us understand the demographics, motivations and behavioral patterns  of the Turkers,
  2. papers that identify unique design patterns and optimizations,
  3. mechanism papers that demonstrate effective ways to post-process Turk data to get signal from noise.

We basically need all these to enable us to effectively use the Turk (run the experiment), and address issues of bias and validity (produce believable, generalizable, reproducible results).

Think of it this way:  we’ve been given some new instrument with some rough edges and without a manual.  We’re collectively writing one.  Unfortunately, at some point the production of the manual and the sharpening of the tool, which up until now have been a scientific process, are no longer interesting as science.  We seem to be rapidly arriving at this point: there are only so many design patterns, so many demographic studies, so many measures of noise, and so many inference techniques for noisy data for this particular instrument (and arguably most instruments of this type).  In other words, we’re running out of low hanging fruit.

Showing that Humans can do Human Tasks is not a Contribution

To stretch the metaphor, the lack of low hanging fruit of the first type have led to the pursuit of an entirely different kind of fruit, the ones that already fell on the floor (and occasionally ones we’ve already pretty much eaten in pie*).  Basically, taking AI-hard problems–those that we know people can do but that machines are not so good at–and farming them out to Turkers.  We know that most people can do this kind of work (find a car in a picture, pick out cancer cells, label sentences).  The surprise would be if Turkers couldn’t.

In the past, when dealing with computationally taxing problems, one would either a) make simplifying assumptions, b) optimize, optimize, optimize, or c) propose some difficult-to-compute, partial solution and argue that current technology trends will allow us to catch up (proof-by-Moore’s law).  Although not always satisfying, the key feature of all approaches was that we built automated solutions–hopefully creating true advancements on the boundaries.

The Turk has created a certain laziness.  When I recently pointed out in a review that a particular hard computational step had been glossed over, the authors countered with a small experiment that indicated that Turkers could be paid 4 cents per task.  Though they made no arguments about the total cost of this approach (which could have easily been in the millions given the scope of their project), the authors felt this was a sufficient explanation.  QED, proof-by-Turk-existence.

The difference between the two “proofs” is subtle. but important.  When we argue that our algorithm will eventually run reasonably well we do so by relying on the fact that computers are getting faster and cheaper.  We frequently design things that we know won’t run well for a few years out.  The same can’t be said for the Turk.  Humans don’t really get cheaper and their ability to do work doesn’t magically scale.

Are there situations where crowdsourcing is desirable?  Sure. If you’re a biologist and the advancement in your discipline is hampered by some hard AI step, don’t waste your time on solving the AI part.  Your artifact/contribution is a drug and you’re stuck on how proteins fold?  By all means, crowdsource!**

The switch in “fruit” has led to a deluge of papers that spend 98% of the space demonstrating that yes, people can solve easy people-problems; that yes, the mechanical Turk makes this easy; and that yes, it’s pretty cheap.  The other 2% (or 1% or 0%) might be some minor observation: an attempt at adding something to the manual, some undiscovered toggle somewhere.  Maybe, if we’re lucky, it will be novel.

But are we really that lucky? Have we stumbled on some population or work environment that is so fundamentally different that it requires 1000s of pages of research?  Or does this really boil down to things we already knew? Turkers are human, Turkers are unreliable and Turkers maximize “profits.”

We seem to have forgotten the original mechanical Turk.  Most generously, it was an elegant illusion, but really it was a hoax and a  con with the purpose of deceiving .  So are we just tricking ourselves into believing we are doing something interesting? When was the last time paying-people-to-do-stuff a novel scientific or engineering contribution?

So now that we have our manual, can we go back to the hard problems?  trying to build reliable mechanized systems that just do what we ask.  I have some electrons I can pay with.

* The worst is when we already know that there’s a good solution

** Want to win a Nobel?  Hint: solve the automation problem.  But, seriously, figuring out how to get people to do biology without knowing biology… that’s pretty slick too (GWAP++).

Update: I’ve been told by some beta readers that it’s not entirely obvious what I don’t  like.  While it’s easy to point to papers that I think are good (I do above, in the 3 bullet points), I’m uncomfortable pointing at specific papers I think are bad.  Luckily, most of the really bad ones don’t  make it past review.  But to be clear: it’s the papers that don’t advance the field in any dimension that I object to.  These fail the “so what?” test by demonstrating that humans can do work that we knew humans could do (contrary to popular belief, this is not a contribution).  Or they don’t show anything surprising about the instrument (the Turk).  These almost always get the “well, duh!” response from me as they really just confirm what we know.  There are a lot of these.

Update 2: Someone (he can self-identify if he wants) pointed out to me that the difficult part of Turk system building is that while a human is good at a task humans (plural) are not.  Reminds me of one of my favorite quotes: “A person is smart. People are dumb, panicky, dangerous animals, and you know it.”  I totally agree, this is an interesting challenge which needs to be addressed by technical and non-technical interventions: we either make individual components less noisy or just learn to derive signal.  Here we can, though we rarely do, turn to other scientific disciplines.  Von Neumann taught us about majority voting, and the hundreds of system architecture and OS researchers that followed him only added to this literature (voting, RAID, parallel development, etc.).  Similarly, “manipulating” and “incentivizing” people to behave better (making the component less noisy) is also thoroughly addressed in psych, survey design, and marketing research.  This doesn’t mean there’s nothing new here.  We have people coming in and out, issues of money, cultural issues, etc.  The combination is novel and unique enough that we need that new operating manual.  But why do we need to rediscover everything? New to us does not mean new.


The interesting challenge for me is to build extremely non-trivial crowd-powered systems. What can we do with crowd platforms that you wouldn’t think would be otherwise possible?

The design space used to look like a tradeoff between user and AI. The user could step in when an AI couldn’t solve a problem, and guide it in the right direction. Now, we have a triangular (simplex) design space: user, AI and crowd. What do we rely on the user to do? What can an AI do? What can a crowd do? How can they play off each others’ strengths to reach new heights?

We’ve chatted about this, but a similar critique could be levied against most paradigm shifts in science and engineering. Researchers rush in, and it produces bad work as well as good work. This happens in mechanized systems as well. “Why I hate incremental LDA research”, “Yet Another Slight Tweak on SVMs”, “Why I Review Too many Twitter Papers that Don’t Have A Good Grip On Twitter”, etc.

by Michael Bernstein on November 11, 2010 at 4:03 pm. Reply #

No question… s/turk/buzzword/ in the above post (that’s “replace” for those of you who don’t speak regexp). I should do that every year when I do CHI reviews.

by admin on November 11, 2010 at 4:57 pm. Reply #

I work on the “building signal from noise” aspect of using Turk, so I’m glad that you value that. While I haven’t reviewed a Turk paper yet, I have to admit that I have a low bar for Turk papers. As long as they show me something about the nature of tasks that *need* to be done by ‘trained people’ and the tasks can be done by ‘Turkers’, I’m usually happy. (Special note: I consider the training of an annotator to be inducement of bias, which is a more complicated discussion.)

For many conferences, this isn’t enough of a contribution, but if someone where churning out blog posts about the difference between Turkers and ‘trained’ annotators (that are commonly used in social science research), I would think it was an interesting contribution.

Having worked on many, many annotation projects, I am deeply skeptical of the annotation process, especially as it is applied both in academia and in business. If I can’t use more sophisticated science than is typically applied to the annotations, I have grave concerns about the work. So, a movement to large N annotators (possible with Turk), can shed a lot of light on how ‘regular’ people respond under a variety of conditions instead of measuring the responses of WEIRD subjects (where WEIRD is an acronym from recent research that, if you don’t know it, you need to read). ;)

by Stephen Purpura on November 11, 2010 at 5:01 pm. Reply #

The clever counterpoint would be to design a system for evaluating papers about Mechanical Turk, and to parcel out the work so that the Turkers themselves are the reviewers. (Peer review and all that.)

by Edward Vielmetti on November 11, 2010 at 5:08 pm. Reply #

Really interesting post. As a Turk simpleton two things strike me
1. It needs a decent reputation system
2. But more importantly I feel the digital sweat shop nature of the mturk (partly due to the reputation problems) is due to a misunderstanding of what computers and humans should be for. The Human Use of Human Beings by Norbert Wiener and also Computer Power and Human Reason by Joseph Weizenbaum dealt a long time ago with how we should divide tasks between humans and machines.

If Norbert Wiener was willing to worry about these problems that seems to say that it is important the mturk be used effectively for experiments on what humans and computers should do.

by iamreddave on November 11, 2010 at 6:18 pm. Reply #

Leave your comment


Required. Not published.

If you have one.

Comment moderation is in use. Please do not submit your comment twice -- it will appear shortly.