2019-08-14: Building the Better Crowdsourced Study - Literature on Mechanical Turk

The XKCD comic "Study" parodies
 the challenges of recruiting study participants.

As part of "Social Cards Probably Provide For Better Understanding Of Web Archive Collections" (recently accepted for publication by CIKM2019), I had to learn how to conduct user studies. One of the most challenging problems to solve while conducting user studies is recruiting participants. Amazon's Mechanical Turk (MT) solves this problem by providing a marketplace where participants can earn money by completing studies for researchers. This blog post summarizes the lessons I have learned from other studies that have successfully employed MT. I have found parts of this information scattered throughout different bodies of knowledge, but not gathered in one place; thus, I hope it is a useful starting place for future researchers.

MT is by far the largest source of study participants, with over 100,000 available participants. MT is an automated system that facilitates the interaction of two actors: the requester and the worker. A worker signs up for an Amazon account and must wait a few days to be approved. Once approved, MT provides the worker with a list of assignments to choose from. A Human Interface Task (HIT) is an MT assignment. Workers perform HITs for anywhere from $0.01 up to $5.00 or more. Workers earn as much as $50 per week completing these HITs. Workers are the equivalents of subjects or participants found in research studies.

Workers can browse HITs to complete via Amazon's Mechanical Turk.
Requesters are the creators of HITs. After a worker completes a HIT, the requester decides whether or not to accept the HIT and thus pay the worker. Requesters use the MT interface to specify the amount to be paid for a HIT, how many unique workers per HIT, how much time to allot to workers, and when the HIT will no longer be available for work (expire). Also, requesters can specify that they only want workers with specific qualifications, such as age, gender, employment history, or handedness. The Master Qualification is assigned automatically by the MT system based on the behavior of the workers. Requesters can also specify that they only want workers with precise approval rates.

Requesters can create HITs using the MT interface, which provides a variety of templates.
The HITs themselves are HTML forms entered into the MT system. Requesters have much freedom within the interface to design HITs to meet their needs, even including JavaScript. Once the requester has entered the HTML into the system, they can preview the HIT to ensure that it looks and responds as expected. When the requester is done creating the HIT, they can then save it for use. HITs may contain variables for links to visualizations or other external information. When the requester is ready to publish a HIT for workers to perform, they can submit a CSV file containing the values for these variables. MT will create one HIT per row in the CSV file. Amazon will require that the requester deposit enough money into their account to pay for the number of HITs they have specified. After the requester pays for the HITs, workers can see the HIT and then begin their submissions. The requester then reviews each submission as it comes in and pays workers.

The MT environment is different from that used in traditional user studies. MT participants can use their own devices to complete the study wherever they have a connection to the Internet. Requesters are limited in the amount of data that they can collect on MT participants. For each completed HIT, the MT system supplies the completion time and the responses provided by the MT participant. A requester may also employ JavaScript in the HIT to record additional information.

In contrast, traditional user studies allow a researcher to completely control the environment and record the participant's physical behavior. Because of these differences, some scholars have questioned the effectiveness of MT's participants. To assuage this doubt, Heer et al. reproduced the results of a classic visualization experiment. The original experiment used participants recruited using traditional methods. Heer recruited participants via MT and demonstrated that the results were consistent with the original study. Kosara and Ziemkiewicz reproduced one of their previous visualization studies and discovered that MT results were equally consistent with the earlier study. Bartneck et al. conducted the same experiment with both traditionally recruited participants and MT workers. They also confirmed consistent results between these groups.

MT is not without its criticism. Fort, Adda, and Cohen raise questions on the ethical use of MT, focusing on the potentially low wages offered by requesters. In their overview of MT as a research tool, Mason and Suri further discuss such ethical issues as informed consent, privacy, and compensation. Turkopticon is a system developed by Irani and Silberman that helps workers safely voice grievances about requesters, including issues with payment and overall treatment.

In traditional user studies, the presence of the researcher may engender some social motivation to complete a task accurately. MT participants are motivated to maximize their revenue over time by completing tasks quickly, leading some MT participants to not exercise the same level of care as a traditional participant. Because of the differences in motivation and environments, MT studies require specialized design. Based on the work of multiple academic studies, we have the following advice for requesters developing meaningful tasks with Mechanical Turk:
  • complex concepts, like understanding, can be broken into smaller tasks that collectively provide a proxy for the broader concept (Kittur 2008)
  • successful studies ensure that each task has questions with verifiable answers (Kittur 2008)
  • limiting participants by their acceptance score has been successful for ensuring higher quality responses (Micallef 2012, Borkin 2013)
  • participants can repeat a task – make sure each set of responses corresponds to a unique participant by using tools such as Unique Turker (Paolacci 2010)
  • be fair to participants; because MT is a competitive market for participants, they can refuse to complete a task, and thus a requester's actions lead to a reputation that causes participants to avoid them (Paolacci 2010)
  • better payment may improve results on tasks with factually correct answers (Paolacci 2010, Borkin 2013, PARC 2009) – and can address the ethical issue of proper compensation
  • being up front with participants and explaining why they are completing a task can improve their responses (Paolacci 2010) – this can also help address the issue of informed consent
  • attention questions can be useful for discouraging or weeding out malicious or lazy participants that may skew the results (Borkin 2013, PARC 2009)
  • bonus payments may encourage better behavior from participants (Kosara 2010) – and may also address the ethical issue of proper compensation
MT provides a place to recruit participants, but recruitment is only one part of successfully conducting user experiments. To create successful user experiments, I recommend starting with "Methods for Evaluating Interactive Information Retrieval Systems with Users" by Diane Kelly.

For researchers starting down the road of user studies, I recommend starting first with Kelly's work and then circling back to the other resources noted here when developing their experiment.

-- Shawn M. Jones