Đánh giá Thị trường Lao động Trực tuyến cho Nghiên cứu Thí nghiệm: Mechanical Turk của Amazon.com

Political Analysis - Tập 20 Số 3 - Trang 351-368 - 2012
Adam J. Berinsky1, Gregory A. Huber2, G. Lenz3
1Department of Political Science, Massachusetts Institute of Technology, Cambridge, MA 02139 e-mail: (corresponding author)
2Institution for Social and Policy Studies, Yale University, New Haven, CT 06511 e-mail:
3Department of Political Science, University of California, Berkeley, Berkeley, CA 94720 e-mail:

Tóm tắt

Chúng tôi xem xét các sự đánh đổi liên quan đến việc sử dụngMechanical Turk (MTurk) của Amazon.com như một công cụ tuyển mộ đối tượng. Chúng tôi đầu tiên mô tả MTurk và tiềm năng của nó như một phương tiện để thực hiện các thí nghiệm với chi phí thấp và dễ dàng triển khai. Sau đó, chúng tôi đánh giá tính hợp lệ nội tại và ngoại tại của các thí nghiệm được thực hiện bằng cách sử dụng MTurk, áp dụng một khung đánh giá có thể được sử dụng để đánh giá các nguồn đối tượng khác. Đầu tiên, chúng tôi điều tra các đặc điểm của các mẫu được rút ra từ dân số MTurk. Chúng tôi chỉ ra rằng những người tham gia được tuyển mộ theo cách này thường đại diện hơn cho dân số Hoa Kỳ so với các mẫu tiện lợi phỏng vấn trực tiếp—mẫu phổ biến trong các nghiên cứu khoa học chính trị thí nghiệm đã công bố—nhưng ít đại diện hơn so với các chủ thể trong các bảng khảo sát trực tuyến hoặc các mẫu xác suất quốc gia. Cuối cùng, chúng tôi lặp lại những công trình thí nghiệm quan trọng đã được công bố bằng cách sử dụng các mẫu MTurk.

Từ khóa


Tài liệu tham khảo

We also find a somewhat different pattern of signs for the coefficients on the control variables—notably education and income. However, the coefficients on these variables—both in our analysis and in the original article by Kam and Simas—fall short of statistical significance by a wide margin.

Another promise of MTurk is as an inexpensive tool for conducting panel studies. Panel studies offer several potential advantages. For example, recent research in political science on the rate at which treatment effects decay (Chong and Druckman 2010; Gerber, Gimpel, Green, and Shaw 2011) has led to concerns that survey experiments may overstate the effects of manipulations relative to what one would observe over longer periods of time. For this reason, scholars are interested in mechanisms for exposing respondents to experimental manipulations and then measuring treatment effects over the long term. Panels also allow researchers to conduct pretreatment surveys and then administer a treatment distant from that initial measurement (allowing time to serve as a substitute for a distracter task). Another potential use of a panel study is to screen a large population and then to select from that initial pool of respondents a subset who better match desired sample characteristics. The MTurk interface provides a mechanism for performing these sorts of panel studies. To conduct a panel survey, the researcher first fields a task as described above. Next, the researcher posts a new task on the MTurk workspace. We recommend that this task be clearly labeled as open only to prior research participants. Finally, the researcher notifies those workers she wishes to perform the new task of its availability. We have written and tested a customizable Perl script that does just this (see the Supplementary data). In particular, after it is edited to work with the researcher's MTurk account and to describe the new task, it interacts with the Amazon.com API to send messages through the MTurk interface to each invited worker. As with any other task, workers can be directed to an external Web site and asked to submit a code to receive payment. Our initial experiences with using MTurk to perform panel studies are positive. In one study, respondents were offered 25 cents for a 3-min follow-up survey conducted 8 days after a first-wave survey. Two reminders were sent. Within 5 days, 68% of the original respondents took the follow-up. In a second study, respondents were offered 50 cents for a 3-min follow-up survey conducted 1–3 months after a first-wave interview. Within 8 days, almost 60% of the original respondents took the follow-up. Consistent with our findings, Buhrmester, Kwang, and Gosling (2011) report a two-wave panel study, conducted 3 weeks apart, also achieving a 60% response rate. They paid respondents 50 cents for the first wave and 50 cents for the second. Analysis of our two studies suggests that the demographic profile does not change significantly in the follow-up survey. Based on these results, we see no obstacle to oversampling demographic or other groups in follow-up surveys, which could allow researchers to study specific groups or improve the representativeness of samples.

Sorokin, 2008, Utility data annotation with Amazon Mechanical Turk, Computer Vision and Pattern Recognition Workshops ‘08, 51, 1

10.1017/S000305541000047X

10.1073/pnas.0705435104

10.1006/obhd.1995.1046

In the case of experiments involving deception, it is also feasible to debrief at the conclusion of the experiment.

Berinsky Adam J. , Huber Gregory A. , and Lenz Gabriel S. 2011. Replication data for: Evaluating online labor markets for experimental research: Amazon.com's Mechanical Turk. IQSS Dataverse Network [Distributor] V1 [Version]. http://hdl.handle.net/1902.1/17220 (accessed January 19, 2012).

The MTurk sample does have fewer blacks than either of the Berinsky and Kinder adult samples.

10.1017/S0043887110000195

Chandler Dana , and Kapelner Adam . 2010. Breaking monotony with meaning: Motivation in crowdsourcing markets. University of Chicago Mimeo.

We chose to replicate the study by Kam and Simas because it is an excellent example of the way in which contemporary political scientists use experimentation to understand key political dynamics. That study examines the importance both of framing and of the relationship between framing and underlying preferences for risk aversion (i.e., heterogeneity in treatment effects).

It should be noted that other convenience samples, such as student or local intercept samples, may also have significant numbers of habitual experimental participants. However, it is important to determine whether this is especially a problem in the MTurk sample, where subjects can easily participate in experiments from their home or work computers.

The support for increased spending is, on average, somewhat higher in both conditions on the GSS. Specifically, in 2010, the GSS data show that 24% think that too little is being spent on welfare, whereas 68% think that too little is spent on assistance to the poor.

Kam and Simas also employed a within-subjects design to show that high levels on the risk acceptance scale reduce susceptibility to framing effects across successive framing scenarios. We replicated these results as well (see Supplementary data).

10.1086/269158

10.1145/1753326.1753688

Takemura, 1994, Influence of elaboration on the framing of decision, Journal of Psychology, 128

These discrepancies also suggest the potential utility of using an initial survey to screen large numbers of individuals and then inviting a more representative subset of those respondents to participate in the experiment itself. We discuss the technique for contacting selected respondents for a follow-up survey in footnote 8.

As with the drug benefit, this difference may be due to age or to differences in political circumstances.

Other researchers have surveyed MTurk respondents and found a similar demographic profile (e.g., Ross et al. 2010).

As of October, 2011, Google Scholar lists 769 social sciences articles with the phrase "Mechanical Turk." Relevant studies by economists include, for example, Chandler and Kapelner (2010), Chen and Horton (2010), Horton and Chilton (2010), and Paolacci et al. (2010). Computer scientists have also tested MTurk's suitability as a source of data for training machine learning algorithms (e.g., Sheng et al. 2008

Sorokin and Forsyth 2008). For example, Snow et al. (2008) assessed the quality of MTurkers' responses to several classic human language problems, finding that the quality was no worse than the expert data that most researchers use.

We conducted the habitual responder analysis for the welfare and Asian flu experiments (see Supplementary data). We do not perform this analysis for the study by Kam and Simas because it was conducted a year after our data on frequent participants was collected.

On demographics, the only nonsignificant differences between MTurk and the other samples are on gender, marriage separation, Catholic, and region.

Horton J. , and Chilton L. 2010. The labor economics of paid crowdsourcing. Proceedings of the 11th ACM Conference on Electronic Commerce, Cambridge, MA.

10.1037/0022-3514.51.3.515

To conduct these tests, we pooled the MTurk and GSS samples. We then ran three ordered probits using the three-category welfare spending response scale as the dependent variable—one for each of the demographic variables (men versus women; college educated versus other; blacks versus all other races). For each of these probits, we included as independent variables a dummy variable for sample (GSS versus MTurk), a dummy variable for the treatment (welfare versus assistance to the poor question form), the demographic variable of interest (education, gender, or race), and interactions between all the variables. The interactions between the treatment and the demographic variables allow us to test whether heterogeneous treatment effects are present, whereas the threeway interaction between the demographic variable, the sample, and the treatment allow us to test whether the treatment effect varies by demographic subgroup across forms. In all cases, these interaction terms were insignificant (the p values on the terms range from .28 to .78). The full ordered probit results are presented in the Supplementary data.

Green, 2010, Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees

10.1016/S0167-4870(00)00032-5

We present screen shots of a sample HIT from a Worker's view in the Supplementary data.

Replication code and data are available at Political Analysis Dataverse (Berinsky, Huber, and Lenz 2011).

Analyses have generally found that experiments on Internet samples yield results similar to traditional samples. Based on a comprehensive analysis, for example, Gosling et al. (2004) conclude that Internet samples tend to be diverse, are not adversely affected by nonserious or habitual responders, and produce findings consistent with traditional methods.

The MTurk platform is of course limited to conducting research that does not require physical interactions between the subject and either the researcher or other subjects (e.g., to gather DNA samples, administer physical interventions, or observe face-to-face interactions among subjects). © The Author 2012. Published by Oxford University Press on behalf of the Society for Political Methodology. All rights reserved. For Permissions, please email: [email protected] Adam J. Berinsky et al.

We have successfully used commercial Web sites like SurveyGizmo and Qualtrics for this process, and any Web survey service that can produce a unique worker code should be suitable. Providing subjects with a unique code and having them enter it in the MTurk Web site ensures that they have completed the task. Evaluating Online Labor Markets for Experimental Research

If the researcher has arranged for the external Web site to produce a unique identifier, she can then use these identifiers to reject poor quality work on the MTurk Web site. For example, if the experiment included mandatory filter questions or questions designed to verify the subject was reading instructions, the worker's compensation can be made contingent on responses. Finally, a unique identifier also allows the researcher to pay subjects a bonus based on their performance using either the MTurk Web interface or Amazon.com's Application Programming Interface (API). A Python script we have developed and tested to automate the process of paying individual bonuses appears in the Supplementary data.

Ross, 2010, CHI EA 2010

We are unaware of research using the MTurk interface to recruit large numbers of subjects for longer surveys, although Buhrmester et al. (2011) report being able to recruit about five subjects per hour for a survey advertised as taking 30 min for a $.02 payment. Other scholars have reported that higher pay increases the speed at which subjects are recruited but does not affect accuracy (Buhrmester, Kwang, and Gosling 2011; Mason and Watts 2009; but see Downs et al. 2010 and Kittur, Chi, and Suh 2008 on potentially unmotivated subjects, a topic addressed in greater detail below).

The HIT was described as follows: Title: Survey of Public Affairs and National Conditions. Description: Complete a survey to gauge your opinion of national conditions and current events (USA only). Should be no more than 10 mins. Keywords: survey, current affairs, research, opinion, politics, fun. Detailed Posting: Complete this research survey. Usually takes no more than 10 minutes. You can find the survey here: [URL removed]. At the end of the survey, you'll find a code. To get paid, please enter the code below.

MTurk classifies individuals as 18 or older based on self-reports. MTurk does not reveal how it classifies individuals as living in a particular country but may rely on mailing addresses and credit card billing addresses.

These individuals may reside in the United States but be traveling or studying abroad. Additionally, although IP address locators seem reliable, we are unaware of research benchmarking their accuracy. Still, so as to provide as conservative a picture of our sample as is possible, we excluded these questionable respondents. Our results did not change when we included them.

Moreover, as the material in the Supplementary data makes clear, many other studies do not report any information about sample characteristics.

Prospective respondents were offered $10 per month to complete surveys on the Internet for 30 min each month.

We therefore only report significance tests in the exceptional cases when they are not statistically significant, relying on Kolmogorov-Smirnov tests of differences in distributions (and proportion tests for categorical variables). As shown in the Supplementary data, about 85% of the tests between MTurk and the other samples are statistically significant at the 0.10 threshold. In comparison, about 60% of the tests between ANES 2008 and CPS 2008 are significant.

The need for cognition and need to evaluate scales are from the 2008 ANES. These items were placed on a separate survey of 699 MTurk respondents conducted in May 2011. This study also contained the Kam and Simas (2010) replication discussed below. The HIT was described as follows: Title: Survey of Public Affairs and Values. Description: Relatively short survey about opinions and values (USA only). 10–12 minutes. Keywords: survey, relatively short. Detailed Posting: Complete this research survey. Usually takes 10–12 minutes. You can find the survey here: [URL removed]. At the end of the survey, you'll find a code. To get paid, please enter the code below.

Snow, 2008, Evaluating non-expert annotations for natural language tasks

In fact, differences between MTurk and ANES 2008 on registration and turnout are not statistically significant (see the Supplementary data). There are inconsistencies in the ANESP's measures of turnout and registration (e.g., the survey contains respondents who say they are not registered to vote, but report voting) that suggest caution here.

10.1145/1357054.1357127

This result is somewhat odd because workers visit MTurk to make money, not because they are interested in politics. The higher levels of interest may be due to advertising the survey as about “public affairs.”

Prior work similarly finds no evidence that gender or race are associated with differences in effect sizes in the GSS in years earlier than 2010 (Green and Kern 2010).

10.1007/s11109-007-9037-6

This survey was fielded in January 2010. HIT was described as follows: Title: Answer a survey about current affairs and your beliefs. Description: “Answer a survey about current affairs and your beliefs. Should take less than 5 minutes.” Paolacci et al. (2010) also reports an MTurk replication of this experiment.

Lawson et al. (2010) successfully replicate the ratings of 2006 Senate candidate faces on MTurk by Ballew and Todorov (2007). Horton, Rand, and Zeckhauser (2010) replicate several experimental findings in economics. Gabriele Paolacci's Experimental Turk blog (http://experimentalturk.wordpress.com/) has collected reports of successful replications of several canonical experiments from a diverse group of researchers, including the Asian Disease Problem discussed in this section and other examples from psychology and behavioral economics.

Researchers can reject and block future work by suspected retakers or simply exclude duplicate work from their analysis by selecting only the first observation from a given IP address.

10.1111/j.1468-2508.2006.00451.x

10.1002/(SICI)1099-0992(199803/04)28:2<287::AID-EJSP861>3.0.CO;2-U

10.1177/1745691610393980

Chen Daniel L. , and Horton John J. 2010. The wages of pay cuts: Evidence from a field experiment. Harvard University Mimeo.

10.1017/S0003055410000493

10.1017/CBO9780511921452.004

10.1037/0003-066X.59.2.93

Horton John J. , Rand David G. , and Zeckhauser Richard J. 2010. The online laboratory: Conducting experiments in a real labor market. Available at SSRN: http://ssrn.com/abstract=1591202 (accessed January 19, 2012).

To check whether MTurk subjects looked up answers to knowledge questions on the Internet, we asked two additional multiple choice questions of much greater difficulty: who was the first Catholic to be a major party candidate for president and who was Woodrow Wilson's vice president. Without cheating, we expected respondents to do no better than chance. On the question about the first Catholic candidate, MTurk subjects did worse than chance with only 10% answering correctly (Alfred Smith; many chose an obvious but wrong answer, John F. Kennedy). About a quarter did correctly answer the vice presidential question (Thomas Marshall), exactly what one would expect by chance. These results suggest political knowledge is not inflated much by cheating on MTurk.

10.3758/BF03197268

10.1017/S0022381609990806

10.1145/1600150.1600175

10.1037/h0043424

Paolacci, 2010, Running experiments on Amazon Mechanical Turk, Judgment and Decision Making, 5, 10.1017/S1930297500002205

Sheng, 2008, Get another label?

10.1093/pan/mpn012

10.1126/science.7455683