A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory

Sprouse, Jon

doi:10.3758/s13428-010-0039-7

A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory

Open access
Published: 25 November 2010

Volume 43, pages 155–167, (2011)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory

Download PDF

Jon Sprouse¹

8985 Accesses
318 Citations
6 Altmetric
Explore all metrics

Abstract

Amazon’s Mechanical Turk (AMT) is a Web application that provides instant access to thousands of potential participants for survey-based psychology experiments, such as the acceptability judgment task used extensively in syntactic theory. Because AMT is a Web-based system, syntacticians may worry that the move out of the experimenter-controlled environment of the laboratory and onto the user-controlled environment of AMT could adversely affect the quality of the judgment data collected. This article reports a quantitative comparison of two identical acceptability judgment experiments, each with 176 participants (352 total): one conducted in the laboratory, and one conducted on AMT. Crucial indicators of data quality—such as participant rejection rates, statistical power, and the shape of the distributions of the judgments for each sentence type—are compared between the two samples. The results suggest that aside from slightly higher participant rejection rates, AMT data are almost indistinguishable from laboratory data.

Questionnaire Design

Natural Language Processing

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

Sander Greenland, Stephen J. Senn, … Douglas G. Altman

From a purely methodological point of view, syntacticians are interested in identifying the properties of syntactic representations. Over the past 50 years, the dominant method for identifying the properties of syntactic representations has involved comparing two (or more) minimally different representations using a behavioral response known as an acceptability judgment as a proxy for grammatical well-formedness (Chomsky, 1965; Schütze, 1996). Traditionally, these acceptability judgments have been collected using an informal experiment consisting of only a handful of participants (usually the researcher’s colleagues) and a handful of experimental items (Marantz, 2005). This informal methodology has worked well because acceptability judgments of linguistic phenomena tend to be strikingly robust, even at very small sample sizes (for a large-scale quantitative evaluation, see Sprouse & Almeida, 2010). The success of informal experiments notwithstanding, over the past 15 years, a number of syntacticians have argued that formal experimental methods—such as full-scale surveys, large samples, and sophisticated scaling tasks like magnitude estimation—can provide an additional level of detail (usually in the form of statistical models) that can help clarify some theoretical questions in syntactic theory (e.g., Bard, Robertson, & Sorace, 1996; Cowart, 1997; Featherston, 2005a, 2005b; Keller, 2000; Myers, 2009; Sorace & Keller, 2004; Sprouse, 2009; Sprouse & Cunningham, submitted for publication; Sprouse, Wagers, & Phillips, 2010). Of course, the additional information gained by formal acceptability experiments is offset by the fact that they take considerably more time to deploy than informal acceptability experiments: an informal experiment can be conducted in a matter of minutes, whereas formal experiments can require several weeks for recruiting and running a full sample (e.g., 25–30 participants).

Several free software solutions, such as WebExp (Keller, Gunasekharan, Mayo, & Corley, 2009) and MiniJudge (Myers, 2009), have been developed to allow acceptability judgments to be collected over the Web, and thus reduce some of the collection time. Though successful at reducing physical data collection time, these software solutions still require the experimenter to invest time in participant recruitment (and compensation disbursement), which can still take weeks to complete. It has been recently suggested that syntacticians could use the Amazon Mechanical Turk marketplace (henceforth, AMT) to completely automate the recruitment of participants, the administration of surveys, and the disbursement of compensation, thus virtually eliminating the time cost of formal experiments (see, e.g., Gibson & Fedorenko, in press). AMT is an online marketplace where companies or individuals (called requesters) can post small tasks (called Human Intelligence Tasks, or HITs) that cannot easily be automated, and therefore require human workers (called workers) for completion. These HITs are generally very small in nature (such as identifying the contents of an image), and generally very high in quantity (it is not unusual for requesters to post thousands of tasks in a single batch). Requesters generally pay very little per HIT (e.g., $0.02 U.S.) and retain the ability to accept or reject the results of each HIT before Amazon sends payment to the worker. In this way, requesters are able to crowdsource (cf. outsource) tasks that would previously have required hours of work by in-house employees at considerably more expensive compensation rates. HITs can be posted using an online interface (www.mturk.com), and results can be downloaded in CSV format. From the point of view of an experimenter, AMT provides instantaneous access to thousands of potential participants and provides the tools necessary to distribute surveys, collect responses, and disburse payments.

It should be noted that AMT has already proven useful in at least one area of language research, computational linguistics, where it has been used for corpus annotation and evaluation—two tasks that have historically consumed significant time and resources (see, e.g., the recent NAACL HLT 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk; proceedings available online at www.aclweb.org/anthology/W/W10/W10-07.pdf). However, AMT has yet to be widely adopted by syntacticians who run formal acceptability experiments. The primary concern among syntacticians is that moving formal acceptability judgments out of the experimenter-controlled environment of the laboratory and onto the user-controlled environment of AMT may adversely affect the quality of the data collected and potentially negate the quantitative advantages that motivate formal experiments in the first place. In the laboratory, the experimenter can ensure that all participants are part of the population of interest (e.g., native speakers of U.S. English), control the environmental distractions, influence the rate of completion (“don’t rush”), verify that participants understand the task, and answer any questions that may arise. Before syntacticians can widely adopt AMT, they will need to be reasonably sure that the loss of this control will not affect the quality of the data that are collected. To that end, the goal of this article is to compare the results of a large-scale laboratory-based experiment (176 participants) and an identical AMT-based experiment (176 participants) along all of the quantitative measures of interest to linguists: time, cost (in money), participant rejection rate, detection rates of several known effects (both strong and weak) at a range of sample sizes, and differences in the shapes of the distributions of ratings for each condition (peak, dispersion, etc.).

Experimental details

Quantitative validation studies such as this require two large data sets: a reference data set and a target (AMT) data set. Given the relative scarcity of funding in linguistics, it seems unlikely that syntacticians will devote their limited resources to collecting two large data sets simply to validate AMT. However, Sprouse, Wagers, and Phillips (2010) collected a large data set as part of a theoretically motivated study: 176 participants, 24 different sentence types, 16 different lexicalizations (tokens) of each sentence type, and four judgments per sentence type per participant. This data set serves as the reference data for the AMT validation. The details of the experiment are given in the rest of this section.

Method

Participants

A group of 176 (152 female) self-reported monolingual native speakers of English, all University of California Irvine undergraduates, participated in the laboratory experiment for either course credit or $5. Another 176 (102 female) unique AMT workers participated in the AMT experiment for $3.

Materials

A total of 24 sentence types (conditions) were tested in this experiment. Sixteen lexicalizations of each sentence type were created and distributed among four lists using a Latin-square procedure. This meant that each list consisted of four tokens per sentence type, for a total of 96 items per list. Two orders for each of the four lists were created by pseudorandomizing the items such that related sentence types were never presented successively. This resulted in eight different surveys.

Procedure

The task for both samples was magnitude estimation of acceptability (Bard et al., 1996; Featherston, 2005a; Keller, 2000; Sprouse & Cunningham, submitted for publication). In a magnitude estimation task, participants are asked to rate experimental items in proportion to a reference item (the standard). The standard is preassigned a numerical value (the modulus). In the example below, the standard has been assigned a modulus of 100. If the participant believes that an experimental item is twice as acceptable as the standard, he or she would assign it a value of 200. If the participant believes that an experimental item is half as acceptable as the standard, he or she would assign it a value of 50.

(1)
An Example of Magnitude Estimation of Acceptability

Standard:	Who said my brother was kept tabs on by the FBI? 100
Item:	What did Lisa meet the man that bought? ____

The standard and modulus do not change throughout the experiment. Participants are instructed that they can use any positive number that they feel is appropriate. The standard was identical for all eight surveys and was in the middle range of acceptability: Who said my brother was kept tabs on by the FBI?

Presentation in the laboratory

The experiment began with a practice phase during which participants estimated the lengths of seven lines using another line as a standard set to a modulus of 100. This practice phase ensured that participants understood the concept of magnitude estimation. During the main phase of the experiment, 10 items were presented per page (except for the final page), with the standard appearing at the top of every page inside a textbox with black borders. The first 9 items of the survey were practice items (3 each of low, medium, and high acceptability). These practice items were not marked as such—that is, the participants did not know they were practice items—and they did not vary between participants in order or lexicalization. Including the practice items, each survey was 105 items long. The task directions are available on the author’s Web site (www.ling.cogsci.uci.edu/~jsprouse/tools/amt/). Participants were under no time constraints during their visit.

Presentation on AMT

The primary difference between the laboratory and AMT presentations was that the AMT survey appeared as a Web page rather than as a paper survey (see Fig. 1 for a screen shot). There were no page delineations in the Web page, therefore all of the items appeared as one long page (600 pixels in height) that required the participants to scroll. The standard and modulus were repeated in boldface every seven items to ensure that they were always visible on the page during scrolling. The HTML template used for the AMT presentation is available on the author’s site (www.ling.cogsci.uci.edu/~jsprouse/tools/amt/). All other experimental details were identical.

Preprocessing of responses

The responses to the nine practice items were removed, and the remaining responses for each participant were z-score transformed prior to analysis. The z-score transformation is a standardization procedure that corrects for some kinds of scale bias between participants by converting a participant’s scores into units that convey the number of standard deviations each score is from that participant’s mean score.

Case studies for analysis

Fourteen of the 24 sentence types will be analyzed in this comparison. These 14 sentence types can be paired (one in the experimental condition and one control) to form seven theoretically relevant phenomena from the syntactic and sentence-processing literature. The first four phenomena are called island effects (Chomsky, 1986; Huang, 1982; Ross, 1967). Island effects are ideal case studies for AMT, since they have many of the properties of other syntactic phenomena: They are discussed in dozens of articles and textbooks, the source of the unacceptability is generally too abstract for naive participants to identify or correct, and they have been reported to demonstrate a good deal of variability among native speakers (Grimshaw, 1986; Hofmeister, & Sag, 2010; Kuno, 1973).

(2)
Whether Island Effect
What do you think that John bought?
(control)
*What do you wonder whether John bought?
(violation)
(3)
Complex Noun Phrase Island Effect
What did you claim that John bought?
(control)
*What did you make the claim that John bought?
(violation)
(4)
Subject Island Effect
What do you think interrupted the TV show?
(control)
*What do you think the speech about interrupted the TV show?
(violation)
(5)
Adjunct Island Effect
What do you think that John forgot at the office?
(control)
*What do you worry if John forgets at the office?
(violation)

The next three case studies are contrasts that have historically proven particularly difficult to replicate in acceptability judgment tasks, but are nonetheless detectable with very large sample sizes like those in this study (Sprouse & Almeida, 2010). They are the center embedding illusion (e.g., Frazier, 1985; Gibson & Thomas, 1999), the comparative illusion (e.g., Phillips, Wagers, & Lau, in press), and the agreement attraction illusion (e.g., Wagers, Lau, & Phillips, 2009). These contrasts are likely difficult to detect with acceptability judgments because they are not caused by a static property of the syntactic representations, but rather by the way the sentences are processed. Such processing-based effects are generally investigated using measures with high temporal resolution, such as reaction times or event-related potentials, rather than untimed acceptability judgments; however, these three contrasts have been reported using untimed acceptability judgments, and therefore provide an interesting case study in the detection of extremely weak effects using an AMT sample.

(6)

Center Embedding Illusion

*The ancient manuscript that the grad student who the new card catalog had confused a great deal was studying in the library was missing a page.	(violation)
?The ancient manuscript that the grad student who the new card catalog had confused a great deal was missing a page.	(illusion)

(7)
Comparative Illusion
*More people have graduated law school than I have.
(violation)
?More people have been to Russia than I have.
(illusion)
(8)
Agreement Attraction Illusion
*The slogan on the poster unsurprisingly were designed to get attention.
(violation)
?The slogan on the posters unsurprisingly were designed to get attention.
(illusion)

Time, cost, and participant rejection

There are many aspects of the experimental procedure that could be affected by the change of venue from the laboratory to AMT, such as the time it takes to create and run the experiment, the methods available for ensuring an appropriate sample (e.g., only native speakers of English), and the number of participants that must be removed from the sample prior to analysis. This section provides an in-depth comparison of these preanalysis aspects of the experimental procedure.

Time

Preparation

Laboratory experiments require the use of experimental software (e.g., WebExp, MiniJudge) or the creation of paper surveys; AMT experiments require the creation of an HTML survey. It took about 3 h to explore the AMT documentation (tutorials and discussion threads), and another hour to create the HTML template for the surveys, for a total of 4 h of initial setup time, which seems comparable to the initial setup of other software options. This is a one-time investment, and the HTML template is reusable; therefore, additional experiments will take only a matter of minutes to publish. The HTML template used here can be downloaded for free from the author’s Web site (www.ling.cogsci.uci.edu/~jsprouse/tools/amt/).

Data collection

The primary advantage of AMT is in data collection. The laboratory-based sample took approximately 88 experimenter hours spread over a 3-month period, whereas AMT returned 170 surveys in 2 h. That is a rate of 85 participants per hour. Because a few of the participants were excluded during data collection (see the Participant Rejection section below), the total time to collect 176 correctly completed surveys was 4 h. These rates suggest that a standard-sized sample (25–35 participants) could be collected in less than 1 h using AMT.

Cost

The laboratory-based participants were paid $5 or given course credit for a 30-min visit to the laboratory. The AMT participants were paid $3 per survey. The $3 compensation rate was chosen on the basis of the other HITs available on AMT: HITs generally pay $0.02 per single task, and these surveys required 105 judgments in addition to the reading of detailed instructions. AMT charges a 10% fee in addition to the compensation given to workers, so the total participant compensation cost was $3.30 per participant ($580.80 for 176 participants). The participant compensation cost of AMT is likely to be a concern for linguists without funding. Whereas laboratory-based experiments can be run at no cost through the use of university participant pools that grant course credit, the AMT system is cash only. At these rates, a standard 30 participant/100 item experiment on AMT would cost approximately $100.

Participant rejection

Selection

Participant selection criteria will obviously vary from experiment to experiment; however, there are at least two criteria that every experiment will include that can be used as case studies to understand the dynamics of participant selection on AMT:

1.
Participants must be native speakers of the language of interest (e.g., U.S. English).
2.
Participants must take the experiment only once.

The AMT documentation indicates that requesters can require that workers complete a qualification exam prior to completing HITs. These qualification exams are intended to assess the worker’s skill at a particular task. It is theoretically possible to create a qualification exam that will screen out nonnative speakers and participants who have already completed a related survey. However, workers can retake qualification exams. This means that a worker who is disqualified for being a nonnative speaker can potentially retake the exam and change his or her answers to avoid disqualification. This situation is not ideal, as it potentially encourages misrepresentation. Furthermore, several discussion threads on the AMT forum suggest that qualification exams severely decrease participation rates, as many AMT workers routinely ignore HITs that require qualification.

Given the retake possibility of the qualification exams, it seems that the only option for participant selection is to rely on self-identification by the participants in combination with postcollection participant rejection criteria. To that end, the description of the experiment said “You must be a native speaker of U.S. English to participate in this experiment.” This description is visible to workers while they are browsing the list of available HITs. Similarly, the first paragraph of the survey instructions explained that this HIT is actually an experiment, and that only native speakers of U.S. English should take it because nonnative speakers could contaminate the data. Participants were then told that a native speaker of U.S. English meets the following two criteria, and were asked to choose YES or NO using radio buttons for each criterion:

1.
You lived in the United States from birth until age 13.
2.
Both of your parents spoke English to you during those years.

Participants were paid $3 regardless of their answers to these criteria. This ensured that there was no incentive to answer untruthfully and that the responses could be used to reject participants prior to analysis. Only 3 participants answered NO to one or more of the native speaker criteria. These 3 participants were still compensated for their time, so $9.90 was lost to self-identified nonnative speakers.

To ensure that participants only completed one of the eight surveys that were part of this experiment, a paragraph was placed at the end of the survey (after all of the judgments) that instructed workers not to take any of the seven other HITs available as part of this HIT batch. They were told that they would only be paid for the first survey that they completed, so there was no monetary incentive to complete additional HITs in this batch. Because AMT assigns each worker a unique alphanumeric ID number, it is relatively straightforward to search the results for workers who have completed multiple surveys and to reject their later surveys using the AMT approval/rejection feature. If a worker is rejected through the approval/rejection feature, he or she is not compensated for that HIT, and that HIT is automatically returned to the list of available HITs to be completed by a different worker. The approval/rejection feature thus ensures that there is no monetary incentive for workers to take more than one survey in a single experiment. One participant submitted three surveys. Only the first was approved; the other two were rejected and returned to the AMT system for completion by other participants.

False submission

Because laboratory experiments are conducted in person, there are generally no false submissions. There can be participants who fail to show for a scheduled appointment, but at many universities there are penalties to dissuade no-shows. On the AMT system, there are no such penalties. Seven participants submitted incomplete surveys. These participants were rejected using the AMT rejection/approval system, which means that they were not compensated for their time, and their surveys were automatically returned to the AMT system to be taken by other participants. Together with the two repeated surveys mentioned in the previous subsection, this means that 9 out of 176 surveys were rejected using the AMT rejection/approval system and returned to the AMT system (5.1%). Identifying these 9 surveys took less than 10 min of experimenter time and resulted in no monetary loss.

Rejections

Because acceptability judgments are by definition subjective (there is no external measurement method), there are no universally agreed upon criteria for identifying participants who are not performing the task correctly. One possibility explored by Sprouse and Cunningham (submitted for publication) was to plot the mean ratings of each condition in ascending order and identify a subset of conditions that appear to have a definitive rank order in the sample mean data. The rank order of those items could then be computed for each participant and compared to their rank order in the sample mean data (the “true” ordering) to derive a measure of divergence between each participant’s rank order and the sample rank order. One such measure of rank order comparison is the tau rank correlation (Kendall, 1938). The tau rank correlation is based on Kendall’s tau, which is a distance measure between two rank orders based on how many pairwise “flips” of adjacent numbers are necessary to turn one rank order into another. The tau rank correlation yields a coefficient for each participant between –1 and 1. A perfect match between the two ranks yields a 1, no relation between two ranks yields a 0, and the most dissimilar rank yields a –1. The tau rank correlation coefficients can then be plotted in a histogram to identify any participants whose rank order is qualitatively different from the sample rank order. Crucially, for the purposes of this report, this procedure does not have to be the best possible outlier identification procedure; it merely has to return results that (1) are logically interpretable and (2) allow for a comparison to be made between the two samples.

To derive a baseline rank order for comparison, eight conditions were chosen that appeared to have a reliable set of ordering relations on the basis of the mean ratings of all participants in both samples. In ascending order, these were (a) adjunct island violations, (b) whether island violations, (c) agreement attraction violations, (d) agreement attraction illusions, (e) matrix wh- questions with embedded adjunct clauses, (f) long distance wh- questions with embedded that clauses, (g) matrix wh- questions with embedded complex NPs, and (h) matrix wh- questions with embedded that clauses.

(9)
Examples of the Eight Conditions Chosen for the Rank Order Analysis
1. a.
  What do you worry if the lawyer forgets at the office?
2. b.
  What does the detective wonder whether Paul took?
3. c.
  The slogan on the poster unsurprisingly were designed to get attention.
4. d.
  The slogan on the posters unsurprisingly were designed to get attention.
5. e.
  Who worries if the lawyer forgets his briefcase at the office?
6. f.
  What does the detective think Paul took?
7. g.
  Who made the claim that Amy stole the pizza?
8. h.
  Who thinks Paul took the necklace?

The R statistical computing environment (R Development Core Team, 2009) was used to compute the order of those eight conditions for each participant and compare each one’s order with the baseline. The tau correlation coefficients for each sample are presented in Fig. 2.

The tau coefficients for the laboratory sample are much more tightly clustered at the high end of the scale than the AMT sample, which has a much heavier leftward tail. At a practical level, this means that it is much easier to identify outliers in the laboratory sample: the 3 participants with tau coefficients below 0 are obviously distinct from the primary mass of participants. Furthermore, their negative tau coefficients indicate that their rank order was nearly reverse from the sample rank order. The picture is less clear for the AMT sample. A large majority of the participants still have tau coefficients above .5, but there are many more participants with tau coefficients near or below 0, and there is a less clear separation between the primary mass of participants and the potential outliers. Adopting a cutoff criterion similar to the one for the laboratory sample (~.15) results in the elimination of 22 participants from the AMT sample and coincides with a minor mode in the tail of the distribution. The fact that this criterion is difficult to establish without a comparison to the laboratory sample raises a potential problem for the use of this method of participant removal with AMT samples; however, for the purposes of this validation study, it provides us with a conservative estimate that is logically comparable to the laboratory sample.

In total, 25 out of 176 participants (14.2%) were excluded from the AMT sample for either self-identifying as nonnative (3) or providing results in which the rank order differed significantly from the sample rank order (22). Although the AMT rejection rate appears to compare unfavorably with the 3 rejections for the laboratory sample (1.7%), it should be noted that 14.2% is well within the range of rejection rates for other behavioral methodologies such as self-paced reading and lexical decision, and lower than the rejection rates for electrophysiological methodologies such as EEG and MEG. The minor increase in participant rejections in the AMT sample seems to be more than offset by the 90:1 time advantage. To adjust for this slightly higher rejection rate, syntacticians may want to consider adding 15% to the target sample size (e.g., 35 instead of 30). The statistical analyses presented in the following sections were performed on the remaining 173 participants in the laboratory sample and the remaining 151 participants in the AMT sample.

Statistical power

The primary concern of syntacticians is that the noise introduced by the uncontrolled environment of AMT might lead to lower statistical power than traditional laboratory-based experiments. To investigate this concern empirically, resampling simulations were run on each of the phenomena presented in the Case Studies for Analysis section above. These resampling simulations were designed to estimate the rate of statistical detectability for each phenomenon for every sample size between 5 and 173 for the laboratory sample, and between 5 and 151 for the AMT sample. In other words, these resampling simulations provide an answer to the questions: How likely am I to detect phenomenon X with a sample size of Y in the laboratory? And how likely am I to detect phenomenon X with a sample size of Y with AMT?

The algorithm for the resampling simulations can be described as follows (see Sprouse & Almeida, 2010, for more details):

1.
Choose one of the two samples (laboratory or AMT).
2.
Choose a sample size (e.g., 5).
3.
Randomly sample (with replacement) a number of participants equal to that size (e.g., 5) from the full data set.
4.
Randomly choose one judgment for each condition from each of the participants in the sample.
5.
Run a paired t test on the sample.
6.
Repeat Steps 3–5 a total of 1,000 times.
7.
Calculate the proportion of significant results (p < .05) out of those 1,000 samples; this is an estimate of the detection rate at that sample size.
8.
Repeat Steps 2–7 for all of the other possible sample sizes (5–173 for the laboratory sample, 5–151 for the AMT sample).
9.
Repeat Steps 2–8 for every possible number of judgments per participant per condition (in this case, 1–4).
10.
Repeat Steps 2–9 for the other sample (laboratory or AMT).

It should be noted that sample sizes below 5 were not tested because paired t tests are not necessarily computable for sample sizes smaller than 5. Only graphs for one judgment per participant per condition and four judgments per participant per condition are presented in Fig. 3, as these were the upper and lower bounds made possible by the design of the experiment. Because all of the island effects tested asymptoted at 100% detectability with relatively small samples, the figure only presents the detectability estimates for sample sizes up to 30.

Although there does appear to be a slight loss of statistical power in the AMT sample, this difference is relatively small by experimental standards: The AMT sample requires 3 or 4 more participants than the laboratory sample to reach 100% detectability. This suggests that any concern that syntacticians may have about AMT can be alleviated by increasing the sample size slightly. It should also be noted that both the laboratory sample and the AMT sample reached 100% detectability with fewer than 20 participants in the relatively underpowered one-judgment analysis. Given that the standard sample size in formal acceptability judgments is 25–30 and that it is standard to give each participant more than one judgment per condition, it seems unlikely that syntacticians would notice the slight power loss under normal experimental design conditions. In short, these results suggest that AMT is well suited to detect standard syntactic phenomena without any noticeable loss in statistical power.

The three weak phenomena presented in Fig. 4 have historically been difficult to detect with standard acceptability judgment experiments, likely because they are not caused by static properties of the final syntactic representation, but rather by dynamic properties of the way these representations are constructed during real-time sentence processing. Nonetheless, these effects are detectable with extremely large samples, as demonstrated in Fig. 4. This makes them an ideal test case for the ability to detect extremely weak effects using AMT.

For the center embedding and agreement attraction effects, the AMT sample once again appears to yield slightly lower detectability rates than the laboratory sample: The AMT sample requires 10 additional participants to reach detectability rates that are comparable to the laboratory sample. This does not appear to pose a significant problem for the use AMT, given the ease with which an additional 10 participants can be recruited. However, the comparative illusion detection rate in the AMT sample is potential cause for concern: The AMT sample appears to require 50 additional participants to reach detectability rates that are comparable to the laboratory sample. Given that two of the three extremely weak effects were detected within the AMT sample at rates comparable to the laboratory sample, it seems likely that the lower detection rate for comparative illusions may say more about comparative illusions than it does about the use of AMT. In fact, as we shall see in the next section, the distributions of the comparative illusion data suggest that fewer AMT participants were fooled by the illusion, which suggests that the lower detectability of the effect in the AMT sample may be indicative of more accurate judging by the AMT participants. Taken together with the fact that none of these effects are well suited to investigation using (nonspeeded) acceptability judgments in the first place, these results strongly suggest that syntacticians need not worry about the statistical power of AMT samples for true syntactic phenomena.

The shapes of the distributions

One final analysis that may be of interest to syntacticians considering the use of AMT is a direct comparison of the shapes of the distributions of each condition in the laboratory and AMT samples. Whereas the resampling simulations in the previous section confirmed that differences between condition means arise at approximately the same rates in each sample, the direct comparison of the distributions can confirm that the sources of the differences between condition means are identical for each sample (i.e., the location of the peak (mode) vs. the heaviness of the tail). To aid in the visualization of the distributions, density curves for each condition were calculated using the function density in the base statistics package {stats} in R. These density curves are plotted in Fig. 5.

The distributions of the two samples are very similar for each of the conditions constituting the island effects: the peaks (modes) are approximately equal in location and frequency, and the overall shapes and widths of the distributions are approximately equal. It does appear that the rightward tail of the AMT distributions is slightly heavier than the rightward tail of the laboratory distributions, which may account for the marginal power difference between the two samples. But overall, the variation between the distributions appears to be well within the bounds of normal variation between samples.

The first point to note about the illusions in Fig. 6 is that the mean differences are not driven by as clear a peak (mode) separation as the island effects; instead, the differences between the control violations (solid lines) and the illusions (dashed lines) appear to be driven by both a small shift in the locations of the distributions along the x-axis and small changes in the shapes of the distributions. Nonetheless, the shapes of the laboratory and AMT distributions for each condition again appear to be relatively similar. It should be noted that the reason for the discrepancy between the two samples with respect to the detectability of the comparative illusion may be visible in the density curves in Fig. 6: Although the peaks of the illusion conditions appear to be equal in the two samples, the laboratory illusion condition appears to have a slightly heavier right side than the AMT illusion condition. This suggests that fewer AMT participants were fooled by the illusion, which would result in the lower detectability rates of the comparative illusion in the previous section. This raises the interesting possibility that the AMT sample included more accurate participants than did the laboratory sample, at least for the comparative illusion. Of course, additional research on the comparative illusion itself is necessary to better understand the differences between the two samples.

Conclusion

Data quality

The quantitative comparison of these two large-scale samples suggests that Amazon Mechanical Turk is a viable alternative to laboratory-based acceptability judgment experiments. AMT provides impressive time savings (the collection rate is about 85 participants per hour) without any meaningful disadvantage on the measures of concern to syntacticians:

The participant rejection rate is less than 15%, which is well within the normal bounds for behavioral experiments.
There is no evidence of a meaningful power loss for syntactic phenomena, and only a slight power loss for extremely weak (processing-based) effects.
There is no evidence of meaningful differences in the shapes or locations of the judgment distributions.

Limitations

The most obvious limitation of AMT is the cost: AMT is a payment-only marketplace, and therefore requires research funding (e.g., $3.30 per participant for a 105-item survey). Although these sums are relatively small, they do lead to a significant increase over the (free) university participant pools that syntacticians are accustomed to. In addition to cost, there are also other, less obvious limitations imposed by the AMT environment that syntacticians should keep in mind as they switch from laboratory-based experiments to online AMT experiments:

The online-only interface means that there is no way to ensure that the participants understand the task. This may contribute to the increased participant rejection rate over laboratory-based experiments.
There is similarly no way to debrief participants after the experiment to identify potential problems with the design, instructions, responses, and so forth. The only option is to include debriefing questions as part of the survey itself, which limits the ability to follow up based on the participant’s responses.
The increased participant rejection rate suggests a need for standard participant rejection criteria. Unfortunately, at present there are no standard participant rejection methods in the acceptability judgment literature.
The HTML foundation of AMT means that audio and visual stimuli may be used instead of text (as long as Web browsers support the multimedia file type). However, Amazon provides no mechanism for uploading multimedia files. Instead, researchers must store the multimedia files on their own Web server and link to the files in the HIT itself. An example template for audio files (an auditory acceptability judgment task) is included on the author’s Web site (see the Supplemental Materials section below).
The AMT system provides no mechanism for the collection of reaction times. The only time recorded by the AMT system is HIT completion time (the time from acceptance of the HIT to submission of the HIT), which can be used for participant rejection. If reaction times are crucial to the acceptability judgment experiment, one could use an independent experimental platform (such as WebExp) and use AMT to recruit participants and direct them to the independent experimental platform.
The AMT system does not include functions to aid in experimental design (as is common in dedicated experimental platforms). For example, AMT cannot automatically randomize the order of presentation in a survey. Instead, the experimenter must create randomized versions of the surveys by hand. If the experimenter does not create a novel randomization for each participant, then several participants will see the same randomization (as in this experiment). This adds some time to the construction phase of the experiment.
At present, the AMT worker pool is primarily composed of residents of the U.S. (46.8%) and residents of India (34%) (Ipeirotis, 2010). The composition of the worker pool is a direct reflection of Amazon’s payment system, which is currently configured to pay in U.S. dollars and Indian rupees only. The composition may change in the future as Amazon’s payment system expands; however, at present the lack of geographic diversity will likely affect the collection rates for languages other than English and Hindi, potentially limiting the benefits of AMT for cross-linguistic studies.

Recommendations

In addition to being aware of the limitations discussed above, I also strongly recommend the following practices to help control the unique properties of the AMT environment:

Any questions about native speaker ability should be informational only and, crucially, should not lead to nonpayment. This discourages misrepresentations, so that the answers can be used as participant rejection criteria during data analysis.
Researchers should run some sort of participant rejection or outlier removal process prior to analysis, since the AMT outlier rate is higher than the laboratory rate (14.2% vs. 1.7%).
Target sample sizes should be increased by 15% to accommodate the higher participant rejection rate.
If extremely weak effects are being investigated (i.e., effects that require sample sizes of 100 or more), 10 additional participants should be added to accommodate the slightly lower statistical power of the AMT sample.

Supplemental Materials

HTML templates for five different acceptability judgment tasks (magnitude estimation, 7-point scale, yes–no, forced choice, and auditory) can be found on the author’s Web site (currently, www.ling.cogsci.uci.edu/~jsprouse/tools/amt/). This page also includes links to R scripts that may aid in the analysis of data collected using AMT and an online tutorial offered by Amazon about using the AMT Web site.

References

Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72, 32–68.
Article Google Scholar
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge: MIT Press.
Google Scholar
Chomsky, N. (1986). Barriers. Cambridge: MIT Press.
Google Scholar
Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks: Sage.
Google Scholar
Featherston, S. (2005a). Magnitude estimation and what it can do for your syntax: Some wh-constraints in German. Lingua, 115, 1525–1550.
Article Google Scholar
Featherston, S. (2005b). Universals and grammaticality: Wh-constraints in German and English. Linguistics, 43, 667–711.
Article Google Scholar
Frazier, L. (1985). Syntactic complexity. In D. Dowty, L. Karttunen, & A. Zwicky (Eds.), Natural language processing: Psychological, computational and theoretical perspectives (pp. 129–189). Cambridge: Cambridge University Press.
Google Scholar
Gibson, E., & Fedorenko, E. (in press). The need for quantitative methods in syntax. Language and Cognitive Processes.
Gibson, E., & Thomas, J. (1999). Memory limitations and structural forgetting: The perception of complex ungrammatical sentences as grammatical. Language and Cognitive Processes, 14, 225–248.
Article Google Scholar
Grimshaw, J. (1986). Subjacency and the S/S' parameter. Linguistic Inquiry, 17, 364–369.
Google Scholar
Hofmeister, P., & Sag, I. (2010). Cognitive constraints and island effects. Language, 86, 366–415.
Article Google Scholar
Huang, C.-T. (1982). Move WH in a language without WH movement. Linguistic Review, 1, 369–416.
Article Google Scholar
Ipeirotis, P. G. (2010). Demographics of Mechanical Turk. Center for Digital Economy Research Working Papers, 10. Available at http://hdl.handle.net/2451/29585
Keller, F. (2000). Gradience in grammar: Experimental and computational aspects of degrees of grammaticality. University of Edinburgh: Unpublished doctoral dissertation.
Google Scholar
Keller, F., Gunasekharan, S., Mayo, N., & Corley, M. (2009). Timing accuracy of Web experiments: A case study using the WebExp software package. Behavior Research Methods, 41, 1–12.
Article PubMed Google Scholar
Kendall, M. (1938). A new measure of rank correlation. Biometrika, 30, 81–89.
Google Scholar
Kuno, S. (1973). Constraints on internal clauses and sentential subjects. Linguistic Inquiry, 4, 363–385.
Google Scholar
Marantz, A. (2005). Generative linguistics within the cognitive neuroscience of language. Linguistic Review, 22, 429–445.
Article Google Scholar
Myers, J. (2009). The design and analysis of small-scale syntactic judgment experiments. Lingua, 119, 425–444.
Article Google Scholar
Phillips, C., Wagers, M., & Lau, E. (in press). Grammatical illusions and selective fallibility in real-time language comprehension. Language and Linguistics Compass.
R Development Core Team. (2009). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at www.R-project.org
Ross, J. R. (1967). Constraints on variables in syntax. Unpublished doctoral dissertation, MIT, Cambridge, MA.
Schütze, C. (1996). The empirical base of linguistics: Grammaticality judgments and linguistic methodology. Chicago: University of Chicago Press.
Google Scholar
Sorace, A., & Keller, F. (2004). Gradience in linguistic data. Lingua, 115, 1497–1524.
Article Google Scholar
Sprouse, J. (2009). Revisiting satiation: Evidence for an equalization response strategy. Linguistic Inquiry, 40, 329–341.
Article Google Scholar
Sprouse, J., & Almeida, D. (2010). A quantitative defense of linguistic methodology. Manuscript submitted for publication.
Sprouse, J., Wagers, M., & Phillips, C. (2010). A test of the relation between working memory capacity and island effects. Manuscript submitted for publication.
Wagers, M., Lau, E., & Phillips, C. (2009). Agreement attraction in comprehension: Representations and processes. Journal of Memory and Language, 61, 206–237.
Article Google Scholar

Download references

Author Notes

This research was supported in part by National Science Foundation Grant BCS-0843896. I thank Diogo Almeida for helpful comments, Jessamy Norton-Ford for assistance in the early stages of this project, and two anonymous reviewers for their thoughtful comments.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

Department of Cognitive Sciences, University of California, 3151 Social Science Plaza A, Irvine, CA, 92697-5100, USA
Jon Sprouse

Authors

Jon Sprouse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jon Sprouse.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Sprouse, J. A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behav Res 43, 155–167 (2011). https://doi.org/10.3758/s13428-010-0039-7

Download citation

Published: 25 November 2010
Issue Date: March 2011
DOI: https://doi.org/10.3758/s13428-010-0039-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

What do you think that John bought?	(control)
*What do you wonder whether John bought?	(violation)

What did you claim that John bought?	(control)
*What did you make the claim that John bought?	(violation)

What do you think interrupted the TV show?	(control)
*What do you think the speech about interrupted the TV show?	(violation)

What do you think that John forgot at the office?	(control)
*What do you worry if John forgets at the office?	(violation)

A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory

Abstract

Similar content being viewed by others

Questionnaire Design

Natural Language Processing

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Experimental details

Method

Participants

Materials

Procedure

Presentation in the laboratory

Presentation on AMT

Preprocessing of responses

Case studies for analysis

Time, cost, and participant rejection

Time

Preparation

Data collection

Cost

Participant rejection

Selection

False submission

Rejections

Statistical power

The shapes of the distributions

Conclusion

Data quality

Limitations

Recommendations

Supplemental Materials

References

Author Notes

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation