Elsevier

Physiology & Behavior

Volume 82, Issue 1, August 2004, Pages 109-114
Physiology & Behavior

Valid across-group comparisons with labeled scales: the gLMS versus magnitude matching

https://doi.org/10.1016/j.physbeh.2004.02.033Get rights and content

Abstract

Labeled scales are commonly used for across-group comparisons. The labels consist of adjective/adverb intensity descriptors (e.g., “very strong”). The relative distances among descriptors are essentially constant but the absolute perceived intensities they denote vary with the domain to which they are applied (e.g., a “very strong” rose odor is weaker than a “very strong” headache), as if descriptors were printed on an elastic ruler that compresses or expands to fit the domain of interest. Variation in individual experience also causes the elastic ruler to compress or expand. Taste varies genetically: supertasters perceive the most intense tastes; nontasters, the weakest; and medium tasters, intermediate tastes. Taste intensity descriptors on conventional-labeled scales denote different absolute perceived intensities to the three groups making comparisons across the groups invalid. Magnitude matching provides valid comparisons by asking subjects to express tastes relative to a standard not related to taste (e.g., supertasters match tastes to louder sounds than do nontasters). Borrowing the logic of magnitude matching, we constructed a labeled scale using descriptors unrelated to taste. We reasoned that expressing tastes on a scale labeled in terms of all sensory experience might work. We generalized an existing scale, the Labeled Magnitude Scale (LMS), by placing the label “strongest imaginable sensation of any kind” at the top. One hundred subjects rated tastes and tones using the generalized LMS (gLMS) and magnitude matching. The two methods produced similar results suggesting that the gLMS is valid for taste comparisons across nontasters, medium tasters, and supertasters.

Introduction

Thomas Kuhn [1] described an “… implicit body of intertwined theoretical and methodological belief …” as a paradigm in science. The key to paradigm shifts in science, he wrote, is the discovery of anomaly. Kuhn noted that “… novelty emerges only with difficulty, manifested by resistance, against a background provided by expectation.” Kuhn described global paradigm shifts that affect broad areas of science but also noted paradigm shifts within branches of science of critical importance to the investigators in these fields. Smith [2] is well known for his interest in paradigms in appetitive research. His insights about these conceptual issues led us to consult him when we first encountered anomalies in taste that led to the revisiting of some long-held ideas about psychophysical scaling. His encouragement was very important to our work and we acknowledge that debt to him here.

A variety of problems in medicine and science require comparisons of sensory experiences across groups: the comparison of pain intensities across patients as a function of disease or injury, the comparison of the perception of satiety across bulimics and controls, the comparison of taste intensities across groups varying in the number of taste buds with which they were born, etc. There are many systematic group differences (e.g., sex, age, ethnicity, race, clinical status, genetic status, etc.) that have been the subject of sensory studies. The focus here is the validity of these group comparisons using labeled scales.

The labels on scales [Likert, category, Visual Analogue Scale (VAS)] derive from the way we describe our sensory experiences in every day life: “That tastes very strong to me. Does it taste very strong to you?” The descriptor is an adjective (strong) modified by an adverb (very). The rules of grammar reveal the essential problem with this comparison. Intensity adjectives modify nouns. They are dependent on the noun and have no absolute meaning unless the noun is specified. Stevens [3] demonstrated this in the following quote: “Mice may be called large or small, and so may elephants, and it is quite understandable when someone says it was a large mouse that ran up the trunk of the small elephant.” Adverbs modify adjectives so we can speak of a very large mouse and a very small elephant (for accounts of how intensity adverbs modify adjectives, see [4], [5]). When we use adjective/adverb intensity descriptors, they denote only relative intensities (e.g., small is less than large) until their noun (e.g., sensory domain) is specified. Once we know the domain (mice or elephants), we have an idea of the absolute intensities involved, so we know that even a large mouse will be smaller than a small elephant.

When we quantify the intensities of sensations, we have an additional problem. We cannot share experiences so we cannot compare perceived sensations directly. Thus, we have no check on the absolute perceived intensities of our sensations. All of us may taste the same sugar and describe it as “very strong” but we cannot tell whether or not we are experiencing the same sweetness intensity. This is the heart of the problem with labeled scales that is the subject of this study.

Consider this problem in the context of a serious medical issue: the perception of pain intensity. Pain is typically assessed with a scale from 0 to 10, where 0=no pain and 10=most intense pain you have ever experienced. This scale is fine for detecting change in an individual patient's pain but it produces flawed comparisons across patients. For a woman who has experienced a particularly painful childbirth, “most intense pain you have ever experienced” likely denotes a more intense pain than for a child whose most intense pain to date is a skinned knee. At a more practical level, let us suppose that when pain is rated “4” on the scale, patients are given medication. For the woman who has experienced a painful childbirth, “4” denotes a more intense pain than “4” denotes for our child with the skinned knee. In this scenario, the woman would have to be in greater pain to be prescribed an analgesic. This has impact not only on clinical medicine but also on the evaluation of analgesia efficacy, etc. Finding a solution to this problem would likely have considerable practical benefit.

Category scales are very old [e.g., the Greek astronomer Hipparchus (190–120 B.C.) devised one to assess the brightness of the stars]. Two category scales known to contemporary investigators are the Likert scale [6], which was devised for the measurement of attitudes (“Likert” is now sometimes used incorrectly as a synonym for “category”) and the Natick nine-point scale [7], [8]. Of special importance, Stevens [9] noted that category scales have only ordinal and not ratio properties. For example, a rating of “8” on the Natick nine-point scale does not denote a perceived intensity twice as intense as does the rating “4.”

The VAS drops all of the intermediate labels and retains only the line with labels at its ends as anchors: the “minimum and the maximum rating” for the attribute of interest (e.g., see Ref. [10]). The VAS traces back to the 1960s [11], [12], [13]. Aitken first used his VAS to quantify emotions and moods. The VAS was adopted for pain studies [14] and is still widely used in that field. Of special interest, Price et al. [15] determined that the VAS used for pain had ratio properties (i.e., a rating at the top of the scale denotes a perceived intensity twice that of a rating half way up the scale), a considerable advantage in measurement.

The intensity descriptors used for the labels on category scales and used to anchor the ends of the VAS are subject to the problem described above: they may denote systematically different perceived intensities to the different groups under study. This problem has been addressed by distinguished scientists in different fields (e.g., [16], [17], [18].

The taste anomalies that first led us to question the validity of across-group comparisons with labeled scales began with the discovery of supertasters, individuals who experience unusually intense taste sensations [19]. Fox [20] discovered taste blindness; many individuals (tasters) perceive a bitter taste from the crystals of phenylthiocarbamide (PTC) while others (nontasters) fail to do so. Early family studies suggested that taste blindness was a Mendelian trait [21]. The early descriptive assessments (yes–no to perceiving bitterness) gave way to a threshold measure [22] that was to dominate PTC studies until the 1970s.

Although some of the studies of this early era linked PTC tasting and various diseases, research on associations between PTC tasting and health really began with the work of Fischer in the 1960s. Fischer made minor changes in the threshold method but his most important contributions were the substitution of PROP (6-n-propylthiouracil) for PTC (PROP, a chemical relative of PTC, shows taste blindness but lacks the sulfurous odor of PTC) and focus on the impact of taster status on food preferences (e.g., see Ref. [23]). Amazingly, he predicted many of the PROP/health links that we study today. He associated PROP status with body weight (tasters were thinner) and food preferences (tasters liked bitter foods less than did nontasters) and suggested that changes in PROP taste could result from stress and hormonal variation.

As Fischer's career was winding down, Stevens was revolutionizing psychophysics. The threshold methods used to study the ability to taste PROP dated back to the great 19th century psychophysicist, Fechner. Fechner's view of quantification of sensation above threshold rested on the concept of the “just noticeable difference” (JND), the amount a stimulus had to be increased to make it noticeably different. Fechner thought that the JND could be considered a psychological unit. To “scale” the bitterness of a given concentration of PROP, we needed only to count the JNDs from the absolute threshold for PROP to the concentration of interest. Stevens demolished this idea by noting that the JND did not act like a constant unit. A stimulus 8 JNDs above threshold was more than twice as intense as one 4 JNDs above threshold [24]. Stevens argued that the best psychological scales would have ratio properties. That is, a stimulus rated as “8” should be twice as intense as a stimulus rated “4.” Not surprisingly, he thought little of category scales limited only to ordinal measurement. Stevens's most famous method was magnitude estimation. Subjects were instructed to assign numbers to their sensations such that a sensation twice as strong as another would be given a number twice as large. Note however that this produces only information about relative perceived intensities. Magnitude estimates do not provide information about absolute perceived intensities and so cannot be compared across subjects/groups.

The problem of making comparisons across individuals or groups would be solved if we had a standard that we knew was the same absolute perceived intensity to all. We could simply ask subjects to rate any sensation of interest relative to that evoked by the standard. As a result of research on cross-modality matching [25], [26], we know that subjects can match ratings of perceived intensity across various sensory modalities. Unfortunately, we can never prove that we have a standard perceived the same by all because we cannot share sensory experiences. Is there a solution to this dilemma? There is if we focus on group comparisons. Suppose we have a standard that we have reason to believe is unrelated to the stimuli of interest. Even if that standard is not perceived the same by all, if the standard and stimuli of interest are unrelated, then any variability in the standard will likely be the same across groups varying in their perceptions of the stimuli of interest. Consider an example from taste. The early studies on taste blindness attributed differences in the ability to taste PTC/PROP to the N−C=S group on a molecule. A taste stimulus like NaCl does not have such a chemical group. Thus, any variability in the perception of NaCl would be independent of the presence or absence of the PROP receptor. We could compare nontasters' and tasters' abilities to taste PROP by having them rate the bitterness of PROP (e.g., 0.0032 M, near saturation) relative to the saltiness of NaCl (e.g., 1 M). Nontasters rated the saltiness of NaCl to be much more intense than the bitterness of PROP. Tasters divided into two groups: Some (supertasters) rated the bitterness of PROP to be much more intense than the saltiness of NaCl, while others (medium tasters) rated the two as similar in intensity. By treating the average rating of NaCl as equivalent in the three groups, we could measure the difference in bitterness across nontasters, medium tasters, and supertasters.

The weakness of this approach is the assumption that the saltiness of NaCl is genuinely independent of the bitterness of PROP. In fact, it turned out that this assumption was wrong. Thanks to insights from Miller who had discovered that the number of taste buds was associated with the intensity of neural response to bitter compounds in mice [27], we found that PROP tasting correlated with density of taste buds in humans [28]. Supertasters not only perceived the greatest bitterness from PROP but also tended to perceive the greatest taste intensity from other tastants as well, including NaCl [29]. This meant that our original comparisons of the bitterness of PROP across nontasters, medium tasters, and supertasters underestimated the differences across the groups.

Marks and Stevens [30], [31] generalized their work on cross-modality matching to create the method of magnitude matching, the logic of which was described above: Select a standard unlikely to be related to taste and express perceived taste intensities relative to that standard. The formalization of this method opened the way to select any of a number of standards. We replaced our NaCl standard with sound and found that PROP effects were larger using a tone standard (e.g., see Ref. [32]).

The differences across nontasters, medium tasters, and supertasters ultimately proved so large that it seemed this sensory variation was likely responsible for the links Fischer saw between genetic variation in taste, food behavior, and health. However, Fischer did not have the advantages of magnitude matching and so could not quantify the magnitude of the sensory variation. With magnitude matching, we can study the sensory characteristics of foods themselves and link these to behavior [33].

However, not all investigators have found this large sensory variation. Given the arguments above, the explanation is now clear. Supertasters live in the most vivid taste world and nontasters, the least vivid. Thus, the labels applied to taste denote different absolute perceived intensities to these groups [34]. Consider what this would do to ratings on the VAS (the argument is similar for category scales). Suppose we ask nontasters and supertasters to rate the sweetness of a very concentrated solution of sucrose on the VAS (0=“no sweet”; top of the scale=“sweetest ever tasted”). Because this solution may well be close to the sweetest thing members of either group have ever tasted, both groups will produce ratings near the top of the VAS, suggesting that PROP tasting is not associated with the perception of sweetness. But if we were to then ask each group to match the sweetness they perceived to the loudness of a tone, the supertasters would match the sweetness to a much louder tone than would the nontasters. This is the fundamental distinction between the magnitude matching data we collected and the labeled scale data collected by others.

It is important to delineate the conditions under which the VAS and category scales are appropriately used. The scales are obviously valid for within-subject comparisons and for across-group comparisons when members of the groups have been randomly assigned. The scales are invalid for across-group comparisons unless the experimenter can argue that the labels denote the same absolute perceived intensity, on average, to all groups. Unfortunately, if a group difference in sensory experience really exists, this is likely to make the labels on scales denote different absolute perceived intensities to the two groups.

Magnitude matching is the gold standard method for making valid across-group comparisons. Could we create a labeled scale that provides equally valid comparisons? The descriptors on labeled scales serve the same purpose as the standard in magnitude matching. The key is to select labels that are not related to the sensation of interest. Note that the VAS is labeled in terms of the attribute of interest. This is the fundamental flaw in the use of the VAS for across-group comparisons.

Can we generalize labeled scales to encompass all sensory experience? If the most intense sensation ever experienced is not correlated with taste, then anchoring a scale with that label at the top should permit valid across-group comparisons of taste experiences. The issues involved with creating such a scale are reviewed below.

First, we begin with a labeled scale with ratio properties. Several investigators have created such scales by spacing the descriptors appropriately [35], [36], [37], [38], [39], [40]. Interestingly, the relative spacings were similar across studies using a variety of domains (e.g., food attributes, pain, oral sensations). This suggests that the relative spacing of intensity descriptors is essentially invariant and the entire scale compresses or expands to fit the domain to which it is applied [41]. We reasoned that stretching that scale to its maximum might create a scale that could produce valid across-group comparisons. The most general of the hybrid ratio/labeled scales is one devised by Green et al. [39], the Labeled Magnitude Scale (LMS). This scale was devised for oral sensations and included a series of typical intensity descriptors (barely detectable, weak, moderate, strong, very strong); the top of the scale was labeled “strongest imaginable” oral sensation. We stretched this scale to its maximum by labeling the top “strongest imaginable sensation of any kind.” If the “strongest imaginable sensation of any kind” is independent of taste, then this gLMS should permit valid across-group comparisons. The purpose of the present study is to test this prediction.

Section snippets

Methods

Subjects (67 females, 33 males) with mean age 23.8±0.6 were recruited from the Yale community; 55 were Caucasian, 20 African American, 12 Asian, 8 Hispanic, with 5 unknown or of other races. Subjects rated concentration series of NaCl, sucrose, citric acid, quinine, and PROP (6-n-propylthiouracil) as well as tones using magnitude estimation and the gLMS. The two methods were tested in separate sessions (order counterbalanced). PROP solutions (0.0001, 0.00032, 0.001, and 0.0032 M in random

Results

Fig. 1 shows a scatter plot of the responses to 0.0032 M PROP with each of the methods.

Fig. 2 shows concentration functions for PROP. Subjects' responses to 0.0032 M PROP were ranked for each method. For the purposes of this analysis, the 25% of subjects with the highest responses were considered to be supertasters (ST); the 25% of subjects with the lowest responses, nontasters (NT); and the 50% with the intermediate responses, medium tasters (MT).

Discussion

The similarity of the responses collected with the gLMS and magnitude matching with a tone standard suggests that the gLMS provides valid comparisons across PROP groups. Note that PROP functions diverge for both methods. Thus, the most accurate way to sort subjects into groups is to use a very high concentration of PROP.

We note that since the gLMS is a ratio scale, gLMS data can be normalized to standards just as magnitude estimates can. Thus, an investigator can evaluate gLMS data by comparing

Acknowledgements

This study was funded by NIH grant number DC 00283.

References (43)

  • F Mosteller et al.

    Quantifying probabilistic expressions

    Stat. Sci.

    (1990)
  • R Likert

    A technique for the measurement of attitudes

    Arch. Psychol.

    (1932)
  • L.V Jones et al.

    Development of a scale for measuring soldier's food preferences

    Food Res.

    (1955)
  • D.R Peryam et al.

    Advanced taste test method

    Food Eng.

    (1952)
  • S.S Stevens

    Mathematics, measurement, and psychophysics

  • M.M Hetherington et al.

    Methods of investigating human eating behavior

  • R.C.B Aitken

    Measurement of feelings using visual analogue scales

    Proc. R. Soc. Med.

    (1969)
  • P.R.F Clarke et al.

    Reliability and sensitivity in the self-assessment of well-being

    Bull. Br. Psychol. Soc.

    (1964)
  • T Silverstone et al.

    The anorectic effect of dexamphetamine sulphate

    Br. J. Pharmacol. Chemother.

    (1968)
  • L Narens et al.

    How we may have been misled into believing in the interpersonal comparability of utility

    Theory Decis.

    (1983)
  • M Biernat et al.

    Shifting standards and stereotype-based judgements

    J. Pers. Soc. Psychol.

    (1994)
  • Cited by (444)

    View all citing articles on Scopus
    View full text