Original Article
Differential Item Functioning in the Danish Translation of the SF-36

https://doi.org/10.1016/S0895-4356(98)00111-5Get rights and content

Abstract

Statistical analyses of Differential Item Functioning (DIF) can be used for rigorous translation evaluations. DIF techniques test whether each item functions in the same way, irrespective of the country, language, or culture of the respondents. For a given level of health, the score on any item should be independent of nationality. This requirement can be tested through contingency-table methods, which are efficient for analyzing all types of items. We investigated DIF in the Danish translation of the SF-36 Health Survey, using two general population samples (USA, n = 1,506; Denmark, n = 3,950). DIF was identified for 12 out of 35 items. These results agreed with independent ratings of translation quality, but the statistical techniques were more sensitive. When included in scales, the items exhibiting DIF had only a little impact on conclusions about cross-national differences in health in the general population. However, if used as single items, the DIF items could seriously bias results from cross-national comparisons. Also, the DIF items might have larger impact on cross-national comparison of groups with poorer health status. We conclude that analysis of DIF is useful for evaluating questionnaire translations.

Introduction

During the last 10 to 15 years, the methodology of medical outcomes research has become more rigorous and interest in multinational studies and cross-national comparisons has grown. To access patient-reported outcomes, questionnaires, which are applicable across countries and languages, are needed. Several research groups have undertaken the development of such questionnaires while maintaining critical scrutiny of the comparability of wording and concepts in different languages and cultures. Examples of such efforts are the translations of the SF-36 1, 2, 3, the NHP 4, 5, 6, the EuroQOL 7, 8, the EORTC QLQ-C30 9, 10, 11, and the WHOQOL 12, 13.

Traditionally, researchers have tried to achieve cross-language comparability through elaborate translation procedures including use of several parallel translations, rating of translation quality, back translations, review of translation by focus groups, and coordinated multinational questionnaire development 14, 15. Although these procedures are well suited for translation improvement, they give insufficient evidence for translation evaluation, a critical evaluation of the equivalence of questionnaires across languages [16]. Researchers have used a number of statistical methods for translation evaluation; one statistical approach is the analysis of Differential Item Functioning—DIF (also called item bias) 17, 18. DIF methods build on the requirement that items should function in the same way, whatever subgroups are investigated: males versus females, old versus young, respondents using one language versus respondents using another language.

This requirement can be labeled homogenous item functioning and can be defined in the following way for a specific item: All persons at a given level of the attribute measured (or at a given scale score) should have an equal probability of scoring positively on the item regardless of their group membership as to sex, age, or other exogenous variable (modified from [19]). We talk about DIF when this requirement does not hold.

A simple example can illustrate the idea of DIF. Suppose that:

  • 1.

    We compare physical functioning in the United States and in Denmark using a ten-item scale in a questionnaire that we have translated from English into Danish.

  • 2.

    The real level of physical functioning in the two populations is the same.

  • 3.

    The questionnaire shows no difference between the United States and Denmark on nine items, but a strong difference on the tenth item (i.e., the item shows DIF).

We may encounter such a result if we have translated the item wrongly or if the item in part measures concepts that are specific to one country (or culture). An example of a culture-specific question would be a question about participation in sports like American football or baseball. Such a question will measure physical functioning but also national habits in choosing sports, since few Danes play American football or baseball. Use of a literal translation of this question will therefore sacrifice the objectivity of comparisons of physical functioning in Denmark and the United States. We might say that the sports question is biased since it in part measures concepts that are irrelevant for assessing physical functioning in Denmark. While such extreme cases of bias or DIF often will be recognized during the translation process, more subtle problems may not be noticed.

This paper will describe contingency-table methods for testing Differential Item Functioning across languages and investigate Differential Item Functioning between the U.S. version and the Danish version of the SF-36. The paper thus supplements previous research on the comparability of the U.S. and Danish versions of the SF-36 16, 20, 21. In short, we translated the SF-36 into Danish using the standard methodology of the IQOLA project 1, 2, 16. Initial studies have shown that the Danish version has adequate internal consistency, discriminate validity, and a factor structure that closely resembles the U.S. version 21, 22, 23. Analyses by the Rasch model of the physical function subscale have shown that the item difficulties of the Danish and the U.S. versions are generally congruent [20]. However, responses to questions about walking indicate that this is more difficult for U.S. respondents in the sense that Americans with a given level of physical functioning tend to indicate more limitations in walking than Danes with the same level of physical functions. Bathing and dressing, on the other hand, is more difficult for Danish respondents [20].

In our hypothetical example we assumed that the Danish and the American populations have the same level of physical function. Based on this assumption it is easy to identify biased items. However, the Danish and the American populations may have different levels of physical functioning, and results from simple two-way analyses can therefore not be taken as indication of DIF. The nine items which did not show any difference may be bad items that are unable to distinguish between different levels of functioning, while the tenth item may be a very good item which is sensitive to small differences in physical functioning. Techniques for investigating DIF should therefore correct for real differences in the attribute being measured, thus comparing item responses for respondents with equal health status.

Item-response models achieve this goal by estimating one or more item parameters that are independent of the health status of the population. Item-response models assume that the response on a particular item is caused by two types of parameters: a person parameter reflecting the health of the person in this particular domain, and item paramters reflecting, for example, the difficulty of the item and its ability to discriminate between persons with good and bad health. The basic item-response models assume that the scale is unidimensional, i.e. that all items in the scale measure exactly the same concept (i.e., the person parameter). The relation between the person parameter and the probability of answering an item in a certain way can be displayed as an Item Characteristic Curve. For a dichotomous item (e.g., “yes/no”), we define item difficulty as the value of the person parameter where the probability of “yes” and “no” is equal. Item discrimination is the slope of the curve at this point. Different items in a scale may differ in difficulty and discrimination, but a specific item should have the same difficulty and discrimination irrespective of the race, gender, language, or nationality of the population under study [18]. In a cross-language study, large differences in item parameters between languages are taken as indication of DIF.

Item-response models have frequently been used for detection of DIF across languages 24, 25, 26, 27, 28, and these models can also be used to correct for DIF (see [28]). However, the efficiency of the methods depends on the fit of the model. If the particular item-response model does not fit the data for each country separately, the model is not optimal for investigating DIF.1 Furthermore, relatively large sample sizes are required to achieve reliable estimates. This problem is magnified when the number of item parameters increases, as is the case with two- and three-parameter models or models for items with more than two response categories (polytomous items). Thus, a two-parameter item-response model has been shown to give poor parameter estimates with a sample size of 300 in each group (country) [30], and 1,000 participants from each group (country) are recommended for the three-parameter model for dichotomous items [31]. Often, cross-national studies do not achieve such numbers 24, 25, 26.

In recent years, the contingency-table approach to DIF analysis has become popular 18, 32. With this approach the relation between item response and country is analyzed controlling for the manifest scale score. Thus, cross-national DIF is investigated in three-way tables: item score by country by scale score. An item shows DIF if we find a significant association between item score and country, controlling for scale score. On the other hand, homogenous item functioning implies that for a given scale score, the item scores are independent of nationality. Advantages of the contingency-table method are smaller sample size requirements and that no assumption of a unidimensional latent variable is necessary [33]. The requirement of independency between item scores and nationality (or other variables) for a given scale score can be motivated by a requirement of statistical sufficiency of the scale [33].

Several contingency-table tests have been proposed to test for DIF (see 34, 35). A chi-square test can examine DIF in a single score group, or across score groups. However, the chi-square test is nonspecific and may therefore not be optimal for investigating specific kinds of DIF. Further, the chi-square test is very sensitive to large sparse table problems [36] and one often has to combine scores to achieve score groups of sufficient size, an artifice that may invalidate the test if there are large variations of the scores within the score groups (‘fat matching’). The Mantel-Haenszel test has become popular for examining DIF, since it overcomes these difficulties [35]. The Mantel-Haenszel method tests for a specific kind of DIF; that is, the item score is uniformly higher (or rather, the odds for a high answer are greater) in one country compared to another country for all levels of the scale score (uniform DIF). The Mantel-Haenszel test is a one-degree-of-freedom test and is quite robust toward small cell frequencies, so it is often unnecessary to group scores. Also, the Mantel-Haenszel estimator—the Mantel-Haenszel odds ratio—can be used to judge the size of DIF. For educational research, the Educational Testing Service has suggested standards for interpretation of DIF, building on the odds ratio (see Table 1).

The Mantel-Haenszel approach can only be used when items are dichotomous (e.g., “right”/“wrong” or “yes”/“no”) and when only two countries are compared. In medical, sociological, and psychological research, items are often polytomous (i.e., have more than two response categories). For rank-scaled polytomous items (like Likert-type items), Kreiner has proposed the use of the partial version of the gamma coefficient as a test statistic [36]. The gamma coefficient ranges from −1 to +1, and gamma may be interpreted as a modification of Kendal’s non-parametrical correlation coefficient tau [37]. However, the gamma coefficient can in certain situations be transformed to the odds ratio [38] (Table 1 shows examples of equivalent values of the odds ratio and the gamma). In our opinion, use of the gamma statistic is a simple, flexible, and efficient way to extend the Mantel-Haenszel idea to polytomous rank-scaled items.

Although we have used cross-national DIF as our example throughout the paper, DIF can also occur in relation to other variables such as age, sex, and ethnicity. When investigating cross-national DIF, one should therefore use populations that are comparable with respect to such basic variables. Otherwise, interpreting the cause of DIF is difficult.

Section snippets

Data Collection

Data for the Danish population were collected from February to August 1994 as part of a population health survey. A representative sample of 5,983 noninstitutionalized Danish adults (over the age of 15) were drawn from the Civil Registration System, which registers addresses of all Danes. The survey included a 30-minute structured personal interview regarding social and demographic data, health behavior, health status, and diseases. After the interview, respondents were given the SF-36

Results

There are consistent differences between the U.S. and the Danish samples in basic demographic variables. The Danish samples have higher proportions of men, people below 30 years, and people never married. A smaller proportion of the Danish samples are above 64 years of age, and fewer are married, divorced/separated, or widowed.

Weighting the Danish data to the age and gender distribution in the U.S. samples, the two Danish samples have a higher (5–11 point) mean score than the U.S. samples

Discussion

Using a contingency-table methodology, we found significant indications of cross-language DIF in 12 out of 35 items in the Danish translation of the SF-36. These results were confirmed when we adjusted for ethnicity, controlled for differences in age and gender, and used another approach for dealing with missing responses. Judged by the size of the gamma coefficients, five items had moderate to severe DIF, six items had slight to moderate DIF, and one item had neglible DIF. Cross-language DIF

Acknowledgements

The International Quality of Life Assessment (IQOLA) Project is sponsored by Glaxo Wellcome, Inc. (Research Triangle Park, North Carolina), and Schering-Plough Corporation (Kenilworth, New Jersey) and other companies. In addition this study has been supported by grants from Glaxo Research Institute, from the Danish Medical Research Council, and from the Danish Health Insurance Fund. We would like to thank Barbara Gandek and an anonymous reviewer for comments to a previous version of this paper.

References (46)

  • B.B. Ellis

    Assessing intelligence cross-nationallyA case for differential item functioning detection

    Intelligence

    (1990)
  • Ware JE, Gandek B, The IQOLA Project Group. The SF-36 Health Survey: Development and use in mental health and the IQOLA...
  • N.K. Aaronson et al.

    International Quality of Life Assessment (IQOLA) Project

    Quality Life Res

    (1992)
  • H. Thorsen et al.

    The Danish version of the Nottingham Health ProfileIts adaptation and reliability

    Scandinavian Journal of Primary Health Care

    (1993)
  • EuroQol Group

    EuroQol—a new facility for the measurement of health-related quality of life

    Health Policy

    (1990)
  • J. Brazier et al.

    Testing the validity of the Euroqol and comparing it with the SF-36 health survey questionnaire

    Quality Life Res

    (1993)
  • N.K. Aaronson et al.

    The EORTC core quality-of-life questionnaireInterim results of an international field study

  • N.K. Aaronson et al.

    The European Organization for Research and Treatment of Cancer QLQ-C30A quality-of-life instrument for use in international clinical trials in oncology

    J Natl Cancer Inst

    (1993)
  • Sprangers MAG, Cull A, Bjordal K, Groenvold M, Aaronson NK, for the EORTC Quality of Life Study Group. The European...
  • Anonymous

    Study protocol for the World Health Organization project to develop a Quality of Life assessment instrument (WHOQOL)

    Quality Life Res

    (1993)
  • Anonymous

    The World Health Organization Quality of Life Assessment (WHOQOL)Position Paper from the World Health Organization

    Soc Sci Med

    (1995)
  • S.M. Hunt et al.

    Cross-cultural comparability of quality of life measures

    Br J Med Economics

    (1992)
  • Holland PW, Wainer H. Holland PW, Wainer H, Eds. Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum...
  • Cited by (170)

    View all citing articles on Scopus
    View full text