Depressive Response Sets due to gender and culture-based Differential Item Functioning

doi:10.1016/S0191-8869(01)00203-3

Personality and Individual Differences

Volume 33, Issue 6, 19 October 2002, Pages 937-954

https://doi.org/10.1016/S0191-8869(01)00203-3 Get rights and content

Abstract

Two studies tested a “strong” version of Nolen-Hoeksema's [Nolen-Hoeksema, S. (1987). Sex difference in unipolar depression: evidence and theory. Psychological Bulletin, 101, 259–282.] hypothesis of depressive response sets using samples of Australian respondents (Study I, n=1111) as well as US respondents (Study II, n=300), using a Rasch version of Thalbourne's Manic-Depressiveness Scale (MDS) whose contents are consistent with atypical depression (i.e. depressive episodes with hypomanic symptoms). As predicted, tests for differential item functioning in both studies revealed that women are more likely to worry about “being poor” than equally depressive men (P<0.05), thus ruling out the alternative hypothesis that depressive response sets are simply a byproduct of more frequent or stronger depression in women. Australian women and men, and US women used the MDS items in a similar fashion, whereas equally depressed US men seriously underreported their symptoms. Yet, using top-down purification to derive an unbiased baseline set of items, a “split” 12-item Rasch measure (R-MDS) could be developed that is not affected by differential test functioning due to gender or cultural differences. Comparison of the R-MDS measure and the original MDS scores revealed that while women are more depressive than men by 0.4 SD, the absence of gender bias in the R-MDS decreased this effect by about 20% (0.08 SD). Moreover, gender bias interacted with culture, suggesting that comparisons of depression levels of diverse groups within the same culture may be biased as well. Raw score to R-MDS conversion tables are included.

Introduction

The available literature agrees that women are almost twice more likely than men to report protracted sadness, apathy, low self-esteem and other symptoms that are indicative of depression. Yet, there continues to be less than perfect agreement on how men and women differ with respect to their expression of depressive symptoms (Nolen-Hoeksema, 1987, Nolen-Hoeksema, 1994, Santor and Ramsay, 1998). For instance, based on factor analytic techniques Williamson (1998) identified social withdrawal, indecisiveness, and irritability as characteristic of depressed men. Wilhelm and Hadzi-Pavlovic (1997) suggest that women are more likely to report anxiety, but this hypothesis disagrees with Breslau, Schultz, and Peterson's (1996) finding that anxiety disorders predict the occurrence of depressive disorders for men and women alike. Given this conflicting picture, it is understandable that “negative symptoms” (i.e. symptoms of depression) are often difficult to identify in psychiatric practice (Greden and Tandon, 1991, Muller and Davids, 1999). This problem is exacerbated by the fact that depressive symptoms may vary with the nature of patients' psychiatric illnesses (Gibbons, Clark, Cananaugh, & Davis, 1985).

A coherent framework to address the expression of depressive symptoms is provided by Nolen-Hoeksema's (1987) “response set” explanation according to which women tend to show passive ruminating responses, whereas men actively use distraction to cut off depression before it ramifies. The response set explanation has received consistent empirical support in descriptive research (e.g. Butler & Nolan-Hoeksema, 1994) as well as in laboratory research (e.g. Lyobomirsky & Nolen-Hoeksema, 1993). The detrimental effects of negative rumination are further demonstrated in an experiment showing that the level of depression in mildly-to-moderately depressed subjects increased when they were induced to ruminate, while depression decreased when distractive behaviors were induced (Nolen-Hoeksema & Morrow, 1994). Despite such findings, the possibility remains that in daily life similar levels of depression naturally produce similar negative ruminative behaviors in men and women. Thus, the greater rumination exhibited by women might be a direct consequence of their more frequent depression, thereby implying that gender differences merely serve to moderate the expression of depressive response sets.

This paper tests the “strong” response set hypothesis according to which women express depression differently than equally depressed men. Although Nolen-Hoeksema's theorizing clearly implies the existence of qualitative gender differences, the response set hypothesis has not been tested in this form. We attribute this to the fact that the stronger version poses technical problems that are difficult to solve within the framework of classical test theory (Thissen, Steinberg, & Gerrard, 1986). In particular, a rigorous test of the strong response set hypothesis requires that depression first should be assessed in a gender-neutral fashion. No bias tests were performed in previous research on depressive symptoms (Butler and Nolen-Hoeksema, 1994, Christensen et al., 1999, Lyobomirsky and Nolen-Hoeksema, 1993, Nolen-Hoeksema, 1994, Nolen-Hoeksema and Morrow, 1994), and it is not clear therefore whether the requirement of gender-neutral measurement was met. The following describes how the strong version of Nolen-Hoeksema's basic hypothesis can be tested by removing gender bias within a Rasch (1960) scaling approach.

Most applications of Rasch scaling, as well as related Item Response Theory methods, in clinical and personality testing (e.g. Birenbaum, 1986, Carter and Wilkinson, 1984, Cooke and Michie, 1997, Gibbons et al., 1985, Lange and Houran, 1999 are primarily intended to obtain unidimensional scales with interval level measurement properties. Unfortunately, as pointed out by Thissen et al. (1986) and Santor and Ramsay (1998), one of the most useful properties of such methods, namely, their ability to detect and quantify a particular kind of item bias called Differential Item Functioning (DIF), is often ignored. DIF violates the assumption of local independence which requires that items' measurement properties should not be affected by external variables such as gender or culture. As such, DIF is evidenced by the fact that equally depressed men and women respond systematically differently to the same item. Although formal methods to determine the presence of differential item functioning are discussed in a later section, we point out that DIF is also often accompanied by psychometric anomalies (Tanzer, 1996) and that it may produce “phantom” (i.e. actually non-existing) factors in factor analysis (Lange, Irwin, & Houran, 2000).

Because Nolen-Hoeksema's (1987) response set hypothesis entails that women are more ruminative than equally depressed men, it follows that measures of depression and rumination should show DIF (cf. Santor & Ramsay, 1998). There is indirect evidence however that gender is not the only source of DIF in depression related measures and that cultural factors may play a role as well. For instance, Thompson, Kaslow, Weiss, and Nolen-Hoeksema (1998) administered a revised version of the Children's Attributional Style Questionnaire (CASQ) to large samples of African American and Caucasian American girls and boys. Suggestive of cultural DIF, the psychometric properties of the CASQ varied by race since the Caucasian data showed higher internal consistency. To address issues related to cultural DIF we compare samples of North American and Australian respondents. Differential item and test functioning in depression due to age have recently been reported as well (Christensen et al., 1999). We judged, however, that our samples contained too few older respondents to address this factor in a satisfactory manner.

The strong response set hypothesis will be tested in two studies. To address the possible effects of cultural DIF, Study I uses Australian respondents, while Study II is based on North American respondents. Both studies combine the data sets of projects in which Thalbourne's Manic-Depressiveness Scale (MDS) was administered for purposes not related to the present research. The MDS addresses history of manic symptoms as well as history of depression (Thalbourne, Delin, & Bassett, 1994). While relatively new, the MDS has profitably been used in clinical reserach (for a review see Thalbourne, Keough, & Crawley, 1999) and two preliminary validation studies have been reported (Lester, 1999, Thalbourne and Bassett, 1998). The MDS consists of 18 True/False type questions which were largely derived from the clinical criteria listed in the DSM-III and DSM-III-R. Table 1 lists twelve of the MDS questions in abbreviated form. The remaining questions are listed in the Method section to Study I.

Note that Item 1 (“worried about being poor”) corresponds most closely to Nolen-Hoeksema's notion of negative rumination, and we predict therefore that women are more likely to endorse this item than equally manic-depressive men. A test of this hypothesis requires that the remaining MDS items contain no gender DIF as such DIF may combine to produce bias at the test level, thereby making it impossible to identify equally manic-depressive respondents. Bias at the test level is customarily referred to as Differential Test Functioning (DTF). When tests consist of many nearly equivalent items (e.g. as in educational testing), DTF can be removed simply by discarding the items showing DIF. However, this approach is not feasible for shorter instruments like the MDS as too few items would remain. For this reason we use Rasch (1960) scaling as this allows DTF to be neutralized by equating differentially functioning items across genders or cultures based on an unbiased, i.e. DIF-free, baseline set of items. Where the context allows, we use the terms DIF and DTF as synonymous with bias.

Section snippets

Participants

We combined the results of eight different research projects conducted at Adelaide University, Australia, into a single data set consisting of 742 women and 369 men. The average age of these 1111 individuals was 27.7 years (SD=8.51, Median=22, range=17–74 years). Most respondents were undergraduate students who volunteered their participation. We estimate that at most 1% of the respondents were non-Australian.

Materials

The Manic-Depressiveness Scale (MDS) consists of 18 True/False items, twelve of which

Study II

To replicate our basic findings, while simultaneously addressing the possible effects of cultural differences on the expression of depressive symptoms, we also analyzed a data set of responses obtained from volunteer student participants at Stockton College, USA. Because the data had originally been gathered for different purposes, it was not possible to screen out non-US respondents and the sample probably contained a small number of foreign students as well. Again, the basic hypothesis is

Conclusions

Consistent with Lester's (1999) findings, six of the manic items of the MDS did not survive the top-down item purification process due to severe item misfit and low internal consistency. In addition, several of the twelve remaining items showed statistically significant differential item functioning. However, an unbiased baseline set of items could be identified and this allowed the “splitting” of the remaining to arrive at a Rasch scale that is neutral with respect to gender and culture. This

Acknowledgements

We like to thank Ben Wright and Anne Sustik, as well as the attendees of the Midwestern Objective Measurement Seminar, held in the MESA Institute, University of Chicago, June 4 1999, for their valuable suggestions concerning this research.

References (51)

F. Benazzi
A typical depression with hypomanic symptoms
Journal of Affective Disorders
(2001)
D. Faries et al.
The responsiveness of the Hamilton Depression Rating Scale
Journal of Psychiatric Research
(2000)
R.D. Gibbons et al.
Application of modern psychometric theory in psychometric research
Journal of Psychiatric Research
(1985)
R. Lange et al.
Scaling MacDonald's AT-20 using item-response theory
Personality and Individual Differences
(1999)
R. Lange et al.
Top-down purification of Tobacyk's Revised Paranormal Belief Scale
Personality and Individual Differences
(2000)
J.M. Linacre et al.
The structure and stability of the functional independence measure
Archives of Physical Medicine and Rehabilitation
(1994)
Diagnostic and statistical manual of mental disorders
(1994)
M. Birenbaum
Effect of dissimulation motives and anxiety on response pattern appropriateness
Applied Psychological Measurement
(1986)
L. Bond
Comments on the O'Neil and McPeek paper
N. Breslau et al.
Sex differences in depression: a role for preexisting anxiety
Psychiatric Research
(1996)

L.D. Butler et al.

Gender differences in responses to depressed mood in a college sample

Sex Roles

(1994)

J.E. Carter et al.

A latent trait analysis of the MMPI

Multivariate Behavior Research Monographs

(1984)

H. Christensen et al.

Age differences in depression and anxiety symptoms: a structural equation modelling analysis of data from a general population sample

Psychological Medicine

(1999)

B.E. Clauser et al.

Using statistical procedures to identify differentially functioning test items

Educational Measurement: Issues and Practice

(1998)

D.J. Cooke et al.

An item response theory analysis of the Hare Psychopathology Checklist—Revised

Psychological Assessment

(1997)

S.E. Embretson

The new rules of measurement

Psychological Assessment

(1996)

J.F. Greden et al.

Negative schizophrenic symptoms: pathophysiology and clinical implications

(1991)

J. Hattie

Methodology review: assessing unidimensionality of tests and items

Applied Psychological Measurement

(1985)

P.C. Kendall et al.

Issues and recommendations regarding the use of the Beck Depression Inventory

Cognitive Therapy and Research

(1987)

D. Lester

Comment on “Manic-depressiveness and its correlates”

Psychological Reports

(1999)

J.M. Linacre et al.

A user's guide to Winsteps, Bigsteps, Ministep Rasch-Model computer programs

(1998)

S. Lyobomirsky et al.

Self-perpetuating properties of dysphoric rumination

Journal of Social Psychology and Social Psychology

(1993)

R.J. Mislevy et al.

BILOG 3: item analysis and test scoring with binary logistic models

(1990)

M.J. Muller et al.

Relationship of psychiatric experience and interrater reliability in assessment of negative symptoms

Journal of Nervous and Mental Disease

(1999)

R. Nandakumar et al.

Refinements of Stout's procedure for assessing latent trait unidimensionality

Journal of Educational Statistics

(1993)

Cited by (37)

Gender-based differential item function for the difficulties in emotion regulation scale
2016, Personality and Individual Differences
Citation Excerpt :
When a measure contains DIF, it can cause problems, both for the validity of the measurement and in the interpretation of results that employ the measurement (Clauser & Mazor, 1998). Recent attention has been given to potential gender-related DIF in measurement relevant to emotional experiences, such as the Anxiety Sensitivity Index (Van Dam, Earleywine, & Forsyth, 2009), the Center for Epidemiological Studies, Depression Scale (Cole, Kawachi, Maller, & Berkman, 2000; Covic, Pallant, Conaghan, & Tennant, 2007), the Thalbourne Manic-Depressiveness Scale (Lange, Thalbourne, Houran, & Lester, 2002) and the BPD criteria (Sharp et al., 2014). Additionally, other work has examined gender-based DIF in other psychological measurement, including the Brief Fear of Negative Evaluation Scale (Harpole et al., 2014), and the Multidimensional Personality Questionnaire Stress Reaction Scale (Smith & Reise, 1998).
Emotion dysregulation is a mechanism central to the development and maintenance of various psychological disorders. Notably, men and women may differ in their experience of emotion regulation; for instance, women generally report more frequent use of problematic emotion regulation strategies. While considering the possibility that true gender differences in emotion regulation exist, it is also important to ensure that measures assessing the process of emotion regulation are not biased toward one group over the other. The current study examined differential item functioning (DIF) in a commonly used, 36-item measure of emotion dysregulation – the Difficulties in Emotion Regulation Scale (DERS). Participants (N = 679, 48.3% women) completed the DERS. Results demonstrated statistically-significant DIF in several of the items; two items met more stringent criteria for clinically-significant DIF. Findings suggest that further evaluation of emotion regulation measures may yield insight regarding the assessment of gender differences for emotion regulation and related constructs.
Performance of the 6-item Kessler scale for measuring serious mental illness in Hong Kong
2012, Comprehensive Psychiatry
Citation Excerpt :
Further analysis of the severity profile of each item endorsed by our respondents, such as through IRT analysis, should shed light on whether these findings are related to the differential sensitivity of item category across respondents who varied in symptom severity [51]. Thus, sex bias has been found in the reporting of functional disability [52] and depressive symptom severity [53]. It is also possible that respondents with more severe SMI would endorse more severe symptom categories differently from those with milder symptoms.
The 6-item Kessler scale (K6) promises to be a valuable epidemiological tool for assessing serious mental illness (SMI) in communities with limited resources for psychiatric research and treatment. Its performance in Chinese community has not been studied with reference to clinically assessed SMI.
From a representative telephone-based population survey (n = 3014) that administered the K6, 153 participants were readministered the K6 and, on the same day, interviewed face-to-face by clinicians using the Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorder, Fourth Edition, Axis I Disorder. Predictive indicators such as McNemar χ², area under receiver operating characteristic curve and stratum-specific likelihood ratios were used to investigate the concordance between the K6 and clinical status of SMI, individual-level predicted probabilities of having SMI, and the weighted prevalence of SMI in the community.
The K6 exhibited high internal consistency and test-retest reliability. Factor analysis revealed 2 correlating components composed of depression and anxiety. Matching of K6 caseness and SMI status showed that at the cutoff of 12/13, the area under receiver operating characteristic curve was moderate (0.69). The K6 had high specificity and was a stronger screen-out than screen-in tool for SMI. The weighted prevalence estimate of SMI in Hong Kong was 6.5%. A person scoring 13 or above on the K6 has a probability of at least 22.2% of having SMI.
The Chinese K6 is reliable and generates the likelihood of SMI with substantial concordance with face-to-face clinical interviews in Hong Kong. It is a valuable tool for screening SMI, behavioral risk factor surveillance, and community epidemiological surveys.
Evaluation of the "Consultation and Relational Empathy" (CARE) measure by means of Rasch-analysis at the example of cancer patients
2011, Patient Education and Counseling
The aims of this study are: (1) analysis of unidimensionality of the German version of the “Consultation and Relational Empathy” (CARE) measure and (2) identification of moderating variables affecting the scale structure.
The CARE-measure was evaluated by means of Rasch-analysis in a sample of N = 326 cancer patients. Association of diagnosis and treatment as well as patients’ characteristics was analyzed by person-fit measures and Differential Item Functioning.
Nine of the original 10 CARE items fit to the Rasch-model. For breast and prostate cancer, as well as for patients taking complementary and alternative medicine treatment, item biases affect the scale structure. Furthermore, older patients and patients with higher social support exhibit substantial deviations from model predictions.
Only the nine-item version of the CARE-measure allows for the unidimensional assessment of physician empathy. Especially for specific diagnosis and treatment groups, the CARE-items indicate variations in the underlying latent construct of physician empathy.
The CARE_R-measure provides a theory-based and psychometrically sound basis for the assessment of PE. It can be used to enhance the fairness of the assessment and to further identify valuable information about the influence of patient characteristics on the structure of the construct PE.
Screening for depression: Rasch analysis of the dimensional structure of the PHQ-9 and the HADS-D
2010, Journal of Affective Disorders
Both the depression modules of the Hospital Anxiety and Depression Scale (HADS-D) and the Patient Health Questionnaire (PHQ-9) are widely used for the screening of depression. We analyzed the dimensionality and the item fit of both scales individually and across the scales. Moreover, we sought to identify items which evidenced item response bias associated with age and gender.
The depression subscales HADS-D and the PHQ-9 were administered to 1271 patients (mean age 67.2; 22.5% women) undergoing coronary artery bypass graft surgery (CABG). Rasch analyses were performed to assess the overall fit of the model, individual item fit and differential item functioning (DIF).
Rasch analysis revealed that the HADS-D and the PHQ-9 feature a common core construct containing six items of the HADS-D and three items of the PHQ-9. Two of these items are identical with the 2-item short form of the PHQ-9. In addition, fatigability was the only somatic item that fitted the model. No substantial DIF was observed.
The generalizability of these results might be restricted to patients awaiting CABG.
The short form of the PHQ-9 seems to be an economic and valid instrument for the screening of depression, which indicates the same latent construct that is captured by six items of the HADS-D. Further studies are needed to evaluate whether the addition of fatigability might enhance the validity of the PHQ-2 in this patient population.
Chat-up lines as male displays: Effects of content, sex, and personality
2007, Personality and Individual Differences
Male chat-up lines and other opening gambits can be viewed, from an evolutionary perspective, as sexual displays. We extended an analysis of vignettes by Bale et al. (2006), using a larger sample to examine the inter-item relationships, and the effects of personality (EPQ-r and the Dating Partner Preference test) and sex of the judge on ratings for different groups of items. Principal components analysis identified four groups of items – good mate, compliment, sex (preferred by males), and humour (preferred by females). For female judges, extraversion and psychoticism influenced the ratings for the humour and good mate factors, respectively, while the ratings for the compliment and sex factors were influenced by one ‘Dating Partner Preference’ factor. We discuss the idea that chat-up lines may function both to attract females and to select the types of female who respond.
5 Differential Item Functioning and Item Bias
2006, Handbook of Statistics
Citation Excerpt :
DIF has become an integral part of test validation, being included among the Standards (Standard 7.3), and is now a key component of validity studies in virtually all large-scale assessments. In addition, the study of DIF is being extended to other fields that utilize constructed variables, such as psychology (Bolt et al., 2004; Dodeen and Johanson, 2003; Lange et al., 2002) and the health sciences (Gelin et al., 2004; Gelin and Zumbo, 2003; Iwata et al., 2002; Kahler et al., 2003; Panter and Reeve, 2002). In these areas, the assessment of DIF is not only a concern for selection bias, but also for ensuring the validity of research examining between-group differences on traits measured with constructed variables.
This chapter presents a description of many of the commonly employed methods in the detection of item bias. Because much of the statistical detection of item bias makes the use of differential item functioning (DIF) procedures, the majority of this chapter focuses on the description of statistical methods for the analysis of DIF. DIF detection procedures for dichotomous and polytomous items are presented in the chapter, along with the methods for the categorization of DIF effect in dichotomous items. It also presents several recent innovations in DIF detection, including Bayesian applications the detection of differential test functioning, and studies examining sources or explanations of DIF. While much of this chapter focuses on the statistical approaches to measuring DIF, conducting a comprehensive DIF analysis requires series of steps aimed at measuring DIF and ensuring that the obtained DIF statistics are interpreted appropriately. The chapter also outlines a set of six steps that practitioners can use in conducting DIF analyses. The steps are demonstrated using a real dataset.

View all citing articles on Scopus

View full text

Depressive Response Sets due to gender and culture-based Differential Item Functioning

Abstract

Introduction

Section snippets

Participants

Materials

Study II

Conclusions

Acknowledgements

Journal of Affective Disorders

Journal of Psychiatric Research

Journal of Psychiatric Research

Personality and Individual Differences

Personality and Individual Differences

Archives of Physical Medicine and Rehabilitation

Diagnostic and statistical manual of mental disorders

Effect of dissimulation motives and anxiety on response pattern appropriateness

Applied Psychological Measurement

Comments on the O'Neil and McPeek paper

Sex differences in depression: a role for preexisting anxiety

Psychiatric Research

Gender differences in responses to depressed mood in a college sample

Sex Roles

A latent trait analysis of the MMPI

Multivariate Behavior Research Monographs

Age differences in depression and anxiety symptoms: a structural equation modelling analysis of data from a general population sample

Psychological Medicine

Using statistical procedures to identify differentially functioning test items

Educational Measurement: Issues and Practice

An item response theory analysis of the Hare Psychopathology Checklist—Revised

Psychological Assessment

The new rules of measurement

Psychological Assessment

Negative schizophrenic symptoms: pathophysiology and clinical implications

Methodology review: assessing unidimensionality of tests and items

Applied Psychological Measurement

Issues and recommendations regarding the use of the Beck Depression Inventory

Cognitive Therapy and Research

Comment on “Manic-depressiveness and its correlates”

Psychological Reports

A user's guide to Winsteps, Bigsteps, Ministep Rasch-Model computer programs

Self-perpetuating properties of dysphoric rumination

Journal of Social Psychology and Social Psychology

BILOG 3: item analysis and test scoring with binary logistic models

Relationship of psychiatric experience and interrater reliability in assessment of negative symptoms

Journal of Nervous and Mental Disease

Refinements of Stout's procedure for assessing latent trait unidimensionality

Journal of Educational Statistics