An empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary

doi:10.1016/j.jclinepi.2007.09.013

Journal of Clinical Epidemiology

Volume 61, Issue 11, November 2008, Pages 1095-1103

https://doi.org/10.1016/j.jclinepi.2007.09.013 Get rights and content

Abstract

Objective

Meta-analysis of studies of the accuracy of diagnostic tests currently uses a variety of methods. Statistically rigorous hierarchical models require expertise and sophisticated software. We assessed whether any of the simpler methods can in practice give adequately accurate and reliable results.

Study Design and Setting

We reviewed six methods for meta-analysis of diagnostic accuracy: four simple commonly used methods (simple pooling, separate random-effects meta-analyses of sensitivity and specificity, separate meta-analyses of positive and negative likelihood ratios, and the Littenberg–Moses summary receiver operating characteristic [ROC] curve) and two more statistically rigorous approaches using hierarchical models (bivariate random-effects meta-analysis and hierarchical summary ROC curve analysis). We applied the methods to data from a sample of eight systematic reviews chosen to illustrate a variety of patterns of results.

Results

In each meta-analysis, there was substantial heterogeneity between the results of different studies. Simple pooling of results gave misleading summary estimates of sensitivity and specificity in some meta-analyses, and the Littenberg–Moses method produced summary ROC curves that diverged from those produced by more rigorous methods in some situations.

Conclusion

The closely related hierarchical summary ROC curve or bivariate models should be used as the standard method for meta-analysis of diagnostic accuracy.

Introduction

There is growing acknowledgment of the need for systematic reviews of studies evaluating the accuracy of diagnostic and screening tests. Recent years have seen increasing numbers of such reviews being published: the Database of Abstracts of Reviews of Effects (DARE) maintained by the Centre for Reviews and Dissemination at the University of York [1] includes 27 diagnostic test accuracy reviews for 1998, increasing to 49 in 2003. The Cochrane Collaboration is planning to include reviews of test accuracy studies in the Cochrane Library. There has also been an increase in methodological work in this area [2], [3], [4], [5], [6], [7], [8].

Considerable uncertainty remains, however, over the best way to formally synthesize test accuracy studies. Statistical methodology is much more varied than in meta-analysis of therapeutic interventions. There are a number of different measures of diagnostic test accuracy, and a variety of ways of meta-analyzing them [7], [9], [10], [11]. These meta-analytic methods may produce either summary estimates of test characteristics (e.g., sensitivity, specificity, positive and negative likelihood ratios) or a summary receiver operating characteristic (SROC) curve.

A survey of reviews of diagnostic accuracy that were included in the Centre for Reviews and Dissemination's DARE up to 2002 [9] found that of 133 reviews in which meta-analysis was performed, 52% computed one or more summary measures of accuracy, 18% conducted only SROC analyses, and 30% did both. Of the 109 reviews that computed summary measures of accuracy, 89% used sensitivity and/or specificity, 24% used likelihood ratios, and 10% used predictive values. There are also several alternative ways of computing both SROC curves and summary measures of accuracy, differing in the weighting given to each study or whether a transformation is used [9]. There is a clear need for consensus on the most appropriate methods for meta-analysis of test accuracy studies.

Our aim in this paper was to compare two statistically rigorous methods involving hierarchical models that are not widely used at present with simpler, more commonly used methods, to assess whether the simpler methods are adequate in practice. We start by presenting brief results of a survey of the statistical methods used in recently published systematic reviews of test accuracy studies. We then review these methods, and then evaluate their results when applied to data from eight systematic reviews. We conclude with recommendations for best practice in future reviews.

Section snippets

Brief review of use of methods in the literature

Diagnostic systematic reviews published in 2003, the most recent complete available year, were identified from the DARE [1] (http://www.york.ac.uk/inst/crd/crddatabases.htm#DARE) maintained by the Centre for Reviews and Dissemination at the University of York. We extracted data on the summary accuracy measures presented and the meta-analysis methods used.

Of 49 systematic reviews of diagnostic accuracy published in 2003 and identified from the DARE database, 34 (69%) included a meta-analysis.

Review of methods for meta-analysis of diagnostic accuracy studies

Several features distinguish the results of diagnostic accuracy studies from those of studies of therapeutic interventions and necessitate different methods of meta-analysis. First, diagnostic accuracy is usually quantified by two measures, sensitivity and specificity (or positive and negative likelihood ratios), and cannot be reduced to a single summary measure such as a diagnostic odds ratio without losing important information [10]. Second, declaring a test result to be positive involves

Empirical comparison of methods

We compared the results of meta-analysis based on the different proposed methods, using data from a sample of eight systematic reviews (Table 3) [23], [24], [25], [26], [27], [28], [29], [30]. These were purposively sampled from reviews in which one or more of the authors have been involved to illustrate a variety of tests with different ranges of and variability in sensitivity and specificity. For each data set, we plotted the individual study results in ROC space. We applied the methods above

Discussion

Our empirical comparison of the results of different meta-analytic methods applied to eight datasets showed that the commonly used Littenberg–Moses method of generating SROC curves, and simple pooling of sensitivity and specificity, can give results that differ markedly from those derived using the statistically rigorous bivariate/HSROC method involving hierarchical models. Separate random-effects meta-analysis of logit-transformed sensitivity and specificity gave summary points, but not SROC

Limitations of this study

The main limitation of the present study is that we applied the methods to a limited sample of eight data sets. However, our chosen data sets capture a range of features common in the literature: all exhibit substantial between-study variability, but the differ widely in the regions of ROC space in which the study estimates lie. There remains scope for future work comparing the methods both on a larger number of data sets, and on simulated data where the true parameter values are known and can

Conclusions

We have reviewed methods for meta-analysis of diagnostic accuracy studies and compared their results when applied to the eight example data sets. The bivariate/HSROC method is the most statistically rigorous and can be used to give confidence and prediction regions and a summary ROC curve, in addition to the summary sensitivity and specificity. We believe this method should be adopted as the standard approach. The Littenberg–Moses method and separate random effects meta-analysis did not

Acknowledgments

This work was supported by the MRC Health Services Research Collaboration. Dr Bachmann's work (grants no. 3233B0-103182 and 3200B0-103183) was supported by the Swiss National Science Foundation.

References (43)

K.S. Khan et al.
Systematic reviews with individual patient data meta-analysis to evaluate diagnostic tests
Eur J Obstet Gynecol Reprod Biol
(2003)
L. Irwig et al.
Meta-analytic methods for diagnostic test accuracy
J Clin Epidemiol
(1995)
S.D. Walter et al.
Studies reporting ROC curves of diagnostic and prediction data can be incorporated into meta-analyses using corresponding odds ratios
J Clin Epidemiol
(2007)
S. Suzuki et al.
The conditional relative odds ratio provided less biased results for comparing diagnostic test accuracy in meta-analyses
J Clin Epidemiol
(2004)
J.B. Reitsma et al.
Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews
J Clin Epidemiol
(2005)
H. Chu et al.
Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach
J Clin Epidemiol
(2006)
P. Macaskill
Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis
J Clin Epidemiol
(2004)
J.A. Knottnerus et al.
Assessment of the accuracy of diagnostic tests: the cross-sectional study
J Clin Epidemiol
(2003)
Centre for Reviews and Dissemination
Database of abstracts of reviews of effects
P.M. Bossuyt et al.
Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD Initiative
Ann Intern Med
(2003)

J.J. Deeks

Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests

BMJ

(2001)

W. Deville et al.

Conducting systematic reviews of diagnostic studies: didactic guidelines

BMC Med Res Methodol

(2002)

C. Gluud et al.

Evidence based diagnostics

BMJ

(2005)

A. Tatsioni et al.

Challenges in systematic reviews of diagnostic technologies

Ann Intern Med

(2005)

P. Whiting et al.

Sources of variation and bias in studies of diagnostic accuracy: a systematic review

Ann Intern Med

(2004)

J. Dinnes et al.

A methodological review of how heterogeneity has been examined in systematic reviews of diagnostic test accuracy

Health Technol Assess

(2005)

J.J. Deeks

Evaluations of diagnostic and screening tests

V. Dukic et al.

Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds

Biometrics

(2003)

R.M. Harbord et al.

A unification of models for meta-analysis of diagnostic accuracy studies

Biostatistics

(2007)

B. Littenberg et al.

Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method

Med Decis Making

(1993)

L.E. Moses et al.

Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations

Stat Med

(1993)

Cited by (165)

Relative diagnostic accuracy of point-of-care tests to rule-in Dirofilaria immitis infection in clinically suspect dogs: A systematic review and meta-analysis
2023, Preventive Veterinary Medicine
Canine heartworm, Dirofilaria immitis, can cause severe disease and sometimes death of the host. Associated clinical signs, lack of preventative usage and regional endemicity are unlikely sufficient by themselves to reach a definitive diagnosis. Several point-of-care (POC) diagnostic tests are commercially available to aid in-clinic diagnosis, however, there is variable diagnostic accuracy reported and no synthesis of published evidence. This systematic review aims at meta-analysing the likelihood ratio of a positive result (LR⁺) to inform the selection and interpretation of POC tests in practice to rule-in heartworm infection when there is clinical suspicion. Three literature index interfaces (Web of Science, PubMed, Scopus) were searched on November 11th, 2022, for diagnostic test evaluation (DTE) articles assessing at least one currently commercialised POC test. Risk of bias was assessed adapting the QUADAS-2 protocol and articles with no evidence of high risk of bias were meta-analysed if deemed applicable to our review objective. Substantial between DTE heterogeneity was investigated including potential threshold or covariate effects. A total of 324 primary articles were sourced and 18 were retained for full text review of which only three had low risk of bias in all four QUADAS-2 domains. Of the nine heartworm POC tests evaluated, only three, IDEXX SNAP (n DTEs = 6), Zoetis WITNESS (n DTEs = 3) and Zoetis VETSCAN (n DTEs = 5) could be analysed. Both WITNESS and VETSCAN DTEs showed substantial heterogeneity due to a putative threshold effect and no summary point estimates could be reported. SNAP DTEs showed acceptable heterogeneity, and a summary LR⁺ was estimated at 559.0 (95%CI: 24.3–12,847.4). The quality and heterogeneity of heartworm POC test DTEs was highly variable which restricted our summary of the diagnostic accuracy to only the SNAP test. A positive result from the SNAP test provides strong evidence of the presence of an infection with adult heartworm(s) in a dog patient and this test is warranted to rule-in clinical suspicion(s) in clinics. However, our review did not appraise the literature to assess the fitness of SNAP test, or any other POC tests, to rule-out heartworm infection in dogs without clinical suspicion or following heartworm therapy.
Convolutional neural network performance compared to radiologists in detecting intracranial hemorrhage from brain computed tomography: A systematic review and meta-analysis
2022, European Journal of Radiology
To compare the diagnostic accuracy of convolutional neural networks (CNN) with radiologists as the reference standard in the diagnosis of intracranial hemorrhages (ICH) with non contrast computed tomography of the cerebrum (NCTC).
PubMed, Embase, Scopus, and Web of Science were searched for the period from 1 January 2012 to 20 July 2020; eligible studies included patients with and without ICH as the target condition undergoing NCTC, studies had deep learning algorithms based on CNNs and radiologists reports as the minimum reference standard. Pooled sensitivities, specificities and a summary receiver operating characteristics curve (SROC) were employed for meta-analysis.
5,119 records were identified through database searching. Title-screening left 47 studies for full-text assessment and 6 studies for meta-analysis. Comparing the CNN performance to reference standards in the retrospective studies found a pooled sensitivity of 96.00% (95% CI: 93.00% to 97.00%), pooled specificity of 97.00% (95% CI: 90.00% to 99.00%) and SROC of 98.00% (95% CI: 97.00% to 99.00%), and combining retrospective and studies with external datasets found a pooled sensitivity of 95.00% (95% CI: 91.00% to 97.00%), pooled specificity of 96.00% (95% CI: 91.00% to 98.00%) and a pooled SROC of 98.00% (95% CI: 97.00% to 99.00%).
This review found the diagnostic performance of CNNs to be equivalent to that of radiologists for retrospective studies. Out-of-sample external validation studies pooled with retrospective studies found CNN performance to be slightly worse. There is a critical need for studies with a robust reference standard and external data-set validation.
Diagnostic accuracy of ultrasound for upper extremity fractures in children: A systematic review and meta-analysis
2021, American Journal of Emergency Medicine
Ultrasound has an excellent diagnostic accuracy for fractures that is reportedly comparable to plain radiographs. We aim to summarize the diagnostic accuracy of ultrasound for upper extremity fractures in children.
Databases were searched from inception through November 2019 using pre-defined index terms, including “ultrasound,” “fractures of upper extremities” and “children”. The study is reported using Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA). Meta-analysis of the diagnostic accuracy of ultrasound for fractures was conducted using the random-effects bivariate model. Subgroup analysis of fracture site (elbow vs non-elbow fractures) was also performed. Meta-regression was performed to determine if the site of fracture affected the diagnostic accuracy.
Thirty-two studies were identified in the meta-analysis. Ultrasound for fractures of the upper extremities has a sensitivity: 0.95 (95% CI: 0.93–0.97), specificity: 0.95 (95% CI: 0.91–0.98), positive likelihood ratio: 21.1 (95% CI: 10.8–41.5) and negative likelihood ratio: 0.05 (95% CI: 0.03–0.07), with an area under ROC (AUROC) curve of 0.98 (95% CI: 0.97–0.99). Subgroup analysis for elbow fracture showed ultrasound has a sensitivity: 0.95 (95% CI: 0.86–0.98), specificity: 0.87 (95% CI: 0.76–0.94), positive likelihood ratio: 7.3 (95% CI: 3.7–14.4) and negative likelihood ratio: 0.06 (95% CI: 0.02–0.16), with an AUROC of 0.96 (95% CI: 0.94–0.97). Meta-regression suggested the fracture sites would affect diagnostic accuracy of ultrasound (elbow vs non-elbow, p < 0.01).
Current evidence suggests ultrasound has excellent diagnostic accuracy for non-elbow upper extremity fractures in children, serving as an alternative diagnostic modality to plain radiographs.
The diagnostic value of conventional radiography and musculoskeletal ultrasonography in calcium pyrophosphate deposition disease: a systematic literature review and meta-analysis
2021, Osteoarthritis and Cartilage
To examine and compare the accuracy of conventional radiography (CR) and musculoskeletal ultrasonography (US) in the diagnosis of calcium pyrophosphate (CPP) crystals deposition disease (CPPD).
A systematic search of electronic databases (PubMed, Embase, and Cochrane), conference abstracts and reference lists was undertaken. Studies which evaluated the accuracy of CR and/or US in the diagnosis of CPPD, using synovial fluid analysis (SFA), histology or classification criteria as reference tests were included. Subgroup analyses by anatomic site and by reference test were performed.
Twenty-six studies were included. Using SFA/histology as reference test, CR and US showed an excellent (CR AUC = 0.889, 95%CI = 0.811–0.967) and an outstanding (US AUC = 0.954, 95%CI = 0.907–1.0) diagnostic accuracy (p < 0.01), respectively. Furthermore, US showed a higher sensitivity (0.85, 95%CI = 0.79–0.90 vs 0.47, 95%CI = 0.40–0.55) and only a little lower specificity (0.87, 95%CI = 0.83–0.91 vs 0.95, 95%CI = 0.92–0.97) than CR. A considerable heterogeneity between the studies was found, with adopted reference test being the main source of heterogeneity. In fact, subgroup analysis showed a significant change in the diagnostic accuracy of CR, but not of US, using Ryan and McCarty criteria or SFA/histology as reference test (CR: AUC = 0.956, 95%CI = 0.925–1.0 vs AUC = 0.889, 95%CI = 0.828–0.950, respectively, p < 0.01) (US: AUC = 0.922, 95%CI = 0.842–1.0 vs AUC = 0.957, 95%CI = 0.865–1.0, respectively, p = 0.08)
Although US is more sensitive and a little less specific than CR for identifying CPP crystals, both these two techniques showed a great diagnostic accuracy and should be regarded as complementary to each other in the diagnostic work-up of patients with CPPD.
A new method for synthesizing test accuracy data outperformed the bivariate method
2021, Journal of Clinical Epidemiology
This study outlines the development of a new method (split component synthesis; SCS) for meta-analysis of diagnostic accuracy studies and assesses its performance against the commonly used bivariate random effects model.
The SCS method summarizes the study-specific diagnostic odds ratio (on the ln(DOR) scale), which mainly reflects test discrimination rather than threshold effects, and then splits the summary ln(DOR) into its component parts, logit sensitivity (Se) and logit specificity (Sp). Performance of SCS estimator was assessed through simulation and compared against the bivariate random effects model estimator in terms of bias, mean squared error (MSE), and coverage probability across varying degrees of between-studies heterogeneity.
The SCS estimator for the DOR, Se, and Sp was less biased and had smaller MSE than the bivariate model estimator. Despite the wider width of the 95% confidence intervals under the bivariate model, the latter had a poorer coverage probability than that under the SCS method.
The SCS estimator outperforms the bivariate model estimator and thus represents an improvement in the approach to diagnostic meta-analyses. The SCS method is available to researchers through the diagma module in Stata and the SCSmeta function in R.
The diagnostic accuracy of cardiac ultrasound for acute myocardial ischemia in the emergency department: a systematic review and meta-analysis
2024, Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine

View all citing articles on Scopus

View full text

Original ArticleAn empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary

Abstract

Objective

Study Design and Setting

Results

Conclusion

Introduction

Section snippets

Brief review of use of methods in the literature

Review of methods for meta-analysis of diagnostic accuracy studies

Empirical comparison of methods

Discussion

Limitations of this study

Conclusions

Acknowledgments

Eur J Obstet Gynecol Reprod Biol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

J Clin Epidemiol

Database of abstracts of reviews of effects

Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD Initiative

Ann Intern Med

Systematic reviews in health care: systematic reviews of evaluations of diagnostic and screening tests

BMJ

Conducting systematic reviews of diagnostic studies: didactic guidelines

BMC Med Res Methodol

Evidence based diagnostics

BMJ

Challenges in systematic reviews of diagnostic technologies

Ann Intern Med

Sources of variation and bias in studies of diagnostic accuracy: a systematic review

Ann Intern Med

A methodological review of how heterogeneity has been examined in systematic reviews of diagnostic test accuracy

Health Technol Assess

Evaluations of diagnostic and screening tests

Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds

Biometrics

A unification of models for meta-analysis of diagnostic accuracy studies

Biostatistics

Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method

Med Decis Making

Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations

Stat Med

Original Article
An empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary