Original article
Receiver Operating Characteristic Analysis: A Tool for the Quantitative Evaluation of Observer Performance and Imaging Systems

https://doi.org/10.1016/j.jacr.2006.02.021Get rights and content

Receiver operating characteristic (ROC) analysis provides the most comprehensive description of diagnostic accuracy available to date, because it estimates and reports all of the combinations of sensitivity and specificity that a diagnostic test is able to provide. After sketching the 6 levels at which diagnostic efficacy can be assessed, this paper explains the conceptual foundations of conventional ROC analysis, describes a variety of indices that can be used to summarize ROC curves, and describes several forms of generalized ROC analysis that address situations in which more than 2 decision alternatives are available. Key issues that arise in ROC curve fitting and statistical testing are addressed in an intuitive way to provide a basis for judging the validity of ROC studies reported in the literature. A list of sources for free ROC software is provided. Receiver operating characteristic methodology has reached a level of maturity at which it can be recommended broadly as the approach of choice for radiologic imaging system comparisons.

Introduction

A hierarchical model for diagnostic efficacy2 developed by a scientific committee of the National Council on Radiation Protection and Measurements [2] provides a concise conceptual overview of the issues involved in evaluating diagnostic systems. This model’s 6 levels are as follows:

  • (1)

    Technical efficacy: at the model’s lowest level, a diagnostic test is considered effective if its result is accurate and precise in a physical sense, for example, if the test measures 1 or more physical properties of the human body in a way that agrees with a “gold standard,” and its results are reproducible. Aspects of technical efficacy in medical imaging include spatial or temporal resolution, noise magnitude and texture, and contrast sensitivity.

  • (2)

    Diagnostic accuracy: the second level of efficacy concerns the extent to which the results of a diagnostic test agree, in some statistical sense, with patients’ actual states of health or disease. Virtually all practical measures of diagnostic accuracy quantify the ability of a test to distinguish between 2 (usually composite) states of truth, such as “normal” vs “abnormal” or “positive” vs “negative,” with respect to a specified disease. Examples of diagnostic-accuracy measures include percentage correct, sensitivity and specificity, and receiver operating characteristic (ROC) curves. Of these, ROC curves provide the most comprehensive description, because they indicate all of the combinations of sensitivity and specificity that a diagnostic test is able to provide as the test’s “decision criterion” is varied.

  • (3)

    Diagnostic-thinking efficacy: given the prevalence of a particular disease and given the sensitivity and specificity (or more generally the ROC curve) of a diagnostic test for the presence of that disease, one can easily compute the factor by which the prior odds of disease change after the test’s result is obtained. However, the extent to which a diagnostic test affects physicians’ subjective estimates of disease likelihood must be answered empirically. This level of efficacy is sometimes difficult to quantify, but it provides a conceptual link between the more easily interpreted tiers above and below.

  • (4)

    Therapeutic efficacy: this is the lowest level at which the effects of a diagnostic test on patient management are assessed directly. The basic question is how and by how much a particular diagnostic test changes the way in which patients are treated; for example, how does therapy differ when it is chosen without or with knowledge of a test’s result?

  • (5)

    Patient-outcome efficacy: here, the goal of diagnostic medicine is confronted directly: a diagnostic test is considered effective at this level only if patient health (as measured, eg, in “quality-adjusted life years”) is demonstrably improved by use of the test. This is the kind of efficacy that is of greatest interest to most patients and physicians, and it is an indispensable component of any meaningful “cost-benefit” or “cost-effectiveness” analysis. However, a definitive assessment of efficacy at this level requires prospective randomized and controlled clinical trials, in which practical, statistical, and ethical problems can be formidable [2].

  • (6)

    Societal efficacy: any cost-benefit or cost-effectiveness analysis of a diagnostic test at level 5 focuses on the benefits and personal risks that accrue to the patients who are candidates for the test. However, the fact that medical costs are borne increasingly by society as a whole implies that social utilities should somehow be taken into account when benefits and costs are evaluated. This is the domain of “societal efficacy,” which in principle merges private and public considerations to assess diagnostic tests within the context of the social endeavor.

Efficacy at this hierarchical model’s upper levels usually is of greatest direct interest, but lower-level efficacy is almost always easier to quantify reliably. Fortunately, efficacy at the higher levels sometimes can be estimated from measurements at lower levels by the use of collateral data and appropriate assumptions. Most studies of diagnostic efficacy in medical imaging focus on the measurement of diagnostic accuracy (level 2), because this is the lowest level at which human observers are included and often the highest level at which scientifically rigorous methods can be used.

For many years, diagnostic accuracy was measured and reported in terms of a kind of “batting average”: the percentage of diagnostic decisions that proved to be correct. This “percentage-correct” measure has the fairly obvious limitation that it can depend strongly on disease prevalence [3]: if only 1% of the patients in a screening population have a particular disease, for example, then a system can be “99% accurate” simply by blindly calling all patients negative with respect to that disease. Moreover, the percentage-correct measure does not reveal the relative frequencies of false-positive and false-negative errors, which usually have substantially different clinical consequences.

Both of these disadvantages are overcome if diagnostic performance is reported in terms of a pair of indices: “sensitivity” (the fraction of patients actually having the disease in question who are correctly diagnosed as positive) and “specificity” (the fraction of patients actually without the disease who are correctly diagnosed as negative). In effect, these indices quantify separately the “accuracies” of the system for actually positive and actually negative patients, respectively. False-negative and false-positive diagnoses are accounted for implicitly by these indices, and a change in disease prevalence does not affect their numerical values if constant decision criteria are used. The terms true-positive fraction (TPF) and true-negative fraction are synonymous with sensitivity and specificity, respectively. In a complementary way, the “false-negative fraction” and the “false-positive fraction” (FPF) represent the conditional probabilities or frequencies with which actually positive and actually negative patients are diagnosed incorrectly [3, 4]; thus, false-negative fraction = 1 − TPF = 1 − sensitivity, and FPF = 1 − true-negative fraction = 1 − specificity. Because of the interrelationships among these measures, it is necessary only to indicate a single pair; conventionally, either sensitivity and specificity or TPF and FPF are used. The use of sensitivity or TPF alone is inadequate, because the performance of the diagnostic system with regard to actually negative patients is then unknown.

The sensitivity-specificity pair, or one of its equivalents, describes diagnostic accuracy more meaningfully than the single index of percentage correct, and it has been used widely in the medical literature. A single pair of numbers representing sensitivity and specificity is not entirely adequate, however, because it confounds 2 aspects of diagnostic accuracy that can vary independently: (a) the inherent capacity of a diagnostic system to distinguish between actual states of health and disease, and (2) the balance between the frequencies of false-positive and false-negative errors that a decision maker chooses to adopt in a clinical task when a particular discrimination capacity is available [5].

The limitations of reporting diagnostic accuracy in terms of a single sensitivity-specificity or TPF-FPF pair are most evident in studies that attempt to compare diagnostic tests, because often, one test is found to have higher sensitivity (higher TPF) but lower specificity (higher FPF) than the other; in other words, one test is more accurate for actually positive patients, whereas the other is more accurate for actually negative patients. This is clearly problematic in deciding which of the 2 tests to use, because diagnostic testing would not be needed if the presence or absence of the disease were known.

Section snippets

The basic concepts of ROC analysis

The dilemma that arises when one diagnostic test has higher sensitivity but lower specificity than another can be resolved by noting that the sensitivity and specificity of virtually all diagnostic tests can be changed by modifying the “threshold of abnormality” or “decision threshold” that is used for the test. For example, consider 2 radiologists with equal skill who read mammograms to detect breast cancer. If one of these radiologists reads the images more aggressively than the other,

ROC-related indices of diagnostic accuracy

Experience indicates that real-world ROC curves must be described by at least 2 parameters that represent separately each curve’s height and the strength or weakness of its symmetry around the negative (−45°) diagonal of the unit square in which ROC curves are plotted. Therefore, ROC curves usually cannot be summarized fully by a single number, and system comparisons that are based entirely on a single-number summary index may lead to erroneous conclusions. The concepts of “better” and “worse”

Generalized ROC analysis

Conventional ROC analysis fully describes all of the trade-offs that a particular human or automated decision maker can achieve among the frequencies of true-positive, true-negative, false-positive, and false-negative decisions in any particular 2-group classification task, that is, in any situation in which only 2 states of truth are relevant, in which the decision maker must decide to which of the 2 states each test case belongs, and in which a population of test cases has been defined and

Curve fitting

Curve fitting in any branch of science involves 3 steps: (1) choosing a family of fitting functions with adjustable parameters that is able to summarize the data of interest with adequate fidelity; (2) adopting a measure that quantifies the goodness of any particular fit; and (3) computing the values of the fitting function’s parameters that provide the best fit, according to the adopted measure.

Conventional ROC curves are usually fit by using the so-called binormal model, which assumes that

Testing the statistical significance of differences

Differences between estimated ROC curves must be subjected to statistical significance testing to determine whether those differences can be ascribed to random variation or, instead, are likely to be real. A sometimes bewildering variety of statistical tests has been developed to assess the significance of differences between ROC curve estimates. Although the details of those techniques are beyond the scope of this paper, 2 fundamental issues are mentioned briefly here to provide a basis for

Free software

Although the calculations associated with most meaningful ROC curve-fitting and significance-testing techniques are extremely complicated, free software for these purposes is available from several investigators:

  • A number of programs for fitting conventional ROC curves as well as software for MRMC ROC analysis can be obtained from Kevin Berbaum, PhD, Department of Radiology, University of Iowa ([email protected]).

  • A version of the Iowa MRMC software that has been modified to test the

References (27)

  • C.E. Metz

    ROC methodology in radiologic imaging

    Invest Radiol

    (1986)
  • J.A. Swets

    ROC analysis applied to the evaluation of medical imaging techniques

    Invest Radiol

    (1979)
  • D.K. McLish

    Analyzing a portion of the ROC curve

    Med Decis Making

    (1989)
  • Cited by (229)

    • Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset

      2022, NeuroImage: Clinical
      Citation Excerpt :

      The classification performances can be evaluated in terms of sensitivity (true-positive ratio) and specificity. The trade-off between the sensitivity and the false-positive ratio (which corresponds to one minus the specificity), obtained by varying the decision threshold of the classifier, is known as the receiver operating characteristic (ROC) curve (Metz, 2006). From the ROC curve, the area under curve (AUC) can be estimated.

    • Validation of a candidate instrument to assess image quality in digital mammography using ROC analysis

      2021, European Journal of Radiology
      Citation Excerpt :

      In type testing, both physical and clinical assessments are performed, to identify issues in the DM systems that may negatively affect lesion detection and interpretation. Ideally, the clinical assessment of type testing should include an estimate of diagnostic performance by using task-based methods, such as Receiver Operating Characteristics (ROC) analysis or one of its variants [12–14]. However, ROC analysis is a very laborious and time-consuming procedure, involving a large number of observers and a considerable number of both negative and positive cases in order to reach sufficient statistical power.

    View all citing articles on Scopus
    View full text