Elsevier

NeuroImage

Volume 36, Issue 4, 15 July 2007, Pages 1207-1224
NeuroImage

On evaluating brain tissue classifiers without a ground truth

https://doi.org/10.1016/j.neuroimage.2007.04.031Get rights and content

Abstract

In this paper, we present a set of techniques for the evaluation of brain tissue classifiers on a large data set of MR images of the head. Due to the difficulty of establishing a gold standard for this type of data, we focus our attention on methods which do not require a ground truth, but instead rely on a common agreement principle. Three different techniques are presented: the Williams’ index, a measure of common agreement; STAPLE, an Expectation Maximization algorithm which simultaneously estimates performance parameters and constructs an estimated reference standard; and Multidimensional Scaling, a visualization technique to explore similarity data. We apply these different evaluation methodologies to a set of eleven different segmentation algorithms on forty MR images. We then validate our evaluation pipeline by building a ground truth based on human expert tracings. The evaluations with and without a ground truth are compared. Our findings show that comparing classifiers without a gold standard can provide a lot of interesting information. In particular, outliers can be easily detected, strongly consistent or highly variable techniques can be readily discriminated, and the overall similarity between different techniques can be assessed. On the other hand, we also find that some information present in the expert segmentations is not captured by the automatic classifiers, suggesting that common agreement alone may not be sufficient for a precise performance evaluation of brain tissue classifiers.

Introduction

Automatic segmentation of medical images has been an essential component of many applications and considerable effort has been invested in order to find reliable and accurate algorithms to solve this difficult problem. Many techniques have been proposed with different levels of automation and range of applicability. However, proposing a new algorithm is not merely enough. A thorough evaluation of its performance is necessary with some quantifiable measurement of its accuracy and variability.

The problem of measuring the performance of segmentation algorithms is the subject of this article. We investigate different techniques to assess the quality of multiple segmentation methods on a problem-specific data set. We are especially interested in cases where there is no ground truth available. We focus on the evaluation of brain tissue classifiers, although our framework can be applied to any segmentation problem.

Before we turn our attention to situations where no ground truth is available, we briefly review the key aspects of evaluation with a ground truth. In this scenario, the accuracy of the evaluation depends on two important components. First, one needs to have or design a suitable ground truth, and second, one needs to choose appropriate similarity metrics for the problem being evaluated.

Defining a ground truth in a medical context is not trivial and several approaches have been proposed. A common and popular technique is to compare automatic techniques with a group of human experts (Grau et al., 2004, Rex et al., 2004). In this framework, one assumes that human raters hold some prior knowledge of the ground truth that is reflected in their manual tracings. Unfortunately, human raters make errors and considerations of accuracy and variability must be addressed (Zijdenbos et al., 2002). Another common technique is the use of phantoms. For segmentation problems, phantoms are usually synthetic images for which the true segmentation is known (Collins et al., 1998, Zhang et al., 2001, Ashburner and Friston, 2003). A physical object can also be used as a phantom ground truth. The phantom is first measured, then imaged. The true measurements and segmentation measurements are compared and performance is thus assessed (Klingensmith et al., 2000). Studies with cadavers have also been completed in a similar fashion (Klingensmith et al., 2000, Yoo et al., 2000). Unfortunately, it is exceedingly difficult to design phantoms that appropriately mimic in vivo data and postmortem data differ from perfused, living tissue.

Once a ground truth is created, the key task of evaluation is to measure the similarity between the reference and the automatic segmentation. It is still unclear whether a generic set of measurements can be used for all segmentation problems, although some measures have been popular. Differences in volume have often been used, possibly because volume is such a central measurement in MR imaging studies (Zijdenbos et al., 1994). However, two objects with the same volume can be quite dissimilar and alternative measures are needed. To address this issue, different forms of distances between boundaries of segmented objects have been used, a popular choice being the Hausdorff distance (Chalana and Kim, 1997, Gerig et al., 2001). Measures of spatial overlap have also been considered important as an alternative to volume differences (Zijdenbos et al., 1994, Zijdenbos et al., 2002, Ashburner and Friston, 2003, Grau et al., 2004, Pohl et al., 2004). We will investigate these in detail in Similarity measures.

For many medical problems, as noted previously, phantom studies are considered insufficient for validation and manual tracings are simply not available. In the work presented here, we focus on the automatic classification of the brain into four major tissue classes: Gray Matter (GM), White Matter (WM), CerebroSpinal Fluid (CSF) and background (BG). For this specific problem, manual tracings of the entire data set, a total of forty cases, is simply impossible. Nevertheless, if one was to start a new neuroimaging study, one would certainly like to evaluate the automatic classifiers on the entire population. We thus have to turn to methods that measure performance in situations where no ground truth is available. A rather intuitive approach is to perform such an evaluation based on common agreement. That is, if nine out of ten algorithms classify voxel x in subject i as white matter then one says there is a 90% chance this voxel truly is white matter. This simple technique is interesting but limited as all algorithms have equal voting power and situations can arise where a voxel can have equal probability to be classified into different tissue classes. Nevertheless, this notion of common agreement is useful and can be quantified directly through measures such as the Williams’ index (Chalana and Kim, 1997, Klingensmith et al., 2000, Martin-Fernandez et al., 2005). Creating a reference according to the majority of votes from the segmentations can also be done. The reference can then be used as a ground truth for further performance measurements. A more elaborate technique is the one developed by Warfield et al. (2004) which creates simultaneously a reference standard as well as performance parameters through an Expectation Maximization framework.

This evaluation approach based on common agreement is the foundation of the work we present here. Our MR brain segmentation problem suffers from the lack of readily available database that has both the type of input data we use and accurate reference classifications. Without a gold standard, the problem is clearly ill-posed, and we believe common agreement is a sensible solution. It should be noted that some care should be taken while analyzing the results as one cannot state with certainty that one algorithm clearly outperforms the others purely based on a common agreement principle. Nevertheless, we will at least be able to observe and study many aspects of the segmentation performance such as robustness, variability between different cases, brain regions, or tissue classes. We will also be able to infer how different algorithms are and whether some techniques tend to behave similarly. One key aspect for the success of our study is the requirement that the input to the common agreement is unbiased. If a subset of the methods tested always behave similarly, the agreement will be biased towards these methods and the evaluation may be incorrect. In our work, we selected 11 segmentation techniques, which we believed represented a well-balanced set of techniques. We incorporate a discussion on bias as part of our analysis in the Discussion section.

In this article, we make several contributions: First, we present a framework in which one can assess segmentation performance purely based on common agreement. Three methods form the basis of this framework: Williams’ Index a technique we recently introduced (Martin-Fernandez et al., 2005); STAPLE’s algorithm (Warfield et al., 2004); and a novel visualization based on Multidimensional Scaling (MDS), a statistical tool to explore (dis)similarity data (Borg and Groenen, 1997, Cox and Cox, 2000). Second, we discuss the validity of our results by comparing our framework (purely based on common agreement) with an evaluation against a set of manual segmentations (used as ground truth). Our findings suggest that common agreement evaluation provides almost the same information as evaluating against a ground truth, with respect to robustness, variability and even ranking. Nevertheless, we do observe that some of the information captured by human expert is not present in the automatic classifications. Common agreement alone may thus not be sufficient to accurately rank automatic segmentation algorithms. Finally, as our experiments test eleven state of the art segmentation algorithms on a real and rather large data set, we provide useful and new knowledge about the performance of these algorithms to the community.

In the following section, we give a detail description of the design of the evaluation framework. We start by introducing different similarity measures to compare binary images. We then give detailed information on how Williams’ index is computed and present a brief review of STAPLE’s underlying principles and how it is used in our experiments. We give a more in depth description of MDS, as we have not seen this technique used for evaluation elsewhere.

The Experiments section describes our experimental setup: which data set is being used, which algorithms are being tested and what kinds of tests are being performed. This section starts with an experiment in which absolutely no ground truth is available and only common agreement is used. We then validate our approach by creating gold standards based on human tracings of a small subset of the data to validate if common agreement is indeed a sensible approach. We analyze our results in the Results section, and discuss the feasibility, accuracy, robustness, scalability and significance of evaluating brain tissue classification algorithms purely based on their common agreement in the Discussion section. The Conclusion section concludes the paper summarizing the achieved results.

Section snippets

Measuring segmentation quality

The main underlying principle of our evaluation is the notion of agreement. In our work, the agreement of two segmentation techniques is defined as the similarity between their respective outputs. Once a similarity measure is decided upon, one can compute a similarity matrix capturing how well matched all the segmentations are with each other or with a reference segmentation (manual or estimated).

Data set

Our data set consists of forty female subjects. The acquisition protocol involves two MR pulse sequences acquired on a 1.5-T GE scanner. First, a SPoiled Gradient-Recalled (SPGR) sequence yielded an MR volume sliced coronally of size 256 × 256 × 124 and voxel dimensions 0.9375 × 0.9375 × 1.5 mm. Second, a double-echo spin-echo sequence gave two MR volumes sliced axially (proton density and T2 weighted) of size 256 × 256 × 54 and voxel dimensions 0.9375 × 0.9375 × 3 mm. For each subject, both axial volumes were

Evaluation with no manual segmentation reference

Fig. 4 presents the mean/standard deviation plots of JC. Our first observation is that, for GM and WM, Williams’ index and STAPLE give a very similar ordering, whereas for CSF they are quite different. This might be due to the difficulty of segmenting CSF, which leads to higher variability and lower accuracy of the output, and consequently less reliable agreement measurements.

Concerning the ranking of techniques, a few observations can be made. First, we have the surprising result that single

Common agreement and bias

The main hypothesis underlying the common agreement principle is the notion that each classifier makes decisions independently from the others. In fact, it is more complex as this independence is conditioned by the underlying truth and the performance parameters of each classifier. This notion is central to both Williams’ Index and STAPLE. As the truth and performance are not known a priori, one cannot test for independence and generally needs to assume classifiers that make highly uncorrelated

Conclusion

In this paper, we investigated evaluating automatic segmentation without having a ground truth. Common agreement was used as the foundation of our comparison study. We chose the Jaccard Coefficient as a metric to measure agreement between two segmentations. We then used three different methods to evaluate and visualize the notion of common agreement. Of these three methods, the Williams’ Index provided us with a simple and efficient way of measuring whether a particular classifier agreed with

Acknowledgments

This work is very much a collaborative endeavor and could not have been completed without the help of V. Grau, G. Kindlmann, K. Krissian, S. K. Warfield, W. M. Wells and C.F. Westin from Brigham and Women’s Hospital; K. Pohl from Massachusetts Institute of Technology; S. M. Smith from the Oxford University Centre for Functional MRI of the Brain; O. Ivanov and A. Zijdenbos from McGill University Brain Imaging Centre.

We acknowledge the support of NIH (K02 MH01110, R01 MH50747, P41 RR13218, U54

References (32)

  • S. Warfield

    Fast kNN classification for multichannel image data

    Pattern Recogn. Lett.

    (1996)
  • J. Ashburner et al.

    Spatial normalization using basis functions

  • I. Borg et al.

    Modern Multidimensional Scaling: Theory and Applications

    (1997)
  • V. Chalana et al.

    A methodology for evaluation of boundary detection algorithms on medical images

    IEEE Trans. Med. Imag.

    (1997)
  • D.L. Collins et al.

    Design and construction of a realistic digital brain phantom

    IEEE Trans. Med. Imag.

    (1998)
  • T.F. Cox et al.

    Multidimensional Scaling

    (2000)
  • G. Gerig et al.

    Valmet: a new validation tool for assessing and improving 3D object segmentation

  • V. Grau et al.

    Improved watershed transform for medical image segmentation using prior information

    IEEE Trans. Med. Imag.

    (2004)
  • G. Hripcsak et al.

    Measuring agreement in medical informatics reliability studies

    J. Biomed. Inform.

    (2002)
  • P. Jaccard

    Étude comparative de la distribuition florale dans une portion des alpes et de jura

    Bull. Soc. Voudoise Sci. Nat.

    (1901)
  • J.D. Klingensmith et al.

    Evaluation of three-dimensional segmentation algorithms for the identification of luminal and medial–adventitial borders in intravascular ultrasound images

    IEEE Trans. Med. Imag.

    (2000)
  • K. Krissian

    Flux-based anisotropic diffusion applied to enhancement of 3-D angiograms

    IEEE Trans. Med. Imag.

    (2002)
  • M. Martin-Fernandez et al.

    Two methods for validating brain tissue classifiers

  • K.M. Pohl et al.

    Anatomical guided segmentation with non-stationary tissue class distributions in an expectation–maximization framework

  • D.E. Rex et al.

    A meta-algorithm for brain extraction in MRI

    NeuroImage

    (2004)
  • T. Rohlfing et al.

    Expectation maximization strategies for multi-atlas multi-label segmentation

  • Cited by (0)

    View full text