Introduction

Relative to the wealth of epidemiologic data on hip fractures, population-based studies of vertebral fractures are quite limited [1], due mainly to methodologic issues surrounding vertebral fracture diagnosis [2]. For example, there has been debate about the degree of reduction in vertebral body height necessary to declare a new vertebral fracture on serial radiographs [3, 4]. Although various definitions have provided comparable results when treatment and placebo groups were compared in a clinical trial [5], more stringent definitions do lower absolute vertebral fracture incidence estimates [5, 6]. Much more problematic is the assessment of vertebral fracture prevalence cross-sectionally at baseline [7]. The more severe vertebral fractures that typically come to clinical attention are easily recognized both by clinical semiquantitative readings and vertebral morphometry [8]. For early intervention, however, it might also be important to detect mild vertebral fractures, which can be relatively asymptomatic [9, 10]. In this regard, vertebral morphometry is more objective and systematic, but clinical readers can more easily detect abnormalities in vertebral size and shape and exclude those from other causes, such as congenital abnormalities [8]. In particular, morphometric assessments are beset by false positive readings of various sorts, and efforts to improve specificity correspondingly reduce sensitivity. Several strategies have been advanced to minimize these false positive results [11, 12]. Recognizing that the occurrence of one vertebral fracture increases the risk of additional fractures [13, 14], one test of these approaches is to estimate subsequent incidence rates in subsets of patients with baseline vertebral fractures defined in various ways [7]. Here we report such an analysis among an age-stratified sample of postmenopausal Rochester, Minnesota women.

Methods

Study subjects

Potential study subjects were identified using the medical records linkage system of the Rochester Epidemiology Project [15]. Since over half of the Rochester population is seen annually in this system, and the large majority is attended in any 3-year period, the enumerated population approximates the underlying population of the community from which random samples can be drawn. Age-stratified samples were obtained of 201 Rochester women aged 50 years or over from among those who attended in 1980±1 year [16], 561 women from those seen in 1983±1 year [17], and 199 additional women in this age-group who were identified from among residents seen in 1990±1 year [18]. Sixteen women in the latter sample had been members of earlier samples, and their original baseline data were used in these analyses. Of these 945 unique subjects, 512 (54%) had follow-up radiographs, and they comprise the study group.

Assessment of baseline vertebral deformities

Anteroposterior and lateral radiographs of the thoracic and lumbar spine were obtained at a standard source-to-film distance of 122 cm. The thoracic film was centered over T-7, while the lumbar film was centered over L-2. As described in detail elsewhere [19], each vertebra (T-4 to L-5) was marked on the lateral thoracic and lumbar radiographs, and a computer algorithm generated anterior (ha), middle (hm) and posterior (hp) heights of each vertebra to the nearest 0.1 millimeter [coefficient of variance (CV):, 1.5% of mean heights]. Of 7168 potentially assessable vertebrae (14 vertebrae for each of 512 subjects), baseline data were available for all but eight (n=7160), and paired (baseline and follow-up) data were available on all but 13 vertebrae (n=7155).

Baseline deformities were evaluated by a number of alternative approaches, most of which were based on ratios of anterior (ha), middle (hm) and posterior (hp) vertebral heights to normalize for absolute differences in bone size [2]. Height ratios characterized anterior wedge (ha/hp), concavity or endplate (hm/hp), and compression or crush (hp/hp′ ) deformities, where hp′ is the posterior height of the vertebra above or below [16]. A vertebra was classified as deformed (approximates fractured) if the relevant ratio was less than a minimum cut-off value. In the first approach, cut-off values were variously defined as reductions in height ratios of 15, 20, or 30% or more, respectively, from the mean ratios for a given vertebra as judged either on the basis of the entire sample of women (“unadjusted”), or on the normal women alone (“normal”) after excluding those with clinically-evident vertebral fractures or metabolic bone disease [16]. Alternatively, we defined the minimum cut-off value as 3 SD (or 4 SD) below the expected mean values for each ratio as judged from normal women [19] or based on a modified algorithm (“trimmed”) originally developed by Black and colleagues [20] that excludes abnormal ratios statistically [17]. To help reduce the problem of false positives, we also evaluated percentage reductions after standardizing the vertebral heights (“standardized”) using a complex algorithm described in detail elsewhere [21]; a program for accomplishing this is available from Dr. Guirong Jiang (http://www.g.jiang@sheffield.ac.uk). In addition, we used the Ross/Wasnich algorithm [11], which trims vertebral heights statistically but then focuses on the heights themselves rather than height ratios (“Z-score”).

We also employed the “prevalence” method of McCloskey and coworkers [4], as modified by Black and colleagues [22], which considers the three vertebral height ratios but, for wedge and biconcavity fractures, requires a reduction of 3 SD (or 4 SD) or more compared not only to the observed posterior height of each vertebra (hp) but also compared to the posterior height predicted from values for four adjacent vertebrae (hpp). Vertebral collapse or crush fractures were defined as a posterior height more than 3 SD (or 4 SD) less than expected (hp:hpp) combined with a comparably reduced anterior height (ha:hpp). Finally, we assessed the effect of deformities confirmed on a qualitative clinical reading of the radiographs by an experienced radiologist (DEW) who did not know the morphometry results.

Ascertainment of new vertebral deformities

Alternative definitions of a new fracture took into consideration a 15, 20, or 30% reduction in at least one of the three heights (ha, hm or hp) within a vertebra between any two examinations, along with a widely used criterion comprised of a 20% reduction in height combined with an absolute change of at least 4 mm. However, 4 mm corresponds to approximately 20% even in the smallest vertebra (T-4), and Black and colleagues [6] showed that the absolute change criterion had little effect when the vertebral height reduction was as much as 20%. Since they further demonstrated that the 20% height change criterion maximized power for clinical trials, this was taken as the base case for the analysis. In these analyses, each patient was evaluated with respect to her own data so that the actual cut-off value varied from subject to subject. The mean value of a 15% reduction in the anterior height of T-12, for example, was 3.0±0.7 mm. We further evaluated changes in height ratios (ha/hp, hm/hp and hp/hp′) of 3 SD or 4 SD, based on SDs from the baseline ratios, as well as changes in the ratios of 15, 20 or 30%. We also evaluated follow-up radiographs using the McCloskey method noted above [4], i.e., vertebrae were evaluated independently at each assessment (serial prevalence), and a new fracture declared when the algorithm identified a deformity in a vertebra previously classified as normal.

Statistical analysis

Information on incident vertebral fractures was obtained from a baseline radiograph and the last available follow-up radiograph, where the new fracture, if any, occurred at some time point during the time interval between the two. Unadjusted incidence rates represent the total number of subjects who experienced a fracture divided by the sum of individual follow-up time (person-years). Age-adjusted rates were created by partitioning each subject’s follow-up into 5-year age-specific categories. For instance, a subject who was age 50 years at baseline and 58 years at last follow-up contributed 5 years of follow-up to the 50- to 54-year age-group and 4 years of follow-up to the 55- to 59-year age-group. A fracture that occurred during this interval was partitioned to the different age categories in proportion to the time the subject was in each category. Because these data came from an age-stratified sample, these age-group-specific counts and follow-up were summed using the appropriate weights from the U.S. 2000 census to create overall incidence rates directly age-adjusted to the population structure of U.S. white women ≥50 years of age in 2000. Confidence intervals were calculated assuming that the incidence cases follow a Poisson distribution.

The relationship of incidence rates to prevalent vertebral fracture status at baseline was assessed using generalized linear models assuming a Poisson error structure and considering the subject – rather than the individual vertebrae – to be the unit of analysis. In order to account for the variable length of follow-up, log (follow-up years) was included in the model with a fixed coefficient of 1 (i.e., as an offset term). Risk ratios (RR) and confidence intervals were obtained from these models.

Agreement with the qualitative clinical reading was summarized as overall agreement (all agreements divided by the number of subjects) and by the kappa (κ) statistic. Because clinical readings cannot be considered a reliable “gold standard,” we also calculated the c-index [23] to compare the ability of the different baseline definitions of vertebral deformity to predict a new deformity, using the 20% vertebral height reduction criterion. The c-index is a rank correlation that compares the predicted probabilities from a model to the observed responses (whether or not fracture was detected). It was estimated from equivalent Cox models, using the known relationship between the Poisson and Cox model. The c-index is a useful statistic for comparing models – the higher the c, the better the model does at predicting the event.

Results

These 512 mostly white (98%) postmenopausal women (median age at baseline: 63 years; range: 50–91 years) were followed radiographically for a total of 4455 person-years. Altogether, 451 (88%) of the participants had a follow-up radiograph at least 4 years later, 325 (63%) after 8 or more years, and 67 (13%) after 12 years. The number of new vertebral deformities identified on these follow-up studies depended on the definition used (Table 1). As expected, the more stringent the criterion for change, the smaller the number of new deformities and the lower the corresponding vertebral fracture incidence rate (per 1000 person-years). For example, the number of women with a new fracture was 187, 112, and 60 for 15, 20, and 30% reductions in height, respectively, and 96 for a 20% reduction in height with an absolute change of ≥4 mm. In subsequent analyses, we used a 20% reduction in any vertebral height to define a new vertebral fracture (n=197). Of these, 25 represented progression of a baseline vertebral deformity (by the Eastell/Melton 3 SD definition), and 172 were new fractures, as noted in Methods. By the 20% change criterion, and counting women not vertebrae, the overall age-adjusted vertebral fracture incidence rate was 23 per 1000 person-years (95% CI: 18–28).

Table 1 Risk of a new vertebral deformity among Rochester women followed for a median of 10 years, by criterion for change

The number of deformed vertebrae at baseline is provided in Table 2, along with the number of affected women, given each definition. Depending on the definition used, the prevalence of a vertebral deformity in these postmenopausal women varied from almost 90% with the most generous criterion down to 3% with the most restrictive ones. A subset of 367 radiographs (the 1980 and 1983 cohorts) had been read clinically, and agreement with the qualitative clinical reading ranged from 32 to 84% for the various baseline definitions. With the exception of the unadjusted 15% criterion, where κ (agreement with the clinical reading beyond chance alone) was only 0.05, κ for the remaining definitions varied from 0.22 (i.e., “fair” agreement) up to 0.53 (i.e., “moderate” agreement). Interestingly, application of the Jiang/Eastell approach to standardizing vertebral heights changed the poorly performing 15% criterion from one of the worst (32% agreement: κ , 0.05) to one of the best (84% agreement: κ, 0.53).

Table 2 Number of affected vertebrae and affected Rochester women, by baseline deformity definition and agreement with a qualitative clinical reading of the radiographs

The likelihood of a new vertebral deformity, based on the 20% change criterion, is given in Table 3 for various definitions of a baseline deformity. Generally speaking, the predictive power of the more generous criteria (e.g., ≥3 SD) resulted partly from the inclusion in such categories of severe deformities (e.g., ≥4 SD). The results for each definition, after excluding the more severe deformities contained within each one, are also shown in Table 3. Thus, deformities defined on the basis of the unadjusted 15% rule were associated with an overall 1.9-fold increase in the risk of a new vertebral fracture, but this was mostly attributable to the 3.9-fold increase associated with 30% deformities; there was no significant increase in fracture risk among women with 15% deformities alone (p=0.114). Again, however, the Jiang/Eastell standardized height approach seemed to resolve this problem, since those with 15% but not 20% deformities by this method still had a statistically significant 2.3-fold increase in risk. For comparison, women with 4 SD deformities by the Eastell/Melton approach (trimmed values) had a 3.8-fold increase in the risk of a new vertebral fracture, but the 1.5-fold increase in those who only had a 3 SD deformity at baseline was not statistically significant (p=0.118). By contrast, the 3 SD criterion for the McCloskey/Kanis method was associated with a significant 4.9-fold increase in risk, although relatively few women would be considered affected by this definition (Table 3). However, the ability of the different baseline definitions to predict a new deformity, as assessed by the c-index, was more or less comparable for all approaches (Table 3).

Table 3 Number of affected Rochester women and their relative risk (RR) of a new deformity (20% change in height criterion), by baseline definition

Discussion

It is difficult to envision a simple morphometric algorithm that could outperform an experienced radiologist in diagnosing vertebral fractures, but not all readers in routine clinical practice are experts and performance varies even among experienced radiologists [2428]. Moreover, the introduction of densitometric Vertebral Fracture Assessment [29, 30] creates an opportunity for automated assessment of vertebral deformities, although expert evaluation may be superior in that application as well [31, 32]. Previous studies have shown that generous morphometric criteria for baseline deformities (e.g., a 15% reduction in mean vertebral height ratios) may declare an excessive number of vertebral fractures relative to expert readings of the same radiographs [16, 33], as shown here as well. However, we also find that significant improvements in morphometry may be possible. Thus, after standardizing vertebral heights using the Jiang/Eastell approach [21], there was much better agreement with the clinical interpretation.

However, many disagreements relate to mild (e.g., 3 SD or grade 1) deformities [17], and this study importantly shows that the power to predict future vertebral fractures is driven by the inclusion in such definitions of more severe (e.g., 4 SD or grade 2) deformities. When isolated, the 3 SD height ratio deformities alone were not associated with a significant increase in future fracture risk. Others have found that, compared to 4 SD deformities, 3 SD vertebral deformities are not associated with much, if any, excess in the frequency or severity of back pain [9, 10], height loss [9, 34, 35], low lumbar spine bone density values [19, 34, 36, 37], or specific risk factors such as corticosteroid use [38]. Moreover, isolated 3 SD (excluding 4 SD) deformities are distributed evenly down the spine instead of concentrating in the mid-thoracic region and thoraco-lumbar junction like more severe deformities do [17, 20, 36, 39, 40], suggesting that they may represent incidental findings. These observations raise questions about the utility of including these mild deformities at all, especially since they represent such a large proportion of the apparent vertebral fractures detected in cross-sectional population surveys. Their exclusion would lower the estimated prevalence of vertebral fractures in postmenopausal white women from 20–25% to about 10% [41].

One of the critical issues in this area of research is the absence of an unequivocal gold standard with which to assess the relative performance of different clinical or morphometric approaches to the recognition of a vertebral fracture [2]. Based on the knowledge that the presence of one vertebral fracture in postmenopausal women increases the risk of another at least fourfold [13, 14], we chose to use subsequent vertebral fracture risk as the metric to explore this issue, as recommended by others [7]. Of course, there are also choices to make with respect to the definition of a new vertebral deformity, but Black and colleagues [6] have made the case that a 20% reduction in vertebral height is not only easy to apply in practice, insofar as no external normative data are required, but also correlates well with height loss, severe back pain, and low lumbar spine bone density and minimizes sample size requirements in clinical trials. With this definition, the overall age-adjusted incidence of a new vertebral fracture in these postmenopausal women was 23 per 1000 person-years, compared to 5.3 per 1000 (comparably adjusted to U.S. white women ≥50 years of age in 2000) for clinically evident vertebral fractures in this population [42]. This ratio of morphometric to clinical fracture incidence rates is consistent with recent results directly comparing the two approaches to vertebral fracture assessment in a clinical trial [43].

Nonetheless, there still may be some limitations to this approach. First, because our focus was on the influence on fracture risk of the deformity definitions themselves, we ignored other variables (e.g., vertebral fracture type and location) that may have prognostic significance [44]. Moreover, to the extent that vertebral heights can increase again following a fracture [45] – for example, “dynamic mobility” [46] – new deformities may be underestimated to some degree by our approach. There are also issues with respect to the comparability of serial radiographs, such as magnification errors [47], but these films were all taken at a single facility (albeit at different times) using consistent methods. The precision of the measurements themselves was quite good (CV: 1.5%) and, excluding the vertebrae considered newly deformed by the 20% change criterion, the mean change between baseline and the last follow-up radiograph in all remaining vertebral heights combined was 1% (interquartile range: −2 to 5%). In addition, the data are limited by the relatively small number of women with follow-up radiographs as well as the fact that films were taken at varying intervals associated with periodic reassessments of the subjects in the underlying prospective epidemiological study. Finally, the study subjects were mostly white women of Northern European extraction.

The c-index data (analogous to the area under the ROC curve) indicate that the overall ability of the different baseline definitions to predict a subsequent vertebral fracture is similar even though some approaches are more sensitive but less specific and vice versa. This suggests that the approach to vertebral fracture assessment may ultimately depend on the information needed. If, for example, it is important not to miss any potential vertebral fractures, then a generous morphometric assessment (high sensitivity) could be useful in flagging films for expert reading, even if most of the mild deformities only represent normal variants in vertebral shape. Done by hand, this entails substantial cost, but it would be more feasible if the vertebral height measurements were automated.

However, none of the deformity definitions clearly distinguished the population at greatest risk of a future fracture, and it has yet to be determined how one could unequivocally recognize from among all mild deformities those that represent early vertebral fractures, where prophylactic treatment might be most beneficial. This would not be necessary if all clinically-evident vertebral fractures involved severe deformities at onset, but this appears not to be the case [43, 48]. Moreover, it was previously shown that 21% of postmenopausal women with a single grade 1 deformity had a clinical history compatible with an acute vertebral fracture [17], and even a 3 SD criterion misses some vertebral fractures identified from changes on serial radiographs [49]. More recently, it was found that a quarter of all clinically-evident vertebral fractures were missed by morphometry altogether [43]. The solution to distinguishing these types of patients morphometrically from the remaining women with mild deformities may lie in consideration of the entire three-dimensional shape of each vertebral body, which is now possible with quantitative computed tomography and magnetic resonance imaging; these new capabilities should be explored. In the interim, more systematic approaches to radiographic assessment [7] may help.