Introduction

Neck pain is a common musculoskeletal disorder. The incidence of neck pain in the Netherlands has been estimated as 23.1 per 1,000 person years [12]. In general, women have more neck pain than men [12, 27]. In the Netherlands, 51% of patients with acute non-specific neck pain who consulted their general practitioners were referred to physiotherapists for treatment [79].

Neck pain may result from many causes (trauma, infections or inflammatory conditions, rheumatic disorders and congenital diseases), but most often no specific cause can be found and the condition is labelled as non-specific neck pain [10]. In their clinical examination, physiotherapists and other healthcare providers may routinely perform an assessment of the active cervical range of motion (ACROM) to assess the level of impairment associated with neck pain as well as the results of treatment. Typical ACROM assessments of the cervical spine include flexion and extension in the sagittal plane, lateral flexion in the frontal plane and rotation in the transverse plane. Tests or instruments which are used to examine ACROM in patients with non-specific neck pain should meet several clinimetric prerequisites, such as an acceptable level of reproducibility, validity and responsiveness [29].

Recently, several systematic reviews have been published on the assessment of passive cervical range of motion (PCROM) and palpation procedures of the cervical spine [62, 64, 75]. Among the palpation procedures, pain provocation tests were found to be the most reliable [62, 64] and the assessment of regional PCROM was found to be more reliable than segmental PCROM [62]. However, in these reviews it was also concluded that most studies were of a poor methodological quality [62, 64, 75].

In 2000, a review was published that assessed the reliability of tools used to measure ACROM. This review concluded that the cervical range of motion device has shown promising reliability [38]. Our overview takes account not only of reliability, but also of validity and responsiveness.

The objective of the present systematic review, therefore, is to provide an overview of the current knowledge on clinimetric properties of instruments that are practical to use when evaluating ACROM in patients with non-specific neck pain.

Methods

Study selection

An extensive search was conducted in the MEDLINE (1982 to January 2007), CINAHL (1982 to January 2007) and EMBASE (1996 to January 2007) databases. The following search terms were used: neck, cervical, reproducibility of results, reliability, reproducibility, validation studies, validity, responsiveness, range of motion, active motion, movement. In addition, after the selection of relevant studies, the specific names of identified assessment instruments were used for an additional computerized search to identify supplementary relevant studies. References from retrieved papers were searched for additional studies. The principle investigator (CK) screened the potentially relevant papers retrieved for eligibility according to the following inclusion criteria:

  • The paper had to be written in English or Dutch;

  • Studies had to pertain to the cervical or upper thoracic spine;

  • Studies had to investigate the reproducibility, validity or responsiveness of instruments or tests for measuring ACROM;

  • The instrument or test used had to be described clearly, enabling eventual replication of the test,

  • And the instrument or test had to be portable, affordable (maximum 1,000 Euros) and easy to use (time to test maximum 5 min) by healthcare professionals in daily practice.

Studies were excluded if they were non-published papers (thesis studies).

Data abstraction and quality assessment

We investigated the following clinimetric properties: intraobserver reliability, interobserver reliability, agreement, construct validity, responsiveness and interpretability. To interpret the data a checklist was composed that was partly based on criteria developed by the Scientific Advisory Committee of the Medical Outcome Trust [46] and a checklist developed by Bot et al. [11] (Table 1).

Table 1 Checklist used for the assessment of clinimetric properties of the studies included

Description of the instruments for the assessment of ACROM

Descriptive data extracted from the publications included the target population and the examiners, description of test/instrument and protocol used, description of test–retest interval, blinding of examiners for participants and each other’s or reference test result, and explanation of withdrawals.

Reproducibility

Reproducibility is the extent to which an instrument yields stable scores over time among respondents who are assumed not to have changed [67]. Reproducibility was assessed by rating reliability and agreement. Reliability represents the extent to which individuals can be distinguished from each other, despite measurement errors. Agreement represents a lack of measurement error [67].

The weighted kappa was considered adequate for calculating the reliability of ordinal data, and calculation of the intra-class correlation coefficient (ICC) was considered an adequate measure for ordinal or parametric data [31]. Intraobserver reliability and interobserver reliability were rated positive if the ICC was >0.85 and >0.70 respectively [29, 66]. A kappa coefficient above 0.60 was rated positively for intra- and inter-observer reliability. This is based on the Landis and Koch scale [43] 0.41–0.60 moderate correlation, 0.61–0.80 substantial correlation and 0.81–1.00 almost perfect correlation. Application of the Pearson reliability coefficient was rated as doubtful, as it neglects systematic observer bias [31].

Agreement is the ability to achieve the same value with repeated measurements. For this review, calculations of the 95% limits of agreement (LoA), standard error of measurement (SEM), smallest detectable change (SDC) or minimal detectable change (MDC) were considered sufficient. The SDC or MDC reflect the smallest within-person change in score that can be interpreted as a real change, above measurement error [11, 67]. It is not possible to define adequate cut-off points for the result of an agreement study. For that reason, a positive rating was given when an adequate method for agreement was used.

Validity

Validity is the degree to which an instrument measures what it is supposed to measure. Construct validity is the extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concept being measured [67]. A Pearson’s correlation coefficient or Spearman correlation coefficient above 0.65 was rated positively for construct validity [29, 66].

Responsiveness

Responsiveness refers to the ability of an instrument to detect important change over time in the concept being measured and is therefore considered to be a measure of longitudinal validity. There is no single agreed method of assessing or expressing an instrument’s responsiveness [29, 67]. Responsiveness is considered to have been studied adequately if hypotheses have been specified and the results corresponded to these hypotheses [11]. It was not possible to define adequate cut-off points for the result of a responsiveness study. For that reason, a positive rating was given when an adequate method for responsiveness was used.

Interpretability

Interpretability is defined as the degree to which scores and change scores can be interpreted and qualitative meaning can be assigned to quantitative scores. The investigators should provide information about what difference in score would be clinically meaningful. We rated this on the basis of whether the authors had presented a minimal important change (MIC) or if information was presented that could aid in interpreting scores—for instance, presentation of means and standard deviations (SD) of patient scores before and after treatment, data on distribution of scores in relevant subgroups and relating changes in the instrument score to patients’ global perceived change [11, 67].

Overall quality

To obtain an overall score for the quality of the instruments, the number of positive ratings on the above-mentioned points was summed for each instrument.

Two investigators (C.K. & S.H.) independently assessed the studies included according to the criteria list. Disagreements between the reviewers were resolved by discussion. If disagreement persisted about the assignment of a score to an item, a third person (E.H.) was consulted to decide on the final rating.

Results

Selection of the studies

The search generated 549 hits. After screening titles and abstracts, 481 studies were excluded. Of the remaining 68 studies, 33 publications were included after reading the full article [13, 8, 15, 16, 20, 3237, 41, 48, 51, 5356, 59, 61, 6871, 73, 77, 78, 80, 8486].

Reasons for exclusion were the cost of the instruments used (n = 24), (these instruments were mainly tested in university laboratories and we estimated that the instruments cost more than 1,000 Euros) [4, 5, 7, 14, 17, 19, 2224, 28, 39, 40, 42, 44, 45, 49, 50, 52, 57, 58, 63, 65, 81, 83]; no clinimetric evaluation of the instrument (n = 9) [6, 9, 18, 24, 25, 30, 60, 72, 82], and only assessment of PCROM [47].

The instruments that were included based on the criteria of clinical acceptability were: goniometers/inclinometers (30 articles); visual estimation (two articles) and tape measurements (four articles). Relevant data on study population, examiners, study protocol and results from these studies are displayed in Tables 2, 3 and 4. Initially, there was 94% agreement on the items that were rated. All disagreements between the two reviewers were solved with discussion.

Table 2 Characteristics of included studies as regards goniometric/inclinometric measurements
Table 3 Characteristics of included studies as regards visual estimation
Table 4 Characteristics of included studies as regards tape measurement

Goniometers/inclinometers

Goniometers are versatile devices that measure range of motion in grades and depend on landmarks. Inclinometers are fluid-filled goniometric instruments that depend on gravity. In the literature we included, the terms goniometer and inclinometer are used interchangeably. The instruments included are: variations of universal goniometers [20, 41, 48, 56, 73, 86], a Myrin goniometer [8, 78], a Spin-T goniometer [1, 2, 32], the cervical range of motion instrument (CROM) [16, 33, 54, 55, 61, 6971, 84, 85], single inclinometers [15, 20, 33, 59, 80], an electronic digital inclinometer EDI-320 [35, 68], a simple inclinometer [51], a gravity action inclinometer [37], a liquid inclinometer [3] and a spirit inclinometer [53] (Fig. 1).

Fig. 1
figure 1

Inclinometer

Information on reliability was found for all instruments. In general, the ICC was reported except for most articles published before 1995. The reliability for the CROM, a single inclinometer, and the EDI-320 were rated positively (ICC intraobserver > 0.85, ICC interobserver > 0.70).

Information on agreement was found for the CROM [54], a universal goniometer [20], spin-T goniometer [32], a single inclometer [20, 59, 80] and the EDI-320 [35].

The EDI-320 (LoA intraobserver F/E −2.5 ± 11.1° LFL −0.1 ± 10.4° ROT −5.9 ± 13.5°; LoA interobserver F/E 3.3 ± 17°, LFL 0.5 ± 17°, ROT −1.3 ± 24.6°) and a single inclinometer (SEM ranged from 3.6° to 7° and MDC ranged from 10° to 19°) were rated positively (Fig. 2).

Fig. 2
figure 2

Cervical range of motion instrument

Construct validity of the CROM was determined by comparing CROM with radiographics, a single inclinometer and opto-electronic systems [33, 55, 6971]. The Spin-T goniometer was compared with motion star equipment [2]. The rangiometer and universal goniometer were compared with each other [86] and with the age and disease duration of patients with ankylosing spondylitis [48]. A single inclinometer was compared with results on the neck disability index (NDI) [59] and radiographics [15].

Construct validity of the spin-T goniometer, universal goniometer, rangiometer and single inclinometer were rated as doubtful, because different statistics were used as the predefined criterion for the Pearson correlation coefficient (r) or r < 0.65. The CROM gets a positive rating; r > 0.65.

Information on responsiveness was only found for the universal goniometer, measured on patients with ankylosing spondylitis [48] and therefore gets a doubtful rating for the present patient group.

Visual estimation

When performing a visual estimation the patient sits and the examiner visually estimates the ACROM. Two articles described the interobserver reliability of visual estimation [34, 77]. Both articles presented Kappa values that were below the predefined criteria and therefore the reliability of visual estimation was rated as inadequate. Information on agreement, validity and responsiveness was not found.

Tape measurement

Neck mobility was measured with tape in centimetres with different landmarks as reference marks. Different measurement protocols were found. Two studies described the Pearson correlation coefficient [8, 36] as a statistic measure for intra- and inter-observer reliability. Information on reliability, described by the ICC, agreement and responsiveness were found for patients with ankylosing spondylitis [48, 78]. Tape measurement was rated as doubtful for reproducibility and responsiveness.

Overall quality

The rating of the clinimetric qualities of the instruments included is presented in Table 5, summarizing each aspect as positive, inadequate, doubtful or insufficient quality. Only a few studies gave an adequate description of the study design and population characteristics. Eight studies included only patients with non-specific neck pain; twenty publications provided insufficient information on the methodological aspects to enable a good appraisal of the study design. Furthermore, information about non-response and subjects’ loss to follow-up was often lacking.

Table 5 Summary of the quality assessment of clinimetric properties of the instruments included

Discussion

An extensive search strategy led to the identification of three different types of instruments for the evaluation of active cervical ROM which are easily applicable in daily clinical practice and for which clinimetric properties have been investigated. Overall, the CROM, a single inclinometer and EDI-320 had the best ratings on such clinimetric aspects as reproducibility, validity and responsiveness. When clinical acceptability is taken into account the CROM or a single inclinometer can be considered to be the most appropriate instruments for the assessment of ACROM in patients with non-specific neck pain. In other patients groups, as for example in patients with cervical disc prostheses the CROM can possibly be used to measure follow-up of range of motion. The authors believe that the CROM may be a cheap and safe alternative instrument instead of radiography. Radiography then only has to be used in those patients in which ACROM does not improve as was expected.

The CROM had been studied most extensively. None of the instruments received positive ratings for all items of the methodological quality checklist. It has been advocated that agreement parameters are required for instruments that are used for evaluative purposes and reliability parameters are required for instruments that are used for discriminative purposes [21]. Instruments used to measure ACROM are mainly used for evaluative purposes, but reliability parameters have been studied more extensively than agreement parameters. Agreement, interpretability and responsiveness are clinimetric properties, which in general have not been tested. For evaluative purposes, these clinimetric properties are important because measurement error should be smaller than the minimal change that is considered to be important [67]. Parameters were only found for the single inclinometer and the EDI-320 agreement in studies that have an adequate study design. MDC values of a single inclinometer ranged from 10° (lateral flexion) to 16° (flexion/extension) [20, 59, 80].

The methodological quality of the studies included varied. In total we included 33 publications. In general, the methodological quality of the older literature (published before 1995) is lower than the more recent literature. Older articles mainly describe reliability and eight of those articles did not use the ICC as a statistical measurement for reliability [3, 8, 16, 36, 37, 41, 51, 86].

In order to ensure external validity it is necessary to include patients with neck pain who are likely to undergo the same measurement procedure in daily practice [13]. Nineteen articles used healthy subjects [2, 3, 8, 15, 16, 32, 33, 36, 51, 5356, 68, 69, 71, 73, 85, 86], 4 articles used patients other than non-specific neck pain patients [37, 48, 78, 80] and 9 articles included patients with non-specific neck pain [20, 34, 35, 41, 59, 61, 70, 77, 84].

Thirteen publications did not describe the training and the results of the training undergone by those doing the rating prior to the test, which is however an important aspect of external validity [74]. Blinding is an important aspect of the internal validity of a study. It can be divided into examiner–participant blinding and examiner–examiner blinding. Blinding of examiner–participant is often not described in the publications included in this review. The aspect of blinding examiner–examiner is given more consideration in the publications.

This review has several limitations. Although much effort was made to find all the published studies, selection bias may have occurred because only Dutch or English-language articles were included. Effort was put into reference tracking but it is possible that studies were missed. Furthermore, unpublished studies were not included. Reviewer bias is also a possible limitation of this review because reviewers were not blinded to the authors.

There is no gold standard for evaluating clinimetric qualities. The checklist used in our review was based on the checklist made by Bot et al. [11]. This list has been used previously for patient-assessed questionnaires instead of instruments to evaluate the functional status of the patient [11, 26, 76]. This checklist, however, was chosen for its quality and international consensus on terminology. Assigning value labels for ranges of Kappa and ICC statistics and correlation coefficients was done in accordance with other authors [29, 43, 66, 67].

Jordan [38] concluded in his review that the CROM has shown promise as regards reliability but its practicality for clinical use is questionable because of the costs involved and its dimensions. The author also concluded that evidence on the single inclinometer is lacking. Since the publication of this review new data have been published on the validity of the CROM [6971] and on the reproducibility of the inclinometer [15, 59, 80]. These studies generally have a high methodological quality and show good reproducibility of the inclinometer and good construct validity of the CROM.

The findings of this systematic review have implications for research and clinical practice. Researchers should give careful consideration to the study design and presentation of the results. The construct validity of a single inclinometer should be investigated by making comparisons with other instruments to measure ACROM. Future research should also report agreement parameters. Clinicians need to be cognizant that ACROM should be measured with the CROM or a single inclinometer and that visual estimation is not reliable. Furthermore, in future research different patient groups have to be studied, including for example also patients with cervical disc prostheses, in order to validate CROM against radiography.

Conclusion

The present review provides information for researchers and clinicians to facilitate choice amongst existing instruments for measuring ACROM. A systematic computerized literature search of databases revealed three different types of instruments that are practical to use when measuring ACROM in patients with non-specific neck pain: visual estimation, tape measurements, different types of goniometers/inclinometers. When a healthcare professional decides that measuring ACROM on a patient with non-specific neck pain is necessary, a single inclinometer and CROM are to be recommended based on their best ratings for clinimetric properties and practicality. Visual estimation should not be used to measure ACROM.