Validity and reliability of the on-road driving assessment with senior drivers

https://doi.org/10.1016/j.aap.2007.09.012Get rights and content

Abstract

The on-road driving assessment is widely regarded as the criterion measure for driving performance despite a paucity of evidence concerning its psychometric properties. The purpose of this study was 2-fold. First, we examined the psychometric properties of an on-road driving assessment with 100 senior drivers between 60 and 86 years (80 healthy volunteers and 20 with specific vision deficits) using Rasch modeling. Second, we compared the outcome of the gestalt decision made by trained professionals with that based on weighted error scores from the standardized assessment.

Rasch analysis provided good evidence for construct validity and inter-rater reliability of the on-road assessment and some evidence for internal reliability. Goodness of fit statistics for all items were within an acceptable range and the item hierarchy was logical. The test had a moderate reliability index (0.67). The best cut off score yielded sensitivity of 81% and specificity of 95% compared with the gestalt decision. Further research is required with less competent drivers to more fully examine reliability.

Healthy senior drivers failed to check blind spots when changing lanes and made errors when asked to report road markings and traffic signs as they drove. In addition unsafe drivers had difficulty negotiating intersections and lane changes.

Introduction

When assessing driving performance of “at risk” drivers, validating off-road tests of driving competence or examining effectiveness of driving simulators, the on-road assessment has traditionally been the criterion measure. Because of face validity, the on-road assessment is assumed to be the most accurate measure of driving competence. However, despite being widely used, there is a paucity of statistical evidence for the validity and reliability of the on-road assessment (Withaar et al., 2000).

Researchers and clinicians generally agree that an on-road assessment should be conducted on a standardized route in real traffic, in a vehicle with dual controls and with separation of responsibility for maintenance of safety and scoring of driving performance (Fox et al., 1998, Mazer et al., 2004). To determine the outcome of the on-road assessment, traditionally, a gestalt decision based on overall driving performance has been used but increasingly a decision based on standard observation and scoring procedures is being recommended (Di Stefano and Macdonald, 2003, Justiss et al., 2006, Odenheimer et al., 1994, Withaar et al., 2000). Different scoring procedures have been employed. Some researchers (Galski et al., 1990, Hunt et al., 1997, Justiss et al., 2006, Odenheimer et al., 1994) rated performance on specific manoeuvres while others (Baldock et al., 2006, Dobbs et al., 1998; Janke and Eberhard, 1998, Staplin et al., 1998) weighted errors according to severity. These total scores were then compared with a gestalt decision.

The results of these studies have been mixed, with some finding that the gestalt decision is consistent with the scored decision (Baldock et al., 2006, Dobbs et al., 1998, Hunt et al., 1997, Justiss et al., 2006, Odenheimer et al., 1994, Staplin et al., 1998) and others finding that only some behaviours or errors are related to the decision (Galski et al., 1990, Janke and Eberhard, 1998). More recently, the need for driving instructor intervention rather than error scores was found to be predictive of failure (Di Stefano and Macdonald, 2003). Inconclusive results such as these raise questions about the theoretical construct being evaluated. The deconstruction of the complex task of driving into smaller component parts may result in the loss of critical information (i.e., the whole is greater than the sum of the parts). In addition previous research comprised varying sample sizes, different client groups (healthy older drivers, “at risk” drivers and medically impaired drivers) and employed varying statistical analyses, all of which could have contributed to the lack of consistent findings.

Item response theory (IRT) with its focus on the items rather than multiple parameters has been identified as a means of addressing important issues associated with on-road evaluation (Justiss et al., 2006) including using ordinal rather than interval scores, uni- versus multi-dimensionality, sample-free measurement, and logicality of the item hierarchy. Rasch modeling, the simplest of the IRT models, is being used increasingly to evaluate tests of human performance to address these concerns. The relationship between person ability and item difficulty is the basis of Rasch analysis. It converts ordinal scores into interval scores, orders items and people on a continuum of difficulty and ability respectively, and examines goodness of fit of items and people along the line (Bond and Fox, 2001).

The purpose of this study was 2-fold. First, we examined the psychometric properties of a standard on-road assessment with healthy older drivers and drivers with vision deficits using Rasch modeling. Second, we compared the outcome of the gestalt decision made by trained professionals with that based on weighted error scores from the standardized assessment.

Section snippets

Study design

The study was a prospective, masked, observational design approved by The University of Sydney Human Research Ethics committee. It was part of a larger investigation of the relationship between vision and driving performance. Only Stage 1 of three stages is presented in this paper.

Participants

A group of 100 senior (≥60 years) volunteer drivers was recruited from the community and from referrals by ophthalmologists in Sydney, Australia. Community members were recruited through Probus Clubs (Senior Rotarians

Weighting of errors

The frequency of each error is recorded in Table 1. Applying a weighting of “1”, “5” and “10”, respectively for habitual, hazardous and critical errors yielded a separation index of 1.14, participant reliability index of .57 and poor item hierarchy and separation on the map, which was disappointing. A weighting of “3”, “5” and “10”, respectively yielded an improved separation index of 1.4, participant reliability index of .60 and an improved item hierarchy map. Finally, applying a weighting of

Discussion

The purpose of this study was to examine validity and reliability of the on-road driving assessment with healthy older drivers and those with vision deficits and to determine how accurately the total error score matched a gestalt decision for on-road driving performance. The findings yielded strong evidence for construct validity indicating that the on-road test measures a single theoretical construct, namely driving errors, indicative of driving safety. There was also strong evidence for

Limitations

Several limitations contribute to the need to exercise caution in generalizing the findings of this study. Firstly, the on-road assessment was shorter (duration of 20–30 min) than many others (duration of 45–60 min) reported in the literature. Most of the required elements were included, but Rasch analysis identified that a sufficiently demanding cognitive task was also required to separate levels of competent drivers. Secondly, there was a small number of participants with vision deficits (N = 20)

Conclusion and recommendation for future research

Rasch analysis of the on-road driving assessment provided strong evidence for construct validity and inter-rater reliability and limited evidence for internal reliability. The addition of more cognitively demanding items and using the assessment with a more varied population, specifically with less competent drivers, would reveal important additional information about the test's psychometric properties and provide avenues of further research. The total error score predicted the assessment

Recommendations for practice

To ensure validity and reliability, on-road driving assessments for senior drivers should be conducted over a standardized route using a vehicle with dual controls to ensure safety. It is recommended that clinicians record errors in performance, then categorize them as habitual, hazardous or critical and weight them by a factor of ‘1”, “2” or “5”, respectively for severity of threat to safety. Although clinicians generally do not have access to software to enable them to statistically analyze

Acknowledgements

This study was funded by the Faculty of Health Sciences, The University of Sydney. Lynnette Kay's contribution was also funded by an Australian Postgraduate Award as the study was undertaken in partial fulfillment of the requirements for her Ph.D. The authors wish to thank the clinicians at Driver Rehabilitation and Fleet Safety Services and the Discipline of Orthoptics at The University of Sydney for their support in conducting the study.

References (23)

  • T. Galski et al.

    An assessment of measures to predict the outcome of driving evaluations in patients with cerebral damage

    Am. J. Occup. Ther.

    (1990)
  • Cited by (50)

    • Comparison of older and middle-aged drivers’ driving performance in a naturalistic setting

      2021, Accident Analysis and Prevention
      Citation Excerpt :

      While it has been shown that crashes occur more often in certain driving environments and during specific maneuvers among older drivers compared to middle-aged drivers, the differences in the complexity of the driving routes chosen by these age groups during their everyday excursions are not well understood. Many drivers with considerable years of experience perform some inappropriate driving maneuvers, which may be either bad habits that are relatively harmless, or they may pose a risk (Baldock et al., 2006; Kay et al., 2008). Analysis of crash data reveals some important differences in the types of critical errors made by older and middle-aged drivers.

    • Development of a weighted scoring system for the Electronic Driving Observation Schedule (eDOS)

      2020, MethodsX
      Citation Excerpt :

      Therefore, the severity of errors defined by the verbal or physical intervention of the instructor is not applicable to the NDO. Other studies developed weighting systems that do not consider the intervention of driving instructors [2,10,12,14]. These weighting systems separate driving errors into habitual errors (or “high-frequency low-severity errors”), hazardous errors (or “low-frequency high-severity errors”), and critical errors.

    • Using the community health assessment to screen for continued driving

      2014, Accident Analysis and Prevention
      Citation Excerpt :

      For a small group of elders, referral to on-road and off-road risk assessments have provided families and clinicians with information indicative of hazardous driving behavior. On-road approaches include driving tests on both open roads (Kay et al., 2008; Shechtman et al., 2010; Ott et al., 2012) and closed-tracks (Ponsford et al., 2008). Driving assessment with off-road simulators also have been an area of active research, with tests simulating approaching intersections, making lane changes, attempting on-street parking and identifying hazardous road conditions (Devlin et al., 2012; Lavallière et al., 2012; Edquist et al., 2012; Wood et al., 2013).

    • DriveSafe DriveAware: A systematic review

      2023, Australasian Journal on Ageing
    View all citing articles on Scopus
    View full text