Spatial modelling of disease using data- and knowledge-driven approaches

https://doi.org/10.1016/j.sste.2011.07.007Get rights and content

Abstract

The purpose of spatial modelling in animal and public health is three-fold: describing existing spatial patterns of risk, attempting to understand the biological mechanisms that lead to disease occurrence and predicting what will happen in the medium to long-term future (temporal prediction) or in different geographical areas (spatial prediction). Traditional methods for temporal and spatial predictions include general and generalized linear models (GLM), generalized additive models (GAM) and Bayesian estimation methods. However, such models require both disease presence and absence data which are not always easy to obtain. Novel spatial modelling methods such as maximum entropy (MAXENT) and the genetic algorithm for rule set production (GARP) require only disease presence data and have been used extensively in the fields of ecology and conservation, to model species distribution and habitat suitability. Other methods, such as multicriteria decision analysis (MCDA), use knowledge of the causal factors of disease occurrence to identify areas potentially suitable for disease. In addition to their less restrictive data requirements, some of these novel methods have been shown to outperform traditional statistical methods in predictive ability (Elith et al., 2006). This review paper provides details of some of these novel methods for mapping disease distribution, highlights their advantages and limitations, and identifies studies which have used the methods to model various aspects of disease distribution.

Highlights

► A range of data- and knowledge driven spatial modeling methods arereviewed. ► Data-driven methods include those that use presence-absence or presence-only data. ► Knowledge-driven modeling focuses on GIS-based multi-criteria decision analysis. ► Approaches for model validation and describing model uncertainty are outlined.

Introduction

The purpose of spatial modelling in animal and public health is three-fold: describing existing spatial patterns of risk using techniques such as kernel smoothing, kriging or Bayesian smoothing (i.e. descriptive), attempting to understand biological mechanisms that lead to the occurrence of disease (i.e. explanatory) and attempting to predict what will happen in the medium to long-term future or in different geographical areas (i.e. predictive). The results of such models are used for a variety of purposes including targeting areas for surveillance, risk management, simulating different control scenarios, predicting what will happen under different environmental conditions such as those resulting from climate change (i.e. temporal prediction) and identifying new geographical areas suitable for the introduction of diseases (i.e. spatial prediction).

Descriptive risk mapping aims to illustrate the spatial variation in disease risk while simultaneously removing excess noise or outliers using interpolation techniques such as kernel smoothing, kriging or Bayesian methods (Pfeiffer et al., 2008b). The data used in such maps are frequently obtained via surveys or surveillance. These data are only really useful when accompanied by denominator data so that the distribution of the population can be accounted for by calculating and mapping the disease risk or rate. However, descriptive risk maps can be incomplete or even misleading as they are frequently based on data which may have a sampling bias. In contrast, predictive risk mapping aims to identify new, unsampled geographical areas suitable for disease by extrapolating beyond the boundary of the data points used in the model (Peterson et al., 2004). However, extrapolation generally leads to greater uncertainty associated with the risk estimate. Such models can be thought of as disease distribution models (DDM), similar to the species distribution models (SDM) widely used in the fields of ecology and conservation to describe the species niche and to identify habitats suitable for supporting a species. SDM have been described as ‘geographical modeling of biospatial patterns in relation to environmental gradients’ (Franklin, 1995) but unlike SDM which are generally concerned with macro-organisms, DDM are more concerned with modelling the distribution and habitat suitability of disease micro-organisms or their vectors. Although the same methods and principles apply independent of the physical size of the species being modeled or whether the purpose is conservation of endangered species or identification of areas suitable for the introduction of a disease vector or micro-organism, modelling potential disease distribution introduces additional layers of complexity to methods which are generally best suited to modelling the relationship between a single species and its environment. Thus, certain SDM, while useful tools for modelling potential disease distribution, may not always be the most appropriate method for modelling complex disease patterns.

Traditional methods of temporal and spatial prediction include general and generalized linear models (GLM), generalized additive models (GAM) and Bayesian estimation methods (Lawson, 2006). These models comprise well-established algorithms and can effectively account for spatial dependence within the data. However, such models require both disease presence and disease absence points. Absence data are generally collected through observational studies which can be costly and, due to logistical constraints, may only cover a small geographical area. Yet DDM results are frequently applied to large areas (e.g. south-east Asia or Africa), even though the only available inputs may be disease presence data obtained through surveillance or knowledge of the causal factors leading to disease occurrence.

In the past decade the ecological and conservation fields have seen an increase in the application of SDM which require only presence data. These models, which are static and probabilistic, relate the species distribution to their current environment through a range of predictor variables believed to limit the species distribution (Guisan and Zimmermann, 2000). In addition to only requiring presence data, these models can also be extrapolated beyond the current distribution of the species or disease to show the potential distribution further afield. Some of these models have been shown to have predictive ability equal to or greater than that of traditional statistical methods (Elith et al., 2006) and have begun to be used in the animal and public health fields to produce maps showing the potential distribution of a variety of diseases (Peterson, 2006). This review paper provides details of some of these novel methods for modelling the spatial distribution of disease, – which can be divided into data- and knowledge-driven methods – highlights their advantages and limitations, and identifies studies which have used the methods to model various aspects of disease distribution.

Section snippets

Classification and regression trees (CART)

Classification and regression trees (CART) is a non-parametric, decision-tree based method which explains the variation in a response variable with respect to one or more predictor variables (Breiman et al., 1984, Sutton et al., 2005). CART is a decision tree applied to qualitative and quantitative response variables respectively. Construction of a decision tree occurs in three stages; tree building, tree stopping and tree pruning (Olden et al., 2008). During the building phase a tree is

Data-driven methods: presence-only data

Although statistical and machine learning methods require both disease presence and absence data, absence data can be problematic because it can be difficult to distinguish between the absence of a disease and the lack of observation or reporting of disease events in an area. Alternatively, the disease species may be absent – even though the habitat is suitable for its occurrence – due to a geographical or man-made barrier preventing its spread into the area (Hirzel et al., 2002). These

Comparison of data-driven methods

A comprehensive comparison of 16 modelling methods over 226 animal species from six regions of the world found that some of the newer methods outperformed the traditional statistical methods with regards to predicting species distribution (Elith et al., 2006). The authors divided the models into three groups based on their performance as assessed by AUC and correlation. The group with the best performance included BRT and MAXENT; most of the standard regression methods (GLM and GAM), including

GIS-based multicriteria decision analysis (MCDA)

In data-sparse situations, maps identifying areas suitable for disease can be produced using GIS-based multicriteria decision analysis (MCDA) – also known as multicriteria decision modelling or making (MCDM). This method uses decision rules derived from existing knowledge to identify areas potentially suitable for disease (Pfeiffer et al., 2008a). Details of the modelling process have been provided by a number of authors (Eastman, 1997, Eastman et al., 1995, Malczewski, 1999, Malczewski, 2000,

Model uncertainty and validation

Disease maps should always be used in conjunction with explicit information regarding the source and magnitude of any uncertainties incorporated in the modelling process (Barry and Elith, 2006, Pfeiffer et al., 2008a). Model uncertainty can be either epistemic (e.g. measurement error, natural variability, model uncertainty) or stochastic (i.e. uncertainty due to inherent variability in the underlying biological processes) (Refsgaard et al., 2007). Although uncertainty resulting from data errors

Conclusion

Spatial modelling and disease mapping are seldom ends to themselves. Rather, they are a means to an end; tools for informing surveillance, decision making and disease risk management. However, there is a need to communicate the outputs of such models more effectively to gain the trust of decision makers and risk managers. In addition, modellers need to communicate any uncertainty and biases inherent in the model outputs in a way that makes them easily understood by decision makers so that

References (107)

  • D.R.B. Stockwell et al.

    Effects of sample size on accuracy of species distribution models

    Ecol Modell

    (2002)
  • D.R.B. Stockwell et al.

    The use of the GARP genetic algorithm and internet grid computing in the Lifemapper world atlas of species biodiversity

    Ecol Modell

    (2006)
  • J.C.Z. Adjemian et al.

    Analysis of genetic algorithm for rule-set production (GARP) modeling approach for predicting distributions of fleas implicated as vectors of plague, Yersinia pestis, in California

    J Med Entomol

    (2006)
  • D. Ayala et al.

    Habitat suitability, ecological niche profile of major malaria vectors in Cameroon

    Malar J

    (2009)
  • S. Barry et al.

    Error and uncertainty in habitat models

    J Appl Ecol

    (2006)
  • M. Benedict et al.

    Spread of the tiger: global risk of invasion by the mosquito Aedes albopictus

    Vector Borne Zoonotic Dis

    (2007)
  • J.K. Blackburn et al.

    Modeling the geographic distribution of Bacillus anthracis, the causative agent of anthrax disease, for the contiguous United States using predictive ecologic niche modeling

    Am J Trop Med Hyg

    (2007)
  • G.E.P. Box et al.

    Empirical model-building and response surfaces

    (1987)
  • L. Breiman

    Bagging predictors

    Mach Learn

    (1996)
  • L. Breiman

    Random forests

    Mach Learn

    (2001)
  • L. Breiman et al.

    Classification and regression trees

    (1984)
  • J. Cappelle et al.

    Ecological modeling of the spatial distribution of wild waterbirds to identify the main areas where avian influenza viruses are circulating in the Inner Niger Delta, Mali

    EcoHealth

    (2010)
  • L. Chamaillé et al.

    Environmental risk mapping of canine leishmaniasis in France

    Parasit Vectors

    (2010)
  • A.C.A. Clements et al.

    Application of knowledge-driven spatial modelling approaches and uncertainty management to a study of Rift Valley fever in Africa

    Int J Health Geog

    (2006)
  • A.C.A. Clements et al.

    Spatial risk assessment of Rift Valley fever in Senegal

    Vector Borne Zoonotic Dis

    (2007)
  • M.G. Colacicco-Mayhugh et al.

    Ecological niche model of Phlebotomus alexandri and P. papatasi (Diptera: Psychodidae) in the Middle East

    Int J Health Geog

    (2010)
  • M. de Souza Muñoz et al.

    OpenModeller: a generic approach to species’ potential distribution modelling

    GeoInformatica

    (2009)
  • G. De’ath

    Boosted trees for ecological modeling and prediction

    Ecology

    (2007)
  • G. De’ath et al.

    Classification and regression trees: a powerful yet simple technique for ecological data analysis

    Ecology

    (2000)
  • K. DeJong

    Learning with genetic algorithms: an overview

    Mach Learn

    (1988)
  • A.P. Dempster

    New methods for reasoning towards posterior distributions based on sample data

    Ann Math Stat

    (1966)
  • A.P. Dempster

    Upper and lower probabilities induced by a multivalued mapping

    Ann Math Stat

    (1967)
  • J.R. Eastman

    GIS and uncertainty management: new directions in software development

    Lurralde

    (1997)
  • J.R. Eastman

    IDRISI andes: guide to GIS and image processing

    (2006)
  • J.R. Eastman et al.

    Raster procedures for multi-criteria/multi-objective decisions

    Photogramm Eng Remote Sensing

    (1995)
  • J. Elith et al.

    Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models

    Ecography

    (2009)
  • J. Elith et al.

    Novel methods improve prediction of species’ distributions from occurrence data

    Ecography

    (2006)
  • J. Elith et al.

    A working guide to boosted regression trees

    J Anim Ecol

    (2008)
  • A.A. Elnaggar et al.

    Application of remote-sensing data and decision-tree analysis to mapping salt-affected soils over large areas

    Remote Sens

    (2010)
  • R. Engler et al.

    An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data

    J Appl Ecol

    (2004)
  • A. Estrada-Pena et al.

    Climate niches of tick species in the Mediterranean region: modeling of occurrence data, distributional constraints and impact of climate change

    J Med Entomol

    (2007)
  • J.M. Fitzpatrick et al.

    Genetic algorithms in noisy environments

    Mach Learn

    (1988)
  • J. Franklin

    Predictive vegetation mapping: geographic modeling of biospatial patterns in relation to environmental gradients

    Prog Phys Geog

    (1995)
  • J. Franklin

    Model evaluation

  • J. Franklin

    Machine learning methods

  • J. Franklin

    Implementation of species distribution models

  • J. Franklin

    Classification, similarity, and other methods for presence-only data

  • Freund Y, Schapire R. Experiments with a new boosting algorithm. Machine Learning: Proceedings of the 13th...
  • C. Gonzalez et al.

    Climate change and risk of leishmaniasis in North America: predictions from ecological niche models of vector and reservoir species

    PLoS Negl Trop Dis

    (2010)
  • J. Grinnell

    Field tests of theories concerning distributional control

    Am Nat

    (1917)
  • Cited by (79)

    View all citing articles on Scopus
    View full text