Spatial modelling of disease using data- and knowledge-driven approaches
Highlights
► A range of data- and knowledge driven spatial modeling methods arereviewed. ► Data-driven methods include those that use presence-absence or presence-only data. ► Knowledge-driven modeling focuses on GIS-based multi-criteria decision analysis. ► Approaches for model validation and describing model uncertainty are outlined.
Introduction
The purpose of spatial modelling in animal and public health is three-fold: describing existing spatial patterns of risk using techniques such as kernel smoothing, kriging or Bayesian smoothing (i.e. descriptive), attempting to understand biological mechanisms that lead to the occurrence of disease (i.e. explanatory) and attempting to predict what will happen in the medium to long-term future or in different geographical areas (i.e. predictive). The results of such models are used for a variety of purposes including targeting areas for surveillance, risk management, simulating different control scenarios, predicting what will happen under different environmental conditions such as those resulting from climate change (i.e. temporal prediction) and identifying new geographical areas suitable for the introduction of diseases (i.e. spatial prediction).
Descriptive risk mapping aims to illustrate the spatial variation in disease risk while simultaneously removing excess noise or outliers using interpolation techniques such as kernel smoothing, kriging or Bayesian methods (Pfeiffer et al., 2008b). The data used in such maps are frequently obtained via surveys or surveillance. These data are only really useful when accompanied by denominator data so that the distribution of the population can be accounted for by calculating and mapping the disease risk or rate. However, descriptive risk maps can be incomplete or even misleading as they are frequently based on data which may have a sampling bias. In contrast, predictive risk mapping aims to identify new, unsampled geographical areas suitable for disease by extrapolating beyond the boundary of the data points used in the model (Peterson et al., 2004). However, extrapolation generally leads to greater uncertainty associated with the risk estimate. Such models can be thought of as disease distribution models (DDM), similar to the species distribution models (SDM) widely used in the fields of ecology and conservation to describe the species niche and to identify habitats suitable for supporting a species. SDM have been described as ‘geographical modeling of biospatial patterns in relation to environmental gradients’ (Franklin, 1995) but unlike SDM which are generally concerned with macro-organisms, DDM are more concerned with modelling the distribution and habitat suitability of disease micro-organisms or their vectors. Although the same methods and principles apply independent of the physical size of the species being modeled or whether the purpose is conservation of endangered species or identification of areas suitable for the introduction of a disease vector or micro-organism, modelling potential disease distribution introduces additional layers of complexity to methods which are generally best suited to modelling the relationship between a single species and its environment. Thus, certain SDM, while useful tools for modelling potential disease distribution, may not always be the most appropriate method for modelling complex disease patterns.
Traditional methods of temporal and spatial prediction include general and generalized linear models (GLM), generalized additive models (GAM) and Bayesian estimation methods (Lawson, 2006). These models comprise well-established algorithms and can effectively account for spatial dependence within the data. However, such models require both disease presence and disease absence points. Absence data are generally collected through observational studies which can be costly and, due to logistical constraints, may only cover a small geographical area. Yet DDM results are frequently applied to large areas (e.g. south-east Asia or Africa), even though the only available inputs may be disease presence data obtained through surveillance or knowledge of the causal factors leading to disease occurrence.
In the past decade the ecological and conservation fields have seen an increase in the application of SDM which require only presence data. These models, which are static and probabilistic, relate the species distribution to their current environment through a range of predictor variables believed to limit the species distribution (Guisan and Zimmermann, 2000). In addition to only requiring presence data, these models can also be extrapolated beyond the current distribution of the species or disease to show the potential distribution further afield. Some of these models have been shown to have predictive ability equal to or greater than that of traditional statistical methods (Elith et al., 2006) and have begun to be used in the animal and public health fields to produce maps showing the potential distribution of a variety of diseases (Peterson, 2006). This review paper provides details of some of these novel methods for modelling the spatial distribution of disease, – which can be divided into data- and knowledge-driven methods – highlights their advantages and limitations, and identifies studies which have used the methods to model various aspects of disease distribution.
Section snippets
Classification and regression trees (CART)
Classification and regression trees (CART) is a non-parametric, decision-tree based method which explains the variation in a response variable with respect to one or more predictor variables (Breiman et al., 1984, Sutton et al., 2005). CART is a decision tree applied to qualitative and quantitative response variables respectively. Construction of a decision tree occurs in three stages; tree building, tree stopping and tree pruning (Olden et al., 2008). During the building phase a tree is
Data-driven methods: presence-only data
Although statistical and machine learning methods require both disease presence and absence data, absence data can be problematic because it can be difficult to distinguish between the absence of a disease and the lack of observation or reporting of disease events in an area. Alternatively, the disease species may be absent – even though the habitat is suitable for its occurrence – due to a geographical or man-made barrier preventing its spread into the area (Hirzel et al., 2002). These
Comparison of data-driven methods
A comprehensive comparison of 16 modelling methods over 226 animal species from six regions of the world found that some of the newer methods outperformed the traditional statistical methods with regards to predicting species distribution (Elith et al., 2006). The authors divided the models into three groups based on their performance as assessed by AUC and correlation. The group with the best performance included BRT and MAXENT; most of the standard regression methods (GLM and GAM), including
GIS-based multicriteria decision analysis (MCDA)
In data-sparse situations, maps identifying areas suitable for disease can be produced using GIS-based multicriteria decision analysis (MCDA) – also known as multicriteria decision modelling or making (MCDM). This method uses decision rules derived from existing knowledge to identify areas potentially suitable for disease (Pfeiffer et al., 2008a). Details of the modelling process have been provided by a number of authors (Eastman, 1997, Eastman et al., 1995, Malczewski, 1999, Malczewski, 2000,
Model uncertainty and validation
Disease maps should always be used in conjunction with explicit information regarding the source and magnitude of any uncertainties incorporated in the modelling process (Barry and Elith, 2006, Pfeiffer et al., 2008a). Model uncertainty can be either epistemic (e.g. measurement error, natural variability, model uncertainty) or stochastic (i.e. uncertainty due to inherent variability in the underlying biological processes) (Refsgaard et al., 2007). Although uncertainty resulting from data errors
Conclusion
Spatial modelling and disease mapping are seldom ends to themselves. Rather, they are a means to an end; tools for informing surveillance, decision making and disease risk management. However, there is a need to communicate the outputs of such models more effectively to gain the trust of decision makers and risk managers. In addition, modellers need to communicate any uncertainty and biases inherent in the model outputs in a way that makes them easily understood by decision makers so that
References (107)
- et al.
Living on the edge – modelling habitat suitability for species at the edge of their fundamental niche
Ecol Modell
(2008) - et al.
Predictive habitat distribution models in ecology
Ecol Modell
(2000) GIS-based land-use suitability analysis: a critical overview
Prog Plann
(2004)Ordered weighted averaging with fuzzy quantifiers: GIS-based multicriteria evaluation for land-use suitability analysis
Int J App Earth Obs Geoinf
(2006)- et al.
Lutzomyia vectors for cutaneous leishmaniasis in Southern Brazil: ecological niche models, predicted geographic distributions, and climate change effects
Int J Parasitol
(2003) - et al.
Rethinking receiver operating characteristic analysis applications in ecological niche modeling
Ecol Modell
(2008) - et al.
Maximum entropy modeling of species geographic distributions
Ecol Modell
(2006) - et al.
A framework for dealing with uncertainty due to model structure error
Adv Water Resour
(2006) - et al.
Uncertainty in the environmental modelling process – a framework and guidance
Environ Modell Softw
(2007) - et al.
Induction of sets of rules from animal distribution data: a robust and informative method of data analysis
Math Comput Simul
(1992)
Effects of sample size on accuracy of species distribution models
Ecol Modell
The use of the GARP genetic algorithm and internet grid computing in the Lifemapper world atlas of species biodiversity
Ecol Modell
Analysis of genetic algorithm for rule-set production (GARP) modeling approach for predicting distributions of fleas implicated as vectors of plague, Yersinia pestis, in California
J Med Entomol
Habitat suitability, ecological niche profile of major malaria vectors in Cameroon
Malar J
Error and uncertainty in habitat models
J Appl Ecol
Spread of the tiger: global risk of invasion by the mosquito Aedes albopictus
Vector Borne Zoonotic Dis
Modeling the geographic distribution of Bacillus anthracis, the causative agent of anthrax disease, for the contiguous United States using predictive ecologic niche modeling
Am J Trop Med Hyg
Empirical model-building and response surfaces
Bagging predictors
Mach Learn
Random forests
Mach Learn
Classification and regression trees
Ecological modeling of the spatial distribution of wild waterbirds to identify the main areas where avian influenza viruses are circulating in the Inner Niger Delta, Mali
EcoHealth
Environmental risk mapping of canine leishmaniasis in France
Parasit Vectors
Application of knowledge-driven spatial modelling approaches and uncertainty management to a study of Rift Valley fever in Africa
Int J Health Geog
Spatial risk assessment of Rift Valley fever in Senegal
Vector Borne Zoonotic Dis
Ecological niche model of Phlebotomus alexandri and P. papatasi (Diptera: Psychodidae) in the Middle East
Int J Health Geog
OpenModeller: a generic approach to species’ potential distribution modelling
GeoInformatica
Boosted trees for ecological modeling and prediction
Ecology
Classification and regression trees: a powerful yet simple technique for ecological data analysis
Ecology
Learning with genetic algorithms: an overview
Mach Learn
New methods for reasoning towards posterior distributions based on sample data
Ann Math Stat
Upper and lower probabilities induced by a multivalued mapping
Ann Math Stat
GIS and uncertainty management: new directions in software development
Lurralde
IDRISI andes: guide to GIS and image processing
Raster procedures for multi-criteria/multi-objective decisions
Photogramm Eng Remote Sensing
Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models
Ecography
Novel methods improve prediction of species’ distributions from occurrence data
Ecography
A working guide to boosted regression trees
J Anim Ecol
Application of remote-sensing data and decision-tree analysis to mapping salt-affected soils over large areas
Remote Sens
An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data
J Appl Ecol
Climate niches of tick species in the Mediterranean region: modeling of occurrence data, distributional constraints and impact of climate change
J Med Entomol
Genetic algorithms in noisy environments
Mach Learn
Predictive vegetation mapping: geographic modeling of biospatial patterns in relation to environmental gradients
Prog Phys Geog
Model evaluation
Machine learning methods
Implementation of species distribution models
Classification, similarity, and other methods for presence-only data
Climate change and risk of leishmaniasis in North America: predictions from ecological niche models of vector and reservoir species
PLoS Negl Trop Dis
Field tests of theories concerning distributional control
Am Nat
Cited by (79)
Identification of suitable areas for African swine fever occurrence in china using geographic information system-based multi-criteria analysis
2022, Preventive Veterinary MedicinePredicting the potential habitat for Ornithodoros tick species in China
2022, Veterinary ParasitologyImproving national level spatial mapping of malaria through alternative spatial and spatio-temporal models
2021, Spatial and Spatio-temporal EpidemiologySpatial Analysis of Bovine Anaplasmosis in Jambi Province, Indonesia: 2018-2022
2024, Veterinary Integrative Sciences