Main

West Nile virus first appeared in the New World in 1999, in New York, USA. By the end of 2002, this virus had spread to Canada and 44 states of the USA (plus the District of Columbia) and, in 2002 alone, caused clinical infections in more than 4,000 people and 284 deaths1.

Severe acute respiratory syndrome (SARS) was first recognized in China in mid-November 2002. By mid-June 2003, it had affected more than 8,000 people, killing 790, in 33 countries of all the inhabited continents of the world2.

Monkeypox, which is related to smallpox, is known historically only in Africa, but was (probably) imported into Texas, USA, in early April 2003, in a shipment of Gambian rats that were destined for the pet trade. Transfer to native prairie dogs, which were in turn shipped to other pet traders or were displayed and handled in pet fairs, carried monkeypox to Wisconsin, Indiana, Illinois, Missouri, Kansas and Ohio, and resulted in a total of 71 human infections in these states by early July 2003 (Ref. 3).

In all of these examples, little or nothing is known about the natural circulation of the pathogens concerned within their countries of origin, or about the precise method, or likelihood, of their transfer to new areas. For every 'new' or 'emerging' disease of this type, there could be many others that are carried by the same mechanisms, but which then fail to establish in new countries for reasons that are also unclear.

These examples, and many more like them, show that the modern equivalent of the spread of cholera from the Broad Street pump to Londoners in the 1840s (Ref. 4) could now encompass the world and its population. 'Small-world' effects increasingly 'connect' geographically distant places5 — affecting not only knowledge, trade and tourism, but also infectious diseases. The immediate questions to be answered are which diseases will be involved in such transfers, from which sources, to which destinations and by what routes; and what will be the chances of the diseases establishing and spreading from any point of introduction?

A geographical information system (GIS), which can hold both the disease data and any other information within the same geographical framework, has the analytical power to help answer these questions. When the diseases in question are known, or suspected, to have environmental risk factors, the addition of remotely sensed (RS) environmental data to the GIS greatly enhances its explanatory power6. This article explains how using RS data in a GIS can help us understand the spatio-temporal dynamics of a wide range of disease systems, especially, but not exclusively, those with environmental correlates.

Steps to understanding epidemiology

To predict where diseases might spread, it is necessary to understand their epidemiology in the regions in which they have been historically recorded. First, we must identify the pieces of the puzzle — the pathogen, its vertebrate hosts and the routes of transmission between hosts. Second, the patterns of each disease must be recorded — both the distribution in space and changes with time. An essential part of this second step is an appreciation of the environment in which transmission is taking place. Third, we need to understand the dynamic processes of transmission that ultimately determine the patterns that are observed. Given this baseline knowledge, we might finally be able to estimate the likelihood of pathogen spread to, and establishment in, new areas.

Diagnosis. The first task is essentially one of diagnosis, a task that is not always simple in complex natural biological systems. Identifying the pieces of the puzzle can lead to an expectation of the resulting patterns, but it can be difficult to anticipate the number of important pieces that are involved. For example, it might have been expected that West Nile virus in North America would be transmitted by only a few native American mosquito species (the vectors). A pathogen in a new continent is unlikely to find many competent vectors, and many flaviviruses (the family to which West Nile virus belongs) typically have only one principal vector species and relatively few additional species of any transmission importance in any region7. By August 2003, however, West Nile virus, its RNA or antigens had been detected in 43 mosquito species in the United States8 (although this does not prove full transmission competence). Had this large number of potential vector species been known or predicted in advance, the rapid spread of West Nile virus in North America would not have been so surprising.

Disease patterns in space and time — statistical models. The second set of problems — that of the distribution, incidence or prevalence of a disease — is usually tackled using statistical techniques. As empirical disease data are typically sparse, even for well-established, perennial diseases, researchers seek to make predictive maps to fill in the gaps using richer environmental data. All available information is gathered — for example, records of the known presence or intensity of the disease from areas as geographically extensive, and covering as wide a range of natural environments, as possible. This information is stored within a GIS that is, at its core, little more than a database that also records the geographical location of each observation9. In addition, the GIS must contain the environmental and other data in the same geographical framework, thereby allowing correlations to be established between these data and the disease data10.

Whilst the database origins allow all the usual calculations of summary or averaged data, the geographical essence of GISs allows them to be used to reveal the spatio-temporal structure of disease cases. For example, disease clusters — the geographical co-occurrence of cases that indicates local transmission of infectious diseases or the presence of environmental determinants of non-infectious diseases — are common. Childhood arthritis is not traditionally thought of as infectious, but the recognition of a cluster of cases in Old Lyme, Connecticut in the late 1970s11 (albeit without the benefit of a GIS) ultimately led to the discovery of tick-borne Lyme borreliosis12, the most widespread and prevalent vector-borne infection in the northern temperate world.

Satellite imagery for monitoring environments. Satellite imagery (Box 1) is a powerful component of modern disease GISs. Satellite sensors provide data from which information about rainfall, temperature, humidity and vegetation conditions at the Earth's surface can be derived. These conditions are crucial for the indirect transmission of pathogens by vectors or intermediate hosts, such as insects, ticks, snails or rodents6,13,14. Environmental conditions might also be important for any directly transmitted (for example, host-to-host) pathogens that must survive for any period of time outside the host. Bovine tuberculosis outbreaks in cattle herds are thought by many to be caused by contamination from infected wild animals, and high-risk areas have been predicted accurately using seasonal features of atmospheric humidity and air temperature15. This result might offer some clues about the precise route of transmission, which is still unknown.

Seasonal changes and risk maps. For many pathogens, and especially for those with intermediate hosts, transmission is seasonal. Furthermore, the continuing existence of the pathogen depends not just on conditions during the transmission season (usually spring through to autumn in temperate regions, and the rainy season in tropical regions), but also on conditions during winter or the dry season — these conditions can determine the survival of the pathogen or its vector from one year to the next. Satellite images with high spatial resolution, such as Landsat imagery, are not recorded sufficiently often to capture the full details of seasonal cycles; their infrequent images often miss important periods.

Satellite imagery at high temporal resolution (Box 1), however, can produce clear monthly pictures (although at the expense of spatial resolution), but the result is large volumes of data, which often show strong serial correlations that affect the power of statistical analyses. To reduce the volume of data and remove these serial correlations without losing the biologically meaningful signals, a technique of time-series analysis that was invented by the French mathematician Joseph Fourier (1768–1830) is used. Fourier solved a problem in calculus that had defeated Newton and subsequent generations of mathematicians, by showing that a complex time series can always be expressed as the sum of a series of sine curves with different amplitudes, frequencies and phases (that is, timings) around a characteristic mean. Using Fourier's techniques, therefore, it is possible to extract information about the annual, bi-annual and tri-annual cycles of rainfall, temperature and other parameters that characterize the natural environments of diseases from the multi-temporal satellite data (Fig. 1) (it is this shift from the time to the frequency domain that removes the serial correlation in the satellite data). The output of temporal Fourier analysis is a set of orthogonal (that is, uncorrelated) variables that capture the seasonality that is of interest in epidemiology6,13,16, and these variables can therefore be used to classify habitats and describe vector and pathogen distributions. For disease systems, Fourier variables are the environmental equivalent of the genes of individual pathogens, and whole Fourier-processed images (Fig. 2) that capture all the interactive space–time features of a habitat can be likened to the organismal genome.

Figure 1: Normalized difference vegetation index signal from a point in central Wales, UK.
figure 1

The monthly normalized difference vegetation index (NDVI) signal from a point in central Wales, UK (3.58°W, 52.34°N) for 1996–1999 is shown in light blue, the mean monthly signal for this period is shown in red (displaced vertically by 0.1 for clarity), the temporal Fourier fit to these data is shown in orange, and the annual, bi-annual and tri-annual components of this fit are shown in yellow, purple and green, respectively (right-hand scale). The temporal Fourier fit describes the average annual cycle at this site well.

Figure 2: Normalized difference vegetation index data (NVDI) Fourier images of Europe.
figure 2

a | The mean is shown in red. b | The annual amplitude is shown in blue. c | The annual phase (that is, the timing of the annual peak) is shown in green. d | All three signals are shown together, which shows how this method of analysis captures habitat seasonality across Europe (notice, for example, that the mean is generally higher in western Europe, with the exception of many parts of Spain, but the annual amplitude is higher in eastern Europe). The white arrow in the mean image points to the site from which the data for Fig. 1 were obtained.

In a GIS, the statistical relationships that are established between the disease and environmental data sets are applied at the full spatial resolution of the latter, richer data sets, to produce a 'risk map' (Fig. 3). This effectively shows the similarity of environmental conditions in unsurveyed places to environmental conditions in which the disease has been recorded as being either present or absent. This similarity is usually expressed as the probability with which each area on the ground (corresponding to one picture element, or 'pixel', of the satellite imagery) belongs to the class of areas that are known to contain the disease. Errors in risk maps arise for several reasons; sometimes the input data are out-of-date, or simply wrong; sometimes the explanatory variables that are used are inappropriate; and sometimes the model itself is wrong. Only false-negatives (that is, false predictions of absence) are clear indicators of an incorrect or inappropriate model, and much can be learnt by investigating why these arise. Even when none of these applies, risk maps often indicate larger areas as 'at risk' than are known to be affected by the disease at present (that is, many false-positives), because an organism or disease will not occupy all 'suitable' habitats. False-positives also highlight an important application of predictive risk maps — to warn health agencies of the potential spread of a disease into these areas17. The significant increases in the incidence of tick-borne encephalitis (TBE) over the past two decades have indeed been accompanied by new records in many supposedly 'false-positive' regions18,19,20,21.

Figure 3: Results of statistical modelling of the distribution of vectors and disease using selected temporal Fourier-processed images as predictor variables.
figure 3

a | The distribution of Glossina morsitans (three subspecies) throughout Africa. b | The distribution of Glossina palpalis (two subspecies) in West Africa. c | The distribution of tick-borne encephalitis in Europe (historical mapped data is shown cross-hatched in black). The predicted habitat suitability for the vectors or disease are on a probability scale from zero (red) to 1 (green) (see inset for legend); (a) 91% correct with 7% false-positives (that is, false predictions of presence) and 2% false-negatives (that is, false predictions of absence); sensitivity 0.96, specificity 0.857; (b) 96% correct with 3% false-positives and 1% false-negatives; sensitivity 0.97, specificity 0.94; (c) 90.5% correct with 8.1% false-positives, 1.4% false-negatives; sensitivity 0.966, specificity 0.865. Modified with permission from Refs 40,41 © (2000) Elsevier Science.

With every powerful technique there is a disadvantage. The ability of GISs to produce multiple, derived layers of data often compounds any errors that are present in each contributory variable. This is especially likely when no attempt is made to verify these derived layers through field observations (for example, checking vegetation classifications that are based on satellite data), and the effects are especially severe when the errors are systematic. The production of visually appealing, even statistically sound, results that do not reveal anything useful about either pattern or process is perhaps the greatest danger facing newcomers to this powerful technology.

Longer-term changes. With historical, geo-referenced disease records over several (preferably many) years, time-series analysis can be applied to make predictions about the future of the disease. Long-term records of disease are relatively rare, and usually apply to point or small-area locations. The problem with statistical approaches (for example, various auto-regressive techniques) to these types of data is that they assume that the transmission dynamics of disease in the future will be similar to those of the past. In reality, pathogen–host systems are always evolving and their environments are changing, which makes the future unpredictable using such statistical approaches.

Infection processes in space and time

Satellite imagery at high temporal resolution is also a powerful tool in the third step of understanding epidemiology using dynamic biological or process-based models. These models capture details of pathogen transmission in a series of mathematical equations that describe the birth and death of organisms, and transmission processes, directly. The same remotely sensed data that are used to describe the patterns of infection in statistical models can also be used to explain transmission in biological models. For example, satellite data have been correlated with key behavioural or demographic rates of the vectors (Fig. 4a) of indirectly transmitted diseases, thereby identifying the conditions that are best for pathogen transmission. A model of a tsetse-fly population, which is driven by satellite-derived variables, captures 93% of the observed seasonal pattern (Fig. 4b). Given the extra layers of complexity that are involved in pathogen transmission by vectors, it is not surprising that the model of trypanosomiasis seasonal prevalence, which is driven, in turn, by this tsetse model, gives a less good fit (62%) to the disease data (Fig. 4b). In this case, the model assumed that cattle supply only 30% of the blood meals for the tsetse, but are the only significant source of trypanosome infections for the flies, when in reality wildlife species also play a part in transmission22. This emphasizes our first point — the importance of identifying all the pieces of the puzzle before we can make reliable predictions.

Figure 4: Biological models driven by remotely sensed data.
figure 4

a | The relationship between the mortality rate of the tsetse Glossina morsitans submorsitans at Bansang, The Gambia, West Africa, and the satellite-derived land surface temperature for the previous month. Mortality is the two-point (previous, present month) smoothed density-independent mortality (d.i.m.) that is calculated from regular trapped samples of flies. b | A least-squares-fitted biological tsetse population model (red line), which is driven by satellite data by the relationship shown in (a), describes 93% of the variance of the observed mean monthly field data (yellow line). The fitted tsetse model was then used in a simple disease transmission model (lower, green line) and describes 62% of the variance of the observed field data (blue line). Field data are the average monthly values from the sample sites, and are repeated over a 3-year period in the figure for clarity. Modified with permission from Ref. 40 © (2000) Elsevier Science.

The persistence of an infectious agent can also be predicted from the widely used R0 equation, which arises from rearranging the transmission equations to determine whether, at very low prevalence in an entirely susceptible population, an infection will increase or decline to extinction23. Given that six out of the seven parameters and variables of the R0 equation for vector-borne pathogens are determined not by the vertebrate host but by the vector24, which is highly susceptible to climatic influences, it is possible to produce R0 maps for vector-borne diseases using satellite imagery; however, the first attempts have not yet been validated. A map describing the relative transmission risk of the directly transmitted foot and mouth disease in the United Kingdom revealed considerable spatial heterogeneity25, but the mapping was not based on any environmental variables that might help explain this spatial pattern. A map of fully parameterized R0 values would show potential sources (R0 ≥ 1) and sinks (R0 < 1) of infection, and would be of obvious benefit to control efforts, because only sources of infection require control, and only to levels where they also become sinks.

Climate and global spread of disease

Taken on a global scale, these same ideas can identify potential sources of infection for new disease outbreaks elsewhere. As there are so few global disease data sets, we are generally forced to use one or other statistical approach to disease mapping. Climatic similarities between areas that are, at present, disease-affected and disease-free, can be used as a guide to the environmental permissiveness of these latter areas for the diseases in question. Like risk mapping, however, climate matching is a probabilistic exercise; similarity of climates does not guarantee similarity of diseases, but simply makes it more likely, all other things being equal. The present-day match of the UK climate to global climates (Fig. 5) indicates that UK inhabitants might expect to share climatically determined diseases not just with parts of mainland Europe, but also with China and North America. Given suitable connections by trade or travel26, if only one of these climatically matched regions is a source of emergence of any new disease — as China is for new strains of influenza — all of these regions could potentially suffer from it.

Figure 5: The use of meteorological data averages to define ten climate zones in Great Britain.
figure 5

The mean, maxima and minima of temperature, rainfall and vapour pressure measurements from gridded meteorological data averages for 1961–1990 (Ref. 33) were clustered within an image-processing system (ERDAS Imagine © Leica Geosystems, ISODATA clustering) to define ten climate zones in Great Britain (see map on right, arbitrary colours). The global map shows how matches to these climate zones occur in Europe, North America and China, with smaller area matches elsewhere. White indicates no significant matching. Reproduced with permission from Ref. 42 © (2002). Crown copyright material is reproduced with the permission of the Controller of HMSO and the Queen's Printer for Scotland.

Much attention has been focused recently on the impacts of past and future climate changes on diseases27,28. To identify causality rather than coincidence in past events as a prerequisite for future predictions, we must have evidence of significant changes both in the climatic drivers of infection and in the incidence of the diseases themselves, at the same times, in the same places and in the right 'direction' according to our understanding of how climate affects transmission. The step increase in the incidence of TBE in Sweden from 1983 to 1986 (Ref. 29), for example, preceded the sudden increase in temperature at the end of the 1980s (Ref. 21), showing that milder winters and springs cannot have been the causal factors, as claimed30. The analytical power in a GIS is the obvious way9 to explore which of many disparate factors — biological and non-biological — might be causal for spatially heterogeneous changes in disease incidence on continental scales, as has been recorded for TBE in Europe31 and malaria in Africa32.

To predict future impacts, future climate scenarios must be obtained at a sufficiently high spatial resolution to be useful, together with good models of climate and disease linkages. Unfortunately, at present, the spatial resolution of both past and future climate data sets is inadequate; the first is owing to historically inadequate meteorological station coverage in many (mostly tropical) areas of the world33, and the second is a consequence of the computing requirements of global circulation models34. We can do little about the historical climatic data sets, and computing power is not predicted to increase sufficiently rapidly to address the second problem in the near future. Climatically speaking, we know neither where we are coming from, nor where we are headed, with anything like the precision that is required to predict spatially variable, climatically sensitive diseases.

Conclusions

With today's global connectivity, humans are now the new vectors of infectious agents, travelling, or trading over, greater distances at faster speeds than can be achieved unaided by any pathogen or its vectors. The consequent increasingly urgent need to develop global disease early-warning systems (DEWS) is fortunately matched by rapidly developing tools: RS imagery within a GIS now offers greater power for describing, explaining and predicting epidemiological phenomena than ever before. Furthermore, although global routes of infection are increasingly common, global information flow has increased in parallel, thanks to the internet, and information can be transmitted even faster than any pathogen. In the most recent example, once the international community became aware of SARS, the response in laboratories and health agencies around the world — which resulted in an unprecedented rapid identification of the agent, its routes of transmission and the networks of connectivity — apparently achieved a (temporary?) halt to its progress within much less than a year. By contrast, however, although the response to the introduction of West Nile virus in the United States was equally impressive (a delay of less than 2 months between the first (unrecognized) human case in New York and identification of the virus), the spread of this disease has not been stopped, and it must now be regarded as endemic throughout much of the United States. The differences between these two diseases can be attributed to many factors, but all will be embedded within one or other parameter or variable of the transmission and R0 equations.

Finally, we highlight two key issues in epidemic disease prediction that have pervaded this review — space and time. Although epidemiological surveillance networks, or global satellite systems, constrain the collection of data to particular intervals of time (the reporting period) or space (the pixel size), diseases do not respect such artificial units. We must develop methods that allow us to move seamlessly across scales of time and space as we analyse epidemics that occur throughout the spectrum from small isolated communities to the global population. Classical time-series analysis of disease data, or Fourier analyses of temporal satellite series, assume a constancy (mathematical stationarity) in the disease system or the environment that is rarely shown in practice. Wavelet techniques, which essentially divide time into smaller units, allow unit-to-unit variation of disease–environment interactions35,36; and other advanced techniques allow continuous variation in these interactions37. Similar techniques can be applied in the spatial domain to satellite imagery. The enormous range of pixel sizes that are available in commercial satellite imagery (from 0.6 m to >1 km) challenges us to develop methods to identify the appropriate data resolutions, and to integrate these data into spatial units that are relevant to disease transmission, such as the movement range of an infected individual for a directly transmitted pathogen, or the flight range of an insect for a vector-borne pathogen. Again, wavelet and other techniques might help here, by enabling us to look at the environment from the perspective of disease systems that are driven more by biological realism than by technological wizardry. A unified space–time theory of epidemics, in which a three-dimensional space–time pixel replaces a two-dimensional one as the basic recording unit, is still some way off38, but GIS and RS technologies are the vehicles that, potentially, will get us there the fastest. All we have to decide is in which direction to go.