Pitfalls in QSAR

https://doi.org/10.1016/S0166-1280(02)00616-4Get rights and content

Abstract

There are no formal guidelines for the development of quantitative structure–activity relationships (QSARs). However, there are a number of practices that should be avoided. This paper describes the pitfalls in QSAR, and problems that can arise if they occur. The emphasis of this paper is particularly for the development of QSARs for toxicity for environmental endpoints and drugs, but is equally applicable to pharmacological endpoints. Problems may arise from all three areas of the QSAR, namely the biological activity, physico-chemical and/or structural descriptors, and the use of a statistical technique. Biological data for use in a QSAR should be of a known (and preferably high) quality. Physico-chemical descriptors and statistical processes should be appropriate for the endpoint being modelled. They should allow for the development of a clear, transparent and mechanistically interpretable QSAR. To have any practical utility, QSARs should be validated by means of an external testing set.

Introduction

The development of quantitative structure-activity relationships (QSARs) is a science that has grown up without a defined framework, series of rules, or guidelines for methodology. The goal of QSAR is to develop models on a training set of compounds, these models will then allow for the prediction of the biological activity of related chemicals. Ideally these models should be simple, transparent and mechanistically comprehensible. At the end of the day, however, QSARs are predictive techniques based on the relationship, for a series of chemicals, between some form of biological activity and some measure(s) of physico-chemical or structural properties. As such, there are a number of limitations to the use and application of QSARs. It is the concern of the authors that these are often not appreciated, or may be forgotten by the developers of QSAR. To assist the developer of QSARs, and as the basis of this paper, lists of ‘essentials’ and ‘desirables’ for QSARs are listed in Table 1, Table 2, respectively.

There are three components to any QSAR, namely the biological data, physico-chemical and/or structural properties, and some form of statistical technique that relates the two. The aim of this paper is to review the potential pitfalls in the development of QSARs in relation to each of these three areas. It should be noted at the outset of this article that these comments represent the views of the authors, following a number of years not only developing their own QSARs, but also appraising the literature and being involved in the peer-review process of journal papers. The purpose of this article is not to be critical of extant literature, but to illustrate pitfalls. To do so, in most cases examples have been taken from the authors' own work.

Section snippets

Biological data

Knowledge of the information on which models are based is essential for the development of any predictive system. The function of a QSAR is to predict biological activity, which may be in terms of a pharmacological, toxicological or pesticide response. To enable predictions of biological activity to be made, QSARs are predictive models which are based, originally, upon some biological data. Too often, however, QSARs are developed for which little, or nothing, is known regarding the information

Descriptors of physico-chemical properties

The assumption of a QSAR is that the biological activity of a chemical is dependent, in some manner, on the physcio-chemical and/or structural properties of the chemical. Such a relationship becomes quantitative when a series of chemicals are considered. It is not the purpose of this section to review physico-chemical descriptors per se, excellent reviews exist elsewhere (cf. [27], [28], [29]), but to make some observations regarding the use of descriptors for the development of QSARs. Many of

Statistical analyses

A statistical technique is required to forge the link between the biological activities of a series of chemicals and their physico-chemical properties. Commonly these techniques range from linear least squares regression analysis, through to multivariate techniques including the use of principal component analysis and partial least squares, as well as a variety of neural networks. Different techniques are required for continuous and categoric data. All techniques have advantages and

Conclusions

A large number of pitfalls encountered in the development of QSARs are described in this paper. These range from issues with data quality, to appropriate use of physico-chemical descriptors and statistical techniques. The variety of potential pitfalls emphasises that QSAR is a multi-disciplinary practice. It requires biologists, chemists, and statisticians who have a feel for what they are attempting to do. Problems normally arise when specialists in one field make assumptions about subjects

Acknowledgements

Stimulating discussions with Dr Alex Tropsha from the School of Pharmacy, University of North Carolina at Chapel Hill, USA, are gratefully acknowledged.

References (51)

  • L.L. Wright

    Toxicology

    (2001)
  • D.V. Sweet et al.

    Chem. Health Saf.

    (1999)
  • M.A. Warne et al.

    Chemosphere

    (1999)
  • M.T.D. Cronin et al.

    Sci. Tot. Environ.

    (1997)
  • R.L. Lipnick

    Sci. Tot. Environ.

    (1991)
  • Y.H. Zhao et al.

    Sci. Tot. Environ.

    (1998)
  • M.T.D. Cronin et al.

    Eur. J. Pharm. Sci.

    (1999)
  • R.J. Scheuplein et al.

    J. Invest. Dermatol.

    (1969)
  • M.E. Johnson et al.

    J. Pharm. Sci.

    (1995)
  • J.R. Seward et al.

    Aquat. Toxicol.

    (2001)
  • G.P. Romanelli et al.

    J. Mol. Struct. (Theochem)

    (2000)
  • A. Golbraikh et al.

    J. Mol. Graphics Mod.

    (2002)
  • C.L. Russom et al.

    Environ. Toxicol. Chem.

    (1997)
  • T.W. Schultz

    Toxicol. Meth.

    (1997)
  • M.T.D. Cronin et al.

    SAR QSAR Environ. Res.

    (1994)
  • K.L.E. Kaiser et al.

    Water Pollut. Res. J. Can.

    (1991)
  • A.D. Gunatilleka et al.

    Analyst

    (1999)
  • M.A. Warne et al.

    SAR QSAR Environ. Res.

    (1999)
  • K.L.E. Kaiser

    Environ. Health Perspect.

    (1998)
  • M.T.D. Cronin et al.

    SAR QSAR Environ. Res.

    (2000)
  • C. Helma et al.

    Environ. Health Perspect.

    (2000)
  • E. Gottmann et al.

    Environ. Health Perspect.

    (2001)
  • L.S. Gold et al.

    Handbook of Carcinogenicity Potency and Genotoxicity Databases

    (1997)
  • W.J. Egan et al.

    Anal. Chem.

    (1998)
  • M.T.D. Cronin et al.
  • Cited by (322)

    • QSAR facilitating safety evaluation and risk assessment

      2023, QSAR in Safety Evaluation and Risk Assessment
    • Quantitative structure-activity relationships (QSARs) in medicinal chemistry

      2023, Cheminformatics, QSAR and Machine Learning Applications for Novel Drug Development
    • Semi-automated harmonization and selection of chemical data for risk and impact assessment

      2022, Chemosphere
      Citation Excerpt :

      For example, a strict selection of only high-quality data is required in a regulatory safety assessment context, disregarding low-quality information. Likewise, only high-quality data are considered when developing extrapolations for substances without available information, i.e., predictive approaches (Aurisano et al., 2019; Cronin and Schultz, 2003; Posthuma et al., 2019). A more inclusive approach (i.e., high data coverage but reduced average data quality) is suitable for screening level prioritization or substitution of chemicals across thousands of substances or for characterizing hundreds of chemicals associated with a given product life cycle (Aurisano et al., 2021a, 2021b, 2022; Fantke et al., 2020, 2021a; Tickner et al., 2019).

    • Prediction of degradability of micropollutants by sonolysis in water with QSPR - a case study on phenol derivates

      2022, Ultrasonics Sonochemistry
      Citation Excerpt :

      To the best of our knowledge, this is the first QSPR model applied to sonolysis as Advanced Oxidation Process in wastewater treatment and in water research the first model evaluated with extensive amounts of statistical methods, e.g. multiple validation methods, tests for chance correlation and multicollinearity. To address some problems mentioned in previous publications [40,41,43,47], the experimental data was obtained with a standardized setup and protocol under reproducible condition to ensure the homogeneity of the experimental dataset. The introduced workflow possesses a variety of statistical methods to ensure the stability and reliability of the obtained model to reduce many problems as mentioned before.

    View all citing articles on Scopus
    View full text