Skip to main content
Log in

Cheminformatics analysis and learning in a data pipelining environment

  • Full–length paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Summary

Workflow technology is being increasingly applied in discovery information to organize and analyze data. SciTegic's Pipeline Pilot is a chemically intelligent implementation of a workflow technology known as data pipelining. It allows scientists to construct and execute workflows using components that encapsulate many cheminformatics based algorithms. In this paper we review SciTegic's methodology for molecular fingerprints, molecular similarity, molecular clustering, maximal common subgraph search and Bayesian learning. Case studies are described showing the application of these methods to the analysis of discovery data such as chemical series and high throughput screening results. The paper demonstrates that the methods are well suited to a wide variety of tasks such as building and applying predictive models of screening data, identifying molecules for lead optimization and the organization of molecules into families with structural commonality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

MCSS:

maximal common substructure search

ECFP:

extended connectivity fingerprints

FCFP:

functional class fingerprints

MDDR:

MDL drug data report

WDI:

world drug index

CATS:

chemically advanced template search

BKD:

binary kernel discrimination

CDK2:

cyclin-dependent kinase 2

DHFR:

escherichia coli dihydrofolate reductase

References

  1. SciTegic, Inc. 10188 Telesis Court, Suite 100, San Diego, CA 92121, USA, http://www.scitegic.com/products_services/pipeline_pilot.htm

  2. Todeschini, R. and Consonni, V., Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, Germany, 2000.

    Google Scholar 

  3. Mark Johnson, M., Maggiora, G., (Eds.) Concepts and Applications of Molecular Similarity. Wiley, New York, 1990.

    Google Scholar 

  4. McGregor, M.J. and Pallai, P.V., Clustering of large databases of compounds: Using the MDL ‘keys’ as structural descriptors, J. Chem. Inf. Comput. Sci., 37 (1997) 443–448.

    Article  CAS  Google Scholar 

  5. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., Classification and Regression Trees, Wadsworth and Brooks/Cole, Monterey, CA, 1984.

    Google Scholar 

  6. Dubois, J. E., In Chemical Applications of Graph Theory, In Balaban, A.T. (Ed.) Academic Press, London, 1976, p. 161.

    Google Scholar 

  7. Randic, M., Fragment search in acyclic structures, J. Chem. Inf. Comput.Sci., 18 (1978) 101–107.

    Article  CAS  Google Scholar 

  8. Willett, P., A screen set generation algorithm, J. Chem. Inf. Comp. Sci., 19 (1979) 159–162.

    Article  CAS  Google Scholar 

  9. Marie, T., Gannon and Willett, P., Sampling considerations in the selection of fragments screens for chemical substructure search systems, J. Chem. Inf. Comp. Sci., 19 (1979) 251–253.

    Article  Google Scholar 

  10. Willett, P., The effect of screen set size on retrieval from chemical substructure search systems, J. Chem. Inf. Comp. Sci., 19 (1979) 253–255.

    Article  CAS  Google Scholar 

  11. Schubert, W. and Ugi, I., Constitutional symmetry and unique descriptors of molecules, J. Amer. Chem. Soc., 100 (1978) 37–41.

    Article  CAS  Google Scholar 

  12. Bremser, W., HOSE – A novel substructure code, Anal. Chim. Acta, 103 (1978) 355–365.

    Article  CAS  Google Scholar 

  13. Bender, A., Mussa, H.Y., Glen, R.C. and Reiling, S. Molecular similarity searching using atom environments, information-based feature selection, and a naive Bayesian classifier, J.Chem. Inf. Comput. Sci., 44 (2004) 170–178.

    Article  PubMed  CAS  Google Scholar 

  14. Morgan, H. L., The generation of a unique machine description for chemical structures-A technique developed at chemical sbstracts service, J. Chem. Doc., 5 (1965) 107–112.

    Article  CAS  Google Scholar 

  15. Weininger, D., Weininger, A. and Weininger, J.L., SMILES. 2. Algorithm for generation of unique SMILES notation, J. Chem. Inf. Comp. Sci., 29 (1989) 97–101.

    Article  CAS  Google Scholar 

  16. Rogers, D. and Hahn, M., Extended connectivity fingerprints, J. Chem. Inf. Model., in preparation.

  17. Bender, A. and Glen, R.C., Molecular similarity: A key technique in molecular informatics, Org. Biomol. Chem., 2 (2004) 3204–3218.

    Article  PubMed  CAS  Google Scholar 

  18. Hert, J., Willett, P., Wilton, D.J., Acklin P., Azzaoui, K., Jacoby, E. and Schuffenhauer, A., Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures, J. Chem. Inf. Comput. Sci., 44 (2004) 1177–1185.

    Article  PubMed  CAS  Google Scholar 

  19. Everitt and Brian S., Cluster Analysis, Edward Arnold, A division of Hodder & Stoughton, London, 1997.

    Google Scholar 

  20. Kaufman, L. and Rousseeu, P., Finding Groups in Data, Wiley-Interscience, New York, 1990.

    Google Scholar 

  21. Hassan, M., Bielawski, J.P., Hempel, J.C. and Waldman, M., Optimization and visualization of molecular diversity and combinatorial libraries, Molecular Diversity, 2 (1996) 64–74.

    Article  PubMed  CAS  Google Scholar 

  22. Asinex, Incorporated, 6 Schukinskaya St, Moscow 123182, Russia; http://www.asinex.com

  23. Raymond, J.W., Gardiner, E.J. and Willett, P. Rascal, calculation of graph similarity using maximum common edge subgraphs, Comput. J., 45 (2002) 631–644.

    Article  Google Scholar 

  24. Raymond, J.W., Gardiner, E.J. and Willett, P., Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm, J. Chem. Inf. Comput. Sci., 42 (2002) 305–316.

    Article  PubMed  CAS  Google Scholar 

  25. Xia, X., Maliski E.G., Gallant, P. and Rogers, D., Classification of kinase inhibitors using a Bayesian model, J. Med. Chem., 47 (2004) 4463–4470.

    Article  PubMed  CAS  Google Scholar 

  26. Hert, J., Willett, P., David J.W., Acklin P., Azzaoui K., Jacoby E. and Schuffenhauer A., New methods for ligand-based virtual screening: Use of data fusion and machine learning to enhance the effectiveness of similarity searching, J. Chem. Inf. Model. (2006), in press.

  27. Robertson, S.E. and Sparck J.K., Relevance weighting of search terms, J. Amer. Soc. Inform. Sci., 27 (1976) 129–146.

    Article  Google Scholar 

  28. Avidon, V.V., Arolovich, V.S., Kozlava, S.P. and Piruzyan, L.A., Statistical study of information file on biologically active compounds. II. Choice of decision rule for biologically active prediction, Khim. Farm. Zh., 12 (1978) 88–93.

    CAS  Google Scholar 

  29. Hert, J., Willett, P., Wilton, D.J., Acklin P., Azzaoui, K., Jacoby E. and Schuffenhauer A., Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures, Org. Biomol. Chem., 2 (2004) 3256–3266.

    Article  PubMed  CAS  Google Scholar 

  30. Barnard Chemical Information Ltd. is at http://www.bci.gb.com/

  31. Daylight Chemical Information Systems, 27401 Los Altos, Suite 360, Mission Viejo, CA, USA 92691; http://www.daylight.com

  32. Tripos Inc. is at http://www.tripos.com

  33. Schuffenhauer, P., Floersheim, P., Acklin, P. and Jacoby, E., Similarity metrics for ligands reflecting the similarity of the target proteins, J. Chem. Inf. Comput. Sci., 43 (2003) 391–405.

    Article  PubMed  CAS  Google Scholar 

  34. Schneider, G., Neidhart, W., Giller, T. and Schmid, G., Scaffold-hopping by topological pharmacophore search: A contribution to virtual screening, Angew. Chem. Int. Ed. Engl., 38 (1999) 2894–896.

    Article  PubMed  CAS  Google Scholar 

  35. The MDL Drug Data Report database is available from MDL Information Systems Inc. at http://www.mdli.com/

  36. Bemis, G.M. and Murcko, M.A., The properties of known drugs. 1. Molecular frameworks, J. Med. Chem., 39 (1996) 2887–2893.

    Article  PubMed  CAS  Google Scholar 

  37. National Cancer Institute database, available at http://dtp.nci.nih.gov/

  38. Sielecki, T.M., Boylan, J.F., Benfield, P.A. and Trainor, G.L., Cyclin-dependent kinase inhibitors: Useful targets in cell cycle regulation. J. Med. Chem., 43 (2000) 1–18.

    Article  PubMed  CAS  Google Scholar 

  39. Buolamwini, J.K., Cell cycle molecular targets in novel anticancer drug discovery. Curr. Pharm. Des., 6 (2000) 379–392.

    Article  PubMed  CAS  Google Scholar 

  40. Meijer, L., Cyclin-dependent kinases inhibitors as potential anticancer, antineurodegenerative, antiviral and antiparasitic agents, Drug Resist. Updates, 3 (2000) 83–88.

    Article  CAS  Google Scholar 

  41. Sausville, E.A., Johnson, J., Alley, M., Zaharevitz, D. and Senderowicz, A.M., Inhibition of CDKs as a therapeutic modality, Ann. N. Y. Acad. Sci., 910, Colorectal Cancer (2000) 207–222.

    Article  PubMed  CAS  Google Scholar 

  42. Mani, S., Wang, C., Wu, K., Francis, R. and Pestell, R., Cyclin-dependent kinase inhibitors: Novel anticancer agents. Exp. Opin. Invest. Drugs 9 (2000) 1849–1870.

    Article  CAS  Google Scholar 

  43. Fischer, P.M. and Lane, D.P., Inhibitors of cyclin-dependent kinases as anti-cancer therapeutics, Curr. Med. Chem., 7 (2000) 1213–1245.

    PubMed  CAS  Google Scholar 

  44. Senderowicz, A.M., Small molecule modulators of cyclin-dependent kinases for cancer therapy, Oncogene, 19 (2000) 6600–6606.

    Article  PubMed  CAS  Google Scholar 

  45. Senderowicz, A.M., Development of cyclin-dependent kinase modulators as novel therapeutic approaches for hematological malignancies. Leukemia, 15 (2001) 1–9.

    Article  PubMed  CAS  Google Scholar 

  46. Senderowicz, A.M., Cyclin-Dependent Kinase Modulators: A Novel Class of Cell Cycle Regulators for Cancer Therapy. In Cancer Chemotherapy and Biological Response Modifiers, Annual 19; Giaccone, G., Schilsky, R., Sondel, P., (Eds.), Elsevier Science: New York, 2001, pp 165–188.

    Google Scholar 

  47. Roy, K.K. and Sausville, E.A., Early development of cyclin dependent kinase modulators, Curr. Pharm. Des., 7 (2001) 1669–1687.

    Article  PubMed  CAS  Google Scholar 

  48. Fischer, P.M., Recent advances and new directions in the discovery and development of cyclin-dependent kinase inhibitors, Curr. Opin. Drug Discovery Dev., 4 (2001) 623–634.

    CAS  Google Scholar 

  49. Bradley, E.K., Miller J.L., Saiah, E. and Grootenhuis, P.D.J., Informative library design as an efficient strategy to identify and optimize leads: Application to cyclin-dependent kinase 2 antagonists, J. Med. Chem., 46 (2003) 4360–4364.

    Article  PubMed  CAS  Google Scholar 

  50. Parker, C.N., McMaster university data-mining and docking competition. Computational models on the catwalk, J. Biomol. Screening, 10 (2005) 647–649.

    Article  Google Scholar 

  51. Rogers, D., Brown, R.D and Hahn, M., Using extended-connectivity fingerprints with laplacian-modified Bayesian analysis in high-throughput screening follow-up, J. Biomol. Screening, 10 (2005), 682–686.

    Article  CAS  Google Scholar 

  52. Klon, A.E., Glick, M., Thomas, M., Acklin, P. and Davies, J. W., Finding more needles in the haystack: A simple and efficient method for improving high-throughput docking results, J. Med. Chem., 47 (2004) 2743–2749.

    Article  PubMed  CAS  Google Scholar 

  53. Klon, A.E., Glick, M. and Davies, J.W., Combination of a Naive Bayes classifier with consensus scoring improves enrichment of high-throughput docking results, J. Med. Chem., 47 (2004) 4356–4359.

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moises Hassan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassan, M., Brown, R.D., Varma-O’Brien, S. et al. Cheminformatics analysis and learning in a data pipelining environment. Mol Divers 10, 283–299 (2006). https://doi.org/10.1007/s11030-006-9041-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-006-9041-5

Key words

Navigation