Getting to the (c)ore of knowledge: mining biomedical literature

https://doi.org/10.1016/S1386-5056(02)00050-3Get rights and content

Abstract

Literature mining is the process of extracting and combining facts from scientific publications. In recent years, many computer programs have been designed to extract various molecular biology findings from Medline abstracts or full-text articles. The present article describes the range of text mining techniques that have been applied to scientific documents. It divides ‘automated reading’ into four general subtasks: text categorization, named entity tagging, fact extraction, and collection-wide analysis. Literature mining offers powerful methods to support knowledge discovery and the construction of topic maps and ontologies. An overview is given of recent developments in medical language processing. Special attention is given to the domain particularities of molecular biology, and the emerging synergy between literature mining and molecular databases accessible through Internet.

Introduction

With an overwhelming amount of biomedical information available as text, it is natural to ask if it can be read automatically. For several decades, natural language processing (NLP) has been applied in biomedicine to automatically ‘read’ patient records and has resulted in a growing, but fairly homogeneous body of research. Now with the explosive growth of molecular biology research, there is a tremendous amount of text of a different sort, journal articles. The text collection in Medline can be mined to learn about a subfield, find supporting evidence for new experiments, add to molecular biology databases, or support Evidence Based Medicine.

Literature mining can be compared with reading and understanding literature but is performed automatically by a computer. Like reading, most literature mining projects target a specific goal. In bioinformatics, examples are:

  • finding protein–protein interactions (a.o. [1], [2], [3]);

  • finding protein–gene interactions [4],

  • finding subcellular localization of proteins [5], [6], [7];

  • functional annotation of proteins [8], [9];

  • pathway discovery [10], [11];

  • vocabulary construction [12], [13], [14];

  • assisting BLAST or SCOP search with evidence found in literature [15], [16];

  • discovering gene functions and relations [17].

A few examples in medicine include:

  • charting a literature by clustering articles [18];

  • discovery of hidden relations between, for instance, diseases and medications [19], [20], [21];

  • use medical text to support the construction of knowledge bases [22].

With this wide variety of goals, it is not surprising that many different tools have been adopted or invented by the various researchers. Although the approaches differ, they can all be seen as examples of one or more stages of a reading process.

Most of the studies that work with biomedical literature use Medline abstracts. This underlines the immense value of the Medline collection. Its size has passed the count of 12 million citations, most of which include abstracts. Our hope is that in future years, more and more initiatives will and can be directed towards the full-text of articles. A number of publishers now offer free on-line access to full articles and standards in web lay-out and metatagging are finding their acceptance. Algorithms that scale up better and a continuous increase in affordable computing power are—or will be—ready to tackle that.

Free availability of material is at this moment trapped between two forces. There is the growing pressure from the (noncommercial) scientific community to freely share material. On the other end of the see-saw sits the growing pressure on companies to make a profit on the web and, therefore, to regulate access to material.

In biomedicine, the efforts of the US National Library of Medicine are once more invaluable on this matter. PubMed Central aims to facilitate and/or host full-text access to participating journals in a common format, and requires that access is free at least 1 year after publication and preferably sooner than that. Currently, more than 25 journals have committed to this initiative.

This article reviews a number of studies on literature mining applied to biomedicine, and takes a look at the range of techniques that have been (or could be) applied to modules within the literature mining process. The nature of an article such as this, is that it can only present a snapshot of the state of the art at one point in time. For a more up-to-date overview of NLP studies applied to molecular biology and other biomedical domains see our on-line, partially annotated, extensive bibliography at http://textomy.iit.nrc.ca/cgi-bin/BNLPB_ix.cgi.

Very recently, an overview on Genomics and NLP appeared [23]. That article is written from a genomics perspective, and as such concentrates partly on Information Retrieval (IR) techniques (possibly including a literature corpus) to support sequence finding and annotation. Our article is written from an NLP researchers’ point of view, and reviews in what ways recent studies—notably in the area of molecular biology and literature searching—have added to the field of NLP in Biomedicine. We see both articles complementing each other.

Section snippets

Natural language processing in biomedicine: a brief overview

The application of NLP for molecular biology might be relatively new, but NLP has been applied to biomedical text for decades, in fact, soon after computerized clinical record systems were introduced in the mid 1960s [24]. The computerization of clinical records increased the tension in the field of medical reporting and recording. Structured reporting, on the one hand, ensures rigidity and optimal retrievability of records. Natural language narrative, on the other hand, ensures flexibility and

Text mining as a modular process

Text mining is a process very similar to reading. A reader first selects what they will read, then identifies important entities and relations between those entities, and finally combines this new information with other articles and other knowledge. This reading process (see Fig. 1) forms the backbone of this article. In the following sections, the various studies on text mining applied to molecular biology literature are aligned with this modular view of the reading process.

Document categorization

Document categorization, at its most basic, divides a collection of documents into disjoint subsets. This is also known as Document or Text Classification, but categorization is the most common term. The categories are usually predefined; if they are not, the process is actually document clustering (grouping documents through their superficial characteristics, e.g. [51]). By this definition IR is one form of categorization: the collection is divided into two categories of documents, one

Named entity tagging

The main reason to read an article is to find out what it says. Similarly, the goal of Information Extraction is to fill in a database record with specific information from the article. The first level of this task is to identify what entities or objects the article mentions. This is called named entity tagging, where the beginning and end of entities might be marked with SGML or XML tags—see Fig. 2.

In molecular biology, most of the entities are molecules, such as RNA, genes and proteins, and

Fact extraction

Readers do not understand text if they merely know the entities. They must also grasp the interactions or relationships between those entities. Fact extraction is the identification of entities and their relations. To have a machine do this correctly for arbitrary relationships would require a full natural language intelligence, something that is many years away. There are several approximations that have been tried, from purely statistical co-occurrence to imperfect parsing and coreference

Collection-wide analysis

Thinking new thoughts and using what is known, requires integrating information between documents. This opens the door to knowledge discovery, where combined facts form the basis of a novel insight. The well-known Swanson study [19], [20] on the relation between Raynauds disease and fish oil, was a starting point of formal literature-based knowledge discovery. Weeber et al. [21] discuss an automated replication of that study and similar ones.

Other studies have addressed knowledge discovery in

Concluding remarks

This overview showed a very wide variety of current applications and techniques for literature mining on biomedical text. The field is likely to become only wider in the future. On-line access of molecular databases and medical knowledge structures will augment the knowledge component in literature mining systems. Large-scale statistical methods will continue to challenge the position of the more syntax-semantics oriented approaches, although both will hold their own place. Literature mining

References (79)

  • T. Sekimizu et al.

    Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts

    Genome Inf. Ser. Workshop Genome Inf.

    (1998)
  • M. Craven, Learning to extract relations from Medline. AAAI-99 Workshop on Machine Learning for Information Extraction,...
  • M. Craven, J. Kumlien, Constructing biological knowledge bases by extracting information from text sources, Proc. Int....
  • B.J. Stapley, L.A. Kelley, M.J. Sternberg, Predicting the sub-cellular location of proteins from text using support...
  • M.A. Andrade et al.

    Automatic annotation for biological sequences by extraction of keywords from Medline abstracts. Development of a prototype system

    Proc. Int. Conf. Intell. Syst. Mol. Biol.

    (1997)
  • A. Renner, A. Aszodi, High-throughput functional annotation of novel gene products using document clustering, Pac....
  • C. Friedman et al.

    GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles

    Bioinformatics

    (2001)
  • S.K. Ng et al.

    Toward routine automatic pathway discovery from on-line scientific text abstracts

    Genome Inf. Ser. Workshop Genome Inf.

    (1999)
  • Y. Ohta et al.

    Automatic construction of knowledge base from biological papers

    Proc. Int. Conf. Intell. Syst. Mol. Biol.

    (1997)
  • T.C. Rindflesch et al.

    Mining molecular binding terminology from biomedical text

    Proc. AMIA Symp.

    (1999)
  • M. Yoshida et al.

    PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary

    Bioinformatics

    (2000)
  • J.T. Chang et al.

    Including biological literature improves homology search

    Pac. Symp. Biocomput.

    (2001)
  • R.M. MacCallum et al.

    SAWTED: structure assignment with text description—enhanced detection of remote homologues with automated Swiss-Prot annotation comparisons

    Bioinformatics

    (2000)
  • H. Shatkay et al.

    Genes, themes and microarrays: using information retrieval for large-scale gene analysis

    Proc. Int. Conf. Intell. Syst. Mol. Biol.

    (2000)
  • W.J. Wilbur, A thematic analysis of the AIDS literature, Pac. Symp. Biocomput. (2002)...
  • D.R. Swanson

    Fish oil, Raynaud's syndrome, and undiscovered public knowledge

    Perspect. Biol. Med.

    (1986)
  • D.R. Swanson

    Medical literature as a potential source of new knowledge

    Bull. Med. Libr. Assoc.

    (1990)
  • M. Weeber et al.

    Text-based discovery in biomedicine: the architecture of the DAD-system

    Proc. AMIA Symp.

    (2000)
  • U. Hahn, M. Romacker, S. Schulz, Automatic knowledge engineering in medicine: the MedSynDiKaTe text mining system....
  • M.D. Yandell et al.

    Genomics and natural language processing

    Nat. Rev. Genet.

    (2002)
  • J.J. Baruch

    Progress in programming for processing English language medical records

    Ann. New York Acad. Sci.

    (1965)
  • P. Spyns

    Natural language processing in medicine: an overview

    Methods Inf. Med.

    (1996)
  • C. Friedman et al.

    Natural language processing and its future in medicine

    Acad. Med.

    (1999)
  • D.C. Berrios, Automated indexing for full text information retrieval, Proc. AMIA Symp. (2000)...
  • R.H. Baud et al.

    Alternative ways for knowledge collection, indexing and robust language retrieval

    Methods Inf. Med.

    (1998)
  • U. Hahn, M. Honeck, M. Piotrowski, S. Schulz, Subword segmentation—leveling out morphological variations for medical...
  • A.R. Aronson, O. Bodenreider, H.F. Chang, S.M. Humphrey, J.G. Mork, S.J. Nelson, T.C. Rindflesch, W.J. Wilbur, The NLM...
  • K. Baclawski, J. Cigna, M.M. Kokar, P. Mager, B. Indurkhya, Knowledge representation and indexing using the unified...
  • O. Bodenreider

    Using UMLS semantics for classification purposes

    Proc. AMIA Symp.

    (2000)
  • Cited by (106)

    • Text mining and network analysis to find functional associations of genes in high altitude diseases

      2018, Computational Biology and Chemistry
      Citation Excerpt :

      Since the amount of unstructured text data is increasing at a great pace, it is necessary to apply knowledge extraction techniques which are largely based on the TM for scientific hypothesis generation and knowledge discovery (Ananiadou et al., 2006; Jensen et al., 2006). The common text mining tasks are NER (Named Entity Recognition), Text Classification, Synonym and Abbreviation extraction, Relationship Extraction and Hypothesis generation (Ananiadou et al., 2006; De Bruijn and Martin, 2002). The NER task is to identify specific names such as gene, protein, drug, chemical or disease from large collections of text (Cheng et al., 2008).

    • Natural language processing approaches in bioinformatics

      2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics
    • A novel paradigm-oriented approach towards NG-RE hybrid power generation

      2017, Energy Conversion and Management
      Citation Excerpt :

      Conclusions are given in Section 5. As both NG and RE have been studied for several decades, a substantial body of research has been published; therefore, the DAS was used to identify and review thousands of research papers, and literature mining, the process of extracting unknown, comprehensible and available knowledge, was applied to organize the information [20,21]. From the literature mining process, visual networks based around different cluster labels were combined with Kuhn’s paradigm theory to allow for the hybrid NG-RE systems development trends to be revealed.

    View all citing articles on Scopus
    View full text