Getting to the (c)ore of knowledge: mining biomedical literature
Introduction
With an overwhelming amount of biomedical information available as text, it is natural to ask if it can be read automatically. For several decades, natural language processing (NLP) has been applied in biomedicine to automatically ‘read’ patient records and has resulted in a growing, but fairly homogeneous body of research. Now with the explosive growth of molecular biology research, there is a tremendous amount of text of a different sort, journal articles. The text collection in Medline can be mined to learn about a subfield, find supporting evidence for new experiments, add to molecular biology databases, or support Evidence Based Medicine.
Literature mining can be compared with reading and understanding literature but is performed automatically by a computer. Like reading, most literature mining projects target a specific goal. In bioinformatics, examples are:
- –
finding protein–protein interactions (a.o. [1], [2], [3]);
- –
finding protein–gene interactions [4],
- –
finding subcellular localization of proteins [5], [6], [7];
- –
functional annotation of proteins [8], [9];
- –
pathway discovery [10], [11];
- –
vocabulary construction [12], [13], [14];
- –
assisting BLAST or SCOP search with evidence found in literature [15], [16];
- –
discovering gene functions and relations [17].
A few examples in medicine include:
- –
charting a literature by clustering articles [18];
- –
discovery of hidden relations between, for instance, diseases and medications [19], [20], [21];
- –
use medical text to support the construction of knowledge bases [22].
With this wide variety of goals, it is not surprising that many different tools have been adopted or invented by the various researchers. Although the approaches differ, they can all be seen as examples of one or more stages of a reading process.
Most of the studies that work with biomedical literature use Medline abstracts. This underlines the immense value of the Medline collection. Its size has passed the count of 12 million citations, most of which include abstracts. Our hope is that in future years, more and more initiatives will and can be directed towards the full-text of articles. A number of publishers now offer free on-line access to full articles and standards in web lay-out and metatagging are finding their acceptance. Algorithms that scale up better and a continuous increase in affordable computing power are—or will be—ready to tackle that.
Free availability of material is at this moment trapped between two forces. There is the growing pressure from the (noncommercial) scientific community to freely share material. On the other end of the see-saw sits the growing pressure on companies to make a profit on the web and, therefore, to regulate access to material.
In biomedicine, the efforts of the US National Library of Medicine are once more invaluable on this matter. PubMed Central aims to facilitate and/or host full-text access to participating journals in a common format, and requires that access is free at least 1 year after publication and preferably sooner than that. Currently, more than 25 journals have committed to this initiative.
This article reviews a number of studies on literature mining applied to biomedicine, and takes a look at the range of techniques that have been (or could be) applied to modules within the literature mining process. The nature of an article such as this, is that it can only present a snapshot of the state of the art at one point in time. For a more up-to-date overview of NLP studies applied to molecular biology and other biomedical domains see our on-line, partially annotated, extensive bibliography at http://textomy.iit.nrc.ca/cgi-bin/BNLPB_ix.cgi.
Very recently, an overview on Genomics and NLP appeared [23]. That article is written from a genomics perspective, and as such concentrates partly on Information Retrieval (IR) techniques (possibly including a literature corpus) to support sequence finding and annotation. Our article is written from an NLP researchers’ point of view, and reviews in what ways recent studies—notably in the area of molecular biology and literature searching—have added to the field of NLP in Biomedicine. We see both articles complementing each other.
Section snippets
Natural language processing in biomedicine: a brief overview
The application of NLP for molecular biology might be relatively new, but NLP has been applied to biomedical text for decades, in fact, soon after computerized clinical record systems were introduced in the mid 1960s [24]. The computerization of clinical records increased the tension in the field of medical reporting and recording. Structured reporting, on the one hand, ensures rigidity and optimal retrievability of records. Natural language narrative, on the other hand, ensures flexibility and
Text mining as a modular process
Text mining is a process very similar to reading. A reader first selects what they will read, then identifies important entities and relations between those entities, and finally combines this new information with other articles and other knowledge. This reading process (see Fig. 1) forms the backbone of this article. In the following sections, the various studies on text mining applied to molecular biology literature are aligned with this modular view of the reading process.
Document categorization
Document categorization, at its most basic, divides a collection of documents into disjoint subsets. This is also known as Document or Text Classification, but categorization is the most common term. The categories are usually predefined; if they are not, the process is actually document clustering (grouping documents through their superficial characteristics, e.g. [51]). By this definition IR is one form of categorization: the collection is divided into two categories of documents, one
Named entity tagging
The main reason to read an article is to find out what it says. Similarly, the goal of Information Extraction is to fill in a database record with specific information from the article. The first level of this task is to identify what entities or objects the article mentions. This is called named entity tagging, where the beginning and end of entities might be marked with SGML or XML tags—see Fig. 2.
In molecular biology, most of the entities are molecules, such as RNA, genes and proteins, and
Fact extraction
Readers do not understand text if they merely know the entities. They must also grasp the interactions or relationships between those entities. Fact extraction is the identification of entities and their relations. To have a machine do this correctly for arbitrary relationships would require a full natural language intelligence, something that is many years away. There are several approximations that have been tried, from purely statistical co-occurrence to imperfect parsing and coreference
Collection-wide analysis
Thinking new thoughts and using what is known, requires integrating information between documents. This opens the door to knowledge discovery, where combined facts form the basis of a novel insight. The well-known Swanson study [19], [20] on the relation between Raynauds disease and fish oil, was a starting point of formal literature-based knowledge discovery. Weeber et al. [21] discuss an automated replication of that study and similar ones.
Other studies have addressed knowledge discovery in
Concluding remarks
This overview showed a very wide variety of current applications and techniques for literature mining on biomedical text. The field is likely to become only wider in the future. On-line access of molecular databases and medical knowledge structures will augment the knowledge component in literature mining systems. Large-scale statistical methods will continue to challenge the position of the more syntax-semantics oriented approaches, although both will hold their own place. Literature mining
References (79)
- et al.
Supporting the classification of pathology reports: comparing two information retrieval methods
Comput. Methods Programs Biomed.
(2000) - et al.
Morpheme-based, cross-lingual indexing for medical document retrieval
Int. J. Med. Inf.
(2000) - et al.
A comparison of classification algorithms to automatically identify chest X-ray reports that support pneumonia
J. Biomed. Inf.
(2001) - et al.
Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method—abstract
J. Biomed. Inf.
(2001) - et al.
Automatic SNOMED classification—a corpus-based method
Comput. Methods Programs Biomed.
(1997) - et al.
Using BLAST for identifying gene and protein names in journal articles
Gene
(2000) Using combinatory categorial grammar to extract biomedical information
IEEE Intell. Syst.
(2001)- et al.
Automatic extraction of biological information from scientific text: protein–protein interactions
Proc. Int. Conf. Intell. Syst. Mol. Biol.
(1999) - et al.
Automated extraction of information on protein–protein interactions from the biological literature
Bioinformatics
(2001) - J. Thomas, D. Milward, C. Ouzounis, S. Pulman, M. Carroll, Automatic extraction of protein interactions from scientific...
Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts
Genome Inf. Ser. Workshop Genome Inf.
Automatic annotation for biological sequences by extraction of keywords from Medline abstracts. Development of a prototype system
Proc. Int. Conf. Intell. Syst. Mol. Biol.
GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles
Bioinformatics
Toward routine automatic pathway discovery from on-line scientific text abstracts
Genome Inf. Ser. Workshop Genome Inf.
Automatic construction of knowledge base from biological papers
Proc. Int. Conf. Intell. Syst. Mol. Biol.
Mining molecular binding terminology from biomedical text
Proc. AMIA Symp.
PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary
Bioinformatics
Including biological literature improves homology search
Pac. Symp. Biocomput.
SAWTED: structure assignment with text description—enhanced detection of remote homologues with automated Swiss-Prot annotation comparisons
Bioinformatics
Genes, themes and microarrays: using information retrieval for large-scale gene analysis
Proc. Int. Conf. Intell. Syst. Mol. Biol.
Fish oil, Raynaud's syndrome, and undiscovered public knowledge
Perspect. Biol. Med.
Medical literature as a potential source of new knowledge
Bull. Med. Libr. Assoc.
Text-based discovery in biomedicine: the architecture of the DAD-system
Proc. AMIA Symp.
Genomics and natural language processing
Nat. Rev. Genet.
Progress in programming for processing English language medical records
Ann. New York Acad. Sci.
Natural language processing in medicine: an overview
Methods Inf. Med.
Natural language processing and its future in medicine
Acad. Med.
Alternative ways for knowledge collection, indexing and robust language retrieval
Methods Inf. Med.
Using UMLS semantics for classification purposes
Proc. AMIA Symp.
Cited by (106)
A BIM-Based construction and demolition waste information management system for greenhouse gas quantification and reduction
2019, Journal of Cleaner ProductionText mining and network analysis to find functional associations of genes in high altitude diseases
2018, Computational Biology and ChemistryCitation Excerpt :Since the amount of unstructured text data is increasing at a great pace, it is necessary to apply knowledge extraction techniques which are largely based on the TM for scientific hypothesis generation and knowledge discovery (Ananiadou et al., 2006; Jensen et al., 2006). The common text mining tasks are NER (Named Entity Recognition), Text Classification, Synonym and Abbreviation extraction, Relationship Extraction and Hypothesis generation (Ananiadou et al., 2006; De Bruijn and Martin, 2002). The NER task is to identify specific names such as gene, protein, drug, chemical or disease from large collections of text (Cheng et al., 2008).
Tech-integrated paradigm based approaches towards carbon-free hydrogen production
2018, Renewable and Sustainable Energy ReviewsNatural language processing approaches in bioinformatics
2018, Encyclopedia of Bioinformatics and Computational Biology: ABC of BioinformaticsA novel paradigm-oriented approach towards NG-RE hybrid power generation
2017, Energy Conversion and ManagementCitation Excerpt :Conclusions are given in Section 5. As both NG and RE have been studied for several decades, a substantial body of research has been published; therefore, the DAS was used to identify and review thousands of research papers, and literature mining, the process of extracting unknown, comprehensible and available knowledge, was applied to organize the information [20,21]. From the literature mining process, visual networks based around different cluster labels were combined with Kuhn’s paradigm theory to allow for the hybrid NG-RE systems development trends to be revealed.