Elsevier

Journal of Web Semantics

Volume 9, Issue 3, September 2011, Pages 316-324
Journal of Web Semantics

NCBO Resource Index: Ontology-based search and mining of biomedical resources

https://doi.org/10.1016/j.websem.2011.06.005Get rights and content

Abstract

The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index – a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics “under the hood.”

Highlights

► The NCBO presents an ontology-based index of various biomedical data resources. ► We have automatically indexed 22 data resources using more than 200 ontologies. ► The semantics that the ontologies encode is used to improve the search experience. ► NCBO BioPortal provides an easy-to-use interface to search and mine data resources.

Introduction

Researchers in biomedicine produce and publish enormous amounts of data describing everything from genomic information and pathways to drug descriptions, clinical trials, and diseases. These data are stored on many different databases accessible through Web sites, using idiosyncratic schemas and access mechanisms. Our goal is to enable a researcher to browse and analyze the information stored in these diverse resources. Then, for instance, a researcher studying allelic variations in a gene can find all the pathways that the gene affects, the drug effects that these variations modulate, any disease that could be caused by the gene, and the clinical trials that involve the drug or diseases related to that specific gene. The information that we need to answer such questions is available in public biomedical resources; the problem is finding that information.

The research community agrees that terminologies and ontologies are essential for data integration and translational discoveries to occur [1], [2], [3]. However, the metadata that describe the information in data resources are usually unstructured, often come in the form of free-text descriptions, and are rarely labelled or tagged using terms from ontologies that are available for the domains. Users often prefer labels from ontologies because they provide a clear point of reference during their search and mining tasks [4], [5], [6]. For example, researchers and curators widely use the Gene Ontology to describe the molecular functions, cellular location, and biological processes of gene products. These annotations enable the integration of the descriptions of gene products across several model organism databases [7].

However, besides these examples, semantic annotation of biomedical resources is still minimal and is often restricted to a few resources and a few ontologies [8]. Usually, the textual content of these online resources is indexed (e.g., using Lucene) to enable querying the resources with keywords. However, there are obvious limits to keyword-based indexing, such as the use of synonyms, polysemy, lack of domain knowledge. Furthermore, having to perform keyword searches at each Web site individually makes the navigation and aggregation of the available information extremely cumbersome, if not impractical. Search engines, like Entrez (www.ncbi.nlm.nih.gov/Entrez), facilitate search across several resources, but they do not currently use as many of the available and relevant biomedical ontologies.

The National Center for Biomedical Ontology (NCBO) Resource Index addresses these two problems by (1) providing a unified index of and access to multiple heterogeneous biomedical resources; and (2) using ontologies and the semantic representation that they encode to enhance the search experience for the user. The NCBO BioPortal – an open library of more than 200 ontologies in biomedicine [9] – serves as the source of ontologies for the Resource Index. We use the terms from these ontologies to annotate, or “tag,” the textual descriptions of the data that reside in biomedical resources and we collect these annotations in a searchable and scalable index (Fig. 1). The key contributions to the field are (i) to build the search system for such an important number of ontologies and resources and (ii) to use the semantics that the ontologies encode.

In the context of our research, we call data element any identifiable entity or record (e.g., document, article, experimentation report) which belongs to a biomedical data resource (e.g., database of articles, experiments, trials). Usually, an element has an identifier and can be linked by a URL. For instance, the trial NCT00924001 is an element of the ClinicalTrials.gov data resource that can be accessed with: http://clinicaltrials.gov/ct2/show/NCT00924001. We call annotation – a central component – a link from an ontology term to a data element, indicating that the data element refers to the term either explicitly or not [10], [11]. We then use these annotations to “bring together” the data elements.

We currently index 22% resources, which are maintained by a variety of different institutions, with terms from more than 200 ontologies included in BioPortal (Appendix A). As of January 2011, our 1.5 Tb MySQL database, which stores the annotations in the Resource Index, contains 11 Billion annotations, 3.3 Million ontology concepts, and 3.2 Million data elements. The user interface is available at http://bioportal.bioontology.org/resources.

A preliminary version of the system was presented in [12]. In this paper, we illustrate use case scenarios (Section 2), describe the system implementation (Section 3) and the details of the indexing workflow (Section 3.3), and the different means to access the Resource Index (Section 3.4). We demonstrate how semantic technologies enable information retrieval and mining scenarios that were not possible otherwise (Section 4).

Section snippets

Use case scenarios

We will describe the functionality of the Resource Index through three use case scenarios.

Scenario 1: Multiple-term search across resources. The user is interested in the role of tumor protein p53 in breast cancer. He can search the Resource Index for “Tumor Protein p53” AND “Breast Carcinoma” as defined in the NCI Thesaurus (Fig. 2). The search results summarize the number of elements per resources annotated with both terms. The user can see there is relevant data linking p53 to breast cancer

The NCBO Resource Index

To create the Resource Index, we process metadata describing data elements in a variety of heterogeneous resources to create semantic annotations of these metadata. We use the publicly available biomedical ontologies in BioPortal as a source of terms, their synonyms, and the relations between terms (Section 3.1). We use resource-specific access tools to process metadata that describe data elements in different resources (Section 3.2). We use an off-the-shelf concept-recognition tool to identify

Discussion and related work

The Resource Index provides semantically-enabled uniform access to a large set of heterogeneous biomedical resources. It leverages the semantics expressed in the ontologies in several different ways:

Preferred names and synonyms: Many biomedical ontologies specify, as class properties, not only labels (preferred names) but also synonyms for the class names, which we use during annotation. For example, a keyword search of caNanoLab resource with “adriamycin” would normally obtain no results.

Challenges and future plans

We are currently working on expanding the Resource Index to include more resources. Our goal is to index up to 100 public resources, including PubMed, which provides access to all research articles in biomedicine (approximately 20 Million elements). We have analyzed the metrics on ontologies in order to re-structure the database backend for the Resource Index. This restructuring has enabled us to reduce the processing time for one of our larger datasets from one week to one hour [18]. With this

Conclusions

We have presented an ontology-based workflow to annotate biomedical resources automatically as well as an index constructed using this workflow. Ontology-based indexing is not new in biomedicine, however it is usually restricted to indexing a specific resource with a specific ontology (vertical approach). We adopt a horizontal approach, accessing annotations for many important resources using a large number of ontologies. This approach follows the translational bioinformatics and Semantic Web

Acknowledgements

This work was supported in part by the National Center for Biomedical Ontology, under roadmap-initiative Grant U54 HG004028 from the National Institutes of Health. The NCBO Resource Index won the First prize in the Semantic Web Challenge 2010 (http://challenge.semanticweb.org/).

References (33)

  • I. Spasic et al.

    Text mining and ontologies in biomedicine: making sense of raw text

    Briefing in Bioinformatics

    (2005)
  • S.Y. Rhee et al.

    Use and misuse of the gene ontology annotations

    Nature Reviews Genetics

    (2008)
  • N.H. Shah, N. Bhatia, C. Jonquet, D.L. Rubin, A.P. Chiang, M.A. Musen, Comparison of concept recognizers for building...
  • N.F. Noy et al.

    BioPortal: ontologies and integrated data resources at the click of a mouse

    Nucleic Acids Research

    (2009)
  • S. Handschuh, S. Staab (Eds.), Annotation for the Semantic Web, vol. 96 of Frontiers in Artificial Intelligence and...
  • N.H. Shah, C. Jonquet, A.P. Chiang, A.J. Butte, R. Chen, M.A. Musen, Ontology-driven Indexing of Public Datasets for...
  • Cited by (67)

    • AgroPortal: A vocabulary and ontology repository for agronomy

      2018, Computers and Electronics in Agriculture
      Citation Excerpt :

      (#9) We have not automatically linked databases of annotated agronomical data using ontology concepts (from within AgroPortal). While the original BioPortal has the NCBO Resource Index (Jonquet et al., 2011), we plan to rely on external annotated resources such as AgroLD (Venkatesan et al., 2015) to interlink with data. To store this information, we will build on our rich mapping model in AgroPortal as presented Section 4.3.

    • A multi-ontology approach to annotate scientific documents based on a modularization technique

      2015, Journal of Biomedical Informatics
      Citation Excerpt :

      This is specially true in the case of biomedical scientists, who cannot count only on data collected from public databases. They need to extract useful information from many different resources [10], but especially from scientific texts [11]. In the last decade, there has been an increasing number of ontologies emerging on biomedical domains.

    • In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access

      2015, Journal of Biomedical Informatics
      Citation Excerpt :

      When dealing with annotations, both expressions are associated with “homo sapiens”, which is not captured with stems. The association between terms and articles using controlled vocabularies instead of stems has been already explored [38–40]. In fact, the version of PMRA implemented in the PubMed repository, used to identify the articles related to the one currently being read, includes MeSH terms.

    • DL-VSM based document indexing approach for information retrieval

      2023, Journal of Ambient Intelligence and Humanized Computing
    • Ontology-aware Search and Analytics with Elasicsearch: Case study for Epidemiological Investigation

      2023, 2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE 2023
    View all citing articles on Scopus
    View full text