NCBO Resource Index: Ontology-based search and mining of biomedical resources
Highlights
► The NCBO presents an ontology-based index of various biomedical data resources. ► We have automatically indexed 22 data resources using more than 200 ontologies. ► The semantics that the ontologies encode is used to improve the search experience. ► NCBO BioPortal provides an easy-to-use interface to search and mine data resources.
Introduction
Researchers in biomedicine produce and publish enormous amounts of data describing everything from genomic information and pathways to drug descriptions, clinical trials, and diseases. These data are stored on many different databases accessible through Web sites, using idiosyncratic schemas and access mechanisms. Our goal is to enable a researcher to browse and analyze the information stored in these diverse resources. Then, for instance, a researcher studying allelic variations in a gene can find all the pathways that the gene affects, the drug effects that these variations modulate, any disease that could be caused by the gene, and the clinical trials that involve the drug or diseases related to that specific gene. The information that we need to answer such questions is available in public biomedical resources; the problem is finding that information.
The research community agrees that terminologies and ontologies are essential for data integration and translational discoveries to occur [1], [2], [3]. However, the metadata that describe the information in data resources are usually unstructured, often come in the form of free-text descriptions, and are rarely labelled or tagged using terms from ontologies that are available for the domains. Users often prefer labels from ontologies because they provide a clear point of reference during their search and mining tasks [4], [5], [6]. For example, researchers and curators widely use the Gene Ontology to describe the molecular functions, cellular location, and biological processes of gene products. These annotations enable the integration of the descriptions of gene products across several model organism databases [7].
However, besides these examples, semantic annotation of biomedical resources is still minimal and is often restricted to a few resources and a few ontologies [8]. Usually, the textual content of these online resources is indexed (e.g., using Lucene) to enable querying the resources with keywords. However, there are obvious limits to keyword-based indexing, such as the use of synonyms, polysemy, lack of domain knowledge. Furthermore, having to perform keyword searches at each Web site individually makes the navigation and aggregation of the available information extremely cumbersome, if not impractical. Search engines, like Entrez (www.ncbi.nlm.nih.gov/Entrez), facilitate search across several resources, but they do not currently use as many of the available and relevant biomedical ontologies.
The National Center for Biomedical Ontology (NCBO) Resource Index addresses these two problems by (1) providing a unified index of and access to multiple heterogeneous biomedical resources; and (2) using ontologies and the semantic representation that they encode to enhance the search experience for the user. The NCBO BioPortal – an open library of more than 200 ontologies in biomedicine [9] – serves as the source of ontologies for the Resource Index. We use the terms from these ontologies to annotate, or “tag,” the textual descriptions of the data that reside in biomedical resources and we collect these annotations in a searchable and scalable index (Fig. 1). The key contributions to the field are (i) to build the search system for such an important number of ontologies and resources and (ii) to use the semantics that the ontologies encode.
In the context of our research, we call data element any identifiable entity or record (e.g., document, article, experimentation report) which belongs to a biomedical data resource (e.g., database of articles, experiments, trials). Usually, an element has an identifier and can be linked by a URL. For instance, the trial NCT00924001 is an element of the ClinicalTrials.gov data resource that can be accessed with: http://clinicaltrials.gov/ct2/show/NCT00924001. We call annotation – a central component – a link from an ontology term to a data element, indicating that the data element refers to the term either explicitly or not [10], [11]. We then use these annotations to “bring together” the data elements.
We currently index 22% resources, which are maintained by a variety of different institutions, with terms from more than 200 ontologies included in BioPortal (Appendix A). As of January 2011, our 1.5 Tb MySQL database, which stores the annotations in the Resource Index, contains 11 Billion annotations, 3.3 Million ontology concepts, and 3.2 Million data elements. The user interface is available at http://bioportal.bioontology.org/resources.
A preliminary version of the system was presented in [12]. In this paper, we illustrate use case scenarios (Section 2), describe the system implementation (Section 3) and the details of the indexing workflow (Section 3.3), and the different means to access the Resource Index (Section 3.4). We demonstrate how semantic technologies enable information retrieval and mining scenarios that were not possible otherwise (Section 4).
Section snippets
Use case scenarios
We will describe the functionality of the Resource Index through three use case scenarios.
Scenario 1: Multiple-term search across resources. The user is interested in the role of tumor protein p53 in breast cancer. He can search the Resource Index for “Tumor Protein p53” AND “Breast Carcinoma” as defined in the NCI Thesaurus (Fig. 2). The search results summarize the number of elements per resources annotated with both terms. The user can see there is relevant data linking p53 to breast cancer
The NCBO Resource Index
To create the Resource Index, we process metadata describing data elements in a variety of heterogeneous resources to create semantic annotations of these metadata. We use the publicly available biomedical ontologies in BioPortal as a source of terms, their synonyms, and the relations between terms (Section 3.1). We use resource-specific access tools to process metadata that describe data elements in different resources (Section 3.2). We use an off-the-shelf concept-recognition tool to identify
Discussion and related work
The Resource Index provides semantically-enabled uniform access to a large set of heterogeneous biomedical resources. It leverages the semantics expressed in the ontologies in several different ways:
Preferred names and synonyms: Many biomedical ontologies specify, as class properties, not only labels (preferred names) but also synonyms for the class names, which we use during annotation. For example, a keyword search of caNanoLab resource with “adriamycin” would normally obtain no results.
Challenges and future plans
We are currently working on expanding the Resource Index to include more resources. Our goal is to index up to 100 public resources, including PubMed, which provides access to all research articles in biomedicine (approximately 20 Million elements). We have analyzed the metrics on ontologies in order to re-structure the database backend for the Resource Index. This restructuring has enabled us to reduce the processing time for one of our larger datasets from one week to one hour [18]. With this
Conclusions
We have presented an ontology-based workflow to annotate biomedical resources automatically as well as an index constructed using this workflow. Ontology-based indexing is not new in biomedicine, however it is usually restricted to indexing a specific resource with a specific ontology (vertical approach). We adopt a horizontal approach, accessing annotations for many important resources using a large number of ontologies. This approach follows the translational bioinformatics and Semantic Web
Acknowledgements
This work was supported in part by the National Center for Biomedical Ontology, under roadmap-initiative Grant U54 HG004028 from the National Institutes of Health. The NCBO Resource Index won the First prize in the Semantic Web Challenge 2010 (http://challenge.semanticweb.org/).
References (33)
- et al.
State of the nation in data integration for bioinformatics
Biomedical Informatics
(2008) - et al.
A comparative evaluation of full-text, concept-based, and context-sensitive search
American Medical Informatics Association
(2007) - et al.
Knowledge-based methods to help clinicians find answers in medline
American Medical Informatics Association
(2007) - et al.
Semantic annotation for knowledge management: requirements and a survey of the state of the art
Web Semantics: Science, Services and Agents on the World Wide Web
(2006) - et al.
A review of ontology based query expansion
Information Processing and Management
(2007) - et al.
SAPHIRE – an information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships
Computers and Biomedical Research
(1990) - et al.
MedicoPort: a medical search engine for all
Computer Methods and Programs in Biomedicine
(2007) - et al.
Essie: a concept-based search engine for structured biomedical text
American Medical Informatics Association
(2007) - et al.
Bio-ontologies: current trends and future directions
Briefing in Bioinformatics
(2006) - A.J. Butte, R. Chen, Finding disease-related genomic experiments within an international repository: first steps in...
Text mining and ontologies in biomedicine: making sense of raw text
Briefing in Bioinformatics
Use and misuse of the gene ontology annotations
Nature Reviews Genetics
BioPortal: ontologies and integrated data resources at the click of a mouse
Nucleic Acids Research
Cited by (67)
AgroPortal: A vocabulary and ontology repository for agronomy
2018, Computers and Electronics in AgricultureCitation Excerpt :(#9) We have not automatically linked databases of annotated agronomical data using ontology concepts (from within AgroPortal). While the original BioPortal has the NCBO Resource Index (Jonquet et al., 2011), we plan to rely on external annotated resources such as AgroLD (Venkatesan et al., 2015) to interlink with data. To store this information, we will build on our rich mapping model in AgroPortal as presented Section 4.3.
A multi-ontology approach to annotate scientific documents based on a modularization technique
2015, Journal of Biomedical InformaticsCitation Excerpt :This is specially true in the case of biomedical scientists, who cannot count only on data collected from public databases. They need to extract useful information from many different resources [10], but especially from scientific texts [11]. In the last decade, there has been an increasing number of ontologies emerging on biomedical domains.
In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access
2015, Journal of Biomedical InformaticsCitation Excerpt :When dealing with annotations, both expressions are associated with “homo sapiens”, which is not captured with stems. The association between terms and articles using controlled vocabularies instead of stems has been already explored [38–40]. In fact, the version of PMRA implemented in the PubMed repository, used to identify the articles related to the one currently being read, includes MeSH terms.
DL-VSM based document indexing approach for information retrieval
2023, Journal of Ambient Intelligence and Humanized ComputingOntology-aware Search and Analytics with Elasicsearch: Case study for Epidemiological Investigation
2023, 2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE 2023PheNominal: an EHR-integrated web application for structured deep phenotyping at the point of care
2022, BMC Medical Informatics and Decision Making