Introduction

Functional annotation of genes is an important challenge once the sequence of a genome has been completed. Gene annotation encompasses a variety of functional attributes, from structural motifs, through cellular function, to associations with specific functions and processes at the level of the organism.

Apart from descriptors at the gene and protein level, the phenotype effect of a mutated or deleted gene forms part of its functional annotation. Systematic mutation and RNA interference screens have been performed for selected phenotypes in Drosophila. melanogaster, Caenorhabditis. elegans, and Saccharomyces. cerevisiae.1, 2, 3 Also for Mus. musculus, an ambitious project to mutagenize most or all genes has been conceived.4

Previous studies have correlated various attributes of human genes, such as predicted function or amino-acid sequence length, with the chance of causing a disease.5, 6, 7 However, only limited attention has been paid to the grouping of (disease) phenotypes into a matrix as a means of predicting biological relations between genes and proteins. Such systematic grouping of genes by their associated phenotypes may be referred to as phenomics.

The existence of specific disease phenotype groups suggests that phenomics is possible in humans. Also, there are large numbers of naturally occurring mutations and we have a detailed knowledge of the phenotypes that are associated.8, 9 Qualitatively, the human mutation data set surpasses that of most model organisms, because we can detect and describe human anomalies in more detail than in other species.10 Specific examples illustrate that individual genes that cause a given phenotype tend to be linked at the biological levels as interacting proteins, as components of a multiprotein complex, or as steps in a biochemical pathway.

We note that current human disease phenotype data set(s) such as Online Mendelian Inheritance in Man (OMIM) are far from ideal for bioinformatics purposes as they were not designed with the intention to systematically describe phenotypes. Therefore, this study can only provide proof of concept, and certainly not a finished product. Our analyses show that, these limitations notwithstanding, there is a detectable association between phenotype clusters and the function of the underlying genes. We have further devised and used a system for assigning similarity scores, which allows all genes with known phenotypes to be compared. This approach is very different from that which uses an artificial division into predetermined entities.11 We have classified over 5000 disease phenotypes in humans into a single human phenome system. Given that the human phenome reflects the biology of the system, any phenotype classification should reflect other measures of gene function at least to some extent. We have therefore compared the organization of genes within this human phenome map to their known interactions, and similarities at multiple levels including sequence, protein motifs, and assigned gene ontology (GO) functions. Our analyses show that for similar human disease phenotypes, there is indeed a consistent association at multiple levels of gene annotation.

The phenotype similarity data are accessible through a web interface (http://www.cmbi.ru.nl/MimMiner/). This interface, called the ‘MimMiner’, enables the user to retrieve the similarity ranking for a specific OMIM phenotype.

Methods

The OMIM database

The OMIM (November 25, 2004) database12 contains record-based textual information, one gene or one genetic disorder per record. OMIM also contains literature references and links to other databases. We have used the full-text (TX) and clinical synopsis (CS) fields of all records that describe genetic disorders. We will refer to this combination of the TX and CS fields as a ‘record’.

OMIM is a rich data set containing 16 357 TX records of which 5132 describe a disease phenotype (semi-automatically selected, manually verified). The remaining records contain variation, mutation, gene/protein, or other information.

OMIM was originally designed to be read by humans, not by computer. We have automatically extracted the phenotypic features from OMIM using text analysis techniques.

Creation of ‘feature vectors’

We used the anatomy (A) and the disease (C) sections of the medical subject headings vocabulary (MeSH) to extract terms from OMIM. MeSH terms and their plurals and components are concepts. MeSH provides a standardized way to retrieve information that uses different terminology to refer to the same concepts. Its size and internal hierarchical structure make it a rich dictionary, which is needed to match the OMIM texts. MeSH concepts serve as phenotype features characterizing OMIM records: every entry in the feature vectors represents an MeSH concept. The number of times the terms for a given concept are found in an OMIM record reflects the concept's relevance to the phenotype.13, 14 Nonspecific concepts like ‘syndrome’ or ‘disease’ were excluded.

Refinement of the feature vectors

MeSH concepts can be very broad like ‘Eye’ or more specific like ‘Retina’. MeSH includes a concept hierarchy that describes relationships such as ‘Eye’–’Retina’–’Photoreceptors’. ‘Eye’ is called a hypernym of ‘Retina’, which in turn is a hypernym of ‘Photoreceptors’, etc. Conversely, ‘Retina’ is called a hyponym of ‘Eye’. To ensure that the concepts ‘Eye’ and ‘Retina’ are recognized as similar, we use the MeSH hierarchy to encode this similarity in the feature vectors by increasing the value of all hypernyms as described in (Figure 1)

Figure 1
figure 1

Example of concept expansion using the MeSH hierarchical structure. The concept ‘Photoreceptors’ (with MeSH descriptor: D010786) is found twice in an OMIM record. Expansion of this concept gives the hypernym ‘Retina’ (D012160). The relevance of the concept ‘Retina’ is derived from ‘Photoreceptors’ according to equation (1). ‘Retina’ has seven descendants or hyponyms, thus its relevance becomes 1/7 × 2 (for two times ‘Photoreceptors’). Similarly, Retina's hypernym ‘Eye’ (D005123) has 11 hyponyms, so that ‘Photoreceptors’ contributes 1/11 × 1/7 × 2 (=0.026) to the relevance of ‘Eye’. (Squared brackets indicate the MeSH tree position and a ‘+’ sign that there are more specific concepts underneath).

For any concept c, its relevance rc becomes the actual count of the concept in a document rc,counted plus the relevance sum of the concept's hyponyms rhypo's. This sum is divided by the number of hyponyms nhypo,c. This equation is applied iteratively from the most detailed level in the MeSH tree, till the highest hypernym level is reached. Not all concepts in the OMIM records are equally informative. For example, ‘retina pigment epithelium’ occurs rarely, and thus provides more specific information than very frequently occurring terms such as ‘Brain’. We allowed for differences in the importance of concept frequencies by using the inverse document frequency measure15

The inverse document frequency or global weight of concept c (gwc) is the logarithm of the total number of records analyzed (N; N=5080) divided by the number of records that contain concept c, nc. Not all OMIM records contain equally extensive descriptions. These differences will make a comparison between records difficult because the diversity and the frequency of concepts in the larger records will exceed those in the smaller records. Equation (3) was used to (partly) correct for these record size differences.15

The local weight of concept c in a record is a function of the concept’s frequency rc divided by the frequency of the most occurring MeSH concept in that record, rmf. The three feature vector corrections were applied in the order of equations (1), (2) and (3).

Comparing OMIM records

The similarity between OMIM records can be quantified by comparing the feature vectors that are expanded and corrected (equations (1), (2) and (3)). Similarities between feature vectors were determined by the cosines of their angles (equation (4)).16

The similarity between the feature vectors X and Y (s(X, Y)) is a function of their respective concept frequencies xi and yi. The index i runs from 1 to the number of MeSH concepts l.

Phenotype–genotype correlations

The matrix of all pairwise vector similarities was denoted the phenomap. A subset of this phenomap containing the 1653 OMIM records for which the causative gene and protein are known was used as a starting point for determining the relation between phenotypic similarities on the one hand and genotypic similarities on the other. All 1653 phenotypes associated with a protein in the UniProt database17 were then compared to four genotype-related data sets. The average of 10 randomized phenomaps was used as a control for background signal. Feature vectors were randomly permutated using Fisher–Yates shuffling18 before computation of the phenomap.

The PFAM database19 was used to determine whether pairs of genes share similar domains.

To check if genes are similar at the sequence level, we compared the proteins associated with the 1653 phenotypes from the UniProt database in an all-against-all Smith–Waterman analysis20 (Blosum-90). Sequence pairs with an alignment e-value better than 10−6 were considered similar.21

Protein–protein interactions were extracted from the interaction section of the HPRD database22 and used to check whether the proteins are part of the same complex or interact in any other way.

The GO database23 and the GO annotations (GOA)24 were used to determine if two genes fall within the same functional category.

Two genes/proteins were considered related when they shared at least three GO terms at the sixth annotation level.25 Annotations at the more detailed levels (level 7, 8, etc) were converted to the corresponding annotation at the sixth level. For example, the RDS protein [UniProt:P23942] has the detailed annotation ‘visual perception’ at level 7 [GO:0007601], which gets converted to the more general ‘sensory perception of light’ [GO:0050953] at level 6.

Clustering

Multidimensional scaling methods did not show evident clusters, which makes every clustering attempt intrinsically subjective. In order to objectively visualize the relations between the phenotypes, we performed average-link clustering that results in a tree structure.26 This tree can be explored at http://www.cmbi.ru.nl/MimMiner/.

Results

Feature vectors

A total of 5132 OMIM records (31%) describe a disease phenotype, and their TX and CS fields were analyzed for the presence of concepts from the anatomy (A) and disease (C) sections in the MeSH thesaurus. For 5080 OMIM records, we could match one or more MeSH terms.

Comparing OMIM records

The 5080 × 5080 pair-wise feature vector similarities form the phenomap. The distribution of these similarity scores is shown in Figure 2a. We used multi-dimensional scaling methods to test whether the phenotypes form distinct groups. The goal of these methods is to detect meaningful underlying dimensions from high-dimensional data. For computational reasons, we used a sample that contains only the OMIM records with a CS field. Principal component analysis, classic multidimensional scaling, and independent component analysis implementations from the ‘R’ software package27 were used, but none of the methods showed evident groups. This suggests that the human phenotypes when based on MeSH terms form a continuous spectrum rather than distinct classes (data not shown).

Figure 2
figure 2

Histograms of phenotype similarity scores. (a) Histogram of all pairwise phenotype similarity scores of the 5082 phenotype records. The vertical axis is logarithmic; most phenotype–phenotype pairs have a low similarity score. (b) The best scores for all phenotypes in the disease phenotype datas et (nearest-neighbor similarity).

For each OMIM record, the most similar of the other 5079 records was identified (Figure 2b). We asked whether moderately similar phenotype pairs might still yield reasonable hypotheses. For gene-finding in humans, this appeared to be the case in a number of instances. Even in the 0.3–0.4 bin, individual cases displayed potentially relevant phenotypic similarities. For example, ‘Fibromuscular Dysplasia of Arteries’ [OMIM:135580] and ‘Cardiomyopathy, Familial Hypertrophic’ [OMIM:192600] have a phenotypic similarity score of 0.31, the ninth most similar score. The observation that these conditions are clinically inter-related is supported by two case reports.28, 29

Next, we asked if clinically different conditions caused by mutation of the same gene were likely to be detected as having considerable phenotype overlap by our system. Indeed, approximately 40% of phenotypes caused by allelic mutations were more similar to each other than to any other phenotype in the data set (data not shown).

The conclusion is that the more phenotypes resemble each other, the more likely they are to share an interaction or pathway. However, we cannot define a general cutoff for similarity scores because even low-scoring OMIM records can occasionally contribute to our understanding of the diseases.

Phenotype–sequence similarity correlations

We asked whether the similarity at the phenotype level predicts similarity in gene/protein function. The causative gene/protein is known for 1653 of the 5080 OMIM records that describe a phenotype (33%). The corresponding 1653 × 1653 phenomap was extracted from the total phenomap. These 1653 phenotypes linked to 2168 corresponding protein sequences (1401 unique sequences). Sequences were extracted from the UniProt database and used to calculate all-against-all Smith–Waterman sequence alignments. Figure 3a shows the fraction of significant sequence alignment similarities as a function of the phenotype similarity scores. The percentage of phenotype pairs for which the causative proteins are similar increases with increasing phenotype similarity score from 0.6% to a maximum of 26.6%. Approximately half of these are owing to different mutations in the same gene causing similar phenotypes. For example, ‘Robinow–Sorauf syndrome’ [OMIM:180750] and the related ‘Saethre–Chotzen syndrome’ [OMIM:101400] are both caused by a mutation in TWIST1 [UniProt:Q15672]. The other relations are owing to mutations in different genes that share sequence similarity. For example, the ‘Rufous Oculocutaneous Albinism’ [OMIM:278400] and ‘Albinism, Oculocutaneous, Type IB’ [OMIM:606952] phenotypes show 0.68 phenotypic similarity. These diseases are caused by mutations in the TYRP1 [UniProt:P17643] and TYR [UniProt:P14679] proteins, respectively, that are 43% identical at the sequence level. Mutations in TYRP1 also cause ‘Oculocutaneous Albinism Type 3’ [OMIM:203290].

Figure 3
figure 3

Phenotype similarity versus genotype relation categories. (a) Sequence similarity (Smith–Waterman, threshold e-value 1e−6). (b) Sharing of at least one PFAM domain. (c) Protein–Protein interactions according to the HPRD database. (d) Sharing of three or more GOA at the sixth or more detailed level. The 0.8–0.9 and 0.9–1.0 phenotype similarity bins suffer from low counts. The average signal of 10 randomized phenomaps is at the level of the two lowest bins (data not shown).

Many proteins have multiple functional domains. If two proteins share a functional domain, then mutations may disrupt the same process and thereby lead to similar phenotypes. This can be true, even if the full protein sequences do not share significant sequence identity. Figure 3b shows the percentage sequence pairs that share a PFAM domain as a function of the phenotype similarity scores. The majority of pairs that share a PFAM domain (67% or 119/181, disallowing the same gene in the 0.5–0.6 bin) also share significant overall sequence similarity. Only in a minority did sequence comparison based on domain sharing add new information compared to overall sequence similarity. For some phenotype pairs, the proteins lack significant overall sequence identity, but do share one or more common structural features. For instance, ‘Long Qt Syndrome 3’ [OMIM:603830] is caused by a mutation in SCN5A [UniProt:Q14524] and shares phenotypic characteristics with ‘Jervell And Lange–Nielsen Syndrome’ [OMIM:220400] that can be caused by a mutation in KCNQ1 [UniProt:P51787]. These two proteins have an ‘Ion Transporter’ domain in common [PFAM:PF00520].

Phenotype–protein interaction correlations

We checked which of the 1653 × 1653 pairs have an interaction described in the HPRD (Figure 3c). Although the HPRD dataset is sparser than the other datasets, and thus reveals fewer relations, 54% of these HPRD relations were not yet detected by the sequence alignments or the PFAM analyses. To check whether possible relations can still be found in the lower similarity ranges, we performed a detailed inspection on the 212 protein interactions that are listed in the HPRD for pairs that have a phenotypic similarity between 0.3 and 0.4. Over 50% of the relations suggested a plausible reason for the phenotypic similarities. For example, ‘Wiskott–Aldrich Syndrome’ [OMIM:301000] and ‘Fleisher Syndrome’ [OMIM:307200] are caused by mutations in WAS [UniProt:P42768] and BTK [UniProt:Q06187], respectively. WAS and BTK do not share significant sequence similarity, nor do they share a PFAM domain. However, both are involved in cell growth regulation and cytoskeleton processes.30 WAS is phosphorylated by BTK, leading to activation of the Arp2/3 complex.31 In general, despite its sparse nature, the HPRD provided biologically plausible information.

Phenotype–functional process correlations

Various excellent databases describing pathway information are available, for example, KEGG,32 BRENDA,33 Reactome,34 etc. Most of these databases focus on metabolic pathways, whereas less than 10% of the OMIM phenotype records relate to metabolic disorders. To get an impression of possible functional relations between genes and proteins, we compared their GOA. We defined GO similarity by the sharing of at least three GOA at the sixth or more detailed GO level.25 The signal we find is well above the average of 10 randomized matrices (7%) over all bins. The percentage of pairs that share three or more GOA increased as a function of the phenotypic similarity (Figure 3d). Compared to the three other sets, 76% of the relations were new. A random set of 50 of the 786 proteins that share three or more GOA in the 0.5–0.6 bin were inspected in more detail. A plausible reason for the phenotypic similarities was found in 37 of these 50 cases. Unlike in Figure 3a–c, less than half of the relations are owing to a defect in the same gene, which suggests the criterion of sharing three GOA at the sixth level is less stringent than in the other data sets.

Nevertheless, sharing GOA can be regarded as a relatively nonspecific characterization of gene function with a high noise level (Figure 3d). Overall, when the genotype by phenotype relations were normalized using random phenomaps, the HPRD was most efficient in providing nonrandom gene–gene relationships (Figure 4).

Figure 4
figure 4

Histogram of normalized genotype relations as a function of phenotype–phenotype similarity. All values are corrected for random information levels. Although the HPRD data set contains fewer relations than the other sets, the normalized signal is more than two times as strong, reflecting the quality of the HPRD data set. The 0.8–0.9 and 0.9–1.0 phenotype similarity bins suffer from low counts.

Discussion

We have developed a text-mining approach to map relationships between more than 5000 human genetic disease phenotypes from the OMIM database. OMIM does not contain a standardized system for scoring phenotypes. We decided to use OMIM for this initial analysis because it is still by far the most comprehensive of available databases. The resulting phenotype matrix has a number of characteristics that suggest that it (or similar systems based on more comprehensive phenotyping efforts) might be a useful addition to functional genomics tools such as the HPRD and KEGG. As expected, we find that different phenotypes associated with mutations of a single gene show considerable overlap. Such allelic conditions are each others best phenotypic hit in 40% of the cases. After exclusion of allelic conditions, there remained a positive correlation between phenotypic similarities on the one hand, and sharing of gene sequence, protein motifs, functional annotation, and known protein interactions on the other (Figure 3). In these comparisons, we found a relatively high level of noise for GO. We believe that our finding that all gene attributes correlate with phenotype sharing can be explained by assuming that human phenotypes reflect disturbance of functional modules, more than of individual genes. One striking finding was that biologically meaningful relationships were mostly detected in the small fraction of the phenotype relations with a similarity score greater than 0.4 (Figure 3). The combined data suggest that we may indeed use phenotypic relationships as general indicators of biological and functional interactions at the gene and protein levels.

Several applications can be envisaged for the phenomap. First, our analysis suggests that the phenotype matrix may aid in the prediction of candidate genes for the 3400 traits listed in OMIM whose molecular basis remains to be defined. Second, it is conceivable that one would take phenotypic relationships as the starting point for biochemical and cell biological experiments in order to prove a suspected link at the gene and protein levels. Experiments of this type have already been shown to be successful (eg polycystic kidney disease (PKD1, PKD2), tuberous sclerosis (TSC1, TSC2), breast and ovarian cancer (BRCA1, BRCA2), and Fanconi anemia (FANCA-G)).35, 36, 37, 38

Finally, there may well be a point in pursuing large-scale phenotype analyses using more precise measures of the phenotypes themselves. OMIM was not designed as a structured database for phenotype analysis. Indeed, it does not contain rules for feature assignment and most of the phenotype information collected by our text-mining approach derives from free text fields. A more standardized method for phenotype description including frequency estimates for each feature would greatly increase the yield of the analyses of genotype–phenotype correlations.39

Phenotype clustering reflects the modular nature of human disease genetics. Thus, the phenomap may be used to predict candidate genes for diseases. Such predictions will further improve if a unified system of phenotype descriptors is developed.