Protein protein interactions, evolutionary rate, abundance and age

Saeed, Ramazan; Deane, Charlotte M

doi:10.1186/1471-2105-7-128

Research article
Open access
Published: 13 March 2006

Protein protein interactions, evolutionary rate, abundance and age

Ramazan Saeed¹ &
Charlotte M Deane¹

BMC Bioinformatics volume 7, Article number: 128 (2006) Cite this article

5628 Accesses
49 Citations
Metrics details

Abstract

Background

Does a relationship exist between a protein's evolutionary rate and its number of interactions? This relationship has been put forward many times, based on a biological premise that a highly interacting protein will be more restricted in its sequence changes. However, to date several studies have voiced conflicting views on the presence or absence of such a relationship.

Results

Here we perform a large scale study over multiple data sets in order to demonstrate that the major reason for conflict between previous studies is the use of different but overlapping datasets. We show that lack of correlation, between evolutionary rate and number of interactions in a data set is related to the error rate. We also demonstrate that the correlation is not an artifact of the underlying distributions of evolutionary distance and interactions and is therefore likely to be biologically relevant. Further to this, we consider the claim that the dependence is due to gene expression levels and find some supporting evidence. A strong and positive correlation between the number of interactions and the age of a protein is also observed and we show this relationship is independent of expression levels.

Conclusion

A correlation between number of interactions and evolutionary rate is observed but is dependent on the accuracy of the dataset being used. However it appears that the number of interactions a protein participates in depends more on the age of the protein than the rate at which it changes.

Background

It has been suggested many times that the rate at which a protein evolves decreases with the number of physical interactions it participates in [3–6]. The intuition behind this idea is that proteins with a greater fraction of amino acid residues playing an essential role will, on the whole, evolve slower then those with a small ratio of such crucial residues. Thus highly interacting proteins will evolve at a slower rate. A recent study by Fraser et al (2002) demonstrated the negative correlation, which this theory would suggest between protein-protein interactions and evolutionary rate. The negative correlation was determined by estimating the evolutionary distance between orthologous proteins from yeast Saccharomyces cerevisiae and the nematode worm Caenorhabditis elegans. Using interaction data from studies conducted by Uetz and "core data" from Ito it was shown that yeast proteins possessing a large number of interacting partners evolve slower then those that have fewer interacting partners. However this relationship proved to be contentious, and the observation was challenged by Jordan et al (2003) on the basis that a correlation between a proteins evolutionary rate and number of interactions arises only because a few highly interacting proteins evolved more slowly. In Jordan's study a smaller set of interactions were considered (The MIPS dataset) and a different measure of evolutionary distance was adopted, using the distance between two yeast species S cerevisiae and S. pombe [7]. Although one would expect that such a comparison would result in an increase in the strength of the relationship, as more orthologs could be found between two closely related species and evolutionary distance could be estimated with greater precision, this was shown not to be the case and only a very weak correlation was detected. This finding was immediately rebutted by Fraser et al (2003) who claimed that the dataset used in the study conducted by Jordan et al (2003) was too small. They also stated that the method used to obtain evolutionary distance resulted in low confidence data and hence the lack of any correlation [8]. Bloom et al (2003) then demonstrated that the presence of any correlation between evolutionary rate and number of interactions was dependent on the dataset that was used [9].

Experimental methods used to obtain interaction datasets include the Yeast two-hybrid (Y2H) assay and Mass spectrometry (MS) of purified complexes. In the Y2H method, pairs of proteins to be tested for interaction are expressed as fusion proteins in yeast (hybrids): one protein is fused to a DNA-binding domain, the other to a transcriptional activator domain. Any interaction between them is detected by the observation of a reporter gene which results from the formation of a transcription factor [10]. It is an in vivo technique and both transient and unstable interactions can be detected. It is independent of native protein expression levels and has a fine resolution enabling interaction mapping between proteins.

Drawbacks include the fact that only two proteins are tested at a time (no cooperative binding is detected) and the binding takes place in the nucleus. Consequently many proteins are not in their native compartment and interactions between proteins are unrelated to the physiological setting. Auto activation of the transcription factor can also occur and the fusion process may malform the hybrids.

In the Mass Spectrometry of purified complexes, individual proteins are tagged and used as 'hooks' to biochemically purify whole protein complexes. These are then separated and their components identified by mass spectrometry. Two protocols are widely used: tandem affinity purification (TAP) and high-throughput mass-spectrometric protein complex identification (HMS-PCI). In this method several members of a complex can be tagged, giving an internal check for consistency; and it detects real complexes in physiological settings. Drawbacks include the fact that some complexes that are not present under the given conditions may be missed, tagging may disturb complex formation, and loosely associated components may be washed off during purification [11].

Previous studies have shown that data generated from such large scale experiments have varying error rates and that the number of overlapping interactions is low [12–14]. Explanations for this include that: the methods have not reached saturation point; different methods produce a large number of false positives; and some methods may have difficulties detecting certain types of interactions. Studies that have assessed the reliability of these datasets have uniformly acknowledged that data obtained from the Y2H studies contain high error rates and protein complex purification methods have a slightly higher level of accuracy. There remains a lack of analysis on the error rates within protein interactions databases that have gathered interaction data from numerous sources.

A further complication was highlighted by Bloom. Some of the experimental methods were shown to be biased towards counting more interactions for abundant proteins [9]. This is not a universally accepted conclusion and Fraser et al insisted that it is entirely possible that this manner of relationship between expression levels and number of interactions is an intrinsic characteristic of yeast rather than any experimental bias [15].

This link between abundance and experimental methods is of particular interest as it is known that highly abundant proteins evolve slower [16]. Bloom et al. demonstrated a strong negative correlation between the rate of evolution and the abundance of a protein. This reported correlation was far stronger than the correlation between evolutionary rate and connectivity [9]. Bloom et al (2003) assert that the relationship between expression levels and connectivity was responsible for the negative correlation between evolutionary rate and connectivity.

Some of the studies that observed correlations between connectivity and rate of change did not control for the abundance levels of proteins [17, 18]. It is clear that there is a strong relationship between connectivity and the expression level of a protein in individual experimental datasets [9]. Whether this relationship is still observed in accumulative interaction datasets (sets containing interaction data from multiple experimental sources) has yet to be investigated.

Wuchty (2004) examined the relationship between protein essentiality, connectivity, rate of change and conservation [18]. A negative correlation was found between rate of change and connectivity. However using a novel method to quantify the conservation of a protein, Excess Retention (ER), it was observed that both essentiality and connectivity correlated better with ER than with evolutionary rate. Unfortunately all these contradicting studies conducted their analyses on different datasets. Studies in which a correlation was observed used different data to that of studies where no correlation was observed. This leads to the possibility that the discrepancies between studies arise from the different protein interactions datasets, particularly if the errors in these datasets vary.

Here we analyse six widely accessible protein-protein interaction databases for the yeast S. cerevisiae. We calculated the evolutionary distance to the Mus musculus and the yeast species S. paradoxus using varying methods and then examined the resulting correlations. We considered the overlap of interactions in all the datasets, and calculated three separate measures for the accuracy of each dataset.

In general where no correlation was found in a dataset, it was because the dataset had a large number of interactions derived from experimentally inaccurate methods. Datasets with a high overlap from more robust experimental methods showed an obvious relationship. We show that where a negative correlation is observed it is not due to the simple combination of the distribution of evolutionary rate and number of interactions, but because of some underlying biological factor.

We also examine the impact of gene expression levels and protein age on our observed correlations. In line with previous findings we show that, in all datasets, highly expressed proteins evolve slower. In datasets where we observed a correlation between the number of interactions and evolutionary rate we also find that proteins that are highly expressed are also highly connected. We also find that older proteins possess a larger number of interactions and this is independent of protein expression levels.

Methods

Data

The following Protein-Protein interaction datasets were used in this study: DIP, MIPS, BIND, GRID, MINT, INTACT [19–24]. A self interaction was counted as one interaction against the interacting protein. Duplicates and duplicates by virtue of inversion were removed from the interactions sets.

Protein sequences for interaction sets, where possible were downloaded from the dataset's corresponding website. In all other cases they were obtained from either SGD or UniProt depending on the annotation of proteins [25, 26]. Not all interacting proteins could be assigned protein sequences.

The MIPS dataset contained two types of interactions, physical and genetic. Physical interactions are those ascertained from Y2H studies and purified complexes while genetic interactions were obtained from suppression mutation and synthetic lethality tests. These two types were treated as two different sets.

The Intact dataset housed a small set of interactions, that contained results from small to medium scale experiments as well as the results from the four large scale studies [27–30]. These were treated as individual sets.

Evolutionary distance

Best Reciprocal Hit (BRH) orthologs in each interaction set were found using the BLASTP program [31] and comparing to the entire proteome of either Mus musculus [1] or Saccharomyces paradoxus [2]. The evolutionary rate was estimated using two methods. The first [32] required us to numerically solve the equation q = [ln(1 + 2d)]/2d, where q is the proportion of identical sites between aligned sequences and d is the evolutionary distance. The second method used the gamma distance correction [7]d = α [(1 - p)^-1/α - 1] where d is the evolutionary distance between two protein sequences, p is the number of different residues, and α is the estimated gamma shape parameter, α = 1.53 [33].

Randomisation

The randomisation test was conducted by systematically selecting a protein in a dataset, and assigning it an evolutionary rate by sampling the distribution of evolutionary distances. These random rates were then plotted against the number of interactions. 100 sets were generated, their correlation coefficients were calculated and compared to the correlation coefficient of the original experimental set.

Overlap

We universalised the labelling of all protein interactors in order to overcome the use of different notations to mark proteins. This was done by matching the sequence of each protein interactor to every sequence in the yeast genome (entire genome downloaded from SGD). Only 100% sequence matches were reannotated with the GenBank id, (GI Code). Protein attrition was no more then 5% in all datasets.

Abundance

Gene expression level data was taken from the Young lab [34].

Error rates

We used three methods to assess the accuracy of the different interaction data sets.

The first was the expression profile reliability (EPR) index [14]. The EPR index is calculated using an online server [19]. An expression based distance score is calculated for all interacting protein pairs in a set.

The resulting distribution of distance scores is compared to the distance score distributions of standard interacting and noninteracting sets. The comparison yields the approximate percentage of true interactions in the set.

The second error rate indicator, the Reference Index, involves comparing each interaction against a reference set, following work done by Von Mering et al (2002). The reference set used was the DIP_Core dataset which is considered to be a good interaction set [19]. This dataset contains protein interactions that have been computationally verified or observed in more than one large-scale experiment or those that come from small scale experiments. The percentage of interactions, from the dataset of interest, present in the DIP_Core dataset is taken as an indicator of the reliability of that set.

The third estimator of error rate was the percentage of interacting proteins that shared the same subcellular localisation. In an extension to the logic that interacting proteins would share similar functional roles, it is also possible to say that they would share similar subcellular compartments [35]. The number of interactions in which both interacting partners share the same compartment is used to give a measure of error within an interaction dataset. The subcellular localisations of yeast proteins into 19 compartmental categories is available [36]. The Localisation Similarity (LS) index was the fraction of interactions in which both proteins were from the same compartmental category.

In order to quantify the effect of error rates on each dataset, we ranked the dataset according to each measure of error. 1 being the highest rank and 0 if no error measure existed for the dataset. By calculating the mean rank of each dataset we obtain a consensus measure of error.

Evolutionary excess retention

To estimate the age of a protein we calculated the Evolutionary Excess Retention (ER), previously used as a measure for conservation of a protein [18]. The ER is a value that depicts the propensity of a protein to have orthologs in other fully sequenced genomes. It should be noted that ER does not estimate the exact age of a protein and is not necessarily correct for all proteins, as it does not identify gene loss or consider horizontal gene transfer. Therefore proteins that have a high ER value are most likely to be old but proteins with a low ER value may not necessarily be new.

We measured for orthologs in S. cerevisiae and H. sapiens, D. melanogaster, C. elegans, M. musculus and A. thaliana. Orthologs were taken from the InParanoid database [37] and we only used those core pairs of each cluster that had a confidence of 100%.

Results

Evolutionary rate

The number of proteins and their respective interactions varied in each dataset as shown in Table 1. The largest set according to number of proteins, was GRID followed by the DIP_Full set. The number of proteins in a set does not necessarily dictate the total number of interactions in that set. The DIP_Full set contains 15,481 interactions while the BIND dataset, with only ~5% less proteins, possesses approximately half that number of interactions. The smallest dataset was INTACT_Small, which consists of data solely from a few small scale experiments.

Table 1 Datasets and their correlations. The different datasets that were used in the study and the number of their constituent proteins and interactions. The total number of proteins is shown, as well as the number of proteins for which sequence information was obtainable and the number of orthologs found in the Mus musculus. The final four columns show the correlations between three factors, Evolutionary Distance (ED) as measured using Grishin's method, Abundance (A) and Number of interactions (I). P-values for these correlations were calculated, values in bold have a p-value greater than 0.03. The final column shows the result of a partial correlation, between evolutionary rate and number of interactions where abundance has been controlled for. The BIND datasets lacked expression information due to nomenclature issues.

Full size table

Due to nomenclature and curation errors a few proteins without sequence data remained (Table 1). This resulted in the loss of some interactions from our final analysis. The BIND dataset was particularly affected.

Figure 1A shows the correlation between evolutionary rate and number of interactions for the DIP_Core dataset. The Spearman's rank correlation coefficient returned by the DIP_Core data was -0.1682 with a P-value of 1.29e^-11 indicating the statistical significance of this weak correlation. This negative correlation suggests that proteins with a larger number of interactions tend to evolve slower.

Varying the model used to estimate evolutionary rate between Grishin's and Ota's method had little effect. The Spearman's rank correlations for all the datasets, using Grishin's method for evolutionary rate, are shown in Table 1. None of the sets returned very strong correlations. The strongest correlation was observed in the INTACT_Ho dataset, a fully experimental set obtained by complex purification. INTACT_Ito, a purely Y2H dataset shows no correlation at all (Spearman's ρ: 0.03, P-value: 0.8905). The INTACT_SMALL dataset has a very small number of interactions and it returned the second strongest correlation, however its statistical significance was low (P-value: 0.0508). Figure 2 shows all the datasets ranked by their correlation values. The five datasets with the worst correlations, (Spearman's ρ > -0.1) returned high P-Values (> 0.01).

We also estimated the evolutionary rate of proteins by finding orthologs in the species S. paradoxus. Using a more closely related species to estimate evolutionary rate resulted in a greater number of orthologs. In all but the BIND dataset, we found orthologs for over 90% of the interacting proteins. Figure 2 shows the Spearman's rank corrolations for evolutionary rate estimated from S. paradoxus orthologs using Grishin's method, against number of interactions.

Error rates and overlap

To obtain the error rates in the datasets we used three indicators of correctness. The EPR index, Reference index and the localisation similarity (LS) index. All three error rates for the datasets are listed in Table 2. Annotational issues with the BIND dataset resulted in an inability to calculate its EPR or LS index. Approximately 40% of the interactions in the BIND dataset had at least one partner protein, for which no expression information or localisation categorisation could be found.

Table 2 Error Rate. The EPR index is an estimate of the percentage of true positives in the set. The Reference Index is the percentage of proteins from the reference set found to be in the set of interest. The LS index is the percentage of interactions where both interacting partners shared the same subcellular localisation. The correlation observed between the Evolutionary rate (ED) and the number of Interactions (I) is shown. Values in bold type had a P-value greater than 0.03

Full size table

Not all error rate measures agree for particular datasets. For example, the MIPS_Genetic dataset has a high EPR Index. This would indicate a large number of true positives, yet when comparing to a reference set the overlap is only 4.08% which is a low value compared to other datasets. The corresponding localisation similarity is at 27%, which is on the lower end of the LS index spectrum.

The INTACT_SMALL dataset possessed a strong correlation between connectivity when cerevisiae-musculus orthologs were used yet the statistical significance of this correlation was very low. Furthermore it returns an abnormally high EPR Index. This is because of the statistical nature of both tests. The small size of the INTACT_SMALL dataset makes the significance of any results highly dubious. The DIP_Core dataset, our reference dataset, has one of the lowest error rates. It possesses the second highest EPR Index and also a high LS index. The ITO and UETZ datasets (both large scale Y2H experiments) have high error rates, corroborating previous error rate analysis [13].

To obtain a consensus measure of the error rates, we calculated the mean accuracy rank for each dataset, based upon its rank in each measure. Figure 3 shows a graph of the consensus measure for all the datasets with the exception of the BIND, INTACT_SMALL and MIPS_GENETIC datasets. These three datasets are excluded as accurate error rates cannot be calculated for them.

The overlaps between the accumulative datasets (sets containing data from many experimental sources) and the four major experimental studies were also calculated. Table 3 shows the percentage of interactions from the five INTACT datasets (the four major experimental studies and a fifth set containing interaction from small scale experiments) and data from the remaining databases. An interesting observation is that all the accumulative datasets that had no correlation i.e. both the MIPS datasets and the BIND dataset, have very little overlap with affinity purification datasets. They do however contain a substantial number of interactions obtained from the large scale Y2H studies.

Table 3 Overlap. The percentage overlap between single experimentally derived sets (HO, GAVIN, ITO, UETZ, SMALL) and compound datasets. Compound datasets are those sets that contain information from a range of different experimental sets.

Full size table