The mission of structural coverage of most protein domain families, pioneered in PSI phases 1 and 2, is well on its way to completion [6]. NMR has played an integral role in this endeavor [35, 43]. The goal of structural coverage at a sequence identity level of ~30% for most protein domains in nature will represent a monumental achievement for humankind, contributing in many ways toward our understanding of the relationships between protein sequence, structure, and function. As we ponder the future contributions of structural genomics (SG) for biomedical research, we envision many future opportunities beyond structure production that have been created by these high throughput structural biology platforms.

In the coming years, target selection strategies likely will go beyond the current sparse sampling of representative members of protein families to strategies aimed at providing extensive structural coverage of functional biological systems at high resolution. These systems could include (i) signaling networks and metabolic pathways, (ii) proteomes of medically important species, particularly humans, (iii) human disease-related proteins including infectious diseases, (iv) the human and environmental microbiomes (‘metagenomics’), and (v) comparative analysis of structure, dynamics, and biochemical function across protein families. The application of SG platforms to one or more of these biological systems would leverage NIH’s investment in SG pipelines to further our understanding of fundamental mechanisms of protein function, molecular evolution, biological processes, and human disease at a reduced cost. Alternatively, SG centers could be redefined to focus on increasing the range and types of structures that presently cannot be routinely determined or modeled; for example, membrane proteins, higher order protein complexes, and eukaryotic proteins with extensive natively disordered regions and/or posttranslational modifications.

In considering future efforts, we note that the purified proteins themselves are among the most valuable products of SG efforts. The largest expense in SG is the preparation of pure, soluble protein. Much more could be done with these proteins, particularly the large fraction that does not readily yield structures. Given that all proteins carry out their biochemical function through their interactions with other molecules, we propose that the full realization of the potential of SG platforms must integrate studies of functionally relevant interacting molecules for each protein target. Therefore, we envision that a key element of future SG projects or platforms would include a systematic attempt to integrate experimental protein binding, and/or biochemical information with structural data. Examples of such strategies, which would include HTP biochemical characterization of proteins, are (i) screening of ligand binding coupled with 3D structure analysis of functional protein-ligand complexes (see, for example, [23, 37], (ii) screening or characterization of enzymatic activity coupled with 3D structures of relevant protein substrate/cofactor/inhibitor complexes (see, for example, [28], and (iii) identification of protein-protein interaction partners coupled with 3D structures of relevant multiprotein complexes. A particularly powerful application of such integrated SG/functional studies would be the systematic and comprehensive characterization of the structural basis of ligand (or substrate) binding specificity of proteins with related, but distinct, binding profiles, so as to understand the structural basis of their specificity. Here we define “ligand” as any small molecule or macromolecule that interacts functionally with a protein. By adopting this approach, SG would have stronger synergy with functional genomics activities, and better integration with systems biology. These studies would also identify complexes that stabilize protein structures, and enable structures to be determined for otherwise refractory proteins.

NMR spectroscopy has a unique and valuable role in SG

During the course of PSI phases 1 and 2, we have shown that NMR is a highly complementary approach to X-ray crystallography for protein structure determination [32, 44]. Many proteins that provide good NMR spectra have not been successfully crystallized. In particular, in contrast to X-ray crystallography, NMR is about equally successful for prokaryotic and eukaryotic proteins. Therefore, comprehensive structural coverage of any protein system involving small to medium sized proteins would benefit from an NMR component.

NMR data provide the basis for extending the static structural view of proteins, through the rapid identification of natively unfolded proteins and residue-specific characterization of disordered protein segments, including functionally important flexible surface loops. NMR is also an essential tool for characterizing alternative conformations and allosteric states. In some cases, the minor conformational states that can only be characterized by NMR studies are critically important for biological function. NMR can also be used to measure the rates of transitions between these conformational states. As such, future SG efforts seeking to understand the evolution of structural, functional, and dynamic diversity across a protein family will require NMR studies to provide dynamic information.

NMR is also a powerful method for screening of functional protein-ligand, protein-protein, and protein-nucleic acid interactions. While other biophysical techniques are also capable of identifying such interactions, NMR is uniquely able to identify even transient, but functionally important, interactions. The protein samples, and most of the instrumentation and techniques required for rapid NMR screening studies, are the same as those already used in PSI NMR structure determination pipelines, allowing easy integration of functional screening techniques. NMR methods are also valuable for validating initial ‘hits’ identified in HTP screening. It is important to recognize that the use of NMR as a HTP screening tool is not limited by protein size, since one may monitor either the protein or the ligand to detect the interaction.

Finally, NMR data are used to generate new functional hypotheses, and to confirm functional annotations, interactions, or biochemical reaction rates revealed in other “omics” projects (e.g., functional genomics, transcriptomics, or metabonomics). Hence, we envision that NMR will play a key role to connect SG with these ‘omics’ approaches, thereby better integrating SG into systems biology.

Accomplishments of NMR SG groups during PSI

One Large Scale Center, The Northeast Structural Genomics Consortium (NESG), and one Specialized Center, the Center for Eukaryotic Structural Genomics (CESG), have made major commitments to protein NMR sample and structure production. The two centers have deposited into the PDB some 300 protein NMR structures (>90% of the PSI NMR structures) over the first 8 years of the PSI program. Thus, with ~12% of PSI resources dedicated to NMR pipelines, ~10% of PSI structures have been determined by NMR. Given similar levels of support and priority in these two centers, NMR makes contributions to structure production that are comparable to X-ray crystallography (Fig. 1, left panel). The Joint Center for Structural Genomics (JCSG), Center for Structure of Membrane Proteins (CSMP), and New York Center on Membrane Protein Structure (NYCOMPS) have also used NMR effectively, though with a smaller percentage effort. Many of these structures would not have been solved without the participation of NMR. Indeed, ~15% of small proteins provided by other Large Scale Centers to NESG NMR groups, because they could not be crystallized successfully, subsequently provided 3D structures by NMR. Many other potential opportunities to solve PSI target structures may have been missed by the other PSI centers, where NMR-tractable proteins have been produced, but not pursued by NMR analysis.

Fig. 1
figure 1

(left panel) In the two PSI centers with major commitments to NMR sample and structure production some 37% of structures were determined by NMR (42% and 22% in NESG and CESG, respectively). (right panel) MW distributions for protein NMR structures (>50 residues) determined by PSI groups and non SG groups in the same time period are similar. Inset—histogram plot of MW distribution of PSI NMR structures. Statistics were compiled in October 2008

Comparison of PSI and non-SG protein NMR structures deposited in the PDB during the same time period reveals that (i) the average molecular weight (MW) of PSI NMR structures, ~13 kDa, is similar to that of non-SG structures (Fig. 1, right panel), (ii) the fraction of homo-oligomeric protein structures (~15%) is also about the same, but (iii) the quality of PSI NMR structures is significantly better, when considering PROCHECK dihedral angle distribution and MOLPROBITY atomic clash scores (Fig. 2). As a consequence, PSI NMR structures are generally of sufficiently high accuracy to be used in crystallographic molecular replacement studies [30], and as useful as medium-resolution (1.8–2.5 Å) X-ray crystal structures for high-quality homology modeling (e.g., [22, 24]). The PSI NMR structure pipelines have also demonstrated that they can address challenging protein targets, including proteins with MW 20–35 kDa (Fig 1, right panel), dimeric and tetrameric proteins, and membrane proteins.

Fig. 2
figure 2

NMR structure quality assessed by MolProbity [7], using Z scores defined in Protein Structure Validation Suite (PSVS) software [5], are significantly better for structures generated in the PSI Centers (median Z score = −4.4 over 8 years) compared with NMR structures deposited in the PDB by groups not involved in structural genomics (median Z score = −8.6), during the same time periods. The analysis included statistics for all NMR structures deposited in the PDB by PSI centers (left panel) compared with statistics for 50 NMR structures chosen randomly from those deposited in the PDB by nonSG research groups (right panel) during the same time period. In these box plots, the central horizontal line is the median value, the bottom and top of the box denote the first and third quantiles, and the vertical “whiskers” denote ±1.5 times the interquartile range (approximately two standard deviations). Outlier values are indicated by open circles

NESG, CESG, and JCSG have also developed new methodology for lowering the costs per NMR structure, including (i) protocols for HTP preparation of 13C/15N- and 13C/15N/2H- enriched samples using novel eukaryotic wheat-germ based cell-free expression systems [39, 40] and bacterial single protein production (SPP) systems [29, 33, 34], (ii) HTP NMR screening platforms using microprobe robotics for buffer and construct optimization [1], (iii) GFT NMR [2, 3, 19, 20, 36], and related HIFI [8] and APSY [1315] NMR experiments for reducing NMR measurement times by more than an order of magnitude, (iv) software for semi-automated data analysis and structure calculations [4, 918, 21, 25, 26, 41, 46], (v) software and protocols for structure validation and refinement based on residual dipolar couplings (RDCs) and chemical shifts [31, 38, 42], and (vi) software and servers for comprehensive structure quality assessment [5, 17] and refinement [30]. These methods have reduced the average time required per structure to 2–3 weeks for small to medium sized proteins; in favorable cases, NMR structures are determined in only a few days. Although not in the original charge to the PSI NMR groups, recent efforts in technology development have focused on addressing larger proteins, oligomeric structures, and protein-protein complexes. For example, the NYCOMPS and CSMP have made significant advances in developing new methods for sample preparation and NMR analysis of membrane protein structures [45, 27].

A promising future for NMR contributions to SG and the larger biomedical community

NMR’s role in structural biology is still rapidly evolving. Unlike x-ray crystallography, which has matured to a state in which almost all aspects can be highly automated, NMR is still approaching this goal. We are very optimistic that over the next decade NMR will continue to make gains analogous to those seen for crystallography over the past few decades. For example, recent advances demonstrate that sparse constraints, such as chemical shift, residual dipolar coupling data, and/or small numbers of long-range distance constraints, can be combined with conformational energy calculations to provide good quality protein structures. These emerging technologies will expand the range of proteins that can be addressed at high resolution by NMR, as well as the speed with which this can be done. The new avenues of biological research opened by SG platforms will be tremendously enhanced by these NMR technologies. Clearly, NMR approaches offer tremendous opportunities for SG projects, and will be required in order to extract the greatest knowledge and understanding of whichever biological systems are targeted in the next phase of SG research.