Introduction

“Scaffold hopping” or “molecular hopping” is a recent concept in the field of medicinal chemistry that refers to the search for compounds with bioactivities similar to given original structures1, 2. Many computational approaches have been used to generate heterogeneous structures with bioactivity similar to that of a given structure of interest. These include de novo molecular design, virtual screening, pharmacophore search, topology similarity search and shape similarity search3, 4, 5, 6, 7, 8. Some approaches require the three-dimensional structure of the target, so that the binding site information can be taken into account in the search for molecules with favorable interactions. Other approaches require known active compounds at hand, so that information about the ligand's properties, such as molecular hydrophobicity and hydrophilicity, charge, and shape features, can be used to constrain database mining. The scaffold hopping method has been proved very effective in molecular design, as shown in many review papers. For example, in work conducted at Abbott Laboratories, Zhao et al performed scaffold modification on a screening hit and identified potent and selective growth hormone secretagogue receptor (GHS-R) antagonists9. The scaffold of the hit antagonist was changed from a phenylisoxazole ring to a tetralin carboxamide, dramatically improving its binding affinity and other physiochemical properties. Also, several pharmaceutical companies have modified the hydrophobic ring part of the classic statin-class of HMG-CoA reductase inhibitors and thus invented new entities that they have been able to introduce into the drug market.

There are several methods of molecule fragmentation that produce meaningful fragments, such as scaffolds and functional groups. This analysis usually involves three steps:

  1. 1

    divide the molecules into fragments based on some rules to produce substructures;

  2. 2

    obtain a unique list of the identified substructures;

  3. 3

    process the substructures to assess their importance and to identify the most interesting molecules.

The first fragmentation step is crucial, because it determines what kinds of substructures will be produced. The method proposed by Bemis and Murcko10 is to fragment the whole molecule as rings, linkers and chains. The linker is a minimum atom path connecting the ring parts. The rings are familiar to chemists and obvious from the chemistry of the molecule; all remaining atoms belong to the chains. The framework of the molecule consists of all rings and linker fragments. This method has been implemented by several research groups with slight modifications. Xu11 modified the ring definition to include the unsaturated ring-bonded atoms in order to maintain the charge and geometric properties of the ring system. By doing this, he was able to program a system to classify the compounds based on their molecular framework. Another important approach invented by Lewell et al is to fragment the molecule based on 11 simple chemical reaction types12. The resulting fragments are ready for use in in silico synthesis to form a virtual library. These two basic, yet very important method complement each other in some situations. The former tends to produce a large scaffold, with no clues as to how to synthesize it by means of organic chemistry, whereas the latter tends to cut the meaningful scaffold into several atomic building blocks.

Although scaffold hopping methods have empowered researchers to optimize their lead compounds, to the best of our knowledge, there is no publicly accessible database containing this invaluable scaffold information. Many public small molecule databases are focused on the whole molecule level, such as the ZINC13 and PubChem databases14. These databases are used to search for entire molecules by similarity or substructures, rather than to identify interesting substitution fragments within individual molecules. To support Web-based molecular hopping, we have constructed a comprehensive database of unique scaffold structures by systematically fragmenting the ZINC molecular database, a large database (derived from several commercially available molecular libraries) containing more than 4.6 million compounds. We have also performed the same operations on the small molecular ligand dataset of the DrugBank database and the MDL Drug Data Report (MDDR)15 database in order to derive additional scaffolds. These fragment structures are associated with the properties of the original compounds from which the scaffold was derived. All this information, as well as the 2D structures of the scaffolds, is stored in a Web-based database system called ScafBank, which also implements substructure- and fingerprint-based similarity searches to enable researchers to quickly find feasible scaffolds. We believe that this valuable scaffold database will support medicinal chemists by allowing them to search with their own input fragments, facilitating molecular hopping studies.

Materials and methods

Molecular databases

The commercially available small molecular libraries used for high throughput screening were retrieved from the ZINC Web site (http://blaster.docking.org/zinc/). After molecules with similarities greater than 0.9 based on the fingerprint comparison method were removed, only 819 061 of the 4 600 000 small molecules in the ZINC database remained. We downloaded these datasets for scaffold and functional group analysis. The physicochemical properties of these molecules were calculated using JChem software and saved for later use. In order to compare these data with bioactive compounds, 1030 approved drugs contained in DrugBank16 and approximately 160 000 compounds from the MDDR database were collected and analyzed.

Scaffold analysis

To identify the scaffold structures hidden in the molecules, the recursive scaffold analysis method was adopted and implemented on the basis of the open source C++ programming library OpenBabel2.017. As described by Bemis and Murcko10, the algorithm works by first going through the molecule graph to trim off chain atoms. This is done by continually removing the atoms bonded to only one heavy atom until no more such atoms can be found. The recursive scaffold analysis implemented here is based on the HierS system18. Contiguous fragment searching was conducted to find side chain trimmed molecules, as illustrated in Figure 1. For every fragment, the basic ring scaffolds were found and deleted one by one to produce a new molecule (which may contain several fragments). The new molecules were subjected to a further recursive scaffold analysis. To identify the basic ring system, the smallest set of smallest rings (SSSR) method was used to indicate the ring atoms19. If two rings are connected by one or more atoms, then they are associated together as one ring scaffold. This process is continued until no ring can be associated to any others. The recursive scaffold process was performed until only one ring system remained. Each of the resulting fragments was added to the final scaffold list and written out as a molecule.

Figure 1
figure 1

Illustration of the recursive scaffold analysis procedure, which fragments the molecule into scaffolds.

The RECAP method was used to produce functional groups from the molecular database for analysis. This method was chosen because it is a reaction-based method, and the functional groups are ready to be used in virtual library construction. The Fragment program in the JChem package was utilized to accomplish this task20. The reverse reactions were represented with the SMIRK language and written into an XML file for fragmentation, giving the program the ability to recognize the reaction center and to produce the final functional groups.

To remove duplicate scaffolds and functional groups, the Java programming library Chemical Development Kit (CDK) was used as the basis for implementing a canonical SMILES representation of the chemical structures21. The canonical scaffolds and functional groups were processed to obtain a unique list of molecular fragments. Python scripts were then written to collect information about the original molecules (logP, molecular weight, hydrogen bond donor number, hydrogen bond acceptor number and ring number). The distributions of these properties were calculated and stored in a MySQL database for analysis22.

In order to compare the results from these commercially available datasets with those from drug-like or marketed drug molecules, the 2D structures of bioactive molecules in the MDDR and DrugBank databases were subjected to the same analyses. To facilitate identification of the privileged structures, the frequencies of the scaffolds in the MDDR database and the drug target information associated with their original compounds were analyzed. This target information was used to retrieve the scaffold's target information Shannon entropy using the following equations:

where STE is the scaffold target entropy, Ni is the number of molecules associated with the ith target class, and Nall is the total number of molecules associated with this scaffold. To normalize this entropy, the value was scaled by the entropy of all molecules.

Database system

The widely used MySQL database management system was selected to build the scaffold database22. All scaffolds were manipulated using the JChem database management system due to its efficiency at structure searching. The two-dimensional structures were imported into the database by the jcman program from the JChem package and then stored in a structure table. Other relevant information, such as the molecular weight and logP distributions of the original molecules associated with this scaffold, was stored in another MySQL table. Substructure and fingerprint-based similarity searching was implemented to facilitate Web-based searching. When querying using substructure and structure similarity, the JChem database was searched, and the IDs of the resulting molecules were collected and further used in querying other information tables. These results were then combined together and shown in a Web page format.

Results and Discussion

Analysis of scaffolds derived from the ZINC database

A recursive scaffold analysis similar to the HierS method18 was adopted to analyze the ZINC database and store the unique scaffolds in a MySQL database. To provide an overview of the database, the scaffold occurrence numbers in the original ZINC database13 were calculated and are shown in Figure 2. Most of the scaffolds are unique in the ZINC database (only one molecule contains that scaffold). The most frequent scaffold is benzene, which is contained in approximately half of the ZINC database molecules. About 80 000 scaffolds have a molecular occurrence greater than 2, and about 10 000 scaffolds have an occurrence greater than 10. The scaffold database was also analyzed based on some common properties, such as molecular weight, ring number, aromatic atom number, and aliphatic atom number. This analysis shows that the molecular weight distribution has a typical Gaussian distribution shape and a mean of 280 Dalton, similar to the distribution in the ZINC database. Scaffolds are composed of ring and linker atoms. An analysis of the ring numbers in the scaffolds indicates that about 100 000 scaffolds contain three ring systems and two hetero-ring systems. This gives the database a large number of ring combinations to support scaffold hopping, which benefits researchers seeking substitutions for their query scaffolds. From the distribution of aromatic/aliphatic atoms, the mean number of aromatic atoms in the scaffold database was found to be approximately eight, and for aliphatic atoms, it was about twelve. For comparison, these properties were also calculated from the original ZINC database. Most of the properties were found to have distributions similar to those in the scaffold database.

Figure 2
figure 2

Molecular occurrences in the ZINC database of scaffolds in ScafBank. The occurrence number was scaled by log10 for clarity.

We also analyzed the functional groups in the ZINC database. To do this, the default RECAP method12 was used to fragment the molecules in the ZINC database to get reaction-based building blocks. Surprisingly, this resulted in only 11 958 unique fragments. This represents a small portion of the 819 061 molecules in the ZINC database. This may indicate that the molecules in the ZINC database are made from a relatively small number of building blocks with simple chemical reactions. The same RECAP method was used to analyze the 1 030 approved drugs in DrugBank16, where it yielded a total of 1 599 unique fragments. These results demonstrate that the ZINC database, a collection of commercially available molecular databases, tends to contain common functional groups and may only represent a small fraction of the total functional group space.

Scaffold analysis of MDDR database

As many researchers have demonstrated, bioactive molecular databases contain high-hanging fruit information about target family-related privileged structures. These privileged structures could be utilized to design focused libraries that target specific protein families. To extract this information, a scaffold analysis of the MDDR database was conducted using the same procedure described above. After the identification of unique scaffolds, the target Shannon entropy was calculated for each scaffold in order to identify interesting fragments. Larger normalized scores indicate that a scaffold is found in more targets and can be considered a privileged structure. Some of these are listed in Table 1. This is consistent with findings indicating that most of these valuable scaffolds are in the modulators of the GPCR protein family. Compared with previous investigations, this method provides a systematic way to extract privileged structures and also gives a ranking for these scaffolds, enabling researchers to check more easily for the interesting ones. Further library design using methods of combinatorial chemistry is underway and will be reported elsewhere.

Table 1 Examples of interesting scaffolds with large entropy scores.

To further assess the uniqueness of these scaffolds, we compared the scaffolds derived from the ZINC database with those derived from the DrugBank and MDDR databases15. As shown in Table 2, there is only a small overlap between the ZINC and MDDR scaffolds (12 946 scaffolds in common), indicating that the two scaffold databases complement each other to cover a larger scaffold space. When comparing the ZINC and MDDR database scaffolds (by removing the market drug compounds) with the DrugBank scaffolds alone, it was shown that the ZINC scaffolds cover about 53.1% of the scaffolds found in DrugBank, whereas the MDDR database covers about 78.6% of the DrugBank scaffolds. This is consistent with the fact that the MDDR database is more drug-like than the ZINC database. A combination of the ZINC and MDDR scaffolds covers about 83.44% of the DrugBank scaffold space, indicating that researchers will find a suitable scaffold for their projects in most cases.

Table 2 Comparison of scaffolds across the three datasets derived from the ZINC, MDDR and DrugBank databases. The value in each cell is the number of common scaffolds found in the datasets and the number in parentheses is the percentage of common scaffolds.

Web interface and searching options

To facilitate the use of the database by researchers, we constructed a Web site called ScafBank (http://202.127.30.184:8080/db.html) to host these analyzed data. Through the Web site, users can browse the unique scaffolds, as well as the associated information. In addition, they can search the database using substructure- or fingerprint-based similarity measures. As shown in Figure 3, users can draw the 2D structure online with the program Marvin or they can upload a molecule into Marvin. A database search can then be conducted to find similar scaffolds in the database. The interface provides the option of specifying filtering rules, such as how many molecules to output or how many hydrogen bonds the resulting molecules should contain. This gives the researcher the flexibility to retrieve scaffolds based on their own scaffold hopping research. After the database is searched, the molecules retrieved are depicted on Web pages. Each scaffold is associated with the molecular property distribution of its original molecules, which may be useful for combinatorial library design. Also, users can double click on the scaffold and open a new Marvin window, in which they can calculate additional properties of the scaffolds, such as conformation and charge.

Figure 3
figure 3

The Web interface of the ScafBank database.

To further demonstrate the capability of this scaffold database, we queried the ScafBank with a two-ring scaffold (Figure 4). Similarity searching at a similarity level 0.7, returned a total of 97 hits by the MDDR scaffold database. We collected the results and found some interesting scaffolds that could be used as substitutes in the query, some of which are listed in Figure 4. These resulting scaffolds are reasonable from the viewpoint of medicinal chemists. Further real applications of ScafBank through combinatorial library design and synthesis are in progress and will be reported elsewhere.

Figure 4
figure 4

A case study. The query scaffold is in the middle and some of the results are listed around the query structure.

Discussion

Scaffold hopping is an active research field in chemoinformatics, and many computational methods are being devised to help medicinal chemists develop novel ideas for the hit-to-lead optimization and improve the druggability for these bioactive compounds. Here we conducted scaffold analysis on three common databases and compiled these scaffolds into a relational database, which will enable researchers to perform scaffold substitution query studies. The original databases used to extract the scaffolds include most of the commercially available compounds. Using the canonical SMILES representation, we collected unique scaffolds into a relational database. Which removes the redundancies in this database and simplifies the post-analysis of the query results.

As demonstrated by numerous medicinal chemistry studies, scaffold hopping is a more general application of bioisosteric design, a process in which a target scaffold is replaced by another scaffold, which is sometimes considerably different in structure but still has similar properties. We hope our ScafBank database may be used in scaffold hopping to obtain molecules with better bioavailability or selectivity. Another straightforward application of the scaffold database is the identification of important scaffolds and further subjects for combinatorial chemistry library design. This “privileged structure” approach has already demonstrated its potential in developing GPCR modulators2. The bioactivity-related entropy score in the ScafBank could be used to prioritize the scaffolds, which helps researchers judge the importance of the scaffolds and select the most interesting scaffolds with which to construct a combinatorial library to increase the chance of finding hit or lead compounds.

Although scaffold substitution is a useful method in medicinal chemistry, as reviewed by Babaoglu and Shoichet23, molecules are composed of various fragments. The bioactivity is not just simply summarize the contributions of these fragments. Sometimes the molecule act in an integrated way. In the case of scaffold hopping, changing one part of the ligand may also affect other parts of the molecule because of variations in subtle torsion angle change, in the orientations of other groups connected to these scaffolds, and in the physicochemical properties of the molecules substituted in the scaffold. It should be noted that in our database system, only chemical 2D similarity is considered for scaffold hopping. Users should, therefore, not think the scaffolds returned from database search is the final decision to use for substitution. Instead, it is just a starting point from which to further determine the feasibility of scaffold hopping.

Conclusion

In summary, a comprehensive, Web-accessible scaffold database was built by recursive scaffold analysis of the ZINC, DrugBank and MDDR databases. By comparing these unique scaffolds with the scaffolds derived from approved drugs in DrugBank, it was found that the scaffolds covered approximately 83% of DrugBank scaffold space. To our knowledge, this is the first public database specifically constructed for scaffolds. This database may assist researchers in pharmaceutical fields in conducting scaffold hopping to design novel molecules with higher potency or pharmaceutical potential.

Author contribution

Bing XIONG and Jing-kang SHEN designed the project; Bi-bo YAN, Meng-zhu XUE, Ke LIU and Bing XIONG performed the research; Ding-yu HU did the case study and analyzed the data; Bing XIONG and Jing-kang SHEN wrote the article.