Protein-protein docking

The functions of most protein are subordinated to the interaction with one or more partners: small molecules, other proteins, nucleic acids, etc. Protein-protein complexes are involved in many cellular processes and consequently are of great interest for the biochemical and biological functionnal study of their components. As of today, while structural genomics projects increases dramatically the number of isolated protein structures, still very little is known on the multimolecular assemblies.

The experimental resolution of the three dimensional structure of a complex is a difficult task, often impossible for multiple reasons. Firstly, the experimental resolution of an isolated protein structure is already difficult, because it requires producing the protein in large quantities, and obtaining a pure and concentrated sample. Production and purification difficulties are even worse when dealing with a complex, since obtaining the native complex often requires producing the proteins jointly. Secondly, the affinities of the two partners are often insufficient to allow crystallization. Finally, one protein often has multiple partners.

These last years, multiple studies aiming at the detection of protein-protein complexes at the genome level have been published [1]. In Bkare’s yeast for example, such experimental studies have been conducted [2] [3] [4] [5]. These lead to a list of more than 12,000 potential protein-protein complexes. It is obviously not possible to solve experimentally the structures for all of them. Consequently, modeling methods are needed.

Many attempts have been made at implementing a protein-protein docking software. As clearly demonstrated by the CAPRI experiment [6] [7] [8], some methods are available to predict accurately the complexe’s structure in some cases. The available methods are of different types: some are very fast (a few seconds on a PC), but rely only on geometrical consideration, and thus not very accurate; others are very CPU-demanding (tens of hours on a luster), and consequently don’t allow to explore all the possible solutions. Finally, some of these methods make use of the available biological information, and allow, through the reduction of the solution space to be explored, to obtain very good results. However, whatever the methods, the choice of the best solution still relies on the expert.

In the context of high-throughput technologies, the number of protein-protein complexes to be explored is too important to allow manual selection of solutions, or consideration of biological information. Indeed, in yeast there are at least 12,000 different complexes. Thus, the amount of time spent on a single complex must be very low, ideally in the order of seconds. However, when it comes to studying a particular complex, available information will have to be considered, and will allow to validate and refine the prediction. The way two proteins interact is based on two criterions: the geometrical complementarity of the interacting surfaces, and the physico-chemical affinities of the residues in contact [9] [10] [11]. These two criterions have to be taken into account for the modeling.

We have initiated the implementation of a new method, making use of the Voronoi tessealtion, both for geomtrical and physico-chemical aspects. This work has been conducted in collaboration with Jérôme Azé (LRI, Université Paris-Sud).

Tesselation construction

There are many ways to build the Voronoi tessealtion of a protein. We have chosen as centroids the geometric centers of lateral chains plus Cα for each residue. We thus have only one point for each amino acid. This greatly simplifies the computations. It also allows to account for the mobility of the surface side chains, since when the side chain moves, the geometric center is only slightly modified.

The tesselation constructions were implemented in C++, using the CGAL library (Computational Geometry Algorithms Library, [12] [13]).

Parameters and learning

The number of parameters that might be computed from a Voronoi diagram is very high. In this first study, we have chosen to work with the most evident ones, taken at the interface:

  1. The total number of residues.
  2. The area of the interface.
  3. The frequency of occurence for each type of residue (20 parameters)
  4. The mean Voronoi volume of each type of residue (20 parameters)
  5. The frequency of occurence for each type of interacting pair (210 parameters)
  6. The mean distance between the geometric centers for each interacting pair (210 parameters).

However, the number of positive exemples (the number of known complexes structures) that can be used for learning is reduced to a hundred. Consequently, we had to reduce the number of parameters. To this aim, the residues are grouped in six categories: non-aromatic hydrophobic (VILM), aromatics (YWF), positively charged (HKR), negatively charged (DE), non-charged polar (NQ) and small residues (CGSTAP). Using this classification for pair parameters, we have 84 parameters.

To generate a scoring function, which would allow to distinguish native form non-native solutions, we have tried two different learning methods: SVMs and ROGER, developped in the LRI [14] [15] [16]). The principles of ROGER is to use genetic algorithms to maximize the area under the ROC curve. We have chosen to optimize a scoring function of the type:

(PNG)

The contribution of each parameter Xp depends on its weight wp, and it’s centering value cp, both optimized by the algorithm.

Data sets

Before using a learning method, one has to generate learning data sets. In our case, it was necessary to have a list of complexes with know three dimensional structure. For different reasons, establishing such a list is not straightforward:

  1. It’s not possible to automatically determine from PDB headers if the file concerns a complex or an isolated protein.
  2. A complexe’s structure is usefull for us only if the coordinates of at least one of the isolated partner is also available.
  3. The PDB file format is very complete, but unfortunately not well respected, rendering file reading difficult.

In a first data set, we had considered the list of complexes published by Lo Conte, Chothia and Janin [17]. However, the number of positive examples was two low. Consequently, we have implemented an extraction method form the PDB, which allowed us to establish a list of 102 non-redundant complexes.

Negative examples were generated using DOCK [18] [19] [20] [21]. Non-native solutions were randomly selected from the solution sets generated by this program.

Present state

We have shown that Voronoi cells of residues in contact in a protein protein complex have very particular properties, different from those of inner residues, and different from those of other surface residues.

For example, if not considering water molecules, stacking of atoms at the protein-protein interfaces is less dense than inside of the protein. If taking into account water molecule, the density of stacking is the same within the protein, at the protein-protein interface and at the protein nucleic-acid interface [22] [23] [24]. Our results were confirmed by an independant study which has shown that the number of water molecules ar the interface is different in a complex, a biological dimer and a crystallographic dimer [25].

We have shown that the characteristics of the interface Voronoi cells are different in a native and in a non-native conformation. The study of these characterictics on native and non-native datasets gave us the statistics that will be used in the learning procedure.

(JPEG)

The comparision between parameter values between native and non-native conformations shows that a native interface is enriched in aromatic amino acids, the mean Voronoi volumes of residues are lower, the hydrophobic and hydrophobic-small are more frequent, and the pair distances smaller. Even if these results are not surprising, quantifying these differences allows to establish the scoring function.

To evaluate the physical significance of the scores, we have considered these scores as a binding energy, and applied the random energy model (REM) [26] [27], as had been done previously with the energy [28]. This allows for example to estimate the specific temperature of a complex between 600K and 1200K, showing that at 300K, the dominant species is the complex. The critical temperature is estimated to be lower than 300K, which means that non-ntaive interaction modes in competition with the native mode are not those of lower energy, but a large spectrum of non-native states. Finalyy, it allows to show that the two states model is not suited for the description of protein-protein interactions, and the REM better describes the properties of the system.

Avaluating tthe accuracy of the scoring function is not an easy task. Indeed, what we would really like is to rank first a near-native solution. To measure the accuracy of our scoring function, we have re-ranked the solutions form two different docking software for some of the CAPRI targets (these complexes were not part of the learning dataset). The results, presented in the table below, show that our function almost always ranks one of the best available solutions in the top 10. This was done reranking solutions generated by DOCK and HADDOCK [29] [30] [31]. The results from reranking is betters for HADDOCK solutions because these are of better quality than the solutions generated by DOCK. For all targets but one, the ranking with our function is better than the original HADDOCK ranking.

CibleNb SolBest class of the setRang1Original rankRang2Rang3Number of hits
HADDOCK
11 200 2 4 50 1 9 4
12 200 1 78 119 78 49 0
13 200 1 1 8 7 5 1
14 200 1 2 3 1 3 10
15 200 3 57 29 1 20 0
DOCK
93446 4 4 49 1 3 10
11 32 4 11 28 3 2 3
12 139 3 27 64 14 20 4
13 2263 4 84 1740 7 1 0
14 2460 4 5 244 352 1 2
15 9 5 9 8 1 8 1
16 165 4 1 65 2 5 2
17 72 4 16 39 1 2 1
18 972 3 24 226 1 4 1
19 1979 4 22 94 25 1 2
Definition of classes
Class 1 : Fnat > 0.75 Class 2 : 0.5 < Fnat ≤ 0.75 Class 3 : 0.25 < Fnat ≤ 0.5 Class 4 : 0 < Fnat ≤ 0.25 Class 5 : Fint > 0 and Fnat = 0 Class 6 : Fint = 0 and Fnat = 0

Fnat : fraction of native contacts present in the solution Fint : fraction of interface residues correctly predicted

Best class in the set : the sets of solutions generated by DOCK or HADDOCK do not necessarily contain a "good" solution. The best calss in the set is the class of the best solution present in the set. For example, in the set generated by DOCK for target 18, the best solution in the set has 31% of native contacts. The best solution is the set is thus class 3.

Rank 1, rank 2, rank 3 : rank 1 is the best rank given by our scoring function to a solution of the best class in the set. Rank 2 is the best rank given to a second best class solution of the set, and so on for rank 3. For target 18 for example, when analyzing the results given by DOCK, rank 1 is equal to 24. This means that the first solution of class 3 (which is the best class in the set) is ranked 24. This solution was ranked 226. Rank 2 is equal to 1, which means that the first solution of the second best class (class 4 in this case) is ranked 1.

Number of hits : number of best class solutions in the top 50

Perspectives and work in progress

Until now, we have used the Voronoi tesselation to study the possible conformations for a complex. We could also use this construction to define a surface on each partner well-suited for for the search of a good solution. Indeed, two types of surface representation have been classicaly used: a set of points chosen on the surface (of the order of 10e6 points), or a set of spheres. Both representation involve very high computation times. Moreover, the interaction often requires the modification of both surfaces, which these representations do not allow. The surface defined by the Voronoi tesselation is a polyhedron, and consists in a reduced number of points (usually less than 1000 polyhedrons at the surface).

We have now started the implementation of a new method for the generation of solution using the Voronoi surface of the two partners. This should allow us to generated a new, more accurate, scoring function.

A tool for computing the score of a given complex confromation is available here

[1] Janin, J., and Seraphin, B. (2003) Curr Opin Struct Biol 13, 383-8.

[2] Gavin, A. C., et al. (2002) Nature 415, 141-7.

[3] Ho, Y., et al. (2002) Nature 415, 180-3.

[4] Ito, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S., and Sakaki, Y. (2000) Proc Natl Acad Sci U S A 97, 1143-7.

[5] Uetz, P., et al. (2000) Nature 403, 623-7.

[6] CAPRI : Critical Assessment of PRediction of Interactions. This experriment, co-organized by Joël Janin consists in collecting the experimental structures of complexes before the coordinates are made public, and proposing them to modelers. The participating groups propose up to 10 models form the structures of the isolated partners. The models are then compared to the crystallographic structure, which allows the evaluation of the different methods.

[7] Janin, J. (2005) Protein Sci 14, 278-83.

[8] Mendez, R., Leplae, R., Lensink, M. F., and Wodak, S. J. (2005) Proteins 60, 150-69.

[9] Chothia, C., and Janin, J. (1975) Nature 256, 705-8.

[10] Janin, J., and Wodak, S. J. (2002) Adv Protein Chem 61, 1-8.

[11] Wodak, S. J., and Janin, J. (1978) J Mol Biol 124, 323-42.

[12] http://www.cgal.org/

[13] Boissonat, J.-D., and Yvinec, M. (1995) Géométrie Algorithmique, Ediscience international.

[14] Roche, M., Azé, J., Kodratoff, Y., and Sebag, M. (2004) in "ROCAI", pp. 81-88, Valencia.

[15] Sebag, M., Azé, J., and Lucas, N. (2003) in "ICMD", pp. 637-40, Melboune, Florida, USA.

[16] Sebag, M., Azé, J., and Lucas, N. (2003) in "6th International Conference on Artificial Evolution", pp. 384-96, Marseilles, France.

[17] Lo Conte, L., Chothia, C., and Janin, J. (1999) J Mol Biol 285, 2177-98.

[18] Wodak, S. J., and Janin, J. (1978) J Mol Biol 124, 323-42.

[19] Cherfils, J., Duquerroy, S., and Janin, J. (1991) Proteins 11, 271-80.

[20] Duquerroy, S., Cherfils, J., and Janin, J. (1991) Ciba Found Symp 161, 237-49; discussion 50-2.

[21] Janin, J., and Wodak, S. J. (1985) Biopolymers 24, 509-26.

[22] Lo Conte, L., Chothia, C., and Janin, J. (1999) J Mol Biol 285, 2177-98.

[23] Nadassy, K., Tomas-Oliveira, I., Alberts, I., Janin, J., and Wodak, S. J. (2001) Nucleic Acids Res 29, 3362-76.

[24] Nadassy, K., Wodak, S. J., and Janin, J. (1999) Biochemistry 38, 1999-2017.

[25] Rodier, F., Bahadur, R. P., Chakrabarti, P., and Janin, J. (2005) Proteins 60, 36-45.

[26] Bernauer, J., Azé, J., Janin, J., and Poupon, A. (2005) in "JOBIM 2005".

[27] Bernauer, J., Poupon, A., Aze, J., and Janin, J. (2005) Phys Biol 2, S17-23.

[28] Janin, J. (1996) Proteins 25, 438-45.

[29] Bonvin, A. M., Boelens, R., and Kaptein, R. (2005) Curr Opin Chem Biol 9, 501-8.

[30] Dominguez, C., Boelens, R., and Bonvin, A. M. (2003) J Am Chem Soc 125, 1731-7.

[31] van Dijk, A. D., de Vries, S. J., Dominguez, C., Chen, H., Zhou, H. X., and Bonvin, A. M. (2005) Proteins 60, 232-8.