One condition required for the experimental resolution of a protein’s three dimensional structure is its ability to fold in a stable conformation. Many proteins consist in multiple structural domains separated by flexible regions. In this case, the experimental structural study of the complete protein is not possible. The individual domains have to be studied separately. However, there’s no available method to accurately detect the limits of these domains. Indeed, must of the available methods are based on sequence comparison.
Such predictions have already been successfully done using the HCA method (Hydrophobic Cluster Analysis) and sequence alignment methods.
HCA is based on the fact that that the repartition of hydrophobic residues in the sequences depends on the structure. In the HCA diagram of the human protein XRCC1 for example (below), one can see two different "textures", one corresponding to globular domains, and the other one to an unfolded region.

This allows to determine very rapidly of a sequence contains unfolded regions, and to approximately determine their limits. Sequence comparison methods then allow to reffine these limits.
To allow the quantification of these texture differentces, we have studied the statistical repartition of the 20 amino acids in folded and unfolded regions. This allows, for a window in the protein sequence, to compute it’s probability of occurence in a structured (S) and in an linker region (L). We also compute, in each sequence, and for each residue, the distance to the closest hydrophobic cluster. It was then possible to infer decision rules using this distance, and the ratio between the two probabilities [1]. These rules accurately predict folded and unfolded regions.

We have then used these rules to predict unfolded regions in proteins that were not part of the training dataset. The software is available here.
As explained above, what is most important for structural studies, and espacially for structural genomics, is not really to know if there are unfolded regions, but to know if these regions should be deleted from the sequence to improve expression, solubility or crystallization.
In a new study, the question we tried to answer was, is it possible, from unfolded regions predictions, to desgin a construct (a subsequence) which would be better expressed, more soluble, or leading to better crystals than the full length protein.
From experimental data collected in our team, we predicted the unfolded regions using 9 different methods, and compared the results with the experimental ones. [2].
We were able to demonstrate that, although Prelink is among the best perfomring methods, it allows to correctly predict the experimental behaviour of the construct for only half of the cases. Since the different methods are based on different principles, we searched for a way to combine them. To this aim, we used two different methods: decision trees and decision rules through the software ICL. The best decicion tree gives good predictioons in 60% of the cases. The rules generated give good predictions in 80% of the cases, which represents a considerable progress. A server, allowing to use both methods will soon be available on this website.