Ph.D. Theses

Indexing Methods for Protein Tertiary and Predicted Structures

By Feng Gao
Advisor: Mohammed J. Zaki
December 6, 2006

This thesis focuses on the problem of fast sub-structure search and remote homology detection in proteins by finding similar (sub)structures. That is, for a given query protein and a large database of protein structures, we want to retrieve all the similar structures from the database rapidly. With the growing number of proteins deposited in the database, searching the database is a difficult and time-consuming task. For example, we may want to retrieve all structures that contain sub-structures similar to the query, a specific 3D arrangement of surface residues, etc. Searches such as these are the first step towards building a systems level model for protein interactions. In fact, high throughput proteomics methods are already accumulating the protein interaction data that we would wish to model, but fast computational methods for database searching lag far behind; biologists are in need of a means to search the protein structure databases rapidly, similar to the way BLAST rapidly searches the sequence databases.

We are interested in two main problems that arise in sub-structure and remote homology searches, namely protein tertiary structure indexing and predicted structure indexing for those proteins whose structures have not been determined experimentally. In our tertiary structure indexing approach, a new method for extracting the local feature vectors of protein structures is presented. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Carbon Alpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. In our predicted structure indexing approach, a hidden Markov model (HMMSTR) of high sequence-structure local motifs (I-sites library) is used to generate the feature vectors for the structure predicted for a given sequence. Remote homologous proteins are detected by using the suffix tree index over the predicted structures.

We test our algorithms on several real datasets. We improve both the time and accuracy performance of the tertiary structure indexing and classification. We also find more remote homologous proteins from the database of predicted structures than competing methods.

Return to main PhD Theses page