|
||||||||||||||||
Research Ph.D. ThesesStructured Motifs in Biological Sequences: Search and Extraction
By Yongquiang Zhang
Searching biological sequence(s) for motifs is a fundamental task in bioinformatics. In this thesis, we develop efficient algorithms for the discovery of interesting structured motifs from biological sequences, which may play important roles in various biological functions. Such patterns include transcription factor binding sites, tandem repeats, retrotransposons, protein sites, and so on. A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We tackle two main problems in structured motif search: a) given a motif we search for all occurrences, b) we discover/extract the structured motifs. We allow several variations, including approximate matching, missing components, overlapping motifs, and length ranges of simple motifs. We propose an efficient algorithm, called sMotif, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, sMotif searches the sequences for all occurrences of the motif. Potential applications of sMotif include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. sMotif can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMaRTFinder, a state-of-the-art algorithm for structured motif search. Experimental results show that sMotif is about 7 times faster and consumes 100 times less memory than SMaRTFinder. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. We next look at the problem of mining structured motifs and propose an efficient algorithm, called exMotif, that given some sequence(s), and a structured motif template, extracts all frequent structured motifs that have quorum q. Potential applications of exMotif include the extraction of single/composite regulatory binding sites in DNA sequences. exMotif is efficient in terms of both time and space and outperforms RISO, a state-of-the-art algorithm. Experimental results show that exMotif is about 3-5 times faster than RISO. It is also successful in discovering potential single/composite transcription factor binding sites. Return to main PhD Theses page |
||||||||||||||||
|