- 1:20 PM
653c

Computational Methods of a Generic Motif Discovery Algorithm for Sequential Data

Mark Styczynski1, Kyle L. Jensen1, Isidore Rigoutsos2, and Gregory N. Stephanopoulos3. (1) Chemical Engineering, MIT, 66-264 MIT 77 Massachusetts Ave, Cambridge, MA 02139, (2) Research Division, IBM, Thomas J. Watson Research Center, Yorktown Heights, NY 10598, (3) Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., 56-439, Cambridge, MA 02139

Motif discovery in sequential data is a problem of great interest and with many applications in biology and chemical engineering. However, previous methods for motif discovery have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. While in some cases simply finding a good set of pattern occurrences is acceptable, in many cases we would like to find all possible patterns in a dataset that meet some threshold of importance, as each set of pattern occurrences may have distinct biochemical relevance.

Here we present a Generic Motif Discovery Algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real--valued data. Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation--agnostic: they can be represented using regular expressions, position weight matrices, or any number of other models for any type of sequential data. Since motif discovery tools with even two or three of these qualities are exceedingly rare, Gemoda is particularly novel in having all four of these characteristics.

We demonstrate a number of applications of the algorithm, ranging from the discovery of motifs in amino acid and nucleotide sequences to the discovery of conserved protein sub—structures. We also briefly highlight the potential applications of Gemoda for metabolomic studies and for simple classification tasks. Finally, we discuss some of the issues faced in implementing the algorithm and the computational concepts taken from chemical engineering that were adapted or used as a basis for methods implemented in Gemoda. These methods are of particular importance, as they frequently bring otherwise intractable problems into the realm of feasibility using intelligent reduction of search space.