Here we present a Generic Motif Discovery Algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real--valued data. Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation--agnostic: they can be represented using regular expressions, position weight matrices, or any number of other models for any type of sequential data. Since motif discovery tools with even two or three of these qualities are exceedingly rare, Gemoda is particularly novel in having all four of these characteristics.
We demonstrate a number of applications of the algorithm, ranging from the discovery of motifs in amino acid and nucleotide sequences to the discovery of conserved protein sub—structures. We also briefly highlight the potential applications of Gemoda for metabolomic studies and for simple classification tasks. Finally, we discuss some of the issues faced in implementing the algorithm and the computational concepts taken from chemical engineering that were adapted or used as a basis for methods implemented in Gemoda. These methods are of particular importance, as they frequently bring otherwise intractable problems into the realm of feasibility using intelligent reduction of search space.