The first major focus of the poster is Gemoda, a Generic Motif Discovery Algorithm for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real--valued data. Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation--agnostic: they can be represented using regular expressions, position weight matrices, or any number of other models for any type of sequential data. Since motif discovery tools with even two or three of these qualities are exceedingly rare, Gemoda is particularly novel in having all four of these characteristics. I will present a number of applications of the algorithm, ranging from the discovery of motifs in amino acid and nucleotide sequences to the discovery of conserved protein sub—structures. I also briefly highlight the potential applications of Gemoda for metabolomic studies and for simple classification tasks.
The second major focus of the poster is SpectConnect, a method for tracking "unknown" metabolites in GC-MS metabolomic data. The goal of metabolomics – the metabolite analog of genomics and proteomics – is the measurement of concentrations (or “metabolite profiles”) of as many cellular metabolites as possible, usually with applications to functional genomics. While certain aspects of metabolomics suggest that exhaustive metabolite profiling may be possible, obstacles to exhaustive metabolite profiling persist, one of the most significant being the chemical diversity of metabolites. Unlike DNA or proteins, metabolites do not adhere to a subunit-based chemistry, so assaying for many metabolites (with many chemistries) simultaneously is difficult. Gas chromatography-mass spectrometry (GC-MS) is one method frequently used to assay for a variety of metabolites, and the aim of this work is to improve the downstream analysis of this GC-MS data independent of upstream experimental protocols. Analysis of metabolomic profiling data from GC-MS measurements usually relies upon reference libraries of metabolite mass spectra to structurally identify and track metabolites. In general, techniques to enumerate and track unidentified metabolites are non-systematic and require manual curation. Here I present SpectConnect, a method and software implementation freely available at http://spectconnect.mit.edu, that can systematically detect components that are conserved across samples without the need for a reference library or manual curation. This approach is validated by correctly identifying the components in a known mixture and the discriminating components in a spiked mixture. An application of this approach is demonstrated with a brief analysis of the Escherichia coli metabolome. I will also present recent results of our efforts to better characterize the metabolome of Saccharomyces cerevisiae using SpectConnect.
In addition, I will briefly present other work relevant to my thesis, including a project involving an analysis of the accuracy of BLOSUM amino acid substitution matrices that are used (even if unknowingly) by most biologists and biochemical engineers. This project, as well as the other two primary aspects of my thesis work, will be expounded upon in oral presentations during the conference.
My goal is to leverage my computational experience for both intrinsic and extrinsic ends. Computational research in and of itself is an important part of the development of chemical and biochemical engineering. Beyond that, though, I aim to use that research to better formulate hypotheses and projects in areas from metabolomics to more traditional metabolic engineering. A closed-loop system like this inevitably leads to more effective research and discoveries that may not have otherwise been accessible.