467629 Biologically Consistent Annotation for Metabolite Identification from Metabolomics Data
Advances in detection technology have greatly expanded the number and type of metabolites that can be detected. Coupled with appropriate chromatographic separation methods, high-resolution mass spectrometry (MS) can be used to perform sophisticated untargeted experiments to achieve broad coverage of the chemical species in a sample, while also obtaining rich information on the species’ molecular structure through their fragmentation patterns and corresponding MS/MS spectra. On the other hand, datasets resulting from untargeted experiments are very complex, and can contain 103~104 spectral signatures or “features.” Analyzing such datasets to obtain meaningful biological information depends on reliably and efficiently resolving the chemical identities of the detected features.
Despite progress, the identification of metabolites from spectral data remains a major bottleneck. In untargeted metabolomics, the objective is to characterize as many compounds as possible. The gold standard is to annotate the data by referencing a high-purity chemical standards library generated using the same analytical method as the samples. However, this is often intractable, since the sample could contain thousands of unknown metabolites. A common alternative is to search publicly available spectral libraries for entries that match the observed MS/MS spectra. However, the number of pure chemical standards covered by spectral MS/MS libraries is very small compared to the number of metabolites occurring in nature. Two commonly used libraries, METLIN and HMDB, contain MS/MS data for only ~16% of metabolites catalogued in KEGG.
In recent years, a number of computational tools have been developed that utilize in silico fragmentation, molecular fingerprinting, and machine learning. While these tools can greatly increase the number of features that can be assigned a putative identity, difficulties remain in functionally annotating the data. Given that many compounds share the same chemical formula (and hence mass) and fragment similarly, associating a spectral signature with a chemical identity can be ambiguous. This is also reflected in the outputs of in silico fragmentation tools, which typically suggest many possible matches for a given experimentally observed MS/MS spectrum.
This paper describes a novel computational workflow that utilizes the biological context of a sample to annotate and interpret the metabolomics data. The basis for this context-driven approach is that the metabolites present in a sample reflect enzyme-catalyzed biochemical reactions active in the corresponding biological system. We develop this workflow to specifically address the comparison or differential analysis of samples representing two or more different experimental conditions, a typical problem in metabolomics.
The major steps in the workflow are as follows. First, a metabolic model is constructed from genome annotation data in KEGG to define a set of metabolic reactions that could be expressed in the biological system of interest. Second, the detected features are mapped to candidate compounds in the model, creating a graph that connects the candidate compounds via edges representing reactions. The nodes in this graph are weighted by scores reflecting the confidence in the mapping of detected features to candidate compounds. These scores are calculated based on matches of features to MS/MS spectra in databases and predicted spectra generated using in silico fragmentation tools. The edges in the graph are weighted by the confidence that the corresponding reactions are expressed in the biological system. Third, a local neighborhood analysis is performed on the weighed graph to further refine the confidence calculations. If a large fraction of connected metabolites in a subset of the graph, or neighborhood, consists of high confidence metabolites, this increases confidence in the assigned identities of all metabolites in the neighborhood. Features associated with multiple candidate metabolites are preferentially mapped to such high confidence neighborhoods, since there is a greater likelihood that the reactions are actively engaged. The results of this analysis also inform pathway engagement, as pathways covering many high confidence neighborhoods are also likely engaged.
We validate the workflow using experimental data from two case studies. The first case deals with metabolomic data collected on a single, well-characterized cell type, whereas the second case deals with data on a complex microbial community harboring only partially characterized member species. Untargeted metabolomics experiments were run using multiple chromatography methods in combination with both positive and negative ionization modes, which broadened coverage of unique masses by >50%. Use of an accurate metabolic model is critical; up to 90% of the candidate compounds that match to KEGG compounds based on mass (precursor m/z) could be eliminated as likely false matches by considering only compounds present in the model.
The annotation generated using our workflow was compared against annotation results collected from available databases (METLIN and HMDB) and recently published in silico prediction tools (CFM-ID and MetFrag). For selected metabolites with available chemical standards, the predicted annotation was also confirmed experimentally. Our results to date indicate that all three steps in the workflow greatly contribute to improving the annotation yield and accuracy.