Wednesday, November 7, 2007
515n

A Powerful Data Mining Approach for Developing Assay-Specific Differentially Expressed Signatures from Gene Expression Data

Derrick K. Rollins Sr.1, Ai-ling Teh1, and Dan Nettleton2. (1) Chemical and Biological Engineering, Iowa State University, 2114 Sweeney Hall, Ame, IA 50011, (2) Statistics, Iowa State University, 2114 Sweeney Hall, Ame, IA 50011

Rollins et al. (2007) introduced a powerful method for determining a ranked order listing of genes for assay-specific signatures from microarray gene expression data. In this approach principal component analysis (PCA) is used to identify relevant signature groups in a novel use of eigenvalues generated from treating the assays as the variables, called eigengenes, and from treating the genes as the variables, called eigenassays. For an identified signature group, this creative approach determines individual gene contribution. The genes are then ranked by their contribution and graphed over a constantly changing index. Rollins et al. proposed two ways to determine the size of a signature or where to “cutoff” the list of genes. The first way was to use the inflection point, or the point where the second derivative of this plot changes sign. This approach gave the largest signatures. The second way was to use the maximum curvature (MC), which is essentially the point where the second derivative was a maximum. The method proposed by Rollins et al. was shown to be effective using real and simulated data and from comparison with the method of Misra et al. (2002).

The work presented in this talk extends the method of Rollins et al. to determine a ranked order set of genes that express most differently in two groups of assays. This method selects the principal component that most distinguishes the two groups and create test statistics based on the differential contribution of each gene. This work evaluates two new test statistics for this purpose. The first one is a difference of the linear combinations for each group created from this principal component that we shall call Tdiff. The second one is scaled by dividing Tdiff by its estimated standard deviation that we shall call Tscale. The current approach uses the common pooled student T statistics (Tpooled) (Devore, 2007) which is simply a scaled statistics that weighs each assay equally in each group. This talk compares these three test statistics in three studies. The first study revisits the single group analyses in Rollins et al. involving exposure of E. coli cells to two different levels of ethanol concentration (Gonzalez et al., 2003). Application of the proposed method resulted in a signature that contained a large number of the genes in the two signatures found in Rollins et al., some genes not in either signature, and a significant number of genes in the two signatures in Rollins et al. not contained in the single signature from the differential method. This signature essentially eliminated the genes that expressed high in both groups, increased the rank of genes with moderate expression levels in one group and low expression levels in the other group as weighted by the PCA coefficients or loadings, and kept genes that expressed high in one group and low in the other group.

The second study applied the proposed method to data from Steelman et al. (2006) involving using myostatin as an inhibitor of skeletal muscle growth for five 5-wk-old myostatin (called “mutant”) and non-treated (called “wild-type”) mice in each group. In this study we compared the results of the proposed method with q-value approach developed by Storey and Tibshirani (2003) which uses Tpooled and a novel method to determine significance while controlling for false positives. The agreement between these methods was very good but not perfect. More specifically, the proposed method contained a small but significant number of genes with high ranks that were not in the signature of the q-value method. As a result, the third study was launched to compare the strengths of the two methods in a simulation study that was based on the statistical properties of this data set.

The specific details of the artificially generated data in the third study are as follows. There were five mice in each group. For each case of data, 40,000 genes were simulated from a population similar to the real data set. For each of the mice in one of the groups, the mean levels for 200 of these genes were ä units greater than the corresponding genes for the other group of five mice. ä was 1 or 3 for each case of simulated data. Each result is based on an average of five cases under a constant set of conditions except for random measurement error in the expressed value of each gene. This study consisted of two parts. The first part held the variance of the measurement error constant for each case of data and the second part varied this value randomly from gene-to-gene. For the first study, Tdiff was slightly better than Tscale and a lot better than Tpooled and performed excellence at the conditions most similar to the real data (i.e., about 100% accuracy). In comparing the q-value method with the MC-method, the MC-method performed better. For the second study, Tdiff performed the best with ä = 3 but much worst than both the other test statistics with ä = 1. Tscaled performed slightly worse than Tdiff with ä = 3 and the best with ä = 1. Tpooled was significantly worse than Tscaled in the case with ä = 1. Thus, Tscaled consistently performed well in all cases. For this study the q-value method and the MC-method performed very similar. Thus, the use of the proposed PCA approach has the effect of significantly increasing statistical power (finding the significantly differentially expressed genes with high probability) without adversely increasing false positives. Secondly, the MC-method appears to be at least as effective as the q-value method in determining the cutoff for the number of genes in defining the length of a assay-specific signature.

1. Rollins, D. K, D. Zhai, A. L. Joe, J. W. Guidarelli, and R. Gonzalez, "A Novel Data Mining Method to Identify Assay-Specific Signatures in Functional Genomic Studies,” BMC Bioinformatics, 7 377-395 (2006).

2. Misra, J., Schmitt, W., Hwang, D., Hsiao, LL., Gullans, S., Stephanopoulos, G., & Stephanopoulos, G. (2002) Genome Res. 12, 1112-1120.

3. Gonzalez, R., Tao, H., Purvis, J. E., Shanmugam, K.T., York, S.W., & Ingram, L.O. (2003) Biotechnol. Prog. 19, 612-623.

4. Probability and Statistics For Engineering and the Sciences" 6th Edition by Devore (2004).

5. Steelman, C. A., J. C. Recknor, D. Nettleton, and J. M. Reecy, “Transcriptional profiling of myostatin-knockout mice implicates Wnt signaling in postnatal skeletal muscle growth and hypertrophy,” The FASEB Journal 10.1096/fj.05-5125fje (2006).

6. Storey, J. D., and Tibshirani, R., "Statistical significance for genomewide studies," Proceedings of the National Academy of Sciences 100, 9440-9445 (2003).