The work presented in this talk extends the method of Rollins et al. to determine a ranked order set of genes that express most differently in two groups of assays. This method selects the principal component that most distinguishes the two groups and create test statistics based on the differential contribution of each gene. This work evaluates two new test statistics for this purpose. The first one is a difference of the linear combinations for each group created from this principal component that we shall call Tdiff. The second one is scaled by dividing Tdiff by its estimated standard deviation that we shall call Tscale. The current approach uses the common pooled student T statistics (Tpooled) (Devore, 2007) which is simply a scaled statistics that weighs each assay equally in each group. This talk compares these three test statistics in three studies. The first study revisits the single group analyses in Rollins et al. involving exposure of E. coli cells to two different levels of ethanol concentration (Gonzalez et al., 2003). Application of the proposed method resulted in a signature that contained a large number of the genes in the two signatures found in Rollins et al., some genes not in either signature, and a significant number of genes in the two signatures in Rollins et al. not contained in the single signature from the differential method. This signature essentially eliminated the genes that expressed high in both groups, increased the rank of genes with moderate expression levels in one group and low expression levels in the other group as weighted by the PCA coefficients or loadings, and kept genes that expressed high in one group and low in the other group.
The second study applied the proposed method to data from Steelman et al. (2006) involving using myostatin as an inhibitor of skeletal muscle growth for five 5-wk-old myostatin (called “mutant”) and non-treated (called “wild-type”) mice in each group. In this study we compared the results of the proposed method with q-value approach developed by Storey and Tibshirani (2003) which uses Tpooled and a novel method to determine significance while controlling for false positives. The agreement between these methods was very good but not perfect. More specifically, the proposed method contained a small but significant number of genes with high ranks that were not in the signature of the q-value method. As a result, the third study was launched to compare the strengths of the two methods in a simulation study that was based on the statistical properties of this data set.
The specific details of the artificially generated data in the third study are as follows. There were five mice in each group. For each case of data, 40,000 genes were simulated from a population similar to the real data set. For each of the mice in one of the groups, the mean levels for 200 of these genes were ä units greater than the corresponding genes for the other group of five mice. ä was 1 or 3 for each case of simulated data. Each result is based on an average of five cases under a constant set of conditions except for random measurement error in the expressed value of each gene. This study consisted of two parts. The first part held the variance of the measurement error constant for each case of data and the second part varied this value randomly from gene-to-gene. For the first study, Tdiff was slightly better than Tscale and a lot better than Tpooled and performed excellence at the conditions most similar to the real data (i.e., about 100% accuracy). In comparing the q-value method with the MC-method, the MC-method performed better. For the second study, Tdiff performed the best with ä = 3 but much worst than both the other test statistics with ä = 1. Tscaled performed slightly worse than Tdiff with ä = 3 and the best with ä = 1. Tpooled was significantly worse than Tscaled in the case with ä = 1. Thus, Tscaled consistently performed well in all cases. For this study the q-value method and the MC-method performed very similar. Thus, the use of the proposed PCA approach has the effect of significantly increasing statistical power (finding the significantly differentially expressed genes with high probability) without adversely increasing false positives. Secondly, the MC-method appears to be at least as effective as the q-value method in determining the cutoff for the number of genes in defining the length of a assay-specific signature.
1. Rollins, D. K, D. Zhai, A. L. Joe, J. W. Guidarelli, and R. Gonzalez, "A Novel Data Mining Method to Identify Assay-Specific Signatures in Functional Genomic Studies,” BMC Bioinformatics, 7 377-395 (2006).
2. Misra, J., Schmitt, W., Hwang, D., Hsiao, LL., Gullans, S., Stephanopoulos, G., & Stephanopoulos, G. (2002) Genome Res. 12, 1112-1120.
3. Gonzalez, R., Tao, H., Purvis, J. E., Shanmugam, K.T., York, S.W., & Ingram, L.O. (2003) Biotechnol. Prog. 19, 612-623.
4. Probability and Statistics For Engineering and the Sciences" 6th Edition by Devore (2004).
5. Steelman, C. A., J. C. Recknor, D. Nettleton, and J. M. Reecy, “Transcriptional profiling of myostatin-knockout mice implicates Wnt signaling in postnatal skeletal muscle growth and hypertrophy,” The FASEB Journal 10.1096/fj.05-5125fje (2006).
6. Storey, J. D., and Tibshirani, R., "Statistical significance for genomewide studies," Proceedings of the National Academy of Sciences 100, 9440-9445 (2003).