I\'m a mathematician trying to test some things on gene expression data, and I\'
ID: 31193 • Letter: I
Question
I'm a mathematician trying to test some things on gene expression data, and I'm thus skimming over various articles such as Sotiriou et. al. to understand what is typically done with such data sets. Several things confuse me; in particular, a paragraph in Sotiriou et. al. reads:
"Clinical parameters such as ER status, [...] affect the behavior of breast cancers. We asked whether these clinical/pathologic characteristics were associated with differential gene expression. Parametric t tests identified 606 probe elements of 7,650 elements represented in our array that could segregate ER+ and ER- breast tumors (P < 0.001)."
As segregation of ER+/- based on gene expressions is one of several things I'm interested in attempting to achieve through novel methods, I have been trying to understand what precisely is meant with the above paragrah. To recap the article, there are 99 patients with 7,650 probe expression values, and one ER+/- value each. The article sets out to determine which of those 7,650 probes successfully segregate the dataset into ER+ and ER-.
I've run the above paragraph by a nearby statistician, and he could not for the life of him figure out what was done, and had not even heard of such a thing as a "parametric t test". This leads me to suspect that the term is specific to biology, so I ask: what is meant? It is also unclear to me (and him) what the P-value means in this context.
I hope the scope of this question isn't too broad. Of course I want to avoid asking "explain this article to me, the outsider, please"; I do believe the paragraph above is relatively self-contained in the context of gene expression.
Explanation / Answer
I understand this in the following way:
For each probe you have two sets of measurements, one for ER+ and one for ER-. What you do is a T-test (to my understanding is that the "parametric" just emphasizes that T-test is a parametric test) on these two sets, testing if their mean is significantly different (they refer to this as "separated"). You repeat this test for all 7650 probes, and you get a set of 7650 p-values. You then do some multiple testing correction, such as a Bonferroni correction (I haven't checked in the paper if they did it, but they obviously should). Finally, they find that 606 of the p-values are significant (for some choice of threshold), suggesting that they can "separate" ER+ from ER-.
As a computational biologist I would advise you to look specifically at bioinformatics papers if you are looking into developing new methods, since the analysis in "pure biology" papers can often be lacking and would not give you a good perspective of state-of-the-art analysis methods. Specifically for the question of separating groups from gene expression you should look into the field of Machine Learning, as it had been widely applied to this problem.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.