limma analysis of firma scores

53 views
Skip to first unread message

Frank

unread,
Mar 3, 2009, 8:04:19 PM3/3/09
to aroma.affymetrix
Hi everybody,

I would like to ask some general questions about the limma analysis.

Suppose you have an experiment with two groups (n_1 and n_2 arrays for
each group) and you have selected a specific gene g with n_g
probesets. If you would like to test a probeset of that gene for
differential expression/alternative splicing you could just do a t-
test for that probeset, right?
Now you want to test the whole gene for alternative splicing. Since
you have n_g probesets you have a multiple comparison and therefor you
should use a multicomparison test, right?

My impression was, the limma-package can do this multicomparison test
by executing:

#fs is the 'standard' FirmaSet-object
fsDF <- extractDataFrame(fs, addNames=TRUE)
fsDF <- fsDF[(1:1000),] #only for test purpose to speed things up
fsDF[,-c(1:5)] <- log2(fsDF[,-c(1:5)])
design <- cbind(Grp1=1,Grp2=c(rep(0,n_1),rep(1,n_2)))
fit<-lmFit(fsDF[,-c(1:5)],design)
fit<-eBayes(fit)
fit$genes<-fsDF[,1]
(x<-topTable(fit,coef="Grp2",adjust="BH",number=10,genelist=fsDF[,
1]))
...
ID logFC t P.Value
adj.P.Val B
964 ENSG00000004455 -5.397440e+00 -1.075052e+01 5.735220e-07
0.0005654927 6.0790934
961 ENSG00000004455 3.835360e+00 8.000724e+00 9.142608e-06
0.0045073059 3.8392302
963 ENSG00000004455 2.538539e+00 6.312477e+00 7.306077e-05
0.0240126401 1.9954011
962 ENSG00000004455 2.896122e+00 5.319056e+00 2.944042e-04
0.0706149086 0.7042825
...

Obviously this provides test-statistics per probeset and not per gene.
Then what is the difference to the normal t-test per exon mentioned
above (moderated t-statistic?)?
The User's Guide, Chapter 10 gives some explanations for the topTable-
statistics. For example the B-statistic = log-odds that the >GENE< is
differentially expressed. But with the above input I get a B-statistic
for every >PROBESET<. (Or maybe I got the meaning of "gene" in the
User's Guide wrong?)


So my problem is: shouldn't I somehow group my probesets into genes or
'tell' limma which probesets relate to which gene, if I want to draw
any conclusions on a gene-level and not on a probeset-level. It seems
that limma (if executed like above) does not, what I expected. And so
I got big problems interpreting the limma results. Could someone help
me out and tell me, how to interpret the given output of topTable to
me and especially direct me how I can come from probeset-based
conclusions to gene-based conclusions.

Furthermore I got no idea what the difference is between
design <- cbind(Grp1=1,Grp2=c(rep(0,n_1),rep(1,n_2)))
design <- cbind(Grp1=c(rep(1,n_1),rep(0,n_2)),Grp2=c(rep(0,n_1),rep
(1,n_2)))
Both designs are given in the User's Guide and were already discussed
here in another thread, but I can't tell the difference. Could it be
explained in more practical words? I guess that you also have to
adjust the coef-variable in topTable according to the used design-
variable.


I know these are very general questions, but maybe someone can help me
out. Thanks a lot,
Frank

Mark Robinson

unread,
Mar 4, 2009, 1:06:05 AM3/4/09
to aroma-af...@googlegroups.com
Hi Frank.

Some comments below.


On 04/03/2009, at 12:04 PM, Frank wrote:

>
> Hi everybody,
>
> I would like to ask some general questions about the limma analysis.
>
> Suppose you have an experiment with two groups (n_1 and n_2 arrays for
> each group) and you have selected a specific gene g with n_g
> probesets. If you would like to test a probeset of that gene for
> differential expression/alternative splicing you could just do a t-
> test for that probeset, right?

I don't think its quite this simple. At least the way we've been
working with exon arrays, testing for differential expression is done
at the gene-level (by using all probes for the gene to calculate the
expression summaries) whereas testing for differential splicing is
done at the "probeset"-level (where probeset here typically means 4
probes for a probe selection region [PSR]).

And, testing for differential expression (gene-level) and testing for
differential splicing (probeset-level) is done independently.


> Now you want to test the whole gene for alternative splicing. Since
> you have n_g probesets you have a multiple comparison and therefor you
> should use a multicomparison test, right?

See above. As currently implemented, we are not proposing tests for
differential splicing at the gene level. It may be possible though.
You get a B/t/M statistic for every PROBESET from topTable() because
you gave limma a table of PROBESET-level data. FIRMA scores are
calculated at the PROBESET level (which is your data), not the GENE
level.

Lets take a step back. What are we trying to do? When I made the
suggestion to use limma, that was in response to having a table of
FIRMA scores for each probeset and each sample. The goal is to find
probesets (i.e. not directly genes) that have very different FIRMA
scores. My feeling was that the moderation that limma does is
generally helpful and would help stabilize the detection of
differentially spliced probesets.



> So my problem is: shouldn't I somehow group my probesets into genes or
> 'tell' limma which probesets relate to which gene, if I want to draw
> any conclusions on a gene-level and not on a probeset-level. It seems
> that limma (if executed like above) does not, what I expected. And so
> I got big problems interpreting the limma results. Could someone help
> me out and tell me, how to interpret the given output of topTable to
> me and especially direct me how I can come from probeset-based
> conclusions to gene-based conclusions.


It seems to me that for splicing, you'd be interested to highlight
individual exons that are differentially spliced. For many types of
alternative splicing, only specific exons (e.g. cassette exons) will
deviate. You could highlight genes, but then you'd have to go back to
find which exon is contributing. Any, maybe in doing so, the exons
that don't deviate much will dampen the gene-level effect.



> Furthermore I got no idea what the difference is between
> design <- cbind(Grp1=1,Grp2=c(rep(0,n_1),rep(1,n_2)))
> design <- cbind(Grp1=c(rep(1,n_1),rep(0,n_2)),Grp2=c(rep(0,n_1),rep
> (1,n_2)))
> Both designs are given in the User's Guide and were already discussed
> here in another thread, but I can't tell the difference. Could it be
> explained in more practical words? I guess that you also have to
> adjust the coef-variable in topTable according to the used design-
> variable.


This is just a change in parameterization. The results for testing
the difference between the two groups will be identical (if done
right), but the models that are fit are slightly different.

Model 1:

Y = alpha + beta * x (where x is a binary indicator variable)
- test the null hypothesis that beta=0

Model 2:

Y = mu1 * x1 + mu2 * x2 (where x1, x2 are both binary indicator
variables)
- test the null hypothesis that mu1=mu2

Hope that clears it up.

Cheers,
Mark


>
> I know these are very general questions, but maybe someone can help me
> out. Thanks a lot,
> Frank
> >

------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.rob...@garvan.org.au
e: mrob...@wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------




Reply all
Reply to author
Forward
0 new messages