Expression profiles using ggplot + geom_smooth

26 views
Skip to first unread message

Kevin Rue

unread,
Mar 14, 2014, 7:48:02 AM3/14/14
to openseq...@googlegroups.com
Hi Karsten, Markus, Lisa,

After our openSequencing meeting this Monday, I finally took the time to test whether removing the time 0H control samples would fix the huge confidence interval observed for the control group at later time points.

In short: yes, removing the time 0 control samples does reduce the confidence interval "significantly". 
I added four examples in the attached presentation to illustrate the point.

As Markus mentioned during the meeting,we are sort of "doing physics" in those graphs. The extra time point for control samples puts additional "tension" in the linear regression of the control group because the interval between 0h and 2h is quite small and the difference in expression non-negligible compared to the later intervals. 

The treated groups do not have this extra "tension", which eases their linear regression, with smaller confidence intervals.

Bottom-line: regression data is more likely to be comparable when all groups have the same number of data points.

Best,
Kevin




--
Kévin RUE-ALBRECHT
Wellcome Trust Computational Infection Biology PhD Programme
University College Dublin
Ireland
http://fr.linkedin.com/pub/k%C3%A9vin-rue/28/a45/149/en
OpenSequencing_removeExtraTimePoint.pptx

Ricardo Segurado

unread,
Mar 14, 2014, 8:55:32 AM3/14/14
to openseq...@googlegroups.com
Sorry to chime in out of the blue (I can rarely find time free on Mondays... so I'm a lurker here).

What am I looking at in these plots? The curves don't seem to be linear regression fits, and the x-axis is time, so are there repeated measures for each sample? And you did fit a model which had a different number of time-points for different samples, so it must be a hierarchical/mixed model if it didn't just drop out the "cases" due to list-wise deletion.

Nosey parker here wonders what software you're using, and what the data is.

Thanks!
Ricardo


--
You received this message because you are subscribed to the Google Groups "OpenSequencing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensequencin...@googlegroups.com.
Visit this group at http://groups.google.com/group/opensequencing.
For more options, visit https://groups.google.com/d/optout.

Kevin Rue

unread,
Mar 14, 2014, 11:14:44 AM3/14/14
to openseq...@googlegroups.com
Hi all,

So for those who couldn't make it to the meeting (baaaaaaaaad... ^_^)
  • These plots are showing the log2cpm(counts) for a given gene in different treatment groups at (0, ) 2, ,6, 24 and 48 hours post-treatment.
  • Each treatment group contains 10 biological replicates at each timepoint (timepoints marked by the black dotted lines).
  • Control group has an additional time point at 0 hours. Obviously, it doesn't make sense having a time zero for treated samples.
  • At each timepoint in each group, one biological replicate is an aliquot of alveolar macrophages from one cow. Alveolar macrophages were collected from ten different cows. An aliquot from each cow was exposed to each of the three treatments 
Now, my bad. I wrote "linear regression" which is definitely not what's there!
It is a loess regression where each treatment group is represented by a line (the mean) and a colored area (the confidence interval at level 0.99). The loess regression is in fact a non-linear regression which combines local regression around the timepoints. (http://en.wikipedia.org/wiki/Local_regression)

Indeed, the point of my initial email was that the graphs I presented at the opensequencing meeting (the ones on the left side of the slides) included a different number of time points for the treatment groups. More specifically, the control group had 5 timepoints (0,2,6,24,48) while MB and TB only had 4 timepoints (2,6,24,48).

The software I am using is R, more specifically the "ggplot2" library.
The function is ggplot() + geom_smooth(aes(..., level=0.99))
The default level is 0.95 but I boldly went where a bunch of bioinformaticians went before.

I think I addressed all the questions now, otherwise let me know ! :)

Best,
Kevin

Ricardo Segurado

unread,
Mar 14, 2014, 11:53:48 AM3/14/14
to openseq...@googlegroups.com
I will turn up some day, I promise. I'm not much of a bioinformatician, but can I sit at the back and make statistical comments? Or at least learn some new uses of R!

Thanks very much for the explanation, I think I get it now. 10 cows -> 3 aliquots from each -> 1 aliquot per treatment -> 4 or 5 readings per aliquot. Something like that?

It's a nice way to visualise the treatment effects, and even with a small number of cows you can see the treatment differences by 48h.

Regards,
Ricardo

Kevin Rue

unread,
Mar 14, 2014, 12:31:25 PM3/14/14
to openseq...@googlegroups.com
Hi Ricardo,

Close enough, I just confirmed with the post-doc who actually participated in the experiment:

10 cows (in fact calves) -> 
6  aliquots  from each -> 
2 aliquots per treatments (in case one gets contaminated, the other one can be used; if both are okay, the aliquots are pooled before analysis) -> 
RNA purified, barcoded, multiplexed, and sequenced on Illumina platform.

Not sure what you had in mind regarding the number of readings. In the end, we had a single expression value per gene per animal per treatment per timepoint.
Therefore, 10 expression values per gene per treatment per timepoint in the figures I shared.

Kevin

Ricardo Segurado

unread,
Mar 14, 2014, 12:59:36 PM3/14/14
to openseq...@googlegroups.com
I see. By the number of readings, I meant one reading at each time-point. i.e. some cells are drawn for sequencing from the same aliquot/plate at (0h), 2h, 6h, 24h, 48h, for each treatment x animal.

Would anyone happen to have a link to any roughly similar published studies, so I can go off and educate myself a bit?

Thanks again
Ricardo
Reply all
Reply to author
Forward
0 new messages