COAD survival analysis with mRNAseq gene expression

136 views
Skip to first unread message

Roberto D'Ambrosio

unread,
Aug 26, 2016, 9:50:04 AM8/26/16
to gd...@broadinstitute.org
Dear GDAC-Staff,

I'm trying to perform the survival analysis that is reported here: 
(http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/COAD-TP/Correlate_Clinical_vs_mRNAseq/nozzle.html)

From the link "mRNAseq_Preprocess", I've downloaded the archive "gdac.broadinstitute.org_COAD.mRNAseq_Preprocess.Level_3.2016012800.0.0".

I noticed that in your report you wrote:
"The input file "COAD-TP.uncv2.mRNAseq_RSEM_normalized_log2.txt" is generated in the pipeline mRNAseq_Preprocess in the stddata run"

But in the archive I can only find the file: COAD.uncv2.mRNAseq_RSEM_normalized_log2
 
Are those file the same? What is the meaning of "TP" in the name of the file COAD-TP?

Working on the data I noticed that there are several (~43) samples that are duplicated. How did you handled them? Did you remove the duplicates? If yes, according to which criterion? 


Concerning the clinical data, I've downloaded the archive "Merge_Clinical " and extracted the information of interest from the file gdac.broadinstitute.org_COAD.Merge_Clinical.Level_1.2016012800.0.0

I used the approach described in the report to set the survival time.
I noticed that the patient tcga-aa-3521 has no information for the time end-point. Moreover the sample  tcga-ck-6746 has a negative number of days to the last follow up. How those two cases are handled in your analysis?

I will appreciate any help that you can provide.

Best regards,

Roberto D'Ambrosio
 



Michael Noble

unread,
Aug 26, 2016, 10:54:29 AM8/26/16
to rbrtda...@gmail.com, gd...@broadinstitute.org
Hello Robert,

I wanted to acknowledge your note, and thank you for using our work, but a full reply to this will have to wait until early next week because we are extremely short staffed right now with summer vacations etc.

Best,
-Mike
Michael S. Noble
Associate Director for Data Science
Cancer Genome Computational Analysis
The Broad Institute of MIT & Harvard
Manager, Genome Data Analysis Center
The Cancer Genome Atlas

Hailei Zhang

unread,
Aug 29, 2016, 2:24:03 PM8/29/16
to Michael Noble, rbrtda...@gmail.com, GDAC Pipeline Role Account
Hi Robert,

You downloaded the file is not the directly input file for this correlation analysis. This correlation analysis, we only consider primary tumor samples, this is why you see the input file called "COAD-TP". "TP" means tumor primary. 

You downloaded the file of "COAD.uncv2.mRNAseq_RSEM_normalized_log2" which includes all coad samples (tumors, normal tissues and other types.). 

The names with this pattern: "TCGA.**.****.01" means primary tumor. 

For the clinical file, we did not process the any clinical files.All the values is from the original clinical file. 

We will ignore the patient with negative value of days to last follow up or missing value.
 
Best,
Hailei 

On Fri, Aug 26, 2016 at 10:54 AM, Michael Noble <mno...@broadinstitute.org> wrote:
Hello Robert,

I wanted to acknowledge your note, and thank you for using our work, but a full reply to this will have to wait until early next week because we are extremely short staffed right now with summer vacations etc.

Best,
-Mike
Michael S. Noble
Associate Director for Data Science
Cancer Genome Computational Analysis
The Broad Institute of MIT & Harvard
Manager, Genome Data Analysis Center
The Cancer Genome Atlas

Roberto D'Ambrosio

unread,
Aug 31, 2016, 10:47:08 AM8/31/16
to Hailei Zhang, Michael Noble, GDAC Pipeline Role Account
Thanks Michael and Hailei for your answer and explanations.

I have  additional questions regarding the methodology used to estimate the correlation of the gene with the survival time.
According to the report:

"For survival clinical features, logrank test in univariate Cox regression analysis with proportional hazards model (Andersen and Gill 1982) was used to estimate the P values comparing quantile intervals using the 'coxph' function in R. Kaplan-Meier survival curves were plotted using quantile intervals at c(0, 0.25, 0.50, 0.75, 1). If there is only one interval group, it will not try survival analysis."

I can't really understand which is the procedure that you used.

My best guess is that you used the univariate cox model (coxph function in R) to estimate a risk score for each gene. As second step this score is used to divide the samples (patients) into groups according to the quantile.  Then the logrank test is computed from the  Kaplan-Meier curves obtained for each group.

Am I right? If yes, what do you mean with "If there is only one interval group, it will not try survival analysis."?
Is the logrank computed on all the curves at once or performing pairwise comparisons between the curves? Did you use any correction for multiple testing?

Again I will really appreciate any help that you can provide.

Best Regards,

Roberto D'Ambrosio




On Mon, 29 Aug 2016 at 20:24 Hailei Zhang <hai...@broadinstitute.org> wrote:
Hi Robert,

You downloaded the file is not the directly input file for this correlation analysis. This correlation analysis, we only consider primary tumor samples, this is why you see the input file called "COAD-TP". "TP" means tumor primary. 

You downloaded the file of "COAD.uncv2.mRNAseq_RSEM_normalized_log2" which includes all coad samples (tumors, normal tissues and other types.). 

The names with this pattern: "TCGA.**.****.01" means primary tumor. 

For the clinical file, we did not process the any clinical files.All the values is from the original clinical file. 

We will ignore the patient with negative value of days to last follow up or missing value.
 
Best,
Hailei 
On Fri, Aug 26, 2016 at 10:54 AM, Michael Noble <mno...@broadinstitute.org> wrote:
Hello Robert,

I wanted to acknowledge your note, and thank you for using our work, but a full reply to this will have to wait until early next week because we are extremely short staffed right now with summer vacations etc.

Best,
-Mike
Michael S. Noble
Associate Director for Data Science
Cancer Genome Computational Analysis
The Broad Institute of MIT & Harvard
Manager, Genome Data Analysis Center
The Cancer Genome Atlas

Hailei Zhang

unread,
Sep 1, 2016, 12:45:51 PM9/1/16
to Roberto D'Ambrosio, Michael Noble, GDAC Pipeline Role Account
Hi Reberto,

For the survival analysis, we divide samples by each gene expression data. Then we use Surv(), coxph() to calculate the logrank P value. At last, we use p.adjust() to get Q value. 

Here is our script for this survival anlaysis:

cox.simple=function(vv, st, se) {
    vv=as.numeric(vv)
    ind=complete.cases(vv, st, se)
    vv=vv[ind]
    if (length(vv)<5) return(rep(NA, 5))
    
    #first, check whether or not the data quantile is good to try survival curve.  
    brks0=quantile(vv, c(0, 0.25, 0.50, 0.75, 1))
    brks0 <- unique(brks0)
    brks0b=round(brks0, 2)
    vv2=as.numeric(cut(vv, brks0, include.lowest=T))
    
    # the number of intervals of quantile is the number of clusters to try survival curve.
    n.quantile.interval = length(table(vv2)) 
    #print("trying cox.simple...check. how many intervals of data quantile for survival curve?")
    #print(n.quantile.interval)
    if(n.quantile.interval==1){
      res=NA
    }else{
      ss=Surv(st[ind], se[ind])
      # Try coxph to get p.val using qantile interval categories of the data
      fit = coxph(ss~factor(vv2))
      fit2=summary(fit)
      logrank_P=as.vector(fit2$sctest[3]) #c('logrank_P')
      res= logrank_P
    }
    return(res)
}


Best,
Hailei 

Hello Robert,

Roberto D'Ambrosio

unread,
Sep 7, 2016, 10:42:33 AM9/7/16
to Hailei Zhang, Michael Noble, GDAC Pipeline Role Account
Dear Hailei,

thanks for the explanation and the code. That really helped.

Unfortunately I'm still having problems in achieving the same results obtained in the report

I downloaded the 
analysis results and I was trying to replicate the results of the gene  PPFIA4|8497 (see attached pdf V1-1.ex.pdf).
The quantile intervals are  -0.602, 2.972, 4.049, 4.927, 7.946.

I generate a script that read the gene expression from the file COAD.uncv2.mRNAseq_RSEM_normalized_log2.txt
using only the primary tumor samples. The clinical data are extracted from COAD.clin.merged.txt
Using my script (attached main_FH.R) I couldn't replicate the same values. The intervals that I obtained are:
[1] -0.6020875  3.0205699  4.0473938  4.9207626  7.9464507

The KM curves  relative to the data are in the file MyPlotPPFIA4|8497.pdf.
As you can see these curves are different from those propose in your analysis.

In my opinion this mismatch is due to some mistakes on my side regarding the definition of the events and the survival time.
In order to define these survival function I fallowed the description in the section "Selected clinical features" of the report:
http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/COAD-TP/Correlate_Clinical_vs_mRNAseq/nozzle.html

From the description in:
https://confluence.broadinstitute.org/download/attachments/29790363/selection_union.87.txt?version=1&modificationDate=1414611230000

I derived the definition of:
vital_status 
days_to_death 
days_to_last_followup

Continuing fallowing the description in the section Selected clinical features I combined the temporal information (days_to_death,days_to_last_followup) in a single variable days_to_death_or_fup.
The negatives values are removed (only one case). There is a missing value but is ignored since will be removed by the function complete.cases in your script.



I'm wondering if you spot any mistakes in my procedure.
I will really appreciate any help on this topic .

Best regards,

Roberto D'ambrosio
 
main_FH.R
MyPlotPPFIA4|8497.pdf
V1-1.ex.pdf

Roberto D'Ambrosio

unread,
Sep 23, 2016, 9:37:15 AM9/23/16
to Hailei Zhang, Michael Noble, GDAC Pipeline Role Account
Dear Hailei and Michael,


I have partially solved the issues that I described in my previous email.

My survival time and vital_status were incomplete.
In the definition of the clinical information given in:
https://confluence.broadinstitute.org/download/attachments/29790363/selection_union.87.txt?version=1&modificationDate=1414611230000
the days_to_last_followup, days_to_last_death variables are builded using only the quantities
"patient.days_to_last_followup" / "patient.follow_ups.follow_up.days_to_last_followup" and 
"patient.days_to_death" / "patient.follow_ups.follow_up.days_to_death" respectively.

But in order to obtain the same plots available in the report results (as an example the attached files V1-1, V1-2, V1-3) it is necessary to include also:
"patient.follow_ups.follow_up-2.days_to_last_followup", "patient.follow_ups.follow_up-3.days_to_last_followup", "patient.follow_ups.follow_up-4.days_to_last_followup"
and
"patient.follow_ups.follow_up-2.days_to_death" ,"patient.follow_ups.follow_up-3.days_to_death","patient.follow_ups.follow_up-4.days_to_death".

A similar approach is necessary for the vital_status variable.



Considering these clinical information in time and vital_status variables, i'm able to replicate yours Kaplan Meier (KM) curves: (attached files My_PPFIA4|8497, My_LBX2|85474, My_ENO3|2027) 
 
As you can see in the attached files, even obtaining curves almost identical, the quantile groups and the Hazard ratio are slightly different.
Since 
- the KM curves are the same (time and vital_status definition is ok), 
- the code that I'm using to compute the quantile and the coxph is the one that you provided to me,
 I believe that there is a mismatch between the expression data that I'm using and the one used to produce the results.

It is possible that there is a difference between the data in the archive "gdac.broadinstitute.org_COAD.mRNAseq_Preprocess.Level_3.2016012800.0.0
(file: COAD.uncv2.mRNAseq_RSEM_normalized_log2, maintaining only PT) and the version of the data used to generate the report?
 
Best Regards,

Roberto D'Ambrosio
--
Roberto D'Ambrosio, Post-doc
 
Department of Computing Science and Engineering (INGI) 
ICTEAM Institute - Polytechnic School of Louvain
Université catholique de Louvain
Place Sainte Barbe 2 - Louvain-la-Neuve - Belgium

Tel: +32 470 56 22 47      
My_LBX2_85474.pdf
My_ENO3_2027.pdf
V1-3.ex.pdf
V1-1.ex.pdf
My_PPFIA4_8497.pdf
V1-2.ex.pdf
Reply all
Reply to author
Forward
0 new messages