COAD survival analysis with mRNAseq gene expression

Roberto D'Ambrosio

unread,

Aug 26, 2016, 9:50:04 AM8/26/16

to gd...@broadinstitute.org

Dear GDAC-Staff,

I'm trying to perform the survival analysis that is reported here:
(http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/COAD-TP/Correlate_Clinical_vs_mRNAseq/nozzle.html)

From the link "mRNAseq_Preprocess", I've downloaded the archive "gdac.broadinstitute.org_COAD.mRNAseq_Preprocess.Level_3.2016012800.0.0".

I noticed that in your report you wrote:
"The input file "COAD-TP.uncv2.mRNAseq_RSEM_normalized_log2.txt" is generated in the pipeline mRNAseq_Preprocess in the stddata run"

But in the archive I can only find the file: COAD.uncv2.mRNAseq_RSEM_normalized_log2

Are those file the same? What is the meaning of "TP" in the name of the file COAD-TP?

Working on the data I noticed that there are several (~43) samples that are duplicated. How did you handled them? Did you remove the duplicates? If yes, according to which criterion?

Concerning the clinical data, I've downloaded the archive "Merge_Clinical " and extracted the information of interest from the file gdac.broadinstitute.org_COAD.Merge_Clinical.Level_1.2016012800.0.0

I used the approach described in the report to set the survival time.
I noticed that the patient tcga-aa-3521 has no information for the time end-point. Moreover the sample tcga-ck-6746 has a negative number of days to the last follow up. How those two cases are handled in your analysis?

I will appreciate any help that you can provide.

Best regards,

Roberto D'Ambrosio

Michael Noble

unread,

Aug 26, 2016, 10:54:29 AM8/26/16

to rbrtda...@gmail.com, gd...@broadinstitute.org

Hello Robert,

I wanted to acknowledge your note, and thank you for using our work, but a full reply to this will have to wait until early next week because we are extremely short staffed right now with summer vacations etc.

Best,

-Mike

—

Michael S. Noble

Associate Director for Data Science

Cancer Genome Computational Analysis

The Broad Institute of MIT & Harvard

Manager, Genome Data Analysis Center

The Cancer Genome Atlas

Hailei Zhang

unread,

Aug 29, 2016, 2:24:03 PM8/29/16

to Michael Noble, rbrtda...@gmail.com, GDAC Pipeline Role Account

Hi Robert,

You downloaded the file is not the directly input file for this correlation analysis. This correlation analysis, we only consider primary tumor samples, this is why you see the input file called "COAD-TP". "TP" means tumor primary.

You downloaded the file of "COAD.uncv2.mRNAseq_RSEM_normalized_log2" which includes all coad samples (tumors, normal tissues and other types.).

The names with this pattern: "TCGA.**.****.01" means primary tumor.

For the clinical file, we did not process the any clinical files.All the values is from the original clinical file.

We will ignore the patient with negative value of days to last follow up or missing value.

Best,

Hailei

On Fri, Aug 26, 2016 at 10:54 AM, Michael Noble <mno...@broadinstitute.org> wrote:

Hello Robert,

I wanted to acknowledge your note, and thank you for using our work, but a full reply to this will have to wait until early next week because we are extremely short staffed right now with summer vacations etc.

Best,

-Mike
—

Michael S. Noble
Associate Director for Data Science
Cancer Genome Computational Analysis
The Broad Institute of MIT & Harvard
Manager, Genome Data Analysis Center
The Cancer Genome Atlas

Roberto D'Ambrosio

unread,

Aug 31, 2016, 10:47:08 AM8/31/16

to Hailei Zhang, Michael Noble, GDAC Pipeline Role Account

Thanks Michael and Hailei for your answer and explanations.

I have additional questions regarding the methodology used to estimate the correlation of the gene with the survival time.
According to the report:

"For survival clinical features, logrank test in univariate Cox regression analysis with proportional hazards model (Andersen and Gill 1982) was used to estimate the P values comparing quantile intervals using the 'coxph' function in R. Kaplan-Meier survival curves were plotted using quantile intervals at c(0, 0.25, 0.50, 0.75, 1). If there is only one interval group, it will not try survival analysis."

I can't really understand which is the procedure that you used.

My best guess is that you used the univariate cox model (coxph function in R) to estimate a risk score for each gene. As second step this score is used to divide the samples (patients) into groups according to the quantile. Then the logrank test is computed from the Kaplan-Meier curves obtained for each group.

Am I right? If yes, what do you mean with "If there is only one interval group, it will not try survival analysis."?
Is the logrank computed on all the curves at once or performing pairwise comparisons between the curves? Did you use any correction for multiple testing?

Again I will really appreciate any help that you can provide.

Best Regards,

Roberto D'Ambrosio

On Mon, 29 Aug 2016 at 20:24 Hailei Zhang <hai...@broadinstitute.org> wrote:

Hi Robert,

You downloaded the file is not the directly input file for this correlation analysis. This correlation analysis, we only consider primary tumor samples, this is why you see the input file called "COAD-TP". "TP" means tumor primary.

You downloaded the file of "COAD.uncv2.mRNAseq_RSEM_normalized_log2" which includes all coad samples (tumors, normal tissues and other types.).

The names with this pattern: "TCGA.**.****.01" means primary tumor.

For the clinical file, we did not process the any clinical files.All the values is from the original clinical file.

We will ignore the patient with negative value of days to last follow up or missing value.

Best,
Hailei

On Fri, Aug 26, 2016 at 10:54 AM, Michael Noble <mno...@broadinstitute.org> wrote:

Hello Robert,

I wanted to acknowledge your note, and thank you for using our work, but a full reply to this will have to wait until early next week because we are extremely short staffed right now with summer vacations etc.

Best,

-Mike
—

Michael S. Noble
Associate Director for Data Science
Cancer Genome Computational Analysis
The Broad Institute of MIT & Harvard
Manager, Genome Data Analysis Center
The Cancer Genome Atlas

Hailei Zhang

unread,

Sep 1, 2016, 12:45:51 PM9/1/16

to Roberto D'Ambrosio, Michael Noble, GDAC Pipeline Role Account

Hi Reberto,

For the survival analysis, we divide samples by each gene expression data. Then we use Surv(), coxph() to calculate the logrank P value. At last, we use p.adjust() to get Q value.

Here is our script for this survival anlaysis:

cox.simple=function(vv, st, se) {

vv=as.numeric(vv)

ind=complete.cases(vv, st, se)

vv=vv[ind]

if (length(vv)<5) return(rep(NA, 5))

#first, check whether or not the data quantile is good to try survival curve.

brks0=quantile(vv, c(0, 0.25, 0.50, 0.75, 1))

brks0 <- unique(brks0)

brks0b=round(brks0, 2)

vv2=as.numeric(cut(vv, brks0, include.lowest=T))

# the number of intervals of quantile is the number of clusters to try survival curve.

n.quantile.interval = length(table(vv2))

#print("trying cox.simple...check. how many intervals of data quantile for survival curve?")

#print(n.quantile.interval)

if(n.quantile.interval==1){

res=NA

}else{

ss=Surv(st[ind], se[ind])

# Try coxph to get p.val using qantile interval categories of the data

fit = coxph(ss~factor(vv2))

fit2=summary(fit)

logrank_P=as.vector(fit2$sctest[3]) #c('logrank_P')

res= logrank_P

}

return(res)

}

Best,

Hailei

Hello Robert,

Roberto D'Ambrosio

unread,

Sep 7, 2016, 10:42:33 AM9/7/16

to Hailei Zhang, Michael Noble, GDAC Pipeline Role Account

Dear Hailei,

thanks for the explanation and the code. That really helped.

Unfortunately I'm still having problems in achieving the same results obtained in the report.
I downloaded the analysis results and I was trying to replicate the results of the gene PPFIA4|8497 (see attached pdf V1-1.ex.pdf).
The quantile intervals are -0.602, 2.972, 4.049, 4.927, 7.946.

I generate a script that read the gene expression from the file COAD.uncv2.mRNAseq_RSEM_normalized_log2.txt
using only the primary tumor samples. The clinical data are extracted from COAD.clin.merged.txt
Using my script (attached main_FH.R) I couldn't replicate the same values. The intervals that I obtained are:
[1] -0.6020875 3.0205699 4.0473938 4.9207626 7.9464507

The KM curves relative to the data are in the file MyPlotPPFIA4|8497.pdf.
As you can see these curves are different from those propose in your analysis.

In my opinion this mismatch is due to some mistakes on my side regarding the definition of the events and the survival time.
In order to define these survival function I fallowed the description in the section "Selected clinical features" of the report:
http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/COAD-TP/Correlate_Clinical_vs_mRNAseq/nozzle.html

From the description in:
https://confluence.broadinstitute.org/download/attachments/29790363/selection_union.87.txt?version=1&modificationDate=1414611230000

I derived the definition of:
vital_status
days_to_death
days_to_last_followup

Continuing fallowing the description in the section Selected clinical features I combined the temporal information (days_to_death,days_to_last_followup) in a single variable days_to_death_or_fup.
The negatives values are removed (only one case). There is a missing value but is ignored since will be removed by the function complete.cases in your script.

I'm wondering if you spot any mistakes in my procedure.
I will really appreciate any help on this topic .

Best regards,

Roberto D'ambrosio

main_FH.R

MyPlotPPFIA4|8497.pdf

V1-1.ex.pdf

Roberto D'Ambrosio

unread,

Sep 23, 2016, 9:37:15 AM9/23/16

to Hailei Zhang, Michael Noble, GDAC Pipeline Role Account

Dear Hailei and Michael,

I have partially solved the issues that I described in my previous email.

My survival time and vital_status were incomplete.
In the definition of the clinical information given in:
https://confluence.broadinstitute.org/download/attachments/29790363/selection_union.87.txt?version=1&modificationDate=1414611230000
the days_to_last_followup, days_to_last_death variables are builded using only the quantities

"patient.days_to_last_followup" / "patient.follow_ups.follow_up.days_to_last_followup" and
"patient.days_to_death" / "patient.follow_ups.follow_up.days_to_death" respectively.

But in order to obtain the same plots available in the report results (as an example the attached files V1-1, V1-2, V1-3) it is necessary to include also:

"patient.follow_ups.follow_up-2.days_to_last_followup", "patient.follow_ups.follow_up-3.days_to_last_followup", "patient.follow_ups.follow_up-4.days_to_last_followup"

and
"patient.follow_ups.follow_up-2.days_to_death" ,"patient.follow_ups.follow_up-3.days_to_death","patient.follow_ups.follow_up-4.days_to_death".

A similar approach is necessary for the vital_status variable.

Considering these clinical information in time and vital_status variables, i'm able to replicate yours Kaplan Meier (KM) curves: (attached files My_PPFIA4|8497, My_LBX2|85474, My_ENO3|2027)

As you can see in the attached files, even obtaining curves almost identical, the quantile groups and the Hazard ratio are slightly different.
Since
- the KM curves are the same (time and vital_status definition is ok),

- the code that I'm using to compute the quantile and the coxph is the one that you provided to me,

I believe that there is a mismatch between the expression data that I'm using and the one used to produce the results.

It is possible that there is a difference between the data in the archive "gdac.broadinstitute.org_COAD.mRNAseq_Preprocess.Level_3.2016012800.0.0
(file: COAD.uncv2.mRNAseq_RSEM_normalized_log2, maintaining only PT) and the version of the data used to generate the report?

Best Regards,

Roberto D'Ambrosio

--

Roberto D'Ambrosio, Post-doc

Department of Computing Science and Engineering (INGI)
ICTEAM Institute - Polytechnic School of Louvain
Université catholique de Louvain
Place Sainte Barbe 2 - Louvain-la-Neuve - Belgium

Tel: +32 470 56 22 47

My_LBX2_85474.pdf

My_ENO3_2027.pdf

V1-3.ex.pdf

V1-1.ex.pdf

My_PPFIA4_8497.pdf

V1-2.ex.pdf

Reply all

Reply to author

Forward