TCGA:
From what I can tell, the “data_RNA_Seq_v2_expression_median.txt” are the RSEM values, described here:
https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2The meta file says:
profile_description: Expression levels for 20532 genes in 550 prad cases (RNA Seq V2 RSEM).
Something to keep in mind: you can alternately get different version of the TCGA data at these locations:
https://portal.gdc.cancer.gov/ to download the STAR 2-pass HTSeq-counts or FPKM data.
http://gdac.broadinstitute.org/ to download RNAseq-v2 RSEM (I believe this should be similar to what is at cBio Portal)
Robinson:
They have two versions of RNA expression based on two different types of libraries. One is polyA where the library method includes a polyA –selection step and generates a sequencing library from the mRNA. The other library instead “captures” the coding genes from the total RNA using exonic probes, a different way of removing ribosomal RNA, but might end up with unprocessed/unspliced RNA.
I typically use the polyA library data but you could compare and decide which one you think is better. They describe this in the extended version of the article (
http://www.cell.com/cell/pdfExtended/S0092-8674(15)00548-6)
data_RNA_Seq_expression_median.txt
profile_name: mRNA expression / polyA (RNA Seq RPKM)
data_RNA_Seq_expression_capture.txt
profile_description: mRNA expression from capture (RNA Seq RPKM)
Trento/Cornell:
data_expression_median.txt
profile_description: Expression levels.
I believe these are FPKMs, pipeline described here:
https://www.nature.com/articles/nm.4045In your opinion, would I be able to compare these datasets like this, using FPKM data per gene… or should I download the bam files or fastq files somewhere (would be great if you could indicate a place where I could download them in block), and analyse all over again ?