Problem of cufflinks working with STAR alignment bam file

Leon Liu

unread,

Jan 16, 2014, 10:31:07 PM1/16/14

to rna-...@googlegroups.com

Hi,

I was trying to analysis some long RNA-seq from ENCODE/Cold Spring Harbor Lab, wgEncodeCshlLongRnaSeqA549CellLongnonpolyaAlnRep1.bam, which has already been mapped against hg19 using Spliced Transcript Alignment and Reconstruction (STAR). I wanted to use cufflink to calculate the transcription level, the command I used was

cufflinks -o cufflink_out/ -g hg19genes.gtf -b hg19.fa -u -N wgEncodeCshlLongRnaSeqA549CellLongnonpolyaAlnRep1.bam

But it didn't work. The error was:

BAM record error: found spliced alignment without XS attribute

Is this problem due to STAR alignment ?

I also tried to submit into galaxy website to do the cufflinks analysis. But I also didn't get the results.

Can anyone have any suggestion? I really very appreciate it! Thank you so much.

Best

Leon

Alexander Dobin

unread,

Jan 18, 2014, 2:27:48 PM1/18/14

to rna-...@googlegroups.com

Hi Leon,

the ENCODE RNA-seq alignments on the UCSC porta had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here: ftp://ftp2.cshl.edu/gingeraslab/tracks/Alex/Important/ENCODE2/BAM/

This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:

http://genome.ucsc.edu/ENCODE/

To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

The files are directly compatible with Cufflinks.

CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand

Caltech and HAIB data are unstranded and can be run with default --library-type.

STAR version: STAR_2.3.1u (2013/11/24)

Genome: hg19 + phiX + NIST ERCC spike-ins

Annotations: Gencode18

Cheers

Alex

Leon Liu

unread,

Jan 28, 2014, 10:48:31 AM1/28/14

to rna-...@googlegroups.com

Hi Dobin,

Thank you so much. But when I tried to analysis expression for A549 cell line longnonPolyA RNA-seq data (wgEncodeCshlLongRnaSeqA549CellLongnonpolyaRep1.bam), using software cufflinks, there is no results for this data (both Rep1 and Rep2 bam file). The command that I used for cufflinks was :

cufflinks -o cufflinks_rep1_output/ -g hg19KnownGene.gtf -b hg19.fa -u -N --library-type fr-firststrand wgEncodeCshlLongRnaSeqA549CellLongnonpolyaRep1.bam

The annotation file I wanted to use was the hg19 known gene annotation.

But when I applied the same the command for GM12878 cell line, it worked. The results from cufflinks looks like reasonable. So is there some problem for the

BAM file of A549 posted on line, or is there something wrong when I used clufflinks to calculate expression level ?

Thank you so much for your reply

Best

Leon

Leon Liu

unread,

Jan 28, 2014, 11:23:51 AM1/28/14

to rna-...@googlegroups.com

Hi Dobin,

Sorry again. It is me. Actually the fpkm.tracking file given from cufflinks looks like this:

tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status

CUFF.1 - - CUFF.1 - - chr1:11873-14409 - - 0 1.79769e+308 0 OK

CUFF.2 - - CUFF.2 - - chr1:14361-29961 - - 11.5278 1.79769e+308 0 OK

uc001aak.3 - - uc001aak.3 - - chr1:34610-36081 - - 0.0162641 1.79769e+308 0 OK

CUFF.3 - - CUFF.3 - - chr1:69090-70008 - - 0 1.79769e+308 0 OK

I was wondering the FPKM_conf_lo is as high as powers of 308. It looks weird for me. The same for Caltech data. Is this fine, i.e. The FPKM value with such high FPKM_conf_lo is still reliable?

Best

Leon

On Saturday, January 18, 2014 2:27:48 PM UTC-5, Alexander Dobin wrote:

Alexander Dobin

unread,

Jan 31, 2014, 1:40:19 PM1/31/14

to rna-...@googlegroups.com

Hi Leon,

these FPKM values are certainly wrong, something went wrong with the Cufflinks run.

I am trying to run this .bam file through Cufflinks with default parameters. In my case it got stuck in the first pass - "Inspecting reads and determining fragment length distribution" at a locus on chr21.

Note that this is A- RNA-seq data, which contains a lot of retained introns, and might not be suitable for Cufflinks at all.

Cheers

Alex

Alexander Dobin

unread,

Feb 4, 2014, 5:06:31 PM2/4/14

to rna-...@googlegroups.com

Hi Leon,

I was able to run Cufflinks on this file in the following way:

1. Select only main chromosomes from the .bam file:

$ samtools view -bh wgEncodeCshlLongRnaSeqA549CellLongnonpolyaRep1.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM > chrOnly.bam

2. Run cufflinks with default parameters:

/path/to/cufflinks-2.1.1.Linux_x86_64/cufflinks -p6 --library-type fr-firststrand chrOnly.bam

This produced the fpkm.tracking files with reasonable FPKM values, so the .bam file seems to be OK.

Please try this procedure, an if it works fine, you can try again the non-default cufflinks parameters.