Problem of cufflinks working with STAR alignment bam file

1,165 views
Skip to first unread message

Leon Liu

unread,
Jan 16, 2014, 10:31:07 PM1/16/14
to rna-...@googlegroups.com
Hi,

   I was trying to analysis some long RNA-seq from ENCODE/Cold Spring Harbor Lab, wgEncodeCshlLongRnaSeqA549CellLongnonpolyaAlnRep1.bam, which has already been mapped against hg19 using Spliced Transcript Alignment and Reconstruction (STAR). I wanted to use cufflink to calculate the transcription level, the command I used was 

   cufflinks -o cufflink_out/ -g hg19genes.gtf -b hg19.fa -u -N wgEncodeCshlLongRnaSeqA549CellLongnonpolyaAlnRep1.bam

But it didn't work. The error was:

   BAM record error: found spliced alignment without XS attribute 

   Is this problem due to STAR alignment ? 

   I also tried to submit into galaxy website to do the cufflinks analysis. But I also didn't get the results. 


   Can anyone have any suggestion? I really very appreciate it! Thank you so much. 

Best
Leon

 

Alexander Dobin

unread,
Jan 18, 2014, 2:27:48 PM1/18/14
to rna-...@googlegroups.com
Hi Leon,

the ENCODE RNA-seq alignments on the UCSC porta had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks). 

To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here: ftp://ftp2.cshl.edu/gingeraslab/tracks/Alex/Important/ENCODE2/BAM/

This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:

To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

The files are directly compatible with Cufflinks. 
CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
Caltech and HAIB data are unstranded and can be run with default --library-type.

STAR version: STAR_2.3.1u (2013/11/24)
Genome: hg19 + phiX + NIST ERCC spike-ins
Annotations: Gencode18

Cheers
Alex

Leon Liu

unread,
Jan 28, 2014, 10:48:31 AM1/28/14
to rna-...@googlegroups.com
Hi Dobin,

     Thank you so much.  But when I tried to analysis expression for A549 cell line longnonPolyA RNA-seq data (wgEncodeCshlLongRnaSeqA549CellLongnonpolyaRep1.bam), using software cufflinks, there is no results for this data (both Rep1 and Rep2 bam file). The command that I used for cufflinks was :  

     cufflinks  -o  cufflinks_rep1_output/  -g  hg19KnownGene.gtf -b  hg19.fa -u -N --library-type fr-firststrand  wgEncodeCshlLongRnaSeqA549CellLongnonpolyaRep1.bam

The annotation file I wanted to use was the hg19 known gene annotation. 

     But when I applied the same the command for GM12878 cell line, it worked. The results from cufflinks looks like reasonable. So is there some problem for the
BAM file of A549 posted on line, or is there something wrong when I used clufflinks to calculate expression level ?
 
    Thank you so much for your reply

Best
Leon

Leon Liu

unread,
Jan 28, 2014, 11:23:51 AM1/28/14
to rna-...@googlegroups.com
Hi Dobin, 

    Sorry again. It is me. Actually the fpkm.tracking file given from cufflinks looks like this: 
tracking_id     class_code      nearest_ref_id  gene_id gene_short_name tss_id  locus   length  coverage        FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
CUFF.1  -       -       CUFF.1  -       -       chr1:11873-14409        -       -       0       1.79769e+308    0       OK
CUFF.2  -       -       CUFF.2  -       -       chr1:14361-29961        -       -       11.5278 1.79769e+308    0       OK
uc001aak.3      -       -       uc001aak.3      -       -       chr1:34610-36081        -       -       0.0162641       1.79769e+308    0       OK
CUFF.3  -       -       CUFF.3  -       -       chr1:69090-70008        -       -       0       1.79769e+308    0       OK

    I was wondering the FPKM_conf_lo is as high as powers of 308. It looks weird for me. The same for  Caltech data. Is this fine, i.e. The FPKM value with such high FPKM_conf_lo is still reliable? 

Best
Leon

On Saturday, January 18, 2014 2:27:48 PM UTC-5, Alexander Dobin wrote:

Alexander Dobin

unread,
Jan 31, 2014, 1:40:19 PM1/31/14
to rna-...@googlegroups.com
Hi Leon,

these FPKM values are certainly wrong, something went wrong with the Cufflinks run.

I am trying to run this .bam file through Cufflinks with default parameters. In my case it got stuck in the first pass - "Inspecting reads and determining fragment length distribution" at a locus on chr21.
Note that this is A- RNA-seq data, which contains a lot of retained introns, and might not be suitable for Cufflinks at all. 

Cheers
Alex

Alexander Dobin

unread,
Feb 4, 2014, 5:06:31 PM2/4/14
to rna-...@googlegroups.com
Hi Leon,

I was able to run Cufflinks on this file in the following way:
1. Select only main chromosomes from the .bam file:
$ samtools view -bh wgEncodeCshlLongRnaSeqA549CellLongnonpolyaRep1.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM > chrOnly.bam

2. Run cufflinks with default parameters:
/path/to/cufflinks-2.1.1.Linux_x86_64/cufflinks -p6 --library-type fr-firststrand  chrOnly.bam

This produced the fpkm.tracking files with reasonable FPKM values, so the .bam file seems to be OK. 
Please try this procedure, an if it works fine, you can try again the non-default cufflinks parameters.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages