Cuffdiff taking too long - how can I diagnose why?

Alexander Predeus

unread,

Jun 15, 2014, 2:18:59 AM6/15/14

to tuxedo-to...@googlegroups.com

Hello all,

I'm running cuffdiff v. 2.2.1 with the following options:

cuffdiff --no-update-check -q -o cuff_out -p 8 --library-type fr-firststrand -b $FA -u -L $COND1,$COND2 $GTF $BAMS1 $BAMS2

there are 2 bam files per each condition, about ~25 M paired-end reads each.

FA is a reference mm10 sequence.

GTF is Gencode vM2 GTF file.

I run cuffdiff on a whole node with 48 Gb memory and 8 CPUs.

And it is not done in 24 hours.

Is is expected behaviour? I mean differential expression analysis probably should not take this long, right? Please let me know what should I look into to take care of this.

All the best,

-- Alex

Alexander Predeus

unread,

Jun 15, 2014, 2:10:45 PM6/15/14

to tuxedo-to...@googlegroups.com

Ok, so it turns out that it's the same problem that was brought up here by several people: Cuffdiff/Cufflinks is choking on "gene" records from Gencode.

When I removed those fields, it only took 2-3 hours on 8 cpus.

If anybody with some knowledge of suite's code can comment as to why is this happening? Because many believe that Gencode is the best reference annotation out there, and it would be nice to be use it without any doubts.

At least if it's ok to just filter out those lines? It seems like the cufflinks/cuffdiff have no problem extracting actual gene info from other records of the GTF file.

Jason Williams

unread,

Jul 3, 2014, 8:35:03 PM7/3/14

to tuxedo-to...@googlegroups.com

I am also noticing an unreasonable runtime for cufflinks;

I have jobs running on two machines (8 CPU, 16GB RAM and 4 CPU, 32GB RAM) with appropriate -p options.

My command is:
cufflinks -o ./cufflinks_cntrl_rep1 -p 4 -g genes.gtf -b genome.fa --multi-read-correct --library-type fr-firststrand --min-intron-length 5 ./tophat_out_cntrl_rep1/accepted_hits.bam

After about 28 hours I get:
[08:19:16] Loading reference annotation.
[08:19:26] Inspecting reads and determining fragment length distribution.
> Processed 392247 loci.                       [*************************] 100%
> Map Properties:
>    Normalized Map Mass: 50985031.00
>    Raw Map Mass: 50985031.00
>    Number of Multi-Reads: 30756077 (with 123420955 total hits)
>    Fragment Length Distribution: Empirical (learned)
>                  Estimated Mean: 145.06
>               Estimated Std Dev: 56.93
[12:12:46] Assembling transcripts and initializing abundances for multi-read correction.
> Processing Locus 10:99998362-100010333       [***                      ] 14%

Some stats on the Tophat results from align_summary
Left reads:
          Input     : 55451716
           Mapped   : 53894442 (97.2% of input)
            of these: 31634855 (58.7%) have multiple alignments (471340 have >20)
Right reads:
          Input     : 55451716
           Mapped   : 53340860 (96.2% of input)
            of these: 31355839 (58.8%) have multiple alignments (471751 have >20)
96.7% overall read mapping rate.

Aligned pairs: 52644373
     of these: 30992723 (58.9%) have multiple alignments
                 3768739 ( 7.2%) are discordant alignments
88.1% concordant pair alignment rate.

I am using the genome and annotations from the Illumina iGenomes for Maize.

I don't know if this is related but hope someone can give some ideas!

Alexander Predeus

unread,

Jul 7, 2014, 11:32:35 AM7/7/14

to tuxedo-to...@googlegroups.com

Hi Jason,

First of all, you are using -g option which not only quantifies the reads but also assembles the transcriptome using the GTF file as a guide. That is supposed to take quite a long time. If you want it to just quantify known isoforms, use -G.

Also I'd check if your annotation has "gene_id" and "transcription_id" fields set correctly - Cufflinks/Cuffdiff was choking on "gene" fields of Gencode GTFs because they had those fields set to the same value, for whatever reason (I'm guessing for visualization).

Also using "-q --no-update-check" options makes logs a LOT smaller and more readable, and saves some time (I've had these logs take up Gbs of space each).

cheers

-- Alex Predeus

Shangzhong Li

unread,

Jul 11, 2014, 1:57:59 PM7/11/14

to tuxedo-to...@googlegroups.com

Hi Alexander,

I used NCBI mouse annotation files. After removing the 'gene' record, cuffdiff still takes a long time. I compared the 3rd column of NCBI annotation and gencode annotation, I found that in NCBI there are many other features such as 'Region','cDNA_match','D_loop', which doesn't exist in gencode annotation. I wonder do you know anything about this? Thank you very much.

Alexander Predeus

unread,

Jul 13, 2014, 12:03:59 PM7/13/14

to tuxedo-to...@googlegroups.com

Hello Shangzhong,

I'm not sure about the format of NCBI files, but there are two things I've learnt about how cufflinks handles annotations:

1. You can take only fields labelled "exon" in column 3 of the GTF file

2. You should make sure that gene_id and transcript_id are set correctly for each exon, and that they cover all of the features described in the annotation.

E.g. you can parse the file for "exon" only entries, then sort and find the number of unique gene_ids and transcript_ids

If that matches the number you would expect from the annotation (e.g. from counting unique "gene" and "transcript" entries), then you are good to go.

Many other features (CDS. UTR, start/stop codon, etc) are irrelevant for expression quantification, so there's no need to worry about them.

I hope this helps

cheers

-- Alex

Ephraim Trakhtenberg

unread,

Aug 12, 2014, 3:43:10 PM8/12/14

to tuxedo-to...@googlegroups.com

Dear Alex,

I understand that CD, UTR, start/stop codon lines could be deleted from the Gencode gtf, as they will not affect expression quantification.

The Selenocysteine lines I assume also could be deleted, as there is no differences on a transcript level whether or not Selenocysteine is translated from it?

Is there any tophat/cufflinks output file that might be effected by deleting these lines? Any reason to think that if I use STAR instead of Tophat these omissions might affect some outputs?

What command would you recommend for deleting rows from the gtf? I am still new to Linux and so far have been using excel for these purposes...

Thank you,

Ephraim

Alexander Predeus

unread,

Aug 12, 2014, 6:20:14 PM8/12/14

to tuxedo-to...@googlegroups.com

Hello Ephraim,

basically you can only take the lines that have "exon" in them. They basically consist of CDS+UTR.

> The Selenocysteine lines I assume also could be deleted, as there is no differences on a transcript level whether or not Selenocysteine is translated from it?

if there are any exons in those genes, they will be listed separately as "exon" as well.

> Any reason to think that if I use STAR instead of Tophat these omissions might affect some outputs?

STAR automatically only uses "exon" field only (to generate the genome reference). I mean the results of tophat and STAR can be different, because they use different algorithms, but not because of the different GTF parsing.

> What command would you recommend for deleting rows from the gtf?

Well I would do something like this:

grep -P "\texon\t" name.gtf > name.exon.gtf

and then use the newly produced file.

Cheers

-- Alex

Ephraim Trakhtenberg

unread,

Aug 12, 2014, 9:38:50 PM8/12/14

to tuxedo-to...@googlegroups.com

Alex, thank you! I just posted a different Gencode related question (link below), please take a look.

https://groups.google.com/forum/#!topic/tuxedo-tools-users/0kydiKjBMbM

Best,

Ephraim

Reply all

Reply to author

Forward