Soft clipping and Cufflinks

1,555 views
Skip to first unread message
Assigned to ado...@gmail.com by me

Luciano Cosme

unread,
Nov 25, 2014, 4:57:34 PM11/25/14
to rna-...@googlegroups.com
Hi,
   I followed your recommendations on the section 4.2.3 of STAR's manual to make the output compatible with Cufflinks (--outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical and I also used --outSAMtype BAM SortedByCoordinate). Then I run Cufflinks following this protocol http://cufflinks.cbcb.umd.edu/tutorial.html#discovery. I start on step 2. Now on step 3 I need to run cuffmerge. However, I get an error saying that the coordinates of some transcripts were wrongly assigned by Cufflinks. See error message below. I also saw this earlier post https://groups.google.com/forum/#!topic/rna-star/Ta1Z2u4bPfc however the awk command is not working for me. I have 65 samples and it would take forever to run TopHat. Besides, I optimized STAR's mapping parameters for my samples. Any suggestion would be appreciated.

Thank you.


Error message:


[14:51:50] Assembling transcripts and estimating abundances.
Processed 49568 loci.                      
[Tue Nov 25 14:52:58 2014] Comparing against reference file None
You are using Cufflinks v2.2.1, which is the most recent release.
Error (GFaSeqGet): end coordinate (326376) cannot be larger than sequence length 326363
Error (GFaSeqGet): end coordinate (326376) cannot be larger than sequence length 326363
Error (GFaSeqGet): end coordinate (280765) cannot be larger than sequence length 280754
Error (GFaSeqGet): end coordinate (280765) cannot be larger than sequence length 280754
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (262350) cannot be larger than sequence length 262345
Error (GFaSeqGet): end coordinate (266564) cannot be larger than sequence length 266549
Error (GFaSeqGet): end coordinate (259490) cannot be larger than sequence length 259488
Error (GFaSeqGet): end coordinate (259490) cannot be larger than sequence length 259488
Error (GFaSeqGet): end coordinate (259490) cannot be larger than sequence length 259488
Error (GFaSeqGet): end coordinate (2611283) cannot be larger than sequence length 2611281
Error (GFaSeqGet): end coordinate (2611283) cannot be larger than sequence length 2611281
Error (GFaSeqGet): end coordinate (186949) cannot be larger than sequence length 186940
Error (GFaSeqGet): end coordinate (173274) cannot be larger than sequence length 173270
Error (GFaSeqGet): end coordinate (173274) cannot be larger than sequence length 173270
Error (GFaSeqGet): end coordinate (173274) cannot be larger than sequence length 173270
Error (GFaSeqGet): end coordinate (141661) cannot be larger than sequence length 141655
Error (GFaSeqGet): end coordinate (141661) cannot be larger than sequence length 141655
Error (GFaSeqGet): end coordinate (141112) cannot be larger than sequence length 141098
Error (GFaSeqGet): end coordinate (141112) cannot be larger than sequence length 141098
Error (GFaSeqGet): end coordinate (141112) cannot be larger than sequence length 141098
Error (GFaSeqGet): end coordinate (118790) cannot be larger than sequence length 118789
Error (GFaSeqGet): end coordinate (110135) cannot be larger than sequence length 110123
Error (GFaSeqGet): end coordinate (110135) cannot be larger than sequence length 110123
Error (GFaSeqGet): end coordinate (110135) cannot be larger than sequence length 110123
Error (GFaSeqGet): end coordinate (110135) cannot be larger than sequence length 110123
Error (GFaSeqGet): end coordinate (110135) cannot be larger than sequence length 110123
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2333036) cannot be larger than sequence length 2333031
Error (GFaSeqGet): end coordinate (2163578) cannot be larger than sequence length 2163576
Error (GFaSeqGet): end coordinate (2163585) cannot be larger than sequence length 2163576
Error (GFaSeqGet): end coordinate (69689) cannot be larger than sequence length 69686
Error (GFaSeqGet): end coordinate (91548) cannot be larger than sequence length 91544
Error (GFaSeqGet): end coordinate (91548) cannot be larger than sequence length 91544
Error (GFaSeqGet): end coordinate (60081) cannot be larger than sequence length 60078
Error (GFaSeqGet): end coordinate (1976168) cannot be larger than sequence length 1976150
Error (GFaSeqGet): end coordinate (1976168) cannot be larger than sequence length 1976150
Error (GFaSeqGet): end coordinate (24034) cannot be larger than sequence length 24026
Error (GFaSeqGet): end coordinate (24034) cannot be larger than sequence length 24026
Error (GFaSeqGet): end coordinate (24034) cannot be larger than sequence length 24026
Error (GFaSeqGet): end coordinate (24034) cannot be larger than sequence length 24026
Error (GFaSeqGet): end coordinate (20914) cannot be larger than sequence length 20903
Error (GFaSeqGet): end coordinate (20914) cannot be larger than sequence length 20903
Error (GFaSeqGet): end coordinate (20914) cannot be larger than sequence length 20903
Error (GFaSeqGet): end coordinate (40791) cannot be larger than sequence length 40788
Error (GFaSeqGet): end coordinate (40791) cannot be larger than sequence length 40788
Error (GFaSeqGet): end coordinate (15781) cannot be larger than sequence length 15776
Error (GFaSeqGet): end coordinate (15000) cannot be larger than sequence length 14999
Error (GFaSeqGet): end coordinate (15000) cannot be larger than sequence length 14999
Error (GFaSeqGet): end coordinate (2048567) cannot be larger than sequence length 2048554
Error (GFaSeqGet): end coordinate (2048567) cannot be larger than sequence length 2048554
Error (GFaSeqGet): end coordinate (2048567) cannot be larger than sequence length 2048554
Error (GFaSeqGet): end coordinate (13598) cannot be larger than sequence length 13597
Error (GFaSeqGet): end coordinate (13598) cannot be larger than sequence length 13597
Error (GFaSeqGet): end coordinate (12176) cannot be larger than sequence length 12175
Error (GFaSeqGet): end coordinate (12176) cannot be larger than sequence length 12175
Error (GFaSeqGet): end coordinate (12147) cannot be larger than sequence length 12146
Error (GFaSeqGet): end coordinate (12147) cannot be larger than sequence length 12146
Error (GFaSeqGet): end coordinate (11894) cannot be larger than sequence length 11870
Error (GFaSeqGet): end coordinate (10112) cannot be larger than sequence length 10106
Error (GFaSeqGet): end coordinate (10112) cannot be larger than sequence length 10106
Error (GFaSeqGet): end coordinate (8662) cannot be larger than sequence length 8654
Error (GFaSeqGet): end coordinate (8357) cannot be larger than sequence length 8343
Error (GFaSeqGet): end coordinate (8357) cannot be larger than sequence length 8343
Error (GFaSeqGet): end coordinate (1444879) cannot be larger than sequence length 1444863
Error (GFaSeqGet): end coordinate (1444879) cannot be larger than sequence length 1444863
Error (GFaSeqGet): end coordinate (1444879) cannot be larger than sequence length 1444863
Error (GFaSeqGet): end coordinate (1389108) cannot be larger than sequence length 1389101
Error (GFaSeqGet): end coordinate (7272) cannot be larger than sequence length 7269
Error (GFaSeqGet): end coordinate (7272) cannot be larger than sequence length 7269
Error (GFaSeqGet): end coordinate (6812) cannot be larger than sequence length 6804
Error (GFaSeqGet): end coordinate (6812) cannot be larger than sequence length 6804
Error (GFaSeqGet): end coordinate (6750) cannot be larger than sequence length 6749
Error (GFaSeqGet): end coordinate (6750) cannot be larger than sequence length 6749
Error (GFaSeqGet): end coordinate (6732) cannot be larger than sequence length 6728
Error (GFaSeqGet): end coordinate (6732) cannot be larger than sequence length 6728
Error (GFaSeqGet): end coordinate (6450) cannot be larger than sequence length 6448
Error (GFaSeqGet): subsequence cannot be larger than 6339
Error getting subseq for CUFF.47628.1 (1..6348)!
    [FAILED]
Error: could not execute cuffcompare

Alexander Dobin

unread,
Nov 25, 2014, 5:44:19 PM11/25/14
to rna-...@googlegroups.com
Hi Luciano,

I have a patch for this issue which will filter out all reads soft-clipped past the end of the chromosome. I will include it in the tomorrow's release.

Cheers
Alex

Alexander Dobin

unread,
Nov 27, 2014, 11:25:29 AM11/27/14
to rna-...@googlegroups.com
Hi Luciano,

with --alignSoftClipAtReferenceEnds No , and let me know if it solves the Cuffmerge problem.

Cheers
Alex

Luciano Cosme

unread,
Dec 1, 2014, 5:20:40 PM12/1/14
to rna-...@googlegroups.com
Thank you. I will run it today.

Best,

Luciano

Ambily Sivadas

unread,
Dec 18, 2014, 4:02:52 AM12/18/14
to rna-...@googlegroups.com
Hi Alex,

I faced the same issue and tried the option you mentioned - But I still get the same error. The only way I get to run Cuffmerge successfully is by removing the reference genome fasta file option itself.

Regards,
Ambily

Alexander Dobin

unread,
Dec 18, 2014, 12:05:19 PM12/18/14
to rna-...@googlegroups.com
Hi Ambily,

Unfortunately, I cannor reproduce this problem on my test dataset, Cufflinks/merge run without any problems.
Please send me the Log.out file, the Cufflinks/merge commands that you used.
Also, I would need a minimal sample of your data  that causes the error. My guess is that these are reads that are mapped to small genomic scaffolds, you can guess which scaffolds by looking at the length
of the "sequence length xxx" in the error message like this:
Error (GFaSeqGet): end coordinate (326376) cannot be larger than sequence length 326363

Cheers
Alex

François Lefebvre

unread,
Dec 18, 2014, 7:23:45 PM12/18/14
to rna-...@googlegroups.com
Hi Alex, I am having the same problem. Unfortunately  

--alignSoftClipAtReferenceEnds No 

is still producing alignments with offending soft-clips. Read HWI-ST521:144:C1MF9ACXX:1:2107:2259:104234 seems like a good example.

Thanks
CSB10A_v1_contig_8921.sam.zip

Alexander Dobin

unread,
Dec 19, 2014, 4:19:12 PM12/19/14
to rna-...@googlegroups.com
Hi François,

thanks for the test example. Could you also post the Cufflinks/merge commands that produce the error with this file?
The read you referred to seems to be OK to me:
HWI-ST521:144:C1MF9ACXX:1:2107:2259:104234      99      CSB10A_v1_contig_8921   5883    255     95M     =       6498    642     ATTGGAGAAGAAAAGAGAGATTCAGTGGAAACCATATTTCCTCCCATATTGCTTGACATCACTTGGCAATCCCCATACATTTTTTTTCACTTTAC  @@??DDDD3ADHF>BGF9EAAEEHA<CFEFH;E;CFG@H9?F<B;BHDA?GHEHA@B8CDHHB===CCDCA9@;ACEDB@;;3;<>=>@CC>CCA NH:i:1   HI:i:1  AS:i:118        nM:i:1  RG:Z:HI.0659.001.Index_13.1
HWI-ST521:144:C1MF9ACXX:1:2107:2259:104234      147     CSB10A_v1_contig_8921   6498    255     37S27M  =       5883    -642    AAAACAAAAAGAGGAAGAAGAAGAAGAAGAAGCCAAGAAAGAAGAAGAAGAAGAAGAAGAAGCC B3FF??9D9?*:0*D:?9?*F<BFC4BC9C@HF<AE?@CGGHHEE@HF>HDBDF?C4BAFD?B;        NH:i:1  HI:i:1  AS:i:118        nM:i:1  RG:Z:HI.0659.001.Index_13.1

The soft clipping happens on the "left" of the second line, so it cannot go past the end of the contig. The 27M matched bases start at 6498, so they end at 6498+27-1=6524, while this contig length is 6526.
What were the error messages for Cufflinks referring to this contigs?

Cheers
Alex

François Lefebvre

unread,
Dec 19, 2014, 6:10:27 PM12/19/14
to rna-...@googlegroups.com
Hello Alex. the (path-shortened) command is

cufflinks -o 1 -p 24 -b genome.fa -u --library-type fr-firststrand merged.bam

and I have attached other files that may help: the tracks for the assembled transcript, the contig fasta, and the cufflinks output file.

Perhaps this is just cufflinks using POS and the cigar string including softclips to calculate the other end of the read? Then it would be a cufflinks GFaSeqGet problem. 6498+37+27 - 1  = 6561

Thanks again.
1.cufflinks.o134833
temp.gtf
temp.fa

Alexander Dobin

unread,
Dec 28, 2014, 10:38:38 PM12/28/14
to rna-...@googlegroups.com
Hi François,

thanks for the files, I will look at them when I am back in the office on Jan 6.
If Cufflinks adds the soft-clip to calculate the end of the read, this would be a very serious bug in Cufflinks - it would mean that all reads with soft-clipping are processed wrongly.
This cannot be fixed on STAR side, so  I would recommend using the --alignEndsType EndToEnd which prohibits soft-clipping.

Cheers
Alex

François Lefebvre

unread,
Jan 5, 2015, 9:27:58 AM1/5/15
to rna-...@googlegroups.com
Hello Alex, would you also recommend removing soft-clips as suggested in this post?

Alexander Dobin

unread,
Jan 7, 2015, 2:46:49 PM1/7/15
to rna-...@googlegroups.com
Hi François,

this script is supposed to "hard-clip" the read sequences that were soft-clipped by STAR, but I am not sure if this will solve the problem with Cufflinks. If it works for you, please let me know, since this might be the way to circumvent the problem.

Cheers
Alex

François Lefebvre

unread,
Jan 8, 2015, 1:56:04 PM1/8/15
to rna-...@googlegroups.com
Hello Alex, replacing hard-clipping the soft-clips solved the problem, although I am not sure this will not somehow violate an assumption for the cufflinks assembler.

The antisense transcripts mentioned off-list is a different, unrelated issue.

Thanks again

Alexander Dobin

unread,
Jan 8, 2015, 4:41:00 PM1/8/15
to rna-...@googlegroups.com
Hi François,

I guess hard-clipping could mess up the statistical model, if it assumes that all reads have to have the same length. Also, the insert size distribution for clipped reads is probably a bit different from the un-clipped.
I think the solution with --alignEndsType EndToEnd make STAR output closer to TopHat's, which Cufflinks expects to see - so this might be a better workaround.
My reservation about end-to-end alignments is that they tend to favor pseudogenes (alignments with mismatches rather then clipped).
I will try to reach Cole Trapnell about this issue.

Cheers
Alex

Alexander Dobin

unread,
Jan 13, 2015, 5:35:57 PM1/13/15
to
Hi François,

I have confirmed your suspicion that Cufflinks does not properly deal with soft-clipping. It's a serious bug that affects not only the bias correction, but also the boundaries of the transcripts.
From now on I will strongly recommend for alignments intended to be used with Cufflinks either
(i) using --alignEndsType EndToEnd 
(ii) or converting soft-clipping into hard clipping after mapping with this awk one-liner:
awk 'BEGIN {OFS="\t"} {if (substr($1,1,1)=="@" {print;next}; split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/); if (C[2]=="S") {$10=substr($10,L[1]+1); $11=substr($11,L[1]+1)}; if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1]; $10=substr($10,1,L1); $11=substr($11,1,L1); }; gsub(/[0-9]*S/,"",$6); print}' Aligned.out.sam > Aligned.noS.sam

Cheers
Alex

Alexander Dobin

unread,
Jan 14, 2015, 10:27:18 AM1/14/15
to rna-...@googlegroups.com
Per François' suggestion, the awk script is modified to avoid messing up SAM header lines:

awk 'BEGIN {OFS="\t"} {if (substr($1,1,1)=="@") {print;next}; split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/); if (C[2]=="S") {$10=substr($10,L[1]+1); $11=substr($11,L[1]+1)}; if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1]; $10=substr($10,1,L1); $11=substr($11,1,L1); }; gsub(/[0-9]*S/,"",$6); print}'  Aligned.out.sam > Aligned.noS.sam

On Tuesday, January 13, 2015 at 5:35:57 PM UTC-5, Alexander Dobin wrote:
Hi François,

I have confirmed your suspicion that Cufflinks does not properly deal with soft-clipping. It's a serious bug that affects not only the bias correction, but also the boundaries of the transcripts.
From now on I will strongly recommend for alignments intended to be used with Cufflinks either
(i) using --alignEndsType EndToEnd 
(ii) or converting soft-clipping into hard clipping after mapping with this awk one-liner:
awk 'BEGIN {OFS="\t"} {split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/); if (C[2]=="S") {$10=substr($10,L[1]+1); $11=substr($11,L[1]+1)}; if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1]; $10=substr($10,1,L1); $11=substr($11,1,L1); }; gsub(/[0-9]*S/,"",$6); print}' Aligned.out.sam > Aligned.noS.sam

Cheers
Alex

Robert King

unread,
Jan 29, 2015, 2:02:57 PM1/29/15
to rna-...@googlegroups.com
Thanks I've tried this and seems to solve the problem but was wondering if just clipping the ends of the reference overhangs usng CleanSam.jar from picard tools do this and may be more compatible with galaxy etc? I'm trialling it but wondering what people thought.
Best wishes
Rob

Robert King

unread,
Jan 29, 2015, 4:51:08 PM1/29/15
to rna-...@googlegroups.com
cufflinks doesn't give me the error using CleanSam.jar so if got star in galaxy and picard tools suggest trying that route.

Best wishes
Rob

Alexander Dobin

unread,
Jan 29, 2015, 5:25:17 PM1/29/15
to rna-...@googlegroups.com
Hi Rob,

even though Cufflinks did not throw an error after this clean-up, it still does misplaces the soft-clipped reads, which will result in inaccurate transcripts.
So I would strongly recommend either using  --alignEndsType EndToEnd at the mapping step, or the awk script.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages