STAR and HTSeq-count

772 views
Skip to first unread message

Nicolas Robine

unread,
Dec 10, 2012, 11:38:01 AM12/10/12
to rna-...@googlegroups.com
Hi Alex, hi all,

I am trying to quantify intervals with htseq-count, from a STAR alignment and got this error:

Error occured in line 29 of file Aligned.out.sam.
Error: ("SAM optional field with illegal type letter ':'", 'line 29 of file Aligned.out.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:1174]

The first alignment lines of the SAM file are:

HWI-ST1333:76:D1HURACXX:2:1103:14896:58889      77      *       0       0       *       *       0       0       CACCGTTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCG      CCCFFFFFHHHHHJJJJJJIJJJJJGIJGHHIIJJJDIJJJJIIGIIJII       NH:i:0  HI:i:0  AS:i:49 nM:i:0  uT:A:1
HWI-ST1333:76:D1HURACXX:2:1103:14896:58889      141     *       0       0       *       *       0       0       CCTAAAGTGGGGCAAGGAGAGGGCTGGGCGCGGTGGCTCACGCCTGTAAT      BCCFFFFBHHHHHJJJJJIIIJJJIJJJJJIII@EHIGIIIJHHHHEDFF       NH:i:0  HI:i:0  AS:i:49 nM:i:0  uT:A:1
HWI-ST1333:76:D1HURACXX:2:1103:14754:58981      163     chr10   45935977        255     50M     =       45938492        2565    CACGTCCACCAGACCATCACCCACCTTCTGCGAACACATCTGGTGTCTGA      CCCFFFFFHHHHHJJIJJJIJJJJJIJJJJJIJIE8009?BGGCHHHJJJ       NH:i:1  HI:i:1  AS:i:97 nM:i:0  jM:B:c,-1       jI:B:i,-1

The first two lines are OK, but the third is creating problems. Apparently, the optional fields (  jM:B:c,-1       jI:B:i,-1) are problematic.

I am not sure if it's a STAR problem, a SAM problem or an HTSeq problem, but figured I could ask here first.
What I see from the SAM specifications is the following:

For an integer or numeric array (type `B'), the first letter indicates the type of numbers in the following
comma separated array. The letter can be one of `cCsSiIf', corresponding to int8 t (signed 8-bit
integer), uint8 t (unsigned 8-bit integer), int16 t, uint16 t, int32 t, uint32 t and float, respec-
tively. During import/export, the element type may be changed if the new type is also compatible
with the array.

It seems that only  `cCsSiIf' are valid first letters for type B optional field.  What do they represent, exactly?
I didn't find any mention of this problem anywhere else, and would like to find a solution.
Let me know if you need more information about this problem.

Nicolas (another one...)
--
Nicolas Robine, PhD
Bioinformatics Scientist, NYGC
nro...@nygenome.org 

Alexander Dobin

unread,
Dec 10, 2012, 12:47:43 PM12/10/12
to rna-...@googlegroups.com
Hi Nicolas,

I believe this is a problem with HTseq not recognizing the latest SAM specifications, which introduced the the "B" type of the SAM attributes.
Note, that you would need the latest version of samtools, 0.1.18, for dealing with these attributes. The latest version of Picard also validates these files without
Cufflinks has the same problem with these attributes. 
If you are planning to use the same files with HTseq or cufflinks, 
do *not* use the option --outSAMattributes All .

Note that these extra-attributes are not required by Cufflinks or HTseq. They may be convenient, but, in principle, this information can be extracted from the SAM alignments and genome sequence:
jM:B:c,M1,M2,… Intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
jI:B:I,Start1,End1,Start2,End2,… Start and End of introns for all junctions (1-based)

I will try to contact Cufflinks and HTseq authors to request that they comply with the latest SAM specifications.

Cheers
Alex

Ruben Bautista

unread,
Apr 17, 2014, 9:42:50 AM4/17/14
to rna-...@googlegroups.com
Hi Alex,

I faced this same problem recently and contacted one of the HTSeq authors to ask him if something was amiss, either with my htseq-count command or STAR's SAM output.

That was before finding this post, and it confirms my suspicion that the problem lies in htseq-count rather than STAR. Just to confirm latest version of HTSeq (0.6.1) still falls over due to this problem.

Thankfully, based on what you say--the part about Cufflinks and HTSeq not needing these fields--I have launched my alignments again without the option --outSAMattributes All and hopefully that'll solve the problem since I added this option only for my 2-pass alignments and the 1st pass' SAM's didn't have the optional fields jM:B:c,-1 and jI:B:i,-1 and they ran perfectly fine with htseq-count.

Thanks,

Ruben
Reply all
Reply to author
Forward
0 new messages