compatibility with GATK

myou...@g.ucla.edu

unread,

May 2, 2014, 11:30:19 PM5/2/14

to rna-...@googlegroups.com

The GATK group recommends using STAR for alignment prior to running the GATK variant caller on RNASeq data. Calling variants in RNAseq This is all experimental stuff. I have a few suggestions based on my experience so far with this pipeline that would streamline the process. None of these are show-stoppers.

1. It is necessary to add read groups (RG header and tags) with picard tools. This step could be avoided if STAR had an option to include read groups in the original SAM (soon to be BAM, I hope) file.

2. It is then necessary to run GATK's SplitNCigarReads to split reads into exon segments and convert mapping quality scores from 255 to 60. To quote the GATK article:

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignMappingQuality read filter to reassign all MAPQs to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignMappingQuality read filter to the splitter command.

3. The SAM files produced by STAR fail strict validation with picard tools ValidateSamFile, generation this warning for each record:

NM tag (nucleotide differences) is missing

It's possible to ignore this warning, but it would be nice if STAR could produce the value. I see that the STAR SAMs have a nM tag. Is this intended to be NM, or something else? Lower case letters are reserved for local tags.

Alexander Dobin

unread,

Jul 24, 2014, 10:31:34 AM7/24/14

to

Most of these features are available in the latest patch:

--outSAMattrRGline

string: SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRGline ID:xxx CN:yy "DS:z z z".

xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted.

--outSAMmapqUnique

int: 0 to 255: the MAPQ value for unique mappers - it's still not really meaningful, but can be set to any value for compatibility

--outSAMattributes

string: a string of desired SAM attributes, in the order desired for the output SAM

NH HI AS nM NM MD jM jI XS

NM and MD have the same meaning as in the samtools.

nM is the number of mismatches per read pair

I will ask GATK team to update their recommendations.

Cheers

Alex

Jason Ross

unread,

Jul 22, 2014, 12:40:01 AM7/22/14

to rna-...@googlegroups.com

Hi Alex,

I've been using STAR and it's great! However, I've been getting caught a little by the bleeding edge options, so I thought I'd start posting corrections/bug alerts.

The example usage regarding --outSAMattrRGline above needs to be corrected. "outSAMattrRG" should be "outSAMattrRGline". Also, is it correct to completely quote the DS tag? So "DG:z z z" as opposed to DG:"z z z". Finally, the DG value is incorrectly parsed when a comma is present. So, the argument --outSAMattrRGline ID:xxx DS:"z1, z2 z3" will raise the below exception:

EXITING because of FATAL INPUT ERROR: the first word of a line from --outSAMattrRGline=z2 z3 does not start with ID:xxx read group identifier

Cheers,

Jason.

On Wednesday, 7 May 2014 06:46:02 UTC+10, Alexander Dobin wrote:

Most of these features are available in the latest patch:

--outSAMattrRGline
string: SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRG ID:xxx CN:yy "DS:z z z".

xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted.

--outSAMmapqUnique
int: 0 to 255: the MAPQ value for unique mappers - it's still not really meaningful, but can be set to any value for compatibility

--outSAMattributes
string: a string of desired SAM attributes, in the order desired for the output SAM
NH HI AS nM NM MD jM jI XS

NM and MD have the same meaning as in the samtools.
nM is the number of mismatches per read pair

I will ask GATK team to update their recommendations.

Cheers
Alex

On Friday, May 2, 2014 11:30:19 PM UTC-4, myou...@g.ucla.edu wrote:

Alexander Dobin

unread,

Jul 24, 2014, 6:09:17 PM7/24/14

to rna-...@googlegroups.com

Hi Jason,

thanks for reporting these problems. I have fixed the problem with the commas in the --outSAMattrRGline, you can have it inside the tag values. Here is the new patch:

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARreleases/Patches/STAR_2.3.1z14.tgz

Comma may also have another meaning: if you use comma surrounded by spaces and not within double quotes, it will separte RG tags to be assigned to the reads from different files in the comma-separated list of input files in --readFilesIn, e.g.

--readFilesIn Ra1,Rb1 Ra2,Rb2 --outSAMattrRGline ID:rg_a "DS:z1 , z2 z3" , ID:rg_b DS:"y1 , y2"

You can use both "DG:z z z" or DG:"z z z". The important rule is that unquoted white space will be treated as a separator between tags and converted to "\t" in the SAM header.