StringTie, not cufflinks: --outSAMattributes XS without --outSAMstrandField intronMotif

1,440 views
Skip to first unread message

matt....@gladstone.ucsf.edu

unread,
Nov 2, 2016, 9:23:06 PM11/2/16
to rna-star
Hi Alex,
I have PE75 stranded reads and am trying to generate the XS attribute for StringTie without the --outSAMstrandField intronMotif option. I'm hoping this will force RNA strandedness by R1 orientation for all reads (spliced and unspliced). Will this work, or, if not, is there another way to accomplish this? I don't want to lose the benefit of creating a stranded library (also trying to use StringTie instead of Cufflinks).

Thank you!
Matt
 

matt....@gladstone.ucsf.edu

unread,
Nov 2, 2016, 9:24:07 PM11/2/16
to rna-star
To clarify I am using --outSAMattributes NH HI AS nM NM XS.
Thanks again!

Alexander Dobin

unread,
Nov 4, 2016, 4:00:05 PM11/4/16
to rna-star
Hi Matt,

the XS strand is only added to spliced alignments based on their intron motifs.
It does not take into account the strandedness information.
For stranded RNA-seq, Cufflinks can be run with the --library-type parmameter, while StringTie for some reason lacks this option, and requires XS to be supplied for each alignment. I have a simple script that can add XS flag to all alignments for stranded data:


It works on a SAM file:
$ awk -v strType=2 -f tagXSstrandedData.awk Aligned.out.sam > Aligned.XS.sam
For a BAM file:

samtools view -h Aligned.out.bam | awk -v strType=2 -f tagXSstrandedData.awk | samtools view -bS - > Aligned.XS.bam
strType defines the strandedness of the RNA-seq library prep, and specifies the read (1 or 2) which strand agrees with the RNA strand.
For the Illumina Tru-Seq dUTP protocol strType=2.
If the reads are unpaired, they are assumed to be “read 1”.

I have not tested this code with StringTie - please let me know whether it works for you, and I will add it to the official STAR release.

Cheers
Alex

Neelanjan Mukherjee

unread,
Nov 10, 2016, 3:02:57 PM11/10/16
to rna-star
Dear Matt and Alex,
This really solved by related problem and I think it would be great if it could be included in the official release. I have paired-end strand-specific data and wanted to use a 2 pass STAR strategy (rather than HISAT/2) as an input for stringtie. After running stringtie I had many single intron transcripts without any strand because stringtie was expecting the XS tag. I did use the "intronMotif" option, but the unspliced reads were not receiving XS tags leading to the problem in stringtie. I tried to resolve it by counting the reads on either strand of the "unstranded" gtf entries, but it was not always clear what to change the strand to. Anyway, I really appreciate the script and the ability to add the XS tags was quite helpful. 
Thanks very much.
Neel

Alexander Dobin

unread,
Nov 17, 2016, 12:51:35 PM11/17/16
to rna-star
Hi Neel,

thanks a lot for testing the script. I have included it in the STAR distribution package.
Please let me know if you see any issues.

Cheers
Alex

matt....@gladstone.ucsf.edu

unread,
Nov 17, 2016, 8:13:47 PM11/17/16
to rna-star
Thanks so much, Alex!
I'll report back once I get through the pipeline.
Best,
Matt


Dorit Hockman

unread,
Nov 22, 2016, 10:06:39 AM11/22/16
to rna-star
Dear Alex,

Will this script work on bam files that have already been mapped with STAR using the --outSAMstrandField intronMotif option? 

Thank you for help!

be well
Dorit

Alexander Dobin

unread,
Nov 22, 2016, 11:27:10 AM11/22/16
to rna-star
Hi Dorit,

it will not work, since it does not check for the XS tag, and will add 2nd XS tag for those alignments that already have.
I think the best way to strip the old XS tag, e.g.
$ samtools view -h Aligned.sortedByCoord.out.bam | sed 's/\tXS:A:[-+]//' | awk -f tagXSstrandedData.awk | samtools view -bS - > A.bam

Cheers
Alex

Charlotte Capitanchik

unread,
Jul 4, 2017, 11:11:32 PM7/4/17
to rna-star
Hi Alex,

I'm trying to use the awk script to add the XS attribute to stranded data for use with a downstream splicing analysis software, but I get this error:

9: (FILENAME=- FNR=2821) fatal: not enough arguments to satisfy format string
        `SRR594434.48536661     419     1       1210619 3       80M     =      1218933  8394    ACTAACTCCATCACAGTGACTTCTAATCTAGACCATAGACATCATACTGGAGGGTTAAGTGAGACCAGCCAAAGTCACCA        HHHHHHHHHHHHHHHHFHBHHHDHHHHHHHHFEHHHHHHHHHHHFHFHHHGIIG%GDEEDFFFADHFF;GBCEEAEHHHH        NH:i:2  HI:i:2  AS:i:157        nM:i:0'
                                                                                                                                                                                                      ^ ran out for this one
The bam file I'm using was generated in the STAR mapping step with the --outSAMtype BAM SortedByCoordinate argument.
Any help is much appreciated, thank you!

Best,
Charlotte

Alexander Dobin

unread,
Jul 5, 2017, 4:15:04 PM7/5/17
to rna-star
Hi Charlotte,

I have fixed a non-canonical form for the awk printf statement - the problem should go away.
Please try the new version of the script from github master: https://github.com/alexdobin/STAR/blob/master/extras/scripts/tagXSstrandedData.awk

Cheers 
Alex

jxf...@case.edu

unread,
Mar 14, 2018, 10:32:09 AM3/14/18
to rna-star
Hi there! I just attempted to use this script with the following shell command:

samtools view -h A1Aligned.sortedByCoord.out.bam | awk -v strType=2 -f tagXSstrandedData.awk | samtools view -bS - > A1Aligned.XS.bam

I received back the following errors from my Ubuntu VM:

awk: tagXStrandedData.awk: line 35: function and never defined
awk
: tagXStrandedData.awk: line 35: function and never defined
awk
: tagXStrandedData.awk: line 35: function and never defined
awk
: tagXStrandedData.awk: line 35: function and never defined
[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!

Did I align my reads incorrectly? My mapping input settings were: 
./STAR --runMode alignReads --runThreadN 8 --limitGenomeGenerateRAM 50000000000 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --sjdbOverhang 149 --outFileNamePrefix ~/Documents/Data/CLP1MN/TotalRNASeq/Aligned/A1 --genomeDir ~/Documents/Reference/Index/STAR/hg38_noref --sjdbGTFfile ~/Documents/Reference/GTF/UCSC_Gencodev27_Comprehensive.gtf --readFilesIn ~/Documents/Data/CLP1MN/TotalRNASeq/Raw/A1_1.fq ~/Documents/Data/CLP1MN/TotalRNASeq/Raw/A1_2.fq

I received back ~93% uniquely mapped reads in my final log file for the run. Maybe there's just some extra formatting that needs to be done before this command can take the BAM file?

Thanks,

Jordan

Alexander Dobin

unread,
Mar 14, 2018, 5:11:04 PM3/14/18
to rna-star
Hi Jordan,

I think the problem is with the awk installation on your server which prevents the script from running properly. Can you try to run:
$ awk 'BEGIN {print and(1,1)}'
You may also try to use gawk instead of awk.

If it does not work, you would need to ask your sys-admin to install awk with the "and" function compiled.

Cheers
Alex

Jordan Farr

unread,
Mar 15, 2018, 10:09:34 AM3/15/18
to rna-star
Changing awk to gawk worked like a charm!

Thanks a bunch Alex.

Jordan
Reply all
Reply to author
Forward
0 new messages