Formatting read group(s) for STAR

3,437 views
Skip to first unread message

Sean Davis

unread,
Sep 6, 2014, 10:53:37 AM9/6/14
to rna-...@googlegroups.com
Thanks, Alex, for including inline read group argument addition.  I have several sets of FASTQ files for each run of STAR and want to include read group information for all the sets.  Could you provide a little more detail on the formatting for the --outSAMattrRGline argument?  It wasn't clear (to me) from the parameter defaults available in the source distribution.  

Thanks,
Sean

Alexander Dobin

unread,
Sep 8, 2014, 11:46:25 AM9/8/14
to
Hi Sean,

if you used multiple files in --readFilesIn A_R1,B_R1,C_R1  A_R2,B_R2,C_R2
you can use multiple read group entries also separated by commas:
--outSAMattrRGline ID:sampleA CN:AA DS:AAA , ID:sampleBB CN:bb DS:bbbb , ID:sampleC CN:ccc DS:cccc
Each of the entries has to start with ID: field, this field will be use as RG tag in each read.
The whole entry will be used in the SAM header, e.g. @RG ID:sampleA CN:AA DS:AAA
If you need to have spaces in one of the fields, you have to use quotes, e.g. ID:sampleA "CN:A A" "DS:A AA" .
EDIT: commas have to be separated by spaces on left and right in the  --outSAMattrRGline

Cheers
Alex

Sean Davis

unread,
Sep 8, 2014, 4:20:34 PM9/8/14
to rna-...@googlegroups.com
Thanks, Alex.  Here is my command line (2.4.0a):

STAR --genomeDir /data/CCRBioinfo/public/STAR/hg19_gencode14_ov60 --runThreadN 32 --outSAMattributes Standard --alignIntronMax 100000 --readFilesIn /data/CCRBioinfo/fastq/6_1_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz,/data/CCRBioinfo/fastq/1_1_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz /data/CCRBioinfo/fastq/6_3_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz,/data/CCRBioinfo/fastq/1_3_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFileNamePrefix bam/PASEBY/RNA/ --outSAMunmapped Within --chimSegmentMin 25 --chimJunctionOverhangMin 25 --outSAMattrRGline ID:571 LB:1039 SM:PASEBY_Tumor PL:Illumina, ID:574 LB:1039 SM:PASEBY_Tumor PL:Illumina --outStd BAM_SortedByCoordinate --outTmpDir /scratch/tmp

And my BAM header contains only one RG line:

@RG ID:571 ID:571 LB:1039 SM:PASEBY_Tumor PL:Illumina, ID:574 LB:1039SM:PASEBY_Tumor PL:Illumina

I tried with the comma having a space after and without and got the same result.  I'm not sure what detail I might be missing as your instructions are pretty straightforward.  

Thanks,
Sean


On Monday, September 8, 2014 11:46:25 AM UTC-4, Alexander Dobin wrote:
Hi Sean,

if you used multiple files in --readFilesIn A_R1,B_R1,C_R1  A_R2,B_R2,C_R2
you can use multiple read group entries also separated by commas:
--outSAMattrRGline ID:sampleA CN:AA DS:AAA, ID:sampleBB CN:bb DS:bbbb, ID:sampleC CN:ccc DS:cccc
Each of the entries has to start with ID: field, this field will be use as RG tag in each read.
The whole entry will be used in the SAM header, e.g. @RG ID:sampleA CN:AA DS:AAA
If you need to have spaces in one of the fields, you have to use quotes, e.g. ID:sampleA "CN:A A" "DS:A AA" .

Cheers
Alex
 

On Saturday, September 6, 2014 10:53:37 AM UTC-4, Sean Davis wrote:

Alexander Dobin

unread,
Sep 9, 2014, 3:02:53 PM9/9/14
to rna-...@googlegroups.com
Hi Sean,

sorry, I forgot about one rule - commas have to be separated by spaces on left and right in the  --outSAMattrRGline:
ID:sampleA CN:AA DS:AAA , ID:sampleBB CN:bb DS:bbbb , ID:sampleC CN:ccc DS:cccc 

Also, you need to have the same number of RG tags seprted by commas as the number of files in readFilesIn separated by commas (or you can have just one RG tag, which will be assigned to all files).

Cheers
Alex

mihindu...@gmail.com

unread,
Dec 14, 2018, 4:10:57 PM12/14/18
to rna-star
Hi Alex,
I am trying to merge 3 reverted bams (same sample, 3 read groups) and cannot get it to work.  Here is what I am hoping to get for each read:
C4HFCACXX140624:1:2204:2892:6170        419     chr1    10464   3       91S10M  =       631749  621386  GGATGTTCCAGCGGGCCGCTGTCTCGCCATTCCTCTCCACCCTGGGCACTGACTCCGTCTCAAAAAAAAAAAAAACAAAAAAAAAAAACCCACCCTCGCGG     #####################################################################################################   NH:i:2  HI:i:2  AS:i:102        nM:i:2  NM:i:0    RG:Z:C4HFC.1 SM:881_130918

Here is what I get:
C4HFCACXX140624:1:2204:2892:6170        419     chr1    10464   3       91S10M  =       631749  621386  GGATGTTCCAGCGGGCCGCTGTCTCGCCATTCCTCTCCACCCTGGGCACTGACTCCGTCTCAAAAAAAAAAAAAACAAAAAAAAAAAACCCACCCTCGCGG     #####################################################################################################   NH:i:2  HI:i:2  AS:i:102        nM:i:2  NM:i:0    RG:Z:C4HFC.1 SM:881_130918      RG:Z:C4HFC.1

My command:
/home/mihinduk/STAR-2.6.0a/bin/Linux_x86_64/STAR --runThreadN 12 --genomeDir /40/AD/Expression/2017_03_MendelianVsSporadics/06.-References/GrCh38_100n --twopassMode Basic --runMode alignReads --readFilesType SAM PE --readFilesCommand samtools view -h --readFilesIn C4HFC.1.bam,C4HFC.2.bam,C4HFC.3.bam --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --outFileNamePrefix /40/tmp/pipeline_test/881_130918_t2_ --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes  --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --genomeLoad NoSharedMemory --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType WithinBAM SoftClip --chimMainSegmentMultNmax 1 --outSAMattributes NH HI AS nM NM ch --outSAMattrRGline "ID:C4HFC.1 SM:881_130918" , "ID:C4HFC.2 SM:881_130918" , "ID:C4HFC.3 SM:881_130918"

Thank you,
Kathie

Alexander Dobin

unread,
Dec 14, 2018, 4:53:39 PM12/14/18
to rna-star
Hi Kathie,

I am not sure if understand it correctly, do you want the content of RG:Z: tag for each read to be "C4HFC.1 SM:881_130918" with a space separating two words?
Or do you want SM tag to be listed in the header @RG line only, and RG:Z:C4HFC.1 listed for each read? 
Note that SM:881_130918 cannot be a SAM attribute for each read by itself.

Cheers
Alex

mihindu...@gmail.com

unread,
Dec 17, 2018, 9:28:59 AM12/17/18
to rna-...@googlegroups.com
Hi Alex,
I was trying to follow the GTex pipeline, but found that including their recommended parameter: --outSAMattrRGline ID:C4HFC.1 SM:881_130918 results in an extra RG:Z tag.  I was hoping to get: RG:Z:C4HFC.1 SM:881_130918, but got RG:Z:C4HFC.1 SM:881_130918      RG:Z:C4HFC.1.  If I turn off  --outSAMattrRGline, I can capture the sample name in the SAM header and the read group for each alignment, but have been trying to figure out the GTex parameters: https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md.  I was particularly trying to apply this to bam files that contain 1 sample with multiple read groups.
Can each read have RG:Z:RG SM?

I should mention that I am using bams as input with the options: 
--runMode alignReads --readFilesType SAM PE --readFilesCommand samtools view -h --readFilesIn C4HFC.1.bam,C4HFC.2.bam,C4HFC.3.bam

So, perhaps this issue is bam-specific.

Thank you,
Kathie

Alexander Dobin

unread,
Dec 18, 2018, 9:39:46 AM12/18/18
to rna-star
Hi Kathie,

I think the problem is that the BAM files you are mapping from already contain the RG:Z: tag, and you are adding another one.
To remove all the tags from the BAMs, you could do one the following:

1. Convert each BAM separately into sam keeping only 11 first fields:
$ samtools view -h a1.bam  | cut -f1-11 > a1.sam
$ samtools view -h a2.bam  | cut -f1-11 > a2.sam
$ samtools view -h a3.bam  | cut -f1-11 > a3.sam
then map 
$ STAR  --outSAMattrRGline ID:C4HFC.1 SM:881_130918 , ID:C4HFC.2 SM:881_130918 , ID:C4HFC.3 SM:881_130918 --readFilesType SAM PE --readFilesIn a1.sam,a2.sam,a3.sam
Note that you do not need --readFilesCommand command in this case.

2. Or you can do all of the above on the fly:
Create a script convertBAM.sh that contains one line and make it executable:

samtools view -h $1 | cut -f1-11

Then map with  
STAR --readFilesType SAM PE --readFilesCommand /path/to/convertBAM.sh --readFilesIn C4HFC.1.bam,C4HFC.2.bam,C4HFC.3.bam --outSAMattrRGline ID:C4HFC.1 SM:881_130918 , ID:C4HFC.2 SM:881_130918 , ID:C4HFC.3 SM:881_130918


The commands above will add the ID tag for each of the reads, and ID: SM: into @RG line in the header.
If you want to add SM to each read, you can experiment with  --outSAMattrRGline ID:"C4HFC.1 SM:881_130918" , ID:"C4HFC.2 SM:881_130918" , ID:"C4HFC.3 SM:881_130918"
but I do not recommend it as it defies the standard BAM formatting rules.

Cheers
Alex

mihindu...@gmail.com

unread,
Dec 18, 2018, 10:18:40 AM12/18/18
to rna-star
Thank you very much, Alex.  This was very helpful.  I think the best way forward for us is to just skip the --outSAMattrRGline command, since the RG is captured for each read and the SM is captured in the @RG header. 
Kathie
Reply all
Reply to author
Forward
0 new messages