Formatting read group(s) for STAR

Sean Davis

unread,

Sep 6, 2014, 10:53:37 AM9/6/14

to rna-...@googlegroups.com

Thanks, Alex, for including inline read group argument addition. I have several sets of FASTQ files for each run of STAR and want to include read group information for all the sets. Could you provide a little more detail on the formatting for the --outSAMattrRGline argument? It wasn't clear (to me) from the parameter defaults available in the source distribution.

Thanks,

Sean

Alexander Dobin

unread,

Sep 8, 2014, 11:46:25 AM9/8/14

to

Hi Sean,

if you used multiple files in --readFilesIn A_R1,B_R1,C_R1 A_R2,B_R2,C_R2

you can use multiple read group entries also separated by commas:

--outSAMattrRGline ID:sampleA CN:AA DS:AAA , ID:sampleBB CN:bb DS:bbbb , ID:sampleC CN:ccc DS:cccc

Each of the entries has to start with ID: field, this field will be use as RG tag in each read.

The whole entry will be used in the SAM header, e.g. @RG ID:sampleA CN:AA DS:AAA

If you need to have spaces in one of the fields, you have to use quotes, e.g. ID:sampleA "CN:A A" "DS:A AA" .

EDIT: commas have to be separated by spaces on left and right in the --outSAMattrRGline

Cheers

Alex

Sean Davis

unread,

Sep 8, 2014, 4:20:34 PM9/8/14

to rna-...@googlegroups.com

Thanks, Alex. Here is my command line (2.4.0a):

STAR --genomeDir /data/CCRBioinfo/public/STAR/hg19_gencode14_ov60 --runThreadN 32 --outSAMattributes Standard --alignIntronMax 100000 --readFilesIn /data/CCRBioinfo/fastq/6_1_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz,/data/CCRBioinfo/fastq/1_1_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz /data/CCRBioinfo/fastq/6_3_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz,/data/CCRBioinfo/fastq/1_3_62BLFAAXX.211_BUSTARD-2011-01-27.fq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFileNamePrefix bam/PASEBY/RNA/ --outSAMunmapped Within --chimSegmentMin 25 --chimJunctionOverhangMin 25 --outSAMattrRGline ID:571 LB:1039 SM:PASEBY_Tumor PL:Illumina, ID:574 LB:1039 SM:PASEBY_Tumor PL:Illumina --outStd BAM_SortedByCoordinate --outTmpDir /scratch/tmp

And my BAM header contains only one RG line:

@RG ID:571 ID:571 LB:1039 SM:PASEBY_Tumor PL:Illumina, ID:574 LB:1039SM:PASEBY_Tumor PL:Illumina

I tried with the comma having a space after and without and got the same result. I'm not sure what detail I might be missing as your instructions are pretty straightforward.

Thanks,

Sean

On Monday, September 8, 2014 11:46:25 AM UTC-4, Alexander Dobin wrote:

Hi Sean,

if you used multiple files in --readFilesIn A_R1,B_R1,C_R1 A_R2,B_R2,C_R2
you can use multiple read group entries also separated by commas:

--outSAMattrRGline ID:sampleA CN:AA DS:AAA, ID:sampleBB CN:bb DS:bbbb, ID:sampleC CN:ccc DS:cccc

Each of the entries has to start with ID: field, this field will be use as RG tag in each read.
The whole entry will be used in the SAM header, e.g. @RG ID:sampleA CN:AA DS:AAA
If you need to have spaces in one of the fields, you have to use quotes, e.g. ID:sampleA "CN:A A" "DS:A AA" .

Cheers
Alex

On Saturday, September 6, 2014 10:53:37 AM UTC-4, Sean Davis wrote:

Alexander Dobin

unread,

Sep 9, 2014, 3:02:53 PM9/9/14

to rna-...@googlegroups.com

Hi Sean,

sorry, I forgot about one rule - commas have to be separated by spaces on left and right in the --outSAMattrRGline:

ID:sampleA CN:AA DS:AAA , ID:sampleBB CN:bb DS:bbbb , ID:sampleC CN:ccc DS:cccc

Also, you need to have the same number of RG tags seprted by commas as the number of files in readFilesIn separated by commas (or you can have just one RG tag, which will be assigned to all files).

Cheers

Alex

mihindu...@gmail.com

unread,

Dec 14, 2018, 4:10:57 PM12/14/18

to rna-star

Hi Alex,

I am trying to merge 3 reverted bams (same sample, 3 read groups) and cannot get it to work. Here is what I am hoping to get for each read:

C4HFCACXX140624:1:2204:2892:6170 419 chr1 10464 3 91S10M = 631749 621386 GGATGTTCCAGCGGGCCGCTGTCTCGCCATTCCTCTCCACCCTGGGCACTGACTCCGTCTCAAAAAAAAAAAAAACAAAAAAAAAAAACCCACCCTCGCGG ##################################################################################################### NH:i:2 HI:i:2 AS:i:102 nM:i:2 NM:i:0 RG:Z:C4HFC.1 SM:881_130918

Here is what I get:

C4HFCACXX140624:1:2204:2892:6170 419 chr1 10464 3 91S10M = 631749 621386 GGATGTTCCAGCGGGCCGCTGTCTCGCCATTCCTCTCCACCCTGGGCACTGACTCCGTCTCAAAAAAAAAAAAAACAAAAAAAAAAAACCCACCCTCGCGG ##################################################################################################### NH:i:2 HI:i:2 AS:i:102 nM:i:2 NM:i:0 RG:Z:C4HFC.1 SM:881_130918 RG:Z:C4HFC.1

My command:
/home/mihinduk/STAR-2.6.0a/bin/Linux_x86_64/STAR --runThreadN 12 --genomeDir /40/AD/Expression/2017_03_MendelianVsSporadics/06.-References/GrCh38_100n --twopassMode Basic --runMode alignReads --readFilesType SAM PE --readFilesCommand samtools view -h --readFilesIn C4HFC.1.bam,C4HFC.2.bam,C4HFC.3.bam --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --outFileNamePrefix /40/tmp/pipeline_test/881_130918_t2_ --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --genomeLoad NoSharedMemory --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType WithinBAM SoftClip --chimMainSegmentMultNmax 1 --outSAMattributes NH HI AS nM NM ch --outSAMattrRGline "ID:C4HFC.1 SM:881_130918" , "ID:C4HFC.2 SM:881_130918" , "ID:C4HFC.3 SM:881_130918"

Thank you,

Kathie

Alexander Dobin

unread,

Dec 14, 2018, 4:53:39 PM12/14/18

to rna-star

Hi Kathie,

I am not sure if understand it correctly, do you want the content of RG:Z: tag for each read to be "C4HFC.1 SM:881_130918" with a space separating two words?

Or do you want SM tag to be listed in the header @RG line only, and RG:Z:C4HFC.1 listed for each read?

Note that SM:881_130918 cannot be a SAM attribute for each read by itself.

Cheers

Alex

mihindu...@gmail.com

unread,

Dec 17, 2018, 9:28:59 AM12/17/18

to rna-...@googlegroups.com

Hi Alex,

I was trying to follow the GTex pipeline, but found that including their recommended parameter: --outSAMattrRGline ID:C4HFC.1 SM:881_130918 results in an extra RG:Z tag. I was hoping to get: RG:Z:C4HFC.1 SM:881_130918, but got RG:Z:C4HFC.1 SM:881_130918 RG:Z:C4HFC.1. If I turn off --outSAMattrRGline, I can capture the sample name in the SAM header and the read group for each alignment, but have been trying to figure out the GTex parameters: https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md. I was particularly trying to apply this to bam files that contain 1 sample with multiple read groups.

Can each read have RG:Z:RG SM?

I should mention that I am using bams as input with the options:

--runMode alignReads --readFilesType SAM PE --readFilesCommand samtools view -h --readFilesIn C4HFC.1.bam,C4HFC.2.bam,C4HFC.3.bam

So, perhaps this issue is bam-specific.

Thank you,

Kathie

Alexander Dobin

unread,

Dec 18, 2018, 9:39:46 AM12/18/18

to rna-star

Hi Kathie,

I think the problem is that the BAM files you are mapping from already contain the RG:Z: tag, and you are adding another one.

To remove all the tags from the BAMs, you could do one the following:

1. Convert each BAM separately into sam keeping only 11 first fields:

$ samtools view -h a1.bam | cut -f1-11 > a1.sam

$ samtools view -h a2.bam | cut -f1-11 > a2.sam

$ samtools view -h a3.bam | cut -f1-11 > a3.sam

then map

$ STAR --outSAMattrRGline ID:C4HFC.1 SM:881_130918 , ID:C4HFC.2 SM:881_130918 , ID:C4HFC.3 SM:881_130918 --readFilesType SAM PE --readFilesIn a1.sam,a2.sam,a3.sam

Note that you do not need --readFilesCommand command in this case.

2. Or you can do all of the above on the fly:

Create a script convertBAM.sh that contains one line and make it executable:

samtools view -h $1 | cut -f1-11

Then map with

STAR --readFilesType SAM PE --readFilesCommand /path/to/convertBAM.sh --readFilesIn C4HFC.1.bam,C4HFC.2.bam,C4HFC.3.bam --outSAMattrRGline ID:C4HFC.1 SM:881_130918 , ID:C4HFC.2 SM:881_130918 , ID:C4HFC.3 SM:881_130918

The commands above will add the ID tag for each of the reads, and ID: SM: into @RG line in the header.

If you want to add SM to each read, you can experiment with --outSAMattrRGline ID:"C4HFC.1 SM:881_130918" , ID:"C4HFC.2 SM:881_130918" , ID:"C4HFC.3 SM:881_130918"

but I do not recommend it as it defies the standard BAM formatting rules.

Cheers

Alex

mihindu...@gmail.com

unread,

Dec 18, 2018, 10:18:40 AM12/18/18

to rna-star

Thank you very much, Alex. This was very helpful. I think the best way forward for us is to just skip the --outSAMattrRGline command, since the RG is captured for each read and the SM is captured in the @RG header.

Kathie

Reply all

Reply to author

Forward