Aligning reads of different length

Veer Singh Marwah

unread,

May 15, 2014, 7:04:37 AM5/15/14

to rna-...@googlegroups.com

Hi,

We are planning to use the STAR 2-pass method for calling variations as described in GATK RNA-Seq best practices. But there are a few queries we need clarifications for.

When creating the genome index sjdbOverhang should be specified as mate_length-1, but we have datasets with varying read lengths because we trim the reads for adapter contamination and low quality. In such a case how does one use the parameter sjdbOverhang and if one wishes to ignore it does it have any default value. Or is its necessary that all the reads be of same length?

We have also generated an annotation (known and novel) by comparing RNA-Seq information from 50 replicates. I plan to use this annotation as the only source of splice junction information for the genome index, do you think that it will still be advisable to use 2-pass method instead?

Thank you,

Veer

Alexander Dobin

unread,

May 19, 2014, 4:02:41 PM5/19/14

to rna-...@googlegroups.com

Hi Veer,

if you have reads of varying lengths, I recommend using --sjdbOverhang <N> with N=min( 100, max(ReadLength)-1 ). Changing this parameter usually affects only a very small % of reads.

If you have multiple samples, I would recommend the following 2-pass procedure:

(i) run 1st pass of STAR for all samples using genome file with GTF annotations

(ii) collect and filter splice junctions from all samples, and re-generate the genome using the GTF annotations and the merged splice junctions

(iii) re-map all samples with the new genome.

This will provide the highest accuracy for SNP detection.

Cheers

Alex

Veer Singh Marwah

unread,

May 21, 2014, 10:07:06 AM5/21/14

to rna-...@googlegroups.com

Hi Alex,

Thank you for replying to my query, I am implementing your suggestions. I need a little clarification with regards to combining splice junctions from 1pass alignment step. In another post you have mentioned your commands formatting and combining as following.

prepare splice junctions:
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' ../SampleTest.trimQ20.STAR-2pass.sjbd99.SJ.out.tab > SJin.tab

What I understand from that command is that you are formatting the SJ.out.tab file to 'Chr\tStart\tEnd\tStrand' and by 5th field '$5>0', are you filtering 'intron motif' here. After looking at your command I also checked the manual where you have very clearly mentioned format for --sjdbFileChrStartEnd as 'Chr\tStart\tEnd\tStrand'.

So I gather that have to format the SJ.out.tab file before giving to --sjdbFileChrStartEnd option at all times. Or is it that it is required only in case we are combining different samples SJ.out.tab files? I am asking this because in the GATK RNA-Seq best practices they have used the SJ.out.tab file with --sjdbFileChrStartEnd option without formatting it.

It seems they forgot mention the formatting part but I wanted to be sure. Also apologies if quoting from another post seems like a cross post.

Regards,

Veer

PS: Also I guess I should remove redundancy from the combined splice junction annotation file.

Alexander Dobin

unread,

May 22, 2014, 12:23:12 AM5/22/14

to rna-...@googlegroups.com

Hi Veer,

the $5>0 filtering removes non-canonical junctions, they are generally less reliable. However, I could imagine that GATK prefers as large database of junctions as possible.

On the other hand, I recommend strand fromatting 012->.+- in all cases, otherwise the novel junctions will be considered unstranded.

I will have to ask GATK people if they avoided this transformation on purpose, thanks for pointing this out.

Collapsing the merged junctions is not really necessary, STAR will do it for you - although it may be useful to see how many collapsed novel junctions you are adding.

One other filtering you may want to add - removing junctions on Mitochondrion genome: those are likely artifacts, and may significantly slow the 2nd pass.

Cheers

Alex

Veer Singh Marwah

unread,

May 26, 2014, 11:26:28 AM5/26/14

to

Hi Alex,

Thank you for providing the clarification and for suggesting the changes to improve the analysis.

I performed the analysis as discussed. 1pass with GTF annotation and 2pass with GTF annotation and combined & filtered SJ_out from all samples. Further for my curiosities sake I performed variation calling on processed alignments (read grouping, indel-realignment and base quality re-calibration) from both 1pass and 2pass.

I filtered the variations as suggested in GATK RNA-Seq best practices and used vcftools to compare them. Following is the comparison result.

Found 153308 SNPs common to both files.

Found 2265 SNPs only in main file.

Found 2894 SNPs only in second file.

There are 2 thousand odd variations specific to either 1pass or 2pass.I loaded the alignments in IGV to check some of the variations which were specific to 1pass and 2pass.

The variations which were specific to 2pass displayed similar coverage in both alignments and similar number of alternate bases but still it was called as variation in 2pass but not in 1pass.

Example screenshot is attached herewith as 2pass_specific_variation.png

The variations which were specific to 1pass displayed much higher coverage in 1pass alignment compared to 2pass alignment.Reads could have been rearranged to a different location.

Example screenshot is attached herewith as 1pass_specific_variation.png

One thing I noted in both alignment was that quite a lot of reads were tagged as "Alignment NOT primary" by IGV. These must be multi-mapped reads. Do you think it is better to align without multi-mapping or remove multi-mapping reads before performing variation calling.

The only reason I am trying to compare both 1pass and 2pass is because this analysis is an addition to an earlier analysis we had performed where we have merged and filtered the reported genes and transcripts from 50 samples to obtain a GTF annotation which is being used for this analysis. And we have performed 1pass with that GTF annotation, so I wanted to see the difference in that alignment vs the 2pass which is GTF as well as the SJ reported in 1pass.

Looking forward to your comments on the comparison of 1pass and 2pass. Thank you.

Regards,

Veer

PS: Well actually the number of multi-mapped reads are 3-4 % as reported by the alignment statistics. So not that many actually

1pass_specific_variation.png

2pass_specific_variation.png

Alexander Dobin

unread,

May 29, 2014, 11:24:56 AM5/29/14

to rna-...@googlegroups.com

Hi Veer,

thanks for sharing these results. I do not have experience with calling SNPs, so I can only comment very superficially. I am meeting with GATK people next week and will try to understand better the whole process, and will discuss you comparisons with them.

>>> One thing I noted in both alignment was that quite a lot of reads were tagged as "Alignment NOT primary" by IGV. These must be multi-mapped reads. Do you think it is better to align without multi-mapping or remove multi-mapping reads before performing variation calling.

In general, I would advise against using multi-mappers for SNP calling, and STAR assgins very low mapping quality to them, so I hope GATK filters them out.

>>> The variations which were specific to 2pass displayed similar coverage in both alignments and similar number of alternate bases but still it was called as variation in 2pass but not in 1pass.

The only explanation I have is that these alignments are multi-mappers in the 1st pass, and SNPs are not called on them, and they become unique mappers in the 2nd pass, and now SNPs are called on them.

You can check this by making an IGV track with unique alignments only.

>>>The variations which were specific to 1pass displayed much higher coverage in 1pass alignment compared to 2pass alignment.Reads could have been rearranged to a different location.

This is likely the reason why GATK prefers 2-pass scheme, I imagine many of the SNPs called in the 1st pass but not in the 2nd are false positives. In the 2nd pass STAR will align these reads to the novel junctions, rather than to 1st pass unspliced loci with mismatches.

Cheers

Alex

On Monday, May 26, 2014 11:26:28 AM UTC-4, Veer Singh Marwah wrote:

Hi Alex,

Thank you for providing the clarification and for suggesting the changes to improve the analysis.

I performed the analysis as discussed. 1pass with GTF annotation and 2pass with GTF annotation and combined & filtered SJ_out from all samples. Further for my curiosities sake I performed variation calling on processed alignments (read grouping, indel-realignment and base quality re-calibration) from both 1pass and 2pass.

I filtered the variations as suggested in GATK RNA-Seq best practices and used vcftools to compare them. Following is the comparison result.
Found 153308 SNPs common to both files.
Found 2265 SNPs only in main file.
Found 2894 SNPs only in second file.

There are 2 thousand odd variations specific to either 1pass or 2pass.I loaded the alignments in IGV to check some of the variations which were specific to 1pass and 2pass.

The variations which were specific to 2pass displayed similar coverage in both alignments and similar number of alternate bases but still it was called as variation in 2pass but not in 1pass.
Example screenshot is attached herewith as 2pass_specific_variation.png

The variations which were specific to 1pass displayed much higher coverage in 1pass alignment compared to 2pass alignment.Reads could have been rearranged to a different location.
Example screenshot is attached herewith as 1pass_specific_variation.png

One thing I noted in both alignment was that quite a lot of reads were tagged as "Alignment NOT primary" by IGV. These must be multi-mapped reads. Do you think it is better to align without multi-mapping or remove multi-mapping reads before performing variation calling.

The only reason I am trying to compare both 1pass and 2pass is because this analysis is an addition to an earlier analysis we had performed where we have merged and filtered the reported genes and transcripts from 50 samples to obtain a GTF annotation which is being used for this analysis. And we have performed 1pass with that GTF annotation, so I wanted to see the difference in that alignment vs the 2pass which is GTF as well as the SJ reported in 1pass.

Looking forward to your comments on the comparison of 1pass and 2pass. Thank you.

Regards,

Veer

PS: Well actually the number of multi-mapped reads are 3-4 % as reported by the alignment statistics. So not that many actually

Reply all

Reply to author

Forward