Very low processing in 2pass and "drop in unique hits" / "increase in multi mappings"

54 views
Skip to first unread message

Urs Lahrmann

unread,
Nov 16, 2017, 8:58:18 AM11/16/17
to rna-star
Hi Alex,

I have a currently running sample which takes ages (compared to other samples) in the 2pass. Furthermore, there is a drop in "Mapped unique" from 1pass to 2pass which I have never seen before during processing. Could you tell whether this is an indicator for "something strange" in this sample? The sample was in no way an outlier in my previous QC analyses and samples with equal or more reads passed just fine?! I attached the progress log and final log of the 1 pass of the "problematic file" (Sample2), of a sample with similar reads that was processed shortly before (Sample1) and of a sample with many more reads that was processed ~3 weeks with the exact same settings (Sample3).

My (simplified) run command is:

starProg="/foo/bar/Star_2.5.1b/bin/Linux_x86_64_static/STAR"
starGenomeDir="/foo/bar/genomeFiles/Gencode_GRCh38_v24p5/Star251b_sjdbOver99_Ref"
threads="15"

"$starProg" --genomeDir $starGenomeDir --readFilesIn "$forRead" "$revRead" --outFileNamePrefix "$outPrefix" --readFilesCommand zcat --twopassMode Basic --outSAMmapqUnique 60 --outSAMtype BAM Unsorted --sjdbOverhang 99 --outFilterMultimapNmax 20 --outFilterType BySJout --outReadsUnmapped Fastx --winAnchorMultimapNmax 100 --seedSearchLmax 30 --seedSearchStartLmax 30 --seedPerReadNmax 2000 --seedPerWindowNmax 100 --outFilterMismatchNmax 60 --chimSegmentMin 12 --chimJunctionOverhangMin 12 --alignSJoverhangMin 8 --alignSJDBoverhangMin 4 --alignIntronMin 20 --alignMatesGapMax 200000 --alignIntronMax 200000 --outSAMattrRGline ID:"$rgID" CN:"$rgCN" SM:"$rgSM" DS:"$rgDS" PL:"$rgPL" --runThreadN $threads > "$outErrLog" 2>&1

Don't know if this is important, but I'm analyzing scRNA-Seq data from cancer patients sequenced on a HiSeq4k.

Best regards,
Urs

ps: I know that I'm not running the very latest star version, but I would need to re-run a couple of samples if it is a version issue => that's why I did not simply try the latest star version, yet.
Sample1_Star251b_H38V27P10G_1pass_Log.final.out
Sample1_Star251b_H38V27P10G_Log.progress.out
Sample2_Star251b_H38V27P10G_1pass_Log.final.out
Sample2_Star251b_H38V27P10G_Log.progress.out
Sample3_Star251b_H38V27P10G_1pass_Log.final.out
Sample3_Star251b_H38V27P10G_Log.progress.out

Alexander Dobin

unread,
Nov 17, 2017, 6:04:49 PM11/17/17
to rna-star
Hi Urs,

sample2 is the "bad" one, right?
It stands out in two ways that might affect the 2nd step mapping: 
1. The mismatch rate is relatively high, 2.8 vs 1.4 and 1.8 in the other samples.
2. The number of unannotated splices (=Total-Annotated) is 325k compared to 150k for sample1, while the number of uniquely mapped reads increases only by 50%.

Could you also send me the Log.final.out files for the 2nd pass, and the Log.out file for the run?

Cheers
Alex

Urs Lahrmann

unread,
Nov 20, 2017, 4:34:14 AM11/20/17
to rna-star
Hi Alex,

yes, sample2 is the "bad" one. Please find the requested files attached. As these are patient samples, I had to anonymize parts of the file path/names in the *.Log.out files. That means real sample ids became "Sample1,2, ...", real path became "/foo/bar, ProjectX, ...", everything else was not changed at all.

Btw: In the meantime, the "problem" occurred again. The last sample (Sample4) has the same characteristic as sample 2 (high mismatch rate and very high number of unannotated splices (~2.7M). Here, ~11M reads were processed in 2.5days :(

I should maybe also add 2 things concerning my data processing:
1) I filter my reads before mapping to reduce false positives (first: adapter/qual trimming (with bbduk from the bbmap package), second: contamination screening (with biobloomtools))
2) There is a potential technical issue with the flowcell with some tiles having strongly reduced Phred scores (This however, effects of course all samples and not just the "bad" performing)

Cheers,
Urs
Sample1_Star251b_H38V27P10G_2pass_Log.final.out
Sample1_Star251b_H38V27P10G_Log.out
Sample2_Star251b_H38V27P10G_2pass_Log.final.out
Sample2_Star251b_H38V27P10G_Log.out
Sample3_Star251b_H38V27P10G_2pass_Log.final.out
Sample3_Star251b_H38V27P10G_Log.out
Sample4_Star251b_H38V27P10G_1pass_Log.final.out
Sample4_Star251b_H38V27P10G_Log.out
Sample4_Star251b_H38V27P10G_Log.progress.out

Alexander Dobin

unread,
Nov 21, 2017, 12:40:11 PM11/21/17
to rna-star
Hi Urs,

thanks for the files!
There are only ~5000 junctions inserted after the 1st pass in Sample2, and it's even less than the number of inserted junctions in Sample1.
So I now think it's not the number of inserted junctions, but rather the junctions themselves. Most likely, there are a few junctions in Sample2 (and 4) that overlap a highly expressed locus, which makes trouble for both spliced and unspliced alignments in the 2nd pass. Often this happens in loci where there should be no splicing at all, e.g. on chrM, or in rRNA loci.

Could you look at the SJ.out.tab of the first pass to check if you are getting unannotated junctions from unusual places, such as chrM or scaffolds? You may need to do some extra filtering on the SJ.out.tab files before the 2nd pass.
Or you can send it to me, if it's not considered private.

Cheers
Alex

Urs Lahrmann

unread,
Nov 22, 2017, 5:04:12 AM11/22/17
to rna-star
Hi Alex,

you are partially correct in your assumption :)
There are indeed very few known junctions which have very high expression (1M+ reads!)
However, they do not overlap with chrM/scaffolds/rRNA, but are located in immunglobulins.

Please find attached a summary I created from the sj.out.tab files; columns are:
chr - chromosome
unanno - number of unannotated junctions ("0" in column 6 of sj.out.tab) in chr
ua_uniq - sum of all column 7 values (unique mapping support) with column 6 = "0" in chr
ua_multi - sum of all column 8 values (multiple mapping support) with column 6 = "0" in chr
known, kn_uniq, kn_multi - same as above but for annotated/known junctions ("1" in column 6 of sj.out.tab)

If I look at the data in detail, I see for example the following highest expression in Sample2:
chr2    88885621    88885839    1    1    1    2432135    429    74
chr14    106762399    106762484    2    2    1    1951870    91051    70

And in Sample4:
chr2    89142873    89143059    2    2    1    3858291    119201    75
chr2    88857684    88861885    2    2    1    2804002    428709    73

Highest expression values in other files are <100k.

I will try later today whether simply removing those high expressions from the pass1 run resolves the mapping issue in pass2.
However, as these are probably real splice junctions, I would of course like to know whether you can suggest a more sophisticated way of dealing with this ;)

Thanks for the help!

Cheers,
Urs
SJ.out.tab.summary

Alexander Dobin

unread,
Nov 27, 2017, 9:51:30 PM11/27/17
to rna-star
Hi Urs,

thanks for the interesting summary.
The "problem" cannot be caused by annotated junctions as they are present in the reference for both 1st and 2nd path.
The problem is most likely caused by those novel junctions that are detected at the 1st pass with low to moderate expression, but overlap the highly expressed loci. In this case, the many reads that map to these loci unspliced in the 1st pass, in the 2nd pass have to be "checked" for splicing, and I think this is what causes slowdown and increase in multimappers.

In your example, in Sample2, the chr14 (IG locus?) stands out compared to"good" Samples1/3, in large numbers of unannotated junctions. The fact that these junctions get high counts indicates that these loci are highly expressed.
Similarly, Sample4 stands out in chr14 and chr2.

Now, the question you need to answer is whether 2-pass alignment really adds any value for your downstream analyses. Note that 2-pass main effect is to increase the number of reads mapping to unannotated junctions. If you are not interested that much in unannotated junctions, you may simply use 1-pass runs with annotations.

If you still want to get more accurate counts for unannotated junctions, you would need to filter out 1st pass novel junctions that map to highly expressed loci on chr14 and chr2. After that, you can run the 2nd pass "manually" with --sjdbFileChrStartEnd filtered.SJ.out.tab .

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages