occasional low mapping rate

hans-herma...@mdc-berlin.de

unread,

Nov 15, 2016, 4:26:39 AM11/15/16

to rna-star

Hi,

I am aligning some paired-end stranded RNAseq on a cluster and i realized that i got occasional low mapping rates.

I created test input files and submitted the same jobs over and over to different cluster nodes and saw, that they all finish, but sometime at a really low mapping rate.

(all libraries do in general align and are filtered for quality, this can be excluded)

This happened independent of the nodes and file system i used.

I realized that this coincided with differences in the memory allocation needed as seen from the Log.out file.

Whenever STAR gave low alignment rates, the same runs apparently used only low memory for sorting:

Do you have any idea, what that is?

Thanks

Harm

in general i run it:

#$ -pe smp 4

#$ -l mem_free=40G,h_vmem=40G

Here an exert

./4401_SRR070291/4401_SRR070291.Log.final.out: Uniquely mapped reads % | 89.21%

./4408_SRR070286/4408_SRR070286.Log.final.out: Uniquely mapped reads % | 90.46%

./4396_SRR111869/4396_SRR111869.Log.final.out: Uniquely mapped reads % | 44.12%

./4402_SRR111871/4402_SRR111871.Log.final.out: Uniquely mapped reads % | 83.32%

./4396_SRR189833/4396_SRR189833.Log.final.out: Uniquely mapped reads % | 84.58%

./4403_SRR070276/4403_SRR070276.Log.final.out: Uniquely mapped reads % | 94.84%

./4398_SRR111870/4398_SRR111870.Log.final.out: Uniquely mapped reads % | 84.69%

./4410_SRR189834/4410_SRR189834.Log.final.out: Uniquely mapped reads % | 44.90%

./4398_SRR070270/4398_SRR070270.Log.final.out: Uniquely mapped reads % | 77.30%

./4397_SRR070275/4397_SRR070275.Log.final.out: Uniquely mapped reads % | 24.07%

./4404_SRR111875/4404_SRR111875.Log.final.out: Uniquely mapped reads % | 2.30%

./4399_SRR070290/4399_SRR070290.Log.final.out: Uniquely mapped reads % | 2.28%

./4399_SRR111874/4399_SRR111874.Log.final.out: Uniquely mapped reads % | 2.54%

./4401_SRR070291/4401_SRR070291.Log.out:Max memory needed for sorting = 1435182561

./4408_SRR070286/4408_SRR070286.Log.out:Max memory needed for sorting = 281129370

./4396_SRR111869/4396_SRR111869.Log.out:Max memory needed for sorting = 181685402

./4402_SRR111871/4402_SRR111871.Log.out:Max memory needed for sorting = 398699002

./4396_SRR189833/4396_SRR189833.Log.out:Max memory needed for sorting = 1057454380

./4403_SRR070276/4403_SRR070276.Log.out:Max memory needed for sorting = 1208972986

./4398_SRR111870/4398_SRR111870.Log.out:Max memory needed for sorting = 135017297

./4410_SRR189834/4410_SRR189834.Log.out:Max memory needed for sorting = 804864334

./4398_SRR070270/4398_SRR070270.Log.out:Max memory needed for sorting = 562884626

./4397_SRR070275/4397_SRR070275.Log.out:Max memory needed for sorting = 321132447

./4404_SRR111875/4404_SRR111875.Log.out:Max memory needed for sorting = 17019814

./4399_SRR070290/4399_SRR070290.Log.out:Max memory needed for sorting = 25818993

./4399_SRR111874/4399_SRR111874.Log.out:Max memory needed for sorting = 18033328

Another behavior i noticed is, that while alignment rates throughout the job are normally very stable (see example 1), some jobs seem to be relatively unstable (see example 2).

example 1

Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped

M/hr number length unique length MMrate multi multi+ MM short other

Nov 12 02:05:19 86.5 1632947 128 83.4% 128.1 0.4% 4.6% 9.2% 0.1% 2.6% 0.2%

Nov 12 02:06:25 102.1 3801340 128 83.4% 128.1 0.4% 4.6% 9.2% 0.1% 2.5% 0.2%

Nov 12 02:07:31 107.5 5971920 128 83.4% 128.1 0.4% 4.6% 9.2% 0.1% 2.5% 0.2%

Nov 12 02:08:33 107.9 7855040 128 83.3% 128.1 0.4% 4.6% 9.2% 0.1% 2.6% 0.2%

Nov 12 02:10:42 72.4 7862730 128 83.3% 128.1 0.4% 4.6% 9.2% 0.1% 2.5% 0.2%

ALL DONE!

./4402_SRR111871/4402_SRR111871.Log.progress.out (END)

example 2

Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped

M/hr number length unique length MMrate multi multi+ MM short other

Nov 14 17:46:40 112.2 2306504 133 87.0% 133.2 0.3% 4.0% 7.6% 0.1% 1.0% 0.3%

Nov 14 17:47:56 123.3 5136261 133 74.8% 133.2 0.3% 3.6% 6.5% 0.1% 14.8% 0.2%

Nov 14 17:48:58 123.3 7259639 133 53.7% 133.0 0.3% 2.8% 4.7% 0.0% 38.7% 0.2%

Nov 14 17:49:58 124.2 9385364 133 42.0% 132.9 0.3% 2.4% 3.7% 0.0% 51.7% 0.1%

Nov 14 17:51:01 125.5 11676196 133 34.3% 132.7 0.3% 2.1% 3.1% 0.0% 60.5% 0.1%

Nov 14 17:52:05 126.0 13963848 133 29.1% 132.5 0.3% 1.9% 2.6% 0.0% 66.3% 0.1%

Nov 14 17:53:13 126.6 16428849 133 25.1% 132.4 0.3% 1.7% 2.3% 0.0% 70.9% 0.1%

Nov 14 17:57:26 85.9 17189550 133 24.1% 132.3 0.3% 1.7% 2.2% 0.0% 72.0% 0.1%

ALL DONE!

./4397_SRR070275/4397_SRR070275.Log.progress.out (END)

Alexander Dobin

unread,

Nov 17, 2016, 1:14:23 PM11/17/16

to rna-star

Hi Hans-Hermann,

the amount of RAM required for sorting is proportional to the number of mapped reads, so it would make sense that the low-mapping samples require less RAM for sorting.

The decrease of mapped % in the Log.progress.out file is uncommon. This would indicate that the quality of the sequencing get poorer towards the end of the file.

Are these files coming directly from the Illumina pipeline, or are they pre-sorted in any way? In the former case, is it possible that some "tiles" of the flow cell have poorer quality than others?

Please post the full Log.final.out file - there maybe hints there for the low mapping rate.

Cheers

Alex

hans-herma...@mdc-berlin.de

unread,

Nov 21, 2016, 5:33:18 AM11/21/16

to rna-star

Hi Alex,

by now I can explain the issue a bit better and found away around it.

Actually, I did not post the observation of occasional low mapping. Here an example, in which i just run my pipeline 40 times with always the same 1*10^6 paired-end reads.

$ find ./3[1-9][0-9] -name "*Log.fin*" | xargs grep "Uniquely mapped reads %"

./331/old.Log.final.out: Uniquely mapped reads % | 76.33%

./332/old.Log.final.out: Uniquely mapped reads % | 76.33%

./333/old.Log.final.out: Uniquely mapped reads % | 76.33%

./334/old.Log.final.out: Uniquely mapped reads % | 76.33%

./335/old.Log.final.out: Uniquely mapped reads % | 76.33%

./336/old.Log.final.out: Uniquely mapped reads % | 76.33%

./337/old.Log.final.out: Uniquely mapped reads % | 3.14%

./338/old.Log.final.out: Uniquely mapped reads % | 76.33%

./339/old.Log.final.out: Uniquely mapped reads % | 76.33%

./340/old.Log.final.out: Uniquely mapped reads % | 76.33%

./341/old.Log.final.out: Uniquely mapped reads % | 3.12%

./342/old.Log.final.out: Uniquely mapped reads % | 76.33%

./343/old.Log.final.out: Uniquely mapped reads % | 3.12%

./344/old.Log.final.out: Uniquely mapped reads % | 76.33%

./345/old.Log.final.out: Uniquely mapped reads % | 76.33%

./346/old.Log.final.out: Uniquely mapped reads % | 76.33%

./347/old.Log.final.out: Uniquely mapped reads % | 76.33%

./348/old.Log.final.out: Uniquely mapped reads % | 3.80%

./349/old.Log.final.out: Uniquely mapped reads % | 3.83%

./350/old.Log.final.out: Uniquely mapped reads % | 76.33%

./351/old.Log.final.out: Uniquely mapped reads % | 76.33%

./352/old.Log.final.out: Uniquely mapped reads % | 76.33%

./353/old.Log.final.out: Uniquely mapped reads % | 3.12%

./354/old.Log.final.out: Uniquely mapped reads % | 3.86%

./355/old.Log.final.out: Uniquely mapped reads % | 76.33%

./356/old.Log.final.out: Uniquely mapped reads % | 76.33%

./357/old.Log.final.out: Uniquely mapped reads % | 3.83%

./358/old.Log.final.out: Uniquely mapped reads % | 76.33%

./359/old.Log.final.out: Uniquely mapped reads % | 76.33%

./360/old.Log.final.out: Uniquely mapped reads % | 76.33%

./361/old.Log.final.out: Uniquely mapped reads % | 3.14%

./362/old.Log.final.out: Uniquely mapped reads % | 76.33%

./363/old.Log.final.out: Uniquely mapped reads % | 3.80%

./364/old.Log.final.out: Uniquely mapped reads % | 76.33%

./365/old.Log.final.out: Uniquely mapped reads % | 76.33%

./366/old.Log.final.out: Uniquely mapped reads % | 76.33%

./367/old.Log.final.out: Uniquely mapped reads % | 76.33%

./368/old.Log.final.out: Uniquely mapped reads % | 3.83%

./369/old.Log.final.out: Uniquely mapped reads % | 76.33%

./370/old.Log.final.out: Uniquely mapped reads % | 76.33%

My pipeline at that time included 2 filtering steps, in which I subsequently mapped against rRNA sequences and a viral data base. Each time i kept the unmapped reads.

Both indices I generated according the the manual guidelines, (meaning i accounted for the indices being small; rRNA --genomeSAindexNbases 7 (79250 bases), and viral --genomeSAindexNbases 9 (895439 bases))

With both or either one of these two filter steps, this occasional drop out happened, while it was not happening, when skipping both filtering steps.

I do believe that the filtering, or the usage of the unmapped read output from STAR is somehow the issue here.

I believe this:

1. because It happens independent from user/environment/clusternode/read quality

2. when i use bowtie for filtering, the pipeline runs stably.

3. when i take the unmapped reads from any filtering step of a pipeline runs which failed, and repeat the genome alignment x times with it, this genome alignment always fails. And conversely, when i take the unmapped output of a run, which apparently ran through, then the genome alignment also runs through every time.

4. i see slight file size variation in the unmapped reads, which is not perfectly, but somehow coinciding with genome mapping dropouts (see below).

./331/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:55

./332/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:55

./333/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:55

./334/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:59

./335/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:59

./336/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:00

./337/old.mate1.viralfiltered.fastq 3.14% 36226334 Nov 16 13:00

./339/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:04

./338/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:04

./340/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:05

./341/old.mate1.viralfiltered.fastq 3.12% 36221098 Nov 16 13:08

./342/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:09

./343/old.mate1.viralfiltered.fastq 3.12% 36221098 Nov 16 13:09

./344/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:11

./346/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:15

./347/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:15

./345/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:15

./348/old.mate1.viralfiltered.fastq 3.80% 32539027 Nov 16 13:17

./351/old.mate1.viralfiltered.fastq 3.83% 32539027 Nov 16 13:21

./349/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:21

./350/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:21

./352/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:22

./353/old.mate1.viralfiltered.fastq 3.12% 36226334 Nov 16 13:26

./354/old.mate1.viralfiltered.fastq 3.86% 32539027 Nov 16 13:26

./355/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:26

./356/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:27

./357/old.mate1.viralfiltered.fastq 3.83% 32539027 Nov 16 13:31

./358/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:32

./360/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:32

./359/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:32

./361/old.mate1.viralfiltered.fastq 3.14% 36226334 Nov 16 13:34

./362/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:35

./363/old.mate1.viralfiltered.fastq 3.80% 32539027 Nov 16 13:35

./364/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:35

./365/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:38

./366/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:38

./367/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:38

./368/old.mate1.viralfiltered.fastq 3.83% 32539027 Nov 16 13:38

./369/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:41

./370/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:41

I attached 2 Log files, of run 337, which dropped out, and 369, which ran through.

369.old.Log.out

337.old.Log.out

Alexander Dobin

unread,

Nov 21, 2016, 11:31:48 AM11/21/16

to rna-star

Hi Hans-Hermann,

it seems like there is a problem with STAR unmapped output in some cases.

Could you please send me the Log.final.out files for the 337 case, both with STAR and bowtie pre-filtering?

Cheers

Alex

hans-herma...@mdc-berlin.de

unread,

Nov 21, 2016, 12:24:46 PM11/21/16

to rna-star

Hi,

The runs 331 - 370 have been done containing both separate filtering steps, as described before.

The runs 831 - 870 have been done by swapping out STAR for bowtie, keeping the rest unchanged.

The same input was used. Technically all these runs would be the requested 337 with bowtie preprocessing.

As a comparison, i also post 369, as i have mentioned posted the Log.out for that one before.

see below

Started job on | Nov 16 13:00:59

Started mapping on | Nov 16 13:02:29

Finished on | Nov 16 13:03:46

Mapping speed, Million of reads per hour | 10.71

Number of input reads | 229159

Average input read length | 130

UNIQUE READS:

Uniquely mapped reads number | 7188

Uniquely mapped reads % | 3.14%

Average mapped length | 121.70

Number of splices: Total | 664

Number of splices: Annotated (sjdb) | 655

Number of splices: GT/AG | 660

Number of splices: GC/AG | 4

Number of splices: AT/AC | 0

Number of splices: Non-canonical | 0

Mismatch rate per base, % | 0.40%

Deletion rate per base | 0.01%

Deletion average length | 1.58

Insertion rate per base | 0.01%

Insertion average length | 1.62

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 5803

% of reads mapped to multiple loci | 2.53%

Number of reads mapped to too many loci | 2667

% of reads mapped to too many loci | 1.16%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.01%

% of reads unmapped: too short | 93.15%

% of reads unmapped: other | 0.00%

337/old.Log.final.out (END)

Started job on | Nov 16 16:32:30

Started mapping on | Nov 16 16:32:31

Finished on | Nov 16 16:33:17

Mapping speed, Million of reads per hour | 16.63

Number of input reads | 212534

Average input read length | 130

UNIQUE READS:

Uniquely mapped reads number | 157796

Uniquely mapped reads % | 74.25%

Average mapped length | 129.23

Number of splices: Total | 27818

Number of splices: Annotated (sjdb) | 27671

Number of splices: GT/AG | 27468

Number of splices: GC/AG | 334

Number of splices: AT/AC | 6

Number of splices: Non-canonical | 10

Mismatch rate per base, % | 0.44%

Deletion rate per base | 0.02%

Deletion average length | 2.00

Insertion rate per base | 0.01%

Insertion average length | 1.78

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 8561

% of reads mapped to multiple loci | 4.03%

Number of reads mapped to too many loci | 33627

% of reads mapped to too many loci | 15.82%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.12%

% of reads unmapped: too short | 5.47%

% of reads unmapped: other | 0.31%

837/old.Log.final.out (END)

Started job on | Nov 16 13:41:23

Started mapping on | Nov 16 13:41:24

Finished on | Nov 16 13:42:06

Mapping speed, Million of reads per hour | 17.65

Number of input reads | 205871

Average input read length | 130

UNIQUE READS:

Uniquely mapped reads number | 157144

Uniquely mapped reads % | 76.33%

Average mapped length | 129.24

Number of splices: Total | 27818

Number of splices: Annotated (sjdb) | 27671

Number of splices: GT/AG | 27468

Number of splices: GC/AG | 334

Number of splices: AT/AC | 6

Number of splices: Non-canonical | 10

Mismatch rate per base, % | 0.43%

Deletion rate per base | 0.02%

Deletion average length | 2.16

Insertion rate per base | 0.01%

Insertion average length | 1.80

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 8485

% of reads mapped to multiple loci | 4.12%

Number of reads mapped to too many loci | 33627

% of reads mapped to too many loci | 16.33%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.13%

% of reads unmapped: too short | 2.77%

% of reads unmapped: other | 0.32%

369/old.Log.final.out (END)

Alexander Dobin

unread,

Nov 21, 2016, 5:27:44 PM11/21/16

to rna-star

Hi Hans-Hermann,

I have been confused about these results being from different samples.

If 331-370 runs all used the same input, then we have two problems.

1. After filtering, the .fastq files that are used for mapping have different sizes, i.e. filtering is not consistent.

2. Even on the same size input, mapping some times yields very different map rates.

For boths cases, the issue is most likely with the output of Unmapped reads.

Before we dig into it, can you try one simple thing - run the whole pipeline with 2.5.2b latest release.

I have fixed some problems with the unmapped reads over time, it could be that the problem has been already fixed.

If this does not help, could you send me the filtered .fastq files for one of failed runs?