occasional low mapping rate

35 views
Skip to first unread message

hans-herma...@mdc-berlin.de

unread,
Nov 15, 2016, 4:26:39 AM11/15/16
to rna-star
Hi,

I am aligning some paired-end stranded RNAseq on a cluster and i realized that i got occasional low mapping rates.

I created test input files and submitted the same jobs over and over to different cluster nodes and saw, that they all finish, but sometime at a really low mapping rate.
(all libraries do in general align and are filtered for quality, this can be excluded)
This happened independent of the nodes and file system i used.


I realized that this coincided with differences in the memory allocation needed as seen from the Log.out file.
Whenever STAR gave low alignment rates, the same runs apparently used only low memory for sorting:

Do you have any idea, what that is?

Thanks
Harm



in general i run it:
#$ -pe smp 4
#$ -l mem_free=40G,h_vmem=40G

Here an exert

./4401_SRR070291/4401_SRR070291.Log.final.out:                        Uniquely mapped reads % | 89.21%

./4408_SRR070286/4408_SRR070286.Log.final.out:                        Uniquely mapped reads % | 90.46%

./4396_SRR111869/4396_SRR111869.Log.final.out:                        Uniquely mapped reads % | 44.12%

./4402_SRR111871/4402_SRR111871.Log.final.out:                        Uniquely mapped reads % | 83.32%

./4396_SRR189833/4396_SRR189833.Log.final.out:                        Uniquely mapped reads % | 84.58%

./4403_SRR070276/4403_SRR070276.Log.final.out:                        Uniquely mapped reads % | 94.84%

./4398_SRR111870/4398_SRR111870.Log.final.out:                        Uniquely mapped reads % | 84.69%

./4410_SRR189834/4410_SRR189834.Log.final.out:                        Uniquely mapped reads % | 44.90%

./4398_SRR070270/4398_SRR070270.Log.final.out:                        Uniquely mapped reads % | 77.30%

./4397_SRR070275/4397_SRR070275.Log.final.out:                        Uniquely mapped reads % | 24.07%

./4404_SRR111875/4404_SRR111875.Log.final.out:                        Uniquely mapped reads % | 2.30%

./4399_SRR070290/4399_SRR070290.Log.final.out:                        Uniquely mapped reads % | 2.28%

./4399_SRR111874/4399_SRR111874.Log.final.out:                        Uniquely mapped reads % | 2.54%



./4401_SRR070291/4401_SRR070291.Log.out:Max memory needed for sorting = 1435182561

./4408_SRR070286/4408_SRR070286.Log.out:Max memory needed for sorting = 281129370

./4396_SRR111869/4396_SRR111869.Log.out:Max memory needed for sorting = 181685402

./4402_SRR111871/4402_SRR111871.Log.out:Max memory needed for sorting = 398699002

./4396_SRR189833/4396_SRR189833.Log.out:Max memory needed for sorting = 1057454380

./4403_SRR070276/4403_SRR070276.Log.out:Max memory needed for sorting = 1208972986

./4398_SRR111870/4398_SRR111870.Log.out:Max memory needed for sorting = 135017297

./4410_SRR189834/4410_SRR189834.Log.out:Max memory needed for sorting = 804864334

./4398_SRR070270/4398_SRR070270.Log.out:Max memory needed for sorting = 562884626

./4397_SRR070275/4397_SRR070275.Log.out:Max memory needed for sorting = 321132447

./4404_SRR111875/4404_SRR111875.Log.out:Max memory needed for sorting = 17019814

./4399_SRR070290/4399_SRR070290.Log.out:Max memory needed for sorting = 25818993

./4399_SRR111874/4399_SRR111874.Log.out:Max memory needed for sorting = 18033328




Another behavior i noticed is, that while alignment rates throughout the job are normally very stable (see example 1), some jobs seem to be relatively unstable (see example 2).


example 1

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped

                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other

Nov 12 02:05:19     86.5     1632947      128    83.4%    128.1     0.4%     4.6%     9.2%     0.1%     2.6%     0.2%

Nov 12 02:06:25    102.1     3801340      128    83.4%    128.1     0.4%     4.6%     9.2%     0.1%     2.5%     0.2%

Nov 12 02:07:31    107.5     5971920      128    83.4%    128.1     0.4%     4.6%     9.2%     0.1%     2.5%     0.2%

Nov 12 02:08:33    107.9     7855040      128    83.3%    128.1     0.4%     4.6%     9.2%     0.1%     2.6%     0.2%

Nov 12 02:10:42     72.4     7862730      128    83.3%    128.1     0.4%     4.6%     9.2%     0.1%     2.5%     0.2%

ALL DONE!

./4402_SRR111871/4402_SRR111871.Log.progress.out (END)



example 2

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped

                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other

Nov 14 17:46:40    112.2     2306504      133    87.0%    133.2     0.3%     4.0%     7.6%     0.1%     1.0%     0.3%

Nov 14 17:47:56    123.3     5136261      133    74.8%    133.2     0.3%     3.6%     6.5%     0.1%    14.8%     0.2%

Nov 14 17:48:58    123.3     7259639      133    53.7%    133.0     0.3%     2.8%     4.7%     0.0%    38.7%     0.2%

Nov 14 17:49:58    124.2     9385364      133    42.0%    132.9     0.3%     2.4%     3.7%     0.0%    51.7%     0.1%

Nov 14 17:51:01    125.5    11676196      133    34.3%    132.7     0.3%     2.1%     3.1%     0.0%    60.5%     0.1%

Nov 14 17:52:05    126.0    13963848      133    29.1%    132.5     0.3%     1.9%     2.6%     0.0%    66.3%     0.1%

Nov 14 17:53:13    126.6    16428849      133    25.1%    132.4     0.3%     1.7%     2.3%     0.0%    70.9%     0.1%

Nov 14 17:57:26     85.9    17189550      133    24.1%    132.3     0.3%     1.7%     2.2%     0.0%    72.0%     0.1%

ALL DONE!

./4397_SRR070275/4397_SRR070275.Log.progress.out (END) 


Alexander Dobin

unread,
Nov 17, 2016, 1:14:23 PM11/17/16
to rna-star
Hi Hans-Hermann,

the amount of RAM required for sorting is proportional to the number of mapped reads, so it would make sense that the low-mapping samples require less RAM for sorting.

The decrease of mapped % in the Log.progress.out file is uncommon. This would indicate that the quality of the sequencing get poorer towards the end of the file.
Are these files coming directly from the Illumina pipeline, or are they pre-sorted in any way? In the former case, is it possible that some "tiles" of the flow cell have poorer quality than others?

Please post the full Log.final.out file - there maybe hints there for the low mapping rate.

Cheers
Alex

hans-herma...@mdc-berlin.de

unread,
Nov 21, 2016, 5:33:18 AM11/21/16
to rna-star
Hi Alex,

by now I can explain the issue a bit better and found away around it.

Actually, I did not post the observation of occasional low mapping. Here an example, in which i just run my pipeline 40 times with always the same 1*10^6 paired-end reads.

$ find ./3[1-9][0-9] -name "*Log.fin*" | xargs grep "Uniquely mapped reads %"

./331/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./332/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./333/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./334/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./335/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./336/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./337/old.Log.final.out:                        Uniquely mapped reads % | 3.14%

./338/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./339/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./340/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./341/old.Log.final.out:                        Uniquely mapped reads % | 3.12%

./342/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./343/old.Log.final.out:                        Uniquely mapped reads % | 3.12%

./344/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./345/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./346/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./347/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./348/old.Log.final.out:                        Uniquely mapped reads % | 3.80%

./349/old.Log.final.out:                        Uniquely mapped reads % | 3.83%

./350/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./351/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./352/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./353/old.Log.final.out:                        Uniquely mapped reads % | 3.12%

./354/old.Log.final.out:                        Uniquely mapped reads % | 3.86%

./355/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./356/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./357/old.Log.final.out:                        Uniquely mapped reads % | 3.83%

./358/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./359/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./360/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./361/old.Log.final.out:                        Uniquely mapped reads % | 3.14%

./362/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./363/old.Log.final.out:                        Uniquely mapped reads % | 3.80%

./364/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./365/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./366/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./367/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./368/old.Log.final.out:                        Uniquely mapped reads % | 3.83%

./369/old.Log.final.out:                        Uniquely mapped reads % | 76.33%

./370/old.Log.final.out:                        Uniquely mapped reads % | 76.33%


My pipeline at that time included 2 filtering steps, in which I subsequently mapped against rRNA sequences and a viral data base. Each time i kept the unmapped reads.
Both indices I generated according the the manual guidelines, (meaning i accounted for the indices being small; rRNA --genomeSAindexNbases 7 (79250 bases), and viral --genomeSAindexNbases 9 (895439 bases))
With both or either one of these two filter steps, this occasional drop out happened, while it was not happening, when skipping both filtering steps.

I do believe that the filtering, or the usage of the unmapped read output from STAR is somehow the issue here.
I believe this:
1. because It happens independent from user/environment/clusternode/read quality 
2. when i use bowtie for filtering, the pipeline runs stably.
3. when i take the unmapped reads from any filtering step of a pipeline runs which failed, and repeat the genome alignment x times with it, this genome alignment always fails. And conversely, when i take the unmapped output of a run, which apparently ran through, then the genome alignment also runs through every time. 
4. i see slight file size variation in the unmapped reads, which is not perfectly, but somehow coinciding with genome mapping dropouts (see below).  
                     
./331/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:55
./332/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:55
./333/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:55
./334/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:59
./335/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 12:59
./336/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:00
./337/old.mate1.viralfiltered.fastq 3.14% 36226334 Nov 16 13:00
./339/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:04
./338/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:04
./340/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:05
./341/old.mate1.viralfiltered.fastq 3.12% 36221098 Nov 16 13:08
./342/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:09
./343/old.mate1.viralfiltered.fastq 3.12% 36221098 Nov 16 13:09
./344/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:11
./346/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:15
./347/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:15
./345/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:15
./348/old.mate1.viralfiltered.fastq 3.80% 32539027 Nov 16 13:17
./351/old.mate1.viralfiltered.fastq 3.83% 32539027 Nov 16 13:21
./349/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:21
./350/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:21
./352/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:22
./353/old.mate1.viralfiltered.fastq 3.12% 36226334 Nov 16 13:26
./354/old.mate1.viralfiltered.fastq 3.86% 32539027 Nov 16 13:26
./355/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:26
./356/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:27
./357/old.mate1.viralfiltered.fastq 3.83% 32539027 Nov 16 13:31
./358/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:32
./360/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:32
./359/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:32
./361/old.mate1.viralfiltered.fastq 3.14% 36226334 Nov 16 13:34
./362/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:35
./363/old.mate1.viralfiltered.fastq 3.80% 32539027 Nov 16 13:35
./364/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:35
./365/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:38
./366/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:38
./367/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:38
./368/old.mate1.viralfiltered.fastq 3.83% 32539027 Nov 16 13:38
./369/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:41
./370/old.mate1.viralfiltered.fastq 76.33% 32539027 Nov 16 13:41



I attached 2 Log files, of run 337, which dropped out, and 369, which ran through.


369.old.Log.out
337.old.Log.out

Alexander Dobin

unread,
Nov 21, 2016, 11:31:48 AM11/21/16
to rna-star
Hi Hans-Hermann,

it seems like there is a problem with STAR unmapped output in some cases.
Could you please send me the Log.final.out files for the 337 case, both with STAR and bowtie pre-filtering?

Cheers
Alex

hans-herma...@mdc-berlin.de

unread,
Nov 21, 2016, 12:24:46 PM11/21/16
to rna-star

Hi,

The runs 331 - 370 have been done containing both separate filtering steps, as described before.
The runs 831 - 870 have been done by swapping out STAR for bowtie, keeping the rest unchanged.
The same input was used. Technically all these runs would be the requested 337 with bowtie preprocessing.
As a comparison, i also post 369, as i have mentioned posted the Log.out for that one before.

see below

                                 Started job on |       Nov 16 13:00:59

                             Started mapping on |       Nov 16 13:02:29

                                    Finished on |       Nov 16 13:03:46

       Mapping speed, Million of reads per hour |       10.71


                          Number of input reads |       229159

                      Average input read length |       130

                                    UNIQUE READS:

                   Uniquely mapped reads number |       7188

                        Uniquely mapped reads % |       3.14%

                          Average mapped length |       121.70

                       Number of splices: Total |       664

            Number of splices: Annotated (sjdb) |       655

                       Number of splices: GT/AG |       660

                       Number of splices: GC/AG |       4

                       Number of splices: AT/AC |       0

               Number of splices: Non-canonical |       0

                      Mismatch rate per base, % |       0.40%

                         Deletion rate per base |       0.01%

                        Deletion average length |       1.58

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.62

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       5803

             % of reads mapped to multiple loci |       2.53%

        Number of reads mapped to too many loci |       2667

             % of reads mapped to too many loci |       1.16%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.01%

                 % of reads unmapped: too short |       93.15%

                     % of reads unmapped: other |       0.00%

337/old.Log.final.out (END) 


                                Started job on |       Nov 16 16:32:30

                             Started mapping on |       Nov 16 16:32:31

                                    Finished on |       Nov 16 16:33:17

       Mapping speed, Million of reads per hour |       16.63


                          Number of input reads |       212534

                      Average input read length |       130

                                    UNIQUE READS:

                   Uniquely mapped reads number |       157796

                        Uniquely mapped reads % |       74.25%

                          Average mapped length |       129.23

                       Number of splices: Total |       27818

            Number of splices: Annotated (sjdb) |       27671

                       Number of splices: GT/AG |       27468

                       Number of splices: GC/AG |       334

                       Number of splices: AT/AC |       6

               Number of splices: Non-canonical |       10

                      Mismatch rate per base, % |       0.44%

                         Deletion rate per base |       0.02%

                        Deletion average length |       2.00

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.78

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       8561

             % of reads mapped to multiple loci |       4.03%

        Number of reads mapped to too many loci |       33627

             % of reads mapped to too many loci |       15.82%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.12%

                 % of reads unmapped: too short |       5.47%

                     % of reads unmapped: other |       0.31%

837/old.Log.final.out (END) 


 

                                 Started job on |       Nov 16 13:41:23

                             Started mapping on |       Nov 16 13:41:24

                                    Finished on |       Nov 16 13:42:06

       Mapping speed, Million of reads per hour |       17.65


                          Number of input reads |       205871

                      Average input read length |       130

                                    UNIQUE READS:

                   Uniquely mapped reads number |       157144

                        Uniquely mapped reads % |       76.33%

                          Average mapped length |       129.24

                       Number of splices: Total |       27818

            Number of splices: Annotated (sjdb) |       27671

                       Number of splices: GT/AG |       27468

                       Number of splices: GC/AG |       334

                       Number of splices: AT/AC |       6

               Number of splices: Non-canonical |       10

                      Mismatch rate per base, % |       0.43%

                         Deletion rate per base |       0.02%

                        Deletion average length |       2.16

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.80

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       8485

             % of reads mapped to multiple loci |       4.12%

        Number of reads mapped to too many loci |       33627

             % of reads mapped to too many loci |       16.33%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.13%

                 % of reads unmapped: too short |       2.77%

                     % of reads unmapped: other |       0.32%

369/old.Log.final.out (END) 

Alexander Dobin

unread,
Nov 21, 2016, 5:27:44 PM11/21/16
to rna-star
Hi Hans-Hermann,

I have been confused about these results being from different samples.
If 331-370 runs all used the same input, then we have two problems.
1. After filtering, the .fastq files that are used for mapping have different sizes, i.e. filtering is not consistent.
2. Even on the same size input, mapping some times yields very different map rates.

For boths cases, the issue is most likely with the output of Unmapped reads.
Before we dig into it, can you try one simple thing - run the whole pipeline with 2.5.2b latest release.
I have fixed some problems with the unmapped reads over time, it could be that the problem has been already fixed.

If this does not help, could you send me the filtered .fastq files for one of failed runs?

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages