Seeding comprehension & relevance of soft-clipping in a context of Ribosome Profiling

325 views
Skip to first unread message

Alexandra Bomane

unread,
Mar 25, 2016, 3:35:19 AM3/25/16
to rna-star
Hi Alexander,

I work on Ribosome Profiling data of Human (Ensembl 83) and I use STAR v.2.4.0k to map these data to the reference genome.
The average length of my reads is 30 b and I index the genome with an overhang of 28.

I want to deal with 2 main topics : seeding & soft-clipping.

1) Seeding

1.a) I don't really understand how the seeding used by STAR works. Indeed, here a practical case a little bit confusing :

When I use parameters --outSAMunmapped Within --outFilterMismatchNoverLmax 0.06 --quantMode TranscriptomeSAM and let the default parameter --seedSearchStartLmax 50, I get this quite good alignment :

NS500801:5:H7YMFBGXX:1:13211:5985:10031    0    3    172525696    255    19M7679N11M    *    0    0    CATTTGTAAAGCAGAAGCTGAGAATCTTAT    FAF7F...<A)F7FF7FFF)F.AFFFA7.F    NH:i:1    HI:i:1    AS:i:27    nM:i:1

When I use same parameters except --seedSearchStartLmax 15 (as recommended in http://daryavanichkina.com/wikipages/ngs/star.html : "At the mapping step, you can increase sensitivity with –seedSearchStartLmax 20, or even 15."), I get this alignment for the same read :

NS500801:5:H7YMFBGXX:1:13211:5985:10031 4       *       0       0       *       *       0       0       CATTTGTAAAGCAGAAGCTGAGAATCTTAT  FAF7F...<A)F7FF7FFF)F.AFFFA7.F  NH:i:0  HI:i:0  AS:i:18 nM:i:0  uT:A:1

I check the 1st alignment in Ensembl database (http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=vega;g=OTTHUMG00000156762;r=3:172807706-172820311;t=OTTHUMT00000346165) and it match quasi-perfectly : 19 matches on first exon with one mismatch at position 16 + gap of 7679 for intron + 11 perfect-matches on the following exon.

So, I don't understand why with --seedSearchStartLmax 15 this alignment is set as unmapped and more precisely I don't understand what you mean by decrease the seed to increase the sensitivity because of this odd example.

I wonder if there is a maximal error rate in the seed not indicated in the documentation ?

1.b) In another post (https://groups.google.com/forum/#!searchin/rna-star/ribosome$20profiling/rna-star/-cqI7uVse7U/TLNwnfYxCQAJ : "Recommended settings for mapping ribosome profiling reads"), you recommend --seedSearchStartLmax 13 & --winAnchorMultimapNmax 200 parameters.

Is --seedSearchStartLmax 13 long enough to avoid False Positive alignments (I wonder if 18 could be sensitive enough and avoid FP) ?
Could the number of multi-reads increases with --winAnchorMultimapNmax 200 ? I don't really understand this option.

2) Soft-clipping

I guess soft-clipping optimizes alignments to get better scores but in some cases, I have 11 bases out of 30 that are soft-clipped and it is not a very good situation.

But keeping soft-clipping is widely recommended by the community, whereas I think disable it could be more relevant in the context of Ribosome Profilig (knowing that reads belong to very short fragments binded to ribosomes), even if the number of unmapped reads increase highly because of "too many mismatches".

So, I want to know what could be the drawback(s) of disabling soft-clipping (knowing that I use annotations systematically to index genome) in my context ? Is there a way to limit it (minimize the number of bases soft-clipped regarding the length of processed read) ?

In a post (https://www.biostars.org/p/62544/) someone recommend to "write a script that filters reads with a lower ratio of number of bases aligned to the total length of the read". I want to try this way setting the ratio at 0.85.

Someone else advised me to disable soft-clipping BUT trimming the 3 last bases (3' end) of each read before in order to avoid problems because of bad quality of sequencing at 3' end. Do you think that it could be a good idea ?


Thanks in advance for your help ! ;)
Alexandra

Alexander Dobin

unread,
Mar 25, 2016, 7:00:16 AM3/25/16
to rna-star
Hi Alexandra,

1). Seeding
Reducing --seedSearchStartLmax should increase overall sensitivity, i.e. you should see more reads mapped and less soft-clipping.
However, it's not guaranteed for every read - for a few reads it will result in a poorer mapping. 
I will look more closely at your example to see if there are other parameters that can be tweaked to make it mappable.

Seeding works in the following way. The read is split into pieces no longer than --seedSearchStartLmax.
Each piece defines a "starting" point, from which a maximum mappable prefix is found.

2). Soft-clipping
Soft clipping allows to gracefully report alignments which have unmappable ends, which could be caused by
(i) poor sequencing quality at the ends; (ii) end modifications (e.g. A-tails  adapters, etc); (iii) short overhangs of unannotated splice junctions.
If EndToEnd option is chosen, the alignment will be forced to extend to the very ends causing a lot of mismatches.
If there is a similar locus, it may result in false alignments - this often creates bias towards pseudogenes for case (iii).

You can reduce the amount of allowed soft-clipping by setting --outFilterScoreMinOverLread and --outFilterMatchNminOverLread
which define the minimum alignment score and number of matched bases normalized to read length. By default they are both 0.66
If you set --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.85, it will "filter reads with a lower ratio of number of bases aligned to the total length of the read", do no need to write a script.

Cheers
Alex

Alexandra Bomane

unread,
Mar 25, 2016, 8:20:27 AM3/25/16
to rna-star
Hi Alexander,

Thank you very much for your quick reply.

I'll try --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.85 as you said and wait for your recommendations about the seeding.
Indeed, when I decrease the seed length, I get more mapped reads. However, how can I be confident in the fact that these "new" mapped reads are not False Positive ?

When you say "each piece defines a "starting" point", do you mean that the first base of the seed is the "starting" point and that the maxilmum mappable prefix is found until the last base of the seed ?

Cheers,
Alexandra

Alexandra Bomane

unread,
Mar 31, 2016, 11:38:21 AM3/31/16
to rna-...@googlegroups.com
Hi Alexander,

I forgot to mention that with --seedSearchStartLmax 5, I found :


NS500801:5:H7YMFBGXX:1:13211:5985:10031    0    3    172525696    255    19M7679N11M    *    0    0    CATTTGTAAAGCAGAAGCTGAGAATCTTAT    FAF7F...<A)F7FF7FFF)F.AFFFA7.F    NH:i:1    HI:i:1    AS:i:27    nM:i:1

But I think that a seed of 5 is maybe too small to be confident in the result.

Cheers,
Alexandra

Alexander Dobin

unread,
Mar 31, 2016, 12:53:53 PM3/31/16
to rna-star
Hi Alexandra,

the first base of each "piece" is a starting point for seed search, however, the length of the seed is not limited by the piece length - the seeds have "maximum mappable" lengths.
This is why reducing the piece length does not always lead to increase in sensitivity - some of the starting points may yield seeds mapped to wrong positions on the genome.

Cheers
Alex

Alexandra Bomane

unread,
Apr 6, 2016, 11:55:18 AM4/6/16
to rna-star
Hi Alexander,

1) Soft-clipping issue

I have a little problem with parameters --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.85 that you recommended. Indeed, using these parameters, I get this unexpected result :

NS500801:5:H7YMFBGXX:1:22112:14031:11218    0    19    49962311    255    10M286254N16M6S    *    0    0    CCCCGGCGTCGCCCCCGAGCAGCCGCGGCAGC    FFFF<.FFFFF)FF)FAFFF.F).<FAFFFFA    NH:i:1    HI:i:1    AS:i:21    nM:i:0

Calculating the ratio of matched bases to total read length, we have : (10+16)/32 = 26/32 = 0,8125. Besides, to my mind AS:i:21 is not a very satisfying score. I have others reads in the same case (I have not observed very extreme cases for the moment, i.e 50% soft-clipped bases in a read, but I want to take precautions).

I wonder if the problem is caused by the parameter --outFilterScoreMinOverLread 0 allowing low-score alignments to pass.

Another alignement with the same situation :

NS500801:5:H7YMFBGXX:2:12311:24047:1969 16      17      33478276        255     27M5S   *       0       0       GGCCTCGATCAGAAGGACTTGGCCCCCCACGG        FAFFFFFF)FFFFF7FFFFFFFFFFFFFFFF7        NH:i:1  HI:i:1  AS:i:24 nM:i:1

Here, the ratio is : 27/32 = 0,84375 which is not so far of my threshold (0.85). So, I would like to change --outFilterScoreMinOverLread 0 to --outFilterScoreMinOverLread 0.76. Indeed, here the ratio of score to read length is : 24/32 = 0.75. I'll keep the parameter --outFilterMatchNminOverLread 0.85.

Do you think this solution could solve my problem or have you another recommendation ?

2) About seeding


I did some test for seeding decreasing gradully the parameter --seedSearchStartLmax from 27 to 16 (to avoid the case of my first post).

Here the logs with --seedSearchStartLmax 27 (complete command line : STAR --runThreadN 12 --genomeDir /root --readFilesIn Sample2.fastq --seedSearchStartLmax 27 --outFilterMismatchNoverLmax 0.06 --outSAMunmapped Within --quantMode TranscriptomeSAM) :

                                 Started job on |       Mar 15 11:02:27
                             Started mapping on |       Mar 15 11:05:12
                                    Finished on |       Mar 15 11:05:50
       Mapping speed, Million of reads per hour |       275.50

                          Number of input reads |       2908048
                      Average input read length |       30
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1293943
                        Uniquely mapped reads % |       44.50%
                          Average mapped length |       28.65
                       Number of splices: Total |       156416
            Number of splices: Annotated (sjdb) |       148748
                       Number of splices: GT/AG |       153132
                       Number of splices: GC/AG |       1780
                       Number of splices: AT/AC |       170
               Number of splices: Non-canonical |       1334
                      Mismatch rate per base, % |       1.14%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.42
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.04
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       515794
             % of reads mapped to multiple loci |       17.74%
        Number of reads mapped to too many loci |       109340
             % of reads mapped to too many loci |       3.76%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.48%
                 % of reads unmapped: too short |       33.50%
                     % of reads unmapped: other |       0.03%

Here the logs with --seedSearchStartLmax 16 (complete command line : STAR --runThreadN 12 --genomeDir /root --readFilesIn Sample2.fastq  --outFilterMismatchNoverLmax 0.06 --seedSearchStartLmax 16 --outSAMunmapped Within --quantMode TranscriptomeSAM) :

Started job on |       Apr 04 14:11:02
                             Started mapping on |       Apr 04 14:11:45
                                    Finished on |       Apr 04 14:13:09
       Mapping speed, Million of reads per hour |       124.63

                          Number of input reads |       2908048
                      Average input read length |       30
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1307393
                        Uniquely mapped reads % |       44.96%
                          Average mapped length |       28.65
                       Number of splices: Total |       160987
            Number of splices: Annotated (sjdb) |       152000
                       Number of splices: GT/AG |       157566
                       Number of splices: GC/AG |       1863
                       Number of splices: AT/AC |       172
               Number of splices: Non-canonical |       1386
                      Mismatch rate per base, % |       1.16%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.40
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.06
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       524670
             % of reads mapped to multiple loci |       18.04%
        Number of reads mapped to too many loci |       110398
             % of reads mapped to too many loci |       3.80%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.57%
                 % of reads unmapped: too short |       32.61%
                     % of reads unmapped: other |       0.03%

We can see that I have more mapped reads, so I wanted to keep the parameter --seedSearchStartLmax 16 allowing a better sensitivity and setting a seed as big as possible to optimize the mapping.

3) About winAnchorMultimapNmax

I test --winAnchorMultimapNmax 200 as you recommented in another post and I get theses logs (complete command line : STAR --runThreadN 12 --genomeDir /root --readFilesIn Sample2.fastq --outSAMunmapped Within --outFilterMismatchNoverLmax 0.06 --quantMode TranscriptomeSAM --seedSearchStartLmax 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.85 --winAnchorMultimapNmax 200 --outFilterMultimapNmax 50) :

Started job on |       Apr 04 15:05:11
                             Started mapping on |       Apr 04 15:06:15
                                    Finished on |       Apr 04 15:08:19
       Mapping speed, Million of reads per hour |       84.43

                          Number of input reads |       2908048
                      Average input read length |       30
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1183239
                        Uniquely mapped reads % |       40.69%
                          Average mapped length |       29.37
                       Number of splices: Total |       164066
            Number of splices: Annotated (sjdb) |       148936
                       Number of splices: GT/AG |       159316
                       Number of splices: GC/AG |       1954
                       Number of splices: AT/AC |       175
               Number of splices: Non-canonical |       2621
                      Mismatch rate per base, % |       1.11%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.38
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.05
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       542136
             % of reads mapped to multiple loci |       18.64%
        Number of reads mapped to too many loci |       233
             % of reads mapped to too many loci |       0.01%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       1.67%
                 % of reads unmapped: too short |       38.96%
                     % of reads unmapped: other |       0.03%

I tried to increased this parameter to --winAnchorMultimapNmax 1000 to see if I can increase my mappability and I use --outFilterMultimapScoreRange 0 to keep only very best alignments (complete command line : STAR --runThreadN 12 --genomeDir /root --readFilesIn S2_TestSizeSelectionUntrimmed_norRNA.fastq --outSAMunmapped Within --outFilterMismatchNoverLmax 0.06 --quantMode TranscriptomeSAM --seedSearchStartLmax 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.85 --winAnchorMultimapNmax 1000 --outFilterMultimapScoreRange 0) :

Started job on |       Apr 06 13:52:58
                             Started mapping on |       Apr 06 13:56:29
                                    Finished on |       Apr 06 13:58:43
       Mapping speed, Million of reads per hour |       78.13

                          Number of input reads |       2908048
                      Average input read length |       30
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1279328
                        Uniquely mapped reads % |       43.99%
                          Average mapped length |       29.33
                       Number of splices: Total |       213710
            Number of splices: Annotated (sjdb) |       187031
                       Number of splices: GT/AG |       204531
                       Number of splices: GC/AG |       2778
                       Number of splices: AT/AC |       245
               Number of splices: Non-canonical |       6156
                      Mismatch rate per base, % |       1.12%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.46
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.08
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       371614
             % of reads mapped to multiple loci |       12.78%
        Number of reads mapped to too many loci |       88472
             % of reads mapped to too many loci |       3.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       3.25%
                 % of reads unmapped: too short |       36.80%
                     % of reads unmapped: other |       0.14%

We can see that my mappibility increased and I want to know how to know the limit of this parameter to increase my mappability avoiding pervasive hits.

Great thanks in advance and thanks for your great tool ! :)
Alexandra

Alexander Dobin

unread,
Apr 7, 2016, 12:30:09 PM4/7/16
to rna-star
Hi Alexandra,

thanks a lot for the interesting tests and observations.

1) Soft-clipping issue
This is actually an unintended behavior in my code (a.k.a a bug).
The actual filter for single-end alignments:
the alignment is "too short"  if nMatchedBases < int(outFilterMatchNminOverLread*(Lread-1))
For your 1st read, int(outFilterMatchNminOverLread*(Lread-1))=int(0.85*(32-1))=26 which is =26=nMatchedBases, so the alignment passes the filter.
For your 2nd read, the number of matched bases is also 26, since there is one mismatch, and it passes the filter.

For PE reads, the formula is nMatchedBases < int(outFilterMatchNminOverLread*Lread)
This is an unfortunate historical issue in the code. which I do not want to fix presently since it will "silently" change the results and make them incompatible with older versions.
Basically, if you need more stringent alignments, you would have to increase --outFilterMatchNminOverLread slightly.

In general, using controlling the minimum score --outFilterScoreMinOverLread is more robust than controlling the number of matched bases.
For instance, your first read has a lower score than the 2nd (21 vs 24), because it has a very large non-canonical gap.

2) About seeding
3) About winAnchorMultimapNmax

The results with reducing --seedSearchStartLmax and increasing --winAnchorMultimapNmax both make sense showing increase in sensitivity.
Reducing --seedSearchStartLmax allows for more "dense" (in the read sequence) search of seeds.
Increasing --winAnchorMultimapNmax allows more (and shorter) seeds to be tried as "anchors", i.e. explore more genomic loci for potential alignments.
Note that in both cases the mapping speed decreases, this is the main limitation for further gains of sensitivity.
Also, at some point the storage allocated for seeds and windows may need to be increased (see parameters that control it below).

There is yet another parameter that you can try to tweak to increase sensitivity
--seedMultimapNmax (=10000 by default). It defines the max number of loci non-anchor seeds can map to. It will affect catching short overhangs of junctions and indels. 
In my hands it never gave a big sensitivity boost, and it could also lead to an increase of false junctions with large gaps. 
If you try to increase it, do it on an exponential scale, say by a factor of 10 and 100.

Cheers
Alex



alignWindowsPerReadNmax     10000
    int>0: max number of windows per read

alignTranscriptsPerWindowNmax       100
    int>0: max number of transcripts per window

alignTranscriptsPerReadNmax               10000
    int>0: max number of different alignments per read to consider

seedPerReadNmax       1000
    int>0: max number of seeds per read

seedPerWindowNmax     50
    int>0: max number of seeds per window

seedNoneLociPerWindow    10
    int>0: max number of one seed loci per window
Reply all
Reply to author
Forward
0 new messages