STAR for miRNA

9,214 views
Skip to first unread message

praful aggarwal

unread,
Mar 1, 2013, 1:48:58 PM3/1/13
to rna-...@googlegroups.com
Hi Alex,

Have you tried the alignment of miRNA-sequencing reads with STAR? I know these will be short reads (~25bp long) and splicing alignment may not be needed, but I just want to check if you have tried it and also what your thoughts/suggestions would be in trying miRNA-seq alignment with STAR. 

Thanks,
Praful

Alexander Dobin

unread,
Mar 1, 2013, 5:13:49 PM3/1/13
to rna-...@googlegroups.com
Hi Praful,

we are routinely using STAR to map "small RNA" (~<200b) data within the ENCODE project - the miRNA (mostly mature) are a major subclass of these small RNA.
We are using STAR with the following parameters:
--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

You can clip 3' adapter before feeding the reads to STAR, or you can use simple built-in clipper
--clip3pAdapterSeq TGGAATTCTC      --clip3pAdapterMMp 0.1
(second parameter is the proportion of mismatches in the matched adapter length).

You would also likely want to filter out reads that STAR "genomically" trims at the 5' (see the discussion about "Soft clipping" here).
This simple awk script will filter out all alignments that are trimmed by more than 1 base from the 5'.
awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/);  if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam

Cheers
Alex

praful aggarwal

unread,
Mar 4, 2013, 3:02:24 PM3/4/13
to rna-...@googlegroups.com
Alex,

This is some really useful information. I really appreciate it. 

Thanks,
Praful

German Leparc

unread,
Apr 3, 2014, 9:38:55 AM4/3/14
to

Hi Alex,

I have been getting good results with STAR and miRNA sequences. I have compared the STAR read alignment counts to bowtie read alignment counts and see very high correlations between the numbers of mapped reads per miRNA (bowtie is the most often used aligner in miRNA pipelines, for example in ncPRO-seq which I am testing). Interestingly, the bowtie approach is very susceptible to how good the trimming is done. If your trimmer misses adaptors, you will have a signficant drop in reads because there is no soft clipping. STAR seems more immune to this problem due to its softclipping feature...

You mention filtering of softclipped reads - what would the reason be for removing softclipped reads? Wouldn't you want to keep them, if say, for example your adapter trimming is not always reliable? Or perhaps you are downloading public data and have no clue what the adaptors could be? 

Just wondering, because I am liking the results. I will try a run of STAR without pre-trimming the fastqs and see how it performs compared to trimmed. I expect to see little difference.. let's see

Alexander Dobin

unread,
Apr 4, 2014, 6:51:35 PM4/4/14
to rna-...@googlegroups.com
Hi German,

the soft-clipped bases in this case might be coming from the untrimmed adapter, but could also be sequencing errors - in the latter case it will be unclear where the precise 3' of your miRNA is. The same is true for the trimming STAR may do at the 5'.
If you do not care about precise positioning of the 5'/3', then filtering of the soft-clipped reads is not needed. The latest patches of STAR have the --alignEndsType EndToEnd option that will force end-to-end alignments, making it similar to bowtie.

I would generally not recommend aligning without trimming. Without trimming you will probably see almost the same mapping rate, and even higher proportion of unique alignments. However, the latter will be mapping artifact: some reads that should be multi-mappers after adapter trimming, become unique mappers because of the few untrimmed adapter bases added at the read 3'.

Cheers
Alex


On Thursday, April 3, 2014 9:34:51 AM UTC-4, German Leparc wrote:

Hi Alex,

I have been getting good results with STAR and miRNA sequences. I have compared the STAR read alignment counts to bowtie read alignment counts and see very high correlations between the numbers of mapped reads per miRNA (bowtie is the most often used aligner in miRNA pipelines, for example in ncPRO-seq which I am testing). Interestingly, the bowtie approach is very susceptible to how good the trimming is done. If your trimmer misses adaptors, you will have a signficant drop in reads because there is no soft clipping. STAR seems more immune to this problem due to its softclipping feature...

You mention filtering of softclipped reads - what would the reason be for removing softclipped reads? Wouldn't you want to keep them, if say, for example your adapter trimming is not always reliable? Or perhaps you are downloading public data and have no clue what the adaptors could be? 

Just wondering, because I am liking the results. I will try a run of STAR without pre-trimming the fastqs and see how it performs compared to trimmed. I expect to see little difference.. let's see




On Friday, March 1, 2013 11:13:49 PM UTC+1, Alexander Dobin wrote:

anthony

unread,
Apr 4, 2014, 8:01:44 PM4/4/14
to Alexander Dobin, rna-...@googlegroups.com
hello.  does soft clipping take care of adapter contamination post trimming?  i am having a problem where after running cutadapt, i still have adapter in the insert region, so is there an option from star that can remove the missed sequences after trimming?

"creatio ex nihilio"- Leibniz

--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Alexander Dobin

unread,
Apr 8, 2014, 11:49:48 AM4/8/14
to rna-...@googlegroups.com, Alexander Dobin
Hi Anthony,

in principle, soft-clipping takes care of adapter contamination. Each read will be extended as much as possible to match the genome, and the unmapped portion of the read, containing the adapter, will be soft-clipped.
As I mentioned above, the danger is that this extension may include a few first bases of the adapter that by chance match the genome bases in some loci.

Cheers
Alex
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+unsubscribe@googlegroups.com.

Erik Aronesty

unread,
Aug 18, 2014, 9:54:04 AM8/18/14
to rna-...@googlegroups.com
I use fastq-mcf for miRNA, with -p 30... i'd recommend it.

The reason is that it's better to have "too much" removed for miRNA than too little.   

Also when using small RNA, you're better off simply counting sequences with 1 mismatch, rather than aligning.. .this can be done with a hash table (we use mirna-quant... which is just google's sparsehash).   The reason is that there's very little information in small RNA for alignment, but an accurate count will produce a lot of good signal for differential expression.

miRNA's are duplicated in the genome heavily, there are many misleading homologs, and there's only 15 bases in smaller ones.  At 15 bases, alignment is certainly not the right tool.


On Thursday, April 10, 2014 11:54:30 AM UTC-4, German Leparc wrote:
Hi Alex, I can confirm that your warnings are correct - the softclipping will allow the untrimmed adapter sequence to align to the genome if there is enough sequence similarity, resulting in more "unique" alignments.

If the trimming is done well (i've tried cutadapt, alientrimmer, and reaper from kraken tools - so far the best one) then the results are comparable between STAR and bowtie. Bowtie is quite common in miRNA pipelines.. 

If the trimming is not so accurate (say it missed one or two bases of the adapter), STAR seems to be a little more robust in these cases, as when I've used Cutadapt or Alientrimmer.

Thanks Alex!

To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.

André Santos

unread,
Mar 20, 2015, 10:40:53 AM3/20/15
to rna-...@googlegroups.com
Hi,

I am also trying to use STAR to align my miRNA data. How did you build your genome? Did you pass mirbase GFF file? Or you build without it?
BTW can you only pass the GTF reference on the alignment?

I just starting with STAR and your suggestions will help a lot.

Thank you,
André

Alexander Dobin

unread,
Mar 24, 2015, 3:32:44 PM3/24/15
to rna-...@googlegroups.com
Hi André,

for mapping miRNAs (or any other unspliced RNAs) you do not need the GTF file, since STAR only uses splice junctions information for mapping.
Moreover, at the mapping stage you may want to prohibit splicing with --alignIntronMax 1

Cheers
Alex

German Leparc

unread,
Apr 10, 2014, 11:56:43 AM4/10/14
to
Hi Alex, I can confirm that your warnings are correct - the softclipping will allow the untrimmed adapter sequence to align to the genome if there is enough sequence similarity, resulting in more "unique" alignments.

If the trimming is done well (i've tried cutadapt, alientrimmer, and reaper from kraken tools - so far the best one) then the results are comparable between STAR and bowtie. Bowtie is quite common in miRNA pipelines.. 

If the trimming is not so accurate (say it missed one or two bases of the adapter), STAR seems to be a little more robust in these cases, as when I've used Cutadapt or Alientrimmer.

Thanks Alex!


On Tuesday, April 8, 2014 5:49:48 PM UTC+2, Alexander Dobin wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
star_vs_bowtie.png

Jake Freimer

unread,
Dec 10, 2015, 8:06:42 PM12/10/15
to rna-star
Hi Alex,
I was just wondering why you recommend changing --outFilterScoreMinOverLread  --outFilterMatchNminOverLread to 0 for small RNAs compared to leaving them as default? Thanks

Alexander Dobin

unread,
Dec 11, 2015, 3:56:18 PM12/11/15
to rna-star
Hi Jake,

I recommend these parameters set to 0 to switch these filters off, but this is done in conjunction with --outFilterMatchNmin 16 which limits the mapped length to 16b.
All the --outFilter* filters work together in the AND fashion, so if you want to use just one of the filters, you need to switch off the other filters.

Cheers
Alex

Alexander Predeus

unread,
Jul 2, 2016, 11:04:40 AM7/2/16
to rna-star
Hello Alex,

I'm trying to make sense of micro-RNA alignment and so I started playing with some datasets I've found in GEO.

The one I used is from mouse: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=uhepescyjhqbjuf&acc=GSM1954400

These are 51 bp reads, with some adapters appearing starting with 21 bp along the read, according to FastQC.

So I've decided to try the built-in trimming together with the recommended settings and ran the alignment like this:

STAR --genomeDir $REF --readFilesIn $READS --runThreadN 4 --readFilesCommand zcat --clip5pNbases 30 --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1  --outSAMstrandField All --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM

And got about 43% uniquely aligned reads (which was great since default STAR settings wo clipping gave me about 2%).

But then I ran the same command with trimming at 3p and got even more (51%) unique aligners!

What am I missing here?

Thank you in advance

-- Alex Predeus


Alexander Dobin

unread,
Jul 6, 2016, 4:31:40 PM7/6/16
to rna-star
Hi Alex,

this is strange indeed - with this command you were trimming 30b from the 5' end of the reads, which should not have worked, since the adapters are supposed to appear on the 3' end.
If you run with your parameters but without any clipping, you should get the maximum for the uniquely mapped reads, since STAR will "trim" as needed if they have pieces that do not map
(provided that you use --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 which turns off the filter for the min mapped length).
If this does not agree with you observations, please post the parameters and Log.final.out results for each run.

I recommend trimming the actual adapter for the small RNA-seq - you can do this with 
--clip3pAdapterSeq TGGAATTCTC --clip3pAdapterMMp 0.1 where first parameter is the adapter sequence, and second is the proportion of mismatches allowed in the adapter sequence.



Cheers
Alex

Alexander Predeus

unread,
Aug 22, 2016, 5:49:31 PM8/22/16
to rna-star
Alex,

thank you for the advice and sorry for such slow response. No clipping combined with no filtering for minimum mapped length does provide the highest percent of the unique matches, and average mapped length of 24-25, which does seem right for a miRNA library.

I guess it will remain a mystery why does it still wirk with the 5'-end clipping...

As for the trimming the actual adapter: is it possible to elucidate it from the data if you did not produce it? I'm trying FastQC but it doesn't seem very useful here, and I'm not sure why. Could it be possible that adapter sequence does not start at the same position in every read?

cheers & thank you for your help again!

-- Alex

Alexander Dobin

unread,
Aug 24, 2016, 5:26:15 PM8/24/16
to rna-star
Hi Alex,

if the sequencing quality is good, you should be able to see exactly the same sequence at the 3' ends. However, the start position of the adapter depends on the length of the small RNA molecule, so it would vary.
To figure out the adapter sequence:
1. Take only reads that map to + strand
2. Cut their sequences at 5' at the number of bases that are mapped by STAR (i.e.by n where <n>M is in the CIGAR).
3. Cut the first 20 bases from 5', sort them and look at the most frequent sequence - this should give you the first 20b of the 3' adapter.

Cheers
Alex

German Leparc

unread,
Dec 6, 2016, 10:44:16 AM12/6/16
to rna-star


On Friday, March 1, 2013 at 11:13:49 PM UTC+1, Alexander Dobin wrote:
 
This simple awk script will filter out all alignments that are trimmed by more than 1 base from the 5'.
awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/);  if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam


Dear Alex, 

a colleague and I are trying to figure out what the and() operator is doing in your awk script  e.g.:  and($2,0x10)
i know this is a bitwise operator, but I'm having trouble with understanding what is being held in the $2 variable and what the 0x10 (decimal 16?) is for. I don't want to waste your time with a lesson in awk, but could you please give me a hint of what it being checked at this step?


Thank you for any help!

Kind Regards,
German


Alexander Dobin

unread,
Dec 6, 2016, 11:43:52 AM12/6/16
to rna-star
Hi German,

this is indeed bitwise operator that checks the 0x10 bit of the SAM field 2, which is the strand of the alignment. If >0, the 0x10 bit is set, and the strand is negative, so the 5' is on the right hand side.

Cheers
Alex

German Leparc

unread,
Dec 8, 2016, 12:08:01 PM12/8/16
to rna-star
Thank you Alex, that helps a lot! 

D Prat

unread,
Jan 18, 2017, 1:18:58 PM1/18/17
to rna-star
Hi Alex, 

i'm working on miRNA data, after the trimming of the adapters with Cutadapt and of the duplicated sequence with Prinseq i used STAR with the parameters you recommended but i still got a lot of reads unmapped because they are too short (~14%) is this normal ?

this is the command line i use :

STAR --runThreadN 16 --runMode alignReads --genomeDir genome_directory_GTF --readFilesIn ... --outFileNamePrefix ... --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1 --outMultimapperOrder Random --twopassMode Basic

and this is the report i got:

                                 Started job on |       Jan 17 11:58:05
                             Started mapping on |       Jan 17 12:00:11
                                    Finished on |       Jan 17 12:00:46
       Mapping speed, Million of reads per hour |       389.19
                          Number of input reads |       3783764
                      Average input read length |       46
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1142926
                        Uniquely mapped reads % |       30.21%
                          Average mapped length |       35.06
                       Number of splices: Total |       23479
            Number of splices: Annotated (sjdb) |       23479
                       Number of splices: GT/AG |       23266
                       Number of splices: GC/AG |       182
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       31
                      Mismatch rate per base, % |       1.31%
                         Deletion rate per base |       0.19%
                        Deletion average length |       1.00
                        Insertion rate per base |       0.20%
                       Insertion average length |       1.13
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1579235
             % of reads mapped to multiple loci |       41.74%
        Number of reads mapped to too many loci |       179751
             % of reads mapped to too many loci |       4.75%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.35%
                 % of reads unmapped: too short |       14.36%
                     % of reads unmapped: other |       8.59%

Thx for your help !

Alexander Dobin

unread,
Jan 18, 2017, 4:29:33 PM1/18/17
to rna-star
Hi David,

14% unmapped "too short" does sound too bad for small RNA. :)
You could investigate why there reads were unmapped:1
1. How many reads were trimmed below 16nt before mapping? These will be considered "too short" right away.
2. How many reads were not trimmed enough (e.g. due to errors int the adapter sequence)? The average length of the reads after trimming (as STAR inputs the reads) is 46, which looks too much if you are working with miRNA sequencing. What is the length distribution for the unmapped reads only (you can output them in separate file)?
3. BLAST a few of the unmapped reads.

Cheers
Alex

D Prat

unread,
Jan 19, 2017, 11:52:44 AM1/19/17
to rna-star
Hi Alex, 

Thx for you answer !

I've tried to change the option --outFilterMatchNmin 16 to 10 so i've been able to kept the little reads, is this ok ?
now i don't get any "too short reads", but still i don't get a huge improvement in the unique mapped reads.
This is the length distribution i'm getting on the unmapped reads.

About the average input reads length it's probably due to the fact that my data correspond to the sequenced miRNA and to their sequenced target
Sequence length distribution

I also did a blast on some unmapped sequence but i got 0 result.

This is the report i got now i've changed the option:

 Started job on |       Jan 19 11:03:28
                             Started mapping on |       Jan 19 11:07:15
                                    Finished on |       Jan 19 11:07:57
       Mapping speed, Million of reads per hour |       324.32

                          Number of input reads |       3783764
                      Average input read length |       46
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1206059
                        Uniquely mapped reads % |       31.87%
                          Average mapped length |       33.99
                       Number of splices: Total |       23967
            Number of splices: Annotated (sjdb) |       23967
                       Number of splices: GT/AG |       23719
                       Number of splices: GC/AG |       188
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       60
                      Mismatch rate per base, % |       1.28%
                         Deletion rate per base |       0.18%
                        Deletion average length |       1.00
                        Insertion rate per base |       0.20%
                       Insertion average length |       1.13
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       2030086
             % of reads mapped to multiple loci |       53.65%
        Number of reads mapped to too many loci |       399889
             % of reads mapped to too many loci |       10.57%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.35%
                 % of reads unmapped: too short |       0.00%
                     % of reads unmapped: other |       3.55%



Regards, 

D. 
Auto Generated Inline Image 1

Alexander Dobin

unread,
Jan 19, 2017, 4:51:25 PM1/19/17
to rna-star
Hi David,

after you reduced the min mapped length, the previously unmapped "too short" reads became multimappers.
However, this could have happened because either
(i) the reads were trimmed to <16b before mapping
or
(ii) only small portions of the reads (<16) can be mapped, while the rest of the read sequence could not be matched to the genome (e.g. untrimmed adapter)

If you are willing to work with multimappers mapping to >10 loci, you can reduce 
% of reads mapped to too many loci |       10.57%
by increasing --outFilterMultimapNmax .
Also, you can play with increasing --winAnchorMultimapNmax from the default 50, which may help to reduce the remaining unmapped reads
 % of reads unmapped: other |       3.55%

Cheers
Alex

Xavi

unread,
Aug 2, 2017, 3:35:52 PM8/2/17
to rna-star
Hi!

I'm completely new to this field but I would like to start using STAR to align miRNAs.

I have a question, how does the alignment algorithm deal with paralog sequences of miRNA that have a huge similarity? How does the algorithm know when to allow a mismatches or sequences that have tailing and trimming events? 

Such as:

>hsa-miR-30a-5p MIMAT0000087

UGUAAACAUCCUCGACUGGAAG

>hsa-miR-30e-5p MIMAT0000692

UGUAAACAUCCUUGACUGGAAG

 
Thanks!

Xavi

Laura Dijkhuizen

unread,
Aug 4, 2017, 10:00:06 AM8/4/17
to rna-star
Thanks all for the helpful discussion. I too was confronted with quite a high proportion of my sRNA-seq data not mapping: approximately 60% of my sRNAs got stuck in the mysterious category 'umapped: other'. I know the sequencing data should map better to the reference genome for bowtie 1 & 2 map +90% of the reads. However these don't report all locations. There is off course the -all flag in bowtie, but this is very time in-efficient and not suitable for upscaling. 

Here I want to confirm that the --winAnchorMultimapNmax did the trick for me, I went from ~60% 'unmapped: other' to ~7%. Admittedly, I am not quite sure what this parameter exactly does. Bumping up --seedMultimapNmax had no substantial effect. Also, note that this parameter adds quite a lot to computing time for both mapping in STAR and sorting + compressing in samtools. Still, quite an improvement over the bow ties though.

A code snippet of my script for reference:
STAR    --runMode alignReads                    \
                        --genomeDir "$starindex"                \
                        --readFilesIn "$sRNAseqdir/$s"          \
                        --readFilesCommand zcat                 \
                        --outFileNamePrefix "$basedir/star/$name."   \
                        --outStd Log                            \
                        --outReadsUnmapped Fastx                \
                        --outSAMtype BAM Unsorted               \
                        --runThreadN "$CPU"                     \
                        --alignIntronMax 1                      \
                        --outFilterMultimapNmax 100000          \
                        --outFilterMismatchNoverLmax 0.05       \
                        --outFilterMatchNmin 16                 \
                        --outFilterScoreMinOverLread 0          \
                        --outFilterMatchNminOverLread 0         \
                        --winAnchorMultimapNmax 10000 


Before I added the       --winAnchorMultimapNmax 10000  line

                                 Started job on |       Aug 03 15:54:16
                             Started mapping on |       Aug 03 15:54:18
                                    Finished on |       Aug 03 15:55:24
       Mapping speed, Million of reads per hour |       628.59

                          Number of input reads |       11524124
                      Average input read length |       25
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1165520
                        Uniquely mapped reads % |       10.11%
                          Average mapped length |       24.20
                       Number of splices: Total |       0
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       0
                       Number of splices: GC/AG |       0
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       0
                      Mismatch rate per base, % |       0.13%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.00
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.13
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       2739052
             % of reads mapped to multiple loci |       23.77%
        Number of reads mapped to too many loci |       0
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       2.41%
                     % of reads unmapped: other |       63.71%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%



And after:
                                 Started job on |       Aug 03 16:48:58
                             Started mapping on |       Aug 03 16:49:00
                                    Finished on |       Aug 03 17:14:37
       Mapping speed, Million of reads per hour |       26.99

                          Number of input reads |       11524124
                      Average input read length |       25
                                    UNIQUE READS:
                   Uniquely mapped reads number |       1064272
                        Uniquely mapped reads % |       9.24%
                          Average mapped length |       24.56
                       Number of splices: Total |       0
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       0
                       Number of splices: GC/AG |       0
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       0
                      Mismatch rate per base, % |       0.11%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.00
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.02
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       9482211
             % of reads mapped to multiple loci |       82.28%
        Number of reads mapped to too many loci |       0
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       1.21%
                     % of reads unmapped: other |       7.27%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%



Alexander Dobin

unread,
Aug 4, 2017, 12:27:00 PM8/4/17
to rna-star
Hi Xavi,

aligning short sequences (~20b) with mismatches is not easy.
STAR will find some alignments with mismatches, but finding all possible alignments is not guaranteed (i.e.the algorithm is greedy).
You may want to increase --winAnchorMultimapNmax from the default 50 to 1000 (or even more) and reduce  --seedSearchStartLmax to 10 (or even less) to allow for shorter seeds.

Cheers
Alex

Alexander Dobin

unread,
Aug 4, 2017, 12:38:37 PM8/4/17
to rna-star
Hi Laura,

this is an interesting observation, thanks for sharing it!
There are two possibilities for the extra alignments that you are getting with increasing --winAnchorMultimapNmax:
1. These reads map to >50 loci in the genome. --winAnchorMultimapNmax (=50 by default) defines the maximum number of loci a seed can map to.
If a read maps to >winAnchorMultimapNmax loci, no "good" seeds are found, and the read is classified as "unmapped other"

2. These extra alignments are not exact matches to the genome, but contain mismatches and (less likely) indels. Increasing winAnchorMultimapNmax  allows for shorter seeds and hence more sensitivity to short sequences with mismatches. However, this is probably not your case, since most of the extra alignments were multimappers.

Cheers
Alex

Xavi

unread,
Aug 7, 2017, 3:09:19 PM8/7/17
to rna-star
Hi Alex,

Thanks for your advice! I will try it

Best,

Xavi

MA1789

unread,
Mar 1, 2018, 11:02:59 AM3/1/18
to rna-star
Hi Alex,

I run STAR for our small RNA data with the option --outFilterMismatchNoverLmax 0.05   as suggested. However, when I visualize my alignment in IGV i see that there are a lot of mismatches. I am attaching a screen shot just for an example of this situation. Can you please comment on what might be the issue? FYI for the genome and index I use,  http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/GENCODE/GRCh38_Gencode26/  and the annotation therein. Also for now I am using the quantification built in STAR. Do you think any disadvantage of it for small RNA data (shall I use featurecounts or htseq per say?)


Thanks many in advance
Screen Shot 2018-03-01 at 10.54.57 AM.png

Alexander Dobin

unread,
Mar 9, 2018, 3:58:58 PM3/9/18
to rna-star
Hi MA1789,

these mismatches are weird, the nM=0 attribute means there should not be any mismatches.
Are you sure that you mapped to the same assembly as you use on the browser?
Please extract this read (or any other that show many mismatches) from the BAM file and send it to me.

Both featurecounts and htseq will produce exactly the same results as STAR counting.
In general, it should be fine, however, you may want to refine counting for small RNA in some ways, e.g. only count reads that match the mature miRNA, or count the multimappers, etc.

Cheers
Alex

alb...@oncobox.com

unread,
Apr 3, 2018, 9:04:46 AM4/3/18
to rna-...@googlegroups.com
Dear Alex, I'm new to STAR and would like to figure out proper options for index generation stage for miRNA reads mapping. In ENCODE pipeline the indexing uses following options:

STAR \
--runThreadN 16 \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--sjdbGTFfile gencode.comprehensive.annotation.gtf \
--sjdbOverhang 1 \
--genomeFastaFiles genome.fasta

Is the 1 a correct value for --sjdbOverhang option (probably a typo?)? And if yes could you please give a small explanation why it's not a read len -1 

Best, Eugene

PS and one additional question: what exactly is an advantage in using comprehencive genome annotation in the indexing step and separete miRNA gtf file in mapping step? Also indexing with miRNA gtf file should allow me to exclude --sjdbGTFfile option from mapping step and take an advantage of LoadAndKeep option?

Alexander Dobin

unread,
Apr 3, 2018, 1:11:49 PM4/3/18
to rna-star
Hi Eugene,

--sjdbOverhang 1 is a hack to prohibit splicing over annotated junctions in the GTF, but still use the GTF for counting reads over genes.
If you do not need counting over genes in the GTF file, you can omit the --sjdbGTFfile and --sjdbOverhang parameters altogether.

Cheers
Alex

Kyle Chang

unread,
Apr 22, 2018, 11:16:15 PM4/22/18
to rna-...@googlegroups.com

I'm following the ENCODE protocol to for miRNA seq alignment but I'm only getting 0.08% uniquely mapped reads.

Genome index:
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir ~/references/star_mirna --genomeFastaFiles GRCh37-lite.fa --sjdbGTFfile ~/references/gencode.v19.annotation.nochr.gtf --sjdbOverhang 1

For adapter trimming: i use https://raw.githubusercontent.com/rm2011/miRNA-seq-adapters/master/cutadapt.sh , after editing my own 3 prime adapter sequence

For aligning, i use -sjdbGTFfile ~/references/mirna_subset.gtf to specify a miRNA gtf subset which i grepped from the full gencode gtf file
STAR --genomeDir ~/references/star_mirna/ --readFilesIn read1.fastq.gz,read2.fastq.gz --readFilesCommand zcat --runThreadN 16 --sjdbGTFfile ~/references/mirna_subset.gtf --alignEndsType EndToEnd --outFilterMismatchNmax 1 --outFilterMultimapScoreRange 0 --quantMode TranscriptomeSAM GeneCounts --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 10 --outSAMunmapped Within --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 16 --alignSJDBoverhangMin 1000 --alignIntronMax 1 --outWigType wiggle --outWigStrand Stranded --outWigNorm RPM outFilterMatchNmin

Attached is my log file.



Log.out
Log.final.out
33-LR33_S30_L004_R1_001_trim.fastq.gz
33-LR33_S30_L003_R1_001_trim.fastq.gz

Alexander Dobin

unread,
Apr 25, 2018, 7:03:05 PM4/25/18
to rna-star
Hi Kyle,

I have tried to map these reads and they have a very low mapping rate indeed.
If you remove --alignEndsType EndToEnd requirement you can probably map a bit more.
I would recommend that you BLAST a few of the sequences to check if you are dealing with contamination.

Cheers
Alex

Kyle Chang

unread,
Apr 25, 2018, 8:35:43 PM4/25/18
to rna-...@googlegroups.com

*edit* i'm using STAR_2.5.1b_modified


I remove --alignEndsType EndToEnd and it went up a bit to 20% uniquely mapped , and 64 % multi-mapped.




I also trim by keeping the first 37 nt, and bumped it up higher to 40% uniquely mapped, but still dealing with a lot of multi-mapped reads



1) Any suggestions on handling multi-mapped reads?

2) Can you clarify what I should blast against to check for contamination?


Thank you!

Alexander Dobin

unread,
Apr 26, 2018, 1:22:16 PM4/26/18
to rna-star
Hi Kyle,

with these new results, in which most of the reads map, I think the original mapping had a very low mapping because the adapter was not trimmed properly, and with the EndToEnd alignment it results in too many mismatches. I would make sure the 3' adapter sequence is correct.
Now that you almost do not have unmapped reads, there is nothing to BLAST :).

For multimappers, the most common approach is to randomly select the primary alignment for each multimapper and count them towards the genes, thus avoiding counting multimappers multiple times. You will need map with STAR’s --outMultimapperOrder Random option. Then you can either use featureCounts --primary option (though I did not try it) or filter the BAM file to keep primary alignments only (samtools view -F0x100) and use featureCounts -M (this works well for me).

Cheers
Alex

Kyle Chang

unread,
Apr 27, 2018, 3:11:32 PM4/27/18
to rna-star
Hi Alex,

I realized the BAM is generated from the genome index that was generated with 1) gencode gtf (-sjdbGTFfile gencode.v19.annotation.gtf), 2) --sjdbOverhang 100


1) If i want to run feature count using mirbase mature miRNA gtf, would I have to regenerate the BAM file with --sjdbGTFfile mirbase.gtf first?
2) Do  I need to change sjdbOverhang to the trimmed read length of miRNA which is ~24 nt?


Kyle

Alexander Dobin

unread,
Apr 27, 2018, 5:35:22 PM4/27/18
to rna-star
Hi Kyle,

since mapping miRNA does not require splicing, if you do not want to do counting with STAR, the best thing for mapping miRNA is to generate the genome without any GTF, and only use GTF in featureCounts.
If you want STAR counting, and you want to prevent splicing, you would need to generate STAR genome with the right GTF and --sjdbOverhang 1,

Cheers
Alex

MA1789

unread,
May 1, 2018, 9:19:24 PM5/1/18
to rna-...@googlegroups.com
Hi Alex,

How does STAR handle multi mapped reads for counting? Is there any option where we can get a count matrix that has only uniquely mapped reads. Also I would like to somehow correlate the number regions a read mapped and the length of the reads is there a way i get get this info from the aligned.out.bam file.

Thanks

Alexander Dobin

unread,
May 4, 2018, 4:05:55 PM5/4/18
to rna-star
Hi MA1789,

STAR does *not* count multimapping reads with --quantMode GeneCounts, only unique mappers.

The number of loci a multimappers maps to in reported in the NH:i: tag, and the mapped length can be extracted from the CIGAR string (add all values for M-operations).

Cheers
Alex

MA1789

unread,
May 14, 2018, 11:31:51 AM5/14/18
to rna-star
Hi Alex,

Thanks. In my data I have about 10% uniquely mapped reads and 60%-70% percent multimappers. I would like to see what percentage of these multimappers align to the same place as of the 10% unique mappers. Is there an easy way to get this information from the bam file?

Alexander Dobin

unread,
May 15, 2018, 1:06:24 PM5/15/18
to rna-star
Hi @MA1789,

Usually, for multimappers people look at just one randomly selected alignment.
For instance, you can count "primary" alignment only (to make it truly random with STAR, you would need to use --outMultimapperOrder Random) with featureCounts --primary option.

If you want a more comprehesive accounting for multimappers, you would have to build a custom pipeline for this sort of analysis, e.g.
1. Define the "unique" loci - it could be annotated genes where unique mappers map, or "transfrags" made.
2. Overlap each multimapping alignment (i.e. each line in the SAM file) with the unique loci, using bedtools or featureCounts
3. For each multimapping read, decide which locus to assign it to, as it may overlap more than one.

Cheers
Alex

Marim Aouda kamel

unread,
Jun 27, 2019, 10:42:12 AM6/27/19
to rna-...@googlegroups.com
Hi Alex,
I am new to star and I would ask for best miRNA mapping should I use the reference genome following the ENCODE project pipeline or using the mature mirbase reference, however the aim of my study is to quantify the known miRNAs followed by differential expression analysis, and if I am going to use the mirbase reference do you recommend any other parameters rather --genomeSAindexNbases 6 parameter in  the indexing step, and finally if I am going to quantify the miRNA should I use the gff file in the indexing step?

Many thanks in advance.

Alexander Dobin

unread,
Jul 2, 2019, 11:59:34 AM7/2/19
to rna-star
Hi Marim,

mapping to mirBase only might be dangerous as it will force all reads to be aligned there, even if they have better alignment to the genome.
Some pipelines utilize a 2-step mapping, with alignments done to both the whole genome, and the mirBase, and the best alignment selected.
In terms of parameters, for genome generation you might also use --genomeChrBinNbits 10; for mapping --alignIntronMin 1 to prohibit splicing.

Cheers
Alex

Ralf C. Mueller

unread,
Mar 25, 2020, 2:38:26 PM3/25/20
to rna-star
Hi Alex,

I am new to STAR on small RNA. I have a PE library (2x 50 bp) of small RNA and was wondering which settings to use.
After trimming adaptors, reads are on average 22 bases long (range 10-50). Can I safely try above-mentioned settings?


Cheers,
Ralf

On Friday, 1 March 2013 23:13:49 UTC+1, Alexander Dobin wrote:
Hi Praful,

we are routinely using STAR to map "small RNA" (~<200b) data within the ENCODE project - the miRNA (mostly mature) are a major subclass of these small RNA.
We are using STAR with the following parameters:
--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

You can clip 3' adapter before feeding the reads to STAR, or you can use simple built-in clipper
--clip3pAdapterSeq TGGAATTCTC      --clip3pAdapterMMp 0.1
(second parameter is the proportion of mismatches in the matched adapter length).

You would also likely want to filter out reads that STAR "genomically" trims at the 5' (see the discussion about "Soft clipping" here).
This simple awk script will filter out all alignments that are trimmed by more than 1 base from the 5'.
awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/);  if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam

Cheers
Alex

Alexander Dobin

unread,
Mar 30, 2020, 11:32:20 AM3/30/20
to rna-star
Hi Ralf,

these parameters should work fine for your data, though you may need to tweak them after running a few samples and checking the mapping stats in Log.final.out.
--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

Cheers
Alex

Ralf C. Mueller

unread,
Mar 31, 2020, 3:57:39 AM3/31/20
to rna-star
Thank you!
Reply all
Reply to author
Forward
0 new messages