Hi Alex,I have been getting good results with STAR and miRNA sequences. I have compared the STAR read alignment counts to bowtie read alignment counts and see very high correlations between the numbers of mapped reads per miRNA (bowtie is the most often used aligner in miRNA pipelines, for example in ncPRO-seq which I am testing). Interestingly, the bowtie approach is very susceptible to how good the trimming is done. If your trimmer misses adaptors, you will have a signficant drop in reads because there is no soft clipping. STAR seems more immune to this problem due to its softclipping feature...You mention filtering of softclipped reads - what would the reason be for removing softclipped reads? Wouldn't you want to keep them, if say, for example your adapter trimming is not always reliable? Or perhaps you are downloading public data and have no clue what the adaptors could be?Just wondering, because I am liking the results. I will try a run of STAR without pre-trimming the fastqs and see how it performs compared to trimmed. I expect to see little difference.. let's see
On Friday, March 1, 2013 11:13:49 PM UTC+1, Alexander Dobin wrote:
--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+unsubscribe@googlegroups.com.
Hi Alex, I can confirm that your warnings are correct - the softclipping will allow the untrimmed adapter sequence to align to the genome if there is enough sequence similarity, resulting in more "unique" alignments.If the trimming is done well (i've tried cutadapt, alientrimmer, and reaper from kraken tools - so far the best one) then the results are comparable between STAR and bowtie. Bowtie is quite common in miRNA pipelines..If the trimming is not so accurate (say it missed one or two bases of the adapter), STAR seems to be a little more robust in these cases, as when I've used Cutadapt or Alientrimmer.Thanks Alex!
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
This simple awk script will filter out all alignments that are trimmed by more than 1 base from the 5'.awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/); if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam
Started job on | Jan 17 11:58:05
Started mapping on | Jan 17 12:00:11
Finished on | Jan 17 12:00:46
Mapping speed, Million of reads per hour | 389.19
Number of input reads | 3783764
Average input read length | 46
UNIQUE READS:
Uniquely mapped reads number | 1142926
Uniquely mapped reads % | 30.21%
Average mapped length | 35.06
Number of splices: Total | 23479
Number of splices: Annotated (sjdb) | 23479
Number of splices: GT/AG | 23266
Number of splices: GC/AG | 182
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 31
Mismatch rate per base, % | 1.31%
Deletion rate per base | 0.19%
Deletion average length | 1.00
Insertion rate per base | 0.20%
Insertion average length | 1.13
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1579235
% of reads mapped to multiple loci | 41.74%
Number of reads mapped to too many loci | 179751
% of reads mapped to too many loci | 4.75%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.35%
% of reads unmapped: too short | 14.36%
% of reads unmapped: other | 8.59%
Started job on | Jan 19 11:03:28Started mapping on | Jan 19 11:07:15Finished on | Jan 19 11:07:57Mapping speed, Million of reads per hour | 324.32
Number of input reads | 3783764Average input read length | 46UNIQUE READS:
Uniquely mapped reads number | 1206059Uniquely mapped reads % | 31.87%Average mapped length | 33.99Number of splices: Total | 23967Number of splices: Annotated (sjdb) | 23967Number of splices: GT/AG | 23719Number of splices: GC/AG | 188
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 60Mismatch rate per base, % | 1.28%Deletion rate per base | 0.18%
Deletion average length | 1.00Insertion rate per base | 0.20%Insertion average length | 1.13MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 2030086% of reads mapped to multiple loci | 53.65%Number of reads mapped to too many loci | 399889% of reads mapped to too many loci | 10.57%
UNMAPPED READS:% of reads unmapped: too many mismatches | 0.35%
% of reads unmapped: too short | 0.00%% of reads unmapped: other | 3.55%
>hsa-miR-30a-5p MIMAT0000087
UGUAAACAUCCUCGACUGGAAG
>hsa-miR-30e-5p MIMAT0000692
UGUAAACAUCCUUGACUGGAAG
STAR --runMode alignReads \
Started job on | Aug 03 15:54:16 Started mapping on | Aug 03 15:54:18 Finished on | Aug 03 15:55:24 Mapping speed, Million of reads per hour | 628.59
Number of input reads | 11524124 Average input read length | 25 UNIQUE READS: Uniquely mapped reads number | 1165520 Uniquely mapped reads % | 10.11% Average mapped length | 24.20 Number of splices: Total | 0 Number of splices: Annotated (sjdb) | 0 Number of splices: GT/AG | 0 Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0 Mismatch rate per base, % | 0.13% Deletion rate per base | 0.00% Deletion average length | 1.00 Insertion rate per base | 0.00%
Insertion average length | 1.13 MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 2739052 % of reads mapped to multiple loci | 23.77% Number of reads mapped to too many loci | 0 % of reads mapped to too many loci | 0.00% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 2.41% % of reads unmapped: other | 63.71% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
Started job on | Aug 03 16:48:58 Started mapping on | Aug 03 16:49:00 Finished on | Aug 03 17:14:37 Mapping speed, Million of reads per hour | 26.99
Number of input reads | 11524124 Average input read length | 25 UNIQUE READS: Uniquely mapped reads number | 1064272 Uniquely mapped reads % | 9.24% Average mapped length | 24.56 Number of splices: Total | 0 Number of splices: Annotated (sjdb) | 0 Number of splices: GT/AG | 0 Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0 Mismatch rate per base, % | 0.11% Deletion rate per base | 0.00% Deletion average length | 1.00 Insertion rate per base | 0.00% Insertion average length | 1.02 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 9482211 % of reads mapped to multiple loci | 82.28% Number of reads mapped to too many loci | 0 % of reads mapped to too many loci | 0.00% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 1.21% % of reads unmapped: other | 7.27% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
*edit* i'm using STAR_2.5.1b_modified
I remove --alignEndsType EndToEnd and it went up a bit to 20% uniquely mapped , and 64 % multi-mapped.
I also trim by keeping the first 37 nt, and bumped it up higher to 40% uniquely mapped, but still dealing with a lot of multi-mapped reads
1) Any suggestions on handling multi-mapped reads?
2) Can you clarify what I should blast against to check for contamination?
Thank you!
Hi Praful,we are routinely using STAR to map "small RNA" (~<200b) data within the ENCODE project - the miRNA (mostly mature) are a major subclass of these small RNA.
We are using STAR with the following parameters:--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).You can clip 3' adapter before feeding the reads to STAR, or you can use simple built-in clipper--clip3pAdapterSeq TGGAATTCTC --clip3pAdapterMMp 0.1(second parameter is the proportion of mismatches in the matched adapter length).You would also likely want to filter out reads that STAR "genomically" trims at the 5' (see the discussion about "Soft clipping" here).This simple awk script will filter out all alignments that are trimmed by more than 1 base from the 5'.awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/); if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.samCheersAlex
--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).