Dear Alex,
I am working with libraries that have different read length (35bp, 75bp, and 150bp, all of them paired-end reads, all human samples). I have generated 3 different genome index, using –sjdbOverhang 34, 74, 149 (respectively). Now I want to map the reads to the genome. I could use 3 different scripts, adapting the different parameters for each case or the same script for all of them, I don’t know what is the best strategy.
Reading at older posts, I understand that --outFilterMismatchNoverLmax could be set at 0.05, so the number of mismatches will depend on the read length <= 5% of mapped length.
For RNA <200bp you were also recommending the following parameters: --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1.
I understand that if I am being strict with --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16, I can be more relax and set the outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0. Why do you recommend alignIntronMax to 1? Would you recommend me to use all these parameters for all of my libraries (35bp, 75bp, 150bp)?
I am also very concern about the reads that will align to multiple loci. I guess that when working with 35bp, I will get a high percentage of multiple loci reads. The other day, we were discussing with our lab group what is the maximum percentage of multiple loci reads that we should allow. They were saying that 2% should be the maximum, but I don’t know if this is true. The –outFilterMultimapNmax is set to 10, but I don’t know that if I should decrease it (thinking about the 35bp read length)
samtools view -u -f 4 input.bam > unmapped.bam |