Run parameters for aligning of 35bp, 75bp and 150bp

mireia alemany

unread,

Jul 11, 2020, 1:15:37 AM7/11/20

to rna-star

Dear Alex,

I am working with libraries that have different read length (35bp, 75bp, and 150bp, all of them paired-end reads, all human samples). I have generated 3 different genome index, using –sjdbOverhang 34, 74, 149 (respectively). Now I want to map the reads to the genome. I could use 3 different scripts, adapting the different parameters for each case or the same script for all of them, I don’t know what is the best strategy.

Reading at older posts, I understand that --outFilterMismatchNoverLmax could be set at 0.05, so the number of mismatches will depend on the read length <= 5% of mapped length.

For RNA <200bp you were also recommending the following parameters: --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1.

I understand that if I am being strict with --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16, I can be more relax and set the outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0. Why do you recommend alignIntronMax to 1? Would you recommend me to use all these parameters for all of my libraries (35bp, 75bp, 150bp)?

I am also very concern about the reads that will align to multiple loci. I guess that when working with 35bp, I will get a high percentage of multiple loci reads. The other day, we were discussing with our lab group what is the maximum percentage of multiple loci reads that we should allow. They were saying that 2% should be the maximum, but I don’t know if this is true. The –outFilterMultimapNmax is set to 10, but I don’t know that if I should decrease it (thinking about the 35bp read length)

Alexander Dobin

unread,

Jul 15, 2020, 3:04:55 PM7/15/20

to rna-star

Hi Mireia,

using different --sjdbOverhang indexes is the best approach, since you have very short 35b reads.
Setting --outFilterMismatchNoverLmax could be set at 0.05 is a good idea as well, it will keep the alignment quality the same for different lengths.

These parameters

--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1

apply only to the "small RNA-seq" (miRNA, tRNA, etc, RNA length <200b), which is typically single-end sequencing.

If your samples are "long RNA-seq" (i.e. mRNA and lincRNA mostly), you do not wan to use these parameters, as they prohibit splicing and allow for very short alignments.

The % of multimappers is determined by the RNA species in the sample, you cannot change it significantly with mapping parameters.

If the % of multimappers is too high, it may indicate some problems with the wet-lab protocol (e.g. incomplete rRNA depletion).

You are right that 35b reads will have higher multimapping proportion, which may skew some comparisons between 35b and 75/150b samples.

If you need to make such comparisons, the best approach, strictly speaking, is to trim all reads to 35b.

Cheers

Alex

mireia alemany

unread,

Jul 15, 2020, 11:53:09 PM7/15/20

to rna-star

Hi,

Thanks for your answer.

Yes, I have "long-RNAseq", so I will NOT change any of the default values for the splicing parameters (sorry for the misunderstanding!)

At the moment, I just did the alignment with the default STAR parameters, and as expected for the 150- and 75-bp paired-end reads, I have obtained a good percentage of uniquely mapped reads (80-95%). For the 35-bp, I have obtained around 65% of uniquely mapped reads, while 35% are multiple loci alignments. I feel that I will lose a lot of good data if I trim all the reads to 35 bp:(

For the 150- and 75-bp paired-end reads, I have a subset of samples that have around 20-15% of unmapped reads (too short), while all the other samples only have around 2%. I was very surprised to see that number. I am working with human samples, and to me it seemed a high percentage (20%) of unmapped reads. ALL the samples were prepared together (the RNA extraction and library prep was done together), why could it happen? do I need to worry? I want to compare between them (DESeq2), so I am worried about introducing some bias. In some posts, I have seen that you recommend blasting them, and I am a little bit lost. At the moment, I have been able to extract the unmapped reads using samtools

samtools view -u -f 4 input.bam > unmapped.bam

I don't know how to examine them, what tool do you recommend?

Thanks

Mireia

Alexander Dobin

unread,

Jul 22, 2020, 12:11:40 PM7/22/20

to rna-star

Hi Mireia,

~15-20% of unmapped reads is typically not a big deal, however, it might be prudent to investigate what's causing them.

A few suggestions for these samples:

1. BLAST a few (~10) of the unmapped sequences against the "full" NCBI not database. If they all have good hits against other species, it may indicate contamination in the samples.

2. Remap Read1 and Read2 separately, and see whether the number of unmapped reads is reduced significantly

3. Trim the reads for Illumina adapters. STAR has a simple internal 3' trimmer --clip3pAdapterSeq <Seq>, where Seq is first ~10-20b of the 3' Illumina adapter.

Or you can use more sophisticated trimmers such as cutadapt.

Cheers

Alex

Reply all

Reply to author

Forward