low alignment rate using star

S Perera

unread,

Jun 1, 2018, 12:43:29 PM6/1/18

to rna-star

Hello,

I'm new to using rna-star and I wanted to check if I was making an obvious mistake as my alignment rate isn't that good (I aligned the same samples with Tophat2 and hisat2 previously and had an alignment rate of over 90%).

These are the commands used for generating index and aligning:

$ STAR --runMode genomeGenerate --runThreadN 64 --genomeDir /home/STAR/genome --genomeFastaFiles /home/STAR/Mus_musculus.GRCm38.dna.toplevel.fa --sjdbGTFfile /home/STAR/Mus_musculus.GRCm38.92.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM=33524399489 (this last part was included because of a previous error which gave, SOLUTION: please specify --limitGenomeGenerateRAM not less than 33524399488 and make that much RAM available)

$ STAR --runThreadN 12 --genomeDir /home/STAR/genome --sjdbGTFfile /home/STAR/Mus_musculus.GRCm38.92.gtf --sjdbOverhang 100 --readFilesIn /home/5_S4_R1.fastq.gz /home/5_S4_R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix Star_E13_5/Star_E13_5_peripheral --outSAMtype BAM Unsorted SortedByCoordinate

This is the Log.progress.out for one of the samples.

Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped

M/hr number length unique length MMrate multi multi+ MM short other

May 31 15:20:34 24.9 421749 158 61.9% 157.0 0.3% 7.8% 0.2% 0.0% 29.8% 0.2%

May 31 15:21:35 66.3 2247279 158 61.9% 157.0 0.3% 7.8% 0.3% 0.1% 29.8% 0.2%

May 31 15:22:36 80.1 4069270 158 61.8% 157.0 0.3% 7.9% 0.3% 0.1% 29.8% 0.2%

May 31 15:23:45 84.2 5891252 158 61.8% 157.0 0.3% 7.9% 0.3% 0.1% 29.8% 0.2%

May 31 15:24:46 87.1 7572704 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 29.9% 0.2%

May 31 15:25:50 89.7 9394245 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 29.9% 0.2%

May 31 15:26:50 92.4 11210419 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:27:50 94.3 13023396 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:28:58 94.5 14836357 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:30:01 95.4 16649264 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:31:02 95.7 18322660 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:32:04 95.9 19996309 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:33:04 97.6 21990591 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

May 31 15:34:35 89.4 22409080 158 61.7% 157.0 0.3% 7.8% 0.3% 0.1% 30.0% 0.2%

Just for reference I thought I'll provide the summary for the hisat2 run for the above sample. I used the same fasta file used for STAR to index the genome (for aligning: hisat2 -p 12 -x /home/HISAT2_indexing/Mus_musculus.GRCm38.dna.toplevel_hisat2 -1 /home/5_S4_R1.fastq.gz -2 /home/5_S4_R2.fastq.gz -S /home/Hisat2_E13_5/Hisat2_E13_5.sam 2>Hisat2_E13_5/summary.txt)

22409080 reads; of these:

22409080 (100.00%) were paired; of these:

2953534 (13.18%) aligned concordantly 0 times

17515772 (78.16%) aligned concordantly exactly 1 time

1939774 (8.66%) aligned concordantly >1 times

----

2953534 pairs aligned concordantly 0 times; of these:

83480 (2.83%) aligned discordantly 1 time

----

2870054 pairs aligned 0 times concordantly or discordantly; of these:

5740108 mates make up the pairs; of these:

3615214 (62.98%) aligned 0 times

1699149 (29.60%) aligned exactly 1 time

425745 (7.42%) aligned >1 times

91.93% overall alignment rate

Do you know if there's a way I could improve my alignment using star?

Thank you!

Alexander Dobin

unread,

Jun 1, 2018, 5:12:48 PM6/1/18

to rna-star

Hi S Perera,

my first explanation would be that reads reported as mapped by Tophat or HISAT are actually single-end alignments or improper pairs, which STAR, by default, does not output. To check that, you can try to map each mate separately to see if the mappability improves. If this does not explain it, please cut out ~100,000 reads (preferrably from the middle of the files) and send them to me, I will have a look.

Cheers

Alex

S Perera

unread,

Jun 2, 2018, 8:00:40 AM6/2/18

to rna-star

Hi Alex,

Thanks so much for the suggestion. I aligned each pair separately and there is some definite improvement. I've copied the Log.Progress.out file for one of the pairs.

Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped

M/hr number length unique length MMrate multi multi+ MM short other

Jun 02 11:40:21 232.0 3930598 79 79.8% 78.8 0.4% 14.8% 0.8% 0.0% 3.9% 0.8%

Jun 02 11:41:21 258.7 8696869 79 79.7% 78.8 0.4% 14.8% 0.8% 0.0% 4.0% 0.7%

Jun 02 11:42:21 275.8 13864269 79 79.8% 78.8 0.4% 14.7% 0.8% 0.0% 4.0% 0.7%

Jun 02 11:43:22 280.9 18885252 79 79.8% 78.8 0.4% 14.7% 0.8% 0.0% 4.0% 0.7%

Jun 02 11:45:02 235.9 22409080 79 79.8% 78.8 0.4% 14.7% 0.8% 0.0% 4.0% 0.6%

ALL DONE!

Do you have any suggestions for how I should proceed with aligning the pairs (I am hoping to do a differential expression analysis). Should I try trimming before the alignment (I didn't do any previous trimming as the quality was fine)?

Thanks again.

Best

Surangi

Alexander Dobin

unread,

Jun 5, 2018, 11:45:42 AM6/5/18

to rna-star

Hi Surangi,

the good single-end (for each mate) / poor paired-end mapping might indicate a problem with the pairing of the reads in the FASTQ files.

Sometimes this is caused by the trimming software, so the first thing to do is to try mapping the raw reads without trimming.

Also, you can check that the pairing of the reads by comparing all read names (lines 1,5,9,...) in the two FASTQ files.