low alignment rate using star

584 views
Skip to first unread message

S Perera

unread,
Jun 1, 2018, 12:43:29 PM6/1/18
to rna-star

Hello,


I'm new to using rna-star and I wanted to check if I was making an obvious mistake as my alignment rate isn't that good (I aligned the same samples with Tophat2 and hisat2 previously and had an alignment rate of over 90%).


These are the commands used for generating index and aligning:

$ STAR --runMode genomeGenerate --runThreadN 64 --genomeDir /home/STAR/genome --genomeFastaFiles /home/STAR/Mus_musculus.GRCm38.dna.toplevel.fa --sjdbGTFfile /home/STAR/Mus_musculus.GRCm38.92.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM=33524399489 (this last part was included because of a previous error which gave, SOLUTION: please specify --limitGenomeGenerateRAM not less than 33524399488 and make that much RAM available) 

$ STAR --runThreadN 12 --genomeDir /home/STAR/genome --sjdbGTFfile /home/STAR/Mus_musculus.GRCm38.92.gtf --sjdbOverhang 100 --readFilesIn /home/5_S4_R1.fastq.gz /home/5_S4_R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix Star_E13_5/Star_E13_5_peripheral --outSAMtype BAM Unsorted SortedByCoordinate


This is the Log.progress.out for one of the samples.  


           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped

                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other

May 31 15:20:34     24.9      421749      158    61.9%    157.0     0.3%     7.8%     0.2%     0.0%    29.8%     0.2%

May 31 15:21:35     66.3     2247279      158    61.9%    157.0     0.3%     7.8%     0.3%     0.1%    29.8%     0.2%

May 31 15:22:36     80.1     4069270      158    61.8%    157.0     0.3%     7.9%     0.3%     0.1%    29.8%     0.2%

May 31 15:23:45     84.2     5891252      158    61.8%    157.0     0.3%     7.9%     0.3%     0.1%    29.8%     0.2%

May 31 15:24:46     87.1     7572704      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    29.9%     0.2%

May 31 15:25:50     89.7     9394245      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    29.9%     0.2%

May 31 15:26:50     92.4    11210419      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:27:50     94.3    13023396      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:28:58     94.5    14836357      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:30:01     95.4    16649264      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:31:02     95.7    18322660      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:32:04     95.9    19996309      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:33:04     97.6    21990591      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%

May 31 15:34:35     89.4    22409080      158    61.7%    157.0     0.3%     7.8%     0.3%     0.1%    30.0%     0.2%



Just for reference I thought I'll provide the summary for the hisat2 run for the above sample. I used the same fasta file used for STAR to index the genome (for aligning: hisat2 -p 12 -x /home/HISAT2_indexing/Mus_musculus.GRCm38.dna.toplevel_hisat2 -1 /home/5_S4_R1.fastq.gz -2 /home/5_S4_R2.fastq.gz -S /home/Hisat2_E13_5/Hisat2_E13_5.sam 2>Hisat2_E13_5/summary.txt) 


22409080 reads; of these:

  22409080 (100.00%) were paired; of these:

    2953534 (13.18%) aligned concordantly 0 times

    17515772 (78.16%) aligned concordantly exactly 1 time

    1939774 (8.66%) aligned concordantly >1 times

    ----

    2953534 pairs aligned concordantly 0 times; of these:

      83480 (2.83%) aligned discordantly 1 time

    ----

    2870054 pairs aligned 0 times concordantly or discordantly; of these:

      5740108 mates make up the pairs; of these:

        3615214 (62.98%) aligned 0 times

        1699149 (29.60%) aligned exactly 1 time

        425745 (7.42%) aligned >1 times

91.93% overall alignment rate


Do you know if there's a way I could improve my alignment using star?


Thank you!

Alexander Dobin

unread,
Jun 1, 2018, 5:12:48 PM6/1/18
to rna-star
Hi S Perera,

my first explanation would be that reads reported as mapped by Tophat or HISAT are actually single-end alignments or improper pairs, which STAR, by default, does not output. To check that, you can try to map each mate separately to see if the mappability improves. If this does not explain it, please cut out ~100,000 reads (preferrably from the middle of the files) and send them to me, I will have a look.

Cheers
Alex

S Perera

unread,
Jun 2, 2018, 8:00:40 AM6/2/18
to rna-star
Hi Alex,  

Thanks so much for the suggestion. I aligned each pair separately and there is some definite improvement. I've copied the Log.Progress.out file for one of the pairs. 

Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped

                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other

Jun 02 11:40:21    232.0     3930598       79    79.8%     78.8     0.4%    14.8%     0.8%     0.0%     3.9%     0.8%

Jun 02 11:41:21    258.7     8696869       79    79.7%     78.8     0.4%    14.8%     0.8%     0.0%     4.0%     0.7%

Jun 02 11:42:21    275.8    13864269       79    79.8%     78.8     0.4%    14.7%     0.8%     0.0%     4.0%     0.7%

Jun 02 11:43:22    280.9    18885252       79    79.8%     78.8     0.4%    14.7%     0.8%     0.0%     4.0%     0.7%

Jun 02 11:45:02    235.9    22409080       79    79.8%     78.8     0.4%    14.7%     0.8%     0.0%     4.0%     0.6%

ALL DONE!


Do you have any suggestions for how I should proceed with aligning the pairs (I am hoping to do a differential expression analysis). Should I try trimming before the alignment (I didn't do any previous trimming as the quality was fine)? 

Thanks again.

Best
Surangi

Alexander Dobin

unread,
Jun 5, 2018, 11:45:42 AM6/5/18
to rna-star
Hi Surangi,

the good single-end (for each mate) / poor paired-end mapping might indicate a problem with the pairing of the reads in the FASTQ files.
Sometimes this is caused by the trimming software, so the first thing to do is to try mapping the raw reads without trimming.
Also, you can check that the pairing of the reads by comparing all read names (lines 1,5,9,...) in the two FASTQ files.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages