Regarding too many loci alignment percentage

1,615 views
Skip to first unread message

Jeffin

unread,
Sep 1, 2015, 9:58:03 AM9/1/15
to rna-star
Hi Alex,
For star alignment of  a wheat sample paired end illumina fastq data using the below command.
STAR_2.4.2a/bin/Linux_x86_64_static/STAR --runThreadN 14 --runMode alignReads --genomeDir $GENOME_DIR --genomeFastaFiles $FASTA_FILE --sjdbGTFfile $GTF --sjdbOverhang 99 --limitGenomeGenerateRAM 130116110378 --genomeChrBinNbits 12 --readFilesIn $R1_FILE $R2_FILE --outReadsUnmapped Fastx --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate

Log.final.out obtained is as follows:

         Started job on |       Aug 28 16:42:15
                             Started mapping on |       Aug 28 17:14:53
                                    Finished on |       Aug 28 23:31:59
       Mapping speed, Million of reads per hour |       6.72

                          Number of input reads |       42262849
                      Average input read length |       167
                                    UNIQUE READS:
                   Uniquely mapped reads number |       10922670
                        Uniquely mapped reads % |       25.84%
                          Average mapped length |       156.75
                       Number of splices: Total |       9867
            Number of splices: Annotated (sjdb) |       2590
                       Number of splices: GT/AG |       9371
                       Number of splices: GC/AG |       267
                       Number of splices: AT/AC |       70
               Number of splices: Non-canonical |       159
                      Mismatch rate per base, % |       2.64%
                         Deletion rate per base |       0.05%
                        Deletion average length |       1.78
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.03
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       13125197
             % of reads mapped to multiple loci |       31.06%
        Number of reads mapped to too many loci |       7967283
             % of reads mapped to too many loci |       18.85%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       24.25%
                     % of reads unmapped: other |       0.00%


too many loci alignment % is unusually high and  0.00% is seen for the too many mismatches and other.As similar percentages are seen for other samples also, I am doubtful whether  I have given something incorrect or do I need to add /modify any parameter in the star alignReads command used given above?
(For the same input data ,with tophat, alignment percent was 54.16% which  I could see is pretty close to uniquely and multimapped  from STAR considered together. Hence the concern regarding too many loci alignment % )
Please advise.

Regards,
Jeffin Rockey

Alexander Dobin

unread,
Sep 3, 2015, 2:50:00 PM9/3/15
to rna-star
Hi Jeffin,

by default, STAR only outputs reads that map to <=10 loci, others are considered "mapped to too many loci".
You can increase this threshold by increasing --outFilterMultimapNmax. I would start with a value of 50 and check the decrease of "mapped to too many loci" number (and equal increase in multi-mappers). You can make this parameter >50, but then you would also need to increase --winAnchorMultimapNmax (=50 by default).

It is a bit strange that only a very small % of reads are splices, ~10k out of 11M uniquely mapped reads, is this expected for wheat?

Cheers
Alex

Jeffin

unread,
Sep 16, 2015, 10:30:52 AM9/16/15
to rna-star
Hi Alex,
  Thank you for your advice and very sorry that I could not reply any earlier.
  With --outFilterMultimapNmax  20 itself, alignments were much better from which I understand quite a number of reads in the sample data align to > 10 but <=20 loci.
 
Though that issue is solved,could you please provide me some hint on two questions below as well that I have in mind regarding the splices .
  1) What exactly is meant by the number of splices ? Is it the number read(pairs) with splice junctions identified in them (similar to having N in cigar string or so)?
  2) Is this number of splices reported dependent on the GTF file used while generating and aligning ? (reason being, in the gtf I used from ensembl, quite unsually, multiple transcripts having same gene id were not readily seen ).

Many thanks,
Jeffin

Alexander Dobin

unread,
Sep 17, 2015, 11:44:10 PM9/17/15
to rna-star
Hi Jeffin,

the number of splices in the Log.final.out file is calculated as the number of all "N" operation in the CIGARs of unique mappers, so if there are more than one junction per read pair, they will be counted multiple times.

The number of splices depends strongly on the GTF file. The splices across the junctions from the GTF are called annotated. They usually constitute >95% of all junctions.
If this is not the case, it may indicate a problem with GTF.

Cheers
Alex

Jeffin

unread,
Sep 18, 2015, 9:14:38 AM9/18/15
to rna-star
Hi Alex,

  Thank you very much for providing the details.That gave a lot of clarity on the splice numbers.

Regards,
Jeffin Rockey
Reply all
Reply to author
Forward
0 new messages