Hi all,
I am currently using STAR to map several Hi-SEQ mRNA runs, and I am very pleased with the run time (working with 200M+ HiSeq runs...)
However, I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)
In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.
The unmapped bin that most of the reads fall into is "too short", which I believe Alex has pointed out to be correlated with read quality. But I've run the runs through FastQC, and the quality is pretty good up until the ~85th base out of 101.
What parameter I might be mis-specifying? These are PE 101 Illumina reads, and we have around 200M reads per sample.
What other parameter be causing the unmapped reads: too short set of reads to be so large?
Many Thanks!
Carmen
My command like is like so:
# $1 = READ1 fq file # $2 = READ2 fq file # $3 = PREFIX for Output Files [*.BAM] /path/to/STAR --genomeDir /path/to/Fly/ --readFilesCommand 'zcat -fc' --readFilesIn $1 $2 --runThreadN 32 --genomeLoad LoadAndRemove --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 2 --outSAMstrandField None --outSAMmode Full --outSAMattributes Standard --outSAMunmapped None --outFilterType BySJout --outStd SAM | samtools view -b -o $3_STAR.bam -S -
This is the command line I used to build the STAR index.
/path/to/STAR_2.3.0e/STAR --runMode genomeGenerate --genomeDir /path/to/Genomes/Fly/ --genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa --runThreadN 16 --sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf --sjdbGTFtagExonParentTranscript transcript_id --sjdbOverhang 100
And all Final Logs look something like this:
./Log.final.out Started job on | May 20 22:49:23 Started mapping on | May 20 22:52:08 Finished on | May 21 05:18:10 Mapping speed, Million of reads per hour | 32.74 Number of input reads | 210640950 Average input read length | 202 UNIQUE READS: Uniquely mapped reads number | 29841188 Uniquely mapped reads % | 14.17% Average mapped length | 190.07 Number of splices: Total | 4405621 Number of splices: Annotated (sjdb) | 4123661 Number of splices: GT/AG | 4348429 Number of splices: GC/AG | 25290 Number of splices: AT/AC | 664 Number of splices: Non-canonical | 31238 Mismatch rate per base, % | 1.23% Deletion rate per base | 0.03% Deletion average length | 1.92 Insertion rate per base | 0.02% Insertion average length | 2.41 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 36646260 % of reads mapped to multiple loci | 17.40% Number of reads mapped to too many loci | 229494 % of reads mapped to too many loci | 0.11% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 58.84% % of reads unmapped: other | 9.49%