Hi all,
I have been trying to do some alignments using STAR. The software indeed seems stellar, giving very clear outputs and with most options well explained in the documentation (and of course the ultra-high mapping speed). However I am getting quite a high “% of reads unmapped: too short”.
Having read quite extensively through similar problems both on this google group and on SeqAnswers, etc, I'm still a bit unsure as to why exactly this is happening.
For a bit more information on the issue:
I have assembled a transcriptome de novo from an RNA-seq dataset. As of yet I have no GTF file, I am mapping straight to the transcript contigs. When running STAR on my denovo assembly I don't get too many unmapped reads:
Average input read length | 182
Uniquely mapped reads % | 43.81%
Average mapped length | 179.89
% of reads mapped to multiple loci | 50.93%
% of reads mapped to too many loci | 0.90%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 4.35%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
The high rate of multi-mapping loci is expected here as the dataset is comprised of 6 individuals with high levels of heterozygosity, and the raw assembly thus contains allelic variation and alternatively spliced transcripts sharing exons.
However, when I undergo a pipeline to collapse some of this allelic variation, and scaffold a lot of my contigs together, the rate of unmapped reads being “too short” dramatically increases on the new assembly:
Average input read length | 182
Uniquely mapped reads % | 46.69%
Average mapped length | 179.19
% of reads mapped to multiple loci | 7.94%
% of reads mapped to too many loci | 0.01%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 45.36%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Having read through various forums, I have changed various parameters of the alignment to try to 'optimise' the mapping run. I am confident that sequence quality is not an issue for either forward or reverse reads – I was fairly stringent with trimming (using Trimmomatic) and adapter sequences have already been trimmed off, and from eyeballing the FastQC outputs none of my .fq files have bases with low quality, most in fact are >30. I even tried tricking STAR by icreasing '--outQSconversionAdd' to '20', but this has no effect on the results.
The STAR options which seem to have the most effect are indeed '--outFilterMatchNminOverLread' '--outFilterScoreMinOverLread'. Dropping the value of these from the defauilt leads to a decreasing amount of “reads unmapped”. Setting both the options to '0' results in 100% of the reads mapping back to the reference, with ~74% uniquely mapping. The 100% of reads mapping is a result I do not trust, especially as I know a number of contigs have been dropped from the original assembly.
I would like to ask:
1 – Why reads are classed as “unmapped too short”, as opposed to “too many mistmatches” or simply not finding any suitable alignment match in the reference?
2 – Given that I used to have high mapping rates with the original assembly, why am I suddenly finding so many are “too short”?
3 – How can I optimise the mapping but still be confident that I'm not relaxing the parameters too far so as to induce false matches? I would expect some mistmatching due to the new assembly being a tentative consensus from different contigs and alleles.
Thank you very much, and my apologies for such a long post!
Ali
./STAR --runThreadN 10 --runMode genomeGenerate --genomeDir $DIR/GenomeIndices --genomeFastaFiles $DIR/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile $DIR/Homo_sapiens.GRCh38.84.gtf --sjdbOverhang 100
./STAR --runThreadN 10 --genomeDir GenomeIndices --readFilesCommand gunzip -c $DIR/TrimmedFASTQ/$1_R1_val_1.fq.gz $DIR/TrimmedFASTQ/$1_R2_val_2.fq.gz --sjdbGTFfile /Homosapiens/Homo_sapiens.GRCh38.84.gtf --quantMode TranscriptomeSAM GeneCounts --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --outWigType bedGraph read1_5p --outFileNamePrefix $DIR/Alignment/$1_