Hi Monica,
the minimum mapped length is controlled by --outFilterScoreMinOverLread and --outFilterMatchNminOverLread.
By default these parameters are equal to 0.66, i.e. if either the number of matched bases OR the alignment score (which is number of mapped bases + penalties) is < 66% of the read length, the alignment will not be output and will be reported as "too short".
In your case, it appears that 45% of your reads have mapped length below 66% of the read length (i.e. 35b for 50b reads). Is this what you expect? For example, this would happen if you have short RNA libraries and did not trim fully the adapter. If your libraries are standard long RNA-seq, this, generally, should not happen.
You can try 0 values for --outFilterScoreMinOverLread and --outFilterMatchNminOverLread, and check the average mapped length you will get in this case.
If you could send me several 100,000 of the reads, I could take a look at them.
Cheers
Alex
On Wednesday, January 2, 2013 9:03:09 PM UTC-5, mbritton wrote:
Hi folks,
I've been using STAR for a little while, and today tried running it for the first time on a small set of HiSeq 50 bp single-end sequences, aligning to hg19. I had originally generated the genome (with gencode.v14.annotation.gtf.sjdb) using --sjdbOverhang 99, but I have since re-generated the genome (in a new directory) using --sjdbOverhang 49, which would seem to be more appropriate for these reads. I also generated a genome without the sjdb file (so no --sjdbOverhang). I ran star against all these genome directories, using both the raw reads, and the same reads that had been trimmed for quality and adapter contamination. A sample command line is:
star_2.2.0c/star \
--genomeDir /references/star49 \
--readFilesIn \
../raw/C1556ACXX_721_S08_8750_CAGATC_CAGATC_L007_R1_001.fastq \
--runThreadN 14 genomeLoad=LoadAndRemove \
--outFileNamePrefix ./C1556ACXX_721_S08_8750_49_ \
--outFilterMultimapNmax 2
The results are very similar for each type of run, with a very high percentage of "too short" unmapped reads.
Here are the last few lines of a Log.final.out:
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 45.77%
% of reads unmapped: other | 0.17%
This stays at ~45% whether I input untrimmed reads (all 50bp) or trimmed reads (most 50bp, but some as short as 20).
What exactly does "too short" refer to? Does it mean that the aligned portion of the read is "too short"? Is there a parameter that I should change that would allow more reads to be mapped?
Thanks,
Monica