Hi everyone,
I have been struggling with this "high % of reads unmapped: too short" problem for almost a month. I am working with Brassica napus, and mapping illumina paired-end 100bp raw reads to its reference genome.
The command I have been using is "STAR --genomeDir path-to-genome --readFilesIn ../1_1.fq ../1_2.fq --outSAMtype BAM SortedByCoordinate --sjdbGTFfile genome.gff3 --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic –alignIntronMax 15000 --outFilterIntronMotifs RemoveNoncanonical --runThreadN 6 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --outReadsUnmapped Fastx"
The result gave me % of reads unmapped ranged from 28% to 55% for my 24 libraries. (such a low mapping rate!!! I am expecting at least 80% mapping rate)
I have tried different ways to look for the reason:
1) mapped my reads using BWA and TopHat, both of them gave low mapping rate. So this is not a problem with STAR.
2) mapped my reads to combined B.rapa & B.oleracea (these are the two ancestor genomes of Brassica napus), gave me almost the same low mapping rate too. So it is not a problem with B.napus genome.
3) included Mitochondria, chloroplast, and rRNA in my reference, still got similar low mapping rate. This means mitochondria and chloroplast, and rRNA contamination is not a problem.
While all of these didn't work out. The last thing I tried is de-novo assemble my unmapped reads into transcripts using Trinity and blast my gene against NCBI genebank to see what they hit. Guess what, out of the 43749 assembled transcripts, only 1945 don't hit Brassica napus genes, and rRNA, chloroplast, and mitochondria contamination doesn't seem like a problem either.
Advanced users or developers, should I change the advanced options for my mapping? Or should I trim my data?
confused....
I appreciate any input and suggestions.
Thank you very much!
Ruijuan Li