Hi there,
I have single-cell RNA-seq data from samples mixed with four different species'.
What I did is, generated a huge reference genome merged by the four references, and do the STAR alignment for read 1 and read 2 separately. However it turned out that not all reads were mapped correctly since many of the aligned reads cannot be associated with a gene (based on exon sequence similarity).
Here is my align command:
STAR --runThreadN 30 --outSAMtype BAM SortedByCoordinate \
--genomeDir star.mix.index/ \
--outFilterMultimapNmax 1 --outFilterIntronMotifs RemoveNoncanonical --outFilterMismatchNmax 5 \
--alignSJDBoverhangMin 6 --alignSJoverhangMin 6 --outFilterType BySJout --alignIntronMin 25 \
--alignIntronMax 1000000 --outSAMstrandField intronMotif --outSAMunmapped Within --alignMatesGapMax 1000000 \
--readFilesIn R1_001.fastq.gz \
--readFilesCommand zcat --outFileNamePrefix R1/
So I am thinking it's either
1) to map the reads against one single reference genome at a time, and then record four 'exact match rate' for each read, select the highest one;
2) or to change the param --outFilterMultimapNmax to a higher number and do filtering later on.
My questions are,
a) is there any measure that can be used as the 'exact match rate' in 1)?
b) how can I select the best alignment if I allow reads to be multi-mapped in 2)?
Thanks,
Sophie