Dear Alex,
Thanks for excellent tool which align rnaseq data very fast.
I am using STAR to align our dataset. After removing adapters, trimmed reads were submitted to star and I am getting good alignment rate around 95% average across all samples.
Here is the command I used:
STAR --runThreadN 32 --genomeDir STARIndex --readFilesIn test_1.fastq test_2.fastq --outFileNamePrefix sample1 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 30000000000 --twopassMode Basic --outSAMunmapped Within --outFilterMultimapNmax 10 --outTmpDir tmpdir
Star log final out file :
Mapping speed, Million of reads per hour | 107.07
Number of input reads | 23585104
Average input read length | 276
UNIQUE READS:
Uniquely mapped reads number | 15965778
Uniquely mapped reads % | 67.69%
Average mapped length | 273.29
Number of splices: Total | 10931770
Number of splices: Annotated (sjdb) | 10870946
Number of splices: GT/AG | 10809839
Number of splices: GC/AG | 73575
Number of splices: AT/AC | 3369
Number of splices: Non-canonical | 44987
Mismatch rate per base, % | 0.43%
Deletion rate per base | 0.01%
Deletion average length | 1.51
Insertion rate per base | 0.03%
Insertion average length | 1.15
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 6558742
% of reads mapped to multiple loci | 27.81%
Number of reads mapped to too many loci | 10347
% of reads mapped to too many loci | 0.04%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 4.25%
% of reads unmapped: other | 0.20%
Following, I ran featureCounts and only few reads (20 to 30%) were assigned to features on reverse strand (Trueseq stranded library used while sequencing by default "reverse" stranded). I observed reads were assigned to both forward and reverse strands which should not be for stranded library.
And also, I observed %GC is high (49-55%) across all samples before trimming and after trimming.
Even though all adapters were removed. We are still having over represented sequences in other pair (R2) read with "no hit". After observing these, I am afraid if data is having any contamination. Please let me know what check s do I need to perform for contamination in data.
I have checked across related posts before posting this.
Could you please suggest me how can I decrease multimapped reads so that % of uniquely mapped reads will increase.
Thanks In Advance
Fazulur Rehaman