Dear All,
This is my first post in this community, so I will try to be as straightforward as possible. I want to map paired-end FASTQ files (48bp length after trimming - Trimmomatic) to the GENCODE v42 (hg38) reference genome. Diving through this community, forums, tutorials, and the STAR manual, I came up with the following parameters setting (STAR 2.7.10a 2-pass mode):
1) Indexing genome with annotations (GENCODE v42)
STAR \
--runMode genomeGenerate \
--runThreadN 44 \
--genomeDir GEN_IDX
--sjdbGTFfile gencode.v42.annotation.gtf \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbOverhang 47
2) STAR pass1
STAR \
--runThreadN 44 \
--genomeDir GEN_IDX \
--readFilesIn SAMPLE_ID_R1.fq.gz SAMPLE_ID_R2.fq.gz \
--outFileNamePrefix OUTPUT/pass1/SAMPLE_ID/ \
--readFilesCommand pigz -p44 -dc \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outMultimapperOrder Random \
--seedSearchStartLmax 25 \
--winAnchorMultimapNmax 100 \
--outFilterMismatchNoverReadLmax 0.05 \
--outFilterMatchNminOverLread 0.66 \
--outFilterScoreMinOverLread 0.66 \
--limitSjdbInsertNsj 1000000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outSAMstrandField intronMotif \
--outSAMtype None \
--outSAMmode None \
--sjdbOverhang 47
3) Collection and merging of all SJ.out.tab files into a single SJ.filtered.tab, removing junctions as follow:
cat *.tab | awk '($1 != "chrM" && $5 > 0 && $7 > 2 && $6 == 0)' | cut -f1-6 | sort | uniq > SJ.filtered.tab
4) STAR pass2
STAR \
--runThreadN 44 \
--genomeDir star_gen_idx \
--readFilesIn SAMPLE_ID_R1.fq.gz SAMPLE_ID_R2.fq.gz \
--outFileNamePrefix OUTPUT/pass2/SAMPLE_ID/ \
--genomeLoad NoSharedMemory \
--readFilesCommand pigz -p44 -dc \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--sjdbScore 2 \
--outMultimapperOrder Random \
--seedSearchStartLmax 25 \
--winAnchorMultimapNmax 100 \
--outFilterMismatchNoverReadLmax 0.05 \
--outFilterMatchNminOverLread 0.66 \
--outFilterScoreMinOverLread 0.66 \
--limitSjdbInsertNsj 1000000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outSAMstrandField intronMotif \
--outSAMattributes NH HI NM MD AS XS \
--outSAMunmapped Within \
--outSAMtype BAM SortedByCoordinate \
--sjdbOverhang 47 \
--quantMode TranscriptomeSAM \
--quantTranscriptomeBan IndelSoftclipSingleend \
--outFilterType BySJout \
--sjdbFileChrStartEnd SJ.filtered.tab
Eventually, I will use Aligned.toTranscriptome.out.bam as input for RSEM.
I have also checked the NCI GDC RNA-Seq pipeline. However, I am still looking for an explanation for why they set --outFilterMatchNminOverLread and --outFilterScoreMinOverLread to 0.33 GDC pipeline.
Could anyone validate the commands and, eventually, provide any suggestions?
I appreciate your help.