Need help validating STAR 2-pass mode with ~50bp RNA-Seq data

220 views

Skip to first unread message

Rosario Distefano

unread,

Nov 14, 2022, 7:45:10 AM11/14/22

to rna-star

Dear All,
This is my first post in this community, so I will try to be as straightforward as possible. I want to map paired-end FASTQ files (48bp length after trimming - Trimmomatic) to the GENCODE v42 (hg38) reference genome. Diving through this community, forums, tutorials, and the STAR manual, I came up with the following parameters setting (STAR 2.7.10a 2-pass mode):

1) Indexing genome with annotations (GENCODE v42)

STAR \
--runMode genomeGenerate \
--runThreadN 44 \
--genomeDir GEN_IDX
--sjdbGTFfile gencode.v42.annotation.gtf \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbOverhang 47

2) STAR pass1

STAR \
--runThreadN 44 \
--genomeDir GEN_IDX \
--readFilesIn SAMPLE_ID_R1.fq.gz SAMPLE_ID_R2.fq.gz \
--outFileNamePrefix OUTPUT/pass1/SAMPLE_ID/ \
--readFilesCommand pigz -p44 -dc \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outMultimapperOrder Random \
--seedSearchStartLmax 25 \
--winAnchorMultimapNmax 100 \
--outFilterMismatchNoverReadLmax 0.05 \
--outFilterMatchNminOverLread 0.66 \
--outFilterScoreMinOverLread 0.66 \
--limitSjdbInsertNsj 1000000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outSAMstrandField intronMotif \
--outSAMtype None \
--outSAMmode None \
--sjdbOverhang 47

3) Collection and merging of all SJ.out.tab files into a single SJ.filtered.tab, removing junctions as follow:

cat *.tab | awk '($1 != "chrM" && $5 > 0 && $7 > 2 && $6 == 0)' | cut -f1-6 | sort | uniq > SJ.filtered.tab

4) STAR pass2

STAR \
--runThreadN 44 \
--genomeDir star_gen_idx \
--readFilesIn SAMPLE_ID_R1.fq.gz SAMPLE_ID_R2.fq.gz \
--outFileNamePrefix OUTPUT/pass2/SAMPLE_ID/ \
--genomeLoad NoSharedMemory \
--readFilesCommand pigz -p44 -dc \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--sjdbScore 2 \
--outMultimapperOrder Random \
--seedSearchStartLmax 25 \
--winAnchorMultimapNmax 100 \
--outFilterMismatchNoverReadLmax 0.05 \
--outFilterMatchNminOverLread 0.66 \
--outFilterScoreMinOverLread 0.66 \
--limitSjdbInsertNsj 1000000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outSAMstrandField intronMotif \
--outSAMattributes NH HI NM MD AS XS \
--outSAMunmapped Within \
--outSAMtype BAM SortedByCoordinate \
--sjdbOverhang 47 \
--quantMode TranscriptomeSAM \
--quantTranscriptomeBan IndelSoftclipSingleend \
--outFilterType BySJout \
--sjdbFileChrStartEnd SJ.filtered.tab

Eventually, I will use Aligned.toTranscriptome.out.bam as input for RSEM.

I have also checked the NCI GDC RNA-Seq pipeline. However, I am still looking for an explanation for why they set --outFilterMatchNminOverLread and --outFilterScoreMinOverLread to 0.33 GDC pipeline.

Could anyone validate the commands and, eventually, provide any suggestions?

I appreciate your help.

Alexander Dobin

unread,

Nov 16, 2022, 4:15:36 PM11/16/22

to rna-star

Hi,

the parameters look fine, but, as always, you will need to check the results to see if they make sense.

--outFilterScoreMinOverLread 0.33 will relax the mapping filter and allow short alignments. I generally do not recommend it as it lets in false alignments.

Reply all

Reply to author

Forward

0 new messages