Need help validating STAR 2-pass mode with ~50bp RNA-Seq data

219 views
Skip to first unread message

Rosario Distefano

unread,
Nov 14, 2022, 7:45:10 AM11/14/22
to rna-star

Dear All,
This is my first post in this community, so I will try to be as straightforward as possible. I want to map paired-end FASTQ files (48bp length after trimming - Trimmomatic) to the GENCODE v42 (hg38) reference genome. Diving through this community, forums, tutorials, and the STAR manual, I came up with the following parameters setting (STAR 2.7.10a 2-pass mode):

1) Indexing genome with annotations (GENCODE v42)

STAR \
    --runMode genomeGenerate \
    --runThreadN 44 \
    --genomeDir GEN_IDX
    --sjdbGTFfile gencode.v42.annotation.gtf \
    --genomeFastaFiles GRCh38.primary_assembly.genome.fa \
    --sjdbOverhang 47



2) STAR pass1

STAR \
    --runThreadN 44 \
    --genomeDir GEN_IDX \
    --readFilesIn SAMPLE_ID_R1.fq.gz SAMPLE_ID_R2.fq.gz \
    --outFileNamePrefix OUTPUT/pass1/SAMPLE_ID/ \
    --readFilesCommand pigz -p44 -dc \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 999 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 \
    --outMultimapperOrder Random \
    --seedSearchStartLmax 25 \
    --winAnchorMultimapNmax 100 \
    --outFilterMismatchNoverReadLmax 0.05 \
    --outFilterMatchNminOverLread 0.66 \
    --outFilterScoreMinOverLread 0.66 \
    --limitSjdbInsertNsj 1000000 \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1 \
    --outSAMstrandField intronMotif \
    --outSAMtype None \
    --outSAMmode None \
    --sjdbOverhang 47



3) Collection and merging of all SJ.out.tab files into a single SJ.filtered.tab, removing junctions as follow:

cat *.tab | awk '($1 != "chrM" && $5 > 0 && $7 > 2 && $6 == 0)' | cut -f1-6 | sort | uniq > SJ.filtered.tab


4) STAR pass2

STAR \
    --runThreadN 44 \
    --genomeDir star_gen_idx \
    --readFilesIn SAMPLE_ID_R1.fq.gz SAMPLE_ID_R2.fq.gz \
    --outFileNamePrefix OUTPUT/pass2/SAMPLE_ID/ \
    --genomeLoad NoSharedMemory \
    --readFilesCommand pigz -p44 -dc \
    --outFilterMultimapScoreRange 1 \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 999 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 \
    --sjdbScore 2 \
    --outMultimapperOrder Random \
    --seedSearchStartLmax 25 \
    --winAnchorMultimapNmax 100 \
    --outFilterMismatchNoverReadLmax 0.05 \
    --outFilterMatchNminOverLread 0.66 \
    --outFilterScoreMinOverLread 0.66 \
    --limitSjdbInsertNsj 1000000 \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1 \
    --outSAMstrandField intronMotif \
    --outSAMattributes NH HI NM MD AS XS \
    --outSAMunmapped Within \
    --outSAMtype BAM SortedByCoordinate \
    --sjdbOverhang 47 \
    --quantMode TranscriptomeSAM \
    --quantTranscriptomeBan IndelSoftclipSingleend \
    --outFilterType BySJout \
    --sjdbFileChrStartEnd SJ.filtered.tab


Eventually, I will use Aligned.toTranscriptome.out.bam as input for RSEM.

I have also checked the NCI GDC RNA-Seq pipeline. However, I am still looking for an explanation for why they set  --outFilterMatchNminOverLread and --outFilterScoreMinOverLread to 0.33 GDC pipeline.

Could anyone validate the commands and, eventually, provide any suggestions?

I appreciate your help.

Alexander Dobin

unread,
Nov 16, 2022, 4:15:36 PM11/16/22
to rna-star
Hi,

the parameters look fine, but, as always, you will need to check the results to see if they make sense.
--outFilterScoreMinOverLread   0.33  will relax the mapping filter and allow short alignments. I generally do not recommend it as it lets in false alignments.
Reply all
Reply to author
Forward
0 new messages