Optimal detection of known splice variants

Jan Pieter Vanwelkenhuyzen

unread,

Jan 14, 2022, 8:51:43 AM1/14/22

to rna-star

Hi Alex,

Thanks for your extensive manual and continuing support to the community. Currently

I'm working on a RNA-seq pipeline where the main goal is to identify and count known splice variants. However I was wondering if my strategy is the most optimal method to detect and counts the splice variants and if you have any remarks that.

Some background info: multiple splice variants of the receptor in question are known in human samples and I have a fasta file containing their sequence. Cryptic exons are involved in most of the splice variants. A common overlap between the splice variants is that all of them contain the first 3 exons, however most of them contain a unique exon-exon junction that is used atm to identify the different splice variants.

The general idea is to implement the multi-sample 2-pass method. I start with a 1-pass of every sample to obtain the junction file that will be merged together and filtered. I have an indexed reference genome for the first pass (indexed with a gtf file).

first-pass STAR:

STAR --runThreadN {params.threads} --genomeDir {params.starindex} --readFilesIn {input.fq1} {input.fq2} --readFilesCommand zcat --outFileNamePrefix {params.prefix} --outSAMtype None --outSAMunmapped Within --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1

Some general filtering of the junction files

- Filter out junctions on chrM

- Filter out junctions supported by multi mappers only and supported by too few reads($7 > 0)

- Filter out annotated junctions since they are always included from the GTF

second-pass STAR with the new indexed genome:

STAR --genomeDir {params.genomedir} --sjdbGTFfile {params.gtf} --readFilesIn {input.fq1} {input.fq2} --runThreadN {params.threads} --outSAMtype BAM SortedByCoordinate --outFileNamePrefix {params.prefix} --readFilesCommand zcat --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignMatesGapMax 1250000 --alignIntronMax 1250000 --chimSegmentMin 12 --chimMultimapNmax 20 --chimNonchimScoreDropMin 10 --peOverlapNbasesMin 12 --peOverlapMMp 0.1 --chimOutJunctionFormat 1 --outSAMunmapped Within --quantMode TranscriptomeSAM GeneCounts

Subsequently I want to subset the bam file for the specific genomic location, convert these subsets back to fastq files and remap with STAR to the fasta file containing the sequences of the different splice variants (I also created a gtf file where the different splice variants are described as exons).

subsetting the bam file:

samtools view -q 255 {sample}_Aligned.sortedByCoord.out.bam 'X:67533988-67791954' | cut -f1 | sort | uniq > {sample}_read_names

Here I extract all the read names of the paired reads mapping to the region, i use the picard tool FilterSamReads to extract the readpairs and the bam2fastq tool of BEDtools to convert the bam files to fastq. Followed by alignment and counting with STAR.

.

identification of splice variants

STAR --genomeDir {params.AR_index} --readFilesIn {input.fq1} {input.fq2} --runThreadN {params.threads} --twopassMode Basic --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix STAR_AR_{wildcards.sample}/ --outFilterMultimapNmax 1 --alignEndsType EndToEnd --alignSoftClipAtReferenceEnds No --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --quantMode TranscriptomeSAM GeneCounts --peOverlapNbasesMin 10 --scoreGap -1000000;

I'm quite strict on multimappers here, because of the shared sequence that most of the splice variants have so I only want to count the reads that are aligned to the unique exon-exon junctions. Do you think it would be worth to perform 2-pass again?

Any suggestions or help would be greatly appreciated.

Best regards,

Jan

Alexander Dobin

unread,

Jan 25, 2022, 6:26:30 PM1/25/22

to rna-star

Hi Jan,

A few questions:

Are you using the SJ_merged_filtered.tab to generate the 2-nd pass genome?

How many junctions do you have in this file after filtering and collapsing?

How is the genome params.AR_index for the 3rd mapping is generated? Does it contain just the exon-exon junctions of interest?

Since you are prohibiting splicing with --scoreGap -1000000, you do not need to do the 2-pass mapping, since it only helps spliced reads.

Cheers

Alex

Jan Pieter Vanwelkenhuyzen

unread,

Feb 7, 2022, 3:11:58 AM2/7/22

to rna-star

Hi Alex,

I'm sorry for the late reply (I didn't receive a notification of reply via email).

I'm indeed using the SJ_merged_filtered.tab file to generate the 2-nd pass genome. The file contains 319634 junctions after filtering and collapsing.

As we know the complete sequence of the splice variants, I've created a fasta file with the complete sequence for every splice variant. So it contains more than just the exon-exon junction of interest.

Best regards,

Jan

Op woensdag 26 januari 2022 om 00:26:30 UTC+1 schreef ado...@gmail.com:

Alexander Dobin

unread,

Feb 7, 2022, 11:09:15 AM2/7/22

to rna-star

Hi Jan,

if you are creating a fast with sequences for all splices, the 2nd alignment pass will likely not affect your results substantially, since it's pretty much doing the same thing.

So you can go directly from detecting novel junctions in the 1st step to mapping to spliced sequence in the 3rd step.