Hi Alex,
Thanks for your extensive manual and continuing support to the community. Currently
I'm working on a RNA-seq pipeline where the main goal is to identify and count known splice variants. However I was wondering if my strategy is the most optimal method to detect and counts the splice variants and if you have any remarks that.
Some background info: multiple splice variants of the receptor in question are known in human samples and I have a fasta file containing their sequence. Cryptic exons are involved in most of the splice variants. A common overlap between the splice variants is that all of them contain the first 3 exons, however most of them contain a unique exon-exon junction that is used atm to identify the different splice variants.
The general idea is to implement the multi-sample 2-pass method. I start with a 1-pass of every sample to obtain the junction file that will be merged together and filtered. I have an indexed reference genome for the first pass (indexed with a gtf file).
first-pass STAR:
STAR --runThreadN {params.threads} --genomeDir {params.starindex} --readFilesIn {input.fq1} {input.fq2} --readFilesCommand zcat --outFileNamePrefix {params.prefix} --outSAMtype None --outSAMunmapped Within --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1
Some general filtering of the junction files
- Filter out junctions on chrM
- Filter out junctions supported by multi mappers only and supported by too few reads($7 > 0)
- Filter out annotated junctions since they are always included from the GTF
cat *.tab | awk '($7 > 2 && $6==0)' | cut -f1-6 | sort | uniq |wc -l > SJ_merged_filtered.tab
second-pass STAR with the new indexed genome:
STAR --genomeDir {params.genomedir} --sjdbGTFfile {params.gtf} --readFilesIn {input.fq1} {input.fq2} --runThreadN {params.threads} --outSAMtype BAM SortedByCoordinate --outFileNamePrefix {params.prefix} --readFilesCommand zcat --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignMatesGapMax 1250000 --alignIntronMax 1250000 --chimSegmentMin 12 --chimMultimapNmax 20 --chimNonchimScoreDropMin 10 --peOverlapNbasesMin 12 --peOverlapMMp 0.1 --chimOutJunctionFormat 1 --outSAMunmapped Within --quantMode TranscriptomeSAM GeneCounts
Subsequently I want to subset the bam file for the specific genomic location, convert these subsets back to fastq files and remap with STAR to the fasta file containing the sequences of the different splice variants (I also created a gtf file where the different splice variants are described as exons).
subsetting the bam file:
samtools view -q 255 {sample}_Aligned.sortedByCoord.out.bam 'X:67533988-67791954' | cut -f1 | sort | uniq > {sample}_read_names
Here I extract all the read names of the paired reads mapping to the region, i use the picard tool FilterSamReads to extract the readpairs and the bam2fastq tool of BEDtools to convert the bam files to fastq. Followed by alignment and counting with STAR.
.
identification of splice variants
STAR --genomeDir {params.AR_index} --readFilesIn {input.fq1} {input.fq2} --runThreadN {params.threads} --twopassMode Basic --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix STAR_AR_{wildcards.sample}/ --outFilterMultimapNmax 1 --alignEndsType EndToEnd --alignSoftClipAtReferenceEnds No --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --quantMode TranscriptomeSAM GeneCounts --peOverlapNbasesMin 10 --scoreGap -1000000;
I'm quite strict on multimappers here, because of the shared sequence that most of the splice variants have so I only want to count the reads that are aligned to the unique exon-exon junctions. Do you think it would be worth to perform 2-pass again?
Any suggestions or help would be greatly appreciated.
Best regards,
Jan