Here are the steps that I follow:
1) Preparing the Reference with RSEM and making use of STAR
rsem-prepare-reference
--gtf GENOME_data/Homo_sapiens.GRCh38.90.gtf --star
GENOME_data/Homo_sapiens.GRCh38.dna.genome.fa --star-path
/data/NeoSELECT_prob/NeoCODE/applications/STAR-2.53a/bin -p28 GENOME_data/rsem
2> GENOME_data/error.txt 1> GENOME_data/output.txt &
2) Mapping with STAR
STAR --genomeDir GENOME_data --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 28 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted SortedByCoordinate --quantMode TranscriptomeSAM --outSAMheaderHD \@HD VN:1.4 SO:unsorted --outFileNamePrefix RNASEQ_data/R1.temp/R1 --readFilesCommand zcat --readFilesIn ../R1.read1.cut.fastq.gz ../R1.read2.cut.fastq.gz
3) Before
running RSEM, try to sort the R1Aligned.toTranscriptome.out.bam and use this Sorted.bam as an input to rsem
Command:
cat <( samtools view -H R1Aligned.toTranscriptome.out.bam ) <( samtools view R1Aligned.toTranscriptome.out.bam | awk '{printf "%s", $0 "~"; getline; print}' | sort -S 30G -T ./ | tr '~' '\n' ) | samtools view -bS - > Sorted.bam
(Such a sorted file should not change from run to run, and RSEM results should be exactly the same.) This was mentioned in: https://groups.google.com/g/rna-star/c/kQGfbQhezsU?pli=1
To be sure I check the statistics of the Sorted.bam files from two independent runs. They are identical. So till thins step, everything seems to be deterministic and reproducible.
4) Running rsem-calculate expression using the Sorted.bam files
Command:
rsem-calculate-expression --p 28 --bam --output-genome-bam --paired-end --forward-prob 0 --sort-bam-by-coordinate RNASEQ_data/R1.temp/Sorted.bam --seed 0 GENOME_data/rsem RNASEQ_data/R1.temp/R1 2> RNASEQ_data/error_rsem.txt 1> RNASEQ_data/output_rsem.txt &
5) After the above steps, I check the statistics for the BAM files generated from RSEM run using samtools stats.
Problem:
I find the that R1.transcript.bam (from two independent runs) and R1.transcript.sorted bam (from two independent runs) are identical and deterministic, while R1.genome.bam and R1.genome.sorted.bam vary slightly between independent runs.
Additionally, I also tried to run rsem-calculate-expression using STAR to aligned reads with the parameter (--sort-bam-by-read-name). However i get an error (listed below).
the command that i used was:
for preparing the reference
rsem-prepare-reference --gtf GENOME_data/Homo_sapiens.GRCh38.90.gtf --star GENOME_data/Homo_sapiens.GRCh38.dna.genome.fa --star-path /users/data/applications/STAR-2.5.3a/bin/ -p 28 GENOME_data/rsem_ref
for Quantification with RSEM
rsem-calculate-expression --p 28 --star --star-gzipped-file --star-path /users/data/applications/STAR-2.5.3a/bin --paired-end --seed 5 --output-genome-bam --sort-bam-by-read-name --forward-prob=1.0 ../R1.read1.fastq.gz ../R1.read2.fastq.gz GENOME_data/rsem_ref/ RNASEQ_data/R1
Now i get this error:Read A00182:360:HJM25DRXX:1:1102:8703:35728: the two mates do not align to the same transcript. RSEM does not support discordant alignments.
"rsem-parse-alignments GENOME_data/rsem-ref RNASEQ_data/R1.temp/R1 RNASEQ_data/R1.stat RNASEQ_data/R1.temp/R1.sorted.bam 3 -t ag XM" failed. Please check if you provide correct parameters/options for the pipeline
Any help on this would be highly appreciated. Thank you very much.
Best Regards,
Abhilash Kannan