Discrepancies when analyzing same sample with same FASTA but different methods

25 views

Skip to first unread message

Florencia Mascardi

unread,

Dec 18, 2021, 10:28:58 PM12/18/21

to rna-star

Hi Alex,

I'm currently trying to analyze whole RNA-sequencing data to reach to the identification of gut transcriptome-derived biomarkers. Together with my PhD director, we have already harbored the differential expression analysis from human, bacteria and fungi transcriptome. Now, we are trying to do the same but with viruses.

As it is well known, viral databases are mostly incomplete. We have downloaded reference genomes from NCBI, getting FASTA and GTF files, but we also want to try to map our samples to other reference genomes, for which there is no annotation file available.

1) To be sure we are making things the right way, first we applied the workflow we've used before to make an analysis relying on FASTA and GTF files:

nohup STAR --runThreadN 20 --runMode genomeGenerate \

--genomeDir /data/nash2020/STAR/trialC_STARvirusout/index --genomeFastaFiles /data/nash2020/STAR/STARgenomes/viral_mergedall.fna \

--sjdbGTFfile /data/nash2020/STAR/STARgenomes/viral_mergedall.gtf \

--outFileNamePrefix /data/nash2020/STAR/trialC_STARvirusout/index &

nohup STAR --genomeDir index --readFilesIn /data/nash2020/transcriptomics_nafld/Trial/seqs-trial/HV03-1.fastq /data/nash2020/transcriptomics_nafld/Trial/seqs-trial/HV03-2.fastq \

--outFilterType BySJout --outSAMunmapped Within \

--outSAMtype BAM SortedByCoordinate --outSAMattrIHstart 0 \

--outFilterIntronMotifs RemoveNoncanonical \

--runThreadN 12 \

--quantMode TranscriptomeSAM GeneCounts --outWigType bedGraph --outWigStrand Stranded --outFileNamePrefix HV03A3 &

We use the ReadsPerGene output to make the differential expression analysis with DESeq2, after mapping all of our samples.

2) Then, using the same FASTA file, we are trying to use the 2-pass mode mapping, hopping to use the sortedByCoord BAM file to use as input in SAMTools and reach a gene count. But our results do not match the previous ones, and we don't know anymore what we are missing. This is our script:

nohup STAR --runThreadN 20 --runMode genomeGenerate \

--genomeDir /data/nash2020/STAR/STARvirusA2/index --genomeFastaFiles /data/nash2020/STAR/STARgenomes/viral_mergedall.fna \

--outFileNamePrefix /data/nash2020/STAR/STARvirusA2/index --genomeChrBinNbits 15 --limitGenomeGenerateRAM 900000000000 --limitSjdbInsertNsj 2000000 &

nohup STAR --genomeDir /data/nash2020/STAR/STARvirusA2/index --readFilesIn /data/nash2020/transcriptomics_nafld/Trial/seqs-trial/HV03-1.fastq /data/nash2020/transcriptomics_nafld/Trial/seqs-trial/HV03-2.fastq \

--outFilterType BySJout --outSAMunmapped Within \

--outSAMtype BAM SortedByCoordinate \

--runThreadN 12 \

--twopassMode Basic --outFileNamePrefix HV03A4 &

nohup STAR --genomeDir index --readFilesIn /data/nash2020/transcriptomics_nafld/Trial/seqs-trial/HV03-1.fastq /data/nash2020/transcriptomics_nafld/Trial/seqs-trial/HV03-2.fastq \

--outFilterType BySJout --outSAMunmapped Within \

--outSAMtype BAM SortedByCoordinate \

--outFilterIntronMotifs RemoveNoncanonical \

--runThreadN 12 \

--twopassMode Basic --outSJfilterReads Unique --outFileNamePrefix HV03A2b &

samtools index -b HV03A2bAligned.sortedByCoord.out.bam

samtools idxstats HV03A2bAligned.sortedByCoord.out.bam > HV03A2bcounts.tsv

The gene count in the HV03A2b is much higher than in the HV03A3 file, and includes a greater number of detected genes.

Is there any filter we are missing? Or maybe it is not right to compare these methods? Any help would be very helpful to us.

Hope you have very nice Holidays! Cheers from Argentina,

Florencia

Reply all

Reply to author

Forward

0 new messages