Storing sequences instead of BSgenome

183 views
Skip to first unread message

mirzag...@gmail.com

unread,
Oct 4, 2020, 9:55:59 AM10/4/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
I am using IsoformSwitchAnalyzeR and imported wheat RNASeqanalysis output from  a diff_out folder.

I executed the following commands:

library(IsoformSwitchAnalyzeR)
#update.packages("IsoformSwitchAnalyzeR")
#packageVersion('IsoformSwitchAnalyzeR')

testSwitchList <- importCufflinksFiles(
  pathToGTF = 'D:/wheat/SRP043554-cold/merged_asm/merged.gtf',
  pathToGeneDEanalysis = 'D:/wheat/SRP043554-cold/diff_out/gene_exp.diff',
  pathToIsoformDEanalysis = 'D:/wheat/SRP043554-cold/diff_out/isoform_exp.diff',
  pathToGeneFPKMtracking = 'D:/wheat/SRP043554-cold/diff_out/genes.fpkm_tracking',
  pathToIsoformFPKMtracking = 'D:/wheat/SRP043554-cold/diff_out/isoforms.fpkm_tracking',
  pathToIsoformReadGroupTracking = 'D:/wheat/SRP043554-cold/diff_out/isoforms.read_group_tracking',
  pathToReadGroups = 'D:/wheat/SRP043554-cold/diff_out/read_groups.info',
  pathToRunInfo = 'D:/wheat/SRP043554-cold/diff_out/run.info',
  pathToSplicingAnalysis = 'D:/wheat/SRP043554-cold/diff_out/splicing.diff',
  fixCufflinksAnnotationProblem = T,
  quiet = F
)
testSwitchList
names(testSwitchList)
head(testSwitchList,2)
head(testSwitchList$conditions,2)

exampleSwitchListFiltered <- preFilter(
  switchAnalyzeRlist = testSwitchList,
  geneExpressionCutoff = 1,
  isoformExpressionCutoff = 0,
  removeSingleIsoformGenes = TRUE
)

exampleSwitchListAnalyzed <- isoformSwitchTestDEXSeq(
  switchAnalyzeRlist = testSwitchList,
  reduceToSwitchingGenes=TRUE
)

extractSwitchSummary(exampleSwitchListAnalyzed)

exampleSwitchListAnalyzed <- analyzeORF(
  exampleSwitchListAnalyzed,
  orfMethod = "longest",
  # genomeObject = Hsapiens, # not necessary since sequences are already stored in the switchAnalyzeRlist
  showProgress=FALSE
)


but encounter an error saying:

Error in analyzeORF(exampleSwitchListAnalyzed, orfMethod = "longest",  :
  The genomeObject argument must be a BSgenome.

I greatly appreciate it if you kindly let me know how I can store the said required sequences in the switchAnalyzeRlist?

With best regards,

Ghader

Kristoffer Vitting-Seerup

unread,
Oct 5, 2020, 4:09:46 AM10/5/20
to IsoformSwitchAnalyzeR
Hi Ghader

The comment is valid for the example in the vignette where the isoformNtFasta argument in importRdata() was used. Either you will have to provide a BSgenome object matching your species or else you can use the isoformNtFasta argument in the importCufflinksFiles() function.

For more info on BSgenome please see the documentation for the analyzeORF or the vignette.

Cheers
Kristoffer

mirzag...@gmail.com

unread,
Oct 5, 2020, 3:13:46 PM10/5/20
to IsoformSwitchAnalyzeR
  Dear Kristoffer,
Many thanks for your reply.
Could you please let me know what kind of fasta file do I should provide for isoformNtFasta? Actually, I used the CDS fasta file which was downloaded from EnsemblPlant.

May data in ocly from two conditions (C04 and C23)> I am trying to plot isoforms using  the command: switchPlot(exampleSwitchListAnalyzed, gene = 'TraesCS3A02G273500') but I get the following error:
> switchPlot(exampleSwitchListAnalyzed, gene = 'TraesCS3A02G273500')
Error in if (nrow(orfInfo) == 0) { : argument is of length zero
In addition: Warning message:
In switchPlotTranscript(switchAnalyzeRlist = tArgList$switchAnalyzeRlist,  :
  ORFs have not to have been annoated. If ORF should be visualized it can be annoated with the 'annotatePTC()' function
>
Your help in this regard is highly appreciated.

With best regards,
Ghader  

Kristoffer Vitting-Seerup

unread,
Oct 6, 2020, 3:05:39 AM10/6/20
to IsoformSwitchAnalyzeR
The nucleotide (Nt) sequences of the isoforms.

For more information please see the documentation by typing ?importCufflinksFiles or the vignette.

Cheers
Kristoffer

mirzag...@gmail.com

unread,
Oct 6, 2020, 1:07:52 PM10/6/20
to IsoformSwitchAnalyzeR
Many thanks, Kristoffer,
I read the help but still, I have stuck in here. I greatly appreciate it if you let me know how I can prepare the nucleotide sequences of isoforms.
With best regards,
Ghader

mirzag...@gmail.com

unread,
Oct 6, 2020, 4:05:49 PM10/6/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
I extracted the sequences using this code:
gffread -w htsatindex/transcripts.fa -g htsatindex/Triticum_aestivum.IWGSC.dna.toplevel.fa diff_out/merged.gtf
and supplied the path to importCufflinksFiles as isoformNtFasta. 
I hope it is correct this time.
Regards,
Ghader

Kristoffer Vitting-Seerup

unread,
Oct 7, 2020, 3:04:40 AM10/7/20
to IsoformSwitchAnalyzeR
Hi Ghader

That looks right. Just one thing: Does the "htsatindex/transcripts.fa" file contain the sequences of a transcriptome or the genome? It should be the genome (then gffread will extract the sequences corresponding to the transcripts identified by cufflinks/cuffdiff.

Cheers
Kristoffer

mirzag...@gmail.com

unread,
Oct 7, 2020, 5:14:01 AM10/7/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
Actually, transcripts.fa is the name of the output file. The genomic DNA (here htsatindex/Triticum_aestivum.IWGSC.dna.toplevel.fa) is used to build the isoform nucleotide sequences ( transcripts.fa).
I have another question please: 
I think the ORF annotation such as Signal peptide and Pfam results visualization on genes are applied to the significantly identified isoforms. I appreciate it if you let me know how can I do this for a specific group of genes of interest such as a gene family.
Many thanks,
Kind regards,
Ghader

Kristoffer Vitting-Seerup

unread,
Oct 7, 2020, 6:56:52 AM10/7/20
to IsoformSwitchAnalyzeR
Ahh - sorry I was to fast then.

You are right the IsoformSwitchAnalyzeR workflow is designed to only analyze genes with isoform switches. 

If you want to run on other data you would need to follow the workflow described here except you should subset to the genes of interest using the subsetSwitchAnalyzeRlist() function.

mirzag...@gmail.com

unread,
Oct 8, 2020, 1:23:03 AM10/8/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
I shifted from cufflinks to salmon and prepared the RNASeq analysis results. But here I don't know how to prepare isoform nucleotide sequences for the isoformNtFasta argument:

aSwitchList <- importRdata(
  isoformCountMatrix   = salmonQuant$counts,
  isoformRepExpression = salmonQuant$abundance,
  designMatrix         = myDesign,
  isoformExonAnnoation = 'D:/wheat/SRP043554-cold/htsatindex/Triticum_aestivum.IWGSC.46.gtf    ????',
  isoformNtFasta       = '?',
  showProgress = T
)

the one which I made with gffread for cufflinks result has different sequence names and is not accepted here. I appreciate it if you let me know how I can prepare the isoform nucleotide sequence and isoformExonAnnoation files.  Are isoform nucleotide sequences downloadable from Ensembl?
Many thanks and best regards,
Ghader

mirzag...@gmail.com

unread,
Oct 8, 2020, 4:28:39 AM10/8/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
I shifted from cufflinks to salmon and prepared the RNASeq analysis results. I used the following codes to import the data:

aSwitchList <- importRdata(
  isoformCountMatrix   = salmonQuant$counts,
  isoformRepExpression = salmonQuant$abundance,
  designMatrix         = myDesign,
  isoformExonAnnoation = 'D:/wheat/SRP043554-cold/stringtieout/stringtie_merged.gtf',
  isoformNtFasta       = 'D:/wheat/SRP043554-cold/htsatindex/stringtietranscripts.fa',
  ignoreAfterSpace = TRUE,
  ignoreAfterBar = T,
  showProgress = T
)

but I encountered the following error:

Step 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
    importing GTF (this may take a while)
Error in importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance,  : 
  The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925). 
Either isforoms found in the annotation are not quantifed or vise versa. 
Specifically:
 133231 isoforms were quantified.
 145841 isoforms are annotated.
 Only 133224 overlap.
 7 isoforms quantifed had no corresponding annoation
 
This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.
 
If there is no overlap (as in zero or close) there are two options:
 1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
 2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
     Examples from expression matrix are : TraesCS6B02G174800.1, TraesCS3A02G146300.1, TraesCS3D02G389700.1 
     Examples of annoation are : TraesCS5B02G232500.1, TraesCS2B02G382600.1, TraesCS6A02G032700.1 
     Examples of isoforms which were only found im the quantification are  : TraesCS7D02G156100.1, TraesCS2A02G348700.1, TraesCS7D02G338800.3 
 
If there is a large overlap but still far from complete there are 3 possibilites:
 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)
 4) It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod'
In addition: Warning message:
In importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance,  :
     No CDS annotation was found in the GTF files meaning ORFs could not be annotated.
     (But ORFs can still be predicted with the analyzeORF() function)

Any help in this regard is greatly appreciated.
I am using packageVersion: ‘1.11.3’
Many thanks and best regards,
Ghader

Kristoffer Vitting-Seerup

unread,
Oct 8, 2020, 6:14:34 AM10/8/20
to IsoformSwitchAnalyzeR
Hi Ghader

I'm sorry but I do not know how the data you want to analyze have been processed. Is it Cufflinks, Salmon or StringTie?

No matter how you analyze it the 4 files (count, abundances, gtf and fasta) need to contain the information about the same transcipts. So providing a StringTie GTF to salmon quantifications of the Ensemble transcriptome will not work.

If you want to read more about the different quantification approaches you can find more ino here.

mirzag...@gmail.com

unread,
Oct 18, 2020, 1:08:31 AM10/18/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
Thanks for your advice. The Importing issue was resolved.
I used StringTie output and got some outputs. It seems that Unlike the output for cuffdiff data, there is no significance test for stringtie data
for stringtie data, is there a function to statistically test the expression of each isoform between the conditions like what has been done for switches?
Many thanks
Regards,
Ghader
Rplot.png

mirzag...@gmail.com

unread,
Oct 18, 2020, 1:14:01 AM10/18/20
to IsoformSwitchAnalyzeR
Dear Kristoffer,
Sorry for repeated posts!

Thanks for your advice. The Importing issue was resolved.
I used StringTie output and got some outputs. It seems that unlike the output for cuffdiff data, there is no significance test for Isoform Expression when using StringTie data.
for stringtie data, is there a function to statistically test the expression of each isoform between the conditions like what has been done for usages in the attached output graph?

Many thanks
Regards,
GhaderRplot.png

Kristoffer Vitting-Seerup

unread,
Oct 21, 2020, 4:16:54 AM10/21/20
to IsoformSwitchAnalyzeR
Take a look at the FAQ in the vignette :-)
Reply all
Reply to author
Forward
0 new messages