Hello, I have been struggling with adding annotated ORFs to my switchAnalyzeRlist. The annotations I'm using are for Bovine Herpesvirus 1.1 (JX898220.1), and my transcriptome assembly was generated with StringTie.
I have no problem with importRdata:
****************************************************
> data <- importRdata(
+ isoformCountMatrix = expression$counts,
+ isoformRepExpression = expression$abundance,
+ designMatrix = myDesign,
+ isoformExonAnnoation = "merged.gtf",
+ isoformNtFasta = "transcripts.fasta",
+ showProgress = TRUE,
+ )
Step 1 of 10: Checking data...
Please note that some condition names were changed due to names not suited for modeling in R.
Step 2 of 10: Obtaining annotation...
importing GTF (this may take a while)...
1 ( 0.51%) isoforms were removed since they were not expressed in any samples.
Step 3 of 10: Fixing StringTie gene annoation problems...
81 isoforms were assigned the ref_gene_id and gene_name of the most similar
annotated isoform (defined via overlap in genomic exon coordinates).
This was only done if the overlap met the requriements
indicated by the three fixStringTieViaOverlap* arguments.
We were unable to assign 39 isoforms (located within annotated genes) to a known ref_gene_id/gene_name.
These were removed to enable analysis of the rest of the isoform from within the merged genes.
2 gene_ids which were associated with multiple ref_gene_id/gene_names
were split into mutliple genes via their ref_gene_id/gene_names.
74 genes_id were assigned their original gene_id instead of the StringTie gene_id.
This was only done when it could be done unambiguous.
Step 4 of 10: Calculating expression estimates from count data...
Skipped as user supplied expression via the "isoformRepExpression" argument...
Step 5 of 10: Testing for unwanted effects...
SVA analysis failed. No unwanted effects were added.
Step 6 of 10: Batch correcting expression estimates...
Step 7 of 10: Extracting data from each condition...
|================================================================================================| 100%
Step 8 of 10: Making comparisons...
|================================================================================================| 100%
Step 9 of 10: Making switchAnalyzeRlist object...
Step 10 of 10: Guestimating differential usage...
The GUESSTIMATED number of genes with differential isoform usage are:
comparison estimated_genes_with_dtu
1 X1 vs X2 2 - 3
2 X4 vs X6 0 - 0
3 X6 vs X8 0 - 0
Done
Warning messages:
1: In importRdata(isoformCountMatrix = expression$counts, isoformRepExpression = expression$abundance, :
The number of comparisons (n=15) is unusually high.
- If this intended please note that with a large number of comparisons IsoformSwitchAnalyzeR might use quite a lot of memmory (aka running on a small computer might be problematic).
- If this was not intended please check the supplied design matrixt to make sure no mistakes were made.
2: In importRdata(isoformCountMatrix = expression$counts, isoformRepExpression = expression$abundance, :
No CDS annotation was found in the GTF files meaning ORFs could not be annotated.
(But ORFs can still be predicted with the analyzeORF() function)
3: In importRdata(isoformCountMatrix = expression$counts, isoformRepExpression = expression$abundance, :
There were estimated unwanted effects in your dataset but the automatic sva run failed.
We highly reccomend you run sva yourself, add the nessesary surrogate variables
as extra columns in the "designMatrix" and re-run this function
****************************************************
But here is what I'm getting trying to use the GTF file to add in the ORFs (this is the exact file I used with StringTie):
****************************************************
> data <- addORFfromGTF(data, "bohv_genome.gtf")
Step 1 of 2: importing GTF (this may take a while)...
Error in `$<-.data.frame`(`*tmp*`, "geneType", value = NA) :
replacement has 1 row, data has 0
****************************************************
Here is what my GTF file looks like:
****************************************************
> import("bohv_genome.gtf")
GRanges object with 297 ranges and 15 metadata columns:
seqnames ranges strand | source type score phase gene_id
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
[1] JX898220.1 487-1262 + | RefSeq gene NA <NA> M8J62_gp01
[2] JX898220.1 487-1224 + | RefSeq CDS NA 0 M8J62_gp01
[3] JX898220.1 487-489 + | RefSeq start_codon NA 0 M8J62_gp01
[4] JX898220.1 1225-1227 + | RefSeq stop_codon NA 0 M8J62_gp01
[5] JX898220.1 1639-7253 - | RefSeq gene NA <NA> M8J62_gp03
... ... ... ... . ... ... ... ... ...
[293] JX898220.1 124272-124274 - | RefSeq stop_codon NA 0 M8J62_gp73
[294] JX898220.1 130121-134201 + | RefSeq gene NA <NA> M8J62_gp74
[295] JX898220.1 130121-134152 + | RefSeq CDS NA 0 M8J62_gp74
[296] JX898220.1 130121-130123 + | RefSeq start_codon NA 0 M8J62_gp74
[297] JX898220.1 134153-134155 + | RefSeq stop_codon NA 0 M8J62_gp74
transcript_id db_xref gbkey gene gene_biotype locus_tag
<character> <character> <character> <character> <character> <character>
[1] GeneID:72301842 Gene circ protein_coding M8J62_gp01
[2] unassigned_transcrip.. <NA> CDS circ <NA> M8J62_gp01
[3] unassigned_transcrip.. <NA> CDS circ <NA> M8J62_gp01
[4] unassigned_transcrip.. <NA> CDS circ <NA> M8J62_gp01
[5] GeneID:72301769 Gene UL52 protein_coding M8J62_gp03
... ... ... ... ... ... ...
[293] unassigned_transcrip.. <NA> CDS BICP22 <NA> M8J62_gp73
[294] GeneID:72301841 Gene BICP4 protein_coding M8J62_gp74
[295] unassigned_transcrip.. <NA> CDS BICP4 <NA> M8J62_gp74
[296] unassigned_transcrip.. <NA> CDS BICP4 <NA> M8J62_gp74
[297] unassigned_transcrip.. <NA> CDS BICP4 <NA> M8J62_gp74
note product protein_id exon_number
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] immediate-early and .. circ protein YP_010374192.1 1
[3] immediate-early and .. circ protein YP_010374192.1 1
[4] immediate-early and .. circ protein YP_010374192.1 1
[5] <NA> <NA> <NA> <NA>
... ... ... ... ...
[293] <NA> immediate-early and .. YP_010374264.1 1
[294] <NA> <NA> <NA> <NA>
[295] <NA> immediate-early tran.. YP_010374265.1 1
[296] <NA> immediate-early tran.. YP_010374265.1 1
[297] <NA> immediate-early tran.. YP_010374265.1 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
****************************************************
Here I try to use the GFF3 file, which the package has trouble with, of course:
****************************************************
> data <- addORFfromGTF(data, "bohv_genome.gff3")
Step 1 of 2: importing GTF (this may take a while)...
Error in importGTF(pathToGTF = pathToGTF, addAnnotatedORFs = TRUE, onlyConsiderFullORF = onlyConsiderFullORF, :
The GFF file must contain the folliwing collumns 'transcript_id' and 'gene'. transcript_id is missing.
****************************************************
So, I've tried extracted and renamed the needed columns from the GFF3 file:
****************************************************
> data <- addORFfromGTF(data, localAnnotation2)
Step 1 of 2: importing GTF (this may take a while)...
Error in asMethod(object, strict = FALSE) :
The character vector to convert to a GRanges object must contain strings of the form
"chr:start-end" or "chr:start-end:strand", with end >= start - 1, or "chr:pos" or
"chr:pos:strand". For example: "chr1:2501-2900", "chr1:2501-2900:+", or "chr1:740". Note that
".." is a valid alternate start/end separator. Strand can be "+", "-", "*", or missing.
****************************************************
I've exported the GFF3 columns that I've edited *back* into a GTF file and tried that:
****************************************************
> colnames(localAnnotation2@elementMetadata)[1] <- 'transcript_id'
> localAnnotation2
GRanges object with 212 ranges and 3 metadata columns:
seqnames ranges strand | transcript_id gene_id gene_name
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] JX898220.1 1-134896 + | JX898220.1:1..134896 <NA> <NA>
[2] JX898220.1 1-103165 + | id-JX898220.1:1..103.. <NA> <NA>
[3] JX898220.1 1-473 + | id-JX898220.1:1..473 <NA> <NA>
[4] JX898220.1 236-431 + | id-JX898220.1:236..431 <NA> <NA>
[5] JX898220.1 487-1262 + | gene-circ circ circ
... ... ... ... . ... ... ...
[208] JX898220.1 129625-130094 + | id-JX898220.1:129625.. <NA> <NA>
[209] JX898220.1 130121-134201 + | gene-BICP4-2 BICP4 BICP4
[210] JX898220.1 130121-134155 + | cds-AFV53435.1 BICP4 AFV53435.1
[211] JX898220.1 134196-134201 + | id-BICP4-2 BICP4 <NA>
[212] JX898220.1 134896 + | id-JX898220.1:134896.. <NA> <NA>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> export(localAnnotation2,"test.gtf","gtf")
> data <- addORFfromGTF(data, "test.gtf")
Step 1 of 2: importing GTF (this may take a while)...
Error in importGTF(pathToGTF = pathToGTF, addAnnotatedORFs = TRUE, onlyConsiderFullORF = onlyConsiderFullORF, :
The GTF file must contain the folliwing collumns 'transcript_id' and 'gene_id'. transcript_id, gene_id is missing.
****************************************************
I don't even understand this error, since it's literally not true. Another work around I tried was using the isoformToGeneExp() function, but I get this:
**********************************************************
> isoformToGeneExp(data$isoformRepExpression, isoformGeneAnnotation = localAnnotation2)
Error in isoformToGeneExp(data$isoformRepExpression, isoformGeneAnnotation = localAnnotation2) :
The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different.
Specifically:
155 isoforms were quantified.
76 isoforms are annotated.
Only 76 overlap.
79 isoforms quantifed isoforms had no corresponding annoation
This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing the analysis.
If there is no overlap (as in zero or close) there are two options:
1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
Examples from expression matrix are : MSTRG.1.78, MSTRG.1.8, gene-UL12
Examples of annoation are : gene-UL13, gene-UL52, gene-UL3
Examples of isoforms which were only found im the quantification are : MSTRG.1.82, MSTRG.1.95, MSTRG.1.40
If there is a large overlap but still far from complete there are 3 possibilites:
1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)
4) It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
For more info see the FAQ in the vignette.
****************************************************
Not sure what else to try at this point.