Error : The annotation (count matrix and isoform annotation) seems to be different (jacard similarity < 0.95).

297 views
Skip to first unread message

Parastou Kohvaei

unread,
Jul 9, 2018, 8:12:27 AM7/9/18
to IsoformSwitchAnalyzeR
Hey All,

I have produced kallisto .tsv file with the following genome file:

Homo_sapiens.GRCh37.cdna.all.fa

The problem is that, now IsoformSwitchAnalyzR throws lots of different errors once processing these files with whatever .gtf version and release.
We tried drop the release infromation of isoform names  and this is what we get:

Error in importRdata(isoformCountMatrix = kallistoIsoformExpression$
counts
, :
 
The annotation (count matrix and isoform annotation) seems to be different (jacard similarity < 0.95). Either isforoms found in the annotation are not quantifed or vise versa.


I searched this forum and seems that we need to use the 'exact' annotation release to match these files. The problem is that, we do not know what .gtf from Ensemble really goes with this release of cdna file.
Does anyone know?

Thank you all,

Kristoffer Vitting-Seerup

unread,
Jul 23, 2018, 4:33:03 AM7/23/18
to IsoformSwitchAnalyzeR
Hi Parastou

(Sorry the late response - was on holiday)

I must admit I do not know - the fasta file used to generate the Kallisto index seems to originate from here. Have you tried the associated GTF files also found there?

Cheers
Kristoffer

e.c...@protogen.io

unread,
Oct 2, 2018, 8:08:54 AM10/2/18
to IsoformSwitchAnalyzeR
I am having this issue as well with the gtf provided from ensemble .

Kristoffer Vitting-Seerup

unread,
Oct 2, 2018, 8:11:41 AM10/2/18
to IsoformSwitchAnalyzeR
Unfortunately that is a known issue. I've added a section to the updated vignette FAQ that should help you out.

Let me know if it works.

Cheers
Kristoffer

e.c...@protogen.io

unread,
Oct 2, 2018, 9:57:44 AM10/2/18
to IsoformSwitchAnalyzeR
I followed the guide and used the same files however I am still getting this error.

Kristoffer Vitting-Seerup

unread,
Oct 2, 2018, 10:02:45 AM10/2/18
to IsoformSwitchAnalyzeR
Could I get you to describe what you have done in details in the second try?

Evan Clark

unread,
Oct 2, 2018, 10:54:36 AM10/2/18
to IsoformSwitchAnalyzeR
I downloaded the ncrna and cDNA fasts file from ensemble, release 93. I unzipped the files merged them using cat cDNA.fa ncrna.fa > grchcdna-ncrna.fa

I also downloaded chrm scaff gtf from release 93

I ran kallisto index -i out.idx grchcdna-ncrna.fa

I then quantified using kallisto quant -i grchcdna-ncrna.fa -o out/ fq1 fq2

I then ran the pipeline using the outdir for loading expression data and the gtf file in import r data.

e.c...@protogen.io

unread,
Oct 2, 2018, 12:02:48 PM10/2/18
to IsoformSwitchAnalyzeR
Also if I were to use gencode, which of the 3 comprehensive gene lists would be the one to use?

e.c...@protogen.io

unread,
Oct 2, 2018, 1:45:45 PM10/2/18
to IsoformSwitchAnalyzeR
So I attempted to use the gencode files and ran my kallisto quantifications again using, however it still causes the same issue, none of the gtf files from gencode work either, producing the same error.

converting annotated CDSs
Error in importRdata(isoformCountMatrix = kallistoQuant$counts, isoformRepExpression = kallistoQuant$abundance,  :

Kristoffer Vitting-Seerup

unread,
Oct 5, 2018, 4:18:46 AM10/5/18
to IsoformSwitchAnalyzeR
Hmm that sounds strange - which version of IsoformSwitchAnalyzeR are you using?

You can check with:

packageVersion('IsoformSwitchAnalyzeR')

e.c...@protogen.io

unread,
Oct 5, 2018, 7:27:23 AM10/5/18
to IsoformSwitchAnalyzeR
I am using version 1.2.0.

Kristoffer Vitting-Seerup

unread,
Oct 5, 2018, 8:46:41 AM10/5/18
to IsoformSwitchAnalyzeR
Could you try updating to version 1.3.9 from here and see whether the new versions work?

e.c...@protogen.io

unread,
Oct 7, 2018, 7:24:48 AM10/7/18
to IsoformSwitchAnalyzeR
That seemed to have fixed the problem. Also one note, when it prints the switching features table to the terminal, it adds a period to one of the groups.

  Comparison switchingIsoforms switchingGenes
1     A vs I               700            604
2   NA. vs A              2799           2137
3   combined              3334           2531

I am not sure if this is just an error in the printing or if it is being included in the analysis, which may cause unwanted results as the group wouldn't exist.

Kristoffer Vitting-Seerup

unread,
Oct 8, 2018, 3:29:01 AM10/8/18
to IsoformSwitchAnalyzeR
Good that it fixed the problem.

Acutally that is on purpouse - and it is changed throughout the entire switchAnalyzeRlist. The problem is that NA in computer science is used as a placeholder to specifically indicate values  "Not Available" aka missing. Therefore the groupname "NA" cannot be used (as the would cause everything to crash) which is why I change it.

Cheers
Kristoffer

Muzz

unread,
Oct 16, 2018, 10:55:33 AM10/16/18
to IsoformSwitchAnalyzeR
I am using version 1.3.9 with data from Kallisto and still get this same error. I wonder if there is another potential solution.

e.c...@protogen.io

unread,
Oct 17, 2018, 8:37:56 AM10/17/18
to IsoformSwitchAnalyzeR
Have you tried using the GRCH38 annotation as stated in the guide? That was the one that worked best for me, I haven't tried gencode but ensemble works well.

Kristoffer Vitting-Seerup

unread,
Oct 17, 2018, 11:30:40 AM10/17/18
to IsoformSwitchAnalyzeR
Hi Muzz

Thanks for reaching out. Have you read the two relevant FAQs in the vignette (which also describe how to use Ensembl data)?

If none of that worked you will need to show me the R code you use to import the Kallisto data and create the switchAnalyzeRlist.

Cheers
Kristoffer

Muad Abd El Hay

unread,
Nov 4, 2018, 5:48:34 AM11/4/18
to IsoformSwitchAnalyzeR
I solved my issue and it might be that others have the same problem.

I used the gencode GTF file (the first one) and the cDNA file for the Salmon quantification
The error seemed to come from the following code in the import_data.R:

### Test overlap with expression data
if( countsSuppled ) {
j1 <- jaccardSimilarity(
isoformCountMatrix$isoform_id,
isoformAnnotation$isoform_id
)

Here, the isoform_id from the object created with importIsoformExpression matrices (such as counts) is compared with the one from the GTF file.

When I checked both, i noticed that the GTF file was imported correctly but the isoform_id in the count matrix had all the possible IDs jammed together in one column as such:

ENST00000467408.6|ENSG00000127054.20|OTTHUMG00000003330.13|OTTHUMT00000009378.2|INTS11-213|INTS11|639|retained_intron insteas of just the first value ENST00000467408.6

Looking around github I found this post in the issues: https://github.com/kvittingseerup/IsoformSwitchAnalyzeR/issues/11

user csijcs posted his solution to the problem. It might be that it was already implemented in the new devel version but if you are stuck with the 1.5 version, you should do the following:

mySwitchList <- importIsoformExpression(dir,
+                                         calculateCountsFromAbundance=TRUE,
+                                         interLibNormTxPM=TRUE,
+                                         normalizationMethod='TMM',
+                                         pattern='',
+                                         invertPattern=FALSE,
+                                         ignore.case=FALSE,
+                                         showProgress = TRUE,
+                                         quiet = TRUE
+                                         )


isoformCountMatrix <- mySwitchList$counts isoformCountMatrix$isoform_id <- sub("\\|.*", "", isoformCountMatrix$isoform_id) rownames(isoformCountMatrix) <- isoformCountMatrix$isoform_id isoformAbundanceMatrix <- mySwitchList$abundance isoformAbundanceMatrix$isoform_id <- sub("\\|.*", "", isoformAbundanceMatrix$isoform_id) rownames(isoformAbundanceMatrix) <- isoformAbundanceMatrix$isoform_id

##path to your gtf file here
gtf <-("/Users/csijcs/Documents/Work/CSI/AML/RNA_seq/gencode.v28.annotation.gtf")

switchAnalyzeRlist <- importRdata( isoformCountMatrix, isoformAbundanceMatrix, samples, gtf, comparisonsToMake=NULL, addAnnotatedORFs=TRUE, onlyConsiderFullORF=TRUE, removeNonConvensionalChr=TRUE, PTCDistance=50, foldChangePseudoCount=0.01, showProgress=TRUE, quiet=FALSE ) This worked for me, I hope it helps others




Kristoffer Vitting-Seerup

unread,
Nov 5, 2018, 3:55:46 AM11/5/18
to IsoformSwitchAnalyzeR
Hi Muad

Although the proposed solution works it's probably easier to just update to the newest version of IsoformSwitchAnalyzeR since this issue is then automatically solved (when using importIsoformExpression() to import the quantification).

Btw I will recommend this update to everyone.

Cheers
Kristoffer
Reply all
Reply to author
Forward
0 new messages