Using PacBio Isoseq outputs for MAJIQ-L

Tayvia Brownmiller

unread,

Apr 1, 2024, 10:09:04 AM4/1/24

to Biociphers

Hello all,

I am wondering if anyone here has any experience trying to use Isoseq outputs with MAJIQ-L.

Isoseq is the standard pipeline, built and maintained by PacBio, for processing their long read data. I cannot seem to get MAJIQ-L to parse the information correctly compared to outputs I generated using the Isoquant pipeline. When running the prerequisite commands prior to running the viewer, everything runs smoothly with no errors. However, once I run the view command, the view opens as if long read data has been parsed but no reads can be seen.

I understand there will be differences since Isoseq and Isoquant are two different algorithms and I've done the necessary comparisons to ensure that I am not just looking at a gene thats detected by one pipeline and not the other.

I did have to do some manipulation to the needed Isoseq files. Namely, converting the gff to a gtf and extracting the isoformID and counts column to their own csv file.

Any feedback or thoughts would be appreciated. Yes, I could simply use Isoquant for all of my analyses moving forward, but Isoseq is considered more robust and accurate at this time so I would love to be able to use data generated from that pipeline over all else.

Thank you in advance for your help and to the developers for this tool!

Tayvia Brownmiller

San Jewell

unread,

Apr 1, 2024, 12:35:36 PM4/1/24

to Biociphers

Hi Tayvia,

I don't offhand know in depth the output format of the new Isoseq format, however if the transcript information is still available in a relatively similar format, I don't imagine that your inputs would be incompatible with our software. In order to be completely definitive about it, perhaps you could provide a toy dataset or example output from this tool, or instruct us exactly how you are running it on the input data, so that we can reproduce everything and verify there are no oddities or bugs in majiq-L that would prevent it from running properly? (i.e. which command line chain are you using to make the majiq-L inputs)

Thanks!

-San

Seong Woo Han

unread,

Apr 4, 2024, 1:59:30 PM4/4/24

to Biociphers

This is a test.

Tayvia Brownmiller

unread,

Apr 11, 2024, 11:24:06 AM4/11/24

to Biociphers

Hello!

My PI granted me permission to share one of the control data sets. I have emailed a zip folder with the final output files from the isoseq pipeline. Please let me know if there is an issue receiving the email.

What I had attempted in terms of modifying the isoseq outputs is the following:

Converted gff to gtf using cufflinks: gffread file.gff -T -o file.gtf
Extracted the required "ID" and "fl_assoc" columns from the corresponding classification.txt file to a tsv to satisfy the id and counts inputs for voila lr

To generate the long-read voila file:

voila lr --lr-gtf-file /IsoSeq_Analysis_2024/classify/HEK293T_siNeg.sorted.filtered_lite.convert.gtf --lr-tsv-file /IsoSeq_Analysis_2024/classify/HEK293T_siNeg.counts.tsv --voila-file HEK293T/psi_siNeg/siNeg.psi.voila -sg HEK293T/build_siNeg/splicegraph.sql -o siNeg_isoseq.lr.voila

To run voila view: (port number is to connect to a ssh tunnel for opening the web browser from HPC)

voila view -p #### HEK293T/build_siNeg/splicegraph.sql HEK293T/psi_siNeg/siNeg.psi.voila --long-read-file siNeg_isoseq.lr.voila

Tayvia

San Jewell

unread,

Apr 16, 2024, 11:29:42 AM4/16/24

to Biociphers

Hello Tayvia,

I've taken a look at the files and I think I understand the issue.

First of all, there is a separate problem that we have discovered which was noted in the gtf output from the "bambu" tool, which might affect other algorithms as well, in which the ordering of the gtf rows was incorrect which disrupted the algorithm. I have written a patch for it which I will push out and release, but first I want to verify that your data works properly with the tool, which I believe is caused by a different issue.

In the data that you have sent over, there was no splicegraph.sql/voila file file included, so I can not actually fully test and reproduce the run you were trying to do, but by looking at the other files, I see that HEK293T is a human cell line , however all of the gene names in the output have generic gene ids like "PB.15077.7" which I don't think will match with the ones in your build / splicegraph. This is probably why there is no data aligned to the build when you run voila LR. Can you confirm the IDs of the genes in the majiq build?

Thanks,

-San

Tayvia Brownmiller

unread,

Apr 22, 2024, 2:41:56 PM4/22/24

to Biociphers

Hi San,

Our bioinformatics specialist and I did notice that as well. A caveat with the current version of Iso-Seq is that the gff which is output from pigeon only has the pipeline assigned identifier, PB.###.#. The gene and transcript id information is output to a separate txt file. We attempted to amend this information to the converted gtf. Essentially, we have been trying to determine if there is a way we can consolidate/modify the Iso-Seq outputs to best recapitulate the isoquant output.

I have sent you another set of files. Included are the files you requested, as well as what we have attempted to do regarding modifying the Iso-Seq outputs.

Also, I had sent the previous batch of files to you AND Seong, but received an error saying it could not be sent to Seong's email address. Not sure if that's an issue on my end or yours. As you both have been helping with this I wanted to make sure you were both getting the files.