gff3 file error when running majiq input

28 views
Skip to first unread message

Maoting Chen

unread,
Feb 25, 2026, 1:08:17 AMFeb 25
to Biociphers
Dear Majiq team,

I am running AS analysis on my data using MAJIQ v3 and previously I have used v2 for the same dataset. 
With the same gff3 file, I got an error which didn't show up when I used v2:
Screenshot 2026-02-25 at 1.04.52 AM.png
I got the gff3 file from ensembl directly and it's a old release. But i don't want to change to newer release because i used the gtf file of the same release for mapping. Is there any way I can bypass such errors?

Thank you,
Maoting

San Jewell

unread,
Feb 25, 2026, 11:08:25 AMFeb 25
to Biociphers
Hi Maoting, 

I have seen this error before, and in the file I was looking at there were indeed one or two definitions which were actually zero-length, seemingly a mistake when the annotation was created. It may be a similar case here. I would check those coordinates that are indicated and remove that specific transcript from the annotation. If it's just one or two transcripts it should not have significant affect on your analysis (and v2 was just silently ignoring it anyway) 

Let me know if it makes sense. 
Thanks, 
-San

Maoting Chen

unread,
Feb 25, 2026, 5:13:47 PMFeb 25
to Biociphers
Hi San,

Thank you for your reply.
As I proceeded with my analysis, the tsv file from majiq psi doesn't contain the column for LSVs. As you can see from the screenshot of the tsv output, every entry is a junction. Is it something expected? How to get LSV-based output like v2?
Screenshot 2026-02-25 at 5.12.25 PM.png
Thank you,
Maoting

Maoting Chen

unread,
Feb 25, 2026, 5:20:14 PMFeb 25
to Biociphers
And also, it's surprising to see that the ref_exon_start for some junctions is -1 in the below screenshot. Is there anything wrong with the build step?
Screenshot 2026-02-25 at 5.18.42 PM.png

Thank you,
Maoting

San Jewell

unread,
Feb 26, 2026, 11:02:46 AMFeb 26
to Biociphers
Hi Maoting, 

You are correct that the majiq commands output junction-row based data, though all of the same information is available as before. lsv id is just gene_id, event type, and reference exon spliced together. The old format is still available if you run voila tsv mode, however. As for one exon coordinate being negative, this is usually indicative of a 'half exon' event (only one side of a denovo exon could be established within our distance threshold) -- it may help to open the gene in voila view mode to get a better idea of the structure of the detected denovo events. 

-San

Maoting Chen

unread,
Feb 26, 2026, 11:24:53 AMFeb 26
to Biociphers
Hi San,

Thank you for your explanation. 

I feel the documentation on how to use voila tsv in v3 is not clear, so I want to ask about some details.
To run voila tsv, I understand three files are needed: splicegraph.zarr, .deltapsicov from majiq deltapsi command or .psicov from majiq psi-coverage, and .sgc from majiq-v3 sg-coverage.

1. Based on what I understand from documentation, the .psicov or.sgc file can be generated on individual experiments separately or on all experiment together with prefixes (and I assume there is no difference between these two approaches except for file number). In this case, if I want to get the PSI tsv file (old format) on individual experiments (like PSI in each replicate), the only was to do it is to use .psicov and .sgc file that are generated individually because there is no --select-prefixes argument in voila tsv, correct?
2. For voila tsv on .deltapsicov, do I need to provide two .sgc files for the two groups that are being compared or just one .sgc file is sufficient?
3. Additionally, for the junction-based data in v3, is there any documentation on what each column means? What are the corresponding information in this new format for the "junction_coords", "exon_coords", "IR_coords"  from the old format?

Really appreciate your time and support for answering my questions!

Thank you,
Maoting

San Jewell

unread,
Feb 26, 2026, 12:03:08 PMFeb 26
to Biociphers
Hi Maoting, 

No worries!

1) You are correct that there will be no difference in calculated values besides file/directory count. However, as you note there is no --select-prefixes at this time for voila tsv. Currently the lab and I have not decided if this will be a legacy format, superseded by the majiq tsv commands in some way, or kept longer term. I'm going to post some internal messages soon to get an answer on this. If we do keep it, I will consider adding in more flags such as --select-prefixes. Until then, as you suggest, voila defaults to showing groups by default, and to show individual experiments, there should be one file per experiment. 
2) The only relevant data in sgc files is read counts. At a cursory look over the voila tsv output columns, I believe that the only necessity of providing these files is when using the --show-read-counts switch, so I will verify this and provide an updated version soon which only requires sgc files if this switch is specified. To answer your question directly though, sgc files should be provided for all groups/experiments that you are providing psicov/dpsicov files for, for the reads output columns to be successfully processed
3) Internally, we have had scheduled a group discussion to go over an appropriate method to keep output header documentation in line with internal development in a sustainable way. This meeting has been delayed multiple times, I will bring it up again as a renewed priority citing your confusion as a cause. In the meantime, please feel free to ask here about any other columns you may not understand. In the new junc-per-row format of v3, the information that was in these columns  "junction_coords", "exon_coords", "IR_coords" were usually a semicolon separated list of data which is now broken up into each row:

junction_coords: one set of start-end for each junction, in the new format this is the "start" and "end" columns
exon_coords: one set of start-end for each exon, in the new format this is ref_exon_start-ref_exon_end, plus other_exon_start-other_exon_end for each other row
infron_coords: one set of start-end for an intron, if the lsv contained an intron, in the new format this is "start" and "end" columns in the case that the "is_intron" column is True

For all of these, the criterion that delineate the LSV is when the set of (gene_id, event_type, ref_exon_start, ref_exon_end) changes. 

Let me know if it is understandable. 
-San

Maoting Chen

unread,
Mar 5, 2026, 12:48:41 PM (13 days ago) Mar 5
to Biociphers
Hi San,

Thank you for your detailed clarification. 

I have a follow-up question for the first question in my previous email.
So the way I used majiq psi-coverage was to pool all samples (all conditions into the same psicov foler with prefixes provided). Again, I assume it will be the same as I run majiq psi-coverage on separate conditions one by one except for the file number, correct? In this case, if I want to run majiq modulize to see the AS types in one condition, the only way to do it is to use .psicov and .sgc file that are generated by condition but not from all sample-pooled version, right?

Thank you,
Maoting

San Jewell

unread,
Mar 5, 2026, 3:10:14 PM (13 days ago) Mar 5
to Biociphers
Hi Maoting, 

Yes, that is correct. As of the current version, the argument --select-prefixes isn't generally available in any of the voila-* commands, so to limit the input data it is necessary to specify only files containing the desired data on the command line. Using the pooled file for any voila-* command, voila will use all of the data in that file. 

Thanks, 
-San

Maoting Chen

unread,
Mar 5, 2026, 10:36:49 PM (13 days ago) Mar 5
to Biociphers
Hi San,

Thank you for your reply!
I encountered an error when I ran voila modulize on .psicov generated from the command majiq psi-coverage. I'll open another thread for it.

Best,
Maoting

Reply all
Reply to author
Forward
0 new messages