splicegraph issues with C elegans ensembl: ValueError: Full intervals do not have minimum length

67 views
Skip to first unread message

Margaret M

unread,
Jul 29, 2025, 7:59:36 PMJul 29
to Biociphers
Tl:dr have any worm people used majiq successfully and if so how did you get the annotation working?

I'm starting with the official ensembl build WBcel235 with a GTF downloaded from Ensembl and converted to GFF3.

I keep getting 

"ValueError: Full intervals do not have minimum length" during the majiq-v3 gff3 step

I have tried:

Reading my gff3 into R and converting it into what seems to be MAJIQ’s preferred gene/transcript/exon hierarchy:
  1. gene entries with ID=gene:WBGeneID
  1. mRNA/transcript entries with ID=transcript:XYZ;Parent=gene:WBGeneID
  1. exon/CDS/UTR entries with Parent=transcript:XYZ and unique incremental IDs.
That failed, so I switched to filtering via awk/command line 
  1. Removed any features on contigs not present in BAMs.
  2. Dropped zero-length or inverted spans (start >= end).
  3. Excluded any features shorter than 20bp (to avoid edge cases with de novo junction calling).
  4. Validated that all Parent IDs match existing transcript IDs.
  5. Normalized feature case (five_prime_UTR, etc.).
The resulting gff3 still throws "ValueError: Full intervals do not have minimum length" when I run the full file.

I tested a minimal single gene build:
V     WormBase    gene  9244402     9246360     .     -     .     ID=WBGene00000003;Name=WBGene00000003
V     WormBase    mRNA  9244402     9246360     .     -     .     ID=F07C3.7.1;Parent=WBGene00000003
V     WormBase    exon  9246080     9246360     .     -     .     ID=exon:F07C3.7.1:1;Parent=F07C3.7.1
V     WormBase    exon  9245588     9246033     .     -     .     ID=exon:F07C3.7.1:2;Parent=F07C3.7.1
V     WormBase    exon  9245443     9245539     .     -     .     ID=exon:F07C3.7.1:3;Parent=F07C3.7.1
V     WormBase    exon  9244402     9245315     .     -     .     ID=exon:F07C3.7.1:4;Parent=F07C3.7.1

And that allowed a splicegraph to be built. So my basic structure seems to be okay, but running the full file again still throws that "ValueError: Full intervals do not have minimum length"

Any insights would be very appreciated. 

San Jewell

unread,
Jul 30, 2025, 1:26:40 PMJul 30
to Biociphers
Hi Margaret,

This error sounds like a potential problem with some specific lines in the gff3 file, rather than the entire file. However, as I've not ever seen this particular problem come up, I could only guess at the specific cause. As such, I think it would be best if you were able to share the problematic gff3 file with me, so that I can look deeper into it any see if I can see which lines are causing the issue. Do you think you would be able to share it with me?

Thanks!
-San

Margaret M

unread,
Jul 30, 2025, 11:58:33 PMJul 30
to Biociphers
Hi, so I did sort of resolve this issue using AGAT, it turned out that the gff3 had orphaned parent links and was missing utrs (which majiq shouldn't care about but did?)

So I used the gtf hosted on wormbase parasite, converted GFF3 → GTF → GFF3 using AGAT, and then validated that : all exon features present, parent-child structure: gene → mRNA → exon, no zero-length or malformed entries. 

From this I was able to build my annotation splicegraph. But when I did, I noticed that it ran suspiciously fast and when I inspected:
 SpliceGraph[7 contigs, 46926 genes, 160350/1539/116630 exons/introns/junctions]

I found a very small number of introns

From there I troubleshooted (troubleshot?)


Are there genes and features present in the gff3? Yes.
Can introns be inferred from the gff3? Yes, via a python script I found ~200,000 inferred introns from exon chaining  
Substituted transcript → mRNA. This had no effect
Validated exon-mRNA-gene hierarchy via script and all IDs and Parents are linked. 
Purged the old splicegraph just in case? Still  [7 contigs, 46926 genes, 160350/1539/116630 exons/introns/junctions]
Ran with build_debug.log  but there are are no errors or warnings about skipped transcripts

Here's a link to the gff3 that I've been using. Let me know if I can provide further details or if you have any suggestions for a conversion?

Thanks very very much in advance!

https://www.dropbox.com/scl/fi/qo6knpyb0o4dg3gaagiex/majiq_ready.tar?rlkey=j2vart50imvabmvf8vhgl7nfd&st=rbm1wmaw&dl=0

San Jewell

unread,
Jul 31, 2025, 11:03:46 AMJul 31
to Biociphers
Hi Margaret,

I do appreciate you describing things in detail, but, I am actually still a bit uncertain about the nature of your question.

It sounds like the main concern you have is that you expect more introns to be present in the splicegraph? You mention "200k inferred introns from exon chaining" -- I think perhaps that you are assuming the number that is listed for splicegraph introns should correlate to each intronic space between detected exons (that is, each space skipped by junctions), however, the number of introns listed is the number of retained introns, as these are what is relevant for downstream analysis. The annotation doesn't explicitly specify retained introns, but we can infer them by (simplified description) looking for multiple transcripts per gene where one transcript shows a splice junction and another shows  an exon over the entire space.

Screenshot from 2025-07-31 11-01-05.png
So to repeat the number you see from introns when looking at the SpliceGraph API object is annotated retained introns in this case, not all introns.

Let me know if that was the information you were looking for? Happy to answer more.

Thanks
-San

Vanessa Roy

unread,
Aug 5, 2025, 7:38:32 AMAug 5
to Biociphers
Hi Margaret and San,

I'm facing the same issue, specifically with the C. elegans GFF3 file. In my case, I use a custom script to convert the GTF to GFF3 but it worked with other species just fine and I was able to run the majiq-v3 gff3 step without problems for 4 other species. 

In case you have any solution, I would be very grateful to know how you resolved this issue.

Thanks a lot!

Best,
Vanessa

San Jewell

unread,
Aug 5, 2025, 12:18:48 PMAug 5
to Biociphers
Hi Vanessa,

It seems that at the current time, Margaret has ran a process to fix the gff3 file but had some concerns about the results of the processing with it. While it seems like you are still running into a parsing error "Full intervals do not have minimum length" while trying to run your own gff3 file, is that correct?

-Can you post the problem gff3 file with the error so that I can look into it? I never actually had the problem file from Margaret so I am not able to be sure what is causing the error.
-Margaret posted the overview of what was done to clean up the file and also a working version of the file above. I believe you should also be able to download it from the dropbox link if you'd like to use it.

Please correct me if my assumptions are wrong about anything. Thanks!

-San

Vanessa Roy

unread,
Aug 6, 2025, 3:59:11 AMAug 6
to Biociphers
Hi San,

Thanks for your reply. Sure, I can send you the C. elegans GFF3 file via email, if that works for you?

Thanks again,
Vanessa

San Jewell

unread,
Aug 6, 2025, 1:07:58 PMAug 6
to Biociphers
Hi Vanessa,

I received the file you sent and I went through it. The error is cryptic, I will look to improve it.

I determined the problem comes from line 267275/267276 of the file. Here there are exons defined for gene:WBGene00002221 ; transcript:C33H5.4 , it seems exons 2 and 3 have coordinates such that they have no interval between them (7798301-7798302). In order to resolve the issue, you could remove this transcript from the annotation. I would also recommend reaching out to the annotation provider as I believe this may be a mistake? The problem doesn't seem to exist for any other transcripts in the file.

Let me know if it helps,

Thanks,
-San

Vanessa Roy

unread,
Aug 7, 2025, 10:55:25 AMAug 7
to Biociphers
Hi San,

Thanks a lot for your message. Y,es this makes sense. I removed the problematic transcript and gene and then this step runs smoothly.

Thanks again for your help,
Vanessa

Margaret M

unread,
Aug 22, 2025, 10:38:09 PMAug 22
to Biociphers
Hi sorry to have gone silent on this one. 

I think I was misunderstanding the way MAJIQ works. If I understand correctly now, MAJIQ will  detect novel intron retention/junction usage from my BAM files, but when I am building the splicegraph, it is only going to tell me information about annotated events. 

As for the issue with the gff3, glad it has been resolved for the other user. But I wonder if you could provide any tips and best practices for creating a C. elegans gff3 that MAJIQ will use?
I believe mine is appropriate (downloaded from wormbase then converted and cleaned with AGAT) but I'd love to feel more secure in this. 

San Jewell

unread,
Aug 26, 2025, 10:33:55 AMAug 26
to Biociphers
Hi Margaret, I'd recommend running voila view to check a gene of interest in the output, uncheck simplified (if applicable) and get a feel for all of the de-novo introns and junctions detected. I think all assumptions about the data should come from downstream tsv or view analysis and not from the log messages output by the build steps.

For the gff3, I don't know the exact process that these providers go through to create the file themselves. As you saw above, the problem was that a single transcript had two exons with no gap between them, which is invalid. I've added a better error message so that this specific type of error is easier to identify, but I'm not sure in general a better method than removing the invalid transcripts by hand. I will pose the question to my lab and have them add additional advice if they have any.

Thanks,
-San
Reply all
Reply to author
Forward
0 new messages