majiq build -- GFF format issue?

135 views
Skip to first unread message

Patrick T

unread,
Nov 11, 2021, 2:47:11 PM11/11/21
to majiq_voila
Hi,

I'm trying to test out majiq 2.2, and can unfortunately only get it to run with the hg19 .GFF provided in your documentation. It runs fine with that, I've just built my pipeline around another annotation file and would really rather not run the entire process again. 

Ensembl and NCBI database .GFFs do not work and give thousands of errors such as:

(PID:1155) - WARNING - Error, incorrect gff. exon doesn't have valid mRNA b'rna-NR_106918.1

(PID:1155) - WARNING - Error, incorrect gff. exon doesn't have valid mRNA b'rna-MIR7846'

I'm wondering what's going on? The accession numbers are used in NCBI's refseq ftp for .GFF files instead of chr numbers, but shouldn't that not matter if I built and mapped my genome with the accession? Here's and example of GRCh38.p13 refseq .GFF:

NC_000001.11    Gnomon  exon    168100  168165  .       -       .       ID=exon-XR_001737579.2-4;Parent=rna-XR_001737579.2;Dbxref=GeneID:100996442,Genbank:XR_001737579.2;gbkey=ncRNA;gene=LOC100996442;product=uncharacterized LOC100996442%2C transcript variant X5;transcript_id=XR_001737579.2

I've tried with Ensembl as well, here's an example of GRCh37.87 from Ensembl (which seems to be formatted correctly, except for "chr"?): 

1       ensembl_havana  mRNA    47264718        47285085        .       +       .       ID=transcript:ENST00000271153;Parent=gene:ENSG00000142973;Name=CYP4B1-001;biotype=protein_coding;ccdsid=CCDS542.1;havana_transcript=OTTHUMT00000021911;havana_version=1;tag=basic;transcript_id=ENST00000271153;version=4

Is there a place to download .GFFs or .GTFs that are reliably formatted for this program..? Also, if my genome was built and .bam's aligned without a "properly" formatted .GFF, will majiq find no LSV's even if it's correct in that step? Any help would be appreciated. Thanks!

Patrick


Patrick T

unread,
Nov 11, 2021, 6:33:59 PM11/11/21
to majiq_voila
So, evidently the GFF files used directly from GENCODE (ie. https://www.gencodegenes.org/human/) work fine, but Ensembl/NCBI do not. Conversion from GTF with tools other than gtf2gff3 give errors as well. I can't even get the gtf2gff3 script working as the config file is missing from git (which clearly others have had issues with as well) and have spent enough time debugging ones from other sources. 

Going to rerun the entire alignment pipeline with the majiq.hg19 annotation and then again with another annotation over the weekend (probably GRCh38). I'll update again if anything fails horribly and when I get it working as this may be helpful to others who want to use different annotations. Thanks

Patrick 

Reply all
Reply to author
Forward
0 new messages