gtf2bed issue

609 views
Skip to first unread message

mehr...@broadinstitute.org

unread,
Jul 25, 2017, 1:42:21 PM7/25/17
to rseqc-discuss
Hello,
To use junction_saturation.py and other scripts a bed12 file is needed. 

I tried to the gtf2bed.py utility conveniently provided, however, the converted file is not in valid format.

Steps:
  1. gtf2bed --do-not-sort < gencode.v19.annotation.gtf > unsorted-gencode.v19.annotation.gtf.bed  
    1. Note gencode.v19.annotation.gtf is download from gencode webpage for GRCH37 build
  2. junction_annotation.py -i /star/Sample_EG10.Aligned.sortedByCoord.out.bam -r unsorted-gencode.v19.annotation.gtf.bed
Error is shown:
Reading reference bed file:  unsorted-gencode.v19.annotation.gtf.bed  ... Traceback (most recent call last):
  File "/software/RSeQC/2.6.4/bin/junction_annotation.py", line 125, in <module>
    main()
  File "/software/RSeQC/2.6.4/bin/junction_annotation.py", line 109, in main
    obj.annotate_junction(outfile=options.output_prefix,refgene=options.ref_gene_model,min_intron=options.min_intron, q_cut = options.map_qual)
  File "/software/RSeQC/2.6.4/lib/python2.7/site-packages/qcmodule/SAM.py", line 3762, in annotate_junction
    exon_starts = map( int, fields[11].rstrip( ',\n' ).split( ',' ) )
ValueError: invalid literal for int() with base 10: 'transcript_id'

the bed file does not seems to be adapted to bed12 format:

head unsorted-gencode.v19.annotation.gtf.bed 

1       11868   14412   ENSG00000223972.4       .       +       HAVANA  gene    .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
1       11868   14409   ENSG00000223972.4       .       +       HAVANA  transcript      .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
1       11868   12227   ENSG00000223972.4       .       +       HAVANA  exon    .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

This seems to be issue of gtf2bed.py. 

Can this be corrected? 

Thanks,
Sudeep 

Liguo Wang

unread,
Jul 25, 2017, 1:57:39 PM7/25/17
to rseqc-...@googlegroups.com
Please keep in mind BED file can be directly downloaded from the UCSC Table browser for many well annotated genomes (like human, mouse, etc)
https://genome.ucsc.edu/cgi-bin/hgTables

Liguo


--
You received this message because you are subscribed to the Google Groups "rseqc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mehr...@broadinstitute.org

unread,
Jul 25, 2017, 2:02:49 PM7/25/17
to rseqc-discuss
Thanks Liguo from a prompt response.

Turns out that is what I was using:

/Programs/BedOPS/bin/gtf2bed

gtf2bed -v
convert2bed -i gtf
  version:  2.4.27 (typical)
  author:   Alex Reynolds

so does this means that bed12 format created is NOT compatible for junction_saturations and other script within RSeQC or its an issue with bedops. 

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discus...@googlegroups.com.

mehr...@broadinstitute.org

unread,
Jul 25, 2017, 2:07:20 PM7/25/17
to rseqc-discuss
Regarding your other pointer (Please keep in mind BED file can be directly downloaded from the UCSC Table browser for many well annotated genomes (like human, mouse, etc)
https://genome.ucsc.edu/cgi-bin/hgTables)

To be very sure that I am not mixing diff gene models, I prefer to use gencode.gtf file. 

While I can download from UCSC, next i would need to compare the bed with the GTF that the coordinates are indeed as expected. 

Thanks.
Sudeep 

Liguo Wang

unread,
Jul 25, 2017, 2:21:02 PM7/25/17
to rseqc-...@googlegroups.com
the "unsorted-gencode.v19.annotation.gtf.bed " is obviously not in bed12 format. I am not sure what happened during your conversion. the BED file RSeQC used strictly follows its specification (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)

Why not directly download BED file for GENCODE.v19 from UCSC table browser? 

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.

Liguo Wang

unread,
Jul 25, 2017, 2:23:38 PM7/25/17
to rseqc-...@googlegroups.com
For the same genome version, GENCODE version, GTF file and BED file should 100% match each other.

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.

mehr...@broadinstitute.org

unread,
Jul 25, 2017, 2:34:26 PM7/25/17
to rseqc-discuss
Thanks so much for continued support about this.

So here is an entry in the gencode:
1       HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

and here is entry from UCSC:

Track is: wgEncodeGencodeV19
table: comprehensive
assembly: GRCh37

chr1    11868   14409   ENST00000456328.2       0       +       11868   11868   0       3       359,109,1189,   0,744,1352,
Notice the difference one base pair in the 'start'. There are other such discrepancies. This is just one example.

Total entires in the UCSC gene model file: 189020
Total entires in the gencode (excluding header etc.): 2619444

--
Sudeep 

Liguo Wang

unread,
Jul 25, 2017, 3:55:42 PM7/25/17
to rseqc-...@googlegroups.com
The output is totally correct. GTF coordinates start from "1", while BED starts from "0".  They 100% match if you visualize them using UCSC or IGV.

Liguo

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.

mehr...@broadinstitute.org

unread,
Jul 26, 2017, 8:50:20 AM7/26/17
to rseqc-discuss
That is true, the coordinate is Base1 vs Base 0. 
Thank you !
Reply all
Reply to author
Forward
0 new messages