gtf2bed issue

mehr...@broadinstitute.org

unread,

Jul 25, 2017, 1:42:21 PM7/25/17

to rseqc-discuss

Hello,

To use junction_saturation.py and other scripts a bed12 file is needed.

I tried to the gtf2bed.py utility conveniently provided, however, the converted file is not in valid format.

Steps:

gtf2bed --do-not-sort < gencode.v19.annotation.gtf > unsorted-gencode.v19.annotation.gtf.bed

Note gencode.v19.annotation.gtf is download from gencode webpage for GRCH37 build

junction_annotation.py -i /star/Sample_EG10.Aligned.sortedByCoord.out.bam -r unsorted-gencode.v19.annotation.gtf.bed

Error is shown:

Reading reference bed file: unsorted-gencode.v19.annotation.gtf.bed ... Traceback (most recent call last):

File "/software/RSeQC/2.6.4/bin/junction_annotation.py", line 125, in <module>

main()

File "/software/RSeQC/2.6.4/bin/junction_annotation.py", line 109, in main

obj.annotate_junction(outfile=options.output_prefix,refgene=options.ref_gene_model,min_intron=options.min_intron, q_cut = options.map_qual)

File "/software/RSeQC/2.6.4/lib/python2.7/site-packages/qcmodule/SAM.py", line 3762, in annotate_junction

exon_starts = map( int, fields[11].rstrip( ',\n' ).split( ',' ) )

ValueError: invalid literal for int() with base 10: 'transcript_id'

the bed file does not seems to be adapted to bed12 format:

head unsorted-gencode.v19.annotation.gtf.bed

1 11868 14412 ENSG00000223972.4 . + HAVANA gene . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

1 11868 14409 ENSG00000223972.4 . + HAVANA transcript . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

1 11868 12227 ENSG00000223972.4 . + HAVANA exon . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

This seems to be issue of gtf2bed.py.

Can this be corrected?

Thanks,

Sudeep

Liguo Wang

unread,

Jul 25, 2017, 1:57:39 PM7/25/17

to rseqc-...@googlegroups.com

Hello,

Please try BEDOPS (http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/gtf2bed.html).

Please keep in mind BED file can be directly downloaded from the UCSC Table browser for many well annotated genomes (like human, mouse, etc)
https://genome.ucsc.edu/cgi-bin/hgTables

Liguo

--
You received this message because you are subscribed to the Google Groups "rseqc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mehr...@broadinstitute.org

unread,

Jul 25, 2017, 2:02:49 PM7/25/17

to rseqc-discuss

Thanks Liguo from a prompt response.

Turns out that is what I was using:

/Programs/BedOPS/bin/gtf2bed

gtf2bed -v

convert2bed -i gtf

version: 2.4.27 (typical)

author: Alex Reynolds

so does this means that bed12 format created is NOT compatible for junction_saturations and other script within RSeQC or its an issue with bedops.

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discus...@googlegroups.com.

mehr...@broadinstitute.org

unread,

Jul 25, 2017, 2:07:20 PM7/25/17

to rseqc-discuss

Regarding your other pointer (Please keep in mind BED file can be directly downloaded from the UCSC Table browser for many well annotated genomes (like human, mouse, etc)
https://genome.ucsc.edu/cgi-bin/hgTables)

To be very sure that I am not mixing diff gene models, I prefer to use gencode.gtf file.

While I can download from UCSC, next i would need to compare the bed with the GTF that the coordinates are indeed as expected.

Thanks.

Sudeep

Liguo Wang

unread,

Jul 25, 2017, 2:21:02 PM7/25/17

to rseqc-...@googlegroups.com

the "unsorted-gencode.v19.annotation.gtf.bed " is obviously not in bed12 format. I am not sure what happened during your conversion. the BED file RSeQC used strictly follows its specification (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)

Why not directly download BED file for GENCODE.v19 from UCSC table browser?

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.

Liguo Wang

unread,

Jul 25, 2017, 2:23:38 PM7/25/17

to rseqc-...@googlegroups.com

For the same genome version, GENCODE version, GTF file and BED file should 100% match each other.

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.

mehr...@broadinstitute.org

unread,

Jul 25, 2017, 2:34:26 PM7/25/17

to rseqc-discuss

Thanks so much for continued support about this.

So here is an entry in the gencode:

1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

and here is entry from UCSC:

(https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=601122691_9v09BvIciZnOdWtnebNTFOXHLkFX&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=wgEncodeGencodeV19&hgta_table=wgEncodeGencodeCompV19&hgta_regionType=genome&position=chr1%3A11102837-11267747&hgta_outputType=bed&hgta_outFileName=Human_GRCh37_Gencode_GeneModel)

Track is: wgEncodeGencodeV19

table: comprehensive

assembly: GRCh37

chr1 11868 14409 ENST00000456328.2 0 + 11868 11868 0 3 359,109,1189, 0,744,1352,

Notice the difference one base pair in the 'start'. There are other such discrepancies. This is just one example.

Total entires in the UCSC gene model file: 189020

Total entires in the gencode (excluding header etc.): 2619444

--

Sudeep

Liguo Wang

unread,

Jul 25, 2017, 3:55:42 PM7/25/17

to rseqc-...@googlegroups.com

The output is totally correct. GTF coordinates start from "1", while BED starts from "0". They 100% match if you visualize them using UCSC or IGV.

Liguo

To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discuss+unsubscribe@googlegroups.com.

mehr...@broadinstitute.org

unread,

Jul 26, 2017, 8:50:20 AM7/26/17

to rseqc-discuss

That is true, the coordinate is Base1 vs Base 0.

Thank you !

Reply all

Reply to author

Forward