Problem with the BED file, junction

baptiste mossotti

unread,

Mar 20, 2016, 10:51:53 PM3/20/16

to rseqc-discuss

Hi everyone,

I downloaded the bed file from the RSeQC website but the format is really different from the BED format UCSC. --> no 12 columns
https://genome.ucsc.edu/FAQ/FAQformat.html#format1

this the file that I downloaded
bin     name    chrom   strand txStart txEnd   cdsStart        cdsEnd exonCount       exonStarts      exonEnds        score   name2   cdsStartStat    cdsEndStat      exonFrames
0       ENST00000371007.6       chr1    -       67092164        67231852        67093004        67127240        8       67092164,67095234,67096251,67115351,67125751,67127165,67131141,67231845,        67093604,67095421,67096321,67115464,67125909,67127257,67131227,67231852,        0       C1orf141        cmpl       cmpl    0,2,1,2,0,0,-1,-1,

So I decided to downloaded the GTF file and to convert it into the bed file with gtf2bed
gtf2bed < gencode.v23.annotation.gtf > gencode.v23.annotation.bed
chr1    11868   12227   ENSG00000223972.5       .       +       HAVANA exon    .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    11868   14409   ENSG00000223972.5       .       +       HAVANA gene    .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

But when I run the function junction_annotation.py, I got an error:
Reading reference bed file: /home/baptiste/Documents/Genomes/FileGTF/gencode.v23.annotation.bed ... Traceback (most recent call last):
File "/home/baptiste/Documents/Biotools/RSeQC-2.6.3/scripts/junction_annotation.py", line 125, in <module>
    main()
File "/home/baptiste/Documents/Biotools/RSeQC-2.6.3/scripts/junction_annotation.py", line 109, in main
    obj.annotate_junction(outfile=options.output_prefix,refgene=options.ref_gene_model,min_intron=options.min_intron, q_cut = options.map_qual)
File "/usr/local/lib/python2.7/dist-packages/RSeQC-2.6.3-py2.7-linux-x86_64.egg/qcmodule/SAM.py", line 3762, in annotate_junction
    exon_starts = map( int, fields[11].rstrip( ',\n' ).split( ',' ) )
ValueError: invalid literal for int() with base 10: 'transcript_id'

So I was thinking that maybe the bed file is still not in a good format...
Thank you for your help,

Baptiste

Liguo Wang

unread,

Mar 21, 2016, 12:35:10 PM3/21/16

to rseqc-...@googlegroups.com

You are right. the BED file you used is not in good format. Could you please let me know which file you downloaded so I can replace it.

Thanks

Liguo

--
You received this message because you are subscribed to the Google Groups "rseqc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rseqc-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

baptiste mossotti

unread,

Mar 21, 2016, 11:21:49 PM3/21/16

to rseqc-discuss

Hi Liguo,

Thank you for your reply. I just resolved my problem.
The file that I downloaded is the hg38_GENCODE_v23.bed.gz
But I transformed this file with R into something like this:
chr1 67092164 67231852 ENST00000371007.6 0 - 67092164 67231852 0 8 1440,187,70,113,158,92,86,7 0,3070,4087,23187,33587,35001,38977,139681
chr1 67092175 67127261 ENST00000371006.5 0 - 67092175 67127261 0 6 1429,187,70,113,158,96 0,3059,4076,23176,33576,34990

And it works.
Best

Baptiste

Reply all

Reply to author

Forward

Problem with the BED file, junction_annotation

baptiste mossotti

Liguo Wang

baptiste mossotti