Hi everyone,
I downloaded the bed file from the RSeQC website but the format is really different from the BED format UCSC. --> no 12 columns
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
this the file that I downloadedbin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
0 ENST00000371007.6 chr1 - 67092164 67231852 67093004 67127240 8 67092164,67095234,67096251,67115351,67125751,67127165,67131141,67231845, 67093604,67095421,67096321,67115464,67125909,67127257,67131227,67231852, 0 C1orf141 cmpl cmpl 0,2,1,2,0,0,-1,-1,
So I decided to downloaded the GTF file and to convert it into the bed file with gtf2bedgtf2bed < gencode.v23.annotation.gtf > gencode.v23.annotation.bed
chr1 11868 12227 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 11868 14409 ENSG00000223972.5 . + HAVANA gene . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
But when I run the function junction_annotation.py, I got an error:Reading reference bed file: /home/baptiste/Documents/Genomes/FileGTF/gencode.v23.annotation.bed ... Traceback (most recent call last):
File "/home/baptiste/Documents/Biotools/RSeQC-2.6.3/scripts/junction_annotation.py", line 125, in <module>
main()
File "/home/baptiste/Documents/Biotools/RSeQC-2.6.3/scripts/junction_annotation.py", line 109, in main
obj.annotate_junction(outfile=options.output_prefix,refgene=options.ref_gene_model,min_intron=options.min_intron, q_cut = options.map_qual)
File "/usr/local/lib/python2.7/dist-packages/RSeQC-2.6.3-py2.7-linux-x86_64.egg/qcmodule/SAM.py", line 3762, in annotate_junction
exon_starts = map( int, fields[11].rstrip( ',\n' ).split( ',' ) )
ValueError: invalid literal for int() with base 10: 'transcript_id'
So I was thinking that maybe the bed file is still not in a good format...
Thank you for your help,
Baptiste