Hello,
To use junction_saturation.py and other scripts a bed12 file is needed.
I tried to the gtf2bed.py utility conveniently provided, however, the converted file is not in valid format.
Steps:
- gtf2bed --do-not-sort < gencode.v19.annotation.gtf > unsorted-gencode.v19.annotation.gtf.bed
- Note gencode.v19.annotation.gtf is download from gencode webpage for GRCH37 build
- junction_annotation.py -i /star/Sample_EG10.Aligned.sortedByCoord.out.bam -r unsorted-gencode.v19.annotation.gtf.bed
Error is shown:
Reading reference bed file: unsorted-gencode.v19.annotation.gtf.bed ... Traceback (most recent call last):
File "/software/RSeQC/2.6.4/bin/junction_annotation.py", line 125, in <module>
main()
File "/software/RSeQC/2.6.4/bin/junction_annotation.py", line 109, in main
obj.annotate_junction(outfile=options.output_prefix,refgene=options.ref_gene_model,min_intron=options.min_intron, q_cut = options.map_qual)
File "/software/RSeQC/2.6.4/lib/python2.7/site-packages/qcmodule/SAM.py", line 3762, in annotate_junction
exon_starts = map( int, fields[11].rstrip( ',\n' ).split( ',' ) )
ValueError: invalid literal for int() with base 10: 'transcript_id'
the bed file does not seems to be adapted to bed12 format:
head unsorted-gencode.v19.annotation.gtf.bed
1 11868 14412 ENSG00000223972.4 . + HAVANA gene . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
1 11868 14409 ENSG00000223972.4 . + HAVANA transcript . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
1 11868 12227 ENSG00000223972.4 . + HAVANA exon . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
This seems to be issue of gtf2bed.py.
Can this be corrected?
Thanks,
Sudeep