How to Allow for Search of Gene and ID in TrackHub on UCSC Genome Browser

20 views
Skip to first unread message

Heater, Blair Delane

unread,
Sep 2, 2016, 4:02:06 PM9/2/16
to hi...@soe.ucsc.edu, gen...@soe.ucsc.edu, msp...@soe.ucsc.edu, Hendrickson, Curtis (Campus)

Dear Hiram,

 

I am the student intern in Biomedical Informatics at the Center for Clinical and Translational Science at UAB that worked with Curtis Hendrickson creating track hubs for new genomes on the UCSC Genome Browser. I have been absent over the summer but have returned to work this week.

 

Curtis noticed that any search on gene name (i.e. UL46) or ID (i.e. YP_081523.1) would result in error on our track hub. In an effort to determine how your track hub supported search on both gene name and ID, I converted your bigbed file, GCF_000845245.1_ViralProj14559.ncbiGene.ncbi.bb, to bed format (the attached file) and am puzzled since it appears to have 18 columns. First I thought it was bed detail format, but that doesn’t make sense, given the description of bed detail format on the UCSC genome format documentation.

1.       Do the final and additional columns of the bed file (that correspond to "Transcript type" string geneName; "Primary identifier for gene" string geneName2; "Alternative/human readable gene name" string geneType; "Gene type" from the bigbed file)  facilitate the search for ID or gene name?

2.       How did you get this working when the format appears to contradict the UCSC genome bed or bed detail format documentation?

I appreciate your time and help.

 

Thank you,

Blair Heater

 

GCF_000845245.1_ViralProj14559.ncbiGene.ncbi.bed

Hiram Clawson

unread,
Sep 2, 2016, 4:16:53 PM9/2/16
to Heater, Blair Delane, gen...@soe.ucsc.edu, msp...@soe.ucsc.edu, Hendrickson, Curtis (Campus)
Good Afternoon Blair:

This track in the assembly hub is actually a 'bigGenePred' data type, it is not a bed file.
There is an index in the bigGenePred file on the 'name' column, and there is an additional
index of alias names for the genes in the GCF_000845245.1_ViralProj14559.ncbiGene.ix
and GCF_000845245.1_ViralProj14559.ncbiGene.ixx files. Note the trackDb entry that specifies
all of this:

track ncbiGene
longLabel ncbiGene - gene predictions delivered with assembly from NCBI
shortLabel ncbiGene
priority 12
visibility pack
color 0,80,150
altColor 150,80,0
colorByStrand 0,80,150 150,80,0
bigDataUrl bbi/GCF_000845245.1_ViralProj14559.ncbiGene.ncbi.bb
type bigGenePred
html GCF_000845245.1_ViralProj14559.ncbiGene
searchIndex name
searchTrix GCF_000845245.1_ViralProj14559.ncbiGene.ix
url http://www.ncbi.nlm.nih.gov/nuccore/$$
urlLabel NCBI Nucleotide database
group genes

You can see all these files in the assembly hub directory:
http://genome-test.cse.ucsc.edu/gbdb/hubs/refseq/viral/02/GCF_000845245.1_ViralProj14559/

The processing script that converted the NCBI GFF3 file into the genePred can
be found in:

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/genbank/ncbiGene.sh

Using the script gpToIx.pl to extract gene name aliases from the extra columns
in the genePred file:

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/genbank/gpToIx.pl

--Hiram

On 9/2/16 12:50 PM, Heater, Blair Delane wrote:
> Dear Hiram,
>
> I am the student intern in Biomedical Informatics at the Center for Clinical and Translational Science at UAB that worked with Curtis Hendrickson creating track hubs for new genomes on the UCSC Genome Browser. I have been absent over the summer but have returned to work this week.
>
> Curtis noticed that any search on gene name (i.e. UL46) or ID (i.e. YP_081523.1) would result in error on our track hub. In an effort to determine how your track hub supported search on both gene name and ID, I converted your bigbed file, GCF_000845245.1_ViralProj14559.ncbiGene.ncbi.bb<http://genome-preview.cse.ucsc.edu/gbdb/hubs/refseq/viral/02/GCF_000845245.1_ViralProj14559/bbi/GCF_000845245.1_ViralProj14559.ncbiGene.ncbi.bb>, to bed format (the attached file) and am puzzled since it appears to have 18 columns. First I thought it was bed detail format, but that doesn’t make sense, given the description of bed detail format on the UCSC genome format documentation.

Heater, Blair Delane

unread,
Sep 9, 2016, 1:52:21 PM9/9/16
to Hiram Clawson, gen...@soe.ucsc.edu, msp...@soe.ucsc.edu, Hendrickson, Curtis (Campus)

Dear Hiram,

 

Thank you for all the details and references. I will continue to look into the ‘bigGenePred’ data type and your indexing methods for streamlined conversions for viewing within the UCSC Genome Browser.  Based on your example and other documentation, I indexed the gene name in bed12 format when converting from bed to bigBed, supporting search by the gene names for each strain on our track hub in the genome browser. I appreciate your time and help.

 

Thank you,

Blair Heater

Reply all
Reply to author
Forward
0 new messages