fuzzy search of bigBed name field?

46 views
Skip to first unread message

ve...@genomics.fsu.edu

unread,
Mar 29, 2016, 12:01:42 PM3/29/16
to genome...@soe.ucsc.edu
Hello,

I can't seem to get fuzzy search working for bigBed files in a mirror. It works with standard bed files. Are there additional considerations for bigBed files? Setting searchType fuzzy in trackDb.ra did not automatically generate a query in hgFindSpec, so I manually defined it in trackDb.ra.

# head ensemblGenes.bed
1 4853 9652 GRMZM2G059865_T01 0 - 5126 9519 0 9 335,66,119,158,156,159,203,310,460, 0,488,1003,1254,1508,1785,2064,2740,4339,
1 4856 6355 GRMZM2G059865_T03 0 - 5839 6241 0 3 332,634,248, 0,485,1251,
1 4856 9652 GRMZM2G059865_T02 0 - 5839 9519 0 8 332,634,158,156,159,203,310,460, 0,485,1251,1505,1782,2061,2737,4336,
1 9881 10387 GRMZM5G888250_T01 0 - 9886 10117 0 1 506, 0,
1 109518 111769 GRMZM2G093344_T01 0 - 111110 111727 0 4 157,177,374,304, 0,240,1250,1947,
1 136306 138929 GRMZM2G093399_T01 0 + 138578 138899 0 4 328,295,86,378, 0,412,898,2245,
1 144360 144657 GRMZM5G809743_T01 0 + 144360 144657 0 1 297, 0,
1 144956 145646 GRMZM5G833153_T01 0 + 144956 145646 0 1 690, 0,
1 146264 147500 GRMZM2G306216_T01 0 - 146264 147500 0 4 8,261,58,78, 0,109,432,1158,
1 161142 161925 AC177838.2_FGT015 0 + 161142 161925 0 1 783, 0,

# bedToBigBed -extraIndex=name ensemblGenes.bed ../genome.chrom.sizes ensemblGenesGrmzm.bb
pass1 - making usageList (132 chroms): 26 millis
pass2 - checking and writing primary data (63391 records, 12 fields): 630 millis
Sorting and writing extra index 0: 26 millis

# head -n 17 ../trackDb.ra
track ensemblGenesGrmzm
shortLabel Genes
longLabel Genes from Ensembl with GRMZM IDs
type bigBed 12 .
group map
visibility pack
priority 50
html html/ensemblGenesGrmzm 

searchName ensemblGenesGrmzm
searchTable ensemblGenesGrmzm
searchType bigBed
searchPriority 1
searchIndex name
searchMethod fuzzy
query select chrom,chromStart,chromEnd,name from %s where name like '%%%s%%'

# hgTrackDb $GBDIR/$name $name trackDb $SWDIR/kent/src/hg/lib/trackDb.sql $GBDIR/$name
Loaded 16 track descriptions total
Loaded database zeaMayB73_v3

# hgFindSpec $GBDIR/$name $name hgFindSpec $SWDIR/kent/src/hg/lib/hgFindSpec.sql $GBDIR/$name
Loaded 1 search specs total
Loaded database zeaMayB73_v3

# hgsql -e "select * from zeaMayB73_v3.hgFindSpec"
+----------------------+----------------------+--------------+------------+--------------+-----------+------------------------------------------------------------------------+-----------+-----------+----------------+---------------------------------------+-------------------+
| searchName           | searchTable          | searchMethod | searchType | shortCircuit | termRegex | query                                                                  | xrefTable | xrefQuery | searchPriority | searchDescription                     | searchSettings    |
+----------------------+----------------------+--------------+------------+--------------+-----------+------------------------------------------------------------------------+-----------+-----------+----------------+---------------------------------------+-------------------+
| ensemblGenesGrmzm    | ensemblGenesGrmzm    | fuzzy        | bigBed     |            0 |           | select chrom,chromStart,chromEnd,name from %s where name like '%%%s%%' |           |           |              1 | Genes from Ensembl with GRMZM IDs     | searchIndex name  |
+----------------------+----------------------+--------------+------------+--------------+-----------+------------------------------------------------------------------------+-----------+-----------+----------------+---------------------------------------+-------------------+

# hgsql -e "select * from zeaMayB73_v3.ensemblGenesGrmzm"
+---------------------------------------------------+
| fileName                                          |
+---------------------------------------------------+
| /vault/gbdb/zeaMayB73_v3/bbi/ensemblGenesGrmzm.bb |
+---------------------------------------------------+
 

Cath Tyner

unread,
Mar 30, 2016, 12:35:09 PM3/30/16
to ve...@genomics.fsu.edu, genome...@soe.ucsc.edu
Hello Dan

Thank you for submitting your question regarding the use of fuzzy search with big bed files for a UCSC Genome Browser mirror. In order to accomplish this, you will need "searchIndex" and the "searchTrix" key. You can find information about these settings here under the "bigBed - Item or region track settings" section: https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbHub.html.

The searchTrix file is used to map free-text to IDs, which are then searched for in the searchIndex of the corresponding data file - you will need to define the searchTrix file in your trackDb and also define a corresponding searchIndex.

Here is a previously answered mailing-list question on this topic:

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private data help: Email
 
genom...@soe.ucsc.edu

​Enjoy,​

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser mirror site discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-mirro...@soe.ucsc.edu.

Cath Tyner

unread,
Mar 31, 2016, 12:37:42 PM3/31/16
to ve...@genomics.fsu.edu, genome...@soe.ucsc.edu
H
​ello again​
Dan,

Here is some additional information that may be helpful to you
​, based on input from one of our engineers:​


hgFindSpec searches apply to tracks based on SQL database tables (e.g. loaded from regular BED), but not file formats such as bigBed. Sorry that our documentation didn't make that clear. Adding a SQL query to the search definition doesn't work because we don't query bigBed files using SQL.

For bigBed files, searches are actually defined using the trackDb settings referred to earlier, searchIndex and optionally searchTrix, in the "track ..." stanza not the separate "searchName ..." stanza. Unfortunately there is no support for truly fuzzy matching, but it's possible to use the searchTrix setting to get a very limited kind of fuzzy matching, or to define your own fuzzy matches.

For starters, to get exact matches on names like "GRMZM2G059865_T01", you can add a new setting to your "track ensemblGenesGrmzm" stanza:

searchIndex name

To support fuzzy matching, you will need to create a text file with desired matches and then run our ixIxx program as described
on the trix help page​. 
The simplest kind of input file would simply map names to themselves like this:

GRMZM2G059865_T01 GRMZM2G059865_T01

If you enter a search term "GRMZM2G059865_T" or "GRMZM2G059865_T0" then trix would match it with the name "GRMZM2G059865_T01", because trix's idea of "fuzzy" is that the all characters of the search term are matched, and the name has only one or two additional characters. That's because trix was designed for matching keywords like "kinase" / "kinases". If you are looking to match "GRMZM2G059865" then you'll need to include that in your input file like this:

GRMZM2G059865_T01 GRMZM2G059865_T01 GRMZM2G059865

This awk command should work to make such a file from your bed file:

awk '{abbrev = $4;  sub(/_.*$/, "", abbrev); print $4, $4, abbrev;}' ensemblGenes.bed > ensemblGenes.ixTerms.txt

Then run our ixIxx program on that file as described above, move the generated .ix and .ixx files to the same directory as your bigBed file, and add a searchTrix setting like this:

searchTrix /vault/gbdb/zeaMayB73_v3/bbi/ensemblGenesGrmzm.ix

You will need to keep the "searchIndex name" setting if you use searchTrix, so we know where to look for the IDs from the trix search. If you happen to have keywords associated with your IDs, you can add those words to the .ixTerms.txt file to get even better search.

Thanks for including so many details in your question!

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private data help: Email
 
genom...@soe.ucsc.edu

​Enjoy,​
Reply all
Reply to author
Forward
0 new messages