search hygiene

19 views
Skip to first unread message

wm higgins

unread,
Aug 3, 2014, 11:46:53 AM8/3/14
to panda-pro...@googlegroups.com
Panda's great advantage of searching across datasets can be a liability for common terms.
I've discovered a couple ways to help searches in common datasets, so I thought I'd pass them along.

1) Race references
Our voter source data includes these descriptions of race/ethnicity: White (not Hispanic), Black (not Hispanic).
Although it clarifies the distinctions, these terms pollute any search looking for people with last names of White or Black, as well as any search for "Hispanic," which will find all whites, blacks and hispanics.
Rather than enable a bunch of column-specific searching, it was simpler to change white and black designations in the source data to WH and BL. That fixed all three search cases.

2) Hyperlinks
THE PROBLEM:
We're adding a lot of hyperlinks to some datasets, to let users jump to an original story or detail page, and the visible part of the 'a' tag sometimes failed to show up in Panda's search, specifically if the term contained camelcase or hyphenated names, like McCain or Parker-Jones. The problem was that the background text was a continuous string:
<anchortag+href>McCain<closetag>
 without spaces, solr could only extract word tokens by breaking at camelcase or non-alpha characters. so searching for "CAIN" would find McCain, but searching for "MCCAIN" would not.
THE FIX:
add a space before and after the visible reference, so that solar can tokenize on the spaces. So the ref would be:
<anchortag+href> McCain <closetag>
It renders the same, and search works again.

Reply all
Reply to author
Forward
0 new messages