Database for search query classification?

JimB

unread,

Jun 1, 2010, 5:46:43 PM6/1/10

to get.theinfo

I'm looking for a dataset that will allow me to submit individual
search queries (like you would put into Google) and return a high
level categorization.

For example:

"Ford Explorer" might return "Automobiles"
"Roth IRA" might return "Financial Planning"

Any ideas?

David Salamon

unread,

Jun 2, 2010, 6:11:49 AM6/2/10

to get-t...@googlegroups.com

Hey Jim,

One interesting approach might be to do a sort of two step:

1. Run the keyword through google, and get the url for each of the top 20 hits
2. Run the domain for each of those through quantcast.com's lifestyle
affinity list (eg. http://www.quantcast.com/ford.com/lifestyle
assuming ford.com was a top hit for "Ford Explorer")
3. Select the lifestyle keywords that seem to appear a lot :)

This should basically get you there.

Cheers,
David

> --
> [from the http://groups.google.com/group/get-theinfo mailing list]

Walter

unread,

Jun 2, 2010, 5:24:20 AM6/2/10

to get.theinfo

hi,

I am also interested in your problem.

For a quick answer,there is an paper.

Keyword Search over Dynamic Categorized Information. Published in ICDE
2009.

Abstract
Consider an information repository whose content is categorized. A
data item (in the repository) can belong to multiple categories and
new data is continuously added to the system. In this paper, we
describe a system, CS*, which takes a keyword query and returns the
relevant top-K categories. In contrast, traditional keyword search
returns the top-K documents (i.e., data items) relevant to a user
query. The need to dynamically categorize new data and also update the
meta-data required for fast responses to user queries poses
interesting challenges. The brute force approach of updating the meta-
data by comparing each new data item with all the categories is
impractical due to (i) the large cost involved in finding the
categories associated with a data item and (ii) the high rate of
arrival of new data items. We show that a sampling based approach
which provides statistical guarantees on the reported results is also
impracticable. We hence develop the CS* approach whose effectiveness
results from its ability to focus on a strategically chosen subset of
categories on the one hand and a subset of new data on the other.
Given a query, CS* finds the top-K categories with high accuracy even
in time-constrained situations. An experimental evaluation of the CS*
system using real world data shows that it can easily achieve accuracy
in excess of 90%, whereas other approaches demand at least 57% more
resources (i.e., processing power), for providing similar results. Our
experimental results also show that, contrary to expectations, if the
rate of arrival of data items doubles, whereas CS* continues to
provide high accuracy without a significant increase in resources,
other approaches require more than double the number of resources.

For details, we can discuss privately. my emails is
rucmas...@gmail.com

Thanks,
Walter

Rene ZHANG

unread,

Jun 2, 2010, 4:00:30 AM6/2/10

to get-t...@googlegroups.com

hi,

I am also interested in your problem.

For a quick answer,there is an paper.

Keyword Search over Dynamic Categorized Information. Published in ICDE 2009.

Abstract

Consider an information repository whose content is categorized. A data item (in the repository) can belong to multiple categories and new data is continuously added to the system. In this paper, we describe a system, CS*, which takes a keyword query and returns the relevant top-K categories. In contrast, traditional keyword search returns the top-K documents (i.e., data items) relevant to a user query. The need to dynamically categorize new data and also update the meta-data required for fast responses to user queries poses interesting challenges. The brute force approach of updating the meta-data by comparing each new data item with all the categories is impractical due to (i) the large cost involved in finding the categories associated with a data item and (ii) the high rate of arrival of new data items. We show that a sampling based approach which provides statistical guarantees on the reported results is also impracticable. We hence develop the CS* approach whose effectiveness results from its ability to focus on a strategically chosen subset of categories on the one hand and a subset of new data on the other. Given a query, CS* finds the top-K categories with high accuracy even in time-constrained situations. An experimental evaluation of the CS* system using real world data shows that it can easily achieve accuracy in excess of 90%, whereas other approaches demand at least 57% more resources (i.e., processing power), for providing similar results. Our experimental results also show that, contrary to expectations, if the rate of arrival of data items doubles, whereas CS* continues to provide high accuracy without a significant increase in resources, other approaches require more than double the number of resources.

For details, we can discuss privately.

Thanks,
Walter

Paul Butler

unread,

Jun 2, 2010, 11:19:15 AM6/2/10

to get-t...@googlegroups.com

Consider using dbpedia's "Article Categories" dataset. It is scraped
from Wikipedia's category lists. It may not be high-level enough
though.

Ford_Explorer gives:
Motor_vehicles_manufactured_in_the_United_States
Vehicles_introduced_in_1991
2010s_automobiles
2000s_automobiles
1990s_automobiles
Flexible-fuel_vehicles
Rear_wheel_drive_vehicles
Crossover_SUVs
All_wheel_drive_vehicles
SUVs
Ford_vehicles

Roth_IRA gives:
Individual_Retirement_Accounts

You can download the dataset here:
http://wiki.dbpedia.org/Downloads351 . If you'd prefer the dataset as
a text file of space-separated (article, category) pairs, let me know;
I have that.

Unfortunately Wikipedia categories aren't transitive, so for example
All_wheel_drive_vehicles has the category Off-road_vehicles which has
the category Vehicles, but you'd have to traverse the tree to find out
that Ford_Explorer is a Vehicle.

-- Paul

JimB

unread,

Jun 2, 2010, 3:31:44 PM6/2/10

to get.theinfo

Thanks to everybody on this thread -- you're awesome! I'm going to try
a few of these. In addition, it was suggested to me to use Alexa from
AWS. Will let you know how it goes.

On Jun 2, 8:19 am, Paul Butler <pau...@gmail.com> wrote:
> Consider using dbpedia's "Article Categories" dataset. It is scraped
> from Wikipedia's category lists. It may not be high-level enough
> though.
>
> Ford_Explorer gives:
> Motor_vehicles_manufactured_in_the_United_States
> Vehicles_introduced_in_1991
> 2010s_automobiles
> 2000s_automobiles
> 1990s_automobiles
> Flexible-fuel_vehicles
> Rear_wheel_drive_vehicles
> Crossover_SUVs
> All_wheel_drive_vehicles
> SUVs
> Ford_vehicles
>
> Roth_IRA gives:
> Individual_Retirement_Accounts
>

> You can download the dataset here:http://wiki.dbpedia.org/Downloads351. If you'd prefer the dataset as

Tom Morris

unread,

Jun 9, 2010, 2:33:10 PM6/9/10

to get.theinfo

Another thing to try would be Freebase's search API. It'll give you a
list of topics that it thinks are most relevant and you can see what
types have been assigned to them as well as the domains that those
types belong to.

Tom

On Jun 2, 11:19 am, Paul Butler <pau...@gmail.com> wrote:
> Consider using dbpedia's "Article Categories" dataset. It is scraped
> from Wikipedia's category lists. It may not be high-level enough
> though.
>
> Ford_Explorer gives:
> Motor_vehicles_manufactured_in_the_United_States
> Vehicles_introduced_in_1991
> 2010s_automobiles
> 2000s_automobiles
> 1990s_automobiles
> Flexible-fuel_vehicles
> Rear_wheel_drive_vehicles
> Crossover_SUVs
> All_wheel_drive_vehicles
> SUVs
> Ford_vehicles
>
> Roth_IRA gives:
> Individual_Retirement_Accounts
>

> You can download the dataset here:http://wiki.dbpedia.org/Downloads351. If you'd prefer the dataset as

> > On Tue, Jun 1, 2010 at 11:46 PM, JimB <jlbr...@gmail.com> wrote:
>
> >> I'm looking for a dataset that will allow me to submit individual
> >> search queries (like you would put into Google) and return a high
> >> level categorization.
>
> >> For example:
>
> >> "Ford Explorer" might return "Automobiles"
> >> "Roth IRA" might return "Financial Planning"
>
> >> Any ideas?
>
> >> --

Reply all

Reply to author

Forward