One interesting approach might be to do a sort of two step:
1. Run the keyword through google, and get the url for each of the top 20 hits
2. Run the domain for each of those through quantcast.com's lifestyle
affinity list (eg. http://www.quantcast.com/ford.com/lifestyle
assuming ford.com was a top hit for "Ford Explorer")
3. Select the lifestyle keywords that seem to appear a lot :)
This should basically get you there.
Cheers,
David
> --
> [from the http://groups.google.com/group/get-theinfo mailing list]
Consider an information repository whose content is categorized. A data item (in the repository) can belong to multiple categories and new data is continuously added to the system. In this paper, we describe a system, CS*, which takes a keyword query and returns the relevant top-K categories. In contrast, traditional keyword search returns the top-K documents (i.e., data items) relevant to a user query. The need to dynamically categorize new data and also update the meta-data required for fast responses to user queries poses interesting challenges. The brute force approach of updating the meta-data by comparing each new data item with all the categories is impractical due to (i) the large cost involved in finding the categories associated with a data item and (ii) the high rate of arrival of new data items. We show that a sampling based approach which provides statistical guarantees on the reported results is also impracticable. We hence develop the CS* approach whose effectiveness results from its ability to focus on a strategically chosen subset of categories on the one hand and a subset of new data on the other. Given a query, CS* finds the top-K categories with high accuracy even in time-constrained situations. An experimental evaluation of the CS* system using real world data shows that it can easily achieve accuracy in excess of 90%, whereas other approaches demand at least 57% more resources (i.e., processing power), for providing similar results. Our experimental results also show that, contrary to expectations, if the rate of arrival of data items doubles, whereas CS* continues to provide high accuracy without a significant increase in resources, other approaches require more than double the number of resources.
Ford_Explorer gives:
Motor_vehicles_manufactured_in_the_United_States
Vehicles_introduced_in_1991
2010s_automobiles
2000s_automobiles
1990s_automobiles
Flexible-fuel_vehicles
Rear_wheel_drive_vehicles
Crossover_SUVs
All_wheel_drive_vehicles
SUVs
Ford_vehicles
Roth_IRA gives:
Individual_Retirement_Accounts
You can download the dataset here:
http://wiki.dbpedia.org/Downloads351 . If you'd prefer the dataset as
a text file of space-separated (article, category) pairs, let me know;
I have that.
Unfortunately Wikipedia categories aren't transitive, so for example
All_wheel_drive_vehicles has the category Off-road_vehicles which has
the category Vehicles, but you'd have to traverse the tree to find out
that Ford_Explorer is a Vehicle.
-- Paul