I liked Glossary, however. Although it has some gaps with the latest
technologies, a cluster similarity algorithmic produces good
heterogeneous descriptions around target term.
Interesting stuff. Anyone care to speculate on the technology Google's
using to make the sets?
My bet is something like Hyperspace Analog to Language
(http://locutus.ucr.edu/hds.html) or Latent Semantic Analysis
(http://lsa.colorado.edu). Both require the computer to scan through
a great deal of text to "get the gist" of a word or concept, but once
it does, concepts can be compared very easily (they're represented
as vectors whose semantic similarity is usually in keeping with their
distance from each other according to some simple metric--like
Euclidean distance). Neat, neat stuff, and really easy to hack as well--
the enthusiast can just download the Project Gutenberg corpus or some
other large amount of text and write a Perl script to start scanning it.
Of course Google has a huge advantage--it can scan a cache of the entire
web!
If Google really is using HAL or LSA technology, I'm happy to see it!
I believe it represents a means of searching for ideas instead of strings
in texts and that it has the potential to make web searches far more
effective and precise than they are now. Unfortunately, search
applications with LSA at least are already well patented--not that this
would bother a company like Google.
Now, as far as your point goes: does it really make sense for Google to
try and intuit formal sets from just five examples? Wouldn't there be
many possible interpretations in all but the most trivial cases? (I'm
asking--I don't know myself.) Also, how would an algorithm designed to
construct such sets retain the ability to also group marching band
instruments, cheeses, auto races, and so on?
--Tom
Then tried "anne", "diana" and "anne,diana,gilbert" was returned. So
it seems google is learning hwo to build lists from people trying
lists...
so i assume if i try "red,blue,green" and someone else tries
"red,blue,pink" then it associates those collections...
just my 2p.
richard.
se...@sprintmail.com (SB) wrote in message news:<19620026.0205...@posting.google.com>...
No. Sequence '1,2,3,4,5' is not obviously handled as an explicit
query. Every sequence you put in is handled by the same algorithm.
More likely, Google Sets is an extension of Google's 'Similar Pages'
algorithm.
Here is my guess at the algorithm:
- for each term (up to 5) entered, google does a standard search to
return the top ten or so pages
- a Similar Pages search is then done on each of these pages which
returns 100 pages altogether, although many of these would be
duplicates.
- all the words in all the <100 pages are analyzed to produce a ranked
frequency distribution of words (weighted according to the distance in
the text from the initial terms)
- the top 15 (or whatever) words are then displayed as predicted
terms.
In actuality, the algorithm is probably much more complicated than
this, and possibly involves intersections of google indexes which
might be faster than producing ranked frequency distributions.
But what would I know, I'm just guessing. :-)
Google themselves are being very quiet about the algorithm. Maybe
they want to get people's comments about the results of Google Sets,
rather than about how it gets those results.
goo...@richardjhall.com (Richard) wrote in message
> ... So it seems google is learning hwo to build lists from people
> trying lists...
>
> so i assume if i try "red,blue,green" and someone else tries
> "red,blue,pink" then it associates those collections...
I don't think so. It works on any set of words you put in, even if
these have never been entered by anyone before. Remember that this
only came out a couple of days ago.
Anyway, good job Google. You keep on amazing us. You now have 11+
very useful tools:
Google Web Search (with Similar Pages, Cached Pages, and Links to a
Page)
Google Image Search
Google Groups
Google Directory (a better ranked version of the Open Directory
Project)
Google Topic Specific Searches (Apple, BSD Unix, Linux, Microsoft,
Government, Universities)
Google Catalog (beta)
Google News (beta)
Google Answers (beta)
Google Labs (Glossary, Sets, Voice Search, Keyboard Shortcuts, and
hopefully more to come)
Google Zeitgeist
Google Toolbar
I can't wait to see what you come up with next.
Yup. I suspect they're using faceted classification.
In the same way that a web page can be indexed by a set of
words and phrases, words or phrases can themselves be indexed.
For example, "Elvis Presley" may be indexed as male, American,
dead, white, singer, actor, etc.
The algorithm would first create a list of the index terms that
the entered words have in common (the intersection) and then
search the master index for other words which include the same
set of indexing terms.
The more starting words that are entered, the larger the
set of common indexing terms and the better the resulting
set.
As for how they'd create such a classification, the easiest
thing I can think of is to index dictionary definitions.
That'd be a brute-force way of getting a starting list of
classifying words and phrases for any other word or phrase.
A second (and expensive) pass would be to get people to
go through and tweak the indexes by hand.
This approach would explain why a previous poster's example
of a book and its author weren't found in a set - the quality
of the index depends on the number and type of indexable
sources they've used (e.g., perhaps it doesn't include a list
of published works).
To test the dictionary theory, I tried creating sets starting
with a set of things which only have one thing in common that
wouldn't be in their dictionary definition. I tried things
that were the same colour but which were different types of
things. In most cases, I got zero results.
A quick search finds this intro to faceted classification:
http://www.peterme.com/archives/00000063.html
Paola
Yep, this is a nice guess. I'd also look for words with the same
**distributional patterns** as those entered. For example, words X and Y
might rarely appear in the **same** context (so that measuring the distance
between them in text would not be of much use), but they may appear in
**similar contexts** (that is, cooccur with the same words). This is the
idea behind statistical techniques for estimating word similarity. Using
other words as "intermediaries" to assess the similarity of X and Y makes
these techniques more robust, as well as applicable to many more word pairs.
In general, many IR algorithms represent words and documents as vectors,
so it's possible to look for words whose vectors are most close to those
of the words entered. Latent Semantic Analysis (aka Latent Semantic Indexing)
is one way to do it.
> As for how they'd create such a classification, the easiest
> thing I can think of is to index dictionary definitions.
> That'd be a brute-force way of getting a starting list of
> classifying words and phrases for any other word or phrase.
> A second (and expensive) pass would be to get people to
> go through and tweak the indexes by hand.
Some thesauri (notably, Roget's thesaurus, also available in a machine-readable
form) offer extensive classifications of words, so it's possible to use this
kind of info. It's actually used for word sense disambiguation, where the
challenge is to resolve the sense of an ambiguous word given its context.
Evgeniy.
--
Evgeniy Gabrilovich
Ph.D. student in Computer Science
Department of Computer Science, Technion - Israel Institute of Technology
Technion City, Haifa 32000, Israel
WWW: http://www.cs.technion.ac.il/~gabr
ga...@cs.technion.ac.il (Evgeniy Gabrilovich) wrote in message news:<c1b3f499.02052...@posting.google.com>...
tempa...@mailru.com (Vasya Poopkin) wrote in message news:<7bcb8a6d.0205...@posting.google.com>...
Considering that Google Sets works for any words, not just those in
the dictionary (e.g. people's names, places etc), I doubt they set up
any sort of classification beforehand. They just let the web classify
itself (of course they had some rather comprehensive indexes to start
with).
Having now seen the Glossary demo, it seems clear that Google
now have a process to identify web pages which contain glossary
or definition-type information. To create the classification
source for the Sets demo, they just apply their existing indexing
technology to these glossary web pages; Google get a new resource
without the need to obtain licenses to use commercially-produced
dictionaries.
If this is what they are indeed doing for the Sets tool, I hope
that they have permission to use the content from the informational
web pages in this way. I'd be surprised if reusing other people's
content like this doesn't infringe copyright.
If they haven't already, I suggest that they include the
information source (i.e. the page's URL) as a classification
term and show common sources on the Sets results page. If set
members mostly originate from the same source, it's likely
that the person who generated the set would find that site
useful (that is, Sets then becomes a glorified search engine).
Paola
pa...@limov.com (Paola Kathuria) wrote in message news:<7e0e1e1d.0205...@posting.google.com>...