Google algos

SB

unread,

May 21, 2002, 1:38:32 PM5/21/02

to

Google's similarity algorithms (neural-based? alpha-beta search?)
demonstrate quite limited capability on precise sets. Sequence
`1,2,3,4,5` is handled properly (which is obviously handled as an
explicit query), but `1,3,5,7,9` already generates gibberish. Same is
true for other (formal) combinations. Thus, Google should explain
"things" they are "predicting" in more formal way.

I liked Glossary, however. Although it has some gaps with the latest
technologies, a cluster similarity algorithmic produces good
heterogeneous descriptions around target term.

Tom Stepleton

unread,

May 21, 2002, 7:21:04 PM5/21/02

to

se...@sprintmail.com (SB) wrote in message news:<19620026.0205...@posting.google.com>...

> Google's similarity algorithms (neural-based? alpha-beta search?)
> demonstrate quite limited capability on precise sets. Sequence
> `1,2,3,4,5` is handled properly (which is obviously handled as an
> explicit query), but `1,3,5,7,9` already generates gibberish. Same is
> true for other (formal) combinations. Thus, Google should explain
> "things" they are "predicting" in more formal way.

Interesting stuff. Anyone care to speculate on the technology Google's
using to make the sets?

My bet is something like Hyperspace Analog to Language
(http://locutus.ucr.edu/hds.html) or Latent Semantic Analysis
(http://lsa.colorado.edu). Both require the computer to scan through
a great deal of text to "get the gist" of a word or concept, but once
it does, concepts can be compared very easily (they're represented
as vectors whose semantic similarity is usually in keeping with their
distance from each other according to some simple metric--like
Euclidean distance). Neat, neat stuff, and really easy to hack as well--
the enthusiast can just download the Project Gutenberg corpus or some
other large amount of text and write a Perl script to start scanning it.
Of course Google has a huge advantage--it can scan a cache of the entire
web!

If Google really is using HAL or LSA technology, I'm happy to see it!
I believe it represents a means of searching for ideas instead of strings
in texts and that it has the potential to make web searches far more
effective and precise than they are now. Unfortunately, search
applications with LSA at least are already well patented--not that this
would bother a company like Google.

Now, as far as your point goes: does it really make sense for Google to
try and intuit formal sets from just five examples? Wouldn't there be
many possible interpretations in all but the most trivial cases? (I'm
asking--I don't know myself.) Also, how would an algorithm designed to
construct such sets retain the ability to also group marching band
instruments, cheeses, auto races, and so on?

--Tom

Richard

unread,

May 22, 2002, 6:00:14 PM5/22/02

to

Interestingly...
Tried "anne","gilbert","diana" (all charcters from the Anne Of Green
Gables series". No results.

Then tried "anne", "diana" and "anne,diana,gilbert" was returned. So
it seems google is learning hwo to build lists from people trying
lists...

so i assume if i try "red,blue,green" and someone else tries
"red,blue,pink" then it associates those collections...

just my 2p.

richard.

se...@sprintmail.com (SB) wrote in message news:<19620026.0205...@posting.google.com>...

Michael Richards

unread,

May 23, 2002, 10:48:50 AM5/23/02

to

se...@sprintmail.com (SB) wrote in message

> Google's similarity algorithms (neural-based? alpha-beta search?)
> demonstrate quite limited capability on precise sets. Sequence
> `1,2,3,4,5` is handled properly (which is obviously handled as an
> explicit query), but `1,3,5,7,9` already generates gibberish.

No. Sequence '1,2,3,4,5' is not obviously handled as an explicit
query. Every sequence you put in is handled by the same algorithm.

More likely, Google Sets is an extension of Google's 'Similar Pages'
algorithm.

Here is my guess at the algorithm:
- for each term (up to 5) entered, google does a standard search to
return the top ten or so pages
- a Similar Pages search is then done on each of these pages which
returns 100 pages altogether, although many of these would be
duplicates.
- all the words in all the <100 pages are analyzed to produce a ranked
frequency distribution of words (weighted according to the distance in
the text from the initial terms)
- the top 15 (or whatever) words are then displayed as predicted
terms.

In actuality, the algorithm is probably much more complicated than
this, and possibly involves intersections of google indexes which
might be faster than producing ranked frequency distributions.

But what would I know, I'm just guessing. :-)

Google themselves are being very quiet about the algorithm. Maybe
they want to get people's comments about the results of Google Sets,
rather than about how it gets those results.

goo...@richardjhall.com (Richard) wrote in message
> ... So it seems google is learning hwo to build lists from people

> trying lists...
>
> so i assume if i try "red,blue,green" and someone else tries
> "red,blue,pink" then it associates those collections...

I don't think so. It works on any set of words you put in, even if
these have never been entered by anyone before. Remember that this
only came out a couple of days ago.

Anyway, good job Google. You keep on amazing us. You now have 11+
very useful tools:

Google Web Search (with Similar Pages, Cached Pages, and Links to a
Page)
Google Image Search
Google Groups
Google Directory (a better ranked version of the Open Directory
Project)
Google Topic Specific Searches (Apple, BSD Unix, Linux, Microsoft,
Government, Universities)
Google Catalog (beta)
Google News (beta)
Google Answers (beta)
Google Labs (Glossary, Sets, Voice Search, Keyboard Shortcuts, and
hopefully more to come)
Google Zeitgeist
Google Toolbar

I can't wait to see what you come up with next.

Paola Kathuria

unread,

May 24, 2002, 7:42:50 AM5/24/02

to

Michael....@gartner.com (Michael Richards) wrote in message news:<3dde1d55.02052...@posting.google.com>...

> In actuality, the algorithm is probably much more complicated than
> this, and possibly involves intersections of google indexes which
> might be faster than producing ranked frequency distributions.

Yup. I suspect they're using faceted classification.

In the same way that a web page can be indexed by a set of
words and phrases, words or phrases can themselves be indexed.
For example, "Elvis Presley" may be indexed as male, American,
dead, white, singer, actor, etc.

The algorithm would first create a list of the index terms that
the entered words have in common (the intersection) and then
search the master index for other words which include the same
set of indexing terms.

The more starting words that are entered, the larger the
set of common indexing terms and the better the resulting
set.

As for how they'd create such a classification, the easiest
thing I can think of is to index dictionary definitions.
That'd be a brute-force way of getting a starting list of
classifying words and phrases for any other word or phrase.
A second (and expensive) pass would be to get people to
go through and tweak the indexes by hand.

This approach would explain why a previous poster's example
of a book and its author weren't found in a set - the quality
of the index depends on the number and type of indexable
sources they've used (e.g., perhaps it doesn't include a list
of published works).

To test the dictionary theory, I tried creating sets starting
with a set of things which only have one thing in common that
wouldn't be in their dictionary definition. I tried things
that were the same colour but which were different types of
things. In most cases, I got zero results.

A quick search finds this intro to faceted classification:
http://www.peterme.com/archives/00000063.html

Paola

Evgeniy Gabrilovich

unread,

May 27, 2002, 11:15:41 AM5/27/02

to

>Here is my guess at the algorithm:
>- for each term (up to 5) entered, google does a standard search to
>return the top ten or so pages
>- a Similar Pages search is then done on each of these pages which
>returns 100 pages altogether, although many of these would be
>duplicates.
>- all the words in all the <100 pages are analyzed to produce a ranked
>frequency distribution of words (weighted according to the distance in
>the text from the initial terms)

Yep, this is a nice guess. I'd also look for words with the same
**distributional patterns** as those entered. For example, words X and Y
might rarely appear in the **same** context (so that measuring the distance
between them in text would not be of much use), but they may appear in
**similar contexts** (that is, cooccur with the same words). This is the
idea behind statistical techniques for estimating word similarity. Using
other words as "intermediaries" to assess the similarity of X and Y makes
these techniques more robust, as well as applicable to many more word pairs.
In general, many IR algorithms represent words and documents as vectors,
so it's possible to look for words whose vectors are most close to those
of the words entered. Latent Semantic Analysis (aka Latent Semantic Indexing)
is one way to do it.

> As for how they'd create such a classification, the easiest
> thing I can think of is to index dictionary definitions.
> That'd be a brute-force way of getting a starting list of
> classifying words and phrases for any other word or phrase.
> A second (and expensive) pass would be to get people to
> go through and tweak the indexes by hand.

Some thesauri (notably, Roget's thesaurus, also available in a machine-readable
form) offer extensive classifications of words, so it's possible to use this
kind of info. It's actually used for word sense disambiguation, where the
challenge is to resolve the sense of an ambiguous word given its context.

Evgeniy.

--
Evgeniy Gabrilovich
Ph.D. student in Computer Science
Department of Computer Science, Technion - Israel Institute of Technology
Technion City, Haifa 32000, Israel
WWW: http://www.cs.technion.ac.il/~gabr

Vasya Poopkin

unread,

May 28, 2002, 9:46:08 AM5/28/02

to

this is much simplier, it's all done on the fly, no one would sit
there and do the classification for you.
it can be done by 3 nested sql select statements and filtering by
nouns list, can you reproduce it here ? quite a challenge though :)

ga...@cs.technion.ac.il (Evgeniy Gabrilovich) wrote in message news:<c1b3f499.02052...@posting.google.com>...

Vasya Poopkin

unread,

May 28, 2002, 3:57:55 PM5/28/02

to

nop, it's a bit different.
It seems that they ran a massive processing through all the pages and
made an index of "sets" from "words" appearing in the same <li>, maybe
<table>, which start from the beginning of items or cells.

tempa...@mailru.com (Vasya Poopkin) wrote in message news:<7bcb8a6d.0205...@posting.google.com>...

Michael Richards

unread,

May 28, 2002, 10:30:52 PM5/28/02

to

ga...@cs.technion.ac.il (Evgeniy Gabrilovich) wrote in message

> ...

> > As for how they'd create such a classification, the easiest
> > thing I can think of is to index dictionary definitions.
> > That'd be a brute-force way of getting a starting list of
> > classifying words and phrases for any other word or phrase.
> > A second (and expensive) pass would be to get people to
> > go through and tweak the indexes by hand.
>
> Some thesauri (notably, Roget's thesaurus, also available in a machine-readable
> form) offer extensive classifications of words, so it's possible to use this
> kind of info. It's actually used for word sense disambiguation, where the
> challenge is to resolve the sense of an ambiguous word given its context.

Considering that Google Sets works for any words, not just those in
the dictionary (e.g. people's names, places etc), I doubt they set up
any sort of classification beforehand. They just let the web classify
itself (of course they had some rather comprehensive indexes to start
with).

Vasya Poopkin

unread,

May 29, 2002, 12:19:57 PM5/29/02

to

in other words, they probably processed all the pages and made a
couple of additional tables in the database.
the first table - they picked up all the words/phrases that appear in
the same <OL> or <UL> html list tags - which means they are viewed as
lists on pages (their preprocessed pages format allows them to do so,
see http://google.com/programming-contest source code)
and the second table links those entries in the first tables into
sets.
so when you search on words/phrases it just picks up the other
words/phrases which were appearing in the same lists somwhere on the
pages on the net.

Paola Kathuria

unread,

May 30, 2002, 9:46:22 AM5/30/02

to

ga...@cs.technion.ac.il (Evgeniy Gabrilovich) wrote in message news:<c1b3f499.02052...@posting.google.com>...

Paola wrote:
> > As for how they'd create such a classification, the easiest
> > thing I can think of is to index dictionary definitions.
> > That'd be a brute-force way of getting a starting list of
> > classifying words and phrases for any other word or phrase.
> > A second (and expensive) pass would be to get people to
> > go through and tweak the indexes by hand.
>
> Some thesauri (notably, Roget's thesaurus, also available in a machine-readable
> form) offer extensive classifications of words, so it's possible to use this
> kind of info. It's actually used for word sense disambiguation, where the
> challenge is to resolve the sense of an ambiguous word given its context.

Having now seen the Glossary demo, it seems clear that Google
now have a process to identify web pages which contain glossary
or definition-type information. To create the classification
source for the Sets demo, they just apply their existing indexing
technology to these glossary web pages; Google get a new resource
without the need to obtain licenses to use commercially-produced
dictionaries.

If this is what they are indeed doing for the Sets tool, I hope
that they have permission to use the content from the informational
web pages in this way. I'd be surprised if reusing other people's
content like this doesn't infringe copyright.

If they haven't already, I suggest that they include the
information source (i.e. the page's URL) as a classification
term and show common sources on the Sets results page. If set
members mostly originate from the same source, it's likely
that the person who generated the set would find that site
useful (that is, Sets then becomes a glorified search engine).

Paola

Raj Moodaley

unread,

Jun 27, 2002, 4:14:52 PM6/27/02

to

Another possible method is to take a database of all search terms
(used and unused) and user id's (im sure they implicitly collect who
their users are and what they search for) and build up a sparse matrix
of search_terms by users. The matrix entries would most probably be
some rating applicable to each term (possible based on their frequency
of use OR just a 1 if is was ever used or 0 if it was never used).
Then all you need to do is use the matrix to find a similarity rating
between each of the search terms with each other or betwen the various
users themselves (though using users poses speed/memory issues and
cannot be realistically capacity planned since users are always
increasing). They would do this similarity calculation using some
heuristic algorithm, directed graphs or bayesian network or some other
form of mathematical technology. They would then use this similarity
rating to find all the search terms with the most closest rating to
the ones you entered initially. Very much like most recommendation
engines and nearest neighbour systems work.

pa...@limov.com (Paola Kathuria) wrote in message news:<7e0e1e1d.0205...@posting.google.com>...