Google Groups Home
Help | Sign in
Enhancing Google Sets
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  19 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Daniel Vale  
View profile
(2 users)  More options Apr 28, 12:50 pm
From: Daniel Vale <danielv...@gmail.com>
Date: Mon, 28 Apr 2008 09:50:34 -0700 (PDT)
Local: Mon, Apr 28 2008 12:50 pm
Subject: Enhancing Google Sets
The token pairs "is a" and "is an" are used as a resource for
construing ontologies in English. Two of the patterns that occur with
them are quite useful for Google Sets: the first is the instance-
category pattern and the second, the hyponym-hyperonym one. We could
use these patterns to enhance Google Sets by making the following
corpora-based searches:

Instance-Category

The instance-category pattern is realized in text by relating an
instance to a category. For example, one can use a proper noun (a
token or a sequence in title case) as the instance and a common noun
as the category. Look the following sample search for "is a movie
actor" considering only what comes before the searched string in the
ten first results of Google Search Engine:

Ice-T is a movie actor
he is a movie actor
Johnny Depp is a movie actor
a “Super Star” is a movie actor
Ben Stein is a movie actor
Vince Vaughn is a movie actor
The Chris Potter is a movie actor
there is a movie actor
the other is a movie actor
A “star” is a movie actor

By selecting only the title-case tokens that do not follow "a" nor
"the", we find the following list:

movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn

Hyponym-Hyperonym

The hyponym-hyperonym patter is realized in text by relating one
category to another. For example, one can use a common noun (a token
or a sequence in any case*) as the hyponym and another common noun as
the hyperonym. Considering the same sample search, we can find the
following list of hyponyms by selecting only the tokens that follow
"a" or "an":

movie actor => “Super Star”, “star”

* For common nouns, we could apply decreasing weights, respectively,
to lower-case, upper-case and title-case tokens.

Other Patterns

The other patterns might be disconsidered because their processing is
too expensive for large volumes of data. Though this is true, we could
try using these in small clusters of pages (Wikipedia, for instance).

Considerations:

A sequence of at least 7 tokens (7-grams) is needed in order to find a
set for a two-token category. As two-token categories are possibly the
most promising ones (we must check if that is true for a corpus the
size of the Internet), that's about all we need. The compound category
issue ("a part of speech" and "a part of speech tagger")  can be
solved by a set subtraction.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
bpchesney  
View profile
(1 user)  More options Apr 28, 11:21 pm
From: bpchesney <bpches...@gmail.com>
Date: Mon, 28 Apr 2008 20:21:32 -0700 (PDT)
Local: Mon, Apr 28 2008 11:21 pm
Subject: Re: Enhancing Google Sets
Hi, Daniel,

I like the idea of parsing "is-a" constructs to come up with members
of a set.  These specific instances of a more generic class could also
be pulled out of a set together, since they are considered to be
examples of the same thing.

I think, in general, Google Sets could greatly benefit by doing some
browsing for common sentence constructs.  Simply returning items from
lists is a good start but there's a lot that can be done to extend it
to get more meaningful results.  'is-a(n)' is an example of this...

One question: what did you mean by "a set of 7 tokens is needed for a
set for a 2 token category"?

Brian

On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David_Santana  
View profile
 More options Apr 29, 2:48 pm
From: David_Santana <GAUDELIA1...@yahoo.es>
Date: Tue, 29 Apr 2008 11:48:26 -0700 (PDT)
Local: Tues, Apr 29 2008 2:48 pm
Subject: Re: Enhancing Google Sets
so,e strange results are returned when using in spanish * es *,
but amazing results are returning using Que es un *
or in english What is *.

Best regards David. perhaps after all the best way is to query de
google database. to order the sets.
On 29 abr, 05:21, bpchesney <bpches...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David_Santana  
View profile
 More options Apr 29, 2:52 pm
From: David_Santana <GAUDELIA1...@yahoo.es>
Date: Tue, 29 Apr 2008 11:52:52 -0700 (PDT)
Local: Tues, Apr 29 2008 2:52 pm
Subject: Re: Enhancing Google Sets
most of the results that are valid comes whith what is a carrillon? a
carrillon is a huge ... definition.

|                   |

word           repetition of the word

On 29 abr, 20:48, David_Santana <GAUDELIA1...@yahoo.es> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Daniel Vale  
View profile
(1 user)  More options Apr 29, 4:54 pm
From: Daniel Vale <danielv...@gmail.com>
Date: Tue, 29 Apr 2008 13:54:46 -0700 (PDT)
Local: Tues, Apr 29 2008 4:54 pm
Subject: Re: Enhancing Google Sets
Brian,

     a simple, naïve and fast way of parsing an English sentence is to
tokenize it discarding the spaces between words. If we parsed the
results for "movie actor" this way, we would get the following arrays
of tokens:

{Ice-T, is, a, movie, actor}
{he, is, a, movie, actor}
{Johnny, Depp, is, a, movie, actor}
{a, “Super Star”, is, a, movie, actor}
{Ben, Stein, is, a, movie, actor}
{Vince, Vaughn, is, a, movie, actor}
...

     As English anthroponyms (names of people) are usually made up of
two tokens, we must have sequences of at least 7 tokens in order to
use this algorithm for two-token categories: 1 token for checking the
presence of "a", "an" and "the" before the anthroponym, two for the
anthroponym, two for "is a" or "is an" and two for the category. This
is a limiting issue since most statistical processing is being done
nowadays with n-grams up to N=5 (up to 5 tokens). But this is Google
and I expect [7-9]-grams to be possible by now.

Daniel Vale

On Apr 29, 12:21 am, bpchesney <bpches...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Daniel Vale  
View profile
(1 user)  More options Apr 29, 5:30 pm
From: Daniel Vale <danielv...@gmail.com>
Date: Tue, 29 Apr 2008 14:30:57 -0700 (PDT)
Local: Tues, Apr 29 2008 5:30 pm
Subject: Re: Enhancing Google Sets
David Santana,

The poor results you observed in your Spanish search (the same happens
for Portuguese) is due to the fact that in our languages the
indefinite article does not determine categories as in English. In our
languages, it determines the role of the complements instead. This
causes no problem to communication though it makes it impossible to
parse a string with a context and lexis free pattern. I'll give
examples:

Instance-Category
Mario Gómez es actor // Mario Gomez is [an] actor
Miguel Ángel Solá es actor // Miguel Ángel is [an] actor

Both in Spanish and in Portuguese, no article is used before the
category when we categorize instances*. This turns out to be a problem
for parsing, once we have no way to isolate the instance-category
pattern without knowing and eliminating all the other patterns that
have the token "es".

* we do use the indefinite article when we evaluate people "Nicolas
Cage es un actor excelente"

Hyponym-Hyperonym
el hombre es un animal racional. // [a] man is a rational animal.

In both languages, the hyponym of a hyponym-hyperonym pattern is
marked off by a definite article and the hyperonym by an indefinite
one. This pattern turns out to be as easy as the English one to be
mined from a large corpora. We must only keep in mind that the first
article is typically a definite one in original texts and good
translations and that searching "un * es un *" will give poor results.

On Apr 29, 3:48 pm, David_Santana <GAUDELIA1...@yahoo.es> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
bpchesney  
View profile
(1 user)  More options Apr 29, 9:13 pm
From: bpchesney <bpches...@gmail.com>
Date: Tue, 29 Apr 2008 18:13:11 -0700 (PDT)
Local: Tues, Apr 29 2008 9:13 pm
Subject: Re: Enhancing Google Sets
Thanks for the explanation, Daniel.

You're right, simple parsing based on spaces could be improved.  It
seems the way Google approached Sets is to decide that there are a few
simple constructs that they'll search for to make a set.  These
constructs are punctuation (standard English, or HTML) that indicate
that a list is being processed, and they've probably optimized based
on that.

We could say take a similar approach and optimize for 'is-a' parsing
in addition to the list-punctuation method.  So, instead of delimiting
on spaces, we could delimit around an " is a(n) " token and preserve
the tokens on either side.  But how do we delimit on the other ends of
the tokens?  I think that part could get to be tricky.

On Apr 29, 4:54 pm, Daniel Vale <danielv...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Daniel Vale  
View profile
(1 user)  More options Apr 29, 11:18 pm
From: Daniel Vale <danielv...@gmail.com>
Date: Tue, 29 Apr 2008 20:18:40 -0700 (PDT)
Local: Tues, Apr 29 2008 11:18 pm
Subject: Re: Enhancing Google Sets
Brian,

For parsing English, Portuguese, Spanish, French and German (but not
Japanese and Chinese), we could use an improved space-based tokenizer*
for a first rank parsing and then move forward to a second rank one.
The first rank parsing would give us words separated by spaces and
would handle punctuation properly. So the input for the second rank
parsing is a token sequence (array of strings) and not a character
sequence (string).

In the second rank English parsing, we could run an algorithm that
keeps the last - say 5-7 - tokens and ignits a procedure when it finds
the token pairs "is a" or "is an". At this point, the previously read
tokens could be checked for hyponym-hyperonym or instance-category
patterns. Should they match the patterns, we could get the tokens that
follow the "is a"/"is an" token pair to be the category.

Then we run into the compound concept issue. How many tokens should be
read after the "is a"/"is an" token pair? Look these options:

1) Johnny Depp is a movie
2) Johnny Depp is a movie actor
3) Johnny Depp is a movie actor who

This issue can be solved by executing the following procedure: 1)
eliminate the third option because it has a grammatical item** and 2)
classify Johnny Depp as a "movie actor" because this is the longest
structure possible.

* improved space-based tokenizers should be a topic on its own
** we should have a list of grammatical items that end the match

On Apr 29, 10:13 pm, bpchesney <bpches...@gmail.com> wrote: