The token pairs "is a" and "is an" are used as a resource for
construing ontologies in English. Two of the patterns that occur with
them are quite useful for Google Sets: the first is the instance-
category pattern and the second, the hyponym-hyperonym one. We could
use these patterns to enhance Google Sets by making the following
corpora-based searches:
Instance-Category
The instance-category pattern is realized in text by relating an
instance to a category. For example, one can use a proper noun (a
token or a sequence in title case) as the instance and a common noun
as the category. Look the following sample search for "is a movie
actor" considering only what comes before the searched string in the
ten first results of Google Search Engine:
Ice-T is a movie actor
he is a movie actor
Johnny Depp is a movie actor
a “Super Star” is a movie actor
Ben Stein is a movie actor
Vince Vaughn is a movie actor
The Chris Potter is a movie actor
there is a movie actor
the other is a movie actor
A “star” is a movie actor
By selecting only the title-case tokens that do not follow "a" nor
"the", we find the following list:
movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
Hyponym-Hyperonym
The hyponym-hyperonym patter is realized in text by relating one
category to another. For example, one can use a common noun (a token
or a sequence in any case*) as the hyponym and another common noun as
the hyperonym. Considering the same sample search, we can find the
following list of hyponyms by selecting only the tokens that follow
"a" or "an":
movie actor => “Super Star”, “star”
* For common nouns, we could apply decreasing weights, respectively,
to lower-case, upper-case and title-case tokens.
Other Patterns
The other patterns might be disconsidered because their processing is
too expensive for large volumes of data. Though this is true, we could
try using these in small clusters of pages (Wikipedia, for instance).
Considerations:
A sequence of at least 7 tokens (7-grams) is needed in order to find a
set for a two-token category. As two-token categories are possibly the
most promising ones (we must check if that is true for a corpus the
size of the Internet), that's about all we need. The compound category
issue ("a part of speech" and "a part of speech tagger") can be
solved by a set subtraction.
I like the idea of parsing "is-a" constructs to come up with members
of a set. These specific instances of a more generic class could also
be pulled out of a set together, since they are considered to be
examples of the same thing.
I think, in general, Google Sets could greatly benefit by doing some
browsing for common sentence constructs. Simply returning items from
lists is a good start but there's a lot that can be done to extend it
to get more meaningful results. 'is-a(n)' is an example of this...
One question: what did you mean by "a set of 7 tokens is needed for a
set for a 2 token category"?
Brian
On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:
> The token pairs "is a" and "is an" are used as a resource for
> construing ontologies in English. Two of the patterns that occur with
> them are quite useful for Google Sets: the first is the instance-
> category pattern and the second, the hyponym-hyperonym one. We could
> use these patterns to enhance Google Sets by making the following
> corpora-based searches:
> Instance-Category
> The instance-category pattern is realized in text by relating an
> instance to a category. For example, one can use a proper noun (a
> token or a sequence in title case) as the instance and a common noun
> as the category. Look the following sample search for "is a movie
> actor" considering only what comes before the searched string in the
> ten first results of Google Search Engine:
> Ice-T is a movie actor
> he is a movie actor
> Johnny Depp is a movie actor
> a “Super Star” is a movie actor
> Ben Stein is a movie actor
> Vince Vaughn is a movie actor
> The Chris Potter is a movie actor
> there is a movie actor
> the other is a movie actor
> A “star” is a movie actor
> By selecting only the title-case tokens that do not follow "a" nor
> "the", we find the following list:
> movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
> Hyponym-Hyperonym
> The hyponym-hyperonym patter is realized in text by relating one
> category to another. For example, one can use a common noun (a token
> or a sequence in any case*) as the hyponym and another common noun as
> the hyperonym. Considering the same sample search, we can find the
> following list of hyponyms by selecting only the tokens that follow
> "a" or "an":
> movie actor => “Super Star”, “star”
> * For common nouns, we could apply decreasing weights, respectively,
> to lower-case, upper-case and title-case tokens.
> Other Patterns
> The other patterns might be disconsidered because their processing is
> too expensive for large volumes of data. Though this is true, we could
> try using these in small clusters of pages (Wikipedia, for instance).
> Considerations:
> A sequence of at least 7 tokens (7-grams) is needed in order to find a
> set for a two-token category. As two-token categories are possibly the
> most promising ones (we must check if that is true for a corpus the
> size of the Internet), that's about all we need. The compound category
> issue ("a part of speech" and "a part of speech tagger") can be
> solved by a set subtraction.
so,e strange results are returned when using in spanish * es *,
but amazing results are returning using Que es un *
or in english What is *.
Best regards David. perhaps after all the best way is to query de
google database. to order the sets.
On 29 abr, 05:21, bpchesney <bpches...@gmail.com> wrote:
> I like the idea of parsing "is-a" constructs to come up with members
> of a set. These specific instances of a more generic class could also
> be pulled out of a set together, since they are considered to be
> examples of the same thing.
> I think, in general, Google Sets could greatly benefit by doing some
> browsing for common sentence constructs. Simply returning items from
> lists is a good start but there's a lot that can be done to extend it
> to get more meaningful results. 'is-a(n)' is an example of this...
> One question: what did you mean by "a set of 7 tokens is needed for a
> set for a 2 token category"?
> Brian
> On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:
> > The token pairs "is a" and "is an" are used as a resource for
> > construing ontologies in English. Two of the patterns that occur with
> > them are quite useful for Google Sets: the first is the instance-
> > category pattern and the second, the hyponym-hyperonym one. We could
> > use these patterns to enhance Google Sets by making the following
> > corpora-based searches:
> > Instance-Category
> > The instance-category pattern is realized in text by relating an
> > instance to a category. For example, one can use a proper noun (a
> > token or a sequence in title case) as the instance and a common noun
> > as the category. Look the following sample search for "is a movie
> > actor" considering only what comes before the searched string in the
> > ten first results of Google Search Engine:
> > Ice-T is a movie actor
> > he is a movie actor
> > Johnny Depp is a movie actor
> > a “Super Star” is a movie actor
> > Ben Stein is a movie actor
> > Vince Vaughn is a movie actor
> > The Chris Potter is a movie actor
> > there is a movie actor
> > the other is a movie actor
> > A “star” is a movie actor
> > By selecting only the title-case tokens that do not follow "a" nor
> > "the", we find the following list:
> > movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
> > Hyponym-Hyperonym
> > The hyponym-hyperonym patter is realized in text by relating one
> > category to another. For example, one can use a common noun (a token
> > or a sequence in any case*) as the hyponym and another common noun as
> > the hyperonym. Considering the same sample search, we can find the
> > following list of hyponyms by selecting only the tokens that follow
> > "a" or "an":
> > movie actor => “Super Star”, “star”
> > * For common nouns, we could apply decreasing weights, respectively,
> > to lower-case, upper-case and title-case tokens.
> > Other Patterns
> > The other patterns might be disconsidered because their processing is
> > too expensive for large volumes of data. Though this is true, we could
> > try using these in small clusters of pages (Wikipedia, for instance).
> > Considerations:
> > A sequence of at least 7 tokens (7-grams) is needed in order to find a
> > set for a two-token category. As two-token categories are possibly the
> > most promising ones (we must check if that is true for a corpus the
> > size of the Internet), that's about all we need. The compound category
> > issue ("a part of speech" and "a part of speech tagger") can be
> > solved by a set subtraction.- Ocultar texto de la cita -
> so,e strange results are returned when using in spanish * es *,
> but amazing results are returning using Que es un *
> or in english What is *.
> Best regards David. perhaps after all the best way is to query de
> google database. to order the sets.
> On 29 abr, 05:21, bpchesney <bpches...@gmail.com> wrote:
> > Hi, Daniel,
> > I like the idea of parsing "is-a" constructs to come up with members
> > of a set. These specific instances of a more generic class could also
> > be pulled out of a set together, since they are considered to be
> > examples of the same thing.
> > I think, in general, Google Sets could greatly benefit by doing some
> > browsing for common sentence constructs. Simply returning items from
> > lists is a good start but there's a lot that can be done to extend it
> > to get more meaningful results. 'is-a(n)' is an example of this...
> > One question: what did you mean by "a set of 7 tokens is needed for a
> > set for a 2 token category"?
> > Brian
> > On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:
> > > The token pairs "is a" and "is an" are used as a resource for
> > > construing ontologies in English. Two of the patterns that occur with
> > > them are quite useful for Google Sets: the first is the instance-
> > > category pattern and the second, the hyponym-hyperonym one. We could
> > > use these patterns to enhance Google Sets by making the following
> > > corpora-based searches:
> > > Instance-Category
> > > The instance-category pattern is realized in text by relating an
> > > instance to a category. For example, one can use a proper noun (a
> > > token or a sequence in title case) as the instance and a common noun
> > > as the category. Look the following sample search for "is a movie
> > > actor" considering only what comes before the searched string in the
> > > ten first results of Google Search Engine:
> > > Ice-T is a movie actor
> > > he is a movie actor
> > > Johnny Depp is a movie actor
> > > a “Super Star” is a movie actor
> > > Ben Stein is a movie actor
> > > Vince Vaughn is a movie actor
> > > The Chris Potter is a movie actor
> > > there is a movie actor
> > > the other is a movie actor
> > > A “star” is a movie actor
> > > By selecting only the title-case tokens that do not follow "a" nor
> > > "the", we find the following list:
> > > movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
> > > Hyponym-Hyperonym
> > > The hyponym-hyperonym patter is realized in text by relating one
> > > category to another. For example, one can use a common noun (a token
> > > or a sequence in any case*) as the hyponym and another common noun as
> > > the hyperonym. Considering the same sample search, we can find the
> > > following list of hyponyms by selecting only the tokens that follow
> > > "a" or "an":
> > > movie actor => “Super Star”, “star”
> > > * For common nouns, we could apply decreasing weights, respectively,
> > > to lower-case, upper-case and title-case tokens.
> > > Other Patterns
> > > The other patterns might be disconsidered because their processing is
> > > too expensive for large volumes of data. Though this is true, we could
> > > try using these in small clusters of pages (Wikipedia, for instance).
> > > Considerations:
> > > A sequence of at least 7 tokens (7-grams) is needed in order to find a
> > > set for a two-token category. As two-token categories are possibly the
> > > most promising ones (we must check if that is true for a corpus the
> > > size of the Internet), that's about all we need. The compound category
> > > issue ("a part of speech" and "a part of speech tagger") can be
> > > solved by a set subtraction.- Ocultar texto de la cita -
> > - Mostrar texto de la cita -- Ocultar texto de la cita -
a simple, naïve and fast way of parsing an English sentence is to
tokenize it discarding the spaces between words. If we parsed the
results for "movie actor" this way, we would get the following arrays
of tokens:
{Ice-T, is, a, movie, actor}
{he, is, a, movie, actor}
{Johnny, Depp, is, a, movie, actor}
{a, “Super Star”, is, a, movie, actor}
{Ben, Stein, is, a, movie, actor}
{Vince, Vaughn, is, a, movie, actor}
...
As English anthroponyms (names of people) are usually made up of
two tokens, we must have sequences of at least 7 tokens in order to
use this algorithm for two-token categories: 1 token for checking the
presence of "a", "an" and "the" before the anthroponym, two for the
anthroponym, two for "is a" or "is an" and two for the category. This
is a limiting issue since most statistical processing is being done
nowadays with n-grams up to N=5 (up to 5 tokens). But this is Google
and I expect [7-9]-grams to be possible by now.
Daniel Vale
On Apr 29, 12:21 am, bpchesney <bpches...@gmail.com> wrote:
> I like the idea of parsing "is-a" constructs to come up with members
> of a set. These specific instances of a more generic class could also
> be pulled out of a set together, since they are considered to be
> examples of the same thing.
> I think, in general, Google Sets could greatly benefit by doing some
> browsing for common sentence constructs. Simply returning items from
> lists is a good start but there's a lot that can be done to extend it
> to get more meaningful results. 'is-a(n)' is an example of this...
> One question: what did you mean by "a set of 7 tokens is needed for a
> set for a 2 token category"?
> Brian
> On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:
> > The token pairs "is a" and "is an" are used as a resource for
> > construing ontologies in English. Two of the patterns that occur with
> > them are quite useful for Google Sets: the first is the instance-
> > category pattern and the second, the hyponym-hyperonym one. We could
> > use these patterns to enhance Google Sets by making the following
> > corpora-based searches:
> > Instance-Category
> > The instance-category pattern is realized in text by relating an
> > instance to a category. For example, one can use a proper noun (a
> > token or a sequence in title case) as the instance and a common noun
> > as the category. Look the following sample search for "is a movie
> > actor" considering only what comes before the searched string in the
> > ten first results of Google Search Engine:
> > Ice-T is a movie actor
> > he is a movie actor
> > Johnny Depp is a movie actor
> > a “Super Star” is a movie actor
> > Ben Stein is a movie actor
> > Vince Vaughn is a movie actor
> > The Chris Potter is a movie actor
> > there is a movie actor
> > the other is a movie actor
> > A “star” is a movie actor
> > By selecting only the title-case tokens that do not follow "a" nor
> > "the", we find the following list:
> > movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
> > Hyponym-Hyperonym
> > The hyponym-hyperonym patter is realized in text by relating one
> > category to another. For example, one can use a common noun (a token
> > or a sequence in any case*) as the hyponym and another common noun as
> > the hyperonym. Considering the same sample search, we can find the
> > following list of hyponyms by selecting only the tokens that follow
> > "a" or "an":
> > movie actor => “Super Star”, “star”
> > * For common nouns, we could apply decreasing weights, respectively,
> > to lower-case, upper-case and title-case tokens.
> > Other Patterns
> > The other patterns might be disconsidered because their processing is
> > too expensive for large volumes of data. Though this is true, we could
> > try using these in small clusters of pages (Wikipedia, for instance).
> > Considerations:
> > A sequence of at least 7 tokens (7-grams) is needed in order to find a
> > set for a two-token category. As two-token categories are possibly the
> > most promising ones (we must check if that is true for a corpus the
> > size of the Internet), that's about all we need. The compound category
> > issue ("a part of speech" and "a part of speech tagger") can be
> > solved by a set subtraction.
The poor results you observed in your Spanish search (the same happens
for Portuguese) is due to the fact that in our languages the
indefinite article does not determine categories as in English. In our
languages, it determines the role of the complements instead. This
causes no problem to communication though it makes it impossible to
parse a string with a context and lexis free pattern. I'll give
examples:
Instance-Category
Mario Gómez es actor // Mario Gomez is [an] actor
Miguel Ángel Solá es actor // Miguel Ángel is [an] actor
Both in Spanish and in Portuguese, no article is used before the
category when we categorize instances*. This turns out to be a problem
for parsing, once we have no way to isolate the instance-category
pattern without knowing and eliminating all the other patterns that
have the token "es".
* we do use the indefinite article when we evaluate people "Nicolas
Cage es un actor excelente"
Hyponym-Hyperonym
el hombre es un animal racional. // [a] man is a rational animal.
In both languages, the hyponym of a hyponym-hyperonym pattern is
marked off by a definite article and the hyperonym by an indefinite
one. This pattern turns out to be as easy as the English one to be
mined from a large corpora. We must only keep in mind that the first
article is typically a definite one in original texts and good
translations and that searching "un * es un *" will give poor results.
On Apr 29, 3:48 pm, David_Santana <GAUDELIA1...@yahoo.es> wrote:
> so,e strange results are returned when using in spanish * es *,
> but amazing results are returning using Que es un *
> or in english What is *.
> Best regards David. perhaps after all the best way is to query de
> google database. to order the sets.
> On 29 abr, 05:21, bpchesney <bpches...@gmail.com> wrote:
> > Hi, Daniel,
> > I like the idea of parsing "is-a" constructs to come up with members
> > of a set. These specific instances of a more generic class could also
> > be pulled out of a set together, since they are considered to be
> > examples of the same thing.
> > I think, in general, Google Sets could greatly benefit by doing some
> > browsing for common sentence constructs. Simply returning items from
> > lists is a good start but there's a lot that can be done to extend it
> > to get more meaningful results. 'is-a(n)' is an example of this...
> > One question: what did you mean by "a set of 7 tokens is needed for a
> > set for a 2 token category"?
> > Brian
> > On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:
> > > The token pairs "is a" and "is an" are used as a resource for
> > > construing ontologies in English. Two of the patterns that occur with
> > > them are quite useful for Google Sets: the first is the instance-
> > > category pattern and the second, the hyponym-hyperonym one. We could
> > > use these patterns to enhance Google Sets by making the following
> > > corpora-based searches:
> > > Instance-Category
> > > The instance-category pattern is realized in text by relating an
> > > instance to a category. For example, one can use a proper noun (a
> > > token or a sequence in title case) as the instance and a common noun
> > > as the category. Look the following sample search for "is a movie
> > > actor" considering only what comes before the searched string in the
> > > ten first results of Google Search Engine:
> > > Ice-T is a movie actor
> > > he is a movie actor
> > > Johnny Depp is a movie actor
> > > a “Super Star” is a movie actor
> > > Ben Stein is a movie actor
> > > Vince Vaughn is a movie actor
> > > The Chris Potter is a movie actor
> > > there is a movie actor
> > > the other is a movie actor
> > > A “star” is a movie actor
> > > By selecting only the title-case tokens that do not follow "a" nor
> > > "the", we find the following list:
> > > movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
> > > Hyponym-Hyperonym
> > > The hyponym-hyperonym patter is realized in text by relating one
> > > category to another. For example, one can use a common noun (a token
> > > or a sequence in any case*) as the hyponym and another common noun as
> > > the hyperonym. Considering the same sample search, we can find the
> > > following list of hyponyms by selecting only the tokens that follow
> > > "a" or "an":
> > > movie actor => “Super Star”, “star”
> > > * For common nouns, we could apply decreasing weights, respectively,
> > > to lower-case, upper-case and title-case tokens.
> > > Other Patterns
> > > The other patterns might be disconsidered because their processing is
> > > too expensive for large volumes of data. Though this is true, we could
> > > try using these in small clusters of pages (Wikipedia, for instance).
> > > Considerations:
> > > A sequence of at least 7 tokens (7-grams) is needed in order to find a
> > > set for a two-token category. As two-token categories are possibly the
> > > most promising ones (we must check if that is true for a corpus the
> > > size of the Internet), that's about all we need. The compound category
> > > issue ("a part of speech" and "a part of speech tagger") can be
> > > solved by a set subtraction.- Ocultar texto de la cita -
You're right, simple parsing based on spaces could be improved. It
seems the way Google approached Sets is to decide that there are a few
simple constructs that they'll search for to make a set. These
constructs are punctuation (standard English, or HTML) that indicate
that a list is being processed, and they've probably optimized based
on that.
We could say take a similar approach and optimize for 'is-a' parsing
in addition to the list-punctuation method. So, instead of delimiting
on spaces, we could delimit around an " is a(n) " token and preserve
the tokens on either side. But how do we delimit on the other ends of
the tokens? I think that part could get to be tricky.
On Apr 29, 4:54 pm, Daniel Vale <danielv...@gmail.com> wrote:
> a simple, naïve and fast way of parsing an English sentence is to
> tokenize it discarding the spaces between words. If we parsed the
> results for "movie actor" this way, we would get the following arrays
> of tokens:
> {Ice-T, is, a, movie, actor}
> {he, is, a, movie, actor}
> {Johnny, Depp, is, a, movie, actor}
> {a, “Super Star”, is, a, movie, actor}
> {Ben, Stein, is, a, movie, actor}
> {Vince, Vaughn, is, a, movie, actor}
> ...
> As English anthroponyms (names of people) are usually made up of
> two tokens, we must have sequences of at least 7 tokens in order to
> use this algorithm for two-token categories: 1 token for checking the
> presence of "a", "an" and "the" before the anthroponym, two for the
> anthroponym, two for "is a" or "is an" and two for the category. This
> is a limiting issue since most statistical processing is being done
> nowadays with n-grams up to N=5 (up to 5 tokens). But this is Google
> and I expect [7-9]-grams to be possible by now.
> Daniel Vale
> On Apr 29, 12:21 am, bpchesney <bpches...@gmail.com> wrote:
> > Hi, Daniel,
> > I like the idea of parsing "is-a" constructs to come up with members
> > of a set. These specific instances of a more generic class could also
> > be pulled out of a set together, since they are considered to be
> > examples of the same thing.
> > I think, in general, Google Sets could greatly benefit by doing some
> > browsing for common sentence constructs. Simply returning items from
> > lists is a good start but there's a lot that can be done to extend it
> > to get more meaningful results. 'is-a(n)' is an example of this...
> > One question: what did you mean by "a set of 7 tokens is needed for a
> > set for a 2 token category"?
> > Brian
> > On Apr 28, 12:50 pm, Daniel Vale <danielv...@gmail.com> wrote:
> > > The token pairs "is a" and "is an" are used as a resource for
> > > construing ontologies in English. Two of the patterns that occur with
> > > them are quite useful for Google Sets: the first is the instance-
> > > category pattern and the second, the hyponym-hyperonym one. We could
> > > use these patterns to enhance Google Sets by making the following
> > > corpora-based searches:
> > > Instance-Category
> > > The instance-category pattern is realized in text by relating an
> > > instance to a category. For example, one can use a proper noun (a
> > > token or a sequence in title case) as the instance and a common noun
> > > as the category. Look the following sample search for "is a movie
> > > actor" considering only what comes before the searched string in the
> > > ten first results of Google Search Engine:
> > > Ice-T is a movie actor
> > > he is a movie actor
> > > Johnny Depp is a movie actor
> > > a “Super Star” is a movie actor
> > > Ben Stein is a movie actor
> > > Vince Vaughn is a movie actor
> > > The Chris Potter is a movie actor
> > > there is a movie actor
> > > the other is a movie actor
> > > A “star” is a movie actor
> > > By selecting only the title-case tokens that do not follow "a" nor
> > > "the", we find the following list:
> > > movie actor => Ice-T, Johnny Depp, Ben Stein, Vince Vaughn
> > > Hyponym-Hyperonym
> > > The hyponym-hyperonym patter is realized in text by relating one
> > > category to another. For example, one can use a common noun (a token
> > > or a sequence in any case*) as the hyponym and another common noun as
> > > the hyperonym. Considering the same sample search, we can find the
> > > following list of hyponyms by selecting only the tokens that follow
> > > "a" or "an":
> > > movie actor => “Super Star”, “star”
> > > * For common nouns, we could apply decreasing weights, respectively,
> > > to lower-case, upper-case and title-case tokens.
> > > Other Patterns
> > > The other patterns might be disconsidered because their processing is
> > > too expensive for large volumes of data. Though this is true, we could
> > > try using these in small clusters of pages (Wikipedia, for instance).
> > > Considerations:
> > > A sequence of at least 7 tokens (7-grams) is needed in order to find a
> > > set for a two-token category. As two-token categories are possibly the
> > > most promising ones (we must check if that is true for a corpus the
> > > size of the Internet), that's about all we need. The compound category
> > > issue ("a part of speech" and "a part of speech tagger") can be
> > > solved by a set subtraction.
For parsing English, Portuguese, Spanish, French and German (but not
Japanese and Chinese), we could use an improved space-based tokenizer*
for a first rank parsing and then move forward to a second rank one.
The first rank parsing would give us words separated by spaces and
would handle punctuation properly. So the input for the second rank
parsing is a token sequence (array of strings) and not a character
sequence (string).
In the second rank English parsing, we could run an algorithm that
keeps the last - say 5-7 - tokens and ignits a procedure when it finds
the token pairs "is a" or "is an". At this point, the previously read
tokens could be checked for hyponym-hyperonym or instance-category
patterns. Should they match the patterns, we could get the tokens that
follow the "is a"/"is an" token pair to be the category.
Then we run into the compound concept issue. How many tokens should be
read after the "is a"/"is an" token pair? Look these options:
1) Johnny Depp is a movie
2) Johnny Depp is a movie actor
3) Johnny Depp is a movie actor who
This issue can be solved by executing the following procedure: 1)
eliminate the third option because it has a grammatical item** and 2)
classify Johnny Depp as a "movie actor" because this is the longest
structure possible.
* improved space-based tokenizers should be a topic on its own
** we should have a list of grammatical items that end the match
On Apr 29, 10:13 pm, bpchesney <bpches...@gmail.com> wrote:
> You're right, simple parsing based on spaces could be improved. It
> seems the way Google approached Sets is to decide that there are a few
> simple constructs that they'll search for to make a set. These
> constructs are punctuation (standard English, or HTML) that indicate
> that a list is being processed, and they've probably optimized based
> on that.
> We could say take a similar approach and optimize for 'is-a' parsing
> in addition to the list-punctuation method. So, instead of delimiting
> on spaces, we could delimit around an " is a(n) " token and preserve
> the tokens on either side. But how do we delimit on the other ends of
> the tokens? I think that part could get to be tricky.
> On Apr 29, 4:54 pm, Daniel Vale <danielv...@gmail.com> wrote:
> > Brian,
> > a simple, naïve and fast way of parsing an English sentence is to
> > tokenize it discarding the spaces between words. If we parsed the
> > results for "movie actor" this way, we would get the following arrays
> > of tokens: