Genifer parsing English

YKY (Yan King Yin, 甄景贤)

unread,

May 3, 2012, 1:48:33 AM5/3/12

to general-in...@googlegroups.com

For example, the KB would have the following facts:

A tokenized sentence represented as a list in Clojure:

1. input = ["john", "loves", "mary"]

2. "love" is a verb

3. "john" is a noun

4. "mary" is a noun

5. A verb followed by a noun is a verb phrase.

6. A subject can be a noun.

7. A subject followed by a noun phrase forms a sentence.

8. The meaning of a noun is <noun>.

9. The meaning of a noun phrase is <verb ◦ noun>

10. In general, the meaning of grammatical constituents is their constituent meanings composed together.

Desired conclusion:

"john loves mary" is a sentence, and its meaning is <john ◦ loves ◦ mary>.

There are of course some missing pieces of knowledge. What I want to do is to fill in the knowledge so that the above translation works (via forward chaining).

Maybe this can be done using ~50-100 formulas...

A simple example of forward chaining, that already works in the code, is:

"john loves mary"

"mary loves john"

"X loves Y", "Y loves X" -> "X and Y are happy"

==========================

"john and mary are happy"

KY

Matt Mahoney

unread,

May 3, 2012, 1:53:42 PM5/3/12

to general-in...@googlegroups.com

On Thu, May 3, 2012 at 1:48 AM, YKY (Yan King Yin, 甄景贤)
<generic.in...@gmail.com> wrote:
> For example, the KB would have the following facts:
>
> A tokenized sentence represented as a list in Clojure:
> 1. input = ["john", "loves", "mary"]
>
> 2. "love" is a verb
> 3. "john" is a noun
> 4. "mary" is a noun
>
> 5. A verb followed by a noun is a verb phrase.
> 6. A subject can be a noun.
> 7. A subject followed by a noun phrase forms a sentence.
>
> 8. The meaning of a noun is <noun>.
> 9. The meaning of a noun phrase is <verb ◦ noun>
> 10. In general, the meaning of grammatical constituents is their constituent
> meanings composed together.
>
> Desired conclusion:
> "john loves mary" is a sentence, and its meaning is <john ◦ loves ◦
> mary>.
>
> There are of course some missing pieces of knowledge. What I want to do is
> to fill in the knowledge so that the above translation works (via forward
> chaining).
>
> Maybe this can be done using ~50-100 formulas...

There are millions of formulas. Allow me to demonstrate. Complete the following:

"p_____ and salt"
"salt and p_____"

The first one is hard. The second one you probably guessed "pepper".
That is because there is a grammar rule in English that makes "salt
and pepper" more correct than "pepper and salt". Not that "pepper and
salt" is wrong, just less likely. If you needed to solve this problem
in a speech recognition system, where the _____ was inaudible, you
would need this knowledge. Likewise if you were reading text and some
of the text was blurry, or you were translating from another language
and the word for "pepper" could be translated in more than one way, or
you were correcting a document and "pepper" was misspelled. Humans can
solve all of these problems, so our AI should also have this
capability.

You could code this rule by hand, but I don't think you would want to
do this millions of times. And you don't have to. Counting Google
hits:

"salt and pepper", 42,300,000
"pepper and salt", 4,830,000.

So I think we need to think about:

- How do we represent these kind of rules?
- How do we induce this knowledge from raw text?
- What do we use for training data?
- How much computing power do we need?
- How do we measure success?

-- Matt Mahoney, mattma...@gmail.com

Linas Vepstas

unread,

May 3, 2012, 6:08:48 PM5/3/12

to general-in...@googlegroups.com

Hi Matt, YKY,

On 3 May 2012 12:53, Matt Mahoney <mattma...@gmail.com> wrote:

On Thu, May 3, 2012 at 1:48 AM, YKY (Yan King Yin, 甄景贤)
<generic.in...@gmail.com> wrote:
> For example, the KB would have the following facts:
>
> A tokenized sentence represented as a list in Clojure:
> 1. input = ["john", "loves", "mary"]
>
> 2. "love" is a verb
> 3. "john" is a noun
> 4. "mary" is a noun
>
> 5. A verb followed by a noun is a verb phrase.
> 6. A subject can be a noun.
> 7. A subject followed by a noun phrase forms a sentence.
>
> 8. The meaning of a noun is <noun>.
> 9. The meaning of a noun phrase is <verb ◦ noun>
> 10. In general, the meaning of grammatical constituents is their constituent
> meanings composed together.
>
> Desired conclusion:
> "john loves mary" is a sentence, and its meaning is <john ◦ loves ◦
> mary>.
>
> There are of course some missing pieces of knowledge. What I want to do is
> to fill in the knowledge so that the above translation works (via forward
> chaining).
>
> Maybe this can be done using ~50-100 formulas...

There are millions of formulas.

Yes, indeed, as Matt demonstrates. I did not want to state the obvious, but what the heck... let me state the obvious. Anyone who ever tries to write an English-language parser promptly discovers this. Try it. You'll find out. The first 100 rules are easy. The next 900 start getting tedious. After that, you realize that there is no end in sight; humans are not constrained by rules, and they build all sorts of crazy sentences. FWIW, the current link-grammar has approx 2K rules in it.

This explosion of rules was first noted in the 1960's, and for the next 3 decades, linguists had hoped that a "better theory" would allow for a much smaller, limited number of rules. That's why we have dozens of different grammar theories. By the mid-90's, it was realized that hand-writing parsers is really a crazy idea, and that perhaps using machine-learning to automatically learn a grammar is a much better idea. Thus, must recent work in the last 10-15 years has been on machine-learning grammars. There's been some success. I am not convinced that there are any machine-generated grammars that are more accurate and provide better coverage than link-grammar, with its 2K rules. I dunno, haven't seen any yet. But I think the dominance by machines is inevitable.

BTW, syntax isn't everything. Chinese and Japanese have a simple, straightforward syntax (as compared to English). That hasn't meant that a Chinese/Japanese NLP system is somehow easier or more accurate.

Anyway, ignorance of the past is a recipe for failure. Don't assume all those silly linguists have been fucking around for the last six decades for no reason, or because they're stupid.

--linas

swkane

unread,

May 5, 2012, 9:36:44 AM5/5/12

to general-in...@googlegroups.com

Case in point? How do you handle valid sentences like the following?

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
http://en.wikipedia.org/wiki/Buffalo_buffalo_buffalo_buffalo

Steven

William Taysom

unread,

May 5, 2012, 9:53:59 AM5/5/12

to general-in...@googlegroups.com

> Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

I don't know about you. But I can tell you how I handle that sentence:

"Pardon me? Did you just say, 'buffalo' a half dozen times? Do you like buffalo? Would you like to go to Buffalo? Are you studying linguistics?"

When a dialog system gives me something like as the reply, I will gladly pin a Turing badge on it.

Mike Dougherty

unread,

May 5, 2012, 9:53:59 AM5/5/12

to general-in...@googlegroups.com

On Sat, May 5, 2012 at 9:36 AM, swkane <diss...@gmail.com> wrote:
> Case in point? How do you handle valid sentences like the following?
>
> Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
> http://en.wikipedia.org/wiki/Buffalo_buffalo_buffalo_buffalo

People generally fail to parse that one.

IvanV

unread,

Jul 16, 2013, 7:05:31 AM7/16/13

to general-in...@googlegroups.com

I finally made it. Here is universal parser in javascript: http://synth.wink.ws/moonyparser/

I've been to hell and back to make it faster, but that's it, shift-reduce can't be much faster. Unfortunately, finding errors in text makes it three times slower. At least it has linear parsing time.

IvanV

unread,

Jul 16, 2013, 7:08:35 AM7/16/13

to general-in...@googlegroups.com

YKY, where are you? I've enjoyed reading about progress in your work.

SeH

unread,

Jul 16, 2013, 3:49:12 PM7/16/13

to general-in...@googlegroups.com

ivan: wow this is cool. what's the next step?

On Tue, Jul 16, 2013 at 7:08 AM, IvanV <ivan....@gmail.com> wrote:

YKY, where are you? I've enjoyed reading about progress in your work.

--
You received this message because you are subscribed to the Google Groups "Genifer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to general-intellig...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan Vodišek

unread,

Jul 16, 2013, 4:28:26 PM7/16/13

to general-in...@googlegroups.com

hi SeH, tx for asking. i hope this is not hijacking. allow me to keep this post short.

1. to make a new rotating filter for data

2. maybe to implement it in netention (i'm waiting for new computer that has more than 4gigs sdd to install node.js)

3. to make all in one html-php "web 3.0 operating system/database case tool/filesystem/programming language" with that filter, all based on grammar from given parser, i'll call it "Synth"

5. to write a few apps in Synth, like project manager, small office apps, and some other show off for earning money from donations. i hate to write bills :(

6. to feed hungry Africans from earned money. i'm catching an eye on watering windmills, cross me fingers for fortune money.

and then comes the most interesting part:

7. to invest couple of years in Synth based intelligent encyclopedia with capability of parsing and classifying NL text, solving math problems, inductive investigating of new formulas in physics, chemistry, automatic composition of processors by given set of assembly functions and whatever scientists would come up with. I still don't know how to call this futuristic encyclopedia.

2013/7/16 SeH <seh...@gmail.com>

Ivan Vodišek

unread,

Jul 16, 2013, 5:26:13 PM7/16/13

to general-in...@googlegroups.com

oh, Netention comes with big N, right?

2013/7/16 Ivan Vodišek <ivan....@gmail.com>

YKY (Yan King Yin, 甄景贤)

unread,

Jul 17, 2013, 3:27:01 AM7/17/13

to general-in...@googlegroups.com

On Tue, Jul 16, 2013 at 7:05 PM, IvanV <ivan....@gmail.com> wrote:

I finally made it. Here is universal parser in javascript: http://synth.wink.ws/moonyparser/

I've been to hell and back to make it faster, but that's it, shift-reduce can't be much faster. Unfortunately, finding errors in text makes it three times slower. At least it has linear parsing time.

Ivan, of course you're welcome to use this list =)

I think we have discussed this a bit before, and I am skeptical of purely syntactic parsing. In fact it may be the wrong direction to take and a syntax parser may not fit into my agenda for AGI. Though, of course, you may also disagree =)

My latest thinking is to create a logic with "nice" semantics in mind, and not pay so much attention to natural-language syntax (the latter had led me into a wrong direction, I think).

You may read some books on linguistics especially semantics, to see my point...

=)

YKY

YKY (Yan King Yin, 甄景贤)

unread,

Jul 17, 2013, 7:48:16 AM7/17/13

to general-in...@googlegroups.com

Well, to explain with a simple example:

"That night, his father told him the story of his life."

In this sentence, "his life" can mean either the father's life or the son's life. This is the problem of anaphora (pronoun) resolution.

If your AGI contains the above sentence as knowledge, then it has the problem of deciding who that "his" refers to. This is one of the gaps that has to be fixed, if you want to use natural language to represent knowledge in an AGI.

I discovered that such an approach would have many "gaps" that need to be mended. It may be better to start from the "other end", ie, try to design a logic that is ideal for expressing semantics, and upon that, build parsers for natural languages.

I have tried to design logic following the structure of natural languages, and found that to be the wrong direction -- that's my view so far.

=)

YKY

Ivan Vodišek

unread,

Jul 17, 2013, 8:50:45 AM7/17/13

to general-in...@googlegroups.com

I always thought of having one grammar for parsing text after which parsed tokens would align to completely different knowledge base structure.

In above example "story of his life" would parse as object complement, so additional reasoning would be necessery before adding data to knowledge base, probably according to previous or incoming text.

People made a complication with natural language. And then with esperanto mimicking natural language constructs. Better solution would be if we could speak in some programming language like "web ontology language" or java. Predicates have one, two or three parameters for subject, object and complement and that is all. I'd made subjects and objects also as predicates, but with zero parameters. adjuncts would be predicates with one parameter for subject or object.

I cant even imagine we can't think of something better than millenium old constructions we use to think inside of it. Maybe we are trapped into millenium old NL and we can't get out.

Ivan Vodišek

unread,

Jul 17, 2013, 8:56:36 AM7/17/13

to general-in...@googlegroups.com

I think that people are following just learned patterns when they do thinking. god knows what beasts we could be with some other artificial languages in our minds.

2013/7/17 Ivan Vodišek <ivan....@gmail.com>

Matt Mahoney

unread,

Jul 17, 2013, 9:12:02 AM7/17/13

to general-intelligence

Natural language processing would be a lot simpler if we all spoke
semantically precise and grammatically simple languages like Lojban or
Attempto. But we don't. An AI has to be able to understands the
petabytes of ambiguous, sloppy text on the internet if it is going to
build a database. Unfortunately it takes enormous amounts of computing
power to do even rudimentary language processing like Google or
Watson.

> --
> You received this message because you are subscribed to the Google Groups
> "Genifer" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to general-intellig...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--

-- Matt Mahoney, mattma...@gmail.com

SeH

unread,

Jul 17, 2013, 12:35:49 PM7/17/13

to general-in...@googlegroups.com

i see how we can semi-automatically transform at least the most important online texts to a semantically precise format - it's mostly a user-interface solution.

YKY (Yan King Yin, 甄景贤)

unread,

Jul 17, 2013, 1:15:59 PM7/17/13

to general-in...@googlegroups.com

On Thu, Jul 18, 2013 at 12:35 AM, SeH <seh...@gmail.com> wrote:

i see how we can semi-automatically transform at least the most important online texts to a semantically precise format - it's mostly a user-interface solution.

Well, you can continue to try, but my guess is that the fixing may get complicated and inelegant.

But my logic has not been designed yet, maybe when it is ready we can see how it can be applied to your stuff...

=)

SeH

unread,

Jul 17, 2013, 1:21:19 PM7/17/13

to general-in...@googlegroups.com

On Tue, Jul 16, 2013 at 5:26 PM, Ivan Vodišek <ivan....@gmail.com> wrote:

oh, Netention comes with big N, right?

capitalization doesn't matter

can you add more RAM to your existing computer? i've run netention + node.js + mongodb ok on a 512mb VPS (with ubuntu server)

YKY (Yan King Yin, 甄景贤)

unread,

Jul 17, 2013, 1:38:20 PM7/17/13

to general-in...@googlegroups.com

On Wed, Jul 17, 2013 at 4:28 AM, Ivan Vodišek <ivan....@gmail.com> wrote:

2. maybe to implement it in netention (i'm waiting for new computer that has more than 4gigs sdd to install node.js)

What is SDD? You mean SSD drive? Why not use ａ SSD ｄｒｉｖｅ？

４ＧＢｍａｉｎｍｅｍｏｒｙｉｓａｌｓｏｐｒｅｔｔｙｅａｓｙｔｏｇｅｔ，ｎｏ？

Ivan Vodišek

unread,

Jul 17, 2013, 1:58:08 PM7/17/13

to general-in...@googlegroups.com

2013/7/17 SeH <seh...@gmail.com>

> can you add more RAM to your existing computer? i've run netention + node.js + mongodb ok on a 512mb VPS

> (with ubuntu server)

well, i have 2gigs of ram, but 4gigs of hdd (sdd in fact). when i install Ubuntu, i am left with 1.2gigs of free space if i'm lucky. currently i have 472.6megs of free hdd.

but in 10 days i'll get some payback from tourism, so finally 350gigs hdd is comming to me. i'll open a schampaign on that occasion, this five years on 4gigs sdd were more than exhibition with modern apps.

On Thu, Jul 18, 2013 at 12:35 AM, SeH <seh...@gmail.com> wrote:

i see how we can semi-automatically transform at least the most important online texts to a semantically precise format - it's mostly a user-interface solution.

clumsy web english is giving me creeps. i'm hoping that Wikipedia is more or less gramatically correct. a lot of interaction maybe, when the parser is extended with ignore case and some binary search of matching tokens (imagine 260 000 word base, with how many clumsy variants, so i have to add some binary search too). i'm counting on synth grammar to hold some logics format knowledge base, to which subject-predicate-object-complement pattern has to be converted after parsing.

Ivan Vodišek

unread,

Jul 17, 2013, 2:02:16 PM7/17/13

to general-in...@googlegroups.com

SDD = solid state drive - expensive replacement for HDD, a little bit faster then HDD.

2013/7/17 YKY (Yan King Yin, 甄景贤) <generic.in...@gmail.com>

--

Ivan Vodišek

unread,

Jul 17, 2013, 2:10:21 PM7/17/13

to general-in...@googlegroups.com

i have a 500 pages thick book of english grammar here. pretty complicated stuff. i imagine a lot of work.

2013/7/17 Ivan Vodišek <ivan....@gmail.com>

YKY (Yan King Yin, 甄景贤)

unread,

Jul 17, 2013, 2:39:10 PM7/17/13

to general-in...@googlegroups.com

But how can you have only 4GB of SSD? So small? I now have 2 SSD drives, 128GB each.

On Thu, Jul 18, 2013 at 2:10 AM, Ivan Vodišek <ivan....@gmail.com> wrote:

i have a 500 pages thick book of english grammar here. pretty complicated stuff. i imagine a lot of work.

I am aware of various Grammar formulations, both machine and human efforts. I think the problem is not how detailed the grammar is, but something deeper, at the syntax-semantics interface.

Ivan Vodišek

unread,

Jul 17, 2013, 2:58:18 PM7/17/13

to general-in...@googlegroups.com

2 sdd-s? hehe, send one right here :)

2013/7/17 YKY (Yan King Yin, 甄景贤) <generic.in...@gmail.com>

But how can you have only 4GB of SSD? So small? I now have 2 SSD drives, 128GB each.

--

Ivan Vodišek

unread,

Jul 17, 2013, 3:59:04 PM7/17/13

to general-in...@googlegroups.com

did i mentioned that the code is free for any use. just don't sue me, that's all ;)

2013/7/17 Ivan Vodišek <ivan....@gmail.com>

Mike Dougherty

unread,

Jul 17, 2013, 8:57:13 PM7/17/13

to general-in...@googlegroups.com

On Wed, Jul 17, 2013 at 12:35 PM, SeH <seh...@gmail.com> wrote:
> i see how we can semi-automatically transform at least the most important
> online texts to a semantically precise format - it's mostly a user-interface
> solution.

gamification and crowd-source? Something like Mechanical Turk that
gets you gameplay gold... if you can't just engineer the game in such
a way that you're training people to think in a logical language at
the same time you are teaching the machine to understand our
effectively-chaotic meat-sounds.

http://www.youtube.com/watch?v=gaFZTAOb7IE
(original text) http://www.terrybisson.com/page6/page6.html

SeH

unread,

Jul 17, 2013, 9:00:55 PM7/17/13

to general-in...@googlegroups.com

thanks, yes gamification would be a great way to crowdsource knowledge.

here's some blog posts that might help:

http://blog.automenta.com/2013/03/ai-bot-analysis-currency.html

http://blog.automenta.com/search/label/mailpool

http://blog.automenta.com/search/label/nlp

YKY (Yan King Yin, 甄景贤)

unread,

Jul 18, 2013, 11:33:53 PM7/18/13

to general-in...@googlegroups.com

On Thu, Jul 18, 2013 at 8:57 AM, Mike Dougherty <msd...@gmail.com> wrote:

On Wed, Jul 17, 2013 at 12:35 PM, SeH <seh...@gmail.com> wrote:
> i see how we can semi-automatically transform at least the most important
> online texts to a semantically precise format - it's mostly a user-interface
> solution.

gamification and crowd-source? Something like Mechanical Turk that
gets you gameplay gold... if you can't just engineer the game in such
a way that you're training people to think in a logical language at
the same time you are teaching the machine to understand our
effectively-chaotic meat-sounds.

Mike,

Can you tell me what is the relevance of that video, and is it an attempt to make me look stupid?

YKY

Mike Dougherty

unread,

Jul 19, 2013, 8:36:41 AM7/19/13

to general-in...@googlegroups.com

No it is not an attempt to make you look stupid.

The "aliens" in the video would have no problem conversing with
logical AGI. It is their disbelief that humans communicate via the
flapping of "meat" (tongues, etc.) and that even with radio
technology the radio-waves broadcast the sounds of meat. I think it's
a great illustration of the absurdity of something we take for
granted. "Natural" Language is an ironic misnomer. :)

YKY (Yan King Yin, 甄景贤)

unread,

Jul 20, 2013, 6:01:44 AM7/20/13

to general-in...@googlegroups.com

Oh, thanks for explaining, it makes sense now =）

Sandeep Pai

unread,

Jul 20, 2013, 8:20:14 AM7/20/13

to general-in...@googlegroups.com

Just want to get some feedback on this idea:

We have dbpedia and freebase where you can get the info-boxes of Wikipedia and other wikis in a structured format(RDF), ready to query. But the language used for querying(eg: SPARQL) such databases is not very user friendly for normal users. My idea is to create a website which would allow users to create functions for querying. Like, one user can create a function "Mother of" and someone else might choose to create another function "Is son of" and so on.

Once functions are created they can be used in creating newer functions(in the function body) or for querying the graph db. Initial graph db would consists only of dbpedia/freebase, but I guess we can always add in more semantic web databases out there.

So in the end, query looks something like this: Mother of(TV character(In(GoT(Sansa stark)))). The user interface would have a very good auto completion feature to list out available functions as soon as the user starts typing in something.

I know it’s kind of stupid to brute-force NLP and this is not really a NLP project, just something which could be useful for people for querying large databases. That said, we can always use NLP to automate function creation or for interpreting a natural language query as chain of functions.

What do you guys think?

On Sat, Jul 20, 2013 at 3:31 PM, YKY (Yan King Yin, 甄景贤) <generic.in...@gmail.com> wrote:

Oh, thanks for explaining, it makes sense now =）

Ivan Vodišek

unread,

Jul 20, 2013, 10:27:11 AM7/20/13

to general-in...@googlegroups.com

freakin' cool infos and ideas

to bruteforce NLP? that is the only way, i think

i've googled dbpedia and freebase. very usable. if only it can be accessed in realtime, without download.

Sandeep Pai

unread,

Jul 20, 2013, 10:49:55 AM7/20/13

to general-in...@googlegroups.com

Ivan, we could set up a local triplestore with all freebase/dbpedia and sync data with freebase and dbpedia every week or so. Additionally, we can also allow users to add/modify the already existing graph using a simple web interface.

--

Ivan Vodišek

unread,

Jul 20, 2013, 10:59:04 AM7/20/13

to general-in...@googlegroups.com

Hi there, Sandeep, nice to meet U again (Remember AI dreams, and mine disgrace there?)

Unfortunately I'm kinda busy now, still unemployed, in 35s, I gotta work hard on that, so I don't have a much time now. What if I hop in later with some induction and deduction algos? Right now, I can only add some binary search and case insensitive parsing to that parser and that's it. I'm sure that U guys will manage without me.

SeH

unread,

Jul 21, 2013, 12:19:59 AM7/21/13

to general-in...@googlegroups.com, netent...@googlegroups.com

seems doable. basically it learns a mapping from english to SPARQL and that results in a subgraph of dbpedia or whatever other knowledgebase is loaded (including realtime sensors and social feedback data from www.netention.org available soon) . all of that can be made realtime with disambiguation menu popups for word completion.

Reply all

Reply to author

Forward