Word similarity database report

Linas Vepstas

unread,

May 10, 2017, 6:34:27 PM5/10/17

to opencog, link-grammar, Ruiting Lian, Ben Goertzel

Attached PDF reports on a small, early snapshot of what the database looks like. Basically, it looks promising. I'm moving on to the next step, which is to reparse with the clusters. There's various parts of the theory I don't understand, as well as a lot of code to write, to build a pipeline from the atomspace back into the link-grammar parser.

--linas

connector-sets.pdf

Ben Goertzel

unread,

May 10, 2017, 10:21:17 PM5/10/17

to Linas Vepstas, opencog, link-grammar, Ruiting Lian

Very cool stuff !!

So there are two things I'm thinking Ruiting can do in this regard, in
the near term...

1) run this on a larger corpus and see what happens

2) try various clustering approaches on the "feature structures" for
words implicit in the parse-trees you've gotten from this first phase

For this we would need the following...

For 1), we'd need some good instructions on how to replicate the
experiments you've just run (potentially on an additional text corpus)

For 2), we'd need an Atomspace (Scheme file, postgres dump, whatever)
containing the first-pass parses you've obtained for the sentences in
your test corpus

Can you share these w/ Ruiting sometime soon?

thanks!

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

Linas Vepstas

unread,

May 10, 2017, 11:28:05 PM5/10/17

to Ben Goertzel, opencog, link-grammar, Ruiting Lian

OK, yes,

for part 1) I think the README file explains almost all but the newest and greatest steps in detail. I'll update it shortly to add the newest steps. If there's confusion, ask. I keep detailed notes mostly because I can't remember how to replicate any of this stuff, myself.

It would really really help if someone could find & prepare some clean text of some kind of adventure novels or young-adult lit, or any kind of narrative literature. Maybe from project gutenberg. I've discovered that wikipedia has 3 major faults:

* its has very few action verbs

* its got lots of lists and tables and names and dates, weird punctuation.

* its got lots of articles about movies and rock bands with stupid names, all of which glom up the statistics with garbage. That, and lots of geographical and product names and model numbers, which add lots of obscure or nonsense words, and don't help with grammar, at all. again, lists of names, dates, football leagues, awards, recording contracts, run-times, publisher names.

This is almost my new #1 priority, I think. Any help would help.

for part 2) it would be a database dump. To figure out what to do with it, going at least partly into part 1) would clarify what's in there.

Some versions of some of these databases are getting huge -- 50 million or 100 million atoms, which take around a kbyte of ram each so 50 or more GB to load it into RAM. So typically, you don't want to actually load it all... except during MST parsing.

There are two hard parts to clustering. One is writing all the code to get the clusters working in the pipeline. I guess I'll have to do that. The other is dealing with words with multiple meanings: "I saw the man with the saw" and clustering really needs to distinguish saw the verb from saw the noun. Not yet clear about the details of this. i've a glimmer of the general idea, ...

--linas

Ben Goertzel

unread,

May 10, 2017, 11:29:41 PM5/10/17

to link-grammar, opencog, Ruiting Lian

On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <linasv...@gmail.com> wrote:
> The other is dealing with words with multiple meanings: "I saw the man with
> the saw" and clustering really needs to distinguish saw the verb from saw
> the noun. Not yet clear about the details of this. i've a glimmer of the
> general idea, ...

yeah, Ruiting and i have our own ideas about that too, and would like
to perhaps explore those
through experimentation w/ some of the data generated by your first
phase of processing ...

Ben Goertzel

unread,

May 10, 2017, 11:30:55 PM5/10/17

to link-grammar, bitseat tadesse, opencog, Ruiting Lian

Hi Linas,

> It would really really help if someone could find & prepare some clean text
> of some kind of adventure novels or young-adult lit, or any kind of
> narrative literature. Maybe from project gutenberg.

I think Bitseat can help with that, with some guidance from Ruiting...

ben

Ben Goertzel

unread,

May 11, 2017, 4:39:38 AM5/11/17

to link-grammar, bitseat tadesse, zelalem fantahun, opencog, Ruiting Lian

>
> I think Bitseat can help with that, with some guidance from Ruiting...
>
> ben

Actually Bitseat is probably too busy with pattern mining on the English/Lojban
corpus, but Zelalem may be able to handle this... I'll ask him

Ben Goertzel

unread,

May 11, 2017, 5:41:01 AM5/11/17

to link-grammar, opencog, Ruiting Lian

On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <linasv...@gmail.com> wrote:
> There are two hard parts to clustering. One is writing all the code to get
> the clusters working in the pipeline. I guess I'll have to do that. The
> other is dealing with words with multiple meanings: "I saw the man with the
> saw" and clustering really needs to distinguish saw the verb from saw the
> noun. Not yet clear about the details of this. i've a glimmer of the
> general idea,

I was thinking to explore addressing this with (fairly shallow) neural
networks ...

This paper

https://nlp.stanford.edu/pubs/HuangACL12.pdf

which I've pointed out before, does unsupervised construction of
word2vec type vectors for word senses (thus, doing sense
disambiguation sorta mixed up with the dimension-reduction process)

Now that algorithm takes sentences as inputs, not parse trees. But I
think you could modify the approach to apply to our context, in an
interesting way...

The following describes one way to do this. I'm sure there are others.

1) A first step would be to use the OpenCog pattern miner to mine the
surprising patterns from the set of parse trees produced by MST
parsing.

2) Then, one could associate with each word-instance W a set of
instance-pattern-vectors. Each instance vector is very sparse, and
contains an entry for each of the patterns (among the surprising
patterns found in step 1) that W is involved in. Given these
instance-pattern-vectors, one can also calculate word-pattern-vectors
or word-sense-pattern-vectors (via averaging the instance-vectors for
all instance of the word or word-sense)

3) Their algorithm involves an embedding matrix L that maps: a binary
vector with a 1 in position i representing the i'th word in the
dictionary, into a much smaller dense vector. I would suggest
instead having an embedding matrix L that maps the pattern-vectors
representing words or senses (constructed in step 2) into a much
smaller dense vector. This is word2vec-ish, but the data it's drawing
on is the set of patterns observed in a corpus of parse trees...

4) Their algorithm involves, in the local score function, using a
sequence [x1, ..., xm], where xi is the embedding vector assigned to
word i in the sequence being looked at. Instead, we could use a
structure like the following, where w is the word being predicted and
S is the sentence containing w,

[ avg. embedding vector of words one link to the left of w in the
parse tree of S, avg. embedding vector of words one link to the right
of w in the parse tree of S, avg. embedding vector of words two links
to the left of w in the parse tree of S, avg. embedding vector of
words two links to the right of w in the parse tree of S]

This context-matrix is a way of capturing "the embedding vectors of
the words constituting the context of w in parsed sentence S" as a
linear vector... Stopping at "two links away" is arbitrary, probably
we want to go 4-5 links away (yielding a vector of length 8-10); this
would have to be experimented with...

...

Given these changes, one could apply the algorithm in the paper for
sense disambiguation and clustering...

Of course, there would also be a lot of other ways to mix up the same
ingredients mentioned in the above ... the two unique ingredients I
have introduced are

* creating dense vectors for words or senses from pattern-vectors

* creating context-matrices partly capturing the context of a
word-instance (or word or sense) based on a corpus of parse trees...

...and one could play with these in many different ways.

To put it more precisely, there are a lot of ways that one could iteratively

-- cluster word-instances based on their context-matrices (thus
generating word labels)

-- learn an embedding matrix (starting from pattern-vectors) that
enables accurate skip-gram prediction based on knowing the labels of
the words produced by the clustering done in the preceding step

Mimicking the algorithm from the above paper (with the changes I've
suggested) is one way to do this but there are lots of other ways one
could try...

-- Ben

Andi

unread,

May 11, 2017, 5:48:27 AM5/11/17

to opencog, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

does something like this help?

Tom Sawyer -clean.txt

Andi

unread,

May 11, 2017, 7:16:23 AM5/11/17

to opencog, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

War and Peace -clean.txt

Kirk Reiser

unread,

May 11, 2017, 3:40:47 PM5/11/17

to opencog, link-grammar, bitseat tadesse, zelalem fantahun, Ruiting Lian

I have a fairly large collection of books in txt format mostly in the
science-fiction genre if that would be useful. If so let me know what
would be helpful and I'll try to accomodate you.

Kirk

--
Well that's it then, colour me secure!

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1

mQENBFYV5DMBCAC060mbsnLhGPjnFkf0R0p+7MxcfxlOuy5wc8y59y9ZNF0RZD1s
OTEsDih4vD9YJ3zA78VsBUDK47aiDWduh3nHzYN2ZSuxAQ9u7qPqphCG0jPagTU8
p7+Ceeya4I5odWtq+Nkf1UrHB7KKEtexphStSwUG5Bhi4bb84YinmX/a3I+OGV1D
by4QBSdPvSuDw0qFkt/ucLyEwv4L6lDjoH2GF+tnCew4SJtliJFvA1k7NpWO6HW9
aWtBxfYU85ccZKBSE25y+9KprUCncVTpaVs3FztCWG0dQRXHvEbV+Damp/IBd9Jv
HZX7azqbERUa/FjPTIlZhhI9VtaZaFfJSH+5ABEBAAG0HUtpcmsgUmVpc2VyIDxr
aXJrQHJlaXNlcnMuY2E+iQE+BBMBAgAoBQJWFeQzAhsDBQkADS8ABgsJCAcDAgYV
CAIJCgsEFgIDAQIeAQIXgAAKCRAHTEsk7UQUUoeuB/wIqsdLCfDrSvr3qg7rKBDg
ru44OMuRit6hbdWFZjmxccCdjeNhBJRVd5wrEqjj5YoqQAhmacXaEB0DO/TZlDgo
kUfJM7lrtQD4mYU9GVtrzJxCJoBUyeMVnMJt39F91tBu0mYM6oI/dv81dwxIv++4
hj55TZ4GG7DGYAy4LwNb+noNbivgOFHlnfNq8nxhZkHbJdYKP+sptZOL5sagmBQZ
iS9STB54g/U7Jtt1Fe+JwDmbxQhbSHa9JuWn0xZ8CtYhrz06xSqZl5vpMlak3eW2
x6m6IcqZfyuI2K7W/9BCgcsQyYzufO4Gk9KyPNISskX6pFBLuNxIH6hdfxSYYm9y
uQENBFYV5DMBCACtMyhHog5MR6eQUPTx7fWH5ntkgCtmWvQp4lcKj0HHbteDWglS
NVbWKWEk9PAKA4UeQVUH4vOhTRhAPpuDUavLdp2tDtT7ZBVh91B3AWIM6+7fIvyU
2uYt1q/CNjga8RllXBT7mW2zHGEYQFIkBJvqlU0PN1HlxRZIbSSEb+zQuVAd+ph3
kt/oZon3ZbNmKg+arsYMmKkYJ0REwKQib7h5Xl31aK74XmWBp2Ky+lopsJSP8wpH
AfC71h4s3LDm8ADHF1Ns4KuGZdLTugr8uiPm5kEJFGes1uYKy8R7OTFko0NEuJkv
STfpPYnTU2qDCJBH08zZErI/6YBIlSsCSde3ABEBAAGJASUEGAECAA8FAlYV5DMC
GwwFCQANLwAACgkQB0xLJO1EFFKAmgf/d3dk1/HgmF8rmvYVru/hJvmIpmiLqPl5
bYSwdZeU+k82qp3xACM2yMJhOh89SgHsaaqQAE1qo5rAJcSG7/+7M/kzf4u/WM/E
unXDtLkbzi5Zl+gjoikrfOhgF0NmuGdlrOme8a6ue7+iE4XLAo0/jhVlh45O6Iq0
0DGyeFr22cR3jZj4wRmPw5zj4r/sWc06UfquVAEMmfIvJMaGYvwBI+TU6gI8MjLe
VDY0vay/nQ79fXSLQmYEvjwKXIavQu9c8TFt0z9EDdoIMx69ZunqZuYQInxaT+cL
i9zhihMGz4XA1q3blLNX3I0jWzAa23ZchI7htc3kfxp1jWqrGyGEIg==
=nrPH
-----END PGP PUBLIC KEY BLOCK-----

Linas Vepstas

unread,

May 11, 2017, 6:28:31 PM5/11/17

to Andi, opencog, Ben Goertzel, link-grammar, Ruiting Lian

Hi Andi,

Yeah, that's ideal. Did you do this with a script, or by hand? in my ideal world, there's some script that downloads a bunch of these from project gutenberg, strips out the license boilerplate, and puts them into some directory. Busting them up into chapters would be nice, too, so that if cogserver chokes and dies, or I have to kill it, it can pick up where it left off, more or less.

Linas Vepstas

unread,

May 11, 2017, 6:30:32 PM5/11/17

to opencog, link-grammar, bitseat tadesse, zelalem fantahun, Ruiting Lian

On Thu, May 11, 2017 at 2:40 PM, Kirk Reiser <ki...@reisers.ca> wrote:

I have a fairly large collection of books in txt format mostly in the
science-fiction genre if that would be useful. If so let me know what
would be helpful and I'll try to accomodate you.

That would be great! The way you word this hints that they're not all in the public domain. I guess that's OK.

--linas

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscribe@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/alpine.DEB.2.20.1705111538480.25486%40befuddled.reisers.ca.

For more options, visit https://groups.google.com/d/optout.

Linas Vepstas

unread,

May 11, 2017, 8:35:22 PM5/11/17

to Andi, opencog, Ben Goertzel, link-grammar, Ruiting Lian

Hi Andi,

I just created a really ugly shell script to split project gutenberg files into little bite-size pieces. Its user-unfriendly, but it works.

--linas

#! /bin/bash

# Split big project-gutenberrg files into parts.
# takes two arguments: the first argument is the filename to split,
# the second is the filename to generate.

# take file in argument 1, and replace all double-newlines
# by the control-K character.
cat $1 | sed ':a;N;$!ba;s/\n/xxx-foo-xxx/g' > xxx
cat xxx |sed 's/xxx-foo-xxx\rxxx-foo-xxx/\n\x0b\n/g' > yyy
cat yyy |sed 's/\rxxx-foo-xxx/\n/g' > zzz

# split the file along control-K into parts with 50 paragraphs each.
# split -t ' ' zzz poop-
split -l 50 -t '
' --filter=' sed "s/
//g" > $FILE' zzz $2

# remove temps
rm xxx yyy zzz

Kirk Reiser

unread,

May 11, 2017, 10:59:24 PM5/11/17

to opencog, link-grammar, bitseat tadesse, zelalem fantahun, Ruiting Lian, Kirk Reiser

Right. I think though because we wouldn't be distributing them they
should be just fine for research.

They are pretty well contained in one file for one book. There is
publisher information at the top and all the extra blank lines greater
than two have been stripped out.

I'll send you a sample half dozen or so off list to have a look at.

Kirk

er: Your private mail address isn't on the To/Cc list above so sed me
a note to my address and I'll return a link or attachment. Your
choice.

>> email to opencog+u...@googlegroups.com.

>> To post to this group, send email to ope...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/opencog.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/opencog/alpine.DEB.2.20.1705111538480.25486%40befuddled.reisers.ca.
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

Ben Goertzel

unread,

May 12, 2017, 4:40:20 AM5/12/17

to link-grammar, opencog, Ruiting Lian

Digging deeper, I note that for learning the dimension-reduction
matrix the folks in that paper I referred to are using L-BFGS which is
basically a memory-optimized Newton's Method...

I was thinking about recent work by OpenAI using CMA-ES for neural net
weight learning (in a RL context) and then I remembered a paper about
hybridizing CMA-ES with Newton's Method,

http://www.dem.ist.utl.pt/engopt2010/Book_and_CD/Papers_CD_Final_Version/pdf/08/01534-01.pdf

I suspect this sort of hybridization can be very useful for NN
learning, combining the best of evolutionary learning and gradient
descent (as has been done previously in other contexts quite
frequently).... I imagine implementing this in Tensorflow would not
be extremely challenging for someone who knows the framework, as these
are all just pretty basic operations (matrix operations, radial basis
functions, etc.) ...

-- Ben

Ruiting Lian

unread,

May 12, 2017, 6:58:52 AM5/12/17

to ope...@googlegroups.com, Ben Goertzel, link-grammar, Ruiting Lian

Hi Linas,

On Thu, May 11, 2017 at 11:27 AM, Linas Vepstas <linasv...@gmail.com> wrote:

OK, yes,

for part 1) I think the README file explains almost all but the newest and greatest steps in detail. I'll update it shortly to add the newest steps. If there's confusion, ask. I keep detailed notes mostly because I can't remember how to replicate any of this stuff, myself.

From the README file,

*********************************

8) Chose the corresponding `learn-pairs-??.scm` file, copy it from
   the `run` directory to your working directory. Review and edit
   the configuration as necessary. It contains the database
   credentials -- that's probably the only thing you need to change.

9) Start the various servers. Eventually, you can use the
   `run-all-servers.sh` file in the `run` directory to do this;
   it creates a byobu session with different servers in different
   terminals, where you can keep an eye on them. However, the
   first time through, it is better to do all this by hand. So:

9a) Start `run/relex-server-any.sh` in a terminal.

9b) Start the cogserver, in another terminal, as follows:
```
   guile -l learn-pairs-??.scm

**********************************

by "learn-pairs-??.scm", I guess you meant "pair-count-??.scm", because I didn't find any file in the 'run' directory whose name starts with "learn-pairs"

So when I run "guile -l pair-count-en.scm", I got the following errors:

;;; ERROR: Syntax error:

;;; opencog/nlp/learn/pseudo-csets.scm:309:8: definition in expression context, where definitions are not allowed, in form (define n-tot (get-stashed-count))

;;; WARNING: compilation of /home/ruiting/hansonrobotics/opencog/opencog/opencog/nlp/learn/run/pair-count-en.scm failed:

;;; ERROR: Syntax error:

;;; opencog/nlp/learn/pseudo-csets.scm:309:8: definition in expression context, where definitions are not allowed, in form (define n-tot (get-stashed-count))

Backtrace:

1 (primitive-load "/home/ruiting/hansonrobotics/opencog/o…")

0 Exception thrown while printing backtrace:

ERROR: Wrong type (expecting output port): #<output: string 1593620>

ERROR: In procedure opencog-extension:

ERROR: Wrong type (expecting output port): #<output: string 1593e70>

Some deprecated features have been used. Set the environment

variable GUILE_WARN_DEPRECATED to "detailed" and rerun the

program to get more information. Set it to "no" to suppress

this message.

Then I tried to load the pair-count-en.scm file from the guile shell after starting the cogserver, and got the same error..

I have installed the guile 2.2.2 before trying this.

Any idea how to fix it? Thanks a lot.

--

Ruiting

Ruiting Lian

unread,

May 12, 2017, 7:24:35 AM5/12/17

to ope...@googlegroups.com, Ben Goertzel, link-grammar, Ruiting Lian

Here's the error message from the cogserver guile shell:

========

Backtrace:

In ice-9/boot-9.scm:

157: 15 [catch #t #<catch-closure 272ac60> ...]

In unknown file:

?: 14 [apply-smob/1 #<catch-closure 272ac60>]

In ice-9/boot-9.scm:

157: 13 [catch #t #<catch-closure 272ab40> ...]

In unknown file:

?: 12 [apply-smob/1 #<catch-closure 272ab40>]

?: 11 [call-with-input-string "(load \"/home/ruiting/hansonrobotics/opencog/opencog/build/lib/pair-count-en.scm\")\n" ...]

In ice-9/boot-9.scm:

2320: 10 [save-module-excursion #<procedure 31e72d0 at ice-9/eval-string.scm:65:9 ()>]

In ice-9/eval-string.scm:

44: 9 [read-and-eval #<input: string 320c1a0> #:lang ...]

37: 8 [lp #]

In ice-9/boot-9.scm:

2320: 7 [save-module-excursion #<procedure 2598180 at ice-9/boot-9.scm:3961:3 ()>]

3966: 6 [#<procedure 2598180 at ice-9/boot-9.scm:3961:3 ()>]

1645: 5 [%start-stack load-stack ...]

1650: 4 [#<procedure 28c9bd0 ()>]

In unknown file:

?: 3 [primitive-load "/home/ruiting/hansonrobotics/opencog/opencog/build/lib/pair-count-en.scm"]

In opencog/nlp/relex-utils.scm:

473: 2 [use-relex-server "127.0.0.1" 4445]

In ice-9/boot-9.scm:

102: 1 [#<procedure 2598280 at ice-9/boot-9.scm:97:6 (thrown-k . args)> unbound-variable ...]

In unknown file:

?: 0 [apply-smob/1 #<catch-closure 272ab00> unbound-variable ...]

ERROR: In procedure apply-smob/1:

ERROR: In procedure module-lookup: Unbound variable: exact-integer?

ABORT: unbound-variable

Ruiting Lian

Andi

unread,

May 12, 2017, 9:03:05 AM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

This I did by hand.

You can have more of them...

of course the Cogus should do all of this by his own - and he will!

but today :).....

this is ANSI txt. O.K. or do you prefer utf-8 or something else?

Andi

unread,

May 12, 2017, 9:05:32 AM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

Tarzan the Untamed.txt

Andi

unread,

May 12, 2017, 9:09:24 AM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

The upload with ANSI did not work.
Now all this files are unicode.

What about German books?

Tarzan of the Apes.txt

Tarzan the Terrible.txt

Andi

unread,

May 12, 2017, 9:41:07 AM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

utf-8 seems to be the right format.

so further books will be utf-8

Linas Vepstas

unread,

May 12, 2017, 2:53:45 PM5/12/17

to opencog, Ben Goertzel, link-grammar, Ruiting Lian

On Fri, May 12, 2017 at 5:58 AM, Ruiting Lian <lianli...@gmail.com> wrote:

Hi Linas,
*

8) Chose the corresponding `learn-pairs-??.scm` file, copy it from
   the `run` directory to your working directory.

by "learn-pairs-??.scm", I guess you meant "pair-count-??.scm",

Yes.

because I didn't find any file in the 'run' directory whose name starts with "learn-pairs"

So when I run "guile -l pair-count-en.scm", I got the following errors:

;;; ERROR: Syntax error:
;;; opencog/nlp/learn/pseudo-csets.scm:309:8: definition in expression context, where definitions are not allowed, in form (define n-tot (get-stashed-count))

I think you need to do a git-pull. For me, line 309 is something compleltely different.
https://github.com/opencog/opencog/blob/master/opencog/nlp/learn/pseudo-csets.scm#L309

Any idea how to fix it? Thanks a lot.

I also created an LXC container with everything in it, ready to go. This morning, the container doesn't boot. This is due to a brand-new systemd bug. I cannot even begin to express how much I absolutely hate systemd. It goes way out of the way to stab you in the back, repeatedly, every few months, on purpose. It is the worst piece of software devised by humankind. It is actually worse than Windows. I have lost vast amounts of time and productivity trying to work around systemd bugs. Like right now.I just lost about 12 hours work due to systemd, again, and I will loose another 12+ hours trying to debug it, and work around it. Stoopid POS.

I want an operating system that is stable, that works, the boots when I boot it. I do not want an operating system where I have to be a rocket surgeon to make it work. systemd sucks.

--linas

Linas Vepstas

unread,

May 12, 2017, 3:08:46 PM5/12/17

to Andi, opencog, Ben Goertzel, link-grammar, Ruiting Lian

Andi, thanks. UTF8 is the best. It works for all languages, which is why I like it.

--linas

Linas Vepstas

unread,

May 12, 2017, 3:24:51 PM5/12/17

to Andi, opencog, Ben Goertzel, link-grammar, Ruiting Lian

On Fri, May 12, 2017 at 8:03 AM, Andi <gabil...@gmail.com> wrote:

This I did by hand.
You can have more of them...

of course the Cogus should do all of this by his own - and he will!
but today :).....

doing things like this by hand is OK for a little while but it then gets tedious. For others, it would be useful to have a shell script that downloaded books from e.g. gutenberg, using e.g. wget, and then removed the boilerplate, converted to utf8 (using iconv), and in my case, split them into chunks holding a few hundred sentences each.

Another one of my problems is that I also need good sources in other languages.

--linas

Andi

unread,

May 12, 2017, 4:01:06 PM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

The rest of Tarzan...

The Son Of Tarzan.txt

Tarzan and the Jewels of Opar.txt

Jungle Tales of Tarzan.txt

The Beasts of Tarzan.txt

Andi

unread,

May 12, 2017, 4:28:39 PM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

some German

Die Leiden des jungen Werther 1.txt

Die Leiden des jungen Werther 2.txt

BETRACHTUNG.txt

Buddenbrooks.txt

DER goldene Topf.txt

Andi

unread,

May 12, 2017, 4:30:05 PM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

some French

Le Nez d’un.txt

À SE TORDRE.txt

Deux et deux.txt

POUR CAUSE DE FIN DE BAIL.txt

CONTES HUMORISTIQUES.txt

Andi

unread,

May 12, 2017, 4:43:24 PM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

some Italian

SCRITTI.txt

Il trampolino per le stelle.txt

LA MIA PADRONA DI CASA.txt

Verso la cuna del mondo.txt

BRANDELLI.txt

Linas Vepstas

unread,

May 12, 2017, 4:50:48 PM5/12/17

to opencog, Ben Goertzel, link-grammar, Ruiting Lian

On Fri, May 12, 2017 at 1:53 PM, Linas Vepstas <linasv...@gmail.com> wrote:

I also created an LXC container with everything in it, ready to go.

LXC onctainer available on https://linas.org/lxc-nlp-containers/

--linas

Andi

unread,

May 12, 2017, 5:08:17 PM5/12/17

to opencog, gabil...@gmail.com, b...@goertzel.org, link-g...@googlegroups.com, rui...@hansonrobotics.com, linasv...@gmail.com

some portugies
---

Please notice: I am German and understand German+English texts.

I do not understand French, Italien or Portugies.

Since the books are formated very different, it will not be easy find the right text blocks by a script......

so much so far...
no more books from me.

exept if realy necessary :)

--Andi

AGULHA EM PALHEIRO.txt

AMOR DE PERDIÇÃO.txt

A FILHA DO CABIDA.txt

OS FIDALGOS DA CASA MOURISCA.txt

OS BRAVOS DO MINDELLO.txt

TRANSVIADO.txt

Linas Vepstas

unread,

May 12, 2017, 7:00:01 PM5/12/17

to Andi, opencog, Ben Goertzel, link-grammar, Ruiting Lian

OK, Thanks, Andi,

I speak French, and am likely to do that next. For that, I need to first solve the morphology problem, and before that I have some other things to get out of the way.

--linas

Ben Goertzel

unread,

May 13, 2017, 12:01:06 AM5/13/17

to link-grammar, opencog, Ruiting Lian

Thanks Linas! Ruiting has this stuff running on her PC at the Hanson
office, so she will try this out again on Monday I think...

> --
> You received this message because you are subscribed to the Google Groups

> "link-grammar" group.

> To unsubscribe from this group and stop receiving emails from it, send an

> email to link-grammar...@googlegroups.com.
> To post to this group, send email to link-g...@googlegroups.com.
> Visit this group at https://groups.google.com/group/link-grammar.

> For more options, visit https://groups.google.com/d/optout.

Ruiting Lian

unread,

May 15, 2017, 6:55:26 AM5/15/17

to ope...@googlegroups.com, link-grammar, Ruiting Lian

Hi Linas,

FYI, I got through Step 00 to Step 10, now I have to wait for a few hours' dowloading, then a few hours' unpacking before trying the remaining steps...

--

Ruiting

On Sat, May 13, 2017 at 12:01 PM, Ben Goertzel <b...@goertzel.org> wrote:

Thanks Linas! Ruiting has this stuff running on her PC at the Hanson
office, so she will try this out again on Monday I think...

On Sat, May 13, 2017 at 4:50 AM, Linas Vepstas <linasv...@gmail.com> wrote:
>
>
> On Fri, May 12, 2017 at 1:53 PM, Linas Vepstas <linasv...@gmail.com>
> wrote:
>>
>>
>> I also created an LXC container with everything in it, ready to go.
>
>
> LXC onctainer available on https://linas.org/lxc-nlp-containers/
>
> --linas
>
> --
> You received this message because you are subscribed to the Google Groups
> "link-grammar" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to link-grammar+unsubscribe@googlegroups.com.

> To post to this group, send email to link-g...@googlegroups.com.
> Visit this group at https://groups.google.com/group/link-grammar.
> For more options, visit https://groups.google.com/d/optout.

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscribe@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.

Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBcSsXfCOD9nu_T9Yuc83mFZ-UTAzejni%3DBUiY85dO0DEw%40mail.gmail.com.

Ruiting Lian

unread,

May 17, 2017, 6:00:55 AM5/17/17

to ope...@googlegroups.com, link-grammar

Hi Linas,

It seems the attached file has caused the cogserver a segfault and crash while running the wiki-ss-en.sh. I removed the file and tried it again, hopefully there won't be more problems after many hours running ;p I ran it on some simple english wikipedia corpus, with the size of around 300MB.

=====

terminate called recursively

Segmentation fault (core dumped)

================

Ruiting Lian

List of The Real Ghostbusters episodes

Ruiting Lian

unread,

May 17, 2017, 6:08:40 AM5/17/17

to ope...@googlegroups.com, link-grammar

I wonder if it's caused by one word sentence due to no word pairs? I mean those numbers in the list.

Ruiting Lian

Linas Vepstas

unread,

May 17, 2017, 9:00:04 AM5/17/17

to link-grammar, ope...@googlegroups.com

I've never seen an error like that, during the parsing runs.

I have seen an error like that, when the cogserver has not been correctly linked with new libraries (or is linked with inconsistent versions of librarires.) So, for example: if you compile with new header files, (e.g. because a new method was added to some class) but then run run the cogserver, linking to the old libraries (because the new libraries were not yet installed.)

The fix for this is to got back to cogutils, git pull; make; make install and again, for the atomspace and for opencog. Its possible that you are linking to old junk in your build directory, so removing that first can also help.

--linas

Ruiting Lian

unread,

May 17, 2017, 9:12:39 AM5/17/17

to link-grammar, ope...@googlegroups.com

Hmmm, I pulled the latest version and cleaned the old installed files and rebuilt every thing(cogutils, atomspace and opencog) two days ago. I can try to do that if I still find the error exists tomorrow. Now it's been running for 5 hours so I don't want to stop it

Ruiting Lian

Ruiting

> email to link-grammar...@googlegroups.com.

> To post to this group, send email to link-g...@googlegroups.com.
> Visit this group at https://groups.google.com/group/link-grammar.
> For more options, visit https://groups.google.com/d/optout.

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

--
You received this message because you are subscribed to the Google Groups "opencog" group.

To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBcSsXfCOD9nu_T9Yuc83mFZ-UTAzejni%3DBUiY85dO0DEw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "link-grammar" group.

To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.

To post to this group, send email to link-g...@googlegroups.com.
Visit this group at https://groups.google.com/group/link-grammar.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "opencog" group.

To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34%2BktnqqV%3Ds1vqZfiRghswF3N5N%3DVJ61r6vjqqrgJvz9w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Ruiting Lian

Linas Vepstas

unread,

May 17, 2017, 12:03:42 PM5/17/17

to link-grammar, opencog, Ruiting Lian

Hi Ben,

I'm confused by this email.

On Thu, May 11, 2017 at 4:40 AM, Ben Goertzel <b...@goertzel.org> wrote:

,

I was thinking to explore addressing this with (fairly shallow) neural
networks ...

This paper

https://nlp.stanford.edu/pubs/HuangACL12.pdf

which I've pointed out before, does unsupervised construction of
word2vec type vectors for word senses (thus, doing sense
disambiguation sorta mixed up with the dimension-reduction process)

I'm skimming that paper, but it makes my eyes glaze over. We are already getting better results than they get, so WTF?

1) A first step would be to use the OpenCog pattern miner to mine the
surprising patterns from the set of parse trees produced by MST
parsing.

But that is exactly what the disjuncts are. Do you not like the metric? Do you want a different one?

2) Then, one could associate with each word-instance W a set of
instance-pattern-vectors.

Well, but I've already got at least 3 different types of sparse vectors per word instance, and all of them give OK results. I think the disjunct-based one gives the best results, but I haven't proved that yet.

We can add yet another vector to the mix, but honestly (see other email) baby-sitting the CPU while it crunches data takes about half my time, and writing code to do data analysis takes about another half. In between that, I get some scattered hours to actually do some data analysis, and read some email.

So I need to be very protective of where I spend my time.... I still find that work in 1% inspiration and 99% mindless, thoughtless persperation ...

3) Their algorithm involves an embedding matrix L that maps: a binary
vector with a 1 in position i representing the i'th word in the
dictionary, into a much smaller dense vector.

Yes, this is called "clustering". This is the next step.

I would suggest
instead having an embedding matrix L that maps the pattern-vectors
representing words or senses (constructed in step 2) into a much
smaller dense vector.

Why do you think that some kind of linear transform is the best way to do clustering? Clustering usually works better when you allow it to do whatever, instead of forcing it to be linear (e.g. PCA, LSA)

Recall that we already know that we want to have hundreds of clusters. It's not obvious to me that PCA is effective at this size. I've been mentally envisioning some sort of agglomerative clustering for the dimensional reduction step, rather than a linear transform of some kind ...

4) Their algorithm involves, in the local score function, using a
sequence [x1, ..., xm], where xi is the embedding vector assigned to
word i in the sequence being looked at.

Ehh? We've got scoring functions out the wazoo. So far cosine similarity seems to be the best, from my poking around, I'm still planning on exploring some others.

This context-matrix is a way of capturing "the embedding vectors of
the words constituting the context of w in parsed sentence S" as a
linear vector... Stopping at "two links away" is arbitrary, probably
we want to go 4-5 links away (yielding a vector of length 8-10); this
would have to be experimented with...

WTF? link-distances are all about what MST is doing. We already know, from psychology studies, from link-grammar, from published MST results, what the appropriate link lengths are. Viz, yes, most links are 1-2 words long, some are much much longer. We even know these for various languages: e.g. link lengths for English have been decreasing for over 400 years -- link lengths for old english are almost twice as long as modern english. This all seems like a redd herring .. we've got the technology for dealing with this.

Anyway, I don't see anything in that paper that is worth saving. It old crap, we've been doing better for years, Rohit demonstrated that.

The missing next step is the dimensional reduction, and you suggest using linear matrix algos, but I don't see why these would be better than agglomerative clustering. They seem to be harder to control, and gut instinct says they won't give good results.

--linas

--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

--
You received this message because you are subscribed to the Google Groups "link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar+unsubscribe@googlegroups.com.

Ben Goertzel

unread,

May 17, 2017, 12:50:25 PM5/17/17

to link-grammar, opencog, Ruiting Lian

> Anyway, I don't see anything in that paper that is worth saving. It old
> crap, we've been doing better for years, Rohit demonstrated that.

Well, we have actually not demonstrated better results than those
Stanford guys on word sense disambiguation or unsupervised
part-of-speech learning.... Maybe we can get better results than them
using the stuff you and Rohit were doing, I dunno... i kinda doubt
it, but that's an empirical question...

> The missing next step is the dimensional reduction, and you suggest using
> linear matrix algos, but I don't see why these would be better than
> agglomerative clustering. They seem to be harder to control, and gut
> instinct says they won't give good results.

Our instincts/intuitions seem to differ here. But fortunately this is
a relatively straightforward empirical question...

My own experience and intuition is that agglomerative clustering is
crude and works pretty badly, and I think these NN techniques can do
better...

But we don't need to argue about this stuff.... I mean, the beauty of
this sort of work is that one has data and one can try different
algorithms and see what the results are like. You've done this
excellent work building the first-phase MST parses, so now we can take
the data from these first-phase MST parses and apply various
category-learning and disambiguation techniques to the data, and see
what kind of results come out...

I'm not trying to convince you to try any category-learning or
disambiguation methods that don't match your intuition. You can try
some methods based on your own instinct, and Ruiting and I will play
with some methods based on our own instincts, and we can compare
results.... If your instinct is right then Ruiting and I will have
wasted a little time experimenting with inferior techniques but we
will also have learned a bunch...

-- Ben

Linas Vepstas

unread,

May 17, 2017, 3:41:15 PM5/17/17

to link-grammar, opencog, Ruiting Lian

On Wed, May 17, 2017 at 11:50 AM, Ben Goertzel <b...@goertzel.org> wrote:

> Anyway, I don't see anything in that paper that is worth saving. It old
> crap, we've been doing better for years, Rohit demonstrated that.

Well, we have actually not demonstrated better results than those
Stanford guys on word sense disambiguation or unsupervised
part-of-speech learning.... Maybe we can get better results than them
using the stuff you and Rohit were doing, I dunno... i kinda doubt
it, but that's an empirical question...

If by "part of speech" you mean "average vertex degree", then yes, .. but we've figured out one reason for the bad data is that wikipedia doesn't have any verbs in it. I'm hoping that parsing project-gutenberg adventure novels will fix this .. except that I just experienced big data loss, see other email.

I'm vaguely thinking of buying a pair of terrabyte SSD's because processing is definitely bottlenecked in disk i/o but the prices for those disks remain expensive. I'm also concerned that the very high write volume might burn them out in a year.

My own experience and intuition is that agglomerative clustering is
crude and works pretty badly, and I think these NN techniques can do
better...

OK.

Have you done agglomerative clustering in these super-sparse, high-dimensional spaces?

But we don't need to argue about this stuff.... I mean, the beauty of
this sort of work is that one has data and one can try different
algorithms and see what the results are like.

Yes OK.

You've done this
excellent work building the first-phase MST parses,

Caution about terminology: the MST parses are discarded immediately after they are created: the only thing that is saved are the counts of how often the word-disjunct pairs occur.

--linas

Ruiting Lian

unread,

May 22, 2017, 7:07:57 AM5/22/17

to ope...@googlegroups.com, Ben Goertzel, link-grammar, Ruiting Lian

Hi Linas,
```
(use-modules (opencog) (opencog persist) (opencog persist-sql))
(use-modules (opencog nlp) (opencog nlp learn))
(sql-open "postgres:///en_pairs?user=ubuntu&password=asdf")
(batch-all-pairs)
```
I assume line 408 of the README file is supposed to be (batch-any-pairs)? Got the following error while using (batch-all-pairs):

ERROR: In procedure apply-smob/1:

ERROR: Unbound variable: batch-all-pairs

ABORT: unbound-variable

FYI, the wiki-ss-en.sh didn't finish processing all the articles (from the 300MB simple english wikipedia corpus) to atom tables before my computer crashed. It submitted around 20,000 articles (about 22MB in size, according to the "submitted-articles" folder it generated) to the database. It has 16611283 atoms so far.

learn-pairs=# SELECT count(uuid) FROM Atoms;

count

----------

16611283

(1 row)

The above process has already taken around 28 hours, so I decided to move on with the small corpus first.

--

Ruiting Lian

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscribe@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35WzAE8Y4-xB8aiH8oBW-c7ApZf%3DP80id-8pUbdm53PLg%40mail.gmail.com.

Ruiting Lian

unread,

May 22, 2017, 9:10:27 AM5/22/17

to ope...@googlegroups.com, Ben Goertzel, Ruiting Lian, link-grammar

also it seems that "common.scm" should be loaded besides "compute-mi.scm" and "batch-word-pair.scm" before running batch-any-pairs, otherwise the following error will pop up after an hour's computing...

======

guile-en> (batch-any-pairs)

Start loading words ...

Elapsed time to load words: 3 secs

Done loading words, now loading pairs

Elapsed time to load ANY-link pairs: 53 secs

Finished loading any-word-pairs

Support: num left=67735 num right=67735

Done with wild-card count N(*,w) and N(w,*) in 28 secs

Done computing N(*,*) total-count=17704056.0 in 2 secs

Start computing log P(*,w)

Done computing 67735 left-wilds in 3 secs

Done storing 67735 left-wilds in 621 secs

Done with -log P(*,w), start -log P(w,*)

Done computing 67735 right-wilds in 2 secs

Done storing 67735 right-wilds in 701 secs

Done computing -log P(w,*) and <-->

Going to do individual word-pair MI

Backtrace:

6 (apply-smob/1 #<catch-closure 2d71420>)

5 (apply-smob/1 #<catch-closure 2d71380>)

In ice-9/boot-9.scm:

2316:4 4 (save-module-excursion _)

In ice-9/eval-string.scm:

38:6 3 (read-and-eval #<input: string 44cdee0> #:lang _)

In /home/ruiting/hansonrobotics/opencog/opencog/build/../opencog/nlp/learn/compute-mi.scm:

513:33 2 (batch-all-pair-mi _)

In ice-9/boot-9.scm:

759:25 1 (dispatch-exception 0 unbound-variable ("module-look…" …))

In unknown file:

0 (apply-smob/1 #<catch-closure 2d71340> unbound-variable …)

ERROR: In procedure apply-smob/1:

ERROR: In procedure module-lookup: Unbound variable: make-progress-rpt

ABORT: unbound-variable

Ruiting Lian

--
Ruiting Lian

To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35WzAE8Y4-xB8aiH8oBW-c7ApZf%3DP80id-8pUbdm53PLg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Ruiting Lian

Andi

unread,

May 22, 2017, 4:54:01 PM5/22/17

to opencog, b...@goertzel.org, rui...@hansonrobotics.com, link-g...@googlegroups.com

just a vague intuition : could it be (match-any-pairs)?

Reply all

Reply to author

Forward