Preliminary results of C++ observe-text

38 views
Skip to first unread message

Curtis Faith

unread,
Jun 8, 2017, 10:10:18 AM6/8/17
to Ben Goertzel, Ruiting Lian, ope...@googlegroups.com
I got the sentence word extractor and plumbing working to replace the Scheme observe-text function with C++ equivalents, and it now creates atoms. I set things up so it works as a CogServer command so we can have multi-threaded input and many threads adding atoms at the same time in the future. Though, the first tests I have done are not multi-threaded and initial tests were a single-core MacBook Air.

There are several things I am not doing because I didn't think you needed all the atoms that observe-text is creating. No connection to the Relex server, no Link Grammar parses, etc.

I am creating all the possible ordered pairs, and counting them in a manner analogous to what update-clique-pair-counts was doing in link-pipeline.scm. But I can easily adjust the algorithm so it doesn't output pairs.that are further than some N distance, if you desire.

The atoms it generates are of this form, as per link-pipeline.scm.

;     EvaluationLink
;         PredicateNode "*-Sentence Word Pair-*"
;         ListLink
;             WordNode "lefty"  -- or whatever words these are.
;             WordNode "righty"
;
;     ExecutionLink
;         SchemaNode "*-Pair Distance-*"
;         ListLink
;             WordNode "lefty"
;             WordNode "righty"
;         NumberNode 3

I have attached a sample output for one sentence: "Ben likes his ice cream, and Jerry does too.". Please let me know if you need any other types of atoms or counts. 

I was thinking it might be good to output adjacent pairs separately and adjacent triplets in addition to pairs using different PredicateNodes, as adjacency seems like more information than simple pairing in a sentence. It also might be good to have separate sentence predicates that contain all the words in each sentence since there is no link-grammar parse and you can't reconstruct the sentences from the information in this step.


Speed

The interesting part is how much faster it is at processing the full Pride and Prejudice test file.

On my 1.7 GHz Intel Core i7 MacBook Air, the test which took 3 or 4 hours on the 6-core Dell with observe-text in scheme, generates 2,690,934 atoms in only 81 seconds with a single core using C++.  At this speed, the Wikipedia file that Ruiting was using should only take a few hours to process.

For Pride and Prejudice, sentences were averaging 2.2 seconds at 1100% CPU on the Dell. On my Mac it hits about 92% of one CPU and averages 0.0137 seconds per sentence.

The Dell running Guile used 4.2G of RAM. The C++ code uses 1.8G of RAM.

As mentioned above, there is also a lot of work I am not doing, no relex server parses, no link-grammar, etc. But this was only taking 0.15 seconds per sentence of the 2.2 seconds last time I measured.

- Curtis

test_output_atoms.txt

Linas Vepstas

unread,
Jun 8, 2017, 5:09:47 PM6/8/17
to opencog, Ben Goertzel, Ruiting Lian
Just to make you happy, I tuned down the GC from doing it every sentence, to doing it every 20 sentences.  Perhaps this will make you happier.  I also turned off the collection of teh kind of data that you are not interested in; this will cut down on the number of database writes by more than half.   I also made the clique pair-counting code easier to adjust.

All of this is in the latest pull req.  Of course, you too could have turned down, or turned off GC if you wanted to, then you would not have gotten the awful numbers that you so badly did not like.

But when you collect more information than you need or want, and you perform more GC than you need or want, then, yes, you will get poor performance.

--linas

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+unsubscribe@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAJzHpFos1vJaPdmK%2B9uvZ8M4rKMLXrvpnERCzjUBqbP0ukBHpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Curtis Faith

unread,
Jun 8, 2017, 6:55:43 PM6/8/17
to ope...@googlegroups.com, Ben Goertzel, Ruiting Lian
Of course, you too could have turned down, or turned off GC if you wanted to, then you would not have gotten the awful numbers that you so badly did not like.

I could have and did. It made little difference. It was one of the first things I tried.

The 2.2 seconds per sentence times I got with scheme / Guile already had:

1) the (gc) call in observe-text removed, and this made no difference in timing but did cause memory to leak at a bit higher rate. This caused me to believe that (gc) was already being called at least once per sentence due to the sheer volume of work.

2) no (store-atom h) or (fetch-atom h) calls in count-one-atom. So these times included absolutely no SQL reads or writes whatsoever.



Ben Goertzel

unread,
Jun 8, 2017, 10:24:01 PM6/8/17
to Curtis Faith, Ruiting Lian, ope...@googlegroups.com
Hi Curtis,

> But I can easily adjust the algorithm so it doesn't output pairs.that are
> further than some N distance, if you desire.

I am pretty sure we are going to want to experiment with that....
What will make sense is to do some experiments on a modest-sized
corpus to try to find a methodology that "seems to work", and then
repeat that methodology on a larger corpus... So as part of the
initial experimentation phase, we will probably want to try the "N
distance limitation" with a few values of N...

Also to note is that, for later phases in the process, we will want to
load "tagged sentences" rather than just sentences. In a tagged
sentence, each word will be associated with one or more category
labels. [But I don't envision more than, say, 10 labels associated
with an individual word... often it will be 1 or 2] We will then
want to update counts based on category labels as well as words.

(The mechanism described in the previous paragraph is relevant for
workflow in which one is doing the statistical learning process for
clustering/disambiguation partly outside the Atomspace, which is what
Ruiting and I aim to try first.... It's not relevant if one is doing
this statistical learning process within the Atomspace, which is what
Linas is about to try...)

So a couple next steps will be

1) make code to export sparse feature vectors for word, where the
vector for word W has two entry for each other word V: one entry for V
on the left of W, and one entry for V on the right of W. For
instance: The entry in W's feature vector corresponding to "V on the
left of W" is based on the total weight of links pointing from V to W
(with V on the left) in the spanning-tree parses that Linas's code
finds...

2) try to run Shujing's pattern miner on the collection of MST parses
in the Atomspace. This typically requires some fiddling with the
templates and parameters for the pattern miner ... Shujing is good at
giving practical guidance on this, Bitseat and Tensae are not bad at
it either by this point...

The patterns mined by the pattern miner can be used to make more
sophisticated feature vectors to export, which can be tried as an
alternative to the simpler ones described in step (1) above. In
these more sophisticated feature vectors, a library of significant
patterns in the collection of MST parses will be found, and W's
feature vector will have an entry corresponding to "W occurs in
position k of pattern i in library" (for each i and k)

We can discuss all this on Monday when Ruiting and I are back in HK,
I'm just putting it down in an email while it's fresh in my mind...

-- Ben


thanks
ben


--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

Curtis Faith

unread,
Jun 10, 2017, 6:52:30 AM6/10/17
to Ben Goertzel, Ruiting Lian, ope...@googlegroups.com
Hey Ben, you wrote:​

So as part of the 
initial experimentation phase, we will probably want to try the "N
distance limitation" with a few values of N...

​This now works and I did some testing to check on both performance and resource usage for various  values of  N (which I am calling Pair Distance Limit below from the notes in opencog/nlp/learn/observe_cc_notes.txt):

Pair Distance           Atoms                         Total    Observe   Ops per
Limit                   Added             RAM          Time       Time    Second
-------------           -----           -----         -----    -------   -------
1                     164,574          0.167G           13s.        4s     1,483
2                     334,716          0.297G           16s.        7s       848
3                     493,364          0.410G           20s.       11s       539
6                     896,482          0.715G           35s        26s       228
12                  1,473,987          1.084G           47s        38s       156
All pairs           2,690,934          1.949G           87s        78s        76

Noop - just send text.      0          0.044G            9s.        0s.      N/A

NOTE:
Observe Time = Total Time - Noop Time

I submitted pull request #2766 for this in order to make it easier for Ruiting to pull down my changes to try this tomorrow. Now that the plumbing is in place, it's pretty easy to add anything else you'd need for this.

Reply all
Reply to author
Forward
0 new messages