Performance improvement suggestion for the LALR parser generator used by GHOST

41 weergaven
Naar het eerste ongelezen bericht

Xabush Semrie

ongelezen,
29 jun 2019, 11:04:4829-06-2019
aan opencog
Hi,

I have been working recently on LALR parser to parse atomese to JSON(code can be found here). I initially used the same LALR parser generator used by GHOST found in (system base lalr) module with a similar lexer generator (in my case I precompiled the regex patterns for performance gain). However, I was getting very bad performance and it took way too long to parse moderately sized atomese files. It didn't help that the module didn't provided its own lexer generator and in the case of the GHOST code, the regex patterns were not precompiled which would further degrade the performance. As a result, I started looking at alternatives and found the nyacc project.

After rewriting the code using nyacc, I found that the nyacc parser generator on average is 5-6X faster than the previous parser generator (which used by GHOST) for the same file. In addition to the performance improvement, it removes the need to provide a manually written lexer generator, has support for mid-rule context actions for complicated production rules, has a better debugging and "logging" capabilities and (although minor) doesn't require to list all the terminal symbols. Also the project is also being actively developed.

Hence, I deduced the GHOST parser could also benefit the same performance improvements and thought sharing this here. I am happy to work on porting the LALR parser from the current one to nyacc if this gets traction.

Linas Vepstas

ongelezen,
29 jun 2019, 16:43:5529-06-2019
aan opencog
Dumb question: why the heck would you need to "parse" atomese? Especially since it already comes with a built-in parser?  What are you actually trying to do? --linas

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/f3d23857-71b2-40a8-b99d-86249f9bd71a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
cassette tapes - analog TV - film cameras - you

Xabush Semrie

ongelezen,
29 jun 2019, 16:54:4429-06-2019
aan opencog

why the heck would you need to "parse" atomese?  
 What are you actually trying to do? 

I am converting it to JSON for graph visualization with Cytoscape.js for an annotation service. For example,
(EvaluationLink
 (PredicateNode "expresses")
 (ListLink
    (GeneNode "MAP2K4")
    (MoleculeNode "Uniprot:Q5U0B8")))

The above will be "parsed" into the following JSON
{
 
"data": {"source": "MAP2K4", "target": 
"Uniprot:Q5U0B8", "name": "expresses", "group": "edges"}
}


 Especially since it already comes with a built-in parser? 

Maybe I am confusing something here, but I didn't know any parser existed for my use case.

 
On Saturday, June 29, 2019 at 11:43:55 PM UTC+3, linas wrote:
Dumb question: why the heck would you need to "parse" atomese? Especially since it already comes with a built-in parser?  What are you actually trying to do? --linas

On Sat, Jun 29, 2019 at 10:04 AM Xabush Semrie <hsam...@gmail.com> wrote:
Hi,

I have been working recently on LALR parser to parse atomese to JSON(code can be found here). I initially used the same LALR parser generator used by GHOST found in (system base lalr) module with a similar lexer generator (in my case I precompiled the regex patterns for performance gain). However, I was getting very bad performance and it took way too long to parse moderately sized atomese files. It didn't help that the module didn't provided its own lexer generator and in the case of the GHOST code, the regex patterns were not precompiled which would further degrade the performance. As a result, I started looking at alternatives and found the nyacc project.

After rewriting the code using nyacc, I found that the nyacc parser generator on average is 5-6X faster than the previous parser generator (which used by GHOST) for the same file. In addition to the performance improvement, it removes the need to provide a manually written lexer generator, has support for mid-rule context actions for complicated production rules, has a better debugging and "logging" capabilities and (although minor) doesn't require to list all the terminal symbols. Also the project is also being actively developed.

Hence, I deduced the GHOST parser could also benefit the same performance improvements and thought sharing this here. I am happy to work on porting the LALR parser from the current one to nyacc if this gets traction.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ope...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/f3d23857-71b2-40a8-b99d-86249f9bd71a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Linas Vepstas

ongelezen,
29 jun 2019, 16:59:0129-06-2019
aan opencog
On Sat, Jun 29, 2019 at 3:54 PM Xabush Semrie <hsam...@gmail.com> wrote:

why the heck would you need to "parse" atomese?  
 What are you actually trying to do? 

I am converting it to JSON for graph visualization with Cytoscape.js for an annotation service. For example,
(EvaluationLink
 (PredicateNode "expresses")
 (ListLink
    (GeneNode "MAP2K4")
    (MoleculeNode "Uniprot:Q5U0B8")))

The above will be "parsed" into the following JSON
{
 
"data": {"source": "MAP2K4", "target": 
"Uniprot:Q5U0B8", "name": "expresses", "group": "edges"}
}


Why not just dump directly from the atomspace?


 Especially since it already comes with a built-in parser? 

Maybe I am confusing something here, but I didn't know any parser existed for my use case.

? Of course there is. It's called "the atomspace".

--linas

 
On Saturday, June 29, 2019 at 11:43:55 PM UTC+3, linas wrote:
Dumb question: why the heck would you need to "parse" atomese? Especially since it already comes with a built-in parser?  What are you actually trying to do? --linas

On Sat, Jun 29, 2019 at 10:04 AM Xabush Semrie <hsam...@gmail.com> wrote:
Hi,

I have been working recently on LALR parser to parse atomese to JSON(code can be found here). I initially used the same LALR parser generator used by GHOST found in (system base lalr) module with a similar lexer generator (in my case I precompiled the regex patterns for performance gain). However, I was getting very bad performance and it took way too long to parse moderately sized atomese files. It didn't help that the module didn't provided its own lexer generator and in the case of the GHOST code, the regex patterns were not precompiled which would further degrade the performance. As a result, I started looking at alternatives and found the nyacc project.

After rewriting the code using nyacc, I found that the nyacc parser generator on average is 5-6X faster than the previous parser generator (which used by GHOST) for the same file. In addition to the performance improvement, it removes the need to provide a manually written lexer generator, has support for mid-rule context actions for complicated production rules, has a better debugging and "logging" capabilities and (although minor) doesn't require to list all the terminal symbols. Also the project is also being actively developed.

Hence, I deduced the GHOST parser could also benefit the same performance improvements and thought sharing this here. I am happy to work on porting the LALR parser from the current one to nyacc if this gets traction.

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ope...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/f3d23857-71b2-40a8-b99d-86249f9bd71a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
cassette tapes - analog TV - film cameras - you

--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

For more options, visit https://groups.google.com/d/optout.

Linas Vepstas

ongelezen,
29 jun 2019, 17:16:2429-06-2019
aan opencog
I  mean, one very low-brow, trivial way to do it would be to write:

(BindLink
   (VariableList
       (TypedVariable (Variable "SRC") (Type 'GeneNode))
       (TypedVariable (Variable "TGT") (Type 'MoleculeNode))
       (TypedVariable (Variable "XPS") (Type 'PredicateNode)))
   ; what you are looking for
   (Evaluation (Variable "XPS")(List (Variable "SRC") (Variable "TGT")))
   ; what to do when you find it
   (ExecutationOutput
        (GroundedSchema "scm:print-stuff")
        (List (Variable "SRC") (Variable "TGT") (Variable "XPS") ))

; and then define the printer:

(define (print-stuff src tgt xps)
   (format #t "{ \"data\": {\"source\": \"~A\", \"target\": \"~A\", "name": \"~A\", \"group\": \"edges\"}}"
       (cog-name src) (cog-name tgt) (cog-name xps))
   ; a return value
   xps)

I mean -- this is low-brow, simple, bordering on trite, but does what you want to do, for your example.  There are other ways of doing this that are even simpler, but the above is a good demo.  Maybe you need more sophisticated features, but the above is lots easier than trying to figure out LALR.   I mean, knowing what LALR is and having experience with it is a "good thing", but its overkill for this particular problem.

--linas

Xabush Semrie

ongelezen,
29 jun 2019, 17:17:3229-06-2019
aan opencog
Why not just dump directly from the atomspace?

The atomspace has a lot of other atoms and links that might be what the user is asking for, I'm parsing the output of a pattern matcher function.

Xabush Semrie

ongelezen,
29 jun 2019, 17:29:0729-06-2019
aan opencog
I see your point. And for a simple use-case your method works. But in my case, I have the following requirements

1. I have to return both the atomese and the parsed JSON to the use.

2. I am running a different pattern matching functions to aggregate their outputs and parse the result as a whole. That's why I am using the parser.

3. I have to create some links based on discovered patterns instead of directly return a JSON string. For example, we this kind of a function:

(define outputInteraction
   (lambda(gene)
       (cog-execute! (BindLink
         (VariableList
           (TypedVariable (VariableNode "$a") (Type 'GeneNode))
           (TypedVariable (VariableNode "$b") (Type 'GeneNode)))

          (And  
           (EvaluationLink
              (PredicateNode "interacts_with")
              (ListLink
              gene
              (VariableNode "$a")
             ))

            (EvaluationLink
              (PredicateNode "interacts_with")
              (ListLink
              (VariableNode "$a")
              (VariableNode "$b")
             ))

            (EvaluationLink
              (PredicateNode "interacts_with")
              (ListLink
               gene
              (VariableNode "$b")
             ))
         )
         (ExecutionOutputLink
           (GroundedSchemaNode "scm: generate-result")
             (ListLink
               (VariableNode "$a")
               (VariableNode "$b")
             ))
   ))  
))

And generate result is something like

(define (generate-result gene-a gene-b)
    (ListLink
        (EvaluationLink
            (PredicateNode "interacts_with")
            (ListLink gene-a gene-b))
       (node-info gene-b)
       (node-info gene-a)
       )
)


So based on the above points, I decided to write a custom parser.


--
cassette tapes - analog TV - film cameras - you

Linas Vepstas

ongelezen,
29 jun 2019, 18:21:2429-06-2019
aan opencog
Well, you are free do to whatever you want to do, but one of the points of having Atomese in the first place, is to avoid having to go through such contortions. Why do you need the output-Interaction function?  ... the only reason I can see for "(GroundedSchemaNode "scm: generate-result")" is because you are trying to wrap "node-info" -- what does node-info do? Can you do it directly in the atomspace? why write code in scheme?

I mean, yes, I write bucket-loads of scheme all the time, to get things done, but there's always the meta-question -- why, and how can it be made simpler?  The long-run goal is to eventually replace all scheme an python code with declarative Atomese that "does the same thing" - this is impossible in the short-run, but, in the back of your mind, always think "how can this be coded in a declarative manner?" instead of thinking of "how can I code this in a functional manner" or "a procedural manner" or "an OO style"?

--linas

To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.

To post to this group, send email to ope...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.

For more options, visit https://groups.google.com/d/optout.

Leung Man Hin

ongelezen,
2 jul 2019, 00:06:2402-07-2019
aan ope...@googlegroups.com
Hi Xabush,

Performance-wise, I think it would be really nice to replace the LALR parser in Ghost with the nyacc parser for the 5-6X performance gain + better debugging and logging capabilities, if it's not too much work :)

Allen beantwoorden
Auteur beantwoorden
Doorsturen
0 nieuwe berichten