On 10 April 2011 06:56, Joel Pitt <jo...@opencog.org> wrote:
> Hi Tim -
>
> There are a number of files already in the RelEx compact output. But
> I'm not sure if these were generated from RelEx directly or not.
They are generated directly by RelEx.
> If
> they were, then converting to YAML would require updating this code.
Or rather, writing a converter. There's in excess of 10 cpu-years of
parsed wikipedia entries in there ...
> I'm not a huge fan of XML,
Me neither, but the viable alternatives are rare. Never heard of YAML before.
> However, this was mostly designed (and used) by Linas, so it's up to him.
The current format is trivial to handle in perl, and I use perl scripts
to load these files into opencog.
(Note I said "handle", I almost said "parse" but thought better
of it: the perl utility scripts are not parsers in the comp-sci
sense of the word "parse"; they're just ad-hoc file rippers.
They just string-match the angle-bracket-xml headers and
rip out everything in between.)
The YAML file format looks harder to handle, because
its not obvious where sections start and end; so, for example,
colons and dashes occur in natural-language text; by contrast,
xml-like angle-brackets never do. This makes error-free
"ripping" of the YAML files more complicated.
> Also, I just finished writing a pa
rser for generating Framenet
> statistics from the RelEx compact format, so probably won't be eager
> to change until after I've finished the statistics gathering ;-)
>
> Joel Pitt, PhD | http://ferrouswheel.me | +852 6683 9980
> M-Lab AI Project and OpenCog Developer | http://opencog.org
>
> On 10 April 2011 18:29, Tim McNamara <mcnama...@gmail.com> wrote:
>> I have come across the RelEX compact output wiki page[1], which discusses a
>> number of alternatives to expressing RelXML. From the state of the wiki, it
>> appears that a decision has not yet been made.
>> The requirements of the compact format are:
>>
>> compactness (easy to compress)
>> easily human readable
>> machine readable
>> provide metadata
>>
>> I believe that the YAML[2] file format matches the stated goals more
>> effectively than the current hybrid XML. Primarily because YAML is much more
>> readable. It will also tend to require smaller file sizes, as it doesn't
>> require markup.
Relex currently supports a half-dozen different printing formats; the directory
"output" in the source tree holds these. It should be relatively
straight forward
to add another format. Just cut-n-paste one of the existing modules (the "cff"
is probably the best place to start) and tweak as desired.
I won't turn down submissions ...
>> Here is an example in XML:
>> <?xml version="1.0" encoding="UTF-8"?>
>>
>> <nlparse xmlns="http://opencog.org/RelEx/0.1.1">
>>
>> <parser>link-grammar-4.3.5\trelex-0.9.0</parser>
>>
>> <date>2008-06-27 23:47Z</date>
>> <source url="http://www.gutenberg.org/extext/74"/>
>>
>> <sentence index="1" parses="4">
>> Most of the adventures recorded in this book really occurred.
>>
>> <parse id="1">
>>
>> <lg-rank num_skipped_words="0" disjunct_cost="0" and_cost="0"
>> link_cost="20" />
>> <constituents>
>> (S (NP (NP Most) (PP of (NP (NP the adventures) (VP recorded (PP in (NP this
>> book)))))) (VP (ADVP really) occurred) .)
>> </constituents>
>> <features>
>> 1 most most noun
>> 2 of of prep
>> 3 the the det
>> 4 adventures adventure noun plural|definite
>> 5 recorded record verb past
>> 6 in in prep
>> 7 this this det
>> 8 book book noun singular|definite
>> 9 really really adv
>> 10 occurred occur verb past
>> 11 . . punctuation
>> </features>
>> <relations>
>> _advmod(really[9], occur[10])
>> in(record[5], book[8])
>> _obj(record[5], adventure[4])
>> of(most[1], adventure[4])
>> _subj(occur[10], most[1])
>> </relations>
>> <links>
>> S(1, 10)
>> Mp(1, 2)
>> Jp(2, 4)
>> Mv(4, 5)
>> MVp(5, 6)
>> Js(6, 8)
>> Dsu(7, 8)
>> Em(9, 10)
>> Dmc(3, 4)
>> Wd(0, 1)
>> Xp(0, 11)
>> </links>
>> </parse>
>> </sentence>
>> </nlparse>
>> The same in YAML:
>> ---
>>
>> nlparse:
>> parser:
>> - link-grammar-4.3.5
>> - relex-0.9.0</parser>
>> date: 2008-06-27 23:47Z
>> source: http://www.gutenberg.org/extext/74
>>
>> sentences:
>> -
>> index: 1
>> parses: 4
>> text: Most of the adventures recorded in this book really occurred.
>> parse:
>> id: 1
>> lg-rank:
>> num_skipped_words: 0
>> disjunct_cost: 0
>> cost: 0
>> link_cost: 20
>> constituents: |
>> (S (NP (NP Most) (PP of (NP (NP the adventures) (VP recorded (PP in (NP this
>> book)))))) (VP (ADVP really) occurred) .)
>> features: |
>> 1 most most noun
>> 2 of of prep
>> 3 the the det
>> 4 adventures adventure noun plural|definite
>> 5 recorded record verb past
>> 6 in in prep
>> 7 this this det
>> 8 book book noun singular|definite
>> 9 really really adv
>> 10 occurred occur verb past
>> 11 . . punctuation
>> relations: |
>> _advmod(really[9], occur[10])
>> in(record[5], book[8])
>> _obj(record[5], adventure[4])
>> of(most[1], adventure[4])
>> _subj(occur[10], most[1])
>> links: |
>> S(1, 10)
>> Mp(1, 2)
>> Jp(2, 4)
>> Mv(4, 5)
>> MVp(5, 6)
>> Js(6, 8)
>> Dsu(7, 8)
>> Em(9, 10)
>> Dmc(3, 4)
>> Wd(0, 1)
>> Xp(0, 11)
>>
>> Regards, Tim
>> [1] http://wiki.opencog.org/w/RelEx_compact_output
>> [2] http://www.yaml.org
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "opencog" group.
>> To post to this group, send email to ope...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> opencog+u...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/opencog?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups "opencog" group.
> To post to this group, send email to ope...@googlegroups.com.
> To unsubscribe from this group, send email to opencog+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/opencog?hl=en.
>
>