Re: Recommended way to parse vislcg3 output

31 views
Skip to first unread message

Tino Didriksen

unread,
Jan 22, 2018, 6:35:01 AM1/22/18
to Niko Partanen, constrain...@googlegroups.com, Trosterud Trond, tommi....@iki.fi, Francis Morton Tyers, Eckhard Bick
(CC'ing the Constraint Grammar mailing list)

So, from my point of view that's a very simple topic, but unfortunately not in a way that helps you. CG-3 doesn't care about most things in the stream.

As you identified, http://visl.sdu.dk/cg3/chunked/streamformats.html#stream-vislcg is the documentation for the stream format, and that's really it. I've added stream static tags just now, but they don't alter much.

CG-3 mostly does not care what order tags are in or how those tags look. As long as there's a baseform first, the remaining tags are a random bag. Your example "кар" Hom1 N Sg Ine @HNOUN #1->0 is to CG-3 the same as "кар" @HNOUN #1->0 Sg N Ine Hom1, so what you consider validation is far beyond of what I consider validation. Each parsing system has their own tags and tag order, and CG-3 tries to maintain those tags and order but doesn't really care about it.

This is also why there is no CoNLL-U converter directly in CG-3. CoNLL-U mandates many tag patterns and orders that CG-3 simply doesn't care about or even knows about - I can't make a general-purpose converter, because each parsing system wants it differently.

I could quite easily convert to XML or JSON, but how much that would help is I think limited. It'd be something like

XML:
<cohort id="1" parent="0">
<wordform>...</wordform>
<static-tags><tag>...</tag><tag>...</tag></static-tags>
<readings>
<reading><baseform>...</baseform><tag>...</tag><tag>...</tag><tag>...</tag><mapping>...</mapping></reading>
<reading><baseform>...</baseform><tag>...</tag><tag>...</tag><mapping>...</mapping></reading>
</readings>
</cohort>

JSON:
{
wordform: "...",
static_tags: ["...", "..."],
readings: [
{baseform: "...", tags: ["...", "...", "..."], mapping: "..."},
{baseform: "...", tags: ["...", "..."], mapping: "..."},
]
}

(With everything abbreviated to not waste bytes, and non-cohort input put somewhere in CDATA or a string.)

CG-3 knows what a baseform is and what mapping tags are, but which tags are POS or secondary or semantic and so on is simply not part of the model. It could be something people write into their grammars, but even that is messy.

So in conclusion, from my point of view, stream validators need to be part of the parsing system they work in, because CG-3 is mostly agnostic. I'm happy to be proven wrong, if someone can come up with a clean way to make it work in general.

-- Tino Didriksen


On 19 January 2018 at 10:55, Niko Partanen <nikotapi...@gmail.com> wrote:
Hi Tino,

I was asking about this last week in IWCLUL 2018 conference, and was adviced to contact you. I add Trond, Francis and Tommi here into cc as I was discussing with them.

My question was whether there is any obvious well documented way to parse vislcg3 output, or validate that everything is in order with it. I found some documentation of the format here:


I see there are various output formats with cg-conv, but some of those are giving warnings about information being lost. So I assume parsing the default output is the best option to go. Project I'm involved with is currently working with CG rules for Komi-Zyrian, so we would be interested to analyse bit better the output and how it changes.

I've often been using Francis's ud-scripts to convert output into CoNLL-U, which works fine, but this also demands that the output is disambigued.


On the other hand the format seems simple and it is clear parsing it with any programming language is not that hard. Everyone says they have just come up with some of their own methods, but then there are quite many corner cases with the way output varies, so reinventing how to parse this format again seems a bit unnecessary. I would normally work further with the results in R and Python, so getting the output without information loss into any of these would do. Also having the output in XML or JSON could be an easy way to get onward from there.

Just to give an illustration of a random problem, at times additional tags get marked like this in Komi-Zyrian output:

"<карын>"
"кар" Hom1 N Sg Ine @HNOUN #1->0

Now the homonymy tag is the first, so the script that assumes pos-tag to be first will fail (i.e. in ud-annotatrix etc.). Of course the problem may be in Komi analysator and the output should not look like this to start with, so maybe there are some tools to validate that the output is not breaking any rules?

I was told you maybe would be able to help or advice with this issue. All help is very much appreciated! The Komi disambiguation is working pretty nicely now, so I would be very interested to work further with the results.

Best wishes,

Niko Partanen

Kevin Brubeck Unhammer

unread,
Jan 22, 2018, 7:11:27 AM1/22/18
to Tino Didriksen, Niko Partanen, constrain...@googlegroups.com, Trosterud Trond, tommi....@iki.fi, Francis Morton Tyers, Eckhard Bick
Tino Didriksen <ma...@tinodidriksen.com> čálii:

>> On the other hand the format seems simple and it is clear parsing it with
>> any programming language is not that hard. Everyone says they have just
>> come up with some of their own methods, but then there are quite many
>> corner cases with the way output varies, so reinventing how to parse this
>> format again seems a bit unnecessary. I would normally work further with
>> the results in R and Python, so getting the output without information loss
>> into any of these would do.

If you want to use some ready-made tools to get things into Python, you
could use cg-conv and Apertium's streamparser.py

Here's a small session showing its usage:

```
$ git clone https://github.com/goavki/streamparser
Cloning into 'streamparser'...
remote: Counting objects: 142, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 142 (delta 0), reused 0 (delta 0), pack-reused 139
Receiving objects: 100% (142/142), 33.99 KiB | 280.00 KiB/s, done.
Resolving deltas: 100% (76/76), done.
$ cd streamparser/
$ cat /tmp/kom
"<карын>"
"кар" Hom1 N Sg Ine @HNOUN #1->0
"<and>"
"and" CC @Conj

"<so>"
"so" Adv <guess>
"so" PreAdv @Thing

"<on>"
"on" Adv @Other
"on" Pr @Meh
$ cat /tmp/kom | cg-conv -A
^карын/кар<Hom1><N><Sg><Ine><#1->0><@HNOUN>$^and/and<CC><@Conj>$^so/so<Adv><<guess>>/so<PreAdv><@Thing>$^on/on<Adv><@Other>/on<Pr><@Meh>$$
$ # And now to transform into whatever structure we want in Python, say "form\tsyntags\tmain-pos":
$ cat /tmp/kom | cg-conv -A | python3 -c 'import streamparser
import sys
for blank, lu in streamparser.parse_file(sys.stdin, withText=True):
print(blank+lu.wordform,end="\t")
tags = [tag for reading in lu.readings for sub in reading for tag in sub.tags]
print([t for t in tags if t.startswith("@")], end="\t")
print([t for t in tags if t in ["N", "Adv", "Pr", "PreAdv"]], end="\n")
'
карын ['@HNOUN'] ['N']
and ['@Conj'] []
so ['@Thing'] ['Adv', 'PreAdv']
on ['@Other', '@Meh'] ['Adv', 'Pr']
```


I don't know what information "cg-conv -A" loses, but it does keep the
important stuff, e.g. lemma, wordform, readings, subreadings and even
"blanks/formatting in between cohorts.


best regards,
Kevin Brubeck Unhammer

Niko Partanen

unread,
Jan 22, 2018, 7:22:06 AM1/22/18
to Tino Didriksen, constrain...@googlegroups.com, Trosterud Trond, tommi....@iki.fi, Francis Morton Tyers, Eckhard Bick
Hi Tino,

Thank you for clarifying reply and forwarding it to the list! I was also thinking that the tags are probably bit unrelated to my actual question about the format itself, I see well your point with validation. In my opinion XML output could be a welcome addition, since everyone in the research group I work with is familiar with that already. Of course if there is no wider need for this, then I guess I'll just deal with what is the output now. I had also assumed there is bit more structure in the tags from CG-3 point of view than there is, so this also changes where I have to look into with my problem. 

I'm also happy to hear other opinions about this! 

Best wishes,

Niko 

Kevin Brubeck Unhammer

unread,
Jan 22, 2018, 7:51:35 AM1/22/18
to Tino Didriksen, constrain...@googlegroups.com, Niko Partanen, Trosterud Trond, tommi....@iki.fi, Francis Morton Tyers, Eckhard Bick
Kevin Brubeck Unhammer <unhammer...@mm.st> čálii:

> I don't know what information "cg-conv -A" loses, but it does keep the
> important stuff, e.g. lemma, wordform, readings, subreadings and even
> "blanks/formatting in between cohorts.

Actually, now I see something it doesn't convert.
The characters []{}<>/$^+ are reserved in Apertium stream format, and
should be escaped when they appear in tags or lemmas. I've updated
streamparser to handle this, but cg-conv would have to have a change too
if you have tags like #1->0 or <guess>

Edward Garrett

unread,
Jan 24, 2018, 9:07:10 AM1/24/18
to constrain...@googlegroups.com

This is also why there is no CoNLL-U converter directly in CG-3. CoNLL-U mandates many tag patterns and orders that CG-3 simply doesn't care about or even knows about - I can't make a general-purpose converter, because each parsing system wants it differently.


can you elaborate on this? while everything you said from the CG-3 side makes sense to me, just looking at the CoNLL-U format (http://universaldependencies.org/format.html), it seems like one could make a handy converter which just made some choices about how to map the CoNLL-U columns into CG-3 tags. for example, features could be mapped to Name=Value, i.e. Case=Num, and so on. i don't know whether this would be a "general purpose converter" but if it took any CoNLL-U input and produced a plausible CG-3 output, then it would be useful in my book.

as for going in the reverse direction, i agree it would be problematic. i suppose since the order of the tags and their internal composition is irrelevant in CG-3, the tags might easily get swapped around and screw things up. that said, perhaps there could be some guidelines: if you name your tags in such and such a way, and choose from this inventory of tag naming conventions, then you can convert your CG-3 file to CoNLL-U using this script.


Tino Didriksen

unread,
Jan 24, 2018, 9:45:57 AM1/24/18
to constrain...@googlegroups.com
Replied inline...

On 24 January 2018 at 15:07, Edward Garrett <e.ga...@soas.ac.uk> wrote:
can you elaborate on this? while everything you said from the CG-3 side makes sense to me, just looking at the CoNLL-U format (http://universaldependencies.org/format.html), it seems like one could make a handy converter which just made some choices about how to map the CoNLL-U columns into CG-3 tags. for example, features could be mapped to Name=Value, i.e. Case=Num, and so on. i don't know whether this would be a "general purpose converter" but if it took any CoNLL-U input and produced a plausible CG-3 output, then it would be useful in my book.

Sure, from CoNNL-U to CG-3 is easy, though yielding an ugly result for the general case. I assumed the original question was the other way.

 
as for going in the reverse direction, i agree it would be problematic. i suppose since the order of the tags and their internal composition is irrelevant in CG-3, the tags might easily get swapped around and screw things up. that said, perhaps there could be some guidelines: if you name your tags in such and such a way, and choose from this inventory of tag naming conventions, then you can convert your CG-3 file to CoNLL-U using this script.

This is where nobody can agree. VISL uses ALLUPPER for part of speech, <> for secondary. Apertium has no patterns. Giellatekno has several prefixes that denote secondary tags, with the remainder being primary. Etc. The only thing everyone agrees on is using prefix @ for primary mapping tags, because that was enforced by older CG implementations.

So naming conventions are out the window - has to be list based. Those lists could be in the grammar, or supplied to cg-conv somehow.

-- Tino Didriksen

Reply all
Reply to author
Forward
0 new messages