Tino Didriksen <
ma...@tinodidriksen.com> čálii:
>> On the other hand the format seems simple and it is clear parsing it with
>> any programming language is not that hard. Everyone says they have just
>> come up with some of their own methods, but then there are quite many
>> corner cases with the way output varies, so reinventing how to parse this
>> format again seems a bit unnecessary. I would normally work further with
>> the results in R and Python, so getting the output without information loss
>> into any of these would do.
If you want to use some ready-made tools to get things into Python, you
could use cg-conv and Apertium's streamparser.py
Here's a small session showing its usage:
```
$ git clone
https://github.com/goavki/streamparser
Cloning into 'streamparser'...
remote: Counting objects: 142, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 142 (delta 0), reused 0 (delta 0), pack-reused 139
Receiving objects: 100% (142/142), 33.99 KiB | 280.00 KiB/s, done.
Resolving deltas: 100% (76/76), done.
$ cd streamparser/
$ cat /tmp/kom
"<карын>"
"кар" Hom1 N Sg Ine @HNOUN #1->0
"<and>"
"and" CC @Conj
"<so>"
"so" Adv <guess>
"so" PreAdv @Thing
"<on>"
"on" Adv @Other
"on" Pr @Meh
$ cat /tmp/kom | cg-conv -A
^карын/кар<Hom1><N><Sg><Ine><#1->0><@HNOUN>$^and/and<CC><@Conj>$^so/so<Adv><<guess>>/so<PreAdv><@Thing>$^on/on<Adv><@Other>/on<Pr><@Meh>$$
$ # And now to transform into whatever structure we want in Python, say "form\tsyntags\tmain-pos":
$ cat /tmp/kom | cg-conv -A | python3 -c 'import streamparser
import sys
for blank, lu in streamparser.parse_file(sys.stdin, withText=True):
print(blank+lu.wordform,end="\t")
tags = [tag for reading in lu.readings for sub in reading for tag in sub.tags]
print([t for t in tags if t.startswith("@")], end="\t")
print([t for t in tags if t in ["N", "Adv", "Pr", "PreAdv"]], end="\n")
'
карын ['@HNOUN'] ['N']
and ['@Conj'] []
so ['@Thing'] ['Adv', 'PreAdv']
on ['@Other', '@Meh'] ['Adv', 'Pr']
```
I don't know what information "cg-conv -A" loses, but it does keep the
important stuff, e.g. lemma, wordform, readings, subreadings and even
"blanks/formatting in between cohorts.
best regards,
Kevin Brubeck Unhammer