Tino Didriksen <ma...@tinodidriksen.com
>> On the other hand the format seems simple and it is clear parsing it with
>> any programming language is not that hard. Everyone says they have just
>> come up with some of their own methods, but then there are quite many
>> corner cases with the way output varies, so reinventing how to parse this
>> format again seems a bit unnecessary. I would normally work further with
>> the results in R and Python, so getting the output without information loss
>> into any of these would do.
If you want to use some ready-made tools to get things into Python, you
could use cg-conv and Apertium's streamparser.py
Here's a small session showing its usage:
$ git clone https://github.com/goavki/streamparser
Cloning into 'streamparser'...
remote: Counting objects: 142, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 142 (delta 0), reused 0 (delta 0), pack-reused 139
Receiving objects: 100% (142/142), 33.99 KiB | 280.00 KiB/s, done.
Resolving deltas: 100% (76/76), done.
$ cd streamparser/
$ cat /tmp/kom
"кар" Hom1 N Sg Ine @HNOUN #1->0
"and" CC @Conj
"so" Adv <guess>
"so" PreAdv @Thing
"on" Adv @Other
"on" Pr @Meh
$ cat /tmp/kom | cg-conv -A
$ # And now to transform into whatever structure we want in Python, say "form\tsyntags\tmain-pos":
$ cat /tmp/kom | cg-conv -A | python3 -c 'import streamparser
for blank, lu in streamparser.parse_file(sys.stdin, withText=True):
tags = [tag for reading in lu.readings for sub in reading for tag in sub.tags]
print([t for t in tags if t.startswith("@")], end="\t")
print([t for t in tags if t in ["N", "Adv", "Pr", "PreAdv"]], end="\n")
карын ['@HNOUN'] ['N']
and ['@Conj'] 
so ['@Thing'] ['Adv', 'PreAdv']
on ['@Other', '@Meh'] ['Adv', 'Pr']
I don't know what information "cg-conv -A" loses, but it does keep the
important stuff, e.g. lemma, wordform, readings, subreadings and even
"blanks/formatting in between cohorts.
Kevin Brubeck Unhammer