Re: HunPOS part-of-speech tagger query.

149 views
Skip to first unread message

Arne Köhn

unread,
Jul 29, 2010, 4:44:05 AM7/29/10
to J C, hun...@googlegroups.com, pe...@halacsy.com, cb...@acm.org, johnne...@gmail.com, blauere...@googlemail.com, hal...@gmail.com, kgra...@gmail.com, marko.s...@gmail.com, mmih...@gmail.com, zeljk...@gmail.com
Hi John, hi Rachel

sorry for the late reply.
J C <cha...@hotmail.com> writes:

> I am emailing you in regards to HunPOS which does not
> seem to have any active support.

There is the hunpos mailing list:

hun...@googlegroups.com

However, there's not much traffic either (read: it's save to
subscribe ;-). I propose, that we discuss this further on that list,
because Rachel asked the same question there and it seems to be the
appropriate place.

> I notice each of you have used HunPOS
> in the past and I would like to know if you had any troubles building
> your own model using the hunpos-train.

I'm still using HunPOS and have developed an incremental mode for my
bachelor thesis. I haven't pushed this upstream yet, though :-/

> However to train a model using hunpos-train, a file is
> necessary containing a single word and part of speech separated by a
> tab per line. Furthermore as the specifications state, sentences are
> escaped by empty lines.

That's correct.

> [snip]
> hunpos-train model.model < input.txt

> reading training corpus compiling probabilities Fatal error: exception
> Failure("empty context_trie")

I didn't come across this until I wanted to answer this mail with
"worksforme". Now I tried to train a model by hand and got the same
error. The problem seems to be (at least here), that the input size is
too small:

This works:

sed 's/ /\t/' somefile | head -n 35 | ./trainer.native models/test

This doesn't, yielding the same error as mentioned above:

sed 's/ /\t/' somefile | head -n 20 | ./trainer.native models/test

Does this information help? I'll try to dig deeper into this if you
(Rachel and/or John) still have the problem and provide me with your
training corpus.


Greetings,
Arne

Oravecz Csaba

unread,
Jul 29, 2010, 6:55:11 AM7/29/10
to hun...@googlegroups.com
On Thu Jul 29 10:44:05 2010 Arne =?utf-8?Q?K=C3=B6hn?= wrote:
>
> Hi John, hi Rachel
>
> sorry for the late reply.
> J C <cha...@hotmail.com> writes:
>
> > I am emailing you in regards to HunPOS which does not
> > seem to have any active support.
>

Well, the main developer has practically abandoned the project and
gone for some different full time business, unfortunately.

>
> > [snip]
> > hunpos-train model.model < input.txt
>
> > reading training corpus compiling probabilities Fatal error: exception
> > Failure("empty context_trie")

This came up some time ago already and find below Peter's answer then.
Best,
Csaba Oravecz

----------------------------------------------------------------------

From: Peter Halacsy <pe...@halacsy.com>
To: hun...@googlegroups.com
In-Reply-To: <e4ef9313-55e2-4c81...@e25g2000prg.googlegroups.com>
Subject: Re: Curious about hunpos
Date: Wed, 30 Jan 2008 19:48:43 +0100

On Jan 30, 2008, at 2:03 PM, zeljk...@gmail.com wrote:

>
> Peter Halacsy wrote:
>
>> <cut />
>
> BTW, Peter, on many training files that I have available, the training
> procedure breaks down stating


>
> reading training corpus
> compiling probabilities
> Fatal error: exception Failure("empty context_trie")
>

> However, I was not able to determine why this happens on some files
> and yet does not on others. I use input files in the format
>
> wordform1[SPACES or TABS]tag1
> ...
>
> with newlines as sentence delimiters. Do you know what causes this
> kind of error?
>

this it's a bug.

in your training file there are not enough tokens matching the regular
expressions defining cardinals. These are

For some regular expressions Hunpos learns the tag distribution of the
training corpus separately to give more reliable estimates for open
class items like numbers unseen during training. (see http://mokk.bme.hu/archive/halacsy07acl
)

What we can do?

1. factor out the wired reg exp and make them configurable (test
version is done on my computer)

2. check if there are not enough sample for one of the open token class

3. add some dummy data to the training corpus

I hope this helps

peter

Reply all
Reply to author
Forward
0 new messages