Multichar multisep kills manatee

10 views
Skip to first unread message

Tomaž Erjavec

unread,
Mar 4, 2023, 7:10:45 AM3/4/23
to NoSketch Engine
Hi,

I was just transferring a corpus from our old, Bonito noSketch Engine
(https://www.clarin.si/noske/, where it compiles just fine) to the new
Crystal (https://www.clarin.si/ske/), where the corpus compilation
crashed with:

[20230304-12:20:35] mktokencov: writing token coverage for text
[20230304-12:20:35] mktokencov: finished writing token coverage for text
free(): invalid size
/usr/local/bin/compilecorp: line 73: 3397100 Aborted                
(core dumped) mktokencov "$CORPUS"
<---
ERROR: line 445 - command 'mktokencov "$CORPUS"' exited with status: 134
   ... Error at ::main::main called at line 924
--->

After quite some testing (because <text> has 10 attributes), I found out
that the guilty definition is:

    ATTRIBUTE first_lang {
      LABEL "Prvi jezik"
      MULTIVALUE yes
      MULTISEP ", "
    }

in particular the MULTISEP, which contains ", ". The fact that it has
two characters makes manatee lose its cool.

I'm not saying that the limitation to 1 char for MULTISEP is
unreasonable (although, it would be nice to be able to have a proper
string for the separator), it's just that

- manatee gives no indication what the problem is, and neither does any
of the corpus check commands

- the documentation at
https://www.sketchengine.eu/documentation/corpus-configuration-file-all-features/
does not make it clear that only one char is allowed: "MULTISEP defines
multivalue separator".

Because I now know what the problem was, this is just to make others
aware of the limitation, and maybe for Lexicals to add to the
documentation this limitation ("MULTISEP defines multivalue separator
character"), and very very maybe have manatee give a report on the
problem as well, rather than just dumping core.

Best,

Tomaž

Ondřej Herman

unread,
Mar 6, 2023, 7:46:48 AM3/6/23
to Tomaž Erjavec, NoSketch Engine
Hello Tomaž,

Thanks for this -- I updated the documentation and changed corpcheck, so that it complains when MULTISEP is longer than a single byte.

However, I can't reproduce the crash in mktokencov. Could you send me the contents of the first_lang attribute from the vertical? Or a simple grep of all the <doc> structures from the vertical would be most helpful for me to be able to trace this down.

Best,

Ondrej

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/81ace2aa-a8ac-c480-2dc6-2d89d7f2b73a%40ijs.si.

Tomaž Erjavec

unread,
Mar 6, 2023, 12:33:49 PM3/6/23
to Ondřej Herman, NoSketch Engine

Hi Ondřej,

On 6. 03. 2023 13:46, Ondřej Herman wrote:
Thanks for this -- I updated the documentation and changed corpcheck, so that it complains when MULTISEP is longer than a single byte.
thanks! Also in the name of all the future users :)


However, I can't reproduce the crash in mktokencov. Could you send me the contents of the first_lang attribute from the vertical? Or a simple grep of all the <doc> structures from the vertical would be most helpful for me to be able to trace this down.

Alas, I can't, because I fixed the corpus in the meantime and didn't keep the old one.

But I made a little test corpus which gives the same error as below, and you can find it at https://nl.ijs.si/et/tmp/noske/test/.

Strange that you couldn't reproduce it - it just might be that this problem has been fixed in the latest manatee, as we are still using open-2.167.8.

All the best, thanks for you help & greetings to Brno!

Tomaž

[20230306-18:26:42] mktokencov: updating multivalue coverage for first_lang
[20230306-18:26:42] mktokencov: writing token coverage for text
[20230306-18:26:42] mktokencov: finished writing token coverage for text
/usr/local/bin/compilecorp: line 73: 3566017 Segmentation fault      (core dumped) mktokencov "$CORPUS"
<---
ERROR: line 445 - command 'mktokencov "$CORPUS"' exited with status: 139


   ... Error at ::main::main called at line 924
--->

Writing log to /data/manatee-data/kost_test/log/compilecorp_2023-03-06_1826.log
<---
ERROR: line 924 - command 'tee "$TMPLOGFILE"' exited with status: 139
   ... internal debug info from function traperror (line 0)
--->
make: *** [Makefile:4: test] Error 139

Ondřej Herman

unread,
Jul 17, 2023, 11:55:50 AM7/17/23
to Tomaž Erjavec, NoSketch Engine
Hello Tomaž,

Sorry for the delay and thanks for the data -- it helped me in tracking down the crash.

In the next manatee release (version >= 2.224) MULTISEP length will be validated properly.

Best regards,

Ondrej
Reply all
Reply to author
Forward
0 new messages