Problem with token numbers in Text type analysis of multivalued attributes

9 views
Skip to first unread message

Tomaž Erjavec

unread,
Jul 19, 2023, 7:38:42 AM7/19/23
to NoSketch Engine, Cyprian Laskowski, Pahor de Maiti, Kristina, Ganna Kryvenko
Hi,

we noticed that the number of tokens of ParlaMint 3.0 corpora reported
is wrong in Text type analysis of "Parliamentary Body", i.e. speech.body
(it works ok for all other document attributes). This problem is the
same on both our old and new versions of noSkE.

So, e.g. ParlaMint-GB,
https://www.clarin.si/ske/#dashboard?corpname=parlamint30_gb Info says:

Tokens     139,686,402

but if I do Text type analysis of Parliamentary body
(https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint30_gb&wlminfreq=1&wlicase=1&include_nonwords=1&showresults=1&wlnums=frq&wlattr=speech.body),
it says for Token coverage:

Items:  2,  Total frequency: 279,372,804

so, exactly twice what it should be.

The only difference of this attribute to the others is that it is
multivalued:

STRUCTURE speech {
    ATTRIBUTE text_id {
      LABEL "Text ID"
    }
    ATTRIBUTE title
    ATTRIBUTE subcorpus
    ATTRIBUTE body {
        LABEL "Parliamentary body"
        MULTIVALUE yes
        MULTISEP "|"
    }
    ATTRIBUTE term

...

so my guess is that this causes the problem.

Note that the value for most of the corpora is in fact never multivalued
but always a single value, in particular in the ParlaMint-GB corpus.

Best,

Tomaž

Ondřej Herman

unread,
Jul 19, 2023, 7:59:26 AM7/19/23
to Tomaž Erjavec, NoSketch Engine, Cyprian Laskowski, Pahor de Maiti, Kristina, Ganna Kryvenko
Hello Tomaž,

Your assesment is correct -- this issue is present in manatee from 2.199 to 2.211.

To fix this, using a recent manatee a) recompile the corpus or just b) run mktokencov CORPNAME.

Best,

Ondrej

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/0ce94466-80fa-ed87-5f97-5ca352981a5e%40ijs.si.

Tomaž Erjavec

unread,
Jul 20, 2023, 9:38:21 AM7/20/23
to no...@sketchengine.co.uk, Cyprian Laskowski

Hi Ondrej,

tried to recompile 1 corpus on our latest noSkE, and, indeed, the problem goes away.

Thanks!

Tomaž

Reply all
Reply to author
Forward
0 new messages