Tomaž Erjavec
unread,Jul 19, 2023, 7:38:42 AM7/19/23Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to NoSketch Engine, Cyprian Laskowski, Pahor de Maiti, Kristina, Ganna Kryvenko
Hi,
we noticed that the number of tokens of ParlaMint 3.0 corpora reported
is wrong in Text type analysis of "Parliamentary Body", i.e. speech.body
(it works ok for all other document attributes). This problem is the
same on both our old and new versions of noSkE.
So, e.g. ParlaMint-GB,
https://www.clarin.si/ske/#dashboard?corpname=parlamint30_gb Info says:
Tokens 139,686,402
but if I do Text type analysis of Parliamentary body
(
https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint30_gb&wlminfreq=1&wlicase=1&include_nonwords=1&showresults=1&wlnums=frq&wlattr=speech.body),
it says for Token coverage:
Items: 2, Total frequency: 279,372,804
so, exactly twice what it should be.
The only difference of this attribute to the others is that it is
multivalued:
STRUCTURE speech {
ATTRIBUTE text_id {
LABEL "Text ID"
}
ATTRIBUTE title
ATTRIBUTE subcorpus
ATTRIBUTE body {
LABEL "Parliamentary body"
MULTIVALUE yes
MULTISEP "|"
}
ATTRIBUTE term
...
so my guess is that this causes the problem.
Note that the value for most of the corpora is in fact never multivalued
but always a single value, in particular in the ParlaMint-GB corpus.
Best,
Tomaž