Word list and Keywords report twice as high frequencies in some cases

31 views
Skip to first unread message

Tomáš Machálek

unread,
May 2, 2024, 9:17:03 AM5/2/24
to NoSketch Engine
Hello everyone,

We've noticed that Word List and Keywords show double absolute frequencies when a subcorpus is involved. Strangely, this only affects some attributes. So it is quite possible that when working with "word" the frequencies are OK, but when switching to "lemma" or "tag" the frequencies are doubled compared to the corresponding concordance results.

We have tested this problem in manatee-open 2.223.6 and manatee-open-2.225.8 along with bonito-open-5.58.1 and crystal-open-2.130.1.

Best regards,
Tomas Machalek
Institute of the Czech National Corpus


Michal Cukr | Sketch Engine Support

unread,
May 7, 2024, 9:52:35 AM5/7/24
to no...@sketchengine.co.uk, tomas.m...@gmail.com
Dear Tomáš,

Thank you for your email and the details you provided to my colleague František in the private email.

We are not able to reproduce the issue you described in the current up-to-date version of Sketch Engine.

However, I can see you are using an obsolete version of Bonito as well as Crystal. Please update these components and then try your query if the issue persists.

Best regards,

Michal Cukr 



--
Sketch Engine Team
Email: sup...@sketchengine.eu
Boot Camp Online – a course in mastering Sketch Engine https://www.sketchengine.eu/bootcamp/

Tomáš Machálek

unread,
May 15, 2024, 5:46:56 AM5/15/24
to Michal Cukr | Sketch Engine Support, no...@sketchengine.co.uk
I have updated to bonito-open-5.71.15, crystal-open-2.166.4 (Manatee is already 2.225.8) and the problem persists.
 
My observation suggests that this is probably related to the fact that the attributes in question use multivalues. I've noticed that the values are far from always exactly twice the correct value. Rather, it depends on how often the multi-values occur within the given attribute. Just to be sure - this only applies to non-word attributes and subcorpora, in which case Bonito has to compute intermediate data.

Tomáš Machálek

Michal Cukr | Sketch Engine Support

unread,
May 31, 2024, 9:00:03 AM5/31/24
to tomas.m...@gmail.com, no...@sketchengine.co.uk
Dear Tomáš,

Thank you for updating Bonito and providing more details about the issue.

I have tried to raise the error according to your instructions, but unfortunately, I have not been able to reproduce it. I have discussed it with my colleagues and we would need a minimum reproducible example to inspect the error in more detail.

In short, we see two ways how to get it:

  1. To create small data, e.g. 100-token corpus, causing the error and send them to us including the corpus configuration (registry) so that we can compile the data on our servers.
  2. Or to grant us access to the corpus where the mistake occurred.
If you wish to keep your data private, please share it with us via our standard support channel sup...@sketchengine.eu

Best regards,

Michal Cukr 



--
Sketch Engine Team
Email: sup...@sketchengine.eu
Boot Camp Online – a course in mastering Sketch Engine https://www.sketchengine.eu/bootcamp/


Ondřej Herman | Sketch Engine Support

unread,
Jul 15, 2024, 8:11:39 AM7/15/24
to no...@sketchengine.co.uk
---Begin forwarded message:---

Subject: Word list and Keywords report twice as high frequencies in some cases
Date: 07/15/2024 2:08 pm
From: Ondřej Herman
To: Tomáš Machálek <tomas.m...@gmail.com>

Dear Tomas,

We resolved the issue, it was indeed caused by the handling of MUTIVALUE attributes within subcorpora. The fix will be available in the next release of Manatee.

In the meantime, you can use the attached patch.

Thank you for reporting the problem.

Best,

Ondrej
--
Sketch Engine Team
Email: sup...@sketchengine.eu
Boot Camp Online – a course in mastering Sketch Engine https://www.sketchengine.eu/bootcamp/


On Friday, June 7, 2024 at 5:53:04 PM, Tomáš Machálek wrote:

Hello all,

I've prepared a small vertical file from one of our corpora plus a corresponding configuration file (PATH will probably be invalid in your environment).
I've tested it in both NoSkE and our KonText with the same results.
The simplest steps to replicate are as follows
1) select net_v2_sample corpus
2) create a subcorpus (e.g. with the condition s.doc_type="blog")
3) prepare the wordlist function
3.1) use "find: tags" starting with "N"
3.2) select the subcorpus above
4) perform the calculation (press "Go")
- the UI should report that it needs to prepare data
5) look at the results and try to compare the respective frequencies with the number of results when going to the corresponding concordances containing each frequency element.

If you have any questions, please let me know.

Best regards,
Tomas Machalek

On Tue, Jun 4, 2024 at 11:18 AM Tomáš Machálek <tomas.m...@gmail.com> wrote:
Thanks for the information. I will prepare a small corpus and send it to you (or a download link). Hopefully I can do this by the end of the week.

Thank you,
Tomas Machalek

On Fri, May 31, 2024 at 3:00 PM Michal Cukr | Sketch Engine Support <sup...@sketchengine.eu> wrote:
Dear Tomáš,
 
Thank you for updating Bonito and providing more details about the issue.
 
I have tried to raise the error according to your instructions, but unfortunately, I have not been able to reproduce it. I have discussed it with my colleagues and we would need a minimum reproducible example to inspect the error in more detail.
 
In short, we see two ways how to get it:
 
  1. To create small data, e.g. 100-token corpus, causing the error and send them to us including the corpus configuration (registry) so that we can compile the data on our servers.
  2. Or to grant us access to the corpus where the mistake occurred.
If you wish to keep your data private, please share it with us via our standard support channel sup...@sketchengine.eu
 
Best regards,
 
Michal Cukr
 
--
Sketch Engine Team
Email: sup...@sketchengine.eu
Boot Camp Online – a course in mastering Sketch Engine https://www.sketchengine.eu/bootcamp/
On Wednesday, May 15, 2024 at 11:47:11 AM, Tomáš Machálek wrote:

I have updated to bonito-open-5.71.15, crystal-open-2.166.4 (Manatee is already 2.225.8) and the problem persists.

My observation suggests that this is probably related to the fact that the attributes in question use multivalues. I've noticed that the values are far from always exactly twice the correct value. Rather, it depends on how often the multi-values occur within the given attribute. Just to be sure - this only applies to non-word attributes and subcorpora, in which case Bonito has to compute intermediate data.

Tomáš Machálek


On Tue, May 7, 2024 at 3:52 PM Michal Cukr | Sketch Engine Support <sup...@sketchengine.eu> wrote:
Dear Tomáš,
Thank you for your email and the details you provided to my colleague František in the private email.
We are not able to reproduce the issue you described in the current up-to-date version of Sketch Engine.
However, I can see you are using an obsolete version of Bonito as well as Crystal. Please update these components and then try your query if the issue persists.
Best regards,
Michal Cukr

On Thursday, May 2, 2024 at 3:17:15 PM, Tomáš Machálek wrote:

0001-MULTIVALUE-don-t-expand-vaules-with-no-MULTISEP-twic.patch
Reply all
Reply to author
Forward
0 new messages