Text Type Analysis: problem with counting tokens

Pahor de Maiti, Kristina

unread,

Nov 27, 2024, 8:59:17 AM11/27/24

to NoSketch Engine

Dear noSkE team,

I ran into a problem while trying to compute the number of tokens for three built-in subcorpora in the ParlMint-en corpus.

There are 3 built-in subcorpora, as shown here. By clicking the three dots, however, I only get info for two (probably because the third is much smaller). I understand that in this case I could just sum it up and subtract from the whole, but I wanted to find a way to show it in the concordancer (since I'll need this info also for other text types). However, I am unable to count the tokens, all I can get is the number of speeches per built-in subcorpus through Analyze multiple text types. I tried searching (through concordance option) for all tokens in each of the subcorpora, but since there are more than 10mio, I cannot get the number of tokens per subcorpus either.

Is there a workaround for this issue?

Best regards,

Kristina

Michal Cukr | Sketch Engine Support

unread,

Dec 5, 2024, 3:12:01 AM12/5/24

to no...@sketchengine.co.uk, kristina.p...@ff.uni-lj.si

Dear Kristina,

Please first pay attention to the terminology. You are speaking about subcorpora which was confusing for me at first sight. Am I correct? Since there are no standard subcorpora in this corpus as we call subcorpora in Sketch Engine. https://www.sketchengine.eu/glossary/subcorpus/

However, you seem to mean metadata (text type), specifically speech.subcorpus. The difference between subcorpora themselves and text types can be seen when you count relative frequency. https://www.sketchengine.eu/glossary/freqmill/

Regarding the first issue - not showing the third subcorpus - we will need access to the log file of the corpus compilation. Also, the registry file and a sample of the source vertical file might be useful.

As for the latter question, you can switch between Structure frequency and Token coverage in the Text type analysis. Please find the attached screenshot. You can find this information in our documentation https://www.sketchengine.eu/guide/text-type-analysis/

I strongly recommend checking the Sketch Engine documentation first at https://www.sketchengine.eu/guide/. The noSkE group is better suited for technical issues or similar peculiarities. Standard questions about using the interface should be addressed using our website documentation or through the support channel sup...@sketchengine.eu which is available to all Sketch Engine users.

Thank you for your understanding.

Best regards,

Michal Cukr

--
Sketch Engine Team
Email: sup...@sketchengine.eu

Web: https://www.sketchengine.eu/guide/

YouTube tutorials: https://youtube.com/c/SketchEngine

Boot Camp Online – a course in mastering Sketch Engine https://www.sketchengine.eu/bootcamp/

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/AS1PR10MB53655ADB31DC49F153B70F0D9B282%40AS1PR10MB5365.EURPRD10.PROD.OUTLOOK.COM.

text-type-analysis-token-coverage.png

Pahor de Maiti, Kristina

unread,

Dec 5, 2024, 8:40:19 AM12/5/24

to Michal Cukr | Sketch Engine Support, no...@sketchengine.co.uk, Tomaz Erjavec

Dear Michal,

thank you for your answer. I apologize for not using the right terminology and thus making my message confusing. Please find the requested files at this link: https://nl.ijs.si/et/tmp/noske-kristina/

I was of course aware of the "Show" option that allows toggling between Structure and Token frequencies. However, because of the technical problem with the display of all text type values, I was wandering if there exists an alternative way to count the tokens for text type values that exceed 10k tokens (thus not being able to acquire this info through a simple search for all tokens). This alternative way of computing the number of tokens would be helpful also because the "Show" option to toggle token/structure freq does not work if a subcorpus is selected (noSkE definition of subcorpus). Anyway, this last issue has already been reported to your team in the summer and I believe it is in the pipeline. Sure, I'll seek advice from @sketchengine.eu in the future.

Best regards,

Kristina

Od: Michal Cukr | Sketch Engine Support <sup...@sketchengine.eu>
Poslano: četrtek, 05. december 2024 09:11
Za: no...@sketchengine.co.uk <no...@sketchengine.co.uk>; Pahor de Maiti, Kristina <kristina.p...@ff.uni-lj.si>
Zadeva: RE: Text Type Analysis: problem with counting tokens [Ticket#7813070]

Michal Cukr | Sketch Engine Support

unread,

Dec 13, 2024, 9:23:28 AM12/13/24

to no...@sketchengine.co.uk, kristina.p...@ff.uni-lj.si, tomaz....@ijs.si

Dear Kristina,

Thank you for the provided files.

It appears the data on Clarin NoSkE doesn’t match the provided vertical sample. This may be due to using a different registry file during corpus compilation. The registry you provided lacks MULTIVALUE and MULTISEP options, yet the results suggest that these configurations were used.

Could you verify if a different registry was involved?

We compiled the sample using your provided file, and the “Covid,War” value in the speech subcorpus displayed correctly. Please see the attached screenshot.

Best regards,

Michal Cukr

--
Sketch Engine Team
Email: sup...@sketchengine.eu

Web: https://www.sketchengine.eu/guide/

YouTube tutorials: https://youtube.com/c/SketchEngine

Boot Camp Online – a course in mastering Sketch Engine https://www.sketchengine.eu/bootcamp/

screenshot-app_sketchengine_eu-2024_12_13-15_22_24.png

Tomaž Erjavec

unread,

Dec 14, 2024, 7:32:28 AM12/14/24

to Michal Cukr | Sketch Engine Support, no...@sketchengine.co.uk, kristina.p...@ff.uni-lj.si

Dear Michal,

thanks for looking into this. As I was the one to prepare the files and mount them on the concordancers, I can try and take it from here:

I think that the registry and vertical files didn't change since the compilation of the corpus, but I admit it can be confusing that the text (i.e. speech) type "subcorpus" can have the value of "COVID,War", while the registry file does not specify that this is a multivalued attribute. However, this is on purpose, as whatever is in the War subcorpus is also in the COVID subcorpus, i.e. we have the possible values "Reference", "COVID" and "COVID,War", which makes it easier to choose the subcorpus one wants, rather than the values being multivalued.

We come now to the strange part: if we look at the Text type analysis of this corpus (as Kristina wrote), i.e. at
https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint41_xx_en&wlminfreq=1&wlicase=1&include_nonwords=1&showresults=1&wlnums=frq&wlattr=speech.subcorpus

we get two values only:

So, the value "COVID,War" does not show up, even though it is in the vertical file for the corpus, as you saw in the sample.

Also, the structure frequency here shown is 7,650,267 while the complete corpus as 8,081,124 speeches.

And, in fact, if I do a concordance, say "cow", i.e.
https://www.clarin.si/ske/#concordance?corpname=parlamint41_xx_en&tab=basic&keyword=cow&attrs=word&viewmode=kwic&attr_allpos=all&refs_up=0&shorten_refs=1&glue=1&gdexcnt=300&show_gdex_scores=0&itemsPerPage=20&structs=s%2Cg&refs=%3Dspeech.corpus%2C%3Dspeech.date&showresults=1&showTBL=0&tbl_template=&gdexconf=&f_tab=basic&f_showrelfrq=1&f_showperc=0&f_showreldens=0&f_showreltt=0&c_customrange=0&t_attr=&t_absfrq=0&t_trimempty=1&t_threshold=5&operations=%5B%7B%22name%22%3A%22iquery%22%2C%22arg%22%3A%22cow%22%2C%22query%22%3A%7B%22queryselector%22%3A%22iqueryrow%22%2C%22iquery%22%3A%22cow%22%7D%2C%22id%22%3A4701%7D%5D

and click to get the metadata of the first hit, I get

So, here we suddenly do have "COVID,War" as speech.subcorpus, even though it was not present in the Text type analysis page.

I don't see how this is possible, unless something went wrong in the compilation, but I do check the log files and can't remember anything strange.

Or, of course, if there is a bug in noSkE.

If you have the patience for this, I've now put the complete vertical file on https://nl.ijs.si/et/tmp/noske-kristina/ so you have exactly the registry and vertical file used to produce
https://www.clarin.si/ske/#dashboard?corpname=parlamint41_xx_en

and you could see if it compiles correctly at you end, i.e. so that the Text type analysis would give 3 values rather than 2.

If it does then I guess we have to figure out here what went wrong. If it doesn't, then you do :)

And, of course, sorry for this complex bother!

Best,

Tomaž

Ondřej Herman

unread,

Dec 14, 2024, 9:19:55 AM12/14/24

to Tomaž Erjavec, Michal Cukr | Sketch Engine Support, NoSketch Engine, Pahor de Maiti, Kristina

Hello Tomaž,

We fixed this issue in our codebase - the next release will contain the fix.

The problem is that values containing MULTISEP are discarded always without checking whether MULTIVALUE is actually set.

In the meantime, you could set MULTISEP to a byte which never appears within your data.

Sorry for the inconvenience.

Best regards,

Ondrej

To view this discussion visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/5a1fe652-67af-42d8-b01e-1e00027da30a%40ijs.si.

Tomaž Erjavec

unread,

Dec 15, 2024, 5:36:23 AM12/15/24

to Ondřej Herman, Michal Cukr | Sketch Engine Support, NoSketch Engine, Pahor de Maiti, Kristina

Hi Ondrej,

thanks, looking forward to the next release!

And, just to let you know, I now put

    ATTRIBUTE subcorpus {
      TYPE "MD_MGD"
      MULTISEP "÷"
    }

in the registry file, and, indeed, it fixes the problem, cf.
https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint41_xx_en&wlminfreq=1&wlicase=1&include_nonwords=1&showresults=1&wlnums=frq&wlattr=speech.subcorpus