lsslex - functionality via API?

30 views
Skip to first unread message

H Pirker

unread,
Jul 20, 2021, 11:47:42 AM7/20/21
to NoSketch Engine
We frequently require the number of token per structure-attribute values for performing normalisation. 
E.g. we have a attribute doc.year and we need to number of all token per year in order to transform absolute hits per year  into  the "classical" hits-per-million counts. 

Currently we collect this information via a  API  call using  wordlist&wlattr=doc.year   

But for our 11billion corpus this takes  minutes(!) AND is quite RAM intensive. 

Is there  a more economical/clever  way to get these numbers? 

* I am aware of lsslex and I am under the impression that it is much faster: Is there a way to  get a  lsslex-like performance via a API? 
*  On the other hand: these "counts of token per text-type value" are such an fundamental necessity for performing normalisations that  I wonder, why they is not already computed just once at compile-time and then made available via API in "no time"? Maybe they are, and I am just missing out some information?
* if they are not: are there plans to do so -- or at least to add lsslex and lsclex as API-methods?

yours hopefully :-) 

Hannes

Ondřej Herman

unread,
Aug 3, 2021, 6:24:06 AM8/3/21
to H Pirker, NoSketch Engine
Dear Hannes,

we do precompute the token coverage and store it in the STRUCTURE.ATTRIBUTE.token files, or STRUCTURE.ATTRIBUTE.norm using the mknorms script.

Current manatee stores raw token counts in .token files, while for .norm files only words are considered (word is a token _not_ matching NONWORDRE from corpus configuration, which is [^[:alpha:]].* by default). Older versions use .norm files only, and it is not obvious whether they contain token or word counts.

The counts are calculated by mknorms, a .token file is generated by setting NORM_STRUCTATTR to "-", .norm file is generated when NORM_STRUCTATTR is set to a name of an attribute of the structure, which contains the count of words for each structure ocurrence encoded as string. We generate that through compilecorp and addwcattr. The files themselves store the counts as 8 byte little-endian integers, one for each structure attribute id.

You can get the precomputed values through API by setting the wlsort parameter of the wordlist endpoint to "token:l" for token counts per structure attribute value and "norm:l" for word counts per structure attribute value, so

wordlist&wlattr=doc.year&wlsort=token:l

or

wordlist&wlattr=doc.year&wlsort=norm:l

in your case. You can access the same information from the "text type analysis" screen accessible from the corpus info page in the Web interface.

The wlnums parameter can also be used in the same way to get these quantities at the same time along the "primary" wlsort value.

When you call the wordlist endpoint without specifying this, the counts are calculated online in a generic (=seek-intensive) way across all positions in the corpus, so the calculation is not cheap.

Best regards,

Ondrej

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/2d4a2e33-24ff-4533-862c-4e4866b21440n%40sketchengine.co.uk.
Reply all
Reply to author
Forward
0 new messages