lsslex - functionality via API?

30 views

Skip to first unread message

H Pirker

unread,

Jul 20, 2021, 11:47:42 AM7/20/21

to NoSketch Engine

We frequently require the number of token per structure-attribute values for performing normalisation.
E.g. we have a attribute doc.year and we need to number of all token per year in order to transform absolute hits per year into the "classical" hits-per-million counts.

Currently we collect this information via a API call using wordlist&wlattr=doc.year

But for our 11billion corpus this takes minutes(!) AND is quite RAM intensive.

Is there a more economical/clever way to get these numbers?

* I am aware of lsslex and I am under the impression that it is much faster: Is there a way to get a lsslex-like performance via a API?
* On the other hand: these "counts of token per text-type value" are such an fundamental necessity for performing normalisations that I wonder, why they is not already computed just once at compile-time and then made available via API in "no time"? Maybe they are, and I am just missing out some information?
* if they are not: are there plans to do so -- or at least to add lsslex and lsclex as API-methods?

yours hopefully :-)

Hannes

Ondřej Herman

unread,

Aug 3, 2021, 6:24:06 AM8/3/21

to H Pirker, NoSketch Engine

Dear Hannes,

we do precompute the token coverage and store it in the STRUCTURE.ATTRIBUTE.token files, or STRUCTURE.ATTRIBUTE.norm using the mknorms script.

Current manatee stores raw token counts in .token files, while for .norm files only words are considered (word is a token _not_ matching NONWORDRE from corpus configuration, which is [^[:alpha:]].* by default). Older versions use .norm files only, and it is not obvious whether they contain token or word counts.

The counts are calculated by mknorms, a .token file is generated by setting NORM_STRUCTATTR to "-", .norm file is generated when NORM_STRUCTATTR is set to a name of an attribute of the structure, which contains the count of words for each structure ocurrence encoded as string. We generate that through compilecorp and addwcattr. The files themselves store the counts as 8 byte little-endian integers, one for each structure attribute id.

You can get the precomputed values through API by setting the wlsort parameter of the wordlist endpoint to "token:l" for token counts per structure attribute value and "norm:l" for word counts per structure attribute value, so

wordlist&wlattr=doc.year&wlsort=token:l

wordlist&wlattr=doc.year&wlsort=norm:l

in your case. You can access the same information from the "text type analysis" screen accessible from the corpus info page in the Web interface.

The wlnums parameter can also be used in the same way to get these quantities at the same time along the "primary" wlsort value.

When you call the wordlist endpoint without specifying this, the counts are calculated online in a generic (=seek-intensive) way across all positions in the corpus, so the calculation is not cheap.

Best regards,

Ondrej

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/2d4a2e33-24ff-4533-862c-4e4866b21440n%40sketchengine.co.uk.

Reply all

Reply to author

Forward

0 new messages