performance issues: "Text type frequency distribution"

22 views

Skip to first unread message

H Pirker

unread,

Dec 18, 2020, 5:36:46 AM12/18/20

to NoSketch Engine

Dear all,

we are running both a local SKE installation as well as a NoSke instance for hosting our newspaper corpus with 10 billion token.

For a long time we have been observing performance issues when it comes to "Text type frequency distribution".
When I say "severe" I basically mean: it becomes unusable for higher frequency words.

Frequency distributions are a major use case for researchers: they look up a word and inspect its distribution over time or its distribution according to geographic region.
year + region are encoded as structure-attributes of the <doc> element - where a <doc> corresponds to a single newspaper article, and we have 45 mio <docs> in our corpus.

When such a frequency analysis is run on a high(er) frequency word - e.g. "Haus" with 3 Mio hits it basically becomes unavailable: processing is stopped by a timeout after 15min!

I have been mentioning this issue a long time ago in direct emails to the SKE-helpdesk and I was informed then, that you were aware of that issue and -- in the long run -- you were thinking of "parallelizing" some processes in the SKE as a solution.

So I got the impression that you deemed this performance-issue not as a "bug" but as a (not so well performing) "feature" which could (only) be improved by throwing more computer power at it?

But rethinking the issue I start to wonder whether I understood you right, and whether we really have to live with this sorry situation?
As far as I understand it is the performance of the lookup of structure-attributes for a given token# which poses the problem. And I do not understand why nothing can be done to fasten this up?

Making a "normal" frequency distribution -- i.e. one based on a token attribute (e.g. pos) for the same high frequency words takes between 10 and 20 seconds!

So my obvious next attempt will be to change my verticals and "simply" bring all the interesting metadata from the <doc> level down to the token level. I.e. encode "year" or "region" (also) as token attributes. This will bloat the size of my indices, but should remedy my immediate performance issues, right?

But iff this works then I wonder, whether this sort of "workaround" could be automatically integrated into the NoSke?
I.e. why not to offer - say - an option in encodevert to automatically turns (selected?) structure attributes into token-attributes. E.g. a <doc year="2020"/> becomes (as well) a token-attribute "doc.year" for all token in <doc/> ?

So sorry for this overly verbose post - but view it as a mixture of a
* bug/feature clarification ( did I correctly understand that this is a commonly known problem, and not just a mis-configuration on my side)
* my idea on a local solution (by modifying my verticals)
* a proposal to possibly integrate this solution into encodevert et al.

couriously :-)

Hannes
--

Miloš Jakubíček

unread,

Jan 3, 2021, 6:30:04 PM1/3/21

to H Pirker, NoSketch Engine

Hi Hannes,

in principle yes, frequency distributions on large corpora are a standing issue that cannot be significantly improved without parallelization (both CPU and IO).

However, this particular discrepancy between positional and structure attributes can be easily resolved by changing the index type for the related structure (here "doc") from "file64" to "map64" which causes the index to be memory mapped.

A simple comparison: frequency distribution over the concordance for [lemma="Haus"] on doc.tld (top level domains - .de, .at etc.) on detenten13 takes 13 minutes for file64 and just about three seconds for map64 (this is all with hot IO cache, cold cache is of course quite slower, but I assume that an issue that would bother you much).

And yes, if you ask yourself now why that is not the default, the answer is simply: it should be and we ought to change it. The non-memory-mapped variant pretty much stopped making any sense on an 64bit system (it only was required on a 32bit system when the index was too big to be mapped).

More details on index types can be found here: https://www.sketchengine.eu/documentation/corpus-configuration-file-all-features/#Datarepresentationoptions

Best,

Milos Jakubicek

CEO, Lexical Computing

Brno, CZ | Brighton UK

http://www.lexicalcomputing.com

http://www.sketchengine.co.uk

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/013b0e77-e497-4798-9b4a-5f0667a81a80n%40sketchengine.co.uk.

Reply all

Reply to author

Forward

0 new messages