performance issues: "Text type frequency distribution"

22 views
Skip to first unread message

H Pirker

unread,
Dec 18, 2020, 5:36:46 AM12/18/20
to NoSketch Engine
Dear all,

we are running both a local SKE installation as well as a NoSke instance for hosting our newspaper corpus with  10 billion token. 
For a long time we have been observing performance issues when it comes to "Text type frequency distribution".
When I say "severe" I basically mean: it becomes unusable for higher frequency words.

Frequency distributions are a major use case for researchers: they look up a word and inspect its distribution over time or its distribution according to geographic region. 
year + region  are encoded as structure-attributes of the <doc> element - where a <doc> corresponds to a single newspaper article, and we have 45 mio <docs> in our corpus. 

When such a frequency analysis is run on a high(er) frequency word - e.g. "Haus" with 3 Mio hits it basically becomes unavailable: processing is stopped by a timeout after 15min! 

I have been mentioning this issue a long time ago in direct emails to the SKE-helpdesk and I was informed then, that you were aware of that issue and  -- in the long run -- you were thinking of "parallelizing" some processes in the SKE as a solution.

So I got the impression that you deemed this  performance-issue  not as a "bug" but as a (not so well performing) "feature" which could (only) be improved by throwing more computer power at it?

But rethinking the issue I start to wonder whether I understood you right, and whether we really have to live with this sorry situation? 
As far as I understand it is the performance of the lookup of structure-attributes for a given token# which poses the problem. And I do not understand why nothing can be done to fasten this up?

Making a "normal" frequency distribution -- i.e. one based on a token attribute (e.g. pos) for the same high frequency words takes  between 10 and 20 seconds!

So my obvious next attempt will be to change my verticals and "simply" bring all the interesting metadata from the <doc> level down to the token level. I.e. encode "year" or "region" (also) as token attributes. This will bloat the size of my indices, but should remedy my immediate performance issues, right?
 
But  iff  this works then I wonder, whether this sort of "workaround" could  be automatically integrated into the NoSke?
I.e. why not to offer  - say - an option in encodevert to automatically turns (selected?)  structure attributes into token-attributes. E.g. a <doc year="2020"/> becomes (as well) a token-attribute "doc.year" for all token in <doc/> ? 

So sorry for this overly verbose post - but view it as a mixture of a 
* bug/feature clarification ( did I correctly understand that this is a commonly known problem, and not  just a mis-configuration on my side)
* my idea on a local solution (by modifying my verticals)
* a proposal to possibly integrate this solution into encodevert et al. 
 
couriously :-) 

Hannes 
--

Miloš Jakubíček

unread,
Jan 3, 2021, 6:30:04 PM1/3/21
to H Pirker, NoSketch Engine
Hi Hannes,

in principle yes, frequency distributions on large corpora are a standing issue that cannot be significantly improved without parallelization (both CPU and IO).
However, this particular discrepancy between positional and structure attributes can be easily resolved by changing the index type for the related structure (here "doc") from "file64" to "map64" which causes the index to be memory mapped.
A simple comparison: frequency distribution over the concordance for [lemma="Haus"] on doc.tld (top level domains - .de, .at etc.) on detenten13 takes 13 minutes for file64 and just about three seconds for map64 (this is all with hot IO cache, cold cache is of course quite slower, but I assume that an issue that would bother you much).

And yes, if you ask yourself now why that is not the default, the answer is simply: it should be and we ought to change it. The non-memory-mapped variant pretty much stopped making any sense on an 64bit system (it only was required on a 32bit system when the index was too big to be mapped).


Best,
Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton UK


--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/013b0e77-e497-4798-9b4a-5f0667a81a80n%40sketchengine.co.uk.
Reply all
Reply to author
Forward
0 new messages