Dear all,
we are running both a local SKE installation as well as a NoSke instance for hosting our newspaper corpus with 10 billion token.
For a long time we have been observing performance issues when it comes to "Text type frequency distribution".
When I say "severe" I basically mean: it becomes unusable for higher frequency words.
Frequency distributions are a major use case for researchers: they look up a word and inspect its distribution over time or its distribution according to geographic region.
year + region are encoded as structure-attributes of the <doc> element - where a <doc> corresponds to a single newspaper article, and we have 45 mio <docs> in our corpus.
When such a frequency analysis is run on a high(er) frequency word - e.g. "Haus" with 3 Mio hits it basically becomes unavailable: processing is stopped by a timeout after 15min!
I have been mentioning this issue a long time ago in direct emails to the SKE-helpdesk and I was informed then, that you were aware of that issue and -- in the long run -- you were thinking of "parallelizing" some processes in the SKE as a solution.
So I got the impression that you deemed this performance-issue not as a "bug" but as a (not so well performing) "feature" which could (only) be improved by throwing more computer power at it?
But rethinking the issue I start to wonder whether I understood you right, and whether we really have to live with this sorry situation?
As far as I understand it is the performance of the lookup of structure-attributes for a given token# which poses the problem. And I do not understand why nothing can be done to fasten this up?
Making a "normal" frequency distribution -- i.e. one based on a token attribute (e.g. pos) for the same high frequency words takes between 10 and 20 seconds!
So my obvious next attempt will be to change my verticals and "simply" bring all the interesting metadata from the <doc> level down to the token level. I.e. encode "year" or "region" (also) as token attributes. This will bloat the size of my indices, but should remedy my immediate performance issues, right?
But iff this works then I wonder, whether this sort of "workaround" could be automatically integrated into the NoSke?
I.e. why not to offer - say - an option in encodevert to automatically turns (selected?) structure attributes into token-attributes. E.g. a <doc year="2020"/> becomes (as well) a token-attribute "doc.year" for all token in <doc/> ?
So sorry for this overly verbose post - but view it as a mixture of a
* bug/feature clarification ( did I correctly understand that this is a commonly known problem, and not just a mis-configuration on my side)
* my idea on a local solution (by modifying my verticals)
* a proposal to possibly integrate this solution into encodevert et al.
couriously :-)
Hannes
--