performance issues when querying parallel corpora

Tomáš Machálek

unread,

Nov 25, 2020, 5:28:06 AM11/25/20

to NoSketch Engine

Hello,

I have a question regarding querying parallel corpora. There seems to be a huge performance drop when passing a query only to the primary corpus. Once you enter
anything to the second corpus - even quite a general pattern like e.g. [word=".+"] or <s />,
the calculation time shortens by an order of magnitude. I can understand that it takes longer time to collect a large result set compared to a smaller one (= a more concrete search pattern). But the examples above are basically "match all" ones and the difference in processing times is enormous.

Is this a known issue and if so, is there a chance to resolve this in the future?

Thank you,

Tomas Machalek
(Institute of the Czech National Corpus)

Miloš Jakubíček

unread,

Nov 25, 2020, 12:34:36 PM11/25/20

to Tomáš Machálek, NoSketch Engine

Hi Tomáš,

that's weird -- no idea why this should be happening.

Is the corpus somewhere where we can have a look at it?

Best

Milos

Milos Jakubicek

CEO, Lexical Computing

Brno, CZ | Brighton UK

http://www.lexicalcomputing.com

http://www.sketchengine.co.uk

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/a2e97589-1d05-4822-91f8-7a44fed57591n%40sketchengine.co.uk.

Tomáš Machálek

unread,

Nov 26, 2020, 6:57:54 AM11/26/20

to NoSketch Engine, Miloš Jakubíček, NoSketch Engine, Tomáš Machálek

I've been analysing the problem further and it looks like there were two issues but none of them Manatee-related.
So first of all - sorry for the false alarm.

The thing that confused me was I hadn't noticed that queries produced by NoSkE (and also our KonText) were quite

different depending on whether the second corpus query was filled-in or not.

In the former case, the query is a single item list based on "within other_corp:[...]" pattern. Such a query is run
asynchronously (conclib.py, get_conc function) where the first page is typically delivered in a short time.

In the latter case, the query is a two-item list where the first item is the primary query and the second one is "Xother_corp".
In this case, the async. processing is turned off which means the client must wait until everything is done.

The second problem which confused me further is probably data-related. One of our parallel corpora seems to perform
poorly for its size. But this is most likely our problem and not the Manatee one.

Best regards,
Tomas

Reply all

Reply to author

Forward