--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/8e0ae692-dbc6-45d6-aa72-49ad83a045e3n%40sketchengine.co.uk.
Vitku,brani neco vydani te verze z Onionu, co pouzivat google sparse?zdarm.
Milos JakubicekCEO, Lexical ComputingBrno, CZ | Brighton, UK
Vítku,
thank you very much for reporting this!
Onion is using Google sparse hashset (https://github.com/sparsehash/sparsehash) instead of Judy now. I have just updated the page where you can find the most recent version 1.4: http://corpus.tools/wiki/Onion.
Slovak Academy of Sciences
Ľ. Štúr Institute of Linguistics
Panská 26, SK-81101 Bratislava
Tel +421-2-54431762 Fax -54431756
http://aranea.juls.savba.sk/guest/
https://www.facebook.com/araneawebcorpora/
--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/cfcb36fd-f516-958b-afdb-2ee75508536d%40juls.savba.sk.
I have just uploaded the right files.
I made a test of the new Onion last evening on with 4.3 Gigatoken corpus, making use of the fact that it has been dedupliced by the older version on Onion two days ago. Here are the results:
Time |
from |
to |
elapsed |
seconds |
Onion 1.4 |
19:27:25 |
23:01:45 |
3:34:20 |
12860 |
Onion 1.2 |
15:41:53 |
17:13:11 |
1:31:18 |
5478 |
ratio |
|
|
|
2.35 |
RAM |
initial |
total |
Onion 1.4 |
4770 |
11641 |
Onion 1.2 |
3317 |
7975 |
ratio |
1.44 |
1.46 |
--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/20f15302-196f-f9c8-2cdd-badbe924702b%40juls.savba.sk.
That's quite weird. Vitek, could it be caused by some other changes?
The code for using google sparse is actually quite old (I did that in 2016). I remember (hopefully correctly) that it was requiring more memory but that it was faster, not slower, than Judy.Which Judy version were you using? Could it be that the machine started swapping because of insufficient memory? What was the memory peak (grep VmPeak /proc/$PID/status)?
I doubt it could be caused by swapping -- no other activity was present at that machine at the same time...
I am going to perform another deduplication tonight and may
compare results obtained by both versions in the morning. I have
noticed, however, that unlike 1.2 Onion 1.4 is being compiled by
g++. Could that make the difference?
Best,
Vlado B, 21:55
I am going to perform another deduplication tonight and may compare results obtained by both versions in the morning. I have noticed, however, that unlike 1.2 Onion 1.4 is being compiled by g++. Could that make the difference?
Only if the some build flags are erroneously omitted -- did you compile with -O2?
I simply invoked the standard make command. Looks like being compiled with -O3 in both cases.
V, 22:25
--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/e986729b-8e90-8309-81a4-fe844ccda586%40juls.savba.sk.
Vítku,
my newest experiment gave results similar to that of yours (different machine, Slovak corpus, 6.59 Gigatoken before, and 4.99 Gigatoken after document-level deduplication.
time |
from |
to |
elapsed |
seconds |
Onion 1.4 (Sparsehash) |
8:57:07 |
11:32:10 |
2:35:03 |
9303 |
Onion 1.2 (Judy) |
12:46:42 |
14:31:28 |
1:44:46 |
6286 |
ratio |
1,48 |
RAM (MB) |
initial |
total |
Onion 1.4 (Sparsehash) |
9518 |
18637 |
Onion 1.2 (Judy) |
18540 |
23279 |
ratio |
0,51 |
0,80 |
I.e., the new Onion needed less memory at the price of processing time. I'll try to do more comparisons with new corpora to be processed soon.
Best,
Vlado B, 17:45