Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Compiling corpora with multiple threads using compilecorp

27 views
Skip to first unread message

Арсеній Максимович Лукашевський

unread,
Oct 18, 2024, 12:46:14 PM10/18/24
to no...@sketchengine.co.uk
Dear NoSketch Engine team,
I hope you're doing well. I am working with the General Regionally Annotated Corpus of Ukrainian, and I have a question about compiling corpora (manatee-open-2.223.6).
Are there any particular considerations when compiling corpora with compilecorp using many threads? I've noticed that compilecorp occasionally loses part of the corpus during the process. While recompiling sometimes solves this issue, there might be better approaches. Currently, I'm using GNU Parallel for this task.
Any insights or best practices you can share would be greatly appreciated.
Thank you for your time.
Best regards, 
Arsenij

Vlasta Ohlídalová

unread,
Oct 29, 2024, 7:43:12 AM10/29/24
to NoSketch Engine
Hello, 

Just for the record, I am sending the main part of Vojta's answer here: 

I would need to know a few more details about how exactly you run compilecorp (e.g. do you divide the corpus into parts and then put it back together? how exactly?) -- but generally I believe that running compilecorp in parallel is *not* safe, apart from the options that are already built into compilecorp itself, namely
 
--parallel=N
 
(see compilecorp --help). This runs the corpus compilation, plus the compilation of frequency statistics, in parallel and it is the recommended way. 

Best regards,
Vlasta Ohlídalová

Dne pátek 18. října 2024 v 18:46:14 UTC+2 uživatel Арсеній Максимович Лукашевський napsal:
Reply all
Reply to author
Forward
0 new messages