--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/CABBMdsPFDVaDF3RFM%2Bp4%3DNWwZtB9f-wDH05-MwYuSXEFhNCgvw%40mail.gmail.com.
Dear Ondrej,I used this command:nohup docker run --rm --mount type=bind,src=/home/bolfn/corpus/NoSketch-Engine-Docker/corpora,dst=/corpora eltedh/nosketch-engine:latest "compilecorp --no-ske --recompile-corpus final" &How can I be sure if this process has ended? And is this command okay?Best,NándorOndřej Herman <ondrej...@sketchengine.eu> ezt írta (időpont: 2023. máj. 29., H, 21:30):Dear Nándor,I think it might be possible that you accidentaly executed compilecorp concurrently multiple times or something similar:-rw-r--r-- 1 root root 388M máj 28 22:01 word.arf-rw-r--r-- 1 root root 92G máj 29 18:39 word.textThe word.arf file is calculated using the information contained in word.text, but the timestamps are not in the correct order.You can try to run justcompilecorp CORPNAMEand see what happens, might even finish properly, but I'm afraid that the existing files might be corrupt already if stale files were present AND --recompile-corpus argument hasn't been used. So you might need to recompile the corpus again, either deleting the data directory or using compilecorp with the --recompile-corpus flag.Some size decrease is normal, as some of the indices are created by joining partial results with redundant information, however this does seem to be quite extreme. Sorry, I don't remember the usual numbers.Best,OndrejOn Mon, May 29, 2023 at 9:14 PM Nándor Bolf <nando...@gmail.com> wrote:Dear Ondrej,Thanks for the quick response. I'd like to be able to search by two file categories: hu and nem_hu. And It worked on a smaller corpus, but not on this current one.An interesting thing about the index directory is that its size has decreased from 1 TB to 345 GB by the end of the compiling process. (maybe this is a useful information)Here is the list of it:-rw-r--r-- 1 root root 726M máj 29 18:36 div.rng
-rw-r--r-- 1 root root 38 máj 29 18:36 doc.file.fsa
-rw-r--r-- 1 root root 10 máj 29 18:36 doc.file.lex
-rw-r--r-- 1 root root 8 máj 29 18:36 doc.file.lex.idx
-rw-r--r-- 1 root root 8 máj 29 18:36 doc.file.lex.isrt
-rw-r--r-- 1 root root 8 máj 29 18:36 doc.file.lex.srt
-rw-r--r-- 1 root root 2,4K máj 29 18:36 doc.file.rev
-rw-r--r-- 1 root root 8 máj 29 18:36 doc.file.rev.idx0
-rw-r--r-- 1 root root 11 máj 29 18:36 doc.file.rev.idx1
-rw-r--r-- 1 root root 76K máj 29 18:36 doc.file.text
-rw-r--r-- 1 root root 16 máj 29 18:36 doc.file.token
-rw-r--r-- 1 root root 301K máj 29 18:36 doc.rng
-rw-r--r-- 1 root root 108G máj 29 18:36 g.rng
-rw-r--r-- 1 root root 351M máj 29 21:03 lc.lex.idx.tmp
-rw-r--r-- 1 root root 388M máj 29 21:04 lc.lex.ridx
-rw-r--r-- 1 root root 151M máj 29 21:05 lc.lex.srt.tmp
-rw-r--r-- 1 root root 1,4G máj 29 21:03 lc.lex.tmp
-rw-r--r-- 1 root root 460M máj 29 21:04 lc.rev
-rw-r--r-- 1 root root 351M máj 29 21:04 lc.rev.cnt
-rw-r--r-- 1 root root 0 máj 29 21:03 lc.rev.cnt64
-rw-r--r-- 1 root root 351M máj 29 21:04 lc.rev.idx
drwxr-xr-x 2 root root 96 máj 28 22:01 log
-rw-r--r-- 1 root root 19G máj 29 18:36 p.rng
-rw-r--r-- 1 root root 41G máj 29 18:36 s.rng
-rw-r--r-- 1 root root 388M máj 28 22:01 word.arf
-rw-r--r-- 1 root root 1,1G máj 29 18:39 word.fsa
-rw-r--r-- 1 root root 1,5G máj 29 18:39 word.lex
-rw-r--r-- 1 root root 388M máj 29 18:39 word.lex.idx
-rw-r--r-- 1 root root 388M máj 29 18:39 word.lex.isrt
-rw-r--r-- 1 root root 388M máj 29 18:39 word.lex.srt
-rw-r--r-- 1 root root 5,1M máj 29 21:00 word.regex.fsa
-rw-r--r-- 1 root root 5,0M máj 29 21:00 word.regex.lex
-rw-r--r-- 1 root root 4,2M máj 29 21:00 word.regex.lex.idx
-rw-r--r-- 1 root root 4,2M máj 29 21:00 word.regex.lex.isrt
-rw-r--r-- 1 root root 4,2M máj 29 21:00 word.regex.lex.srt
-rw-r--r-- 1 root root 4,0G máj 29 21:00 word.regex.rev
-rw-r--r-- 1 root root 4,2M máj 29 21:00 word.regex.rev.cnt
-rw-r--r-- 1 root root 0 máj 29 20:56 word.regex.rev.cnt64
-rw-r--r-- 1 root root 4,2M máj 29 21:00 word.regex.rev.idx
-rw-r--r-- 1 root root 84G máj 29 20:31 word.rev
-rw-r--r-- 1 root root 6,1M máj 29 20:31 word.rev.idx0
-rw-r--r-- 1 root root 169M máj 29 20:31 word.rev.idx1
-rw-r--r-- 1 root root 92G máj 29 18:39 word.text
-rw-r--r-- 1 root root 1,3G máj 29 18:39 word.text.off
-rw-r--r-- 1 root root 166M máj 29 18:39 word.text.segBest,Nándor
Dear Ondrej,The docker ps command's output is this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e37a9bc63ac5 eltedh/nosketch-engine:latest "/usr/local/bin/entr…" 42 hours ago Up 42 hours 80/tcp adoring_kowalevski
2c7e9c120e60 eltedh/nosketch-engine:latest "/usr/local/bin/entr…" 42 hours ago Up 42 hours 0.0.0.0:10070->80/tcp, :::10070->80/tcp noskeDo you think that I should rerun the corpus compiling command now? From the information above I assume that there is no corpus compiling, but please tell me if I'm wrong. I'm not sure that I have started multiple processes at once, but I can try it again.(It felt for me that the size of the corpus was too big, that is why it crashed, but maybe it's not true)And did you see the log error in my first message? Do you think that it is caused by multiple processes started at once?Best,Nándor
Dear Ondrej,Is it possible to write out the full COMMAND column? If not, then do you suggest recompiling the corpus, and see if the error happens again?
Best,Nándor