Corpus size limit

23 views
Skip to first unread message

Nándor Bolf

unread,
May 29, 2023, 6:42:13 PM5/29/23
to no...@sketchengine.co.uk
Dear support,

I would like to know if there is any limitation in the size of corpuses. I have a source file which is 520 GB. I used my registry file to compile a much smaller corpus, which worked fine, but at this size, it just doesn't want to work. If you could help me I would really appreciate it.

I attach my registry file, and the log error of the compiling. 

Kind regards,
Nándor Bolf
final
compilecorp_2023-05-27_0615.log

Ondřej Herman

unread,
May 29, 2023, 7:02:41 PM5/29/23
to Nándor Bolf, no...@sketchengine.co.uk
Dear Nándor,

Can you send me a listing of the files which are located in the /corpora/final/indexed directory, along with their sizes?

Best,

Ondrej

--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/CABBMdsPFDVaDF3RFM%2Bp4%3DNWwZtB9f-wDH05-MwYuSXEFhNCgvw%40mail.gmail.com.

Ondřej Herman

unread,
May 29, 2023, 7:52:13 PM5/29/23
to Nándor Bolf, NoSketch Engine
Dear Nándor,

Sorry, I'm not well versed in the usage of the container -- the command looks fine, but you still need to ensure that compilecorp is executed exactly once and wait until the command finishes.

Stuff running inside containers is still visible to standard Linux process accounting, so I'd check there first, you can also use docker ps.

Best,

Ondrej

On Mon, May 29, 2023 at 9:41 PM Nándor Bolf <nando...@gmail.com> wrote:
Dear Ondrej,

I used this command:
nohup docker run --rm --mount type=bind,src=/home/bolfn/corpus/NoSketch-Engine-Docker/corpora,dst=/corpora eltedh/nosketch-engine:latest "compilecorp --no-ske --recompile-corpus final" &

How can I be sure if this process has ended? And is this command okay?

Best,
Nándor

Ondřej Herman <ondrej...@sketchengine.eu> ezt írta (időpont: 2023. máj. 29., H, 21:30):
Dear Nándor,

I think it might be possible that you accidentaly executed compilecorp concurrently multiple times or something similar:

-rw-r--r-- 1 root root 388M máj   28 22:01 word.arf
-rw-r--r-- 1 root root  92G máj   29 18:39 word.text

The word.arf file is calculated using the information contained in word.text, but the timestamps are not in the correct order.

You can try to run just

compilecorp CORPNAME

and see what happens, might even finish properly, but I'm afraid that the existing files might be corrupt already if stale files were present AND --recompile-corpus argument hasn't been used. So you might need to recompile the corpus again, either deleting the data directory or using compilecorp with the --recompile-corpus flag.

Some size decrease is normal, as some of the indices are created by joining partial results with redundant information, however this does seem to be quite extreme. Sorry, I don't remember the usual numbers.

Best,

Ondrej 

On Mon, May 29, 2023 at 9:14 PM Nándor Bolf <nando...@gmail.com> wrote:
Dear Ondrej,
 
Thanks for the quick response. I'd like to be able to search by two file categories: hu and nem_hu. And It worked on a smaller corpus, but not on this current one.
An interesting thing about the index directory is that its size has decreased from 1 TB to 345 GB by the end of the compiling process. (maybe this is a useful information)
Here is the list of it:
-rw-r--r-- 1 root root 726M máj   29 18:36 div.rng
-rw-r--r-- 1 root root   38 máj   29 18:36 doc.file.fsa
-rw-r--r-- 1 root root   10 máj   29 18:36 doc.file.lex
-rw-r--r-- 1 root root    8 máj   29 18:36 doc.file.lex.idx
-rw-r--r-- 1 root root    8 máj   29 18:36 doc.file.lex.isrt
-rw-r--r-- 1 root root    8 máj   29 18:36 doc.file.lex.srt
-rw-r--r-- 1 root root 2,4K máj   29 18:36 doc.file.rev
-rw-r--r-- 1 root root    8 máj   29 18:36 doc.file.rev.idx0
-rw-r--r-- 1 root root   11 máj   29 18:36 doc.file.rev.idx1
-rw-r--r-- 1 root root  76K máj   29 18:36 doc.file.text
-rw-r--r-- 1 root root   16 máj   29 18:36 doc.file.token
-rw-r--r-- 1 root root 301K máj   29 18:36 doc.rng
-rw-r--r-- 1 root root 108G máj   29 18:36 g.rng
-rw-r--r-- 1 root root 351M máj   29 21:03 lc.lex.idx.tmp
-rw-r--r-- 1 root root 388M máj   29 21:04 lc.lex.ridx
-rw-r--r-- 1 root root 151M máj   29 21:05 lc.lex.srt.tmp
-rw-r--r-- 1 root root 1,4G máj   29 21:03 lc.lex.tmp
-rw-r--r-- 1 root root 460M máj   29 21:04 lc.rev
-rw-r--r-- 1 root root 351M máj   29 21:04 lc.rev.cnt
-rw-r--r-- 1 root root    0 máj   29 21:03 lc.rev.cnt64
-rw-r--r-- 1 root root 351M máj   29 21:04 lc.rev.idx
drwxr-xr-x 2 root root   96 máj   28 22:01 log
-rw-r--r-- 1 root root  19G máj   29 18:36 p.rng
-rw-r--r-- 1 root root  41G máj   29 18:36 s.rng
-rw-r--r-- 1 root root 388M máj   28 22:01 word.arf
-rw-r--r-- 1 root root 1,1G máj   29 18:39 word.fsa
-rw-r--r-- 1 root root 1,5G máj   29 18:39 word.lex
-rw-r--r-- 1 root root 388M máj   29 18:39 word.lex.idx
-rw-r--r-- 1 root root 388M máj   29 18:39 word.lex.isrt
-rw-r--r-- 1 root root 388M máj   29 18:39 word.lex.srt
-rw-r--r-- 1 root root 5,1M máj   29 21:00 word.regex.fsa
-rw-r--r-- 1 root root 5,0M máj   29 21:00 word.regex.lex
-rw-r--r-- 1 root root 4,2M máj   29 21:00 word.regex.lex.idx
-rw-r--r-- 1 root root 4,2M máj   29 21:00 word.regex.lex.isrt
-rw-r--r-- 1 root root 4,2M máj   29 21:00 word.regex.lex.srt
-rw-r--r-- 1 root root 4,0G máj   29 21:00 word.regex.rev
-rw-r--r-- 1 root root 4,2M máj   29 21:00 word.regex.rev.cnt
-rw-r--r-- 1 root root    0 máj   29 20:56 word.regex.rev.cnt64
-rw-r--r-- 1 root root 4,2M máj   29 21:00 word.regex.rev.idx
-rw-r--r-- 1 root root  84G máj   29 20:31 word.rev
-rw-r--r-- 1 root root 6,1M máj   29 20:31 word.rev.idx0
-rw-r--r-- 1 root root 169M máj   29 20:31 word.rev.idx1
-rw-r--r-- 1 root root  92G máj   29 18:39 word.text
-rw-r--r-- 1 root root 1,3G máj   29 18:39 word.text.off
-rw-r--r-- 1 root root 166M máj   29 18:39 word.text.seg

Best,
Nándor

Ondřej Herman

unread,
May 30, 2023, 2:15:28 PM5/30/23
to Nándor Bolf, NoSketch Engine
Dear Nándor,

I'd say that nothing is being compiled, but it's hard to say without seeing the full COMMAND column.

The corpus size should be just fine -- the error surely does look as if multiple things were happening at the same time (corpus text has been compiled, but in the next step it does not exist anymore).

Best,

Ondrej

On Tue, May 30, 2023 at 10:54 AM Nándor Bolf <nando...@gmail.com> wrote:
Dear Ondrej,

The docker ps command's output is this: 
CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS        PORTS                                     NAMES
e37a9bc63ac5   eltedh/nosketch-engine:latest   "/usr/local/bin/entr…"   42 hours ago   Up 42 hours   80/tcp                                    adoring_kowalevski
2c7e9c120e60   eltedh/nosketch-engine:latest   "/usr/local/bin/entr…"   42 hours ago   Up 42 hours   0.0.0.0:10070->80/tcp, :::10070->80/tcp   noske

Do you think that I should rerun the corpus compiling command now? From the information above I assume that there is no corpus compiling, but please tell me if I'm wrong. I'm not sure that I have started multiple processes at once, but I can try it again.(It felt for me that the size of the corpus was too big, that is why it crashed, but maybe it's not true)

And did you see the log error in my first message? Do you think that it is caused by multiple processes started at once?

Best,
Nándor

Ondřej Herman

unread,
May 30, 2023, 4:37:33 PM5/30/23
to Nándor Bolf, NoSketch Engine
Dear Nándor,

On Tue, May 30, 2023, 16:59 Nándor Bolf <nando...@gmail.com> wrote:
Dear Ondrej,

Is it possible to write out the full COMMAND column? If not, then do you suggest recompiling the corpus, and see if the error happens again?

Sorry, I don't know. Some of the authors of the container frequent this mailing list, perhaps they know?

Yes, I'd try to run the compilation again.

(You can always restart the machine to be completely sure that nothing is interfering.;)

Best,

Ondrej

Best,
Nándor
Reply all
Reply to author
Forward
0 new messages