Increase parallelism with index building

Michael Ruepp

unread,

Jul 14, 2015, 7:07:53 PM7/14/15

to semanti...@googlegroups.com

Hi,

is there any option to increase parallelism when creating the indexes? I run a 16 core machine and while building the buildIndex (by actually using the SV class main and submitting an array of args, I barely get more than 200-400% proc utilization, which could be up to 1600%

I think about implementing some kind of caching system like jcs or ehcache to get the lucene index into main memory, but am not sure if it would have inpact on creating the SV indexes.

Also, barely knowing the underlying maths, could there be any improvement made by using some opencl to offload to GPU or is the memory requirement to much per thread/task?

So far creating the index of 4Mio Terms and 4.4Mio Docs runs with about 17GB of Real Memory.

I also added -Xms1G -Xmx30G -XX:+UseG1GC to the Build Settings in Intellij.

Thankful for every input,

Best regards,

Michael

Dominic Widdows

unread,

Jul 15, 2015, 12:27:58 AM7/15/15

to semanti...@googlegroups.com

Hi Michael,

Darn it, you're right that multitreading during the learning process at least should be implemented but isn't. Feel free to try something.

How long is your build index command taking at the moment?

Best wishes,

Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Michael Ruepp

unread,

Jul 15, 2015, 4:20:48 AM7/15/15

to semanti...@googlegroups.com

Hi,

I started yesterday 13 hours ago. I used 10 trainingcycles and it is now in the mid of the third one.

Also the memory usage has extremely worsened, Computer is swapping and the App uses 27GB of RAM but cpu is barely doing something. See attached screenshots.

What is going on?

Michael

On 15.07.2015, at 06:27, Dominic Widdows <dwid...@gmail.com> wrote:

Hi Michael,

Darn it, you're right that multitreading during the learning process at least should be implemented but isn't. Feel free to try something.

How long is your build index command taking at the moment?

signature.asc

Dominic Widdows

unread,

Jul 15, 2015, 4:29:33 AM7/15/15

to semanti...@googlegroups.com

If you're using a huge amount of memory and CPU is barely working, my guess is that you might be saturating RAM and going into swap space. (Can you hear a disk spinning? ;)

But here's the thing, I would suspect that with 10 training cycles, your vectors will almost all the same by the time you're done - they tend to converge after 3 or 4 cycles. I would strongly recommend tying 2 or 3 to begin with, otherwise you're likely to be disappointed.

Best wishes,

Dominic

Michael Ruepp

unread,

Jul 15, 2015, 4:29:37 AM7/15/15

to semanti...@googlegroups.com

What I forgot to mention, the lucene index is 10GB. How much Memory would be reccomended to run the buildIndex on this large lucene index?

Should I try to set -docindexing incremental? What does this anyhow?

Thanks, Michael

On 15.07.2015, at 10:20, Michael Ruepp <mic...@ruepp.at> wrote:

Hi,

I started yesterday 13 hours ago. I used 10 trainingcycles and it is now in the mid of the third one.

Also the memory usage has extremely worsened, Computer is swapping and the App uses 27GB of RAM but cpu is barely doing something. See attached screenshots.

What is going on?

Michael

On 15.07.2015, at 06:27, Dominic Widdows <dwid...@gmail.com> wrote:

Hi Michael,

Darn it, you're right that multitreading during the learning process at least should be implemented but isn't. Feel free to try something.

How long is your build index command taking at the moment?<Screen Shot 2015-07-15 at 10.16.31.png><Screen Shot 2015-07-15 at 10.16.39.png><Screen Shot 2015-07-15 at 10.18.16.png>

signature.asc

Dominic Widdows

unread,

Jul 15, 2015, 4:38:11 AM7/15/15

to semanti...@googlegroups.com

The size on disk of the Lucene index isn't the key variable, what matters for memory consumption is the number of terms and documents.

-docindexing incremental writes out documents one and a time making the document vectors eligible for garbage collection.

Best wishes,

Dominic

Michael Ruepp

unread,

Jul 15, 2015, 7:40:23 AM7/15/15

to semanti...@googlegroups.com

This is the time with 2 trainingcycles:

2015-07-15 10:37:41 INFO MainGui:415 - Creating term vectors as superpositions of elemental document vectors …

There are 4051896 terms (and 4396040 docs).

Training term vectors for field contents

…

2015-07-15 13:38:01 INFO MainGui:1124 - success semantic

Now testing LSA :-)

Mit freundlichen Grüssen/Best regards,

Michael

signature.asc

Michael Ruepp

unread,

Jul 15, 2015, 7:56:22 AM7/15/15

to semanti...@googlegroups.com

Also whats interesting, if I use docindexing incremental, I end up by having a second file named:

Termvectors.bin2.bin

together with termvectors.bin

How come?

Mit freundlichen Grüssen/Best regards,

Michael Ruepp

_________________

michael ruepp

mic...@ruepp.at

fon +43 676 911 40 90

skype michaelruepp

CONFIDENTIALITY NOTICE

This message (including any attachments transmitted with it) contains confidential information and is intended only for the individual named herein. If you are not the herein named addressee you should not disseminate, distribute, copy or otherwise make use of this message. Please notify the sender immediately by e-mail if you have received this message by mistake, and delete it from your systems.

On 15.07.2015, at 10:38, Dominic Widdows <dwid...@gmail.com> wrote:

signature.asc

Reply all

Reply to author

Forward