Problem in LSA with 15000 Dimensions

23 views
Skip to first unread message

siamak....@insight-centre.org

unread,
Oct 22, 2014, 7:23:02 AM10/22/14
to s-space-re...@googlegroups.com
Hi Guys,

  I want to use LSA with 15000 Dimensions for the ukWaC corpus, i couldn't run it. even when i were giving -Xmx110g flag then also my job was going out of memory. JAVA programs give OutOfMemory exception (java.lang.OutOfMemoryError) at JVM level when they go out of memory not the OOM error at OS level, I found that the my jar was not going out of memory rather the SVD program was going out of memory. It required about 82GB memory.
  I thought that i require 72 hrs to complete this execution. I tested this application and found that it take much more time ( more than 5 days, 120 hrs). After this as well it failed with some internal error (NullPointerException).
  At the end, i came to a conclusion that the S-Space application might not be developed to process such a huge input file (~12GB). The matrix generated by it is huge (Rows: 4755577 Columns: 2727402). Then it performs transpose over it and then processes it with the SVD stage.

Can you suggest some alternative way to handle huge input file?


With Best Wishes,
Siamak

siamak....@insight-centre.org

unread,
Nov 3, 2014, 1:40:23 PM11/3/14
to s-space-re...@googlegroups.com
Sorry With 1500 Dimensions. 

Isn't there any help?!!!!!!!!

David Jurgens

unread,
Nov 3, 2014, 2:07:34 PM11/3/14
to s-space-re...@googlegroups.com
Hi  Siamak,

 Sorry for the delayed response, I was on conference travel when I saw your email and didn't have a chance to respond properly.
 
  I want to use LSA with 15000 Dimensions for the ukWaC corpus, i couldn't run it. even when i were giving -Xmx110g flag then also my job was going out of memory. JAVA programs give OutOfMemory exception (java.lang.OutOfMemoryError) at JVM level when they go out of memory not the OOM error at OS level, I found that the my jar was not going out of memory rather the SVD program was going out of memory. It required about 82GB memory.

Were you performing any token filtering on the ukWaC?  We've used that corpus quite a bit before and it has a huge number of unique terms due to the noisy text extraction process.  If you're not filtering out low-frequency terms (e.g., terms with fewer than 25 occurrences), then the resulting input matrix to the SVD is huge and will incur a large memory overhead.
 
  I thought that i require 72 hrs to complete this execution. I tested this application and found that it take much more time ( more than 5 days, 120 hrs). After this as well it failed with some internal error (NullPointerException).

Do you happen to have the stack trace for the NullPointerException?  That would help us track down where the program is failing and why.  
 
  At the end, i came to a conclusion that the S-Space application might not be developed to process such a huge input file (~12GB). The matrix generated by it is huge (Rows: 4755577 Columns: 2727402). Then it performs transpose over it and then processes it with the SVD stage.

Can you suggest some alternative way to handle huge input file?

That matrix is still quite big but should be feasible with our methods, which are designed for on-disk operations when necessary.  Even with a Thin SVD, you would still need to retain the U, S, and V matrices; for example, according to the dimensions you gave, your U matrix is 475K * 1.5K, which at 8 bytes per float value is around 5.7GB.  This should be feasible with our implementation.  

Do you happen to have more details on where the program might be crashing?

  Thanks,
  David

 



With Best Wishes,
Siamak

--
You received this message because you are subscribed to the Google Groups "Semantic Space Research - Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research...@googlegroups.com.
To post to this group, send email to s-space-re...@googlegroups.com.
Visit this group at http://groups.google.com/group/s-space-research-dev.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages