FYI: I indexed all the long abstracts of dbpedia placed in a lucene
index with:
java -Xmx12024M pitt.search.semanticvectors.BuildIndex -trainingcycles
2 index
Best of luck,
Clive
On May 11, 6:06 pm, Reinald Kim Amplayo <
kinsak...@gmail.com> wrote:
> I'm afraid I can't reduce the number of documents from 4M to 1M. :( I don't
> know which documents can be combined. These 4M documents are actually
> abstracts from Wikipedia (from the 1.1M documents back in 2008 [see here:
http://groups.google.com/group/semanticvectors/browse_thread/thread/d...],
> >> command: java -Xmx1024m pitt.search.semanticvectors.**BuildIndex
> >> -docindexing incremental -minfrequency 10 index/
>
> >> result:
> >> Seedlength: 10, Dimension: 200, Vector type: real, Minimum frequency: 10,
> >> Maximum frequency:
2147483647, Number non-alphabet characters: 0,
> >> Contents fields are: [contents]
> >> Creating elemental document vectors ...
> >> Populating basic sparse doc vector store, number of vectors: 3977274
> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
> >> Sadly, it still brought me to this error. Any other methods? :(
>
> >> On Wed, May 9, 2012 at 10:39 PM, Reinald Kim Amplayo <
kinsak...@gmail.com
> >> > wrote:
>
> >>> Oh. Okay, I'll try something and get back to this thread. Thanks!
>
> >>> On Wed, May 9, 2012 at 10:34 PM, Trevor Cohen <
trever...@gmail.com>wrote:
>
> >>>> Hi Reinald,
> >>>> It looks as though you are using a single large document (as the output
> >>>> reads "and 1 docs"). Is this the case? If so, I wouldn't expect the process
> >>>> to generate meaningful results even if we did get around the memory issue,
> >>>> as every term will have an identical vector on account of occurring in the
> >>>> same context only. So it would be worth subdividing your corpus into
> >>>> meaningful units.
>
> >>>> Memory-wise, you could try the following:
> >>>> (1) use a minimum term frequency > 0, e.g. 10
> >>>> (2) use the -docindexing incremental flag
> >>>> (3) increase the -Xmx to above 1G if this is an option
>
> >>>> Regards,
> >>>> Trevor
>
> >>>> On Wed, May 9, 2012 at 9:00 AM, Reinald Kim Amplayo <
> >>>>
kinsak...@gmail.com> wrote:
>
> >>>>> Hi.
>
> >>>>> I have problems in building the index.
> >>>>> I executed this:
> >>>>> java -Xmx1024m pitt.search.semanticvectors.**BuildIndex index/
>
> >>>>> ... waited for how many seconds:
> >>>>> Seedlength: 10, Dimension: 200, Vector type: real, Minimum frequency:
> >>>>> 0, Maximum frequency:
2147483647, Number non-alphabet characters: 0,
> >>>>> Contents fields are: [contents]
> >>>>> Creating elemental document vectors ...
> >>>>> Populating basic sparse doc vector store, number of vectors: 1
> >>>>> Creating term vectors ...There are 1586253 terms (and 1 docs).
> >>>>> Processed 1000 terms ... Processed 2000 terms ... Processed 3000
> >>>>> terms ... Processed 4000 terms ... Processed 5000 terms ...
> >>>>> ...
> >>>>> ... Processed 1200000 terms ... Processed 1210000 terms ... Processed
> >>>>> 1220000 terms ... Processed 1230000 terms ... Processed 1240000
> >>>>> terms ... Processed 1250000 terms ... Exception in thread "main"
> >>>>> java.lang.OutOfMemoryError: Java heap space
>
> >>>>> is there any way to solve this problem? Thanks!
>
> >>>>> --
> >>>>> You received this message because you are subscribed to the Google
> >>>>> Groups "Semantic Vectors" group.
> >>>>> To post to this group, send email to semanticvectors@googlegroups.**
> >>>>> com <
semanti...@googlegroups.com>.
> >>>>> To unsubscribe from this group, send email to
> >>>>> semanticvectors+unsubscribe@**
googlegroups.com<
semanticvectors%2Bunsu...@googlegroups.com>
> >>>>> .
> >>>>> For more options, visit this group athttp://
groups.google.com/**
> >>>>> group/semanticvectors?hl=en<
http://groups.google.com/group/semanticvectors?hl=en>
> >>>>> .
>
> >>>> --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "Semantic Vectors" group.
> >>>> To post to this group, send email to semanticvectors@googlegroups.**com<
semanti...@googlegroups.com>
> >>>> .
> >>>> To unsubscribe from this group, send email to
> >>>> semanticvectors+unsubscribe@**
googlegroups.com<
semanticvectors%2Bunsu...@googlegroups.com>
> >>>> .
> >>>> For more options, visit this group athttp://
groups.google.com/**
> >>>> group/semanticvectors?hl=en<
http://groups.google.com/group/semanticvectors?hl=en>
> >>>> .
>
> >>> --
> >>> 145614561456161261261261265465**6026126
>
> >> --
> >> 145614561456161261261261265465**6026126