Out of Memory Error

125 views
Skip to first unread message

Reinald Kim Amplayo

unread,
May 9, 2012, 10:00:09 AM5/9/12
to Semantic Vectors
Hi.

I have problems in building the index.
I executed this:
java -Xmx1024m pitt.search.semanticvectors.BuildIndex index/

... waited for how many seconds:
Seedlength: 10, Dimension: 200, Vector type: real, Minimum frequency:
0, Maximum frequency: 2147483647, Number non-alphabet characters: 0,
Contents fields are: [contents]
Creating elemental document vectors ...
Populating basic sparse doc vector store, number of vectors: 1
Creating term vectors ...There are 1586253 terms (and 1 docs).
Processed 1000 terms ... Processed 2000 terms ... Processed 3000
terms ... Processed 4000 terms ... Processed 5000 terms ...
...
... Processed 1200000 terms ... Processed 1210000 terms ... Processed
1220000 terms ... Processed 1230000 terms ... Processed 1240000
terms ... Processed 1250000 terms ... Exception in thread "main"
java.lang.OutOfMemoryError: Java heap space

is there any way to solve this problem? Thanks!

Trevor Cohen

unread,
May 9, 2012, 10:34:23 AM5/9/12
to semanti...@googlegroups.com
Hi Reinald,
It looks as though you are using a single large document (as the output reads "and 1 docs"). Is this the case? If so, I wouldn't expect the process to generate meaningful results even if we did get around the memory issue, as every term will have an identical vector on account of occurring in the same context only. So it would be worth subdividing your corpus into meaningful units.

Memory-wise, you could try the following:
(1) use a minimum term frequency > 0, e.g. 10
(2) use the -docindexing incremental flag
(3) increase the -Xmx to above 1G if this is an option

Regards,
Trevor


--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To post to this group, send email to semanti...@googlegroups.com.
To unsubscribe from this group, send email to semanticvecto...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.


Reinald Kim Amplayo

unread,
May 9, 2012, 10:39:18 AM5/9/12
to semanti...@googlegroups.com
Oh. Okay, I'll try something and get back to this thread. Thanks!
--
1456145614561612612612612654656026126

Reinald Kim Amplayo

unread,
May 9, 2012, 2:10:55 PM5/9/12
to semanti...@googlegroups.com
Hello again.

I've tried everything that you've said, Trevor. Here's what I've got:

command: java -Xmx1024m pitt.search.semanticvectors.BuildIndex -docindexing incremental -minfrequency 10 index/

result:
Seedlength: 10, Dimension: 200, Vector type: real, Minimum frequency: 10, Maximum frequency: 2147483647, Number non-alphabet characters: 0, Contents fields are: [contents]

Creating elemental document vectors ...
Populating basic sparse doc vector store, number of vectors: 3977274

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Sadly, it still brought me to this error. Any other methods? :(
--
1456145614561612612612612654656026126

Dominic

unread,
May 11, 2012, 12:28:27 PM5/11/12
to semanti...@googlegroups.com
Sorry for the delayed reply.

It looks like you've gone from 1 huge document to 3977274 tiny ones. Is this correct? If so, 4M is a very large number - is there any way you have to get this number down to below 1M?

Even so, we shouldn't be breaking since each elemental doc vector should only be taking seedlength * 2 * sizeof(short), which works out to 40 bytes plus some room for identifiers and padding. This should be below 20MB unless your pathnames are very long. I wonder if there's a bug that casts the elemental vectors to dense ones and this is the first big enough corpus to show up the bug?

Has anyone else used so many documents, and if so have you run into problems?

Best wishes,
Dominic
Trevor


To post to this group, send email to semanticvectors@googlegroups.com.
To unsubscribe from this group, send email to semanticvectors+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To post to this group, send email to semanticvectors@googlegroups.com.
To unsubscribe from this group, send email to semanticvectors+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.



--
1456145614561612612612612654656026126



--
1456145614561612612612612654656026126

Reinald Kim Amplayo

unread,
May 11, 2012, 1:06:10 PM5/11/12
to semanti...@googlegroups.com
I'm afraid I can't reduce the number of documents from 4M to 1M. :( I don't know which documents can be combined. These 4M documents are actually abstracts from Wikipedia (from the 1.1M documents back in 2008 [see here:  http://groups.google.com/group/semanticvectors/browse_thread/thread/d024bff9122fd858/ee02557e81495a50], to almost 4M documents this year).

Are there any other ways to index those documents? Should I increase -Xmx1024m to -Xmx6G, or something?

Some people here have used a lot more than 4 million documents but still have been able to index their documents. Maybe because they used enough memory and I didn't.

Thanks for the reply, Dominic.

To view this discussion on the web visit https://groups.google.com/d/msg/semanticvectors/-/-lh5PU2Z2ysJ.

To post to this group, send email to semanti...@googlegroups.com.
To unsubscribe from this group, send email to semanticvecto...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.



--
1456145614561612612612612654656026126

Clive Cox

unread,
May 11, 2012, 1:14:37 PM5/11/12
to Semantic Vectors

FYI: I indexed all the long abstracts of dbpedia placed in a lucene
index with:

java -Xmx12024M pitt.search.semanticvectors.BuildIndex -trainingcycles
2 index

Best of luck,

Clive


On May 11, 6:06 pm, Reinald Kim Amplayo <kinsak...@gmail.com> wrote:
> I'm afraid I can't reduce the number of documents from 4M to 1M. :( I don't
> know which documents can be combined. These 4M documents are actually
> abstracts from Wikipedia (from the 1.1M documents back in 2008 [see here:http://groups.google.com/group/semanticvectors/browse_thread/thread/d...],
> >> command: java -Xmx1024m pitt.search.semanticvectors.**BuildIndex
> >> -docindexing incremental -minfrequency 10 index/
>
> >> result:
> >> Seedlength: 10, Dimension: 200, Vector type: real, Minimum frequency: 10,
> >> Maximum frequency: 2147483647, Number non-alphabet characters: 0,
> >> Contents fields are: [contents]
> >> Creating elemental document vectors ...
> >> Populating basic sparse doc vector store, number of vectors: 3977274
> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
> >> Sadly, it still brought me to this error. Any other methods? :(
>
> >> On Wed, May 9, 2012 at 10:39 PM, Reinald Kim Amplayo <kinsak...@gmail.com
> >> > wrote:
>
> >>> Oh. Okay, I'll try something and get back to this thread. Thanks!
>
> >>> On Wed, May 9, 2012 at 10:34 PM, Trevor Cohen <trever...@gmail.com>wrote:
>
> >>>> Hi Reinald,
> >>>> It looks as though you are using a single large document (as the output
> >>>> reads "and 1 docs"). Is this the case? If so, I wouldn't expect the process
> >>>> to generate meaningful results even if we did get around the memory issue,
> >>>> as every term will have an identical vector on account of occurring in the
> >>>> same context only. So it would be worth subdividing your corpus into
> >>>> meaningful units.
>
> >>>> Memory-wise, you could try the following:
> >>>> (1) use a minimum term frequency > 0, e.g. 10
> >>>> (2) use the -docindexing incremental flag
> >>>> (3) increase the -Xmx to above 1G if this is an option
>
> >>>> Regards,
> >>>> Trevor
>
> >>>> On Wed, May 9, 2012 at 9:00 AM, Reinald Kim Amplayo <
> >>>> kinsak...@gmail.com> wrote:
>
> >>>>> Hi.
>
> >>>>> I have problems in building the index.
> >>>>> I executed this:
> >>>>> java -Xmx1024m pitt.search.semanticvectors.**BuildIndex index/
>
> >>>>> ... waited for how many seconds:
> >>>>> Seedlength: 10, Dimension: 200, Vector type: real, Minimum frequency:
> >>>>> 0, Maximum frequency: 2147483647, Number non-alphabet characters: 0,
> >>>>> Contents fields are: [contents]
> >>>>> Creating elemental document vectors ...
> >>>>> Populating basic sparse doc vector store, number of vectors: 1
> >>>>> Creating term vectors ...There are 1586253 terms (and 1 docs).
> >>>>> Processed 1000 terms ... Processed 2000 terms ... Processed 3000
> >>>>> terms ... Processed 4000 terms ... Processed 5000 terms ...
> >>>>> ...
> >>>>> ... Processed 1200000 terms ... Processed 1210000 terms ... Processed
> >>>>> 1220000 terms ... Processed 1230000 terms ... Processed 1240000
> >>>>> terms ... Processed 1250000 terms ... Exception in thread "main"
> >>>>> java.lang.OutOfMemoryError: Java heap space
>
> >>>>> is there any way to solve this problem? Thanks!
>
> >>>>> --
> >>>>> You received this message because you are subscribed to the Google
> >>>>> Groups "Semantic Vectors" group.
> >>>>> To post to this group, send email to semanticvectors@googlegroups.**
> >>>>> com <semanti...@googlegroups.com>.
> >>>>> To unsubscribe from this group, send email to
> >>>>> semanticvectors+unsubscribe@**googlegroups.com<semanticvectors%2Bunsu...@googlegroups.com>
> >>>>> .
> >>>>> For more options, visit this group athttp://groups.google.com/**
> >>>>> group/semanticvectors?hl=en<http://groups.google.com/group/semanticvectors?hl=en>
> >>>>> .
>
> >>>>  --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "Semantic Vectors" group.
> >>>> To post to this group, send email to semanticvectors@googlegroups.**com<semanti...@googlegroups.com>
> >>>> .
> >>>> To unsubscribe from this group, send email to
> >>>> semanticvectors+unsubscribe@**googlegroups.com<semanticvectors%2Bunsu...@googlegroups.com>
> >>>> .
> >>>> For more options, visit this group athttp://groups.google.com/**
> >>>> group/semanticvectors?hl=en<http://groups.google.com/group/semanticvectors?hl=en>
> >>>> .
>
> >>> --
> >>> 145614561456161261261261265465**6026126
>
> >> --
> >> 145614561456161261261261265465**6026126

Reinald Kim Amplayo

unread,
May 11, 2012, 1:28:22 PM5/11/12
to semanti...@googlegroups.com
Hi Clive,

How many documents and terms were there that you actually used 12024m as your heap size? I hope though that "12024" is just a typo error of "1024". I will also try to index dbpedia's long abstracts. Thanks!

Clive Cox

unread,
May 11, 2012, 1:38:28 PM5/11/12
to Semantic Vectors

Not a typo. 12G memory needed.
There were around 3.5 million documents and 10.9 million terms

Clive

Clive Cox

unread,
May 11, 2012, 1:46:47 PM5/11/12
to Semantic Vectors

Actually I think the "contents" field had around 7 million terms. I
didn't do much cleaning of the terms so I assume there was lots of
junk terms.

Glen Newton

unread,
May 11, 2012, 1:49:53 PM5/11/12
to semanti...@googlegroups.com
From my experience this heap number sounds right....
My work involving 21m terms and a 43GB Lucene index needed 28GB of
heap to run SV...
[Note older versions of both Lucene and SV...]
http://gnewton.ca/u/gn/2009/ecdl2009Newton.pdf

-Glen
-
http://zzzoot.blogspot.com/
-

Trevor Cohen

unread,
May 16, 2012, 12:19:33 PM5/16/12
to Semantic Vectors
Hi Reinald,
One thing that might be worth trying is breaking the procedure up into
steps, along the lines of:

(1) java -Xmx1024m pitt.search.semanticvectors.BuildIndex -
initialtermvectors random -docindexing incremental -minfrequency 10 -
termweight logentropy index
(2) java -Xmx1024M pitt.search.semanticvectors.IncrementalTermVectors -
minfrequency 10 incremental_docvectors.bin index

The first step should produce a set of document vectors based on
random term vectors, without constructing the document vectors in RAM.
The second should use these document vectors to generate a set of term
vectors, without loading them into RAM.

I may not have the command line arguments entirely right here, so let
us know how it goes.
Regards,
Trevor

PS - you may wish to also consider a maximum frequency threshold, or
stopword list

Trevor Cohen

unread,
May 16, 2012, 12:24:31 PM5/16/12
to Semantic Vectors
PPS - it would also be worth trying BuildIndex with the flags "-
docindexing incremental -minfrequency 10 -termweight logentropy -
trainingcycles 2 -initialtermvectors random" to the last command you
tried, which should achieve the same results as the 2-step approach
above.
> ...
>
> read more »

Reinald Kim Amplayo

unread,
May 17, 2012, 3:22:03 AM5/17/12
to semanti...@googlegroups.com
Hi Trevor.

Thanks for that tip! I tried:

java -Xmx6500m pitt.search.semanticvectors.BuildIndex -docindexing incremental -minfrequency 10 -termweight logentropy -trainingcycles 2 -initialtermvectors random index/

but I got this error:

 Processed 3530000 documents ... Processed 3540000 documents ... Processed 3550000 documents ... Finished writing vectors.
java.io.FileNotFoundException: /media/New Volume/thesis/docvectors (No such file or directory)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
    at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:214)
    at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:345)
    at pitt.search.semanticvectors.IncrementalTermVectors.createIncrementalTermVectorsFromLucene(IncrementalTermVectors.java:122)
    at pitt.search.semanticvectors.IncrementalTermVectors.<init>(IncrementalTermVectors.java:111)
    at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:125)

what is that docvectors? is it a directory? what should I put in the docvectors directory?
is it a file, or is it docvectors.bin? I actually saw docvectors.bin in my directory.

Trevor Cohen

unread,
May 17, 2012, 9:53:19 AM5/17/12
to semanti...@googlegroups.com
Hi Reinald,
I think you've run into a bug that was introduced with the 3.4 release, and I submitted a fix a little while back. If so, it would be worth downloading the latest revision using svn, building this, and trying again. Could you please confirm that you are using the latest release (rather than the latest revision)?
Thanks,
Trevor

Reinald Kim Amplayo

unread,
May 17, 2012, 1:22:36 PM5/17/12
to semanti...@googlegroups.com
I am using the latest release. Okay, I'll get back at this later. Thanks Trevor!

Reinald Kim Amplayo

unread,
May 17, 2012, 2:57:50 PM5/17/12
to semanti...@googlegroups.com
Hi Trevor.

the error was still there, just with different name.

 ... Processed 3530000 documents ... Processed 3540000 documents ... Processed 3550000 documents ... Finished writing vectors.
java.io.FileNotFoundException: /media/New Volume/thesis/incremental_docvectors (No such file or directory)

    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
    at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:214)
    at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:345)
    at pitt.search.semanticvectors.IncrementalTermVectors.createIncrementalTermVectorsFromLucene(IncrementalTermVectors.java:122)
    at pitt.search.semanticvectors.IncrementalTermVectors.<init>(IncrementalTermVectors.java:111)
    at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:125)

I think the problem is that incremental_docvectors is not actually the same as incremental_docvectors.bin ... don't you think?
--
1456145614561612612612612654656026126

Reinald Kim Amplayo

unread,
May 17, 2012, 3:00:58 PM5/17/12
to semanti...@googlegroups.com
I tried to open the source, and the error was in this line:

itermVectors = new IncrementalTermVectors(
              luceneIndex, VectorType.valueOf(Flags.vectortype.toUpperCase()),
              Flags.dimension, Flags.contentsfields, "incremental_"+docFile);

the file name was just incremental_ + docFile. there was no file extension.
--
1456145614561612612612612654656026126

Trevor Cohen

unread,
May 17, 2012, 3:26:56 PM5/17/12
to Semantic Vectors
Thanks Reinald. How about now (with the latest revision)?

On May 17, 2:00 pm, Reinald Kim Amplayo <kinsak...@gmail.com> wrote:
> I tried to open the source, and the error was in this line:
>
> itermVectors = new IncrementalTermVectors(
>               luceneIndex,
> VectorType.valueOf(Flags.vectortype.toUpperCase()),
>               Flags.dimension, Flags.contentsfields,
> "incremental_"+docFile);
>
> the file name was just incremental_ + docFile. there was no file extension.
>
> On Fri, May 18, 2012 at 2:57 AM, Reinald Kim Amplayo <kinsak...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Hi Trevor.
>
> > the error was still there, just with different name.
>
> >  ... Processed 3530000 documents ... Processed 3540000 documents ...
> > Processed 3550000 documents ... Finished writing vectors.
> > java.io.FileNotFoundException: /media/New
> > Volume/thesis/incremental_docvectors (No such file or directory)
>
> >     at java.io.RandomAccessFile.open(Native Method)
> >     at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
> >     at
> > org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:214)
> >     at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:345)
> >     at
> > pitt.search.semanticvectors.IncrementalTermVectors.createIncrementalTermVec torsFromLucene(IncrementalTermVectors.java:122)
> >     at
> > pitt.search.semanticvectors.IncrementalTermVectors.<init>(IncrementalTermVe ctors.java:111)
> >     at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:125)
>
> > I think the problem is that incremental_docvectors is not actually the
> > same as incremental_docvectors.bin ... don't you think?
>
> > On Fri, May 18, 2012 at 1:22 AM, Reinald Kim Amplayo <kinsak...@gmail.com>wrote:
>
> >> I am using the latest release. Okay, I'll get back at this later. Thanks
> >> Trevor!
>
> >>>> pitt.search.semanticvectors.IncrementalTermVectors.<init>(IncrementalTermVe ctors.java:111)
> >>>>     at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:125)
>
> >>>> what is that docvectors? is it a directory? what should I put in the
> >>>> docvectors directory?
> >>>> is it a file, or is it docvectors.bin? I actually saw docvectors.bin in
> >>>> my directory.
>
> ...
>
> read more »

Dominic Widdows

unread,
May 17, 2012, 6:12:20 PM5/17/12
to semanti...@googlegroups.com
Thanks for iterating on this, guys, sorry it's been a slog.

If this fixes your problems, Reinald, please let us know on the list
and I'll definitely push a new release.

Best wishes,
Dominic

Reinald Kim Amplayo

unread,
May 18, 2012, 6:28:03 AM5/18/12
to semanti...@googlegroups.com
Hello guys.

I tried appending ".bin" to the docFile. Thankfully, the error did not appear again.

Trevor, I tried the command you've given. I tried to search 'baseball'. Out of 20 results, I think only two or three results were related to the word baseball. Almost all of the results were proper nouns, such as names of cities, people, etc. that are not that related to baseball. How do I make it more efficient? Do the file names also matter?

What I actually did was extract the extended abstracts from the xml file available at the dbpedia website. I saved it to files such that one file will contain one abstract. I saved all the files to one folder.

I also tried these commands:
1.) -dimension 150 -minfrequency 10
2.) -trainingcycles 2 -docindexing incremental -minfrequency 10

but still, proper nouns heavily existed on every search results. :(
--
1456145614561612612612612654656026126

Trevor Cohen

unread,
May 18, 2012, 9:17:23 AM5/18/12
to semanti...@googlegroups.com
Hi Reinald,
- could you post an example of a relevant extended abstract?
- some things to consider trying:
(1) a stopword list or a -maxfrequency threshold
(2) adding the flags -initialtermvectors random,  -termweight logentropy
(3) one of the sliding window based approaches (using BuildPositionalIndex)

- Trevor

Reinald Kim Amplayo

unread,
May 18, 2012, 1:32:20 PM5/18/12
to semanti...@googlegroups.com
this is the extended abstract of 'baseball':
Baseball is a bat-and-ball sport played between two teams of nine players each. The goal is to score runs by hitting a thrown ball with a bat and touching a series of four bases arranged at the corners of a ninety-foot square, or diamond. Players on one team take turns hitting against the pitcher of the other team, which tries to stop them from scoring runs by getting hitters out in any of several ways. A player on the batting team can stop at any of the bases and later advance via a teammate's hit or other means. The teams switch between batting and fielding whenever the fielding team records three outs. One turn at bat for each team constitutes an inning and nine innings make up a professional game. The team with the most runs at the end of the game wins. Evolving from older bat-and-ball games, an early form of baseball was being played in England by the mid-eighteenth century. This game and the related rounders were brought by British and Irish immigrants to North America, where the modern version of baseball developed. By the late nineteenth century, baseball was widely recognized as the national sport of the United States. Baseball on the professional, amateur, and youth levels is now popular in North America, parts of Central and South America and the Caribbean, and parts of East Asia. The game is sometimes referred to as hardball, in contrast to the derivative game of softball. In North America, professional Major League Baseball (MLB) teams are divided into the National League (NL) and American League (AL). Each league has three divisions: East, West, and Central. Every year, the major league champion is determined by playoffs that culminate in the World Series. Four teams make the playoffs from each league: the three regular season division winners, plus one wild card team. Baseball is the leading team sport in both Japan and Cuba, and the top level of play is similarly split between two leagues: Japan's Central League and Pacific League; Cuba's West League and East League. In the National and Central leagues, the pitcher is required to bat, per the traditional rules. In the American, Pacific, and both Cuban leagues, there is a tenth player, a designated hitter, who bats for the pitcher. Each top-level team has a farm system of one or more minor league teams. These teams allow younger players to develop as they gain on-field experience against opponents with similar levels of skill.

the first few abstracts are here: http://downloads.dbpedia.org/preview.php?file=3.7_sl_en_sl_short_abstracts_en.nt.bz2

As I said, I tried three commands, here are the results:

-dimensions 150 -minfrequency 10

1.0:baseball
0.3594425355702205:lindens
0.3505603283108587:asn
0.3501884959427871:aagpbl
0.3465998563536757:annenkov
0.34020589031120574:feiner
0.3385464599365474:bilinear
0.3255934720834881:itpb
0.32099040800884243:vizhnitz
0.3188992897695855:gorgythion
0.3187742073317976:dieterle
0.3184805031840783:hazlet
0.316281242674089:blesses
0.31348234453951546:ajlun
0.31253767499193963:nitroimidazole
0.3119976735939363:pitching <--
0.31022878988669117:svarte
0.30954682872136663:phosphonium
0.3074137363382049:macalasdair
0.3066976865513943:vishniac


-trainingcycles 2 -docindexing incremental -minfrequency 10

1.0:baseball
0.34199171276186924:canzoneri
0.3265345326140684:riadh
0.3249476010825526:hitter <--
0.31647576716765946:league <--
0.3135966585094386:outlaws
0.309115444386895:sox
0.301628655642545:major <--
0.2949917705576994:amiternum
0.2904800782670046:raceway
0.2896802536242947:njabl
0.2877486136664044:threw <--
0.285622180568593:ansell
0.28480204616694044:hitters <--
0.2845309353725343:lxr
0.28437611415668435:zeng
0.2813869271832661:sprout
0.28041938820812284:batting <--
0.28022722918369075:player <--
0.27917523035505487:cockerelli


-docindexing incremental -minfrequency 10 -termweight logentropy -trainingcycles 2 -initialtermvectors random

0.9999999858276257:baseball
0.39999999433105027:nationalteatern
0.39999999433105027:skaterade
0.39999999433105027:bharadwaja
0.39999999433105027:disrupted
0.39999999433105027:expectation
0.39999999433105027:manoor
0.39999999433105027:hypselodoris
0.39999999433105027:unorthodox
0.39999999433105027:chindasuinth
0.39999999433105027:incumbencies
0.39999999433105027:lolo
0.39999999433105027:mertensia
0.29999999574828773:uljanik
0.29999999574828773:rivia
0.29999999574828773:editing
0.29999999574828773:fregattenkapit
0.29999999574828773:ryther
0.29999999574828773:persevered
0.29999999574828773:gitksan

(I swear the last one was really different last time.)

Is there something wrong at the documents I index? Or are the commands not efficient for that big of a corpus? Thanks all!

Trevor Cohen

unread,
May 18, 2012, 1:46:38 PM5/18/12
to semanti...@googlegroups.com

Are you using the -queryvectorfile flag when searching to direct the search toward the latest set of term vectors? Increasing dimensionality would be a good idea I think.

Reinald Kim Amplayo

unread,
May 18, 2012, 2:33:30 PM5/18/12
to semanti...@googlegroups.com
-queryvectorfile termvectors.bin <-- I think this is by default, right? I tried it and it was just the same.

Anyways, I think you are right about increasing the dimensionality. I tried -dimension 800 and it gave me this:

1.000000001007724:baseball
0.3502423844756624:league
0.2623932442475588:major
0.23909328600327803:sox
0.236731546843174:mlb
0.23173086507612586:pitcher
0.21999097374671436:batted
0.2064103379628778:cwr
0.20261360830973377:pitched
0.20229413298704815:player
0.20135434852833697:batting
0.2008570977354619:hitter
0.20062877432859985:baseman
0.198631944671675:leagues
0.19780904710140015:professional
0.19180269732475333:yankees
0.18796778518860302:ballpark
0.18706667160469326:seasons
0.18283285938792923:umpire
0.17809329637613897:expos

But I also tried something like 'politics':

0.9999999977474126:politics
0.24435313912062262:political
0.22149508301850312:portalview
0.16594482641080738:byjoseph
0.16564706001472004:softly
0.161055213100641:member
0.1550003126510641:costebelle
0.1541305945813258:datejuly
0.1531072345362266:outfit
0.1516947526745547:lese
0.1513906404609545:disconnects
0.14991662302722822:he
0.14790919998978846:verginius
0.14657750202504985:caregiver
0.14572453715696265:tian
0.14464096839268728:commarin
0.14433270535948484:points
0.1435354140849732:laigh
0.1425574539863546:gestroi
0.14243256881463584:bagert

which has really weird results.

I used the -trainingcycles 2 -docindexing incremental -minfrequency 10 -dimension 800 command. I will try to increase the dimensionality more until it burns my memory!

Later, I'll try -docindexing incremental -minfrequency 10 -termweight logentropy -trainingcycles 2 -initialtermvectors random with -dimension 1000. I'll post the results here. :) Thanks, Trevor!

Reinald Kim Amplayo

unread,
May 18, 2012, 3:05:54 PM5/18/12
to semanti...@googlegroups.com
Hi, Trevor.

I tried java -Xmx6500m pitt.search.semanticvectors.BuildIndex -docindexing incremental -minfrequency 10 -termweight logentropy -trainingcycles 2 -initialtermvectors random -dimension 1500 index/

but still the results were as bad as last time:

for baseball:

0.9999999858276257:baseball
0.29999999574828773:burins
0.29999999574828773:camerano
0.29999999574828773:omac
0.19999999716552513:sedol
0.19999999716552513:legazpi
0.19999999716552513:zaragoza
0.19999999716552513:muller
0.19999999716552513:sacaton
0.19999999716552513:connes
0.19999999716552513:purandara
0.19999999716552513:criminally
0.19999999716552513:fishtown
0.19999999716552513:liege
0.19999999716552513:cunderdin
0.19999999716552513:forestport
0.19999999716552513:brizio
0.19999999716552513:dimorphus
0.19999999716552513:ossulston
0.19999999716552513:squanto

for politics:

0.9999999858276257:politics
0.29999999574828773:erca
0.19999999716552513:hevelli
0.19999999716552513:namer
0.19999999716552513:semenovi
0.19999999716552513:zeitlin
0.19999999716552513:eyez
0.19999999716552513:glenamoy
0.19999999716552513:ambacht
0.19999999716552513:kwiatkowska
0.19999999716552513:stockaded
0.19999999716552513:stainby
0.19999999716552513:archchancellor
0.19999999716552513:zimbabwean
0.19999999716552513:jmatthew
0.19999999716552513:barch
0.19999999716552513:electus
0.19999999716552513:moulsford
0.19999999716552513:baloi
0.19999999716552513:vergleichenden
0.19999999716552513:pillement
--
1456145614561612612612612654656026126

Trevor Cohen

unread,
May 18, 2012, 4:54:00 PM5/18/12
to semanti...@googlegroups.com
Right, I'm concerned that you may be searching the termvectors.bin file, which contains the random term vectors, not the trained semantic vectors (probably incremental_termvectors2.bin or some such thing). The results suggest incidental overlap between random vectors may be responsible.
Reply all
Reply to author
Forward
0 new messages