Query related to incorrect output for get-neighbors for RandomIndexing

4 views
Skip to first unread message

rati...@gmail.com

unread,
Jun 27, 2014, 2:37:43 AM6/27/14
to s-space-re...@googlegroups.com
Hi,
I want to get the semantically similar neighbors to a term. I use RandomIndexing 
Command:
java -cp classes edu.ucla.sspace.mains.RandomIndexingMain -d <inputfile> --tokenFilter=exclude=english-stop-words-large.txt --verbose ~/Project/sspace/sspaceoutput/

the input corpus is English news text from http://www.statmt.org/wmt14/translation-task.html
It basically contains news headlines and other such text one document per line.
I train with around 2 million lines of text.

I do preprocessing on text to tokenize and wordnet stem the text.

For word "begin" I get the following output which makes sense
>get-neighbors begin
left 0.7131183273100001
die 0.7137912698526857
remain 0.7176777557552262
receive 0.7190962118578188
serve 0.7191618319236693
launch 0.71928316727257
complete 0.748564586407225
0.7565354741099785
turn 0.7669144379867078
start 0.8847017108603537

But for word "tiger"

> get-neighbors tiger

piney 0.5286385694809764
eighty-pound 0.5292671527978384
bisri 0.5298957361147005
flower-strewn 0.5305243194315625
templeton 0.6600891439647897
bretton 0.6678980395401771
twinberrow 0.7493857366570842
mid-tone 0.7493857366570842
106,200 0.7493857366570842
bashea 0.749385736657084

tiger occurs around 1000 times in the corpus
The word bashea only occurs once in the whole corpus. I am not able to understand why it ranked so highly 
Other outputs too are not very correlated.

Please help me in resolving the issue.

Thanks,
Ratish Puduppully
Reply all
Reply to author
Forward
0 new messages