Query related to incorrect output for get-neighbors for RandomIndexing

4 views

Skip to first unread message

rati...@gmail.com

unread,

Jun 27, 2014, 2:37:43 AM6/27/14

to s-space-re...@googlegroups.com

Hi,

I want to get the semantically similar neighbors to a term. I use RandomIndexing

Command:

java -cp classes edu.ucla.sspace.mains.RandomIndexingMain -d <inputfile> --tokenFilter=exclude=english-stop-words-large.txt --verbose ~/Project/sspace/sspaceoutput/

the input corpus is English news text from http://www.statmt.org/wmt14/translation-task.html

It basically contains news headlines and other such text one document per line.

I train with around 2 million lines of text.

I do preprocessing on text to tokenize and wordnet stem the text.

For word "begin" I get the following output which makes sense

>get-neighbors begin

left 0.7131183273100001

die 0.7137912698526857

remain 0.7176777557552262

receive 0.7190962118578188

serve 0.7191618319236693

launch 0.71928316727257

complete 0.748564586407225

0.7565354741099785

turn 0.7669144379867078

start 0.8847017108603537

But for word "tiger"

> get-neighbors tiger

piney 0.5286385694809764

eighty-pound 0.5292671527978384

bisri 0.5298957361147005

flower-strewn 0.5305243194315625

templeton 0.6600891439647897

bretton 0.6678980395401771

twinberrow 0.7493857366570842

mid-tone 0.7493857366570842

106,200 0.7493857366570842

bashea 0.749385736657084

tiger occurs around 1000 times in the corpus

The word bashea only occurs once in the whole corpus. I am not able to understand why it ranked so highly

Other outputs too are not very correlated.

Please help me in resolving the issue.

Thanks,

Ratish Puduppully

Reply all

Reply to author

Forward

0 new messages