Hi,
I want to get the semantically similar neighbors to a term. I use RandomIndexing
Command:
java -cp classes edu.ucla.sspace.mains.RandomIndexingMain -d <inputfile> --tokenFilter=exclude=english-stop-words-large.txt --verbose ~/Project/sspace/sspaceoutput/
It basically contains news headlines and other such text one document per line.
I train with around 2 million lines of text.
I do preprocessing on text to tokenize and wordnet stem the text.
For word "begin" I get the following output which makes sense
>get-neighbors begin
left 0.7131183273100001
die 0.7137912698526857
remain 0.7176777557552262
receive 0.7190962118578188
serve 0.7191618319236693
launch 0.71928316727257
complete 0.748564586407225
0.7565354741099785
turn 0.7669144379867078
start 0.8847017108603537
But for word "tiger"
> get-neighbors tiger
piney 0.5286385694809764
eighty-pound 0.5292671527978384
bisri 0.5298957361147005
flower-strewn 0.5305243194315625
templeton 0.6600891439647897
bretton 0.6678980395401771
twinberrow 0.7493857366570842
mid-tone 0.7493857366570842
106,200 0.7493857366570842
bashea 0.749385736657084
tiger occurs around 1000 times in the corpus
The word bashea only occurs once in the whole corpus. I am not able to understand why it ranked so highly
Other outputs too are not very correlated.
Please help me in resolving the issue.
Thanks,
Ratish Puduppully