I'm sorry if this is going to be a really dumb question but I'm just a
cognitive scientist trying to complete a term project in computational
semantics...
The first aim of the project was to replicate a study by Otis and Sagi
"Phonaesthemes: A corpus-based analysis". It's about certain sub-
morphemic units like gl-, tw-, wr-, -ack and so on that apparently
have a predictable effect on meaning of the words that contain them
with, for example gl- words being associated with light and vision.
Otis and Sagi used Gutenberg corpus and Infomap and we wanted to use
BNC and Semantic Vectors. I've been trying to figure out how to do it
for a while now and I still have no clue.
The goal is to find out whether semantic relatedness between words
that share a given phonestheme is significantly higher than semantic
relatedness between words chosen at random.
Otis and Sagi say they used default Infomap settings, i.e. a co-
occurrence window of 15 words and using the 20 000 most frequent
content words for the analysis but as far as I understand Semantic
Vectors by default takes documents to be the context for terms, not
windows of certain sizes. So my teacher suggested the following
procedure:
1) find out for each term A, how often it appears within 15 words to
the left and to the right of a context word B
2) get the 20000 most freq content words in the corpus
3) for each word B within this set of 20000 words, create a document
(a file that can be named B.txt) which consists of all the instances
of B in the corpus plus all words next to each instance within a 15-
word window of text to left and right of B
4) this collection on 20000 files are the documents or contexts (the
columns in the termvectors matrix) that can be further fed to lucene
indexing
Now, this seems a very roundabout way. I thought that Positional
Indexing can do the trick since it mentions the sliding-window
approach but anything I tried with it delivers similarity measures
above .95 for any terms whatsoever so I'm clearly doing something
wrong.
I thought that if I can work this out and write a "for dummies
tutorial", the students in the next year edition of the course could
use the Semantic Vectors package for their own cool term projects but
now I'm getting more and more frustrated and about to give up :(
Could someone please advise how to go on?
Thank you in advance!
Katja
(University of Amsterdam)
I think you will be able to accomplish what you want to do without generating the 20 000 files that you mention, which isn't to say that generating the 20 000 files wouldn't be useful.
If you build a lucene index with positional information (IndexFilePositions.java) and then use
(TermTermVectorsFromLucene.java) to build a positional semantic vectors index, you should be able to perform the experiments that you want to do. You have probably already done this using the commandline
Also, using a smaller window size of radius 2, 3 or 4, instead of 15, will probably give you better results, but maybe this won't be the same as the study that you are trying to replicate.
Basically you should be able to do what you want.
You have probably seen this already:
http://code.google.com/p/semanticvectors/wiki/PositionalIndexes
If all terms are delivering similarities above .95 it would seem to indicate that the vectors are being diluted by information from frequently occuring terms, maybe.
Regards,
Lance
________________________________________
From: semanti...@googlegroups.com [semanti...@googlegroups.com] On Behalf Of Katja [lumina...@gmail.com]
Sent: 09 December 2011 20:22
To: Semantic Vectors
Subject: Replicating Infomap study
Hi all,
Katja
(University of Amsterdam)
--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To post to this group, send email to semanti...@googlegroups.com.
To unsubscribe from this group, send email to semanticvecto...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.