Thank you for pointing me to the right direction, Steven! Sorry for
taking so long to respond. Now that this semester is over, I can
finally have more time to work on this problem! :)
I came across Ted's post recently:
http://tech.groups.yahoo.com/group/wn-similarity/message/67
It explains the format of the ic files. For example, let's look at the
following line extracted from ~/nltk_data/corpora/wordnet_ic/ic-
brown.dat:
2137n 1055337
* 2137 is the offset of this concept in the wordnet (, not the Brown
corpus).
* n is the POS, in this case, it's a noun.
* 1055337 is the number of times this concept appears in the Brown
corpus.
Note that the offset in each line is unique, i.e., each line in the IC
file corresponds to a unique concept. Each word may appear more than
once in the Brown corpus. This explains why there are lots of
frequency numbers that are larger than len(brown.words()).
Hope this helps.
--Beibei
On Nov 23, 8:03 pm, Steven Bird <
stevenbi...@gmail.com> wrote:
> Hi Beibei,
>
> These information content files come from the Wordnet Similarity
> project, and I suggest you contact them for information. Please let
> us know what you learn!
>
>
http://tech.groups.yahoo.com/group/wn-similarity/http://www.d.umn.edu/~tpederse/Data/README-WN-IC-30.txt
>
> On 23 November 2010 08:56, Beibei (Betty) Yang
>