Newbie question on the format of ic-brown.dat

631 views
Skip to first unread message

Beibei (Betty) Yang

unread,
Nov 22, 2010, 4:56:09 PM11/22/10
to nltk-...@googlegroups.com
Hello all,

I plan to take advantage of the ic-brown.dat to save some time calculating frequencies in the Brown corpus, as part of my research. I'm having trouble understanding the format of ic-brown.dat:

beibei@ubuntu:~/nltk_data/corpora/wordnet_ic$ cat ic-brown.dat|more
wnver::eOS9lXC6GvMWznF1wkZofDdtbBU
1740n 1915712 ROOT
1930n 859272
2137n 1055337
2452n 36243
2684n 527326
3553n 414060
3993n 0
4258n 208268
...

The source code of WordNetICCorpusReader.ic() says the first column is offset, the last letter of the first column is POS, and the second column is the value. I assume each row corresponds to a word in the Brown corpus, but how can I find out the corresponding word? I'm seeing lots of offset numbers that are larger than len(brown.words()). Is there a tutorial explaining the format of these IC files?

Thank you!

Beibei (Betty) Yang
Department of Computer Science
University of Massachusetts Lowell
Lowell MA 01854

Steven Bird

unread,
Nov 23, 2010, 8:03:57 PM11/23/10
to nltk-...@googlegroups.com
Hi Beibei,

These information content files come from the Wordnet Similarity
project, and I suggest you contact them for information. Please let
us know what you learn!

http://tech.groups.yahoo.com/group/wn-similarity/
http://www.d.umn.edu/~tpederse/Data/README-WN-IC-30.txt


On 23 November 2010 08:56, Beibei (Betty) Yang

> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to
> nltk-users+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/nltk-users?hl=en.
>

Beibei Yang

unread,
Dec 22, 2010, 6:14:26 PM12/22/10
to nltk-users
Thank you for pointing me to the right direction, Steven! Sorry for
taking so long to respond. Now that this semester is over, I can
finally have more time to work on this problem! :)

I came across Ted's post recently: http://tech.groups.yahoo.com/group/wn-similarity/message/67

It explains the format of the ic files. For example, let's look at the
following line extracted from ~/nltk_data/corpora/wordnet_ic/ic-
brown.dat:

2137n 1055337

* 2137 is the offset of this concept in the wordnet (, not the Brown
corpus).
* n is the POS, in this case, it's a noun.
* 1055337 is the number of times this concept appears in the Brown
corpus.

Note that the offset in each line is unique, i.e., each line in the IC
file corresponds to a unique concept. Each word may appear more than
once in the Brown corpus. This explains why there are lots of
frequency numbers that are larger than len(brown.words()).

Hope this helps.

--Beibei

On Nov 23, 8:03 pm, Steven Bird <stevenbi...@gmail.com> wrote:
> Hi Beibei,
>
> These information content files come from the Wordnet Similarity
> project, and I suggest you contact them for information.  Please let
> us know what you learn!
>
> http://tech.groups.yahoo.com/group/wn-similarity/http://www.d.umn.edu/~tpederse/Data/README-WN-IC-30.txt
>
> On 23 November 2010 08:56, Beibei (Betty) Yang
>
Reply all
Reply to author
Forward
0 new messages