Contributing a new corpus

SC09JBG

unread,

Feb 15, 2012, 7:10:08 AM2/15/12

to nltk-dev

Hi,
I am a final year student at the University of Leeds and I am
currently carrying out the background research for my project:
"Integrating the LOB corpus into NLTK".
As the title makes clear, the aim of the project is to make LOB, a
British English "version" of the Brown corpus, compatible with NLTK's
tools and demo's. I also aim to create documentation to support LOB. I
am aiming to convert LOB into exactly the same format as Brown
currently is within NLTK.
I had a few questions regarding some of the work that needs to be
done.
Firstly, are there any guides on what exactly is required to form a
contribution to NLTK (I looked but couldn't find any, and the main
site advised writing to nltk-dev)?
Am I correct in thinking that if I succeed in getting LOB into the
same format as Brown, it will work with the corpus reader that is
already in place (CategorizedTaggedCorpusReader, that Brown uses)?
Finally, if there is anything else that I should know, I would be
grateful to hear it.
Thanks

Morten Minde Neergaard

unread,

Feb 18, 2012, 6:35:57 PM2/18/12

to SC09JBG, nltk-dev

At 04:10, Wed 2012-02-15, SC09JBG wrote:
> Hi,
> I am a final year student at the University of Leeds and I am
> currently carrying out the background research for my project:
> "Integrating the LOB corpus into NLTK".

Cool! Sorry for not replying earlier =)

[…]

> Firstly, are there any guides on what exactly is required to form a
> contribution to NLTK (I looked but couldn't find any, and the main
> site advised writing to nltk-dev)?

That's the best way of doing it! The technicalities I'll leave for
Steven or someone else that knows about such things. Is the corpus
released under a license that permits us to distribute it?

> Am I correct in thinking that if I succeed in getting LOB into the
> same format as Brown, it will work with the corpus reader that is
> already in place (CategorizedTaggedCorpusReader, that Brown uses)?
> Finally, if there is anything else that I should know, I would be
> grateful to hear it.

There are some good hints in how the corpus is loaded under
nltk/courpus/__init__.py:

brown = LazyCorpusLoader(
'brown', CategorizedTaggedCorpusReader, r'c[a-z]\d\d',
cat_file='cats.txt', tag_mapping_function=simplify_brown_tag)

This basically just defines how the files are named and where the
categories are defined. The tag_mapping_function parameter is used for
extracting the simplified tags, as described in the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html#a-simplified-part-of-speech-tagset

So yes, using the (word/tag) approach used in the brown corpus should be
a perfectly viable approach and not too much work! ^_^

Cheers,
--
Morten Minde Neergaard

SC09JBG

unread,

Apr 17, 2012, 9:13:03 AM4/17/12

to nltk-dev

Hello again,
I apologise for not having kept my post up to date, it has been a busy
few months!
I have managed to convert the LOB corpus into a format compatible with
NLTK (I aimed to mirror the format used by Brown, and have put it
together in the same way).
I have experimented with using it within the tool kit by editing the
files mentioned above, and it seems to be working well - I did some
testing using the Brown corpus examples from the GoogleCode pages, but
with the LOB files in place of Brown, and it seemed to give the
correct output.
I will be carrying out some tests in my evaluation to compare its
performance to the Brown corpus and hope to get good results.

I wasn't really sure where to go from here in terms of it being a
"contribution" (if that is something which is feasible), and was
really trying to find some validation as to whether what I have done
is correct.

The question above regarding licences - Yes I have been granted
permission to do this work and provide it as a contribution to NLTK

Again, I'm sorry for this late post, everything is very rushed at the
moment with only 3 weeks to go until hand in.

Thanks

Reply all

Reply to author

Forward

Contributing a new corpus - LOB

SC09JBG

Morten Minde Neergaard

SC09JBG