Cool! Sorry for not replying earlier =)
[…]
> Firstly, are there any guides on what exactly is required to form a
> contribution to NLTK (I looked but couldn't find any, and the main
> site advised writing to nltk-dev)?
That's the best way of doing it! The technicalities I'll leave for
Steven or someone else that knows about such things. Is the corpus
released under a license that permits us to distribute it?
> Am I correct in thinking that if I succeed in getting LOB into the
> same format as Brown, it will work with the corpus reader that is
> already in place (CategorizedTaggedCorpusReader, that Brown uses)?
> Finally, if there is anything else that I should know, I would be
> grateful to hear it.
There are some good hints in how the corpus is loaded under
nltk/courpus/__init__.py:
brown = LazyCorpusLoader(
'brown', CategorizedTaggedCorpusReader, r'c[a-z]\d\d',
cat_file='cats.txt', tag_mapping_function=simplify_brown_tag)
This basically just defines how the files are named and where the
categories are defined. The tag_mapping_function parameter is used for
extracting the simplified tags, as described in the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html#a-simplified-part-of-speech-tagset
So yes, using the (word/tag) approach used in the brown corpus should be
a perfectly viable approach and not too much work! ^_^
Cheers,
--
Morten Minde Neergaard