I'm glad to let you know that the translation of NLP w/ Python is just
completed and is now being under the process of publication. The book
is scheduled to go on sale on Nov. 8th. I'd like to thank you for your
advice and support throughout the translation process.
As we discussed with you when the translation began, I wrote one
additional chapter dedicated for Japanese NLP from scratch, and we'd
like to make this chapter freely accessible online.
Could you tell me how we can make this happen? Specifically, in which
data formant and language should I provide the data? (I currently have
the final PDF file written in Japanese.) Is it better if I translate
the whole chapter into English? (I can do this myself but maybe need
some native speakers to double check the translation)
If possible, we'd like to make the chapter available by Nov 4th, just
a few days before the publication.
Thank you for your advice in advance.
Thanks,
Masato Hagiwara
--
Masato HAGIWARA
http://lilyx.net/
Congratulations on completing the Japanese translation of the NLTK book!
Note that we need to resolve an outstanding issue with the Japanese
support in NLTK, and release a new version:
http://code.google.com/p/nltk/issues/detail?id=587
I would be pleased to host the extra chapter on the NLTK website, both
in Japanese and English versions. I can check the English. Let's
talk offline about how to do the hosting and translation checking.
-Steven Bird
Thank you for your quick reply.
> Note that we need to resolve an outstanding issue with the Japanese
> support in NLTK, and release a new version:
> http://code.google.com/p/nltk/issues/detail?id=587
Sorry for leaving these issues unsolved. I've updated the knbc.py
corpus reader accordingly. Please have a look here:
http://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/knbc.py
I have two questions regarding corpus readers:
- Currently the return type of knbc.words()[0] is unicode, not str. Is
this acceptable, or should it encode it to str (in UTF8)?
- I designed ChasenCorpusReader so that the tags are complex type
(tuple), not just str, because Chasen's (and almost all Japanese
morphological analyzers') output tags have structures (levels). In
practice, it's much better if we keep the tags in tuple type. Should I
still represent them as strings?
> I would be pleased to host the extra chapter on the NLTK website, both
> in Japanese and English versions. I can check the English. Let's
> talk offline about how to do the hosting and translation checking.
Thanks! I'll reply to you offline for the details.
Regards,
Masato Hagiwara
2010/10/23 Steven Bird <steve...@gmail.com>:
--
Masato HAGIWARA
http://lilyx.net/
I've adopted these changes, thanks. Note that the NLTK version of
this file identifies you as author, and includes the standard NLTK
copyright statement and license.
We still need to get nltk/test/japanese.doctest sorted out...
> - Currently the return type of knbc.words()[0] is unicode, not str. Is
> this acceptable, or should it encode it to str (in UTF8)?
Unicode is fine.
> - I designed ChasenCorpusReader so that the tags are complex type
> (tuple), not just str, because Chasen's (and almost all Japanese
> morphological analyzers') output tags have structures (levels). In
> practice, it's much better if we keep the tags in tuple type. Should I
> still represent them as strings?
The issue here is to have a consistent API across languages. The
return type of the standard corpus access methods should be the same
across languages, so that people can run the same program on a
different corpus just by changing the name of the corpus they are
loading (and not editing all the later code).
How about we support different return types with different method
names. E.g. words() returns a list of unicode strings, and
word_tuples() or some better, agreed anme, returns the tuples as you
want.
An alternative is to make a clear case (on the nltk-dev list) why we
should not have a consistent API for the corpus readers.
I hope we can resolve this quickly, since I expect it impacts the book.
-Steven
2010/10/24 Steven Bird <steve...@gmail.com>:
> On 23 October 2010 17:16, Masato Hagiwara <hag...@gmail.com> wrote:
>> Sorry for leaving these issues unsolved. I've updated the knbc.py
>> corpus reader accordingly. Please have a look here:
>>
>> http://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/knbc.py
>
> I've adopted these changes, thanks. Note that the NLTK version of
> this file identifies you as author, and includes the standard NLTK
> copyright statement and license.
>
> We still need to get nltk/test/japanese.doctest sorted out...
>
OK, thanks! I think all these errors will be sorted out once the
following problems are fixed?
>> - Currently the return type of knbc.words()[0] is unicode, not str. Is
>> this acceptable, or should it encode it to str (in UTF8)?
>
> Unicode is fine.
>
>> - I designed ChasenCorpusReader so that the tags are complex type
>> (tuple), not just str, because Chasen's (and almost all Japanese
>> morphological analyzers') output tags have structures (levels). In
>> practice, it's much better if we keep the tags in tuple type. Should I
>> still represent them as strings?
>
> The issue here is to have a consistent API across languages. The
> return type of the standard corpus access methods should be the same
> across languages, so that people can run the same program on a
> different corpus just by changing the name of the corpus they are
> loading (and not editing all the later code).
>
OK, I understand the concerns here.
> How about we support different return types with different method
> names. E.g. words() returns a list of unicode strings, and
> word_tuples() or some better, agreed anme, returns the tuples as you
> want.
>
I think this is a good design. words() can just return the first-level
PoS as str, while word_tuples() gives the complete information.
> An alternative is to make a clear case (on the nltk-dev list) why we
> should not have a consistent API for the corpus readers.
>
> I hope we can resolve this quickly, since I expect it impacts the book.
I'm afraid the book is already in the printing process but we can
reflect these changes to the next round of print and the online
version.
Thanks,
Masato Hagiwara
Unfortunately I have not managed to release a new version of NLTK
prior to my forthcoming field trip. I hope to release it by 20
November. I apologise for any inconvenience this causes.
Thanks,
-Steven
No worries, the update of the related corpus readers in the repository
was just in time for the publication.
Thank you again for your cooperation. Looking forward to the next release.
Best,
Masato
2010/11/6 Steven Bird <steve...@gmail.com>:
--
Masato HAGIWARA
http://lilyx.net/