Data distribution

69 views

Skip to first unread message

liling tan

unread,

Aug 16, 2018, 11:00:05 PM8/16/18

to nltk-dev

Dear NLTK contributors and devs,

Following up on the issue https://github.com/nltk/nltk/issues/2079, my proposal is to create a new NLTK data distribution such that users don't (only) rely on github.com to pull nltk_data. To quote the issue:

I've explored Kaggle Datasets, dropbox and zendoo and even data distribution as PyPI packages. But there's always a limitation of

how available can the data be? I.e. does it need user to sign up an account? how many hops/steps to take before user can get hold of the data to be read by nltk.corpus. Up till now, nothing beats the simplicity of pulling from github zip files.
how to track data precedence? I.e. when the data is updated, is there version? How do we go back to track changes and possible have some sort of git blame mechanism to debug what went wrong if it happens
how much support is the CDN going to give? There's always a case of bandwidth limit for files up/downloading and also a storage size limit. I think the latter is cheap but the previous is hard.

Through a personal contact, it is possible that we get NLTK datasets onto Amazon S3 and get free hosting but it'll take some time to port the data there and I'm not sure of how the data would be accessed and I'm not personally used to tracking data changes on S3. Additionally, if a contributor wants to add new dataset or edit datasets, how would this be done with S3 easily without having a bottleneck of someone with admin access manually uploading it.

Here's a summary of some of the things we've explored https://docs.google.com/spreadsheets/d/10qPsmTAa707Ct_Fej6_BQg77l38-qEsT9eJPyaue8HY/edit?usp=sharing

Regards,

Liling

liling tan

unread,

Oct 15, 2018, 8:38:07 PM10/15/18

to nltk-dev

Heads up, I'm going to try this https://arrow.apache.org/docs/python/ and see whether we can replace all datatypes in nltk.corpus.readers.*