I've explored Kaggle Datasets, dropbox and zendoo and even data distribution as PyPI packages. But there's always a limitation of
how available can the data be? I.e. does it need user to sign up an account? how many hops/steps to take before user can get hold of the data to be read by nltk.corpus
. Up till now, nothing beats the simplicity of pulling from github zip files.
how to track data precedence? I.e. when the data is updated, is there version? How do we go back to track changes and possible have some sort of git blame mechanism to debug what went wrong if it happens
how much support is the CDN going to give? There's always a case of bandwidth limit for files up/downloading and also a storage size limit. I think the latter is cheap but the previous is hard.