I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box
, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent
What are the new features?
The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics
API call. Why is this important? No more heatmaps that are just population maps
, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.
I've also added the text2sentiment
method, which has been a big help as I've been categorizing positive and negative comments.
now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.
What's still to be done?
The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.
The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.
Unit testing has shown that text2sentences isn't working at all!
Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward to hearing feedback on how to keep improving that process.