A major new release of the Data Science Toolkit!

Skip to first unread message

Pete Warden

May 19, 2013, 9:49:19 PM5/19/13
to dstk-...@googlegroups.com
I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova (currently uploading, should be available by midnight Pacific tonight)

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.


Sourabh Antani

Sep 15, 2015, 3:41:38 PM9/15/15
to dstk-users
Hello Pete, 

This is wonderful toolkit but it seems to be down almost when ever I try to visit it. I have only been able to successfully browse the site and hit the APIs 3-4 times so far. Could I be doing something wrong? or is there a time of day when the site is down for maintenance? 

Message has been deleted

Hack G

Jan 12, 2016, 1:32:09 PM1/12/16
to dstk-users

anyway someone can post a working torrent link? The link below doesn't seem to work

Wayne Seguin

Feb 1, 2016, 2:09:04 PM2/1/16
to dstk-users

Nothing but broken links. Love the product, would love the update even more...



Mark Wittkowski

Jan 31, 2017, 6:40:07 AM1/31/17
to dstk-users

Hi Pete!

Please let me know if anything has changed with Text2People. We need to find most likely ethnicity of contacts in a file and updates ongoing with what seems as the perfect API. Let me know as soon as you can.

Thanks so much!

On Sunday, May 19, 2013 at 8:49:19 PM UTC-5, Pete Warden wrote:
Reply all
Reply to author
0 new messages