Is there a Scala equivalent for Pandas and StatsModels?

4,836 views
Skip to first unread message

boris....@gmail.com

unread,
Jan 2, 2016, 5:05:03 PM1/2/16
to scala-user
Pandas is a data analysis library for Python http://pandas.pydata.org/ and StatsModels is a statistical package for Python http://statsmodels.sourceforge.net/.
Is there something similar for Scala or even Java?

Simon Ochsenreither

unread,
Jan 3, 2016, 7:48:06 AM1/3/16
to scala-user
Haven't used it, but maybe this helps? https://saddle.github.io/

> Saddle evolved from earlier prototypes developed by Chris Lewis, Cheng Peng, and David Cru, and draws on Adam's prior experience developing the pandas Python library.

boris....@gmail.com

unread,
Jan 3, 2016, 10:54:12 AM1/3/16
to scala-user
It does offer some of the functionality of Pandas, but the maintainer abandoned the project 1 year ago because his new company works with F#. So it's unclear to me whether Saddle will work with future versions of Scala.

Darren Wilkinson

unread,
Jan 4, 2016, 9:15:35 AM1/4/16
to scala-user
On Saturday, January 2, 2016 at 10:05:03 PM UTC, boris....@gmail.com wrote:
Pandas is a data analysis library for Python http://pandas.pydata.org/ and StatsModels is a statistical package for Python http://statsmodels.sourceforge.net/.
Is there something similar for Scala or even Java?

The short answer is "no", I fear. I attempt to review the "pandas equivalent" situation here:  https://darrenjw.wordpress.com/2015/08/21/data-frames-and-tables-in-scala/

I don't know of anything similar to StatsModels. I have considered starting a project to develop something similar, but it's difficult to build something nice without first deciding on a data frame implementation to build on top of.


boris....@gmail.com

unread,
Jan 6, 2016, 10:55:41 AM1/6/16
to scala-user
Unfortunately, all the libraries you review except Spark have been abandoned, as to be expected for one-man hobby-projects. The situation isn't any better with Haskell and F#. I think it's easiest to just use Python, which is about to become a more versatile competitor for R.

In a post on your blog from 2013, you argue that Scala is a good choice for data analysis. Do you still think so? It seems that Breeze and Factorie are the only useful libraries for data analysis, and both packages have little to no documentation.

Dean Wampler

unread,
Jan 6, 2016, 11:25:21 AM1/6/16
to boris....@gmail.com, scala-user
The best numerics library (not quite what you asked for...) is Spire: https://github.com/non/spire
On Wed, Jan 6, 2016 at 9:55 AM, <boris....@gmail.com> wrote:
Unfortunately, all the libraries you review except Spark have been abandoned, as to be expected for one-man hobby-projects. The situation isn't any better with Haskell and F#. I think it's easiest to just use Python, which is about to become a more versatile competitor for R.

In a post on your blog from 2013, you argue that Scala is a good choice for data analysis. Do you still think so? It seems that Breeze and Factorie are the only useful libraries for data analysis, and both packages have little to no documentation.

--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Darren Wilkinson

unread,
Feb 11, 2016, 2:13:28 PM2/11/16
to scala-user
On Wednesday, January 6, 2016 at 3:55:41 PM UTC, boris....@gmail.com wrote:
In a post on your blog from 2013, you argue that Scala is a good choice for data analysis. Do you still think so? It seems that Breeze and Factorie are the only useful libraries for data analysis, and both packages have little to no documentation.

Apologies for the delayed response - my use of google groups is a bit unpredictable... The point of that blog post was to argue that Scala could and should be a good choice for data analysis. I don't think I was arguing that it already is. Breeze is very good and getting better, despite the sketchy documentation, and that provides a lot of the things that are most difficult to implement. As mentioned above, I think the main thing holding us back is a good data frame implementation that the community settles on. There has been discussion on the Breeze github about developing a new Breeze data frame - that is one possibility. Given a data frame, adding some "statsmodels"-like functionality would be relatively straightforward. A good viz library would also be useful, though again, that shouldn't be fantastically difficult to develop. There is a lot of good Scala ML stuff in Spark mllib, but that isn't possible to use from a regular Scala project, which is rather frustrating... I must admit that for routine data analysis I still mainly use R, but for algorithm development and for "big data" I'm still using Scala and still happy with that choice. 

Reply all
Reply to author
Forward
0 new messages