[ANN] tech.ml.dataset - 2.0

79 views
Skip to first unread message

Chris Nuernberger

unread,
Jun 15, 2020, 12:50:52 PM6/15/20
to clo...@googlegroups.com

Good morning Clojurians :-)

It is with much pride that I announce version 2.0 of tech.ml.dataset, our library that maps powerful concepts from libraries like Pandas and data.table into Clojure using functional paradigms. This data frame library has unified loading from csv, tsv, xlsx, xls, Apache parquet, Apache arrow (.feather), sql, json and sequences of maps as well as efficient cpu and memory performance. Finally, because the dataset knows the datatype of each column, you can interoperate with schema-ful things like SQL without writing down the schema.


user> (require '[tech.ml.dataset :as ds])
nil
user> (-> (ds/->dataset "https://vega.github.io/vega/data/stocks.csv")
          (ds/descriptive-stats))
https://vega.github.io/vega/data/stocks.csv: descriptive-stats [3 10]:

| :col-name |          :datatype | :n-valid | :n-missing |       :min |      :mean | :mode |       :max | :standard-deviation | :skew |
|-----------|--------------------|----------|------------|------------|------------|-------|------------|---------------------|-------|
|      date | :packed-local-date |      560 |          0 | 2000-01-01 | 2005-05-12 |       | 2010-03-01 |                     |       |
|     price |           :float32 |      560 |          0 |      5.970 |      100.7 |       |      707.0 |               132.6 | 2.413 |
|    symbol |            :string |      560 |          0 |            |            |  MSFT |            |                     |       |

Data science is (still) alive and well in Clojure and the JVM. Stepping back and considering python bindings, R bindings, smile, the next-gen blas/numerics library Neanderthal and the exceptionally powerful saite science platform, we have really come a long way in the last year!

Thanks and enjoy :-)

Alexandre Almosni

unread,
Jun 15, 2020, 3:48:40 PM6/15/20
to Clojure
Congratulations. This is really a great effort and something we really needed. I hope the community takes this as the base layer for data science and we can build on your efforts, expand the documentation, etc.

Chris Nuernberger

unread,
Jun 15, 2020, 6:52:32 PM6/15/20
to clo...@googlegroups.com
Thank you Alexandre!  I have to admit it is a *ton* of work.  I think there are lots of good pathways literally every direction such as simplifying the numerics layer (tech.datatype), potentially getting a subset working on graalvm-native, zero-copy conversion when possible for parquet and arrow (totally possible in lots of cases), etc. etc; it just depends on what seems like it provides the most value to everyone.

Plus learning just exactly how to use this system is a thing; it is complex as are numpy, pandas, data.table ....  Bridging between Clojure and APL, and C puts this in a unique position.

That being said, Thomaz has released tablecloth which has a more advanced dataset api based on the primitives in tech.ml.dataset with some great documentation.


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/d2063089-7985-4de7-8c40-fd178667dcbbo%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages