Cascading 4.0 WIP

28 views
Skip to first unread message

Chris K Wensel

unread,
Jun 12, 2017, 11:42:56 PM6/12/17
to cascadi...@googlegroups.com
Hey all,

Just a heads up on a new WIP release of Cascading 4.0.


Currently nothing structural has changed, but there are two new additions to the API.

- Java 8 Stream API methods on the Tap interface, along with c.t.TupleEntryStream and c.t.TupleStream helper classes

- DirTap to allow for sourcing and sinking local directory trees. DirTap can be used as a replacement for FileTap


I would love some feedback on the semantics of DirTap. It’s usage seems to make sense, but would love to squash any awkwardness early.

FWIW 4.0 will only support Java 8, and has dropped Hadoop 1.x support entirely. I also hope to leverage more Java 8 functionality in local mode to improve utilization and robustness for heavier workloads (this may mean a native local mode serialization API).

We haven’t updated to the latest Apache Tez due to an outstanding issue with one of their internal APIs not being terribly friendly. 

And hoping to see some high performance Tez planner rules from our colleagues at Twitter soon (no pressure!).

In the next couple weeks we hope to add native JSON support through a JSONCoercibleType, JSONScheme, and a handful of operations that allow for manipulating and creating complex nested documents. The code we have now is working great, just needs a little more bake time and documentation. Let me know if this is of interest and we can speed up the release.

In tandem, we are considering moving XML support to an external sub-project. I hasn’t seen much love, and could stand a re-write in the model of the new JSON primitives we are developing. If this is a bad idea, let me know.

If there are any suggestions or other improvements, please reply to this thread.

As an aside, modern versions of Elasticsearch have a patch to their ES-Hadoop library (https://www.elastic.co/downloads/hadoop) that works better with Cascading local mode. https://github.com/elastic/elasticsearch-hadoop/pull/937 The overall experience can be much improved, but I haven’t had time to sort that out.

cheers,
chris


Reply all
Reply to author
Forward
0 new messages