Plunger: A unit testing framework for Cascading

Elliot West

unread,

Nov 3, 2014, 7:31:08 AM11/3/14

to cascadi...@googlegroups.com

Greetings,
Hotels.com are pleased to announce the contribution of a project to the Cascading open source community. ‘Plunger’ is a unit testing framework for Cascading applications whose primary aim is to simplify the creation of automated tests for cascades, flows, assemblies and operations.
At Hotels.com Cascading is the basis for numerous large scale ETL processing jobs. For us Cascading has many virtues, however we were particularly attracted by its amenability to automated testing. We rely heavily on the suites of tests that we’ve developed for our applications and therefore are always keen to lower the effort required to implement them. With this in mind we developed Plunger to streamline the development of Cascading tests. Plunger reduces boiler plate code and provides a concise API for exercising all aspects of Cascading applications. Key features include:

A fluent API for declaring test data.
A harness for rapidly connecting, exercising, and verifying assemblies.
Sourcing and sinking test data from and to taps.
Assertions for common Cascading record types.
Stub builders for exercising operation implementations.
Component serialization verification.

The project can be found on GitHub and is available under the Apache 2.0 license: https://github.com/HotelsDotCom/plunger
We hope that you find Plunger useful and welcome any feedback or contributions that you may have.
Many thanks - Elliot.
Elliot WestSoftware Dev Engineer IIHotels.com

Chris K Wensel

unread,

Nov 3, 2014, 12:01:32 PM11/3/14

to cascadi...@googlegroups.com

This is great. Will add it to our .org site.

Anything we can do in Cascading core and test apis to help improve things? We are working on 2.7 (in tandem with 3.0), so any suggestions now would help people with a 2.x -> 3.x migration we hope.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAC3gpCa8u%2Bu8BvGmbz%2BT%2B95QKwz3owN_tDiNFBHxBYcYLt-Q%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Elliot West

unread,

Dec 16, 2014, 5:36:36 AM12/16/14

to cascadi...@googlegroups.com

Hi Chris,

Sorry for the delay in getting back to you on this. I do in fact have a suggestion that stems from our experiences when performing unit testing on complex flows and cascades. For the most part we can effectively leverage existing Java tooling to build suites of tests. We can also practice good unit testing behaviours by composing our flows from modular assemblies and, of course, operations. These components are readily testable. However, in regard to assemblies and larger scale tests of flows and cascades we find that our cascading development short-circuits a fundamental piece of the automated testing practices: the measurement and reporting of code coverage with tools such as Cobertura.

As it stands is is simple to attain 100% test coverage of assemblies and flows because in the true Java sense we are simply exercising the construction logic and not the data processing logic that results from said construction. However, to truly measure test coverage in these instances what we really need to be able to do is check that every vertex of the process' corresponding DAG has been exercised. I imagine that this would be as simple as measuring whether or not a vertex (pipe) has transported one or more Tuples. As it stands this is a mental exercise: we can image the graph and consider appropriate test scenarios to attain full coverage. However, this is prone to error - especially if the DAG differs from that which we think we've constructed (human error).

As a solution to this, it'd be great if there were some generic hooks into our Flows, Cascades, and Assemblies onto which we could build some tooling. I imagine such tools would interrogate the DAG after a test execution and report the names or pipes that did not transport any Tuples.

I'd be keen to hear your thoughts on this.

Cheers - Elliot.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/F8117885-F9EB-49FC-B355-915281DCE591%40wensel.net.

Elliot West

unread,

Dec 16, 2014, 9:41:57 AM12/16/14

to cascadi...@googlegroups.com

Hi Chris,

I've tried moving plunger to cascading-3.0-wip-61 but have encountered a few issues with our code that writes to taps. We've found this feature very useful in practice, helping us get around subtle differences between local/hadoop tap implementations, for building tests that use our own Tap implementations, and for moving test data into easily maintainable Java code instead of being baked into non-human readable files. To be fair I was always aware that I was creating some brittle implementations as I was having to dig down into cascading internals to obtain the behaviour I wanted. The methods in question are located here:

Example usage scenarios are visible in the tests:

https://github.com/HotelsDotCom/plunger/blob/master/src/test/java/com/hotels/plunger/TapDataWriterTest.java

The motivation behind this class is to be enable the creation of truly representative data in any format described by a Tap instance. This then allows the creation of integration tests that are as close to a production environment as possible. While this is achievable with a sink, identity pipe, and the respective Tap, such an approach introduces the overhead of the execution of additional Hadoop/Local processes just to write data. Plunger's implementation constructs the minimum amount of scaffolding required around the tap and exercises it directly.

However, with cascading 3 I see that this scaffolding I create may now need to become more complex. Previously in Cascading 2 we needed to create a HadoopFlowStep to clean-up a the '_temporary' folder. Fortunately this was simple to construct. However, in the 3.0.0 version HadoopFlowStep requires more complex initialisation, requiring an ElementGraph and an FlowNodeGraph. These in turn also require complex initialisation values.

Now at this point my experience is telling me that I'm trying to do something that I shouldn't. But in practice we've found this feature very useful so I'm keen to persevere. Would it be possible to structure Taps in such a way that they can be used to write data outside of a flow with minimal dependencies?

Thanks - Elliot.

On 3 November 2014 at 17:01, Chris K Wensel <ch...@wensel.net> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/F8117885-F9EB-49FC-B355-915281DCE591%40wensel.net.

Chris K Wensel

unread,

Jan 3, 2015, 4:36:13 PM1/3/15

to cascadi...@googlegroups.com

Sorry for the delay, holidays and all.

This is totally reasonable, but I don’t have any comments really other than I’m open for suggestions.

and that Cascading 3 would be the place to introduce such changes.

one way to do this could be to write a simple rule that injected a Counter operation at the head of every branch. but then you would need logic that could unwind the counters. and work in such a way that you could reconcile the counts across multiple topologies (mr, dag, local, etc)

ckw

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAC3gpCZUMz9CxoUSJm55fewdJeRerTfRkdLGO72VaH3P6PxCZg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

Chris K Wensel

unread,

Jan 3, 2015, 4:42:42 PM1/3/15

to cascadi...@googlegroups.com

to open a tap to write, you should only need

new Hfs(…).opentForWrite( new HadoopFlowProcess() )

See CascadingTestCase for lots of test helpers.

ckw

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAC3gpCbP6bRa1Ojb-GmfUOSUHxR%3D6QNjg08QpT%2BzdaXupHhtXA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

Elliot West

unread,

Mar 6, 2015, 9:32:22 AM3/6/15

to cascadi...@googlegroups.com

Hi Chris,

Sorry that it's taken me a while to get around to this but I now have a test case to illustrate the problem writing to certain types of Taps outside of a flow, namely the PartitionTap and the TemplateTap. Your suggested approach of using 'new TapType(…).opentForWrite( new HadoopFlowProcess() )' works well until we write more than one partition, at which point the data does not appear to get moved into place from the temporary location.

Here is a Gist containing the test case: https://gist.github.com/teabot/c77b18d56526d6882f04

We do have workarounds in place but the code is horribly brittle and requires separate implementations for Cascading 2.x and 3.x. Therefore we'd been keen to have a working, simple, consistent approach if possible.

Thanks - Elliot.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/9C2CFED0-1A9A-4A2D-A300-19D98B05C03B%40wensel.net.

narain

unread,

May 26, 2015, 2:27:00 PM5/26/15

to cascadi...@googlegroups.com

Hello All,

Is it possible to use "cascading.schema.Schema" to define the input fields in databuilder? My test data will have atleast 100 fields and I need to enforce few constraints like "Not Null" to some fields while creating the test data.

Before using plunger, I create a Lfs tap (find constructor below), pass in my scheme and write in my test data to a content file. which worked fairly well. I am trying to do the same using plunger please let me know if I am hitting a dead end on this one.

Lfs

@ConstructorProperties(value={"scheme","stringPath","sinkMode"})
public Lfs(Scheme scheme, stringPath, sinkMode)

Constructor Lfs creates a new Lfs instance.

Parameters:: scheme - of type Scheme; stringPath - of type String; sinkMode - of type SinkMode

Patrick Duin

unread,

Jun 5, 2015, 4:26:49 AM6/5/15

to cascadi...@googlegroups.com

Hi Narain,

Sorry for the late reply.
Plunger allows you to define masks so you don't have to specify data for all the 100 fields in your test code.

Example:

    Lfs lfs = new Lfs(scheme, stringpath, sinkmode);
    // fieldsThatShouldNotBeNull is a Fields object that acts as a mask only those fields needs to be set in the addTuple() all other fields will get "null" values
    Data data = new DataBuilder(scheme.getSinkFields()).withFields(fieldsThatShouldNotBeNull).addTuple("value1", "value2").build();
    Plunger.writeData(data).toTap(lfs);

I hope this helps :).

Cheers,
Patrick

PS: You can open an issue directly on github https://github.com/HotelsDotCom/plunger/issues, it will easier for us to spot.

narain

unread,

Jul 17, 2015, 11:16:22 AM7/17/15

to cascadi...@googlegroups.com

Thanks Patrick. I just noticed your response.

It really helped.

Reply all

Reply to author

Forward