Testing scalding when using Parquet/Avro

364 views
Skip to first unread message

Serega Sheypak

unread,
Aug 19, 2014, 8:16:15 AM8/19/14
to cascadi...@googlegroups.com
Hi, we are heavy pig users and right now doing PoC on Scalding. 
I need to create/find simple utility which can help me to translate:
1. human deadable input to Parquet/Avro 
2. Parquet/Avro result to human readable format

The idea is:
1. create human-readable testing input for scalding job
2. declare desired target format (avro/parquet)
3. feed input to job
4. convert result to human readable format and verify.

We did throw away pig-unit since it's useless and created our own pig-testing utility which automatically converts and human-readable input to target format and feed it to script during test. We also have automatic  tool which parses avro/parquet result to JSON and allows write evident output verifications. 
Is there sometihing similar is scalding?
What are other approaches to do integration test for a scalding job using readers and writers? 

William Briggs

unread,
Aug 19, 2014, 9:15:07 AM8/19/14
to cascadi...@googlegroups.com
You should consider using the JobTest class; it lets you mock the sources and sinks on any Scalding job, and automate the result validation. A very simple example can be found here: https://github.com/snowplow/scalding-example-project/blob/master/src/test/scala/com/snowplowanalytics/hadoop/scalding/WordCountTest.scala

-Will


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/d44f174c-684b-4332-9e3d-9b45479695d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Serega Sheypak

unread,
Aug 19, 2014, 11:30:03 AM8/19/14
to cascadi...@googlegroups.com
Thanks, so general idea is in unit-testing only, we don't need to test the whole flow starting from 'real' source and ending with 'real' sink?

вторник, 19 августа 2014 г., 17:15:07 UTC+4 пользователь Will Briggs написал:

Jonathan Coveney

unread,
Aug 19, 2014, 10:12:21 PM8/19/14
to cascadi...@googlegroups.com
That is correct. Though scalding also has the LocalCluster for platform tests where you need to test framework boundaries.


Serega Sheypak

unread,
Sep 24, 2014, 8:56:26 AM9/24/14
to cascadi...@googlegroups.com
Hi, I can't make mock work. pleae help.

A code;

val mnp =  UnpackedAvroSource(args("mnpDict")).read
              .project(f1, f2, f3)

and a test:

JobTest(classOf[CompetitorNetworkJob].getName)
    .arg("mnpDict", "mnpDictMock")
    .source(Tsv("mnpDictMock"), mnp) 
    .sink(Csv("outputMock")) {
      buffer: mutable.Buffer[(Long)] =>
        println(buffer.toList)
    }.run

And exception:
Cause: java.lang.IllegalArgumentException: requirement failed: Source com.twitter.scalding.avro.UnpackedAvroSourceList(mnpDictMock) does not appear in your test sources.  Make sure each source in your job has a corresponding source in the test sources that is EXACTLY equal.  Call the '.source' or '.sink' methods as appropriate on your JobTest to add test buffers for each source or sink.
  at scala.Predef$.require(Predef.scala:233)
  at com.twitter.scalding.TestTapFactory.createTap(TestTapFactory.scala:70)
  at com.twitter.scalding.FileSource$$anonfun$createTap$2.apply(FileSource.scala:157)
  at com.twitter.scalding.FileSource$$anonfun$createTap$2.apply(FileSource.scala:158)
  at scala.Option.map(Option.scala:145)
  at com.twitter.scalding.FileSource.createTap(FileSource.scala:156)
  at com.twitter.scalding.Source.read(Source.scala:99)






среда, 20 августа 2014 г., 6:12:21 UTC+4 пользователь Jonathan Coveney написал:

Jonathan Coveney

unread,
Sep 24, 2014, 9:26:52 AM9/24/14
to cascadi...@googlegroups.com
What do you make of the error you got?
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/d10ff3d2-36bf-40fb-9a6e-15b06299bec2%40googlegroups.com.

Serega Sheypak

unread,
Sep 24, 2014, 11:14:28 AM9/24/14
to cascadi...@googlegroups.com
I got it:)
now it fixed. I didn't understand how mock actually work. Now I do.
The job code was:
 UnpackedAvroSource(args("mnpDict")).read
And the mock was:
 .arg("mnpDict", "mnpDictMock")
 .source(Tsv("mnpDictMock"), mnp) 

I've used TSV instead of UnpackedAvroSource
That was a problem. now it works

вторник, 19 августа 2014 г., 16:16:15 UTC+4 пользователь Serega Sheypak написал:
Reply all
Reply to author
Forward
0 new messages