Performance of multiple map/filters vs doing everything in a flatMap

Ian Hummel

unread,

Nov 8, 2013, 5:34:14 PM11/8/13

to alge...@googlegroups.com

Hey gang,

I was wondering if there is any real difference in runtime on Hadoop between code such as the following

val startDate = stringToRichDate(args.required("start-date"))

val endDate = stringToRichDate(args.required("end-date"))

MySource.

.map('timestamp → 'timestamp) { ts: String ⇒ stringToRichDate(ts) }

.filter('timestamp) { ts: RichDate ⇒ (ts >= startDate) && (ts < endDate) }

.filter('userId) { userId: String ⇒ userId != null }

and this

val startDate = stringToRichDate(args.required("start-date"))

val endDate = stringToRichDate(args.required("end-date"))

MySource.

flatMap('timestamp, 'userId) { (ts: String, userId: String) ⇒

stringToRichDate(ts) match { ts ⇒

case ts if (ts >= startDate) && (ts < endDate) && (userId != null) ⇒ Seq(timestamp, userId)

case _ ⇒ Seq()

}

Forgive me if that code isn't exactly right, but I just mean is there a tradeoff between adding lots of maps/filters which do specific things vs trying to shoehorn all the logic into one "pipeline" operation? What is the overhead in Cascading in terms of connecting multiple filters/maps ?

Hope that makes sense!

Cheers,

- Ian.

Hugo Gävert

unread,

Nov 30, 2013, 12:21:08 AM11/30/13

to alge...@googlegroups.com

Hi!

I've wondered many times about similar question as I've been fighting a lot with out of mem errors. For debugging I had to insert .forceToDisks in many places - sometimes it made difference between something that looks like your question. But it was never really conclusive.

However, performance wise, I think what you have written has very little impact compared to all other things happening on the cluster on on your subsequent pipeline.

--

HG

Koert Kuipers

unread,

Nov 30, 2013, 12:34:31 PM11/30/13

to Hugo Gävert, alge...@googlegroups.com

Both run entirely inside a single hadoop map operation, so it makes very little difference i suspect.

* a lot of maps and filters means a lot of conversion from cascading tuples to scalding tuples and back, which could perhaps cause a lot of garbage collection.

* a flatmap causes creation of intermediate collections

i would choose entirely on what is the most readable.

when you do operations inside a groupBy picking the right one can have a huge impact on performance. for "map-side" operations i wouldnt worry about it.

--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

Message has been deleted