Performance of multiple map/filters vs doing everything in a flatMap

53 views
Skip to first unread message

Ian Hummel

unread,
Nov 8, 2013, 5:34:14 PM11/8/13
to alge...@googlegroups.com
Hey gang,

I was wondering if there is any real difference in runtime on Hadoop between code such as the following

val startDate     = stringToRichDate(args.required("start-date"))
val endDate       = stringToRichDate(args.required("end-date"))

MySource.
  .map('timestamp → 'timestamp) { ts: String ⇒ stringToRichDate(ts) }
  .filter('timestamp) { ts: RichDate ⇒ (ts >= startDate) && (ts < endDate) }
  .filter('userId) { userId: String ⇒ userId != null }


and this

val startDate     = stringToRichDate(args.required("start-date"))
val endDate       = stringToRichDate(args.required("end-date"))

MySource.
  flatMap('timestamp, 'userId) { (ts: String, userId: String) ⇒
    stringToRichDate(ts) match { ts ⇒
      case ts if (ts >= startDate) && (ts < endDate) && (userId != null) ⇒ Seq(timestamp, userId)
      case _ ⇒ Seq()
    }
  }


Forgive me if that code isn't exactly right, but I just mean is there a tradeoff between adding lots of maps/filters which do specific things vs trying to shoehorn all the logic into one "pipeline" operation?  What is the overhead in Cascading in terms of connecting multiple filters/maps ?

Hope that makes sense!

Cheers,

- Ian.



Hugo Gävert

unread,
Nov 30, 2013, 12:21:08 AM11/30/13
to alge...@googlegroups.com
Hi!

I've wondered many times about similar question as I've been fighting a lot with out of mem errors. For debugging I had to insert .forceToDisks in many places - sometimes it made difference between something that looks like your question. But it was never really conclusive.

However, performance wise, I think what you have written has very little impact compared to all other things happening on the cluster on on your subsequent pipeline.

-- 
HG

Koert Kuipers

unread,
Nov 30, 2013, 12:34:31 PM11/30/13
to Hugo Gävert, alge...@googlegroups.com
Both run entirely inside a single hadoop map operation, so it makes very little difference i suspect.

* a lot of maps and filters means a lot of conversion from cascading tuples to scalding tuples and back, which could perhaps cause a lot of garbage collection.
* a flatmap causes creation of intermediate collections

i would choose entirely on what is the most readable.

when you do operations inside a groupBy picking the right one can have a huge impact on performance. for "map-side" operations i wouldnt worry about it.


--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages