Hey gang,
I was wondering if there is any real difference in runtime on Hadoop between code such as the following
val startDate = stringToRichDate(args.required("start-date"))
val endDate = stringToRichDate(args.required("end-date"))
MySource.
.map('timestamp → 'timestamp) { ts: String ⇒ stringToRichDate(ts) }
.filter('timestamp) { ts: RichDate ⇒ (ts >= startDate) && (ts < endDate) }
.filter('userId) { userId: String ⇒ userId != null }
and this
val startDate = stringToRichDate(args.required("start-date"))
val endDate = stringToRichDate(args.required("end-date"))
MySource.
flatMap('timestamp, 'userId) { (ts: String, userId: String) ⇒
stringToRichDate(ts) match { ts ⇒
case ts if (ts >= startDate) && (ts < endDate) && (userId != null) ⇒ Seq(timestamp, userId)
case _ ⇒ Seq()
}
}
Forgive me if that code isn't exactly right, but I just mean is there a tradeoff between adding lots of maps/filters which do specific things vs trying to shoehorn all the logic into one "pipeline" operation? What is the overhead in Cascading in terms of connecting multiple filters/maps ?
Hope that makes sense!
Cheers,
- Ian.