DB.flatMap { info => fields.map(fieldName => (fieldName, getKey(fieldName)) ) }.distinct .map { case (fieldName, distinctValue) => (fieldName, 1) } .group.sum?
--Consider the following snippet of scalding code:
val fields = List[String]("blue", "yellow", "red")
def getDistinctCount(DB: TypedPipe[Info])(implicit flowDef: cascading.flow.FlowDef, mode: com.twitter.scalding.Mode) = {val jsonValue: TypedPipe[String] = DB.map { info => getKey("blue" } //Returns a set of values (String) for the inputval distinctCount: ValuePipe[Int]= jsonValue.distinct.map { x => 1 }.sum //Returns Distinct Count
distinctCount.write(TypedTsv(("HDFS Location"))) //Writes the value to a HDFS location
}
I have to calculate the Distinct count for every field in the List, one way is to iterate through list and calculate distinct counts for each String and write in a different HDFS locations and merge them to local file.In reality I have to calculate for at least twenty fields which makes this Process really slow.
Is there any optimal way to calculate the distinct count for every field and write it to a single HDFS location at one go. Thank you !!
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.