Re: calculate the Distinct count for every field in the List at once

16 views
Skip to first unread message
Message has been deleted

Cyrille Chépélov

unread,
Dec 3, 2017, 4:29:07 AM12/3/17
to scaldi...@googlegroups.com
Hi,

how about
    DB.flatMap { 
	info => fields.map(fieldName => (fieldName, getKey(fieldName)) )
    }.distinct
    .map { case (fieldName, distinctValue) => (fieldName, 1) }
    .group.sum
?       
    -- Cyrille

Le 03/12/2017 à 07:44, charani...@gmail.com a écrit :
     
Consider the following snippet of scalding code:

      val fields = List[String]("blue", "yellow", "red")

     def getDistinctCount(DB: TypedPipe[Info])(implicit flowDef: cascading.flow.FlowDef, mode: com.twitter.scalding.Mode) = {
        
        val jsonValue: TypedPipe[String] = DB.map { info => getKey("blue" } //Returns a set of values (String) for the input
   
        val distinctCount: ValuePipe[Int]= jsonValue.distinct.map { x => 1 }.sum //Returns Distinct Count

        distinctCount.write(TypedTsv(("HDFS Location"))) //Writes the value to a HDFS location

      }

I have to calculate the Distinct count for every field in the List, one way is to iterate through list and calculate distinct counts for each String and write in a different HDFS locations and merge them to local file.
In reality I have to calculate for at least twenty fields which makes this Process really slow. 

Is there any optimal way to calculate the distinct count for every field and write it to a single HDFS location at one go. Thank you !!

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Message has been deleted

charani...@gmail.com

unread,
Dec 4, 2017, 1:43:31 PM12/4/17
to Scalding Development

Hey Cyrille,

It is a perfect solution, I gave a try and it worked perfectly fine. thank you!!

Charan

charani...@gmail.com

unread,
Dec 4, 2017, 1:45:36 PM12/4/17
to Scalding Development
Would you mind explaining, why you opted flatMap instead on map for "DB.flatMap". thank you!!
Reply all
Reply to author
Forward
0 new messages