Counters per subassembly

63 views
Skip to first unread message

ANIKET MORE

unread,
May 5, 2016, 10:47:14 AM5/5/16
to cascading-user
Hi,
In my project, I am using bunch of SubAssemblies to process the data. So I have an use case where I need to find the number of tuples processed by each SubAssembly. For that I am trying to use cascading counter (not specifically Hadoop/MR counter, as I will be using HadoopFlowConnector and TezFlowConnector) for counting the number of tuples flowing through sub assemblies.

I have tried to count the number of tuples using FlowStats API as below but I am getting statistics of some count at the end of flow. To be more specific I dont want count per flow, but need count per SubAssembly.

           flow.complete();     
    for (String counterGroups : flow.getFlowStats().getCounterGroups()){
     for (String counter : flow.getFlowStats().getCountersFor(counterGroups)){
     System.out.println("------------------------------Counter Group : " + counterGroups +"\tCounter : " + counter
     + "\tCounter Value : " + flow.getFlowStats().getCounterValue(counterGroups, counter));
    
     }
    }

We have an internal use case where we need to calculate number of tuple per SubAssembly and later those counts will be used for further processing. We don't want to use Driven for that.

Can someone please suggest me some solution for the above use case?

Thanks!!

PaulON

unread,
May 18, 2016, 2:16:13 PM5/18/16
to cascading-user
Can you not just use a different counter in each SubAssembly?
Im not following if you are using different SubAssemblies or reusing the same one?

If the same one, dynamically name it whenever you instantiate it?

Chris K Wensel

unread,
May 18, 2016, 3:28:56 PM5/18/16
to cascadi...@googlegroups.com

Counters are grouped by physical process type (Flow, Step,Node,Slice).

A SubAssembly is logical. it will likely span Nodes if there is a grouping operation, but could span Steps (run across hadoop jobs). 

That said, prefixing your counter group with the SubAssembly name (as implied below), and retrieving the counter for the SubAssembly in question from the Flow should work.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/a6fbd0d5-6abb-4b44-8346-42b703a33861%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




ANIKET MORE

unread,
May 19, 2016, 10:44:08 AM5/19/16
to cascading-user

Thanks Chris for your response!

 

Do I need to add an Each Pipe in every sub assembly to fetch the counters per sub assembly? Is that what you mean by "prefixing your counter group with the SubAssembly name (as implied below), and retrieving the counter for the SubAssembly" in your reply?

 

I have tried by adding Counter function in each subassembly as below:

pipe = new Each(pipe,new Counter("SubAssemblyName", "SubAssemblyId", 1));

 

But adding an each pipe in every sub assembly takes more time to run a job. Every time I will be having huge data for processing. I had tried to run the job on my sample data (~ 1GB) with 8-10 subassemblies were used for data processing and I found that difference in execution time was noticeable.

Reply all
Reply to author
Forward
0 new messages