DataSet size

33 views
Skip to first unread message

m.neuma...@gmail.com

unread,
May 28, 2014, 4:03:27 AM5/28/14
to stratosp...@googlegroups.com
Hej,

Is there a simple way to find out the size of a Dataset, I need that for several Graph Algorithms I'm implementing at the moment.

I tried to just write a reducer summing things up but that I end up with a new DataSet and I have no Idea how to get the actual number out of there.

cheers Martin

Fabian Hueske

unread,
May 28, 2014, 4:44:56 AM5/28/14
to stratosp...@googlegroups.com
Hi,

so I assume you have a DataSet with a single number (the element count).
You can use a Broadcast Variable to feed this (small!) DataSet into any other operator, read it there in the open() method and configure your operator. -> http://stratosphere.eu/docs/0.5/programming_guides/java.html#broadcast_variables

Is this what you are looking for?

Best, Fabian



--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Martin Neumann

unread,
May 28, 2014, 5:39:34 AM5/28/14
to stratosp...@googlegroups.com
Hej,

Thanks I will try it like that.

But I think this is incredibly clunky and complicated, so I was hoping for something more elegant. 
Now I have to write a job to count the items in the Set and then create a broadcast variable out of it just to find out the number of elements. 

cheers Martin


--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/1f6U4CmZyoo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.

Fabian Hueske

unread,
May 28, 2014, 5:48:36 AM5/28/14
to stratosp...@googlegroups.com
Hej,

There will be no separate counting job. Instead, the output of the counting Reduce operator will be directly forwarded to the other operator(s). The whole data flow will executed at once.
However, you're right. There should be a better way to do both, the counting and the passing of the variable.

Cheers, Fabian

Martin Neumann

unread,
May 28, 2014, 9:17:51 AM5/28/14
to stratosp...@googlegroups.com
Hej,

I was not referring to the actual execution (I'm hoping for the optimizer for that) but more to the amount of code that I have to write. 
I think DataSet should have a .size() function even if that leads to a distributed operation when executing.
That's also how its solved in Apace Crunch and it feels natural when coming from a Java background.

cheers Martin

Robert Metzger

unread,
May 28, 2014, 9:32:20 AM5/28/14
to stratosp...@googlegroups.com
Hey,

Is this pull request solving the issue? https://github.com/stratosphere/stratosphere/pull/758
It introduces a DataSet.count() method.

Robert

Martin Neumann

unread,
May 28, 2014, 10:48:37 AM5/28/14
to stratosp...@googlegroups.com
Hej,

That would be exactly what I would need.

My implementation of counting does not work and I'm stomped at the moment how this is done properly. Can someone point me to an example? 

cheers Martin

fhu...@gmail.com

unread,
May 29, 2014, 5:10:08 AM5/29/14
to stratosp...@googlegroups.com
Hi Martin,

I would count the element in a DataSet like this:

DataSet<YourType> yourData = …
DataSet<Integer> count = yourData.map(new ToInt()).reduce(new Counter());

public class ToInt extends MapFunction<YourType, Integer> {
  public Integer map(YourType in) { return 1; }
}

public class Counter extends ReduceFunction<Integer> {
  public Integer reduce(Integer first, Integer second) { return first + second; }
}

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

Stephan Ewen

unread,
May 30, 2014, 12:02:24 PM5/30/14
to stratosp...@googlegroups.com
Hi Martin!

Not that the 0.5 code is finalized, we will begin merging new features for the next release. There is a pending pull request for the count() utility, which we will mere into the 0.6 master soon.

Greetings,
Stephan

Stephan Ewen

unread,
May 30, 2014, 12:20:42 PM5/30/14
to stratosp...@googlegroups.com
It should be "NOW that the 0.5 code is finalized" ;-)

Martin Neumann

unread,
May 30, 2014, 4:58:15 PM5/30/14
to stratosp...@googlegroups.com
Hej,

I tried to write a more generic version of the count code since I might need to reuse it. Unfortunately I have some problems with the generics.

Here is the code:
public static class CountingMap extends MapFunction<Object, Long>{
private static final long serialVersionUID = 1L;
@Override
public Long map(Object value) throws Exception {
return 1l;
}
}
public static class CountingRed extends ReduceFunction<Long>{
private static final long serialVersionUID = 1L;
@Override
public Long reduce(Long value1, Long value2) throws Exception {
return value1+value2;
}
}
public static DataSet<Long> countItems(DataSet set){
return set.map(new CountingMap()).reduce(new CountingRed());
}

If I run thins handing it a DataSet<String> i get an exception:
Exception in thread "main" eu.stratosphere.api.java.functions.InvalidTypesException: Input mismatch: Basic type expected.
With DataSet<Tuple2<String,String>> i get:
Exception in thread "main" eu.stratosphere.api.java.functions.InvalidTypesException: Input mismatch: Tuple type expected
I also tried to use Tuple instead of Object in the Function but then I get:
Exception in thread "main" eu.stratosphere.api.java.functions.InvalidTypesException: Input mismatch: Concrete subclass of Tuple expected.

Is there a way to write this code more generic? I basically looking for a super class that can contain anything since I will never read the actual values anyway.

cheers Martin



On Fri, May 30, 2014 at 6:20 PM, Stephan Ewen <ewens...@gmail.com> wrote:
It should be "NOW that the 0.5 code is finalized" ;-)

--

Stephan Ewen

unread,
May 31, 2014, 8:22:51 AM5/31/14
to stratosp...@googlegroups.com
Hi!

There is pretty strong type checking in the Java API, stonger than the Java Compiler that throws away the generic type information. You can do it the following way (strongly typed generics):


public static class CountingMap<T> extends MapFunction<T, Long>{
    @Override
  public Long map(T value) {
  return 1l;
  }
}
public static class CountingRed extends ReduceFunction<Long>{
  @Override
  public Long reduce(Long value1, Long value2) throws Exception {
  return value1+value2;
  }
}
public static <T> DataSet<Long> countItems(DataSet<T> set){
  return set.map(new CountingMap()).reduce(new CountingRed());
}




--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages