Re: Avro example

87 views
Skip to first unread message

Robert Metzger

unread,
Jun 4, 2014, 4:50:00 PM6/4/14
to stratosp...@googlegroups.com
Hi Slava,

thanks for allowing me to post my answer to the public mailing list.

We are planning to put out the next release (0.6) as soon as we moved our code base to Apache. I hope that we have this set up within the next week. (Bringing out a first Apache release is probably taking us another month or so)
I'll drop you a line here once we've implemented the AvroOutputFormat.

2) Yes, the UDFs only provide iterators. I think there is no fundamental limitation behind it. I guess the authors could not imagine a use case where this is really required. In addition to that, it saves you from writing an additional line of code, where you create a local iterator. It is probably not very difficult to resolve the limitation. I'm just not sure if it is really required. (In general, I'm not 100% sure about this answer)

3) Yes. In order to maintain a list of all incoming objects inside a UDF, you have to copy them because we are re-using the objects internally.

4) Well, thats a good question. I personally ignore the warnings ;) 
We can not really do anything about it, since we want to have Serializable classes and you can not inherit the serial version id. So its basically a limitation of the programming language / runtime we are using. If you are not going to "maintain" the serialVersionUID (change it with each change) its actually pointless to have it at all.


Best,
Robert


On Wed, Jun 4, 2014 at 6:53 PM, Viacheslav Zholudev <vyachesla...@gmail.com> wrote:
Hi,

I successfully migrated my test job to Stratosphere and ran some tests. I liked new API a lot.

I have a couple of things in mind:
1) There is no AvroOutputFormat, right?
2) I noticed that in UDFs iterators are supplied, not iterables? Is there any reason for it? I thought you wanted to switch to iterables at some point
3) Am I correct that one needs to clone objects in UDFs that provide iterator if I want to persist all received objects in a list, for instance?
4) When implementing UDFs IntelliJ IDEA complains that serialisable class does not define "serialVersionUID", and my code is full of warnings. Of course, I can disable this warning, or indeed put serialVersionUID everywhere. In the latter case the code becomes more cluttered. What would you suggest?


Thanks,
Slava


Ufuk Celebi

unread,
Jun 4, 2014, 5:16:44 PM6/4/14
to stratosp...@googlegroups.com

On 04 Jun 2014, at 22:49, Robert Metzger <rmet...@apache.org> wrote:
> 2) Yes, the UDFs only provide iterators. I think there is no fundamental limitation behind it. I guess the authors could not imagine a use case where this is really required. In addition to that, it saves you from writing an additional line of code, where you create a local iterator. It is probably not very difficult to resolve the limitation. I'm just not sure if it is really required. (In general, I'm not 100% sure about this answer)

There is no technical reason.

We had this discussion some time ago I am strongly in favour of returning an Iterable (or IterableIterator) instead of an Iterator. There also (closed) issues for this related to Spargel:

https://github.com/stratosphere/stratosphere/issues/425
https://github.com/stratosphere/stratosphere/pull/433

A reason against Iterable is that users could request multiple iterators via `iterator()`, which wouldn't work with our runtime. But we could make sure that it's only allowed to call this method a single time and throw an Exception if it's called more often.

The new API would have been the perfect time to introduce Iterable. :-( We hesitated before, because we didn't want people to have to change their programs. Still, I think we should go for it as it makes the UDFs less verbose, which is a good thing and people can still use the iterator if they need to via `iterator()`. Then again it might be just a matter of taste. ;-)

> 3) Yes. In order to maintain a list of all incoming objects inside a UDF, you have to copy them because we are re-using the objects internally.

There is also a plan to provide the option to turn re-using objects off (which will impact performance). I can't find the issue, but someone else ran into a similar problem a short while ago. Is there a separate issue to allow this?

> 4) Well, thats a good question. I personally ignore the warnings ;)
> We can not really do anything about it, since we want to have Serializable classes and you can not inherit the serial version id. So its basically a limitation of the programming language / runtime we are using. If you are not going to "maintain" the serialVersionUID (change it with each change) its actually pointless to have it at all.

I see the technical reasons for it (http://stackoverflow.com/questions/285793/what-is-a-serialversionuid-and-why-should-i-use-it), but also find it "inconvenient" (for the lack of a better word) to either suppress the warning or provide a serialVerisonUID. As a user, I don't want to think about serialization.

Stephan Ewen

unread,
Jun 4, 2014, 5:23:48 PM6/4/14
to stratosp...@googlegroups.com
I agree with Ufuk.

For 2: I think we inherited that in a way. Let us think if we can add a second variant with iterable instead of iterator, to support both variants.

For 4: Right now, that is unfortunately required. It is what Java serialization imposes. I would like to switch to a different serialization mechanism and then remove the java.io.Serializable interface where possible.

Stephan

Vyacheslav Zholudev

unread,
Jun 5, 2014, 5:14:46 AM6/5/14
to stratosp...@googlegroups.com, rmet...@apache.org
Hi Robert, 

thanks for reply.


2) Yes, the UDFs only provide iterators. I think there is no fundamental limitation behind it. I guess the authors could not imagine a use case where this is really required. In addition to that, it saves you from writing an additional line of code, where you create a local iterator. It is probably not very difficult to resolve the limitation. I'm just not sure if it is really required. (In general, I'm not 100% sure about this answer)


What I mean is just a convenience for a programmer. imo, it's nicer to have:
udf(Iterable<Point> points, ...) {
  for (Point p : points) {
     //do sth with p
   }
}

rather than
udf(Iterator<Point> points, ...) {
  while(points.hasNext())
     Point p = points.next();
   }
}

might be just a personal taste, but I like the fact that you can feed an iterable directly into a 'for' loop. If one needs an iterator, it's not hard to obtain it form an iterable.


4) Well, thats a good question. I personally ignore the warnings ;) 
We can not really do anything about it, since we want to have Serializable classes and you can not inherit the serial version id. So its basically a limitation of the programming language / runtime we are using. If you are not going to "maintain" the serialVersionUID (change it with each change) its actually pointless to have it at all.


Just out of curiosity, why do you want UDFs to be serialable? In order for it to make sense, UDFs should have a state to share across machines, which I'm not sure is a good idea. If you really need to serialize UDFs, could you use Kryo for that?
Regarding maintaining a serialVersionUID, I don't think I need to do it in any case, since my code does not work with serializing of those classes, and on your side you don't persist the serialized versions of objects across different program executions (when the code of UDFs may have changed from one execution to another)
That said, if you really need to serialize UDFs, Kryo could help keeping code concise and without warnings. What do you think?

Thanks!!!

Vyacheslav Zholudev

unread,
Jun 5, 2014, 5:17:50 AM6/5/14
to stratosp...@googlegroups.com, rmet...@apache.org
Sorry, for duplicating some of the thoughts you already expressed (just saw Robert's post when writing a reply)

Fabian Hueske

unread,
Jun 5, 2014, 5:23:11 AM6/5/14
to stratosp...@googlegroups.com
Hi Slava,

we serialize the UDF objects to transfer configuration state. So you can configure your UDF at program construction (e.g., set a filter literal in the UDFs constructor) and don't need to pass a Configuration object which is read in the open method.

So we only transfer UDF objects from the plan to the executing task manager. We do not share the state of UDFs across parallel instances.
This could of course be done with Kryo, but right now we are using Java Serialization for that.

Best, Fabian


--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Robert Metzger

unread,
Jun 13, 2014, 3:18:30 PM6/13/14
to stratosp...@googlegroups.com
Just as an update: We now have a AvroOutputFormat in the 0.6-SNAPSHOT.
But you have to build Flink/Stratosphere yourself. The automatic distribution of SNAPSHOT versions does not work due to our move to Apache.

Reply all
Reply to author
Forward
0 new messages