Would interning of schema and field names improve performance?

Alexei Perelighin

unread,

Sep 9, 2013, 12:10:39 PM9/9/13

to pangoo...@googlegroups.com

Hi,

Pangool is great as it adds structure (schema and tuple) and simplifies group bys and join implementations. It also makes code more readable as we can use explicit schema and field names in the code. But it also forces a lot of String.equals(String s) operations in the reducer in order to distinguish between different intermediate schemas, I have looked in the debugger for object references and found that the schema name hard coded in the code is not the same object as the one returned by Schema.getName() method, thus forcing a lot of loops in the String.equals implementation.

Have you considered interning the schema and field names? And if you did, did it make any noticeable difference to the performance of the data workflows?

Thanks,

Alexei

Pere Ferrera

unread,

Sep 10, 2013, 4:51:56 AM9/10/13

to pangoo...@googlegroups.com

Hello Alexei,

Interesting idea. Have you noticed that your code is spending too much time in that equals() method? Did you run it through a profiler?

I myself don't think it would make a difference. The schema name comparison in a reduce-side join is executed once per each input Tuple. Let's suppose the job is CPU bound (if it would be IO bound then there's no gain at all). Let's also suppose the user code itself is not really CPU intensive (otherwise it won't make a difference either). There are a lot more operations that happen for each Tuple, so typically the bound would be serializing / deserializing it.

Also I suppose interning dynamic strings is not really a standard practice (haven't seen it anywhere). Any new committer to Pangool would need to have a lot of care with it (if you miss to intern() only one dynamic reference then it doesn't work).

It could be done with care if it would offer a lot of advantages, but as I said I doubt it.

Alexei Perelighin

unread,

Sep 10, 2013, 5:29:40 AM9/10/13

to pangoo...@googlegroups.com

Hi Pere,

I have not run it in the profiler yet. Do you a standard load test?

Thanks,

Alexei

Pere Ferrera

unread,

Sep 10, 2013, 5:35:32 AM9/10/13

to pangoo...@googlegroups.com

Just a profiler over a local execution of your Job would do.

In any case, there are a lot of places where String.equals() is called in Pangool's code... If you think about it, any Hashmap get() will perform some of them. So even if a profiler says that a lot of time is spent in equals() we're not going to conclude that the culprit is the single schema name comparison.

And if it was, a cleaner solution would be to identify schemas with an integer so that the user can still use enums, etc.

Alexei Perelighin

unread,

Sep 10, 2013, 5:44:54 AM9/10/13

to pangool-user

Integers for schemas would be nice, as than ordering of the intermediate schemas on group by will not depend on its name, thus schema objects can become universal for the project, as different workflows might want to mix input sources in different order.

Besides schema names I was concerned about the Tuple.get and Tuple.set as this could be even more prolific as tuple.get("column_name") calls are quite common and HashTables do use String.equals(...) method every time you try to get the index. VM by default interns all explicit Strings like "column_name".

When I have time I will play with it:

1) Run many times the profiler with the standard pangool

2) Run many times the profiled with modified constructors of Schema and Field and see if the total time spent in the String.equals(..) drops :D.

Thanks,
Alexei

--
Has recibido este mensaje porque estás suscrito al grupo "pangool-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus correos electrónicos, envía un correo electrónico a pangool-user...@googlegroups.com.
Para obtener más opciones, visita https://groups.google.com/groups/opt_out.

Pere Ferrera Bertran

unread,

Sep 10, 2013, 5:49:42 AM9/10/13

to pangoo...@googlegroups.com

Actually this is an interesting discussion, and remains me of something.

When we were about to release the first version of Pangool I just did a test with an API that didn't use names for fields, etc, so as to save CPU time, just like we are thinking now... I don't remember it exactly, but the saving were not very significative. In the end we decided that, even if it was some saving, it was not worth it as the APi should be also "nice to use".

It is hard to find a good tradeoff between API usability and high efficiency.

In any case if you run these tests please let us know the conclusions.

2013/9/10 Alexei Perelighin <alex...@googlemail.com>

--
Pere Ferrera

CTO & Co-founder

www.datasalt.com

Alexei Perelighin

unread,

Sep 10, 2013, 9:28:30 AM9/10/13

to pangoo...@googlegroups.com

Hi Pere,

Run a couple of processes through the profiler. You are right, String.equals does not even appear in the report, the only java.lang.String methods that do are the split, replace, valueOf.

It was an interesting experiment, thanks for the insights.

Thanks,

Alexei

Pere Ferrera

unread,

Sep 10, 2013, 10:29:53 AM9/10/13

to pangoo...@googlegroups.com

Thanks for the info Alexei!

Reply all

Reply to author

Forward