Pipeline becomes very slow after I try to join one small data set using joinWithSmaller/Tiny

Jing Lu

unread,

Jul 28, 2019, 2:50:30 AM7/28/19

to cascading-user

Hi,

My cascading pipeline runs about 2 hours to finish. However, after I join another pipe (about 50 MB), my pipeline becomes extremely slow (more than 20 hours). How to debug this situation? Is that because the file format of my data set?

Thanks,

Ken Krugler

unread,

Jul 28, 2019, 1:04:06 PM7/28/19

to cascadi...@googlegroups.com

What kind of join are you doing? Some code would be helpful…

— Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/0aeadee7-03dc-44b4-bea8-041f15586541%40googlegroups.com.

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Jing Lu

unread,

Jul 28, 2019, 11:05:00 PM7/28/19

to cascading-user

The code is like this:

val outlier = pipe1

.insert('sid, 2)

.rename(('id, 'sid, 'mobile) -> ('id_o, 'sid_o, 'mobile_o))

.project(('id_o, 'sid_o, 'mobile_o, 'revenue))

pip.joinWithTiny(('id, 'sid, 'mobile) -> ('id_o, 'sid_o, 'mobile_o),

outlier)

.map(('revenue1, 'revenue), 'revenue1)

{ r: (Double, Double) =>

val (rev1, rev2) = r

rev1.toDouble - rev2.toDouble

}

outlier here is a much smaller data set. After joining, many rows of revenue column can be empty. Is this the right way to do it?

Thanks!

On Sunday, July 28, 2019 at 10:04:06 AM UTC-7, kkrugler wrote:

What kind of join are you doing? Some code would be helpful…

— Ken

On Jul 28, 2019, at 8:50 AM, Jing Lu <aji...@gmail.com> wrote:

Hi,

My cascading pipeline runs about 2 hours to finish. However, after I join another pipe (about 50 MB), my pipeline becomes extremely slow (more than 20 hours). How to debug this situation? Is that because the file format of my data set?

Thanks,

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cascadi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/0aeadee7-03dc-44b4-bea8-041f15586541%40googlegroups.com.

Ken Krugler

unread,

Jul 29, 2019, 2:38:00 AM7/29/19

to cascadi...@googlegroups.com

I’m not a Scalding user, but I think you get an inner join by default. So I wouldn’t expect to see lots of empty revenue values, as you should only get results where the id exists on both sides of the join.

But I’d try joinWithSmaller first, as joinWithTiny replicates the right side to every mapper, so if you have a lot of mappers it might run slower.

— Ken

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/3638a32f-1424-4e73-b049-4728d850b523%40googlegroups.com.

Jing Lu

unread,

Jul 29, 2019, 1:16:11 PM7/29/19

to cascading-user

I tried joinWithSmaller, it's also not terminated. So, I was thinking to try something more efficient. You are right, I should use leftJoinWithTiny here.

Thanks

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/3638a32f-1424-4e73-b049-4728d850b523%40googlegroups.com.

Reply all

Reply to author

Forward