Pipeline becomes very slow after I try to join one small data set using joinWithSmaller/Tiny

21 views
Skip to first unread message

Jing Lu

unread,
Jul 28, 2019, 2:50:30 AM7/28/19
to cascading-user
Hi,

My cascading pipeline runs about 2 hours to finish. However, after I join another pipe (about 50 MB), my pipeline becomes extremely slow (more than 20 hours). How to debug this situation? Is that because the file format of my data set?


Thanks,

Ken Krugler

unread,
Jul 28, 2019, 1:04:06 PM7/28/19
to cascadi...@googlegroups.com
What kind of join are you doing? Some code would be helpful…

— Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/0aeadee7-03dc-44b4-bea8-041f15586541%40googlegroups.com.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra


Jing Lu

unread,
Jul 28, 2019, 11:05:00 PM7/28/19
to cascading-user

The code is like this:

    val outlier = pipe1

      .insert('sid, 2)

      .rename(('id, 'sid, 'mobile) -> ('id_o, 'sid_o, 'mobile_o))

      .project(('id_o'sid_o'mobile_o, 'revenue))

    

    pip.joinWithTiny(('id, 'sid, 'mobile) -> ('id_o, 'sid_o, 'mobile_o),

          outlier)

      .map(('revenue1, 'revenue), 'revenue1) 

      { r: (Double, Double) =>

        val (rev1, rev2) = r

        rev1.toDouble - rev2.toDouble

      }


outlier here is a much smaller data set. After joining, many rows of revenue column can be empty. Is this the right way to do it?


Thanks!



On Sunday, July 28, 2019 at 10:04:06 AM UTC-7, kkrugler wrote:
What kind of join are you doing? Some code would be helpful…

— Ken
On Jul 28, 2019, at 8:50 AM, Jing Lu <aji...@gmail.com> wrote:

Hi,

My cascading pipeline runs about 2 hours to finish. However, after I join another pipe (about 50 MB), my pipeline becomes extremely slow (more than 20 hours). How to debug this situation? Is that because the file format of my data set?


Thanks,

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascadi...@googlegroups.com.

Ken Krugler

unread,
Jul 29, 2019, 2:38:00 AM7/29/19
to cascadi...@googlegroups.com
I’m not a Scalding user, but I think you get an inner join by default. So I wouldn’t expect to see lots of empty revenue values, as you should only get results where the id exists on both sides of the join.

But I’d try joinWithSmaller first, as joinWithTiny replicates the right side to every mapper, so if you have a lot of mappers it might run slower.

— Ken

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/3638a32f-1424-4e73-b049-4728d850b523%40googlegroups.com.

Jing Lu

unread,
Jul 29, 2019, 1:16:11 PM7/29/19
to cascading-user
I tried joinWithSmaller, it's also not terminated. So, I was thinking to try something more efficient. You are right, I should use leftJoinWithTiny here.

Thanks
Reply all
Reply to author
Forward
0 new messages