how to avoid cross-product of two pipes

8 views
Skip to first unread message

Sunny Malik

unread,
Apr 24, 2014, 3:39:46 AM4/24/14
to cascadi...@googlegroups.com

Problem statement: 

pipeOne has tuple's of (integer, List[integers])

        kId, kList             
e.g.: 1,    [3, 4, 9]
        2,    [6, 7, 8]

pipeTwo has tuple's of (integer, List[integers])

        cId,  cList
e.g.: 101, [10, 6, 3]
        102, [6, 8, 10]

I need to find intersection of all elements of kList in all values of cList and count each intersection as one
i.e. check list of kId=1, against all available cList in pipeTwo
     check list of kId=2, against all available cList in pipeTwo 
     and so on

like two "FOR loops" in java
                
FYI: cannot join "pipeOne -> pipeTwo" on "kId -> cId" -- they are two different columns

One approach would be:  
1) do cross product of two pipes
     kid, kList                                             cId, cList
     1,    [3, 4, 9]                                        101, [10, 6, 3]
     2,    [6, 7, 8]       crossproduct with        102, [6, 8, 10]
     
cross-product will be

 kid    Klist        cId      cList
 1     [3, 4, 9]     101    [10, 6, 3]
 1     [3, 4, 9]     102    [6, 8, 10]
 2     [6, 7, 8]     101    [10, 6, 3]
 2     [6, 7, 8]     102    [6, 8, 10] 

.map(kList, cList -> counts){
        val localList = kList.intersection(cList)
        if(localList.length > 0)
            1
        else
            0
}


above, relies on cross-product and that is heavy operation specially for huge pipe size.
is there a better way of doing this..... may be using matrix or something.

i would like to avoid cross-product......
any help is appreciated  

Thanks for help in advance.

-Sunny


Ken Krugler

unread,
Apr 24, 2014, 9:05:45 AM4/24/14
to cascadi...@googlegroups.com
In regular Cascading code, you'd generate an inverse mapping first, then use that to group, and count uniques.

E.g.

kListItem kId
3 1
4 1
9 1
6 2
7 2
8 2

cListItem cId
10 101
6 101
3 101
6 102
8 102
10 102

The join on kListItem & cListItem. You'll get duplicate matches, but then do a unique before counting.

-- Ken
--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply all
Reply to author
Forward
0 new messages