vectorize intersect with cells?

HenryW

unread,

Mar 8, 2009, 9:04:42 PM3/8/09

to

I have a cell cliques=cell(n,1) where each cell
cliques{i} is a set of integers e.g. cliques{1}=[1 3 9 29]
I would like to vectorize this for loop ---- which takes very long:

for ic=1:n-1,
for jc=ic+1:n,
intcliques(ic,jc)=length(intersect(cliques{ic},cliques{jc}));
end
end

any suggestions?

Bruno Luong

unread,

Mar 9, 2009, 3:56:01 AM3/9/09

to

HenryW <hwolk...@uwaterloo.ca> wrote in message <18230588.1236560716...@nitrogen.mathforum.org>...

Could you tell us the typical values of n and length(cliques{i})?

Thanks,

Bruno

us

unread,

Mar 9, 2009, 5:28:30 AM3/9/09

to

HenryW wrote:
> I would like to vectorize this for loop ---- which takes very long:

> any suggestions...

this loop cannot be vectorized (in the true sense of vectorization)...
it only can be optimized...

two of the possible solutions including simple profiling
- copy/paste

% the data
clear c* nc* r*; % <- save old stuff!
nt=10; % <- #runs/solution
nc=100; % <- max #data/cell
cc=cell(nc,1);
for i=1:nc
nx=ceil(10*rand(1,nc));
ne=ceil(nc*rand);
cc{i,1}=nx(1:ne);
end

% the engine
r1=zeros(nc,nc);
r2=zeros(nc,nc);
% - sol #1 = OP
tic;
for i=1:nt
for ic=1:nc-1,
for jc=ic+1:nc,
r1(ic,jc)=length(intersect(cc{ic},cc{jc}));
end
end
end
toc
% - sol #2
tic;
for i=1:nt
for ic=1:nc-1,
for jc=ic+1:nc
r2(ic,jc)=sum(ismembc(unique(cc{ic}),sort(cc{jc})));
end
end
end
toc
% - sol #3
tic;
for i=1:nt
ix=nchoosek(1:nc,2);
r3=cellfun(@(x,y) sum(ismembc(unique(x),sort(y))),cc(ix(:,1)),cc
(ix(:,2)));
r3=accumarray(ix,r3,[nc,nc]);
end
toc
disp(isequal(r1,r2,r3));
%{
% wintel system
% ic2/2*2.6ghz/2gb/winxp.sp3/r2008b
Elapsed time is 5.368212 seconds. % <- sol #1 = OP
Elapsed time is 2.040264 seconds. % <- sol #2 = ~50%
Elapsed time is 2.102942 seconds. % <- sol #3 = ~50%
isequal = 1
%}

us

unread,

Mar 9, 2009, 5:33:50 AM3/9/09

to

HenryW wrote:
> I would like to vectorize this for loop ---- which takes very long:
> any suggestions...

this loop cannot be vectorized (in the true sense of
vectorization)...
it only can be optimized...

two of the possible solutions including simple profiling
- copy/paste

- second post via google...

r3=cellfun(@(x,y) sum(ismembc(unique(x),sort(y))),...
cc(ix(:,1)),cc(ix(:,2)));

r3=accumarray(ix,r3,[nc,nc]);
end
toc
disp(isequal(r1,r2,r3));
%{
% wintel system
% ic2/2*2.6ghz/2gb/winxp.sp3/r2008b
Elapsed time is 5.368212 seconds. % <- sol #1 = OP

Elapsed time is 2.240264 seconds. % <- sol #2
Elapsed time is 2.302942 seconds. % <- sol #3

Bruno Luong

unread,

Mar 9, 2009, 6:33:30 AM3/9/09

to

Here is one way, complicated but should be much faster at the run
time:

function testintersect

n=200; % number of sets
m=100; % element value are from 0 to m-1
meannumel=50; % average number of elements in lists
s = 20; % standard deviation of number of elements
cliques = gendata(n, m, meannumel, s);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%
% Engine 1 (Henry)
tic
intcliques1=Engine1(cliques);
toc % 2.374056 seconds.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%
% Engine 3 (Bruno)
tic
intcliques3=Engine3(cliques);
toc % 0.059586 seconds.

% Check
isequal(intcliques1,intcliques3)
subplot(1,2,1);
imagesc(intcliques1);
subplot(1,2,2);
imagesc(intcliques3);
end

% Generate data
function cliques = gendata(n, m, meannumel, s)
% Data

cliques=cell(1,n);
for k=1:n
nelk = meannumel+round(s*randn);
nelk = min(max(nelk,1),m);
cliques{k} = floor(m*rand(1,nelk));
end
end

function intcliques=Engine1(cliques)
n=length(cliques);
intcliques=zeros(n,n);

for ic=1:n-1,
for jc=ic+1:n,

intcliques(ic,jc)=length(intersect(cliques{ic},cliques{jc}));
end
end
end

function intcliques=Engine3(cliques)
n=length(cliques);
% Transform to long column array
c = cellfun(@(x) unique(x(:)), cliques, 'uni', false);
allc = cat(1,c{:});
% Take a unique
[u I J]=unique(allc);
% map c to index integer set: 1,2,...
cmap = mat2cell(J,cellfun(@length,c),1);
% Set id, same size as cmap
setid = arrayfun(@(n) n+zeros(size(cmap{n})), 1:length(cmap), ...
'uni', false);
% Reshape as long vectors
cmap = cat(1,cmap{:});
setid = cat(1,setid{:});

% For each element, return the list of set that contain it
contained=accumarray(cmap,setid,[],@(id) {sort(id)});

% Build the list of all couples
couples = cellfun(@allcouples, contained, 'uni', false);
couples = cat(1,couples{:});
% Count
%intcliques=sparse(couples(:,1),couples(:,2),1,n,n); % < sparse
intcliques=accumarray(couples,1,[n,n]);

function res=allcouples(k) % Return all couples of a list

k=k(:);
[I J]=ndgrid(1:numel(k),1:numel(k));
[I J]=deal(I(I<J),J(I<J));
res=[k(I(:)) k(J(:))];
end

end

Bruno Luong

unread,

Mar 9, 2009, 11:26:02 AM3/9/09

to

% More compact code:

function testintersect

n=200; % number of sets
m=100; % element value are from 0 to m-1
meannumel=50; % average number of elements in lists

s = 20; % standard deviation of number of elements, 0 to m-1

cliques = gendata(n, m, meannumel, s);

% Engine 1 (Henry)
tic
intcliques1=Engine1(cliques);
toc % 2.373781 seconds.

% Engine 3 (Bruno)
tic
intcliques3=Engine3(cliques);

toc % 0.048709 seconds.

% Check
isequal(intcliques1,intcliques3)
subplot(1,2,1);
imagesc(intcliques1);
subplot(1,2,2);
imagesc(intcliques3);

end % testintersect

% Generate data
function cliques = gendata(n, m, meannumel, s)
% Data

cliques=cell(1,n); % preallocate
for k=1:n
% number of elements in set #k
nelk = meannumel+round(s*randn);
nelk = min(max(nelk,0),m); % clip it
% random set of length nelk

cliques{k} = floor(m*rand(1,nelk));
end

end % gendata

function intcliques=Engine1(cliques)
n=length(cliques);
intcliques=zeros(n,n);
for ic=1:n-1,
for jc=ic+1:n,
intcliques(ic,jc)=length(intersect(cliques{ic},cliques{jc}));
end
end

end % Engine 1

function intcliques=Engine3(cliques)
% Number of sets

n=length(cliques);
% Transform to long column array
c = cellfun(@(x) unique(x(:)), cliques, 'uni', false);

% Set id, same size as c

setid = arrayfun(@(id) id+zeros(size(c{id})), 1:n, ...

'uni', false);
% Reshape as long vectors

setid = cat(1,setid{:});
% Take a unique of all elements

% map c to index integer set: 1,2,...

[dummy dummy cmap]=unique(cat(1,c{:}));
clear dummy

% For each element, return the list of set that contain it

contained=accumarray(cmap(:),setid,[],@(id) {id});
clear cmap setid

% Build the list of all couples
couples = cellfun(@allcouples, contained, 'uni', false);

couples = cat(1,couples{:}); % Put in a long (p x 2)-array

% Count
%intcliques=sparse(couples(:,1),couples(:,2),1,n,n); % < sparse

intcliques=accumarray(couples,1,[n n]); % <- full

function res=allcouples(k) % Return all couples of a list

[K1 K2]=ndgrid(k(:),k(:));
filter=K1<K2;
res=[K1(filter) K2(filter)];
end % allcouples

end % Engine3

% Bruno

Roger Stafford

unread,

Mar 9, 2009, 12:11:01 PM3/9/09

to

HenryW <hwolk...@uwaterloo.ca> wrote in message <18230588.1236560716...@nitrogen.mathforum.org>...

The reason your nested for-loops take so long is that there are n*(n-1)/2 calls on the 'intersect' function and each time this function must do some sorting to find shared elements. That's a lot of sorting for large n. The following code does just one massive sort on the assumption that that will save unnecessary repetition in the sorting process.

Call your cell array 'cliques' just 'c' here and assume that each element of 'c' is a column vector and that there are n such vectors in the cell array.

s = ones(n+1,1);
for ic = 1:n, s(ic+1) = s(ic) + length(c{ic}); end
m = s(n+1)-1;
x = zeros(m,1);
for ic = 1:n, x(s(ic):s(ic+1)-1) = c{ic}; end
[x,p] = sort(x); q = (1:m)'; q(p) = q;
u = cumsum([1;diff(x)~=0]);
t = cumsum(accumarray(s,1));
A = sparse(u(q),t(1:m),1,u(m),n);
A = full(A'*A);

Then A is your desired n x n matrix (which you called 'intcliques'.)

You will note that the diagonal and the lower triangular part are also filled. I hope that is all right with you.

I used the 'sparse' function on the assumption that at this point A would be very sparsely filled. If that is not so, you can replace it with an equivalent call to 'accumarray' and drop the 'full' in the next line.

I have also assumed that there are no repeated elements in each individual column vector of 'c'.

Roger Stafford

HenryW

unread,

Mar 9, 2009, 1:03:56 PM3/9/09

to

The values of n are typically from 2000 upwards; often 10,000. But, we are hoping to solve some very large problems and this is a preliminary preprocessing step.

Bruno Luong

unread,

Mar 9, 2009, 1:50:02 PM3/9/09

to

Great algo from Roger, very quick using binary sparse matrix to store set-element relationship. Here is the algo, slightly modified to give the same result as the original one, and vectorized two for-oops:

function A = Engine4(c) % Roger
n=length(c);
c = cellfun(@(x) unique(x(:)), c, 'uni', false);
s = cumsum([1 cellfun(@length,c)]).';
x=cat(1,c{:});
m = s(n+1)-1;

[x,p] = sort(x);
q = (1:m)';
q(p) = q;
u = cumsum([1;diff(x)~=0]);
t = cumsum(accumarray(s,1));
A = sparse(u(q),t(1:m),1,u(m),n);

A = triu(full(A'*A),1);
end % Engine 4

Bruno Luong

unread,

Mar 9, 2009, 4:09:01 PM3/9/09

to

% Somehow shorter

function A = Engine5(c) % Roger & Bruno

c = cellfun(@(x) unique(x(:)), c, 'uni', false);

l = cellfun(@length,c(:));
setid = cumsum(accumarray(cumsum([1; l]),1));

[dummy dummy cmap]=unique(cat(1,c{:}));

A = sparse(setid(1:end-1),cmap,1,length(c),length(dummy));
A = triu(full(A*A.'),1);

end % Engine 5

% Bruno

Roger Stafford

unread,

Mar 9, 2009, 4:36:01 PM3/9/09

to

"Bruno Luong" <b.l...@fogale.findmycountry> wrote in message <gp3ksa$aak$1...@fred.mathworks.com>...

Thanks, Bruno, for fixing up that code for generating x and s. I am a "babe in the woods" when it comes to cell arrays. Also thanks for the correction with 'triu'.

As it turns out, neither that inverse permutation 'q' nor the quantity 'm' were actually needed, so here is how your modification would look without them:

n=length(c);
c = cellfun(@(x) unique(x(:)), c, 'uni', false);
s = cumsum([1 cellfun(@length,c)]).';
x=cat(1,c{:});

[x,p] = sort(x);

u = cumsum([1;diff(x)~=0]);
t = cumsum(accumarray(s,1));

A = sparse(u,t(p),1,u(end),n);
A = triu(full(A'*A),1);

I forgot to mention that this code worked with my tests even with some of the cell vectors empty.

It is possible the 'unique' operation in your second line might be avoided by suitably altering the subsequent algorithm so as to achieve the same effect without the extra expense of the sorting involved in 'unique'. Ideally, the one massive sort ought to do the whole job with the right procedure. Any duplicate copies within a cell vector will appear in a contiguous string within x and there ought to be a way of excluding them from the count if they are treated the right way. However, I haven't had time to look into that aspect of it.

Roger Stafford

HenryW

unread,

Mar 9, 2009, 4:40:17 PM3/9/09

to

Rogers algorithm was really fast --- thanks!!
But I could not get Bruno's last modification to work.

Bruno Luong

unread,

Mar 9, 2009, 5:18:01 PM3/9/09

to

"Roger Stafford" <ellieandr...@mindspring.com.invalid> wrote in message <gp3ujh$9k3$1...@fred.mathworks.com>...

>
> It is possible the 'unique' operation in your second line might be avoided by suitably altering the subsequent algorithm so as to achieve the same effect without the extra expense of the sorting involved in 'unique'. Ideally, the one massive sort ought to do the whole job with the right procedure. Any duplicate copies within a cell vector will appear in a contiguous string within x and there ought to be a way of excluding them from the count if they are treated the right way. However, I haven't had time to look into that aspect of it.
>

Roger,

Do you think something along this line? Now we have just a unique call of unique!

function A = Engine6(c) % Roger & Bruno
% c must be a list of *row* vectors
[dummy dummy cmap]=unique([c{:}]);
nele = cellfun(@length,c(:));
setid = cumsum(accumarray(cumsum([1; nele]),1));
A = sparse(setid(1:end-1),cmap(:),1,length(c),length(dummy));
A = spones(A);
A = triu(full(A*A.'),1);
end % Engine 6

It's run about 10% faster on my computer for n=1000, and average 50 elements for each set.

Bruno