RE: Re: distance function

4 views
Skip to first unread message

Coleman, Mark

unread,
Sep 30, 2006, 5:37:27 AM9/30/06
to

Out of curiosity, I tested these two approaches on a number of data sets
for which I make frequent use. For some reason Jens code is running
slower! I've been testing it out some lists of reals of size
n=500,1000,2500, and 5000. Is it possible that the time for the
conditional compares is exceeding the computational time of redundant
calculations? Could someone try this out?

(note: I'm working on some code for identifying outliers in large data
sets. The efficient calculation of L-1 and L-2 distance matrices are
important.)

Thanks

Mark


-----Original Message-----
From: Murray Eisenberg [mailto:mur...@math.umass.edu]
Subject: Re: distance function

Yes, I KNOW that I'm computing the distances twice in my solution:
that's why I said it's an "extravagant" solution!

Jens-Peer Kuska wrote:
> Hi Murray,
>
> at least you should compute the distances not twice because the matrix

> is symmetric with zero diagonal ...
>
> d[{p_,p_}]:=0.0
> d[{q_,p_}]/; OrderedQ[{q,p}]:=d[{q,p}]= Norm[p - q]
> d[{q_,p_}]:=d[{p,q}]
>
> Regards
> Jens
>
>
> Murray Eisenberg wrote:
>> If you don't mind an "extravagant" solution -- one that is
>> conceptually simple and short but is probably inefficient due to
>> redundant calculations -- then this works, I believe:
>>
>> d[{p_, q_}] := Norm[p - q]
>> allDistances[pts_] := Union[Flatten[Outer[d, pts, pts]]]
>>
>>
>>
>> dimm...@yahoo.com wrote:
>>> In the book of Gaylord et al. (1996) there is one exercise which
>>> asks (see page 113)
>>>
>>> "Given a list of points in the plane, write a function that finds
>>> the set of all distances between the points."
>>>
>>> Although there is one solution, that solution makes use of the Table

>>> and Length commands.
>>>
>>> Is it a way to define the same function using Higher-Order functions

>>> like Outer, MapThread etc?
>>>
>>> Thanks in advance for any help.
>>>
>>>
>
>

--
Murray Eisenberg mur...@math.umass.edu
Mathematics & Statistics Dept.
Lederle Graduate Research Tower phone 413 549-1020 (H)
University of Massachusetts 413 545-2859 (W)
710 North Pleasant Street fax 413 545-1801
Amherst, MA 01003-9305

Peter Pein

unread,
Oct 1, 2006, 4:32:10 AM10/1/06
to
Coleman, Mark schrieb:

Hi Mark,

I think the most efficient proposal came from Bob Hanlon. I tested
some alternatives and the fastest is d @@@S ubsets[pts, {2}] (2.98s for
1000 points) followed by d @@@ Flatten[MapIndexed[Drop[#1, #2[[1]]-1] &,
Outer[List, Most@pts, Rest@pts, 1]], 1] with 3.31 s.

Peter

Chris Chiasson

unread,
Oct 1, 2006, 4:43:32 AM10/1/06
to
Why not use caching? Also, Maxim said Norm is slow.

d[{p_,p_}]:=0
d[{p_,q_}]:=d[{p,q}]=d[{q,p}]=Sqrt[#.#]&[p-q]

If memory usage becomes a problem, you could always Clear d.
Alternatively, you could make a Module function that operates on an
entire matrix, using d as one of the temporary "variables" inside of
the Module. As long as you code it correctly (so that d doesn't appear
in the result expression), then d will automatically be Removed.

> --
> Murray Eisenberg mur...@math.umass.edu
> Mathematics & Statistics Dept.
> Lederle Graduate Research Tower phone 413 549-1020 (H)
> University of Massachusetts 413 545-2859 (W)
> 710 North Pleasant Street fax 413 545-1801
> Amherst, MA 01003-9305
>
>


--
http://chris.chiasson.name/

Chris Chiasson

unread,
Oct 2, 2006, 12:44:21 AM10/2/06
to
Could someone please explain why Mathematica seems to be expecting the
output of cc to be a vector?

pts[1]=Table[{Random[],Random[]},{3}];
cc[p_,p_]=0;
cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
Outer[ccC,pts[1],pts[1],1]

Carl Woll

unread,
Oct 2, 2006, 12:57:36 AM10/2/06
to
Coleman, Mark wrote:

>
>Out of curiosity, I tested these two approaches on a number of data sets
>for which I make frequent use. For some reason Jens code is running
>slower! I've been testing it out some lists of reals of size
>n=500,1000,2500, and 5000. Is it possible that the time for the
>conditional compares is exceeding the computational time of redundant
>calculations? Could someone try this out?
>
>(note: I'm working on some code for identifying outliers in large data
>sets. The efficient calculation of L-1 and L-2 distance matrices are
>important.)
>
>Thanks
>
>Mark
>
>
>

The package Statistics`ClusterAnalysis` has a very quick DistanceMatrix
function:

data=Table[Random[],{1000},{3}];

<<Statistics`ClusterAnalysis`

Sqrt[DistanceMatrix[data]];//Timing
{0.187 Second,Null}

The default distance function for DistanceMatrix is
SquaredEuclideanDistance, hence the use of Sqrt. It is possible to use
the option DistanceFunction->EuclideanDistance to obviate the need for Sqrt:

DistanceMatrix[data, DistanceFunction->EuclideanDistance];

However, this option seems to be about 25% slower. This is probably
because it is faster to take the square root of a matrix and rely on
vector operations rather than taking the square root of each matrix element.

Carl Woll
Wolfram Research

Jean-Marc Gulliet

unread,
Oct 3, 2006, 6:27:49 AM10/3/06
to
Chris Chiasson wrote:
> Could someone please explain why Mathematica seems to be expecting the
> output of cc to be a vector?
>
> pts[1]=Table[{Random[],Random[]},{3}];
> cc[p_,p_]=0;
> cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
> ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
> Outer[ccC,pts[1],pts[1],1]

Hi Chris,

Written in the following way, Compile does not complain anymore:

In[1]:=
pts[1] = Table[{Random[], Random[]}, {3}];
ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}},
If[p == q, 0, (Sqrt[#1 . #1] & )[p - q]]];
Outer[ccC, pts[1], pts[1], 1]

Out[3]=
{{0.,1.00434,0.161348},{1.00434,0.,0.845952},{0.161348,0.845952,0.}}

This, of course, does not explain why in first instance Compile was
expecting a rank 1 vector. Using the optional argument of Compile to
tell Mathematica what is the return type of an external function does
not help either:

ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, cc[p, q], {{cc, _Real, 0}}];

HTH,
Jean-Marc

Ray Koopman

unread,
Oct 4, 2006, 6:24:11 AM10/4/06
to
Carl Woll wrote:
> [...]

> The package Statistics`ClusterAnalysis` has a very quick
> DistanceMatrix function:
> [...]

The ClusterAnalysis.m file contains some usage messages
for DistanceMatrix, but not the code. Where is the code?

Carl Woll

unread,
Oct 5, 2006, 3:55:13 AM10/5/06
to
Ray Koopman wrote:

>Carl Woll wrote:
>
>
>>[...]


>>The package Statistics`ClusterAnalysis` has a very quick
>>DistanceMatrix function:

>>[...]
>>
>>
>
>The ClusterAnalysis.m file contains some usage messages
>for DistanceMatrix, but not the code. Where is the code?
>
>

Sometimes a package developer will code up a function in the kernel, and
then include just the usage message in the .m file. This is what occurs
here. The DistanceMatrix function can be accessed without loading the
package, as it already exists in the kernel. Try the following from a
fresh kernel (without loading Statistics`ClusterAnalysis`):

Quit

data = Table[ Random[], {3}, {2}];

Statistics`ClusterAnalysis`DistanceMatrix[data]

and you will get the distance matrix. If you load the package, then you
won't need to fully qualify the function name to use it. So, if you were
looking at the package to discover how someone could write Mathematica
code to create a distance matrix so quickly you will discover that they
"cheated", i.e., they coded the function within the kernel.

Carl Woll
Wolfram Research

Peter Pein

unread,
Oct 5, 2006, 4:00:26 AM10/5/06
to
Chris Chiasson schrieb:

> Could someone please explain why Mathematica seems to be expecting the
> output of cc to be a vector?
>
> pts[1]=Table[{Random[],Random[]},{3}];
> cc[p_,p_]=0;
> cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
> ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
> Outer[ccC,pts[1],pts[1],1]
>

I guess it's because p and q are vectors and compile doesn't analyze cc.

b.t.w. It doesn't make much sense to compile only the call to a function
but not the function itself:

pts[1] = Table[Random[], {1000}, {2}];
ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, cc[p, q], {{_cc, _Real}}];
Timing[Outer[ccC, pts[1], pts[1], 1];]

lasts 17.5 seconds but if you compile cc inline:

pts[1] = Table[Random[], {1000}, {2}];
ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, Evaluate@cc[p, q]];
Timing[Outer[ccC, pts[1], pts[1], 1];]

only 1.8 seconds are necessary.

Peter

Chris Chiasson

unread,
Oct 5, 2006, 4:08:37 AM10/5/06
to
thanks!

I thought compile could go "inside" the function.

Of course, evaluating the code before compilation leaves out one of my
pattern definitions :-[

On 10/4/06, Peter Pein <pet...@dordos.net> wrote:
> Chris Chiasson schrieb:


> > Could someone please explain why Mathematica seems to be expecting the
> > output of cc to be a vector?
> >
> > pts[1]=Table[{Random[],Random[]},{3}];
> > cc[p_,p_]=0;
> > cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
> > ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
> > Outer[ccC,pts[1],pts[1],1]
> >
>

> I guess it's because p and q are vectors and compile doesn't analyze cc.
>
> b.t.w. It doesn't make much sense to compile only the call to a function
> but not the function itself:
>
> pts[1] = Table[Random[], {1000}, {2}];
> ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, cc[p, q], {{_cc, _Real}}];
> Timing[Outer[ccC, pts[1], pts[1], 1];]
>
> lasts 17.5 seconds but if you compile cc inline:
>
> pts[1] = Table[Random[], {1000}, {2}];
> ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, Evaluate@cc[p, q]];
> Timing[Outer[ccC, pts[1], pts[1], 1];]
>
> only 1.8 seconds are necessary.
>
> Peter
>


--
http://chris.chiasson.name/

Chris Chiasson

unread,
Oct 6, 2006, 2:15:00 AM10/6/06
to
Was it ever not in the kernel, say in an older version?
If it wasn't in the kernel and it was in a human readable package, a
customer with a backlog of older versions may be able to tell us what
the code was.

On 10/5/06, Carl Woll <ca...@wolfram.com> wrote:
> Ray Koopman wrote:
>
> >Carl Woll wrote:
> >
> >
> >>[...]

> >>The package Statistics`ClusterAnalysis` has a very quick
> >>DistanceMatrix function:

> >>[...]
> >>
> >>
> >
> >The ClusterAnalysis.m file contains some usage messages
> >for DistanceMatrix, but not the code. Where is the code?
> >
> >
> Sometimes a package developer will code up a function in the kernel, and
> then include just the usage message in the .m file. This is what occurs
> here. The DistanceMatrix function can be accessed without loading the
> package, as it already exists in the kernel. Try the following from a
> fresh kernel (without loading Statistics`ClusterAnalysis`):
>
> Quit
>
> data = Table[ Random[], {3}, {2}];
>
> Statistics`ClusterAnalysis`DistanceMatrix[data]
>
> and you will get the distance matrix. If you load the package, then you
> won't need to fully qualify the function name to use it. So, if you were
> looking at the package to discover how someone could write Mathematica
> code to create a distance matrix so quickly you will discover that they
> "cheated", i.e., they coded the function within the kernel.
>
> Carl Woll
> Wolfram Research
>
>


--
http://chris.chiasson.name/

Reply all
Reply to author
Forward
0 new messages