# RE: Re: distance function

4 views

### Coleman, Mark

Sep 30, 2006, 5:37:27 AM9/30/06
to

Out of curiosity, I tested these two approaches on a number of data sets
for which I make frequent use. For some reason Jens code is running
slower! I've been testing it out some lists of reals of size
n=500,1000,2500, and 5000. Is it possible that the time for the
conditional compares is exceeding the computational time of redundant
calculations? Could someone try this out?

(note: I'm working on some code for identifying outliers in large data
sets. The efficient calculation of L-1 and L-2 distance matrices are
important.)

Thanks

Mark

-----Original Message-----
From: Murray Eisenberg [mailto:mur...@math.umass.edu]
Subject: Re: distance function

Yes, I KNOW that I'm computing the distances twice in my solution:
that's why I said it's an "extravagant" solution!

Jens-Peer Kuska wrote:
> Hi Murray,
>
> at least you should compute the distances not twice because the matrix

> is symmetric with zero diagonal ...
>
> d[{p_,p_}]:=0.0
> d[{q_,p_}]/; OrderedQ[{q,p}]:=d[{q,p}]= Norm[p - q]
> d[{q_,p_}]:=d[{p,q}]
>
> Regards
> Jens
>
>
> Murray Eisenberg wrote:
>> If you don't mind an "extravagant" solution -- one that is
>> conceptually simple and short but is probably inefficient due to
>> redundant calculations -- then this works, I believe:
>>
>> d[{p_, q_}] := Norm[p - q]
>> allDistances[pts_] := Union[Flatten[Outer[d, pts, pts]]]
>>
>>
>>
>> dimm...@yahoo.com wrote:
>>> In the book of Gaylord et al. (1996) there is one exercise which
>>> asks (see page 113)
>>>
>>> "Given a list of points in the plane, write a function that finds
>>> the set of all distances between the points."
>>>
>>> Although there is one solution, that solution makes use of the Table

>>> and Length commands.
>>>
>>> Is it a way to define the same function using Higher-Order functions

>>> like Outer, MapThread etc?
>>>
>>> Thanks in advance for any help.
>>>
>>>
>
>

--
Murray Eisenberg mur...@math.umass.edu
Mathematics & Statistics Dept.
Lederle Graduate Research Tower phone 413 549-1020 (H)
University of Massachusetts 413 545-2859 (W)
710 North Pleasant Street fax 413 545-1801
Amherst, MA 01003-9305

### Peter Pein

Oct 1, 2006, 4:32:10 AM10/1/06
to
Coleman, Mark schrieb:

Hi Mark,

I think the most efficient proposal came from Bob Hanlon. I tested
some alternatives and the fastest is d @@@S ubsets[pts, {2}] (2.98s for
1000 points) followed by d @@@ Flatten[MapIndexed[Drop[#1, #2[]-1] &,
Outer[List, Most@pts, Rest@pts, 1]], 1] with 3.31 s.

Peter

### Chris Chiasson

Oct 1, 2006, 4:43:32 AM10/1/06
to
Why not use caching? Also, Maxim said Norm is slow.

d[{p_,p_}]:=0
d[{p_,q_}]:=d[{p,q}]=d[{q,p}]=Sqrt[#.#]&[p-q]

If memory usage becomes a problem, you could always Clear d.
Alternatively, you could make a Module function that operates on an
entire matrix, using d as one of the temporary "variables" inside of
the Module. As long as you code it correctly (so that d doesn't appear
in the result expression), then d will automatically be Removed.

> --
> Murray Eisenberg mur...@math.umass.edu
> Mathematics & Statistics Dept.
> Lederle Graduate Research Tower phone 413 549-1020 (H)
> University of Massachusetts 413 545-2859 (W)
> 710 North Pleasant Street fax 413 545-1801
> Amherst, MA 01003-9305
>
>

### Chris Chiasson

Oct 2, 2006, 12:44:21 AM10/2/06
to
Could someone please explain why Mathematica seems to be expecting the
output of cc to be a vector?

pts=Table[{Random[],Random[]},{3}];
cc[p_,p_]=0;
cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
Outer[ccC,pts,pts,1]

### Carl Woll

Oct 2, 2006, 12:57:36 AM10/2/06
to
Coleman, Mark wrote:

>
>Out of curiosity, I tested these two approaches on a number of data sets
>for which I make frequent use. For some reason Jens code is running
>slower! I've been testing it out some lists of reals of size
>n=500,1000,2500, and 5000. Is it possible that the time for the
>conditional compares is exceeding the computational time of redundant
>calculations? Could someone try this out?
>
>(note: I'm working on some code for identifying outliers in large data
>sets. The efficient calculation of L-1 and L-2 distance matrices are
>important.)
>
>Thanks
>
>Mark
>
>
>

The package Statistics`ClusterAnalysis` has a very quick DistanceMatrix
function:

data=Table[Random[],{1000},{3}];

<<Statistics`ClusterAnalysis`

Sqrt[DistanceMatrix[data]];//Timing
{0.187 Second,Null}

The default distance function for DistanceMatrix is
SquaredEuclideanDistance, hence the use of Sqrt. It is possible to use
the option DistanceFunction->EuclideanDistance to obviate the need for Sqrt:

DistanceMatrix[data, DistanceFunction->EuclideanDistance];

However, this option seems to be about 25% slower. This is probably
because it is faster to take the square root of a matrix and rely on
vector operations rather than taking the square root of each matrix element.

Carl Woll
Wolfram Research

### Jean-Marc Gulliet

Oct 3, 2006, 6:27:49 AM10/3/06
to
Chris Chiasson wrote:
> Could someone please explain why Mathematica seems to be expecting the
> output of cc to be a vector?
>
> pts=Table[{Random[],Random[]},{3}];
> cc[p_,p_]=0;
> cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
> ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
> Outer[ccC,pts,pts,1]

Hi Chris,

Written in the following way, Compile does not complain anymore:

In:=
pts = Table[{Random[], Random[]}, {3}];
ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}},
If[p == q, 0, (Sqrt[#1 . #1] & )[p - q]]];
Outer[ccC, pts, pts, 1]

Out=
{{0.,1.00434,0.161348},{1.00434,0.,0.845952},{0.161348,0.845952,0.}}

This, of course, does not explain why in first instance Compile was
expecting a rank 1 vector. Using the optional argument of Compile to
tell Mathematica what is the return type of an external function does
not help either:

ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, cc[p, q], {{cc, _Real, 0}}];

HTH,
Jean-Marc

### Ray Koopman

Oct 4, 2006, 6:24:11 AM10/4/06
to
Carl Woll wrote:
> [...]

> The package Statistics`ClusterAnalysis` has a very quick
> DistanceMatrix function:
> [...]

The ClusterAnalysis.m file contains some usage messages
for DistanceMatrix, but not the code. Where is the code?

### Carl Woll

Oct 5, 2006, 3:55:13 AM10/5/06
to
Ray Koopman wrote:

>Carl Woll wrote:
>
>
>>[...]

>>The package Statistics`ClusterAnalysis` has a very quick
>>DistanceMatrix function:

>>[...]
>>
>>
>
>The ClusterAnalysis.m file contains some usage messages
>for DistanceMatrix, but not the code. Where is the code?
>
>

Sometimes a package developer will code up a function in the kernel, and
then include just the usage message in the .m file. This is what occurs
here. The DistanceMatrix function can be accessed without loading the
package, as it already exists in the kernel. Try the following from a

Quit

data = Table[ Random[], {3}, {2}];

Statistics`ClusterAnalysis`DistanceMatrix[data]

and you will get the distance matrix. If you load the package, then you
won't need to fully qualify the function name to use it. So, if you were
looking at the package to discover how someone could write Mathematica
code to create a distance matrix so quickly you will discover that they
"cheated", i.e., they coded the function within the kernel.

Carl Woll
Wolfram Research

### Peter Pein

Oct 5, 2006, 4:00:26 AM10/5/06
to
Chris Chiasson schrieb:

> Could someone please explain why Mathematica seems to be expecting the
> output of cc to be a vector?
>
> pts=Table[{Random[],Random[]},{3}];
> cc[p_,p_]=0;
> cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
> ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
> Outer[ccC,pts,pts,1]
>

I guess it's because p and q are vectors and compile doesn't analyze cc.

b.t.w. It doesn't make much sense to compile only the call to a function
but not the function itself:

pts = Table[Random[], {1000}, {2}];
ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, cc[p, q], {{_cc, _Real}}];
Timing[Outer[ccC, pts, pts, 1];]

lasts 17.5 seconds but if you compile cc inline:

pts = Table[Random[], {1000}, {2}];
ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, Evaluate@cc[p, q]];
Timing[Outer[ccC, pts, pts, 1];]

only 1.8 seconds are necessary.

Peter

### Chris Chiasson

Oct 5, 2006, 4:08:37 AM10/5/06
to
thanks!

I thought compile could go "inside" the function.

Of course, evaluating the code before compilation leaves out one of my
pattern definitions :-[

On 10/4/06, Peter Pein <pet...@dordos.net> wrote:
> Chris Chiasson schrieb:

> > Could someone please explain why Mathematica seems to be expecting the
> > output of cc to be a vector?
> >
> > pts=Table[{Random[],Random[]},{3}];
> > cc[p_,p_]=0;
> > cc[p_,q_]:=cc[q,p]=Sqrt[#.#]&[p-q];
> > ccC=Compile[{{p,_Real,1},{q,_Real,1}},cc[p,q]];
> > Outer[ccC,pts,pts,1]
> >
>

> I guess it's because p and q are vectors and compile doesn't analyze cc.
>
> b.t.w. It doesn't make much sense to compile only the call to a function
> but not the function itself:
>
> pts = Table[Random[], {1000}, {2}];
> ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, cc[p, q], {{_cc, _Real}}];
> Timing[Outer[ccC, pts, pts, 1];]
>
> lasts 17.5 seconds but if you compile cc inline:
>
> pts = Table[Random[], {1000}, {2}];
> ccC = Compile[{{p, _Real, 1}, {q, _Real, 1}}, Evaluate@cc[p, q]];
> Timing[Outer[ccC, pts, pts, 1];]
>
> only 1.8 seconds are necessary.
>
> Peter
>

### Chris Chiasson

Oct 6, 2006, 2:15:00 AM10/6/06
to
Was it ever not in the kernel, say in an older version?
If it wasn't in the kernel and it was in a human readable package, a
customer with a backlog of older versions may be able to tell us what
the code was.

On 10/5/06, Carl Woll <ca...@wolfram.com> wrote:
> Ray Koopman wrote:
>
> >Carl Woll wrote:
> >
> >
> >>[...]

> >>The package Statistics`ClusterAnalysis` has a very quick
> >>DistanceMatrix function:

> >>[...]
> >>
> >>
> >
> >The ClusterAnalysis.m file contains some usage messages
> >for DistanceMatrix, but not the code. Where is the code?
> >
> >
> Sometimes a package developer will code up a function in the kernel, and
> then include just the usage message in the .m file. This is what occurs
> here. The DistanceMatrix function can be accessed without loading the
> package, as it already exists in the kernel. Try the following from a
>
> Quit
>
> data = Table[ Random[], {3}, {2}];
>
> Statistics`ClusterAnalysis`DistanceMatrix[data]
>
> and you will get the distance matrix. If you load the package, then you
> won't need to fully qualify the function name to use it. So, if you were
> looking at the package to discover how someone could write Mathematica
> code to create a distance matrix so quickly you will discover that they
> "cheated", i.e., they coded the function within the kernel.
>
> Carl Woll
> Wolfram Research
>
>