Nearest Neighbour Search

harishameed_33

unread,

Dec 17, 2012, 5:25:05 AM12/17/12

to

hi,

i have a set A with 5,000 (x,y,z) triples and set B with
1,000,000 (x,y,z) triples.

For each point in set A i want to find set of nearest point from set B.(closest 10 points)

kindly guide me how to proceede to solve this problem

glen herrmannsfeldt

unread,

Dec 17, 2012, 5:45:05 AM12/17/12

to

Do you have any idea at all?

Most obvious, is that you have to compute the distance between each
point in A and each in B.

Well, maybe not, but that is the more obvious way.

You can do it a little faster computing the square of the distance.

There are faster ways, but for each point in A (you need a loop),
compute the distance (or distance squared) to each point in B
(another loop). You an easily store (on most modern computers)
all 5000 distances.

Now, sort them and choose the smallest ten.

-- glen

Arjen Markus

unread,

Dec 17, 2012, 8:26:24 AM12/17/12

to

On Monday, December 17, 2012 11:45:05 AM UTC+1, glen herrmannsfeldt wrote:

> There are faster ways, but for each point in A (you need a loop),
> compute the distance (or distance squared) to each point in B
> (another loop). You an easily store (on most modern computers)
> all 5000 distances.
>
> Now, sort them and choose the smallest ten.
>

Actually, that would involve sorting the 1 million points in set B for each one
of the 5000 points in set A.

An alternative is to do the following per point in A:
- Determine the distances of that point to each point in set B.
- Determine the smallest distance and double that - call it "search radius".
- Determine the set of points in B that is within a distance of this search
radius from the point in A.
- If the set contains 10 or more points, sort by distance and save the first 10.
- Otherwise increase the search radius and repeat the previous two steps.

(Doubling the search radius is one way. It depends on the distance between the
points in A and B whether doubling is suitable. You do not want to end up with
almost all points, so perhaps you need to reduce the factor)

Other solutions Glen hinted at involve special data structures, such as Kd-trees
(IIRC). Kd-trees are a means of quickly finding points in a multidimensional
space.

I suggest you first try the method Glen described, or my attempt to refine
that, before looking at these more sophisticated options.

Regards,

Arjen

Mark F

unread,

Dec 17, 2012, 8:27:03 AM12/17/12

to

harishameed_33 <harisha...@hotmail.com> wrote:
> i have a set A with 5,000 (x,y,z) triples and set B with
> 1,000,000 (x,y,z) triples.
>
> For each point in set A i want to find set of nearest point from set B.(closest 10 points)
>
> kindly guide me how to proceede to solve this problem

I'd start with:
http://en.wikipedia.org/wiki/Nearest_neighbor_search

Robin Vowels

unread,

Dec 17, 2012, 8:59:34 AM12/17/12

to

On Dec 18, 12:26 am, Arjen Markus <arjen.markus...@gmail.com> wrote:
> On Monday, December 17, 2012 11:45:05 AM UTC+1, glen herrmannsfeldt wrote:
> > There are faster ways, but for each point in A (you need a loop),
> > compute the distance (or distance squared) to each point in B
> > (another loop). You an easily store (on most modern computers)
> > all 5000 distances.
>
> > Now, sort them and choose the smallest ten.
>
> Actually, that would involve sorting the 1 million points in set B for each one
> of the 5000 points in set A.

I think you mean "searching", not "sorting".
Sorting isn't required.

Gordon Sande

unread,

Dec 17, 2012, 9:01:30 AM12/17/12

to

Ask google for "nearest neighbour search". You should get lots of
responses starting
with kd-trees developed in the 1960s. A very well researched topic in
computer science.
So well researched the most articles will come across of very abstract
and arcane as
all the basic facts were discovered long ago and all that is left is
polishing of fine
points. And even current text books (and such) want to show how
abstract they can be.

You want of do 5,000 queries against a datbase of 1,000,000 points. The
queries will cost
log ( n ) after a setup with n log ( n ) for n=1,000,000. The problem
with sorting
is that the sort order is unique to each query so you need to redo it.
Of course sorting
in 2 or more dimensions does not exist. kd-trees filled the void. And
they have been
tinkered with for minor impovements over time.

harishameed_33

unread,

Dec 17, 2012, 12:15:14 PM12/17/12

to

any one plz!!!

Ivan D. Reid

unread,

Dec 17, 2012, 1:50:08 PM12/17/12

to

On Mon, 17 Dec 2012 05:26:24 -0800 (PST), Arjen Markus
<arjen.m...@gmail.com>
wrote in <8eb979ba-ff7d-44a2...@googlegroups.com>:

> Other solutions Glen hinted at involve special data structures, such as Kd-trees
> (IIRC). Kd-trees are a means of quickly finding points in a multidimensional
> space.

Having recently used 2d-trees to significantly speed up LHC track
reconstruction seeding, I'd vote for kd-trees. My colleague who used to
be a CompSci Prof in Brazil tells me it's an elementary problem he set his
students; luckily I had a pre-canned class to work with. There must be
oodles of sample code out there, though how much is in Fortran...?

--
Ivan Reid, School of Engineering & Design, _____________ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".

Gordon Sande

unread,

Dec 17, 2012, 3:04:56 PM12/17/12

to

On 2012-12-17 14:50:08 -0400, Ivan D. Reid said:

> On Mon, 17 Dec 2012 05:26:24 -0800 (PST), Arjen Markus
> <arjen.m...@gmail.com>
> wrote in <8eb979ba-ff7d-44a2...@googlegroups.com>:
>
>> Other solutions Glen hinted at involve special data structures, such as
>> Kd-trees
>> (IIRC). Kd-trees are a means of quickly finding points in a multidimensional
>> space.
>
> Having recently used 2d-trees to significantly speed up LHC track
> reconstruction seeding, I'd vote for kd-trees. My colleague who used to
> be a CompSci Prof in Brazil tells me it's an elementary problem he set his
> students; luckily I had a pre-canned class to work with. There must be
> oodles of sample code out there, though how much is in Fortran...?

In graphics they are known as quad-trees for their use of quadrants
and they sometimes use oct-trees based on octants.

Basic idea is not that complicated but does leave lots of room for fiddling
with the details to exploit features of particular problems.

kd-trees beat list intersection which was the best technique around until
they were developed (by folks working for SLAC). list intersection goes
with the database technique of inverted files.

Ivan D. Reid

unread,

Dec 17, 2012, 3:36:02 PM12/17/12

to

On Mon, 17 Dec 2012 16:04:56 -0400, Gordon Sande <Gordon...@gmail.com>
wrote in <kantt8$q8t$1...@dont-email.me>:

> On 2012-12-17 14:50:08 -0400, Ivan D. Reid said:

>> On Mon, 17 Dec 2012 05:26:24 -0800 (PST), Arjen Markus
>> <arjen.m...@gmail.com>
>> wrote in <8eb979ba-ff7d-44a2...@googlegroups.com>:

>>> Other solutions Glen hinted at involve special data structures, such as
>>> Kd-trees
>>> (IIRC). Kd-trees are a means of quickly finding points in a multidimensional
>>> space.

>> Having recently used 2d-trees to significantly speed up LHC track
>> reconstruction seeding, I'd vote for kd-trees. My colleague who used to
>> be a CompSci Prof in Brazil tells me it's an elementary problem he set his
>> students; luckily I had a pre-canned class to work with. There must be
>> oodles of sample code out there, though how much is in Fortran...?

> In graphics they are known as quad-trees for their use of quadrants
> and they sometimes use oct-trees based on octants.

Hmm, you're right, I used octrees when I was looking at
reconstruction of emulsion tracks from neutrino interactions at UCL ten or
eleven years ago. IIRC the partitioning between them and kd-trees is
subtly different, octrees using strictly binary partitioning while kd-trees
usually partition on the median value.

> Basic idea is not that complicated but does leave lots of room for fiddling
> with the details to exploit features of particular problems.

> kd-trees beat list intersection which was the best technique around until
> they were developed (by folks working for SLAC). list intersection goes
> with the database technique of inverted files.

In my case, kd-trees on a single processor core beat parallelising
the problem out to run on 16 cores simultaneously. YMMV

Dick Hendrickson

unread,

Dec 17, 2012, 4:21:12 PM12/17/12

to

On 12/17/12 11:15 AM, harishameed_33 wrote:
> any one plz!!!
>
>
I think you posted this twice. There are several replies to the other post.

Dick Hendrickson

Gordon Sande

unread,

Dec 17, 2012, 4:22:46 PM12/17/12

to

On 2012-12-17 16:36:02 -0400, Ivan D. Reid said:

> On Mon, 17 Dec 2012 16:04:56 -0400, Gordon Sande <Gordon...@gmail.com>
> wrote in <kantt8$q8t$1...@dont-email.me>:
>> On 2012-12-17 14:50:08 -0400, Ivan D. Reid said:
>
>>> On Mon, 17 Dec 2012 05:26:24 -0800 (PST), Arjen Markus
>>> <arjen.m...@gmail.com>
>>> wrote in <8eb979ba-ff7d-44a2...@googlegroups.com>:
>
>>>> Other solutions Glen hinted at involve special data structures, such as
>>>> Kd-trees
>>>> (IIRC). Kd-trees are a means of quickly finding points in a multidimensional
>>>> space.
>
>>> Having recently used 2d-trees to significantly speed up LHC track
>>> reconstruction seeding, I'd vote for kd-trees. My colleague who used to
>>> be a CompSci Prof in Brazil tells me it's an elementary problem he set his
>>> students; luckily I had a pre-canned class to work with. There must be
>>> oodles of sample code out there, though how much is in Fortran...?
>
>> In graphics they are known as quad-trees for their use of quadrants
>> and they sometimes use oct-trees based on octants.
>
> Hmm, you're right, I used octrees when I was looking at
> reconstruction of emulsion tracks from neutrino interactions at UCL ten or
> eleven years ago. IIRC the partitioning between them and kd-trees is
> subtly different, octrees using strictly binary partitioning while kd-trees
> usually partition on the median value.

Exactly the kind of distinction someone would make to claim that their
version is
"new improved state of the art" without bothering to explain why it
might be true
or any of the cases where it will not be true. If you have funny
hardware and are
straped for memory then that might save your hide until the next generation of
hardware or memory comes out.

>> Basic idea is not that complicated but does leave lots of room for fiddling
>> with the details to exploit features of particular problems.
>
>> kd-trees beat list intersection which was the best technique around until
>> they were developed (by folks working for SLAC). list intersection goes
>> with the database technique of inverted files.
>
> In my case, kd-trees on a single processor core beat parallelising
> the problem out to run on 16 cores simultaneously. YMMV

Unfortunately I have read too many papers of speedups due to parallelization
that turn out to be speedups of poor algorithms that can be outrun by a
good serial algorithm. But who would ever want to publish that in a parallel
journal? It is interesting that a common story is that a successful parallel
algorithm when run serially will be a minor improvement on the earlier good
serial algorithm. I have seen that on some of the things I have converted to
OpenMP and attribute it to much more elaborate control of the pieces of the
computation.

Ivan D. Reid

unread,

Dec 17, 2012, 5:47:17 PM12/17/12

to

On Mon, 17 Dec 2012 17:22:46 -0400, Gordon Sande <Gordon...@gmail.com>
wrote in <kao2f6$rqt$1...@dont-email.me>:

I don't think I'll disagree with you there. I'm a little
concerned that "our" current push to parallelise and multithread our
software loses track of Amdahl's Law -- we need to get to nearly 100%
parallelisation before we see gains. OK, the multithreading bit is about
only needing to keep one copy of our static data structures in memory, so
we reduce the memory footprint, but the one study I've seen wasted all the
CPU advantages in recombining the results from all the threads. :-/

Large-scale parallelisation is as much an art as a science, it
needs a certain mind-set. I succeeded in parallelising a
muon-spin-rotation analysis programme back in 1984 on an FPS co-processor
in a PDP-11/60 that had resisted the efforts of two highly-proficient
physicists before me. For a year or so that changed how we did
experiments (preliminary results being available before a run ended), but
eventually faster VAXen managed the same results in less time. (A
pseudo-vector 8087 library realised about a 25% speedup over serial code
on the same programme, due to keeping scalars on the stack for the whole
operation. I've not done machine-language programming post-Pentium so
I've no idea if such an approach would work as well these days.)

More recently, reconstructing digital holograms with Nvidia GPUs,
(where the forward and reverse FFTs were trivially carried out by the
stock libraries) it took me six or seven iterations at refactoring the
intermediate transfer function before I came up with a form that was
able to be re-created in the CUDA programming model. I've no doubt it could
be pushed further, but at 17 fps real-time reconstruction on my laptop,
I don't need much more speed. :-) [The original CPU-serial code takes 25
seconds per frame on the same hardware.]

e p chandler

unread,

Dec 17, 2012, 8:52:40 PM12/17/12

to

"glen herrmannsfeldt" wrote

I interpret this slightly differently, that the OP wants to collect a set of
the 10 nearest points from B for each point in A.

Brute force works reasonably well to generate the data set and to compute
the distances.
Making each random variable uniform on [-1,1], it took 40 seconds on my
system. [g95, Win7, 3GHz i5]

The time killer here is maintaining a list of the 10 lowest distances for
each point in A.
Putting the index number for each point of B and the distance into the 11th
positions of arrays, and then sorting those arrays on the distance took 263
seconds. Improving the sort did not help much.

Finally, I inserted each new distance and index number into a list
maintained in arrays. The code is more complex because it has to deal with
some special cases: lengthening the list and excluding the distance from a
full list. The code below ran in 199 sec. It's my own code and not
thoroughly tested except that the output file is the same as using a simple
sorting routine.

---- start text ---
program foo
implicit none
integer,parameter :: NA=5000,NB=1000000,NC=10
real xa(NA),ya(NA),za(NA),xb(NB),yb(NB),zb(NB)
real d(NC)
real u,x,y,z,q
integer m(NC)
integer i,j,k,n,L
real t1,t2

open(10,file='out.txt',form='formatted',status='replace')
call cpu_time(t1)

do i=1,NA
call random_number(u)
xa(i)=2*u-1
call random_number(u)
ya(i)=2*u-1
call random_number(u)
za(i)=2*u-1
end do

do i=1,NB
call random_number(u)
xb(i)=2*u-1
call random_number(u)
yb(i)=2*u-1
call random_number(u)
zb(i)=2*u-1
end do

do i=1,NA
if (mod(i,100)==0) print *,i
x=xa(i)
y=ya(i)
z=za(i)
n=0
do j=1,NB
q=(x-xb(j))**2 + (y-yb(j))**2 + (z-zb(j))**2
do k=1,NC
if (k > n) then
d(k)=q
m(k)=j
n=n+1
exit
else if (q < d(k)) then
do L=n,k+1,-1
d(L)=d(L-1)
m(L)=m(L-1)
end do
d(k)=q
m(k)=j
exit
end if
end do
end do
write(10,*) i,d,m
end do
close(10)

call cpu_time(t2)

print *,'time:',t2-t1

end program foo
---- end text ---

The brute force and simple sort took little time to code. The insert into
sorted list took longer.

--- e

glen herrmannsfeldt

unread,

Dec 17, 2012, 11:59:40 PM12/17/12

to

e p chandler <ep...@juno.com> wrote:
> "glen herrmannsfeldt" wrote
> harishameed_33 wrote:
>>> i have a set A with 5,000 (x,y,z) triples and set B with
>>> 1,000,000 (x,y,z) triples.

(snip)

>> Do you have any idea at all?

>> Most obvious, is that you have to compute the distance between each
>> point in A and each in B.

>> Well, maybe not, but that is the more obvious way.

>> You can do it a little faster computing the square of the distance.

>> There are faster ways, but for each point in A (you need a loop),
>> compute the distance (or distance squared) to each point in B
>> (another loop). You an easily store (on most modern computers)
>> all 5000 distances.

>> Now, sort them and choose the smallest ten.

> I interpret this slightly differently, that the OP wants to collect a set of
> the 10 nearest points from B for each point in A.

Yes, I forget which set was 5000 and which 1000000 when wrote that.
Still, 1000000 isn't so big for most computers these days.

> Brute force works reasonably well to generate the data set and
> to compute the distances.

> Making each random variable uniform on [-1,1], it took 40 seconds
> on my system. [g95, Win7, 3GHz i5]

> The time killer here is maintaining a list of the 10 lowest
> distances for each point in A.

> Putting the index number for each point of B and the distance into
> the 11th positions of arrays, and then sorting those arrays on the
> distance took 263 seconds. Improving the sort did not help much.

Using an O(NlogN) sort, it should be pretty fast.

What sort method did you use?

> Finally, I inserted each new distance and index number into a list
> maintained in arrays.

Since log2(1000000) is about 20, or about twice 10, so close enough
that you have to get the fine details right to tell the difference.

The sort might make a litle better use of the cache, or maybe not.

> The code is more complex because it has to deal with some special
> cases: lengthening the list and excluding the distance from a
> full list.

I would instead initialize the list with really large distances,
and expect them to fall off the list pretty soon.

> The code below ran in 199 sec. It's my own code and not
> thoroughly tested except that the output file is the same
> as using a simple sorting routine.

(snip of program example)

-- glen

michael...@compuserve.com

unread,

Dec 18, 2012, 5:46:18 AM12/18/12

to

Am Dienstag, 18. Dezember 2012 02:52:40 UTC+1 schrieb e p chandler:

Ah, I rember k-d from coding for the SFM at the ISR a long time ago. (In fact, we finished up using graphs.)

I had my own simple-minded shot at this, see below. I simply maintain an unsorted list of the 10 closest points (initialising with the first 10), and replacing the current maximum value with any new value that is less. It runs in 58s @ 1,6MHz using Intel on Windows XP (an old machine). (The code you posted took 252s.)

Regards,

Mike Metcalf

program ten_least
implicit none
integer, parameter :: n1 = 5000, n2 = 1000000, n3 = 10
real :: a(n1, 3), b(n2, 3)
type closest
real :: c
integer :: ind
end type closest
type(closest) :: points(n3)
real :: highest, dist2, t1, t2
integer :: i, j, top

call cpu_time(t1)
call random_seed()
call random_number(a)
call random_number(b)
!b = b + 2.0 ! Use these 3 lines for debugging
!b( (/ 2, 22, 2222, 3333, 4444, 5555, 66666, 77777, 88888, 99999/), : ) = &
! b( (/ 2, 22, 2222, 3333, 4444, 5555, 66666, 77777, 88888, 99999/), : ) - 1.0

do i = 1, n1
do j = 1, n2
dist2 = (a(i, 1) - b(j, 1))**2 + (a(i, 2) - b(j, 2))**2 + (a(i, 3) - b &(j, 3))**2
select case (j)
case (:n3-1)
points(j)%c = dist2
points(j)%ind = j
case(n3)
points(j)%c = dist2
points(j)%ind = j
highest = maxval(points%c)
top = maxloc(points%c, dim=1)
case default
if(dist2 >= highest) then
cycle
else
points(top)%c = dist2
points(top)%ind = j
highest = maxval(points%c)
top = maxloc(points%c, dim=1)
end if
end select
end do
if(mod(i, 1000) == 0) &
write(*, *) 'Ten closest points to a(', i, ') are ', points%ind
end do

call cpu_time(t2)
write(*, *) 'Time used in seconds: ', t2- t1
end program ten_least

michael...@compuserve.com

unread,

Dec 18, 2012, 6:37:23 AM12/18/12

to

P.S. Sorry for the spurious & in this line:

dist2 = (a(i, 1) - b(j, 1))**2 + (a(i, 2) - b(j, 2))**2 + (a(i, 3) - b (j, 3))**2
Also, after the first end do, this line might be useful:

write(7, '(10i9,/,10e9.2)') points%ind, sqrt(points%c)

e p chandler

unread,

Dec 18, 2012, 7:17:41 AM12/18/12

to

harishameed_33 wrote:
> i have a set A with 5,000 (x,y,z) triples and set B with
> 1,000,000 (x,y,z) triples.

"glen herrmannsfeldt" wrote:
> Using an O(NlogN) sort, it should be pretty fast.
> What sort method did you use?

A simple exchange sort with flag (bubble sort).
Replacing that with a comb sort multiplied the run time by 5.

> I would instead initialize the list with really large distances,
> and expect them to fall off the list pretty soon.

That did take a few seconds off the list insertion. (Already done for the
sorts.)

Here are run times in seconds. Note that I added -O3 and removed bounds
checking.

10 no sort
119 bubble up
121 bubble down
561 comb sort
46 insertion
44 insertion - pre-filled

--- e

glen herrmannsfeldt

unread,

Dec 18, 2012, 7:26:34 AM12/18/12

to

michael...@compuserve.com wrote:
> Am Dienstag, 18. Dezember 2012 02:52:40 UTC+1 schrieb e p chandler:

> Ah, I rember k-d from coding for the SFM at the ISR a long time ago. (In fact, we finished up using graphs.)

> I had my own simple-minded shot at this, see below. I simply maintain
> an unsorted list of the 10 closest points (initialising with the first 10),
> and replacing the current maximum value with any new value that is less.

One that I thought of on my first post but didn't mention. Well, at
that time I thought the 5000 and 1000000 were the other way around.

You can group the 1000000 points, such as into cubes in 3-space.
Assuming that the extent is about equal in X, Y, and Z, divide each
dimension into ten equal parts.

Now, given a point the nearest such cube, and its neighbors, most likely
will have the 10 nearest points. You can tell from the counts if that is
true or not. With a fairly small number of tests, you can figure out
which cubes to test, and find the appropriate ten. The number of bins
depends on the number of points in A and B, the dimension, and at some
point how evenly they are distributed through the space.

But the OP should first do a compare-all method, either the sort or
insert into list of 10, before trying something else.

-- glen

e p chandler

unread,

Dec 18, 2012, 8:52:40 AM12/18/12

to

michael metcalf wrote:

> dist2 = (a(i, 1) - b(j, 1))**2 + (a(i, 2) - b(j, 2))**2 + (a(i, 3) - b
> (j, 3))**2
> Also, after the first end do, this line might be useful:
> write(7, '(10i9,/,10e9.2)') points%ind, sqrt(points%c)

Thanks for posting such a nice teaching example.

Thomas Koenig

unread,

Dec 18, 2012, 1:12:04 PM12/18/12

to

e p chandler <ep...@juno.com> schrieb:

> A simple exchange sort with flag (bubble sort).
> Replacing that with a comb sort multiplied the run time by 5.

Take this piece of advice from Numerical Recipes:

# We _will_ draw the line, however, at the inefficient N^2 algorithm,
# beloved of elementary computer science texts, called _bubble sort_.
# If you know what bubble sort is, wipe it from your mind; if you
# don't know, make a point of never finding out!

(I know NR has lots of weak spots, but this piece of advice
is part of The Truth, together with their advice on evaluating
polynomials :-)

glen herrmannsfeldt

unread,

Dec 18, 2012, 3:02:08 PM12/18/12

to

Thomas Koenig <tko...@netcologne.de> wrote:
> e p chandler <ep...@juno.com> schrieb:

>> A simple exchange sort with flag (bubble sort).
>> Replacing that with a comb sort multiplied the run time by 5.

> Take this piece of advice from Numerical Recipes:

> # We _will_ draw the line, however, at the inefficient N^2 algorithm,
> # beloved of elementary computer science texts, called _bubble sort_.
> # If you know what bubble sort is, wipe it from your mind; if you
> # don't know, make a point of never finding out!

(snip)

Bubblesort was the first one I knew, and only one for some years.

It was one of the sample programs in my first Fortran book,

Well, this one is newer, and seems to have different sample programs:
http://bitsavers.trailing-edge.com/pdf/ibm/370/fortran/GC28-6515-11_IBM_System360_and_System370_FORTRAN_IV_Language_Sep83.pdf

In high school, I wrote a program to print a big list of Pythagorean
triples, sorted. It isn't hard to make the list. I don't remember
now if I managed to remove multiples, such as 6, 8, 10, but did
at least remove duplicates. It took a loooooong time to sort.

More specifically:

For integer M and N, (M**2-N**2), 2*M*N, and (M**2+N**2) form
a Pythagorean triple. If you are careful at selecting M and N,
you might removes some of the redundancy, but I didn't do that.

-- glen

e p chandler

unread,

Dec 18, 2012, 7:26:38 PM12/18/12

to

e p chandler wrote:

> A simple exchange sort with flag (bubble sort).
> Replacing that with a comb sort multiplied the run time by 5.

Thomas Koenig replied:

>Take this piece of advice from Numerical Recipes:

># We _will_ draw the line, however, at the inefficient N^2 algorithm,
># beloved of elementary computer science texts, called _bubble sort_.
># If you know what bubble sort is, wipe it from your mind; if you
># don't know, make a point of never finding out!

>(I know NR has lots of weak spots, but this piece of advice
>is part of The Truth, together with their advice on evaluating
>polynomials :-)

After some additional refinements, using a simple exchange sort *is*
justifiable.
If the sort is done in the downward direction, the new value sifts down to
its final position.
Only one pass is needed. That gets rid of the flag variable and a DO loop.

Now a run takes 51 sec on my system.
An additional step saves even more time. Since the list is sorted, skip the
sift down if the new value is .GE. the end of the list.
Now a run takes 13 sec on my system.

While I agree in general that a bubble sort is a bad idea, in this special
case it's really not so bad at all.

Here is a copy of the inner loops of the program:
---- start text ----
d=100
m=-1

do j=1,NB
q=(x-xb(j))**2 + (y-yb(j))**2 + (z-zb(j))**2

if (q >= d(NC)) cycle
d(NC+1)=q
m(NC+1)=j
do k=NC,1,-1
if (d(k) > d(k+1)) then
t=d(k)
d(k)=d(k+1)
d(k+1)=t
n=m(k)
m(k)=m(k+1)
m(k+1)=n

end if
end do
end do

---- end text ----
Of course I recognize Michael Metcalf's solution as being better than this
one.
--- e

Gordon Sande

unread,

Dec 18, 2012, 7:52:43 PM12/18/12

to

On 2012-12-18 20:26:38 -0400, e p chandler said:

> e p chandler wrote:
>
>> A simple exchange sort with flag (bubble sort).
>> Replacing that with a comb sort multiplied the run time by 5.
>
> Thomas Koenig replied:
>
>> Take this piece of advice from Numerical Recipes:
>
>> # We _will_ draw the line, however, at the inefficient N^2 algorithm,
>> # beloved of elementary computer science texts, called _bubble sort_.
>> # If you know what bubble sort is, wipe it from your mind; if you
>> # don't know, make a point of never finding out!
>
>> (I know NR has lots of weak spots, but this piece of advice
>> is part of The Truth, together with their advice on evaluating
>> polynomials :-)
>
> After some additional refinements, using a simple exchange sort *is*
> justifiable.
> If the sort is done in the downward direction, the new value sifts down
> to its final position.
> Only one pass is needed. That gets rid of the flag variable and a DO loop.

Because it is a very highly ordered array. Some otherwise poor methods
shine under
such strong input asssunptions and some very good highly robust methods
completely
miss such structured input. Sorting is highly sensitive to its assumptions!

Quicksort looks golden on sorted arrays while treesort sucks. But
treesort has very
predicatble run times but quicksort does not on other data arrays.
Assumptions matter
and you get what you pay for.

e p chandler

unread,

Dec 18, 2012, 7:54:06 PM12/18/12

to

glen herrmannsfeldt wrote:

> Bubblesort was the first one I knew, and only one for some years.
> It was one of the sample programs in my first Fortran book,

[snip]

> In high school, I wrote a program to print a big list of Pythagorean
> triples, sorted. It isn't hard to make the list. I don't remember
> now if I managed to remove multiples, such as 6, 8, 10, but did
> at least remove duplicates. It took a loooooong time to sort.

> More specifically:

> For integer M and N, (M**2-N**2), 2*M*N, and (M**2+N**2) form
> a Pythagorean triple. If you are careful at selecting M and N,
> you might removes some of the redundancy, but I didn't do that.

M < N plus M & N relatively prime?

glen herrmannsfeldt

unread,

Dec 18, 2012, 8:35:45 PM12/18/12

to

e p chandler <ep...@juno.com> wrote:

(snip on bubblesort and Pythagorean triples)

>> For integer M and N, (M**2-N**2), 2*M*N, and (M**2+N**2) form
>> a Pythagorean triple. If you are careful at selecting M and N,
>> you might removes some of the redundancy, but I didn't do that.

> M < N plus M & N relatively prime?

Actually, M>N (unless you want negative values).

I think, though, that you still need another test, though.

Consider the first two, M=2, N=1 (3, 4, 5) and
M=3, N=1 (8, 6, 10) where in both M and N are relatively prime,
but, in increasing order, (6, 8, 10) is twice (3, 4, 5).

I might have even not removed those until after sorting,
but I believe I did remove them. That was not so long after
I had a running GCD routine, translated from S/360 assembly
to some high-level language.

So, the first non-relative primes, M=4, N=2, gives (12, 16, 20),
so four times (3, 4, 5).

-- glen

glen herrmannsfeldt

unread,

Dec 18, 2012, 8:43:53 PM12/18/12

to

Gordon Sande <Gordon...@gmail.com> wrote:

(snip)

> Because it is a very highly ordered array. Some otherwise poor
> methods shine under such strong input asssunptions and some
> very good highly robust methods completely miss such structured
> input. Sorting is highly sensitive to its assumptions!

> Quicksort looks golden on sorted arrays while treesort sucks.
> But treesort has very predicatble run times but quicksort does
> not on other data arrays.

The original quicksort, using the first element as the pivot
element, is O(N**2) for already sorted data.

Better choices are to randomly select the pivot, or use the median
of the first, middle, and last.

Also, when doing the recursion (with or without actual recursive
calls) don't sort blocks smaller than some size, such as 10.
Then follow with an insertion sort on the whole array.

> Assumptions matter and you get what you pay for.

-- glen

Robin Vowels

unread,

Dec 18, 2012, 9:13:59 PM12/18/12

to

On Dec 19, 11:26 am, "e p chandler" <e...@juno.com> wrote:

> After some additional refinements, using a simple exchange sort *is*
> justifiable.
> If the sort is done in the downward direction, the new value sifts down to
> its final position.
> Only one pass is needed. That gets rid of the flag variable and a DO loop.

It's then like a ranking sort or insertion sort.
In which case some elements need only to be moved,
and the new element slipped into position.

Robin Vowels

unread,

Dec 18, 2012, 9:21:45 PM12/18/12

to

On Dec 19, 5:12 am, Thomas Koenig <tkoe...@netcologne.de> wrote:

> Take this piece of advice from Numerical Recipes:
>
> # We _will_ draw the line, however, at the inefficient N^2 algorithm,
> # beloved of elementary computer science texts, called _bubble sort_.
> # If you know what bubble sort is, wipe it from your mind; if you
> # don't know, make a point of never finding out!

When few values are involved, the humble bubble sort is good;
it's a short procedure. It's companion, Cocktail-shaker sort,
is better as it eliminates the worst case.

If a list is already almost sorted, bubble sort is good.

Gordon Sande

unread,

Dec 19, 2012, 9:06:15 AM12/19/12

to

On 2012-12-18 21:43:53 -0400, glen herrmannsfeldt said:

> Gordon Sande <Gordon...@gmail.com> wrote:
>
> (snip)
>> Because it is a very highly ordered array. Some otherwise poor
>> methods shine under such strong input asssunptions and some
>> very good highly robust methods completely miss such structured
>> input. Sorting is highly sensitive to its assumptions!
>
>> Quicksort looks golden on sorted arrays while treesort sucks.
>> But treesort has very predicatble run times but quicksort does
>> not on other data arrays.
>
> The original quicksort, using the first element as the pivot
> element, is O(N**2) for already sorted data.

Thanks. Makes sense. I was assuming the common quicksort (there are many
variants as you point out) was using the middle element. The fact of
interest is that quicksort has a good average timeing but can have a
bad worst case timeing. One hopes the off the shelf quicksort will be a
variant that does well on sorted input.

James Van Buskirk

unread,

Dec 21, 2012, 6:19:37 AM12/21/12

to

"e p chandler" <ep...@juno.com> wrote in message
news:kar1k7$7b9$1...@dont-email.me...

>e p chandler wrote:

>> A simple exchange sort with flag (bubble sort).
>> Replacing that with a comb sort multiplied the run time by 5.

> Thomas Koenig replied:

>>Take this piece of advice from Numerical Recipes:

>># We _will_ draw the line, however, at the inefficient N^2 algorithm,
>># beloved of elementary computer science texts, called _bubble sort_.
>># If you know what bubble sort is, wipe it from your mind; if you
>># don't know, make a point of never finding out!

>>(I know NR has lots of weak spots, but this piece of advice
>>is part of The Truth, together with their advice on evaluating
>>polynomials :-)

> After some additional refinements, using a simple exchange sort *is*
> justifiable.

The O(N**2) may here be applied to sweeping through arrays xb, yb, and zb
5000 times. With an octree, nowhere near as much access to these arrays
will be necessary.

Actually your solution and Michael Metcalf's are indistinguishable.
Consider that the top ten array is only going to be updated about
100-120 times per [x,y,z] value here, while NB = 1000000, and that
the top ten array (here d(NC) and m(NC)) isn't very large so
the difference in run times is negligible. Using either your
method or Michael Metcalf's or a heap to represent the top ten
array resulted in a run time of about 15 seconds on my Core 2 Duo,
provided I split the inner loop into two parts, one to fill up the
top ten data structure and one to handle the general case so as to
remove a conditional from the inner loop.

Since all the time is spent on the two lines

> q=(x-xb(j))**2 + (y-yb(j))**2 + (z-zb(j))**2
> if (q >= d(NC)) cycle

I tried writing these in assembly language:

C:\gfortran\clf\neighbor>type inner.asm
format MS64 coff

section 'CODE' code readable executable align 16
align 16
;interface
; subroutine
inner(xArray,yArray,zArray,xyz,size,heap,heapsize,work,firstindex
,callback) bind(C,name='inner')
; use ISO_C_BINDING
; implicit none
; integer(C_SIZE_T), value :: size ! Size in units of 4 REALs
; real(C_FLOAT) xArray(size) ! Aligned
; real(C_FLOAT) yArray(size) ! Aligned
; real(C_FLOAT) zArray(size) ! Aligned
; real(C_FLOAT) xyz(3)
; integer(C_SIZE_T), value :: heapsize
; type, bind(C) :: heap_entry
; real(C_FLOAT) value
; integer(C_INT) index
; end type heap_entry
; type(heap_entry) heap(heapsize) ! Aligned
; real(C_FLOAT) work(4) ! Aligned
; integer(C_SIZE_T), value :: firstindex
; interface
; function callback(compares,heap,heapsize,firstindex) bind(C)
; import
; implicit none
; real(C_FLOAT) callback
; real(C_FLOAT) compares(4)
; integer(C_SIZE_T), value :: heapsize
; type(heap_entry) heap(heapsize)
; integer(C_SIZE_T), value :: firstindex
; end function callback
; end interface
; end subroutine inner
;end interface

public inner
inner:
sub rsp, 120
movaps [rsp+32], xmm7
movaps [rsp+48], xmm8
movaps [rsp+64], xmm9
mov [rsp+88], r15
mov [rsp+104], rdi
mov [rsp+112], rsi
mov [rsp+80], rbx
movaps [rsp+128], xmm6
mov rax, [rsp+160] ; size
shl rax, 4
lea rbx, [rcx+rax] ; xArray
lea rsi, [rdx+rax] ; yArray
lea rdi, [r8+rax] ; zArray
movups xmm6, [r9] ; xyz
pshufd xmm7, xmm6, 55h ; y
pshufd xmm8, xmm6, 0AAh ; z
pshufd xmm6, xmm6, 00h ; x
mov rcx, [rsp+168] ; heap
movss xmm0, [rcx]
pshufd xmm9, xmm0, 00h ; compare value
mov r15, [rsp+192] ; firstindex
lea r15, [rax+4*r15]
neg rax
align 16
main_loop:
movaps xmm0, [rax+rbx]
movaps xmm1, [rax+rsi]
movaps xmm2, [rax+rdi]
;jmp cleanup
subps xmm0, xmm6
subps xmm1, xmm7
subps xmm2, xmm8
mulps xmm0, xmm0
mulps xmm1, xmm1
mulps xmm2, xmm2
addps xmm0, xmm1
addps xmm0, xmm2
movaps xmm1, xmm0
cmpltps xmm0, xmm9
pmovmskb ecx, xmm0
test ecx, ecx
jnz insert
cleanup:
add rax, 16
js main_loop
movaps xmm7, [rsp+32]
movaps xmm8, [rsp+48]
movaps xmm9, [rsp+64]
mov r15, [rsp+88]
mov rdi, [rsp+104]
mov rsi, [rsp+112]
mov rbx, [rsp+80]
movaps xmm6, [rsp+128]
add rsp, 120
ret
insert:
mov [rsp+96], rax
mov rcx, [rsp+184] ; work
movaps [rcx], xmm1
mov rdx, [rsp+168] ; heap
mov r8, [rsp+176] ; heapsize
lea r9, [rax+r15]; firstindex
shr r9, 2
mov rax, [rsp+200] ; callback
call rax
pshufd xmm9, xmm0, 00h
mov rax, [rsp+96]
jmp cleanup

C:\gfortran\clf\neighbor>type heap3.f90
module funcs
use ISO_C_BINDING
implicit none
private
public inner, callback
type, bind(C), public :: heap_entry
real(C_FLOAT) value
integer(C_INT) index
end type heap_entry
interface
recursive subroutine
inner(xArray,yArray,zArray,xyz,size,heap,heapsize, &
work,firstindex,callback) bind(C,name='inner')
import
implicit none
integer(C_SIZE_T), value :: size ! Size in units of 4 REALs
real(C_FLOAT) xArray(size) ! Aligned
real(C_FLOAT) yArray(size) ! Aligned
real(C_FLOAT) zArray(size) ! Aligned
real(C_FLOAT) xyz(3)
integer(C_SIZE_T), value :: heapsize
type(heap_entry) heap(heapsize) ! Aligned
real(C_FLOAT) work(4) ! Aligned
integer(C_SIZE_T), value :: firstindex
interface
recursive function callback(compares,heap,heapsize,firstindex)
bind(
C)
import
implicit none
real(C_FLOAT) callback
real(C_FLOAT) compares(4)
integer(C_SIZE_T), value :: heapsize
type(heap_entry) heap(heapsize)
integer(C_SIZE_T), value :: firstindex
end function callback
end interface
end subroutine inner
end interface
contains
recursive function callback(compares,heap,heapsize,firstindex) bind(C)
real(C_FLOAT) callback
real(C_FLOAT) compares(4)
integer(C_SIZE_T), value :: heapsize
type(heap_entry) heap(heapsize)
integer(C_SIZE_T), value :: firstindex
integer i
integer current, left, right, child

do i = 1, 4
if(compares(i) < heap(1)%value) then
heap(1) = heap_entry(compares(i),firstindex+i-1)
current = 1
left = 2*current
right = 2*current+1
do while(left <= heapsize)
child = left
if(right <= heapsize) then
if(heap(right)%value > heap(left)%value) then
child = right
end if
end if
if(heap(child)%value > heap(current)%value) then
heap([current,child]) = heap([child,current])
current = child
left = 2*current
right = 2*current+1
else
exit
end if
end do
end if
end do
callback = heap(1)%value
end function callback
end module funcs

program test
use funcs
use ISO_C_BINDING

implicit none
integer, parameter :: n1 = 5000, n2 = 1000000, n3 = 10
real a(n1, 3)

real, target :: b(n2,3)
type(heap_entry) points(n3)
real dist2, t1, t2
integer i, j
integer current, parent, left, right, child
integer(C_INTPTR_T) address
real xyz(3)
real, target :: work(7)
real, pointer :: pwork(:)

open(10,file='test.dat',status='replace')
call cpu_time(t1)

call random_seed()
call random_number(a)
call random_number(b)

address = transfer(C_LOC(work(1)),address)
pwork => work(1+IAND(-int(address),3):)

do i = 1, n1

xyz = a(i,:)
do j = 1, n3
! dist2 = sum((a(i,:)-b(j,:))**2)
dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
points(j) = heap_entry(dist2,j)
current = j
do while(current > 1)
parent = current/2
if(points(current)%value > points(parent)%value) then
points([current,parent]) = points([parent,current])
current = parent
else
exit

end if
end do
end do

address = transfer(C_LOC(b(j,1)),address)/4
do j = j, j+IAND(-int(address),3)-1
! dist2 = sum((a(i,:)-b(j,:))**2)
dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
if(dist2 < points(1)%value) then
points(1) = heap_entry(dist2,j)
current = 1
left = 2*current
right = 2*current+1
do while(left <= n3)
child = left
if(right <= n3) then
if(points(right)%value > points(left)%value) then
child = right
end if
end if
if(points(child)%value > points(current)%value) then
points([current,child]) = points([child,current])
current = child
left = 2*current
right = 2*current+1
else
exit
end if
end do
end if
end do
call inner(b(j,1),b(j,2),b(j,3),xyz,int((n2-j+1)/4,C_SIZE_T), &
points,int(n3,C_SIZE_T),pwork,int(j,C_SIZE_T),callback)
do j = j+(n2-j+1)/4*4, n2
! dist2 = sum((a(i,:)-b(j,:))**2)
dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
if(dist2 < points(1)%value) then
points(1) = heap_entry(dist2,j)
current = 1
left = 2*current
right = 2*current+1
do while(left <= n3)
child = left
if(right <= n3) then
if(points(right)%value > points(left)%value) then
child = right
end if
end if
if(points(child)%value > points(current)%value) then
points([current,child]) = points([child,current])
current = child
left = 2*current
right = 2*current+1
else
exit
end if
end do
end if
end do
if(mod(i,1000) == 0) then
write(*,'(a,i0,a,10(1x,i0))') 'Ten closest points to a(', i, ') are
',
points%index
write(*,'(a,10(1x,f0.4))') 'Distances are ', sqrt(points%value)
end if
end do
call cpu_time(t2)
write(*,'(a,f0.2)') 'Time used in seconds: ', t2-t1
end program test

C:\gfortran\clf\neighbor>fasm inner.asm
flat assembler version 1.70 (1578954 kilobytes memory)
2 passes, 400 bytes.

C:\gfortran\clf\neighbor>gfortran -O2 heap3.f90 inner.obj -oheap3

C:\gfortran\clf\neighbor>heap3
Ten closest points to a(1000) are 439860 730552 732299 114619 474475 381987
350
023 276078 492099 535861
Distances are .0115 .0108 .0109 .0106 .0070 .0048 .0069 .0100 .0061 .0034
Ten closest points to a(2000) are 402266 755814 950446 685965 749339 992119
511
450 523805 791419 312908
Distances are .0138 .0136 .0130 .0113 .0081 .0104 .0129 .0087 .0113 .0072
Ten closest points to a(3000) are 312774 409874 983250 37724 760385 636340
4280
8 601657 149218 481998
Distances are .0118 .0106 .0117 .0105 .0077 .0082 .0089 .0070 .0101 .0076
Ten closest points to a(4000) are 493 322619 67245 578194 552804 439767
858292
460116 725680 296431
Distances are .0132 .0130 .0105 .0126 .0111 .0083 .0056 .0118 .0120 .0093
Ten closest points to a(5000) are 137180 180546 116382 953344 694621 570112
664
350 831770 396725 180879
Distances are .0129 .0106 .0123 .0095 .0104 .0106 .0089 .0061 .0088 .0036
Time used in seconds: 10.28

So even though my SSE code is vectorized and gfortran's compiler output is
probably not, I'm only getting a 33% speedup. The reason for this can be
seen if I uncomment the jmp cleanup instruction in the inner loop:

main_loop:
movaps xmm0, [rax+rbx]
movaps xmm1, [rax+rsi]
movaps xmm2, [rax+rdi]
jmp cleanup ; Skip all calculation!
subps xmm0, xmm6
subps xmm1, xmm7
subps xmm2, xmm8
mulps xmm0, xmm0
mulps xmm1, xmm1
mulps xmm2, xmm2
addps xmm0, xmm1
addps xmm0, xmm2
movaps xmm1, xmm0
cmpltps xmm0, xmm9
pmovmskb ecx, xmm0
test ecx, ecx
jnz insert
cleanup:
add rax, 16
js main_loop

In that case, no calculation will get done but all the 60 GB of memory
are read into the processor, so the program takes 9.61 seconds. Thus
we may conclude that pretty much all the time is taken up in
transferring data from memory and almost none in useful calculation.
Going to octrees should make this exercise almost instantaneous, but
I don't have any octree code handy.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Thomas Koenig

unread,

Dec 21, 2012, 9:17:28 AM12/21/12

to

e p chandler <ep...@juno.com> schrieb:

> e p chandler wrote:
>
>> A simple exchange sort with flag (bubble sort).
>> Replacing that with a comb sort multiplied the run time by 5.
>
> Thomas Koenig replied:
>
>>Take this piece of advice from Numerical Recipes:
>
>># We _will_ draw the line, however, at the inefficient N^2 algorithm,
>># beloved of elementary computer science texts, called _bubble sort_.
>># If you know what bubble sort is, wipe it from your mind; if you
>># don't know, make a point of never finding out!
>
>>(I know NR has lots of weak spots, but this piece of advice
>>is part of The Truth, together with their advice on evaluating
>>polynomials :-)
>
> After some additional refinements, using a simple exchange sort *is*
> justifiable.

[...]

> While I agree in general that a bubble sort is a bad idea, in this special
> case it's really not so bad at all.

Insertion sort or selection sort are generally thought to be
superior, if your problem is small enough and you want the reduced
complexity.

Maybe Knuth will convince you (indirect quote from the Wikipedia
article on bubble sort):

# the bubble sort seems to have nothing to recommend it, except a catchy
# name and the fact that it leads to some interesting theoretical
# problems

mecej4

unread,

Dec 21, 2012, 1:41:34 PM12/21/12

to

On 12/18/2012 6:26 PM, e p chandler wrote:
> q=(x-xb(j))**2 + (y-yb(j))**2 + (z-zb(j))**2
> if (q >= d(NC)) cycle

You can speed the code another 20 percent by replacing these two lines
with the following:

q=(x-xb(j))**2

if (q >= d(NC)) cycle

q = q + (y-yb(j))**2

if (q >= d(NC)) cycle

q = q + (z-zb(j))**2

if (q >= d(NC)) cycle

I think that many of the comments about bubble-sort, etc., appear to
have not taken into account the fact that FULL sorting is NOT being
attempted in your code.

All that is necessary is to build and maintain an array or list of 10
items, optionally sorted. More items are compared to the list and, if
the list does not get updated, that new item is simply discarded. If
updating is necessary, more work is needed, but the updated list has
never more than 10 items.

-- mecej4

James Van Buskirk

unread,

Dec 21, 2012, 3:38:10 PM12/21/12

to

"mecej4" <mec...@NOSPAM.operamail.com> wrote in message
news:50D4AD5E...@NOSPAM.operamail.com...

> On 12/18/2012 6:26 PM, e p chandler wrote:
>> q=(x-xb(j))**2 + (y-yb(j))**2 + (z-zb(j))**2
>> if (q >= d(NC)) cycle
> You can speed the code another 20 percent by replacing these two lines
> with the following:

> q=(x-xb(j))**2
> if (q >= d(NC)) cycle
> q = q + (y-yb(j))**2
> if (q >= d(NC)) cycle
> q = q + (z-zb(j))**2
> if (q >= d(NC)) cycle

At first I didn't think this comment was very insightful, but then I
realized that it could reduce the amount of memory throughput
required significantly, so I tried this with my assembly version
and saw some improvement. I suppose that means that the above comment
was VERY insightful, and pointed the way to new possibilities.
I changed the inner loop as follows:

main_loop:
movaps xmm0, [rax+rbx]
subps xmm0, xmm6
mulps xmm0, xmm0
movaps xmm3, xmm0

cmpltps xmm0, xmm9
pmovmskb ecx, xmm0
test ecx, ecx

jz cleanup
movaps xmm1, [rax+rsi]
subps xmm1, xmm7
mulps xmm1, xmm1
addps xmm3, xmm1
movaps xmm0, xmm3
cmpltps xmm3, xmm9
pmovmskb ecx, xmm3
test ecx, ecx
jz cleanup
movaps xmm2, [rax+rdi]
subps xmm2, xmm8
mulps xmm2, xmm2

addps xmm0, xmm2
movaps xmm1, xmm0
cmpltps xmm0, xmm9
pmovmskb ecx, xmm0
test ecx, ecx
jnz insert
cleanup:
add rax, 16
js main_loop

And execution time went from 10.28 seconds to 8.88 seconds, compared
to about 15 seconds for the pure Fortran versions. So it did help,
but the y-component still gets loaded into memory a lot because it
only helps if you can skip a whole cache line. In the meantime,
those extra tests add latency to the operation and can create hard
to predict branches. Accordingly, I commented out the first test:

main_loop:
movaps xmm0, [rax+rbx]
subps xmm0, xmm6
mulps xmm0, xmm0
movaps xmm3, xmm0

;cmpltps xmm0, xmm9
;pmovmskb ecx, xmm0
;test ecx, ecx

;jz cleanup
movaps xmm1, [rax+rsi]
subps xmm1, xmm7
mulps xmm1, xmm1
addps xmm3, xmm1
movaps xmm0, xmm3
cmpltps xmm3, xmm9
pmovmskb ecx, xmm3
test ecx, ecx
jz cleanup
movaps xmm2, [rax+rdi]
subps xmm2, xmm8
mulps xmm2, xmm2

addps xmm0, xmm2
movaps xmm1, xmm0
cmpltps xmm0, xmm9
pmovmskb ecx, xmm0
test ecx, ecx
jnz insert
cleanup:
add rax, 16
js main_loop

And my run time went down to 6.78 seconds. This way I can save at
best only 1/3 of memory access, but trying to save the second third
was more or less futile anyway. Therefore I recommend that you try
commenting out the first if (q >= d(NC)) cycle statement and see
if that helps in your case as well.

> I think that many of the comments about bubble-sort, etc., appear to have
> not taken into account the fact that FULL sorting is NOT being attempted
> in your code.

> All that is necessary is to build and maintain an array or list of 10
> items, optionally sorted. More items are compared to the list and, if the
> list does not get updated, that new item is simply discarded. If updating
> is necessary, more work is needed, but the updated list has never more
> than 10 items.

As I pointed out in my previous post, this brute force method is so
slow compared to the performance I would expect for an octree
approach that the details of how the top ten list is maintained
don't matter.

mecej4

unread,

Dec 21, 2012, 6:30:58 PM12/21/12

to

On 12/21/2012 2:38 PM, James Van Buskirk wrote:
<--SNIP-->

> In the meantime,
> those extra tests add latency to the operation and can create hard
> to predict branches. Accordingly, I commented out the first test:
>

<--SNIP assembly code-->

>
> And my run time went down to 6.78 seconds. This way I can save at
> best only 1/3 of memory access, but trying to save the second third
> was more or less futile anyway. Therefore I recommend that you try
> commenting out the first if (q >= d(NC)) cycle statement and see
> if that helps in your case as well.

Your conjecture was correct. The run went about ten percent faster when
the first short-cycle was commented out.

However, don't you think that all this is quite data-dependent? After
all, we are using random numbers in our codes, and their distribution
may be nothing similar to the real data that the O.P. has to contend with.

-- mecej4

michael...@compuserve.com

unread,

Dec 22, 2012, 4:56:59 AM12/22/12

to

I tried a slightly different approach. After

do j = 1, n2

in my program, I added:

select case (j)
case (n3+1:)
if(abs(a(i, 1) - b(j, 1)) > hr .or. abs(a(i, 2) - b(j, 2)) > hr .or. &
abs(a(i, 3) - b(j, 3)) > hr ) cycle
end select

where hr is sqrt(highest). This reduced the time from 60 to 45 seconds. Using this alternative

if(any ( (/abs(a(i, 1) - b(j, 1)), abs(a(i, 2) - b(j, 2)), &
abs(a(i, 3) - b(j, 3)) /) > hr )) cycle

disappointingly increased the time to 126s.

Regards,

Mike Metcalf

mecej4

unread,

Dec 22, 2012, 8:10:09 AM12/22/12

to

On 12/22/2012 3:56 AM, michael...@compuserve.com wrote:
> <----SNIP---->

> select case (j)
> case (n3+1:)
> if(abs(a(i, 1) - b(j, 1)) > hr .or. abs(a(i, 2) - b(j, 2)) > hr .or. &
> abs(a(i, 3) - b(j, 3)) > hr ) cycle
> end select
>
> where hr is sqrt(highest). This reduced the time from 60 to 45 seconds.

> <----SNIP---->

Changing the firing order of the comparisons to 3,2,1 from the order
used above (1,2,3) may give a slight increase in speed, as James Van
Buskirk has pointed out elsewhere in this thread. Is this a result of a
quirk in the RNG used to generate the data, I wonder!

This test for short-circuiting the loop should come _before_ the
computation of DIST2.

-- mecej4

michael...@compuserve.com

unread,

Dec 22, 2012, 12:51:13 PM12/22/12

to

Am Samstag, 22. Dezember 2012 14:10:09 UTC+1 schrieb mecej4:
> Changing the firing order of the comparisons to 3,2,1 from the order used >above (1,2,3) may give a slight increase in speed, as James Van Buskirk has >pointed out elsewhere in this thread. Is this a result of a quirk in the RNG >used to generate the data, I wonder! This test for short-circuiting the loop >should come _before_ the computation of DIST2. -- mecej4

The change makes no perceptible difference.

Mike Metcalf

James Van Buskirk

unread,

Dec 22, 2012, 2:48:07 PM12/22/12

to

<michael...@compuserve.com> wrote in message
news:139ed5d9-97ac-4c8f...@googlegroups.com...

Furthermore, I can't see where I or anyone else made this assertion.

James Van Buskirk

unread,

Dec 22, 2012, 3:08:21 PM12/22/12

to

"mecej4" <mec...@NOSPAM.operamail.com> wrote in message

news:50D4F132...@NOSPAM.operamail.com...

Yes, this may be data-dependent. For comparison, the program without
this last optimization:

C:\gfortran\clf\neighbor>heap

Ten closest points to a(1000) are 439860 730552 732299 114619 474475 381987
350
023 276078 492099 535861
Distances are .0115 .0108 .0109 .0106 .0070 .0048 .0069 .0100 .0061 .0034
Ten closest points to a(2000) are 402266 755814 950446 685965 749339 992119
511
450 523805 791419 312908
Distances are .0138 .0136 .0130 .0113 .0081 .0104 .0129 .0087 .0113 .0072
Ten closest points to a(3000) are 312774 409874 983250 37724 760385 636340
4280
8 601657 149218 481998
Distances are .0118 .0106 .0117 .0105 .0077 .0082 .0089 .0070 .0101 .0076
Ten closest points to a(4000) are 493 322619 67245 578194 552804 439767
858292
460116 725680 296431
Distances are .0132 .0130 .0105 .0126 .0111 .0083 .0056 .0118 .0120 .0093
Ten closest points to a(5000) are 137180 180546 116382 953344 694621 570112
664
350 831770 396725 180879
Distances are .0129 .0106 .0123 .0095 .0104 .0106 .0089 .0061 .0088 .0036

Time used in seconds: 15.64

With the optimization:

C:\gfortran\clf\neighbor>type heap5.f90
program test

implicit none
integer, parameter :: n1 = 5000, n2 = 1000000, n3 = 10

real a(n1, 3), b(n2,3)
type closest
real d2
integer index
end type closest
type(closest) points(n3)

real dist2, t1, t2
integer i, j
integer current, parent, left, right, child

call cpu_time(t1)
call random_seed()
call random_number(a)
call random_number(b)

do i = 1, n1

do j = 1, n3

dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
points(j) = closest(dist2,j)

current = j
do while(current > 1)
parent = current/2

if(points(current)%d2 > points(parent)%d2) then

points([current,parent]) = points([parent,current])
current = parent
else
exit

end if
end do
end do

do j = j, n2
! dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2
if(dist2 >= points(1)%d2) cycle
dist2 = dist2+(a(i,3)-b(j,3))**2
if(dist2 < points(1)%d2) then
points(1) = closest(dist2,j)

current = 1
left = 2*current
right = 2*current+1
do while(left <= n3)
child = left
if(right <= n3) then

if(points(right)%d2 > points(left)%d2) then

child = right
end if
end if

if(points(child)%d2 > points(current)%d2) then

points([current,child]) = points([child,current])
current = child
left = 2*current
right = 2*current+1
else
exit

end if
end do

end if
end do
if(mod(i,1000) == 0) then
write(*,'(a,i0,a,10(1x,i0))') 'Ten closest points to a(', i, ') are
',
points%index

write(*,'(a,10(1x,f0.4))') 'Distances are ', sqrt(points%d2)

end if
end do
call cpu_time(t2)
write(*,'(a,f0.2)') 'Time used in seconds: ', t2-t1
end program test

C:\gfortran\clf\neighbor>gfortran -O2 heap5.f90 -oheap5

C:\gfortran\clf\neighbor>heap5

Ten closest points to a(1000) are 439860 730552 732299 114619 474475 381987
350
023 276078 492099 535861
Distances are .0115 .0108 .0109 .0106 .0070 .0048 .0069 .0100 .0061 .0034
Ten closest points to a(2000) are 402266 755814 950446 685965 749339 992119
511
450 523805 791419 312908
Distances are .0138 .0136 .0130 .0113 .0081 .0104 .0129 .0087 .0113 .0072
Ten closest points to a(3000) are 312774 409874 983250 37724 760385 636340
4280
8 601657 149218 481998
Distances are .0118 .0106 .0117 .0105 .0077 .0082 .0089 .0070 .0101 .0076
Ten closest points to a(4000) are 493 322619 67245 578194 552804 439767
858292
460116 725680 296431
Distances are .0132 .0130 .0105 .0126 .0111 .0083 .0056 .0118 .0120 .0093
Ten closest points to a(5000) are 137180 180546 116382 953344 694621 570112
664
350 831770 396725 180879
Distances are .0129 .0106 .0123 .0095 .0104 .0106 .0089 .0061 .0088 .0036

Time used in seconds: 12.75

But another optimization suggests itself: if you look at the range of
distances that makes the cut, only about 1/40 of the range of the
b(:,1) array section is close enough. This means that if we sort the
b array according to the values in its first column, this range will
be consecutive and if we can make a reasonable guess as to the
start of the range, 39/40 of the the problem will go away. In the
following code also the a array is sorted according to the values
in its first column which allows us to make the guess as the first
entry that made the cut in the previous iteration and also avoids
randomly hopping around the b array so that the array elements
required can remain in cache. Of course now I spend more time
sorting the input arrays than I do searching for the nearest points,
but the overall time still is greatly reduced:

C:\gfortran\clf\neighbor>type heap4.f90
module funcs
implicit none
contains
subroutine heapsort(array, indices, N)
integer, intent(in) :: N
real array(N,3)
integer indices(N)
integer i
integer current, parent
integer left, right, child

indices = [(i,i=1,N)]
do i = 1, N
current = i

do while(current > 1)
parent = current/2

if(array(current,1) <= array(parent,1)) exit
array([current,parent],:) = array([parent,current],:)
indices([current,parent]) = indices([parent,current])
current = parent
end do
end do
do i = N, 1, -1
array([i,1],:) = array([1,i],:)
indices([i,1]) = indices([1,i])

current = 1
left = 2*current

do while(left < i)
child = left
right = left+1
if(right < i) then
if(array(right,1) > array(left,1)) child = right
end if
if(array(child,1) <= array(current,1)) exit
array([child,current],:) = array([current,child],:)
indices([child,current]) = indices([current,child])

current = child
left = 2*current

end do
end do
end subroutine heapsort

end module funcs

program test
use funcs

implicit none
integer, parameter :: n1 = 5000, n2 = 1000000, n3 = 10

real a(n1, 3), b(n2,3)
integer aIndex(n1), bIndex(n2)
type closest
real d2
integer index
end type closest
type(closest) points(n3)

real dist2, t1, t2
integer i, j
integer current, parent, left, right, child

integer firstindex

call cpu_time(t1)
call random_seed()
call random_number(a)
call random_number(b)

call heapsort(a,aIndex,n1)
call heapsort(b,bIndex,n2)
firstindex = 1

do i = 1, n1

do j = firstindex, firstindex+n3-1
dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
points(j-firstindex+1) = closest(dist2,j)
current = j-firstindex+1

do while(current > 1)
parent = current/2

if(points(current)%d2 <= points(parent)%d2) exit

points([current,parent]) = points([parent,current])
current = parent

end do
end do
do j = j, n2
! dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
dist2 = (a(i,1)-b(j,1))**2
if(dist2 >= points(1)%d2) exit
dist2 = dist2+(a(i,2)-b(j,2))**2
! if(dist2 >= points(1)%d2) cycle
dist2 = dist2+(a(i,3)-b(j,3))**2
if(dist2 < points(1)%d2) then
points(1) = closest(dist2,j)

current = 1
left = 2*current

do while(left <= n3)
child = left

right = left+1
if(right <= n3) then
if(points(right)%d2 > points(left)%d2) then

child = right
end if
end if

if(points(child)%d2 <= points(current)%d2) exit

points([current,child]) = points([child,current])
current = child
left = 2*current

end do
end if
end do

j = firstindex-1
do j = j, 1, -1
! dist2 = (a(i,1)-b(j,1))**2+(a(i,2)-b(j,2))**2+(a(i,3)-b(j,3))**2
dist2 = (a(i,1)-b(j,1))**2
if(dist2 >= points(1)%d2) exit
dist2 = dist2+(a(i,2)-b(j,2))**2
! if(dist2 >= points(1)%d2) cycle
dist2 = dist2+(a(i,3)-b(j,3))**2
if(dist2 < points(1)%d2) then
points(1) = closest(dist2,j)

current = 1
left = 2*current

do while(left <= n3)
child = left

right = left+1
if(right <= n3) then
if(points(right)%d2 > points(left)%d2) then

child = right
end if
end if

if(points(child)%d2 <= points(current)%d2) exit

points([current,child]) = points([child,current])
current = child
left = 2*current

end do
end if
end do

firstindex = minval(points%index)
if(mod(aIndex(i),1000) == 0) then

write(*,'(a,i0,a,10(1x,i0))') 'Ten closest points to a(',

aIndex(i), &
') are ', bIndex(points%index)
write(*,'(a,10(1x,f0.4))') 'Distances are ', sqrt(points%d2)

end if
end do
call cpu_time(t2)
write(*,'(a,f0.2)') 'Time used in seconds: ', t2-t1
end program test

C:\gfortran\clf\neighbor>gfortran -O2 heap4.f90 -oheap4

C:\gfortran\clf\neighbor>heap4
Ten closest points to a(4000) are 493 322619 460116 725680 578194 296431
67245
439767 552804 858292
Distances are .0132 .0130 .0118 .0120 .0126 .0093 .0105 .0083 .0111 .0056
Ten closest points to a(2000) are 402266 755814 950446 791419 511450 312908
685
965 749339 523805 992119
Distances are .0138 .0136 .0130 .0113 .0129 .0072 .0113 .0081 .0087 .0104
Ten closest points to a(1000) are 439860 732299 730552 474475 492099 114619
276
078 381987 350023 535861
Distances are .0115 .0109 .0108 .0070 .0061 .0106 .0100 .0048 .0069 .0034
Ten closest points to a(5000) are 137180 180546 116382 570112 664350 694621
396
725 180879 953344 831770
Distances are .0129 .0106 .0123 .0106 .0089 .0104 .0088 .0036 .0095 .0061
Ten closest points to a(3000) are 312774 983250 149218 37724 409874 760385
6016
57 636340 481998 42808
Distances are .0118 .0117 .0101 .0105 .0106 .0077 .0070 .0082 .0076 .0089
Time used in seconds: 1.11

I still haven't found time and energy to implement an octree
solution, though.

mecej4

unread,

Dec 22, 2012, 3:24:25 PM12/22/12

to

On 12/17/2012 4:25 AM, harishameed_33 wrote:
> hi,

>
> i have a set A with 5,000 (x,y,z) triples and set B with
> 1,000,000 (x,y,z) triples.
>

> For each point in set A i want to find set of nearest point from set B.(closest 10 points)
>
> kindly guide me how to proceed to solve this problem
>
>
>
A number of helpful replies, some including working code, have already
been provided in this thread by the capable people who inhabit CLF. The
programs given by Chandler, Metcalf, and the modifications provided by
Van Buskirk are able to complete the solution of the problem stated in
about 10 s of CPU time on a current microprocessor such as an i3 or i5.

Some have mentioned that other data structures such as k-D trees, b-D
trees and octrees are more appropriate and would result in faster
throughput. I tried out a few packages using such data structures, and
found one that is well-developed, documented and has all the
capabilities you need:

<http://www.cs.umd.edu/~mount/ANN/>

The only drawback (if it is one) is that the code is in C++. I barely
know C++ myself, but I was able to build the library and the test
program with ease. The program ran on your problem in 2.7 s on an
i3-2350M (2.3 GHz) laptop with Intel C++ under W7-X64. For reference,
here is the input data to the ANN-TEST program:

output_label n10
#validate on
dim 3
data_size 1000000
distribution uniform
gen_data_pts
dim 3
query_size 5000
distribution uniform
gen_query_pts
bucket_size 2
shrink_rule none
split_rule fair
build_ann
stats query_stats
epsilon 0.0
near_neigh 10
true_near_neigh 20
run_queries standard

In comparison, Metcalf's code ran in 13 s when compiled with Intel Fortran.

-- mecej4

James Van Buskirk

unread,

Dec 22, 2012, 11:29:34 PM12/22/12

to

"James Van Buskirk" <not_...@comcast.net> wrote in message
news:kb53vh$p9k$1...@dont-email.me...

> Time used in seconds: 1.11

Since sorting the input arrays was taking most of the time, I tried
switching to mergesort for this operation and showed some
improvement:

C:\gfortran\clf\neighbor>type merge.f90
module funcs
implicit none
integer, parameter :: INSERTION_LIMIT = 65
contains
subroutine mergesort(array, indices, N)

integer, intent(in) :: N
real array(N,3)
integer indices(N)

real work(N-(N+1)/2)
integer iwork(N-(N+1)/2)
integer j, jarray, jwork

indices = [(j,j=1,N)]
if(N <= INSERTION_LIMIT) then
call insertionsort(array(1,1), indices(1), N)
else
jarray = (N+1)/2
jwork = N-(N+1)/2
call sortInPlace(array(1,1), indices(1), work, iwork, jarray)
call sortToScratch(array(jarray+1,1), indices(jarray+1), &
work, iwork, jwork)
do j = N, 1, -1
if(array(jarray,1) > work(jwork)) then
array(j,1) = array(jarray,1)
indices(j) = indices(jarray)
jarray = jarray-1
if(jarray == 0) then
array(1:jwork,1) = work(1:jwork)
indices(1:jwork) = iwork(1:jwork)
exit
end if
else
array(j,1) = work(jwork)
indices(j) = iwork(jwork)
jwork = jwork-1
if(jwork == 0) exit

end if
end do
end if

array(:,2:3) = array(indices,2:3)
end subroutine mergesort

recursive subroutine sortInPlace(array, indices, work, iwork, N)
integer, intent(in) :: N
real array(N)
integer indices(N)
real work(N-(N+1)/2)
integer iwork(N-(N+1)/2)
integer j, jarray, jwork

if(N <= INSERTION_LIMIT) then
call insertionsort(array(1), indices(1), N)
else
jarray = (N+1)/2
jwork = N-(N+1)/2
call sortInPlace(array(1), indices(1), work, iwork, jarray)
call sortToScratch(array(jarray+1), indices(jarray+1), &
work, iwork, jwork)
do j = N, 1, -1
if(array(jarray) > work(jwork)) then
array(j) = array(jarray)
indices(j) = indices(jarray)
jarray = jarray-1
if(jarray == 0) then
array(1:jwork) = work(1:jwork)
indices(1:jwork) = iwork(1:jwork)
exit
end if
else
array(j) = work(jwork)
indices(j) = iwork(jwork)
jwork = jwork-1
if(jwork == 0) exit

end if
end do
end if

end subroutine sortInPlace

recursive subroutine sortToScratch(array, indices, work, iwork, N)
integer, intent(in) :: N
real array(N)
integer indices(N)
real work(N)
integer iwork(N)
integer j, jarray, jwork, kwork

if(N <= INSERTION_LIMIT) then
call insertionsort(array(1), indices(1), N)
work = array
iwork = indices
else
jarray = (N+1)/2
jwork = N-(N+1)/2
call sortInPlace(array(jwork+1), indices(jwork+1), work, &
iwork, jarray)
call sortToScratch(array(1), indices(1), work, iwork, jwork)
jarray = N
kwork = jwork
do j = N, 1, -1
if(array(jarray) >= work(jwork)) then
work(j) = array(jarray)
iwork(j) = indices(jarray)
jarray = jarray-1
if(jarray == kwork) exit
else
work(j) = work(jwork)
iwork(j) = iwork(jwork)
jwork = jwork-1
if(jwork == 0) then
work(1:j-1) = array(kwork+1:jarray)
iwork(1:j-1) = indices(kwork+1:jarray)
end if

end if
end do
end if

end subroutine sortToScratch

subroutine insertionsort(array, indices, N)
integer, intent(in) :: N
real array(N)
integer indices(N)
integer i, j
real temp
integer itemp

do i = 1, N

temp = array(i)
itemp = indices(i)
do j = i-1, 1, -1
if(temp >= array(j)) exit
array(j+1) = array(j)
indices(j+1) = indices(j)
end do
array(j+1) = temp
indices(j+1) = itemp
end do
end subroutine insertionsort

end module funcs

program test
use funcs
implicit none
integer, parameter :: n1 = 5000, n2 = 1000000, n3 = 10
real a(n1, 3), b(n2,3)
integer aIndex(n1), bIndex(n2)
type closest
real d2
integer index
end type closest
type(closest) points(n3)
real dist2, t1, t2
integer i, j
integer current, parent, left, right, child
integer firstindex

call cpu_time(t1)
call random_seed()
call random_number(a)
call random_number(b)

call mergesort(a,aIndex,n1)
call mergesort(b,bIndex,n2)

write(*,'(a,i0,a,10(1x,i0))') 'Ten closest points to a(', &
aIndex(i), ') are ', bIndex(points%index)

write(*,'(a,10(1x,f0.4))') 'Distances are ', sqrt(points%d2)
end if
end do
call cpu_time(t2)
write(*,'(a,f0.2)') 'Time used in seconds: ', t2-t1
end program test

C:\gfortran\clf\neighbor>gfortran -O2 merge.f90 -omerge

C:\gfortran\clf\neighbor>merge

Ten closest points to a(4000) are 493 322619 460116 725680 578194 296431
67245
439767 552804 858292
Distances are .0132 .0130 .0118 .0120 .0126 .0093 .0105 .0083 .0111 .0056
Ten closest points to a(2000) are 402266 755814 950446 791419 511450 312908
685
965 749339 523805 992119
Distances are .0138 .0136 .0130 .0113 .0129 .0072 .0113 .0081 .0087 .0104
Ten closest points to a(1000) are 439860 732299 730552 474475 492099 114619
276
078 381987 350023 535861
Distances are .0115 .0109 .0108 .0070 .0061 .0106 .0100 .0048 .0069 .0034
Ten closest points to a(5000) are 137180 180546 116382 570112 664350 694621
396

725 831770 953344 180879
Distances are .0129 .0106 .0123 .0106 .0089 .0104 .0088 .0061 .0095 .0036
Ten closest points to a(3000) are 312774 983250 636340 37724 409874 481998
6016
57 42808 149218 760385
Distances are .0118 .0117 .0082 .0105 .0106 .0076 .0070 .0089 .0101 .0077
Time used in seconds: .61

e p chandler

unread,

Dec 23, 2012, 1:43:48 AM12/23/12

to

"James Van Buskirk" wrote

> Time used in seconds: 1.11

> Since sorting the input arrays was taking most of the time, I tried
> switching to mergesort for this operation and showed some
> improvement:

[nice program listing snipped]

> C:\gfortran\clf\neighbor>gfortran -O2 merge.f90 -omerge

> [snip]

> Ten closest points to a(3000) are 312774 983250 636340 37724 409874

> 481998 601657 42808 149218 760385

> Distances are .0118 .0117 .0082 .0105 .0106 .0076 .0070 .0089 .0101 .0077
> Time used in seconds: .61

On my system, compiled with g95 -O3, your program took 1.65 seconds.

I followed Glen's suggestion to divide up space into cubes. The following
program creates a linked list of points from the large data set belonging to
each cube. It maintains both a head pointer and a tail pointer, at the
expense of more memory. For each query point, it checks distances in the
corresponding cube and surrounding cubes with offsets in each direction of
W, starting with 1. If the desired number of points is not found, W is
increased by 1 and the search is done over again. (I decided not to save the
incomplete result because this problem does not happen very often. In my
run, 4 times out of 5000.)

A version that wrote data and re-do data to a file took .19 sec

C:\Users\epc\temp>g95 -O3 x1.f90

C:\Users\epc\temp>a
10 closest points to a(1000) are 613919 956917 146506 57187 462514 147538
297288
113432 582251 946844
Distances are .0126 .0104 .0106 .0134 .0103 .0106 .0088 .0112 .0110 .0095
10 closest points to a(2000) are 469878 512634 364675 991312 191255 314131
34355
1 511672 422420 337192
Distances are .0097 .0118 .0098 .0053 .0111 .0116 .0123 .0089 .0046 .0120
10 closest points to a(3000) are 502419 656410 767105 841941 968192 60453
872903
85886 394243 148916
Distances are .0136 .0121 .0094 .0049 .0110 .0071 .0066 .0089 .0131 .0075
10 closest points to a(4000) are 367533 819985 97661 175810 3477 820248
707681 3
51258 767633 361448
Distances are .0129 .0131 .0126 .0116 .0097 .0116 .0126 .0039 .0046 .0114
10 closest points to a(5000) are 632236 444227 467342 285321 931383 886753
19325
2 457728 283612 930406
Distances are .0136 .0067 .0142 .0129 .0132 .0078 .0122 .0148 .0095 .0136
time:.12 sec

The program listing follows:

---- start text ----
program cube
implicit none
integer,parameter :: NA=5000,NB=1000000,NC=100,ND=10
real a(NA,3),b(NB,3)
integer h(NC,NC,NC),t(NC,NC,NC),l(NB)
real d(ND)
integer g(ND)
real x,y,z,d2,highest
integer i,j,k,m,p,i0,j0,k0,w,f,top
real t1,t2

call cpu_time(t1)

call random_seed()
call random_number(a)
call random_number(b)

h=0 !head
t=0 !tail
l=0 !next_node

do m=1,NB
i=1+int(b(m,1)*NC)
j=1+int(b(m,2)*NC)
k=1+int(b(m,3)*NC)
p=t(i,j,k)
if (p==0) then
h(i,j,k)=m
else
l(p)=m
end if
t(i,j,k)=m
end do

do m=1,NA
x=a(m,1)
y=a(m,2)
z=a(m,3)

i0=1+int(x*NC)
j0=1+int(y*NC)
k0=1+int(z*NC)
w=0
do
w=w+1
f=0
do i=max(1,i0-w),min(NC,i0+w)
do j=max(1,j0-w),min(NC,j0+w)
do k=max(1,k0-w),min(NC,k0+w)
p=h(i,j,k)
do
if (p==0) exit
f=f+1
d2=(x-b(p,1))**2 + (y-b(p,2))**2 + (z-b(p,3))**2
select case (f)
case (:ND-1)
d(f)=d2
g(f)=p
case (ND)
d(f)=d2
g(f)=p
highest=maxval(d)
top=maxloc(d,dim=1)
case default
if (d2 < highest) then
d(top)=d2
g(top)=p
highest=maxval(d)
top=maxloc(d,dim=1)
end if
end select
p=l(p)
end do
end do
end do
end do
if (f>=ND) exit
end do
if (mod(m,1000) == 0) then
write(*,'(i0,a,i0,a,999(1x,i0))') ND,' closest points to a(',m,') are',g
write(*,'(a,999(1x,f0.4))') 'Distances are',sqrt(d)

end if
end do

call cpu_time(t2)

write(*,'(a,f0.2,a)') 'time:',t2-t1,' sec'
end program cube
---- end text ----

Of course this program depends on the dataset and the query data both being
uniform on [0,1).

--- e

James Van Buskirk

unread,

Dec 23, 2012, 3:02:39 AM12/23/12

to

"e p chandler" <ep...@juno.com> wrote in message

news:kb697i$dsm$1...@dont-email.me...

> "James Van Buskirk" wrote

>> Time used in seconds: .61

> On my system, compiled with g95 -O3, your program took 1.65 seconds.

Thanks for trying my code.

> I followed Glen's suggestion to divide up space into cubes. The following
> program creates a linked list of points from the large data set belonging
> to each cube. It maintains both a head pointer and a tail pointer, at the
> expense of more memory. For each query point, it checks distances in the
> corresponding cube and surrounding cubes with offsets in each direction of
> W, starting with 1. If the desired number of points is not found, W is
> increased by 1 and the search is done over again. (I decided not to save
> the incomplete result because this problem does not happen very often. In
> my run, 4 times out of 5000.)

> time:.12 sec

Impressive time, but you seem to be missing points. Running your
latest code and mine on the same data set I get:

C:\gfortran\clf\neighbor>merge
Ten closest points to a(4000) are 493 322619 460116 725680 578194 296431
67245
439767 552804 858292
Distances are .0132 .0130 .0118 .0120 .0126 .0093 .0105 .0083 .0111 .0056
Ten closest points to a(2000) are 402266 755814 950446 791419 511450 312908
685
965 749339 523805 992119
Distances are .0138 .0136 .0130 .0113 .0129 .0072 .0113 .0081 .0087 .0104
Ten closest points to a(1000) are 439860 732299 730552 474475 492099 114619
276
078 381987 350023 535861
Distances are .0115 .0109 .0108 .0070 .0061 .0106 .0100 .0048 .0069 .0034
Ten closest points to a(5000) are 137180 180546 116382 570112 664350 694621
396
725 831770 953344 180879
Distances are .0129 .0106 .0123 .0106 .0089 .0104 .0088 .0061 .0095 .0036

Ten closest points to a(3000) are 312774 983250 636340 37724 409874 481998
6016
57 42808 149218 760385
Distances are .0118 .0117 .0082 .0105 .0106 .0076 .0070 .0089 .0101 .0077

Time used in seconds: .64

C:\gfortran\clf\neighbor>c3
10 closest points to a(1000) are 474475 492099 535861 439860 730552 381987
73229
9 114619 276078 350023
Distances are .0070 .0061 .0034 .0115 .0108 .0048 .0109 .0106 .0100 .0069
10 closest points to a(2000) are 402266 755814 772095 749339 950446 511450
79141
9 523805 992119 312908
Distances are .0138 .0136 .0144 .0081 .0130 .0129 .0113 .0087 .0104 .0072
10 closest points to a(3000) are 983250 149218 42808 636340 312774 601657
37724
481998 760385 409874
Distances are .0117 .0101 .0089 .0082 .0118 .0070 .0105 .0076 .0077 .0106
10 closest points to a(4000) are 858292 725680 199579 439767 552804 67245
578194
493 296431 460116
Distances are .0056 .0120 .0148 .0083 .0111 .0105 .0126 .0132 .0093 .0118
10 closest points to a(5000) are 396725 570112 180546 953344 137180 180879
83177
0 694621 664350 116382
Distances are .0088 .0106 .0106 .0095 .0129 .0036 .0061 .0104 .0089 .0123
1.13130379E-02 1.13419574E-02
time:.11 sec

Note that I added the line:

write(*,*) norm2(a(2000,:)-b(791419,:)),norm2(a(2000,:)-b(685965,:))

At the end of your code so that the output makes clear that you are
missing point 685965. Also you seem to be missing point 322619
when searching for nearest neighbors to a(4000). After you fix
these issues, is your code still as fast?

James Van Buskirk

unread,

Dec 23, 2012, 11:28:42 AM12/23/12

to

"James Van Buskirk" <not_...@comcast.net> wrote in message

news:kb6dqr$umi$1...@dont-email.me...

> Note that I added the line:

> write(*,*) norm2(a(2000,:)-b(791419,:)),norm2(a(2000,:)-b(685965,:))

> At the end of your code so that the output makes clear that you are
> missing point 685965. Also you seem to be missing point 322619
> when searching for nearest neighbors to a(4000). After you fix
> these issues, is your code still as fast?

I took the liberty of modifying your search loop:

w=0
f = 0
do
! w=w+1
! f=0

do i=max(1,i0-w),min(NC,i0+w)
do j=max(1,j0-w),min(NC,j0+w)
do k=max(1,k0-w),min(NC,k0+w)

if(abs(i-i0) < w .AND. abs(j-j0) < w .AND. abs(k-k0) < w) cycle

! if (f>=ND) exit
if(f >= ND) then
if(sqrt(highest)*NC <= w) exit
end if
w = w+1
end do

C:\gfortran\clf\neighbor>gfortran -O2 c4.f90 -oc4

C:\gfortran\clf\neighbor>c4
10 closest points to a(1000) are 474475 350023 492099 535861 439860 730552

38198
7 732299 114619 276078

Distances are .0070 .0069 .0061 .0034 .0115 .0108 .0048 .0109 .0106 .0100
10 closest points to a(2000) are 402266 755814 685965 749339 950446 511450

79141
9 523805 992119 312908

Distances are .0138 .0136 .0113 .0081 .0130 .0129 .0113 .0087 .0104 .0072
10 closest points to a(3000) are 481998 983250 636340 42808 409874 312774
601657
37724 760385 149218
Distances are .0076 .0117 .0082 .0089 .0106 .0118 .0070 .0105 .0077 .0101
10 closest points to a(4000) are 858292 725680 322619 439767 552804 67245
578194
493 296431 460116
Distances are .0056 .0120 .0130 .0083 .0111 .0105 .0126 .0132 .0093 .0118
10 closest points to a(5000) are 180879 831770 396725 570112 180546 953344
13718
0 694621 664350 116382
Distances are .0036 .0061 .0088 .0106 .0106 .0095 .0129 .0104 .0089 .0123
time:.17 sec

A little slower but still very fast and I think the new logic
gets all the nearest neighbors.

mecej4

unread,

Dec 23, 2012, 12:48:24 PM12/23/12

to

<--SNIP CODE-->
>
> --- e
>
>

I don't see how you handle the possibility that if the cubes are large
there may be more than one point within that cube. Could that be the
reason for the occasionally missing neighbors that James Van Buskirk
pointed out?

With NC=100 and m = 2000, I compared the list of the distances to the
"closest" ND points in your program and the result of computing all the
distances, sorting and taking the first ND of them. One point was
apparently missed. In increasing order of distance, here is where the
two lists differ:

S.No. EPC Sorted d
6 0.1026697 0.1020998 (missed)
7 0.1082851 0.1026697
8 0.1100086 0.1082851
9 0.1104050 0.1100086
10 0.1109655 0.1104050

I changed NC in your program from 100 to 50 and, at least for the cases
for which results are printed out (i.e., mod(m,1000) == 0), no points
went overlooked. (I did the sorting and checking only for those values
of m).

There is a tradeoff between making the cubes small enough (such that at
most one point falls in a cube) and making the cubes big enough that all
ND neighbors will fall in the immediately neighboring cubes.

If it is acceptable to capture ND _approximately_ nearest neighbors,
rather than ND exactly nearest neighbors, your program does a fine job.

-- mecej4

e p chandler

unread,

Dec 23, 2012, 1:08:36 PM12/23/12

to

James Van Buskirk wrote:

>> At the end of your code so that the output makes clear that you are
>> missing point 685965. Also you seem to be missing point 322619
>> when searching for nearest neighbors to a(4000). After you fix
>> these issues, is your code still as fast?

> I took the liberty of modifying your search loop:

> w=0
> f = 0
> do

> do i=max(1,i0-w),min(NC,i0+w)
> do j=max(1,j0-w),min(NC,j0+w)
> do k=max(1,k0-w),min(NC,k0+w)
> if(abs(i-i0) < w .AND. abs(j-j0) < w .AND. abs(k-k0) < w) cycle
> p=h(i,j,k)

[snip]

> if(f >= ND) then
> if(sqrt(highest)*NC <= w) exit
> end if
> w = w+1
> end do

[snip]

time:.17 sec

> A little slower but still very fast and I think the new logic gets all the
> nearest neighbors.

Thanks for looking at it and proposing a solution. My search method was just
a good guess. I can see where it might miss some extreme cases such as when
the point is in the corner of the (i0,j0,k0) cube.

--- e

"James Van Buskirk" wrote in message news:kb7bfk$ihh$1...@dont-email.me...

e p chandler

unread,

Dec 23, 2012, 1:27:38 PM12/23/12

to

"mecej4" wrote

>I don't see how you handle the possibility that if the cubes are large
>there may be more than one point within that cube. Could that be the reason
>for the occasionally missing neighbors that James Van Buskirk pointed out?

For each cube, there is an associated linked list that contains all of the
points in set B that
fall into that cube.

>With NC=100 and m = 2000, I compared the list of the distances to the
>"closest" ND points in your program and the result of computing all the
>distances, sorting and taking the first ND of them. One point was
>apparently missed. In increasing order of distance, here is where the two
>lists differ:

>S.No. EPC Sorted d
> 6 0.1026697 0.1020998 (missed)
> 7 0.1082851 0.1026697
> 8 0.1100086 0.1082851
> 9 0.1104050 0.1100086
> 10 0.1109655 0.1104050

> I changed NC in your program from 100 to 50 and, at least for the cases
> for which results are printed out (i.e., mod(m,1000) == 0), no points went
> overlooked. (I did the sorting and checking only for those values of m).

See James Van Buskirk's modification to my program. It uses a different
method to widen the search radius than I did.
Lowering NC raises the number of points in each cube and may prevent extreme
cases from appearing.

>There is a tradeoff between making the cubes small enough (such that at
>most one point falls in a cube) and making the cubes big enough that all ND
>neighbors will fall in the immediately neighboring cubes.

Yes, even with the linked list, there is that tradeoff.

> If it is acceptable to capture ND _approximately_ nearest neighbors,
> rather than ND exactly nearest neighbors, your program does a fine job.

Thanks! We have not heard again from the OP so we don't know what his task
actually is.
Maybe simulation would work for him.

--- e

Thomas Koenig

unread,

Dec 25, 2012, 10:41:43 AM12/25/12

to

michael...@compuserve.com <michael...@compuserve.com> wrote:

> if(abs(a(i, 1) - b(j, 1)) > hr .or. abs(a(i, 2) - b(j, 2)) > hr .or. &
> abs(a(i, 3) - b(j, 3)) > hr ) cycle

[...]

> Using this alternative
>
> if(any ( (/abs(a(i, 1) - b(j, 1)), abs(a(i, 2) - b(j, 2)), &
> abs(a(i, 3) - b(j, 3)) /) > hr )) cycle
>
> disappointingly increased the time to 126s.

For gfortran, this is now

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55806

James Van Buskirk

unread,

Dec 25, 2012, 4:12:41 PM12/25/12

to

"Thomas Koenig" <tko...@netcologne.de> wrote in message
news:kbchfn$u5h$1...@newsreader4.netcologne.de...

This is kind of a tricky optimization, though, isn't it?
What if the more idiomatic:

if(any(abs(a(i,:)-b(j,:))>hr)) cycle

were presented to the compiler? If size(a,2) == 3, then the
optimization would be a good one, but if size(a,2) gets large,
then the expression would vectorize nicely -- see my assembly
listing from an earlier post to this thread for an example.
Then you get to the case where a and b are allocatable. Does
the compiler at high optimization insert a test at run time to
determine whether to use the vectorized code or not?

Regarding PR55758: 80% is not 100% but at least it's progress :)

James Van Buskirk

unread,

Dec 25, 2012, 4:19:42 PM12/25/12

to

"James Van Buskirk" <not_...@comcast.net> wrote in message

news:kbd4s8$eu6$1...@dont-email.me...

> if(any(abs(a(i,:)-b(j,:))>hr)) cycle

> If size(a,2) == 3, then the
> optimization would be a good one, but if size(a,2) gets large,
> then the expression would vectorize nicely -- see my assembly
> listing from an earlier post to this thread for an example.

Whoops, wasn't thinking. This could only vectorize easily for a
processor with scatter/gather. Absent such an advanced instruction
set, you need to start with

if(any(abs(a(:,i)-b(:,j))>hr)) cycle

so as to access adjacent array elements. Tests I tried early on in
the thread (not posted) resulted in lower performance when arrays
were transposed like this, however.

Thomas Koenig

unread,

Dec 27, 2012, 12:45:12 PM12/27/12

to

James Van Buskirk <not_...@comcast.net> schrieb:

> What if the more idiomatic:
>
> if(any(abs(a(i,:)-b(j,:))>hr)) cycle
>
> were presented to the compiler?

This is now http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55814 .

Let no one say we don't take problem reports from this group
seriously :-)

Thomas Koenig

unread,

Mar 28, 2013, 7:26:27 PM3/28/13

to

michael...@compuserve.com <michael...@compuserve.com> schrieb:

> Using this alternative
>
> if(any ( (/abs(a(i, 1) - b(j, 1)), abs(a(i, 2) - b(j, 2)), &
> abs(a(i, 3) - b(j, 3)) /) > hr )) cycle
>
> disappointingly increased the time to 126s.

It's been some time, but this issue should now have been fixed in the
current development version for gfortran, which will be released
as 4.9. Didn't make it into 4.8, though.