Best match searching

Daniel Pryde

unread,

Dec 29, 2003, 7:14:42 AM12/29/03

to pytho...@python.org

Hi everyone.
I was wondering if anyone might be able to help me out here. I'm currently
looking to find the quickest way to find a best fit match in a large array.
My problem is that I have an array of, say, 600*400, which contains a value
at each point, and I need to find the value in that array which is closest
to the input value. It's basically some euclidean distances that I've
calculated, and I need to be able to find the best matches over a large
number of input values, so I was looking for a speedy solution to this.
The array is implemented simply using a list of lists. The only solution I
could think of is to use some for statements, but that seems to take a
while, even for just one search.
Any hints or tips would be really helpful. :-)

Daniel

_________________________________________________________________
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo

anton muhin

unread,

Dec 29, 2003, 8:46:21 AM12/29/03

to

Daniel Pryde wrote:

I'm not an expert, but I'd use numarray module. The most simple
example can look like:

a = numarray.array(....) # An array to look in
value = x # Value to look for
c = a.copy() # Create a copy
print numarray.abs(a.copy() - value).argmin() # Finds minimal absolute
difference

If performance of numarray is still too slow, I'd resort to simple
hasing scheme: you can split your inputs in reasonable ranges and look
for closest match only in this ranges.

hth,
anton.

Daniel Pryde

unread,

Dec 29, 2003, 9:28:08 AM12/29/03

to pytho...@python.org

Thanks for the advice guys. I'll give it a try. One other thing I should
mention is that, once a best match is found, the winner and its neighbours
are changed slightly, and the euclidean distances have to be calculated
again, so I'm not sure if a tree structure would be suitable for it. It's
really a self organising map I'm trying to implement overall.

Daniel

_________________________________________________________________
Sign-up for a FREE BT Broadband connection today!
http://www.msn.co.uk/specials/btbroadband

Jeff Epler

unread,

Dec 29, 2003, 9:08:12 AM12/29/03

to Daniel Pryde, pytho...@python.org

Searching 600*400 values by simply going in order is going to take
awhile. It's O(N), linear in the number of values.

If your "array" is relatively unchanging, then you can store it in a
different way so that searching takes much less time. I think that the
method I'm going to describe is O(log N) complexity.

You want to choose a tree representation for your data with the
following property: at each node of the tree, there are N children, and
each of the N children has approximately 1/N of the leaves below that
node. Furthermore, there must be a simple test that will let you choose
which of those children may hold the closest node under your distance
measure.

I think that one algorithm you could follow is described at
http://www.cs.sunysb.edu/~algorith/files/kd-trees.shtml

Beware that if you write your k-d trees in Python but compare to a
linear search using numpy that effectively runs in C, the numpy approach
may win on your example. But if the search space grows larger the
tree approach will eventually win over a brute-force approach.

Jeff

Christos TZOTZIOY

unread,

Dec 29, 2003, 10:05:51 AM12/29/03

to

On Mon, 29 Dec 2003 12:14:42 +0000, rumours say that "Daniel Pryde"
<ababo...@hotmail.com> might have written:

>Hi everyone.
>I was wondering if anyone might be able to help me out here. I'm currently
>looking to find the quickest way to find a best fit match in a large array.

[snip]

>It's basically some euclidean distances that I've
>calculated, and I need to be able to find the best matches over a large
>number of input values, so I was looking for a speedy solution to this.

These might be helpful:

http://groups.google.com.gr/groups?selm=JGWa8.159434%24XZ1.6306345%40typhoon.austin.rr.com

or

http://groups.google.com.gr/groups?threadm=mailman.411.1067919450.702.python-list%40python.org&rnum=1

groups.google.com is your friend :)
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

Paul McGuire

unread,

Dec 29, 2003, 10:47:18 AM12/29/03

to

"Daniel Pryde" <ababo...@hotmail.com> wrote in message
news:mailman.3.10727000...@python.org...

> Hi everyone.
> I was wondering if anyone might be able to help me out here. I'm currently
> looking to find the quickest way to find a best fit match in a large
array.
> My problem is that I have an array of, say, 600*400, which contains a
value
> at each point, and I need to find the value in that array which is closest
> to the input value.

<snip>

> Any hints or tips would be really helpful. :-)
>
> Daniel

I think inverting your matrix to a dictionary may be more "Pythonic". The
included snippet constructs an array of values, then creates a dictionary of
which values are located at which positions in the array (also handles
possibility of duplicates in the array). If you change ROWS and COLS to
your problem values of 400 and 600, the matrix-to-dictionary inversion takes
a bit of time - but the searches are blindingly fast.

-- Paul

# mat2Dict.py
from random import random
import pprint

ROWS = 4
COLS = 6

# create matrix
print "creating data matrix"
matrix = [ [ random() for i in range(COLS) ] for j in range(ROWS) ]

print "create dictionary of values"
valdict = {}
for i,row in enumerate(matrix):
for j,elem in enumerate(row):
if valdict.has_key(elem):
valdict[elem].append( (i,j) )
else:
valdict[elem] = [ (i,j) ]

# uncomment this line to view the dictionary of values
# pprint.pprint( valdict )

matvals = valdict.keys()
matvals.sort()
print len(matvals),"unique values in matrix"

def findClosestVal( searchval ):
if searchval <= matvals[0]:
return matvals[0]
if searchval >= matvals[-1]:
return matvals[-1]

# do binary search - see
http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary
hi = len(matvals)
lo = -1
while (hi-lo) > 1:
loc = ( hi + lo ) / 2
if matvals[loc] > searchval:
hi = loc
else:
lo = loc

if abs(searchval-matvals[lo]) < abs(searchval-matvals[hi]):
return matvals[lo]
else:
return matvals[hi]

# look up some values
for v in range(10):
vv = v/10.
closestVal = findClosestVal( vv )
print vv, "->", closestVal, "occurring at", valdict[closestVal]

Terry Reedy

unread,

Dec 29, 2003, 3:50:30 PM12/29/03

to

"Daniel Pryde" <ababo...@hotmail.com> wrote in message
news:mailman.3.10727000...@python.org...

> Hi everyone.
> I was wondering if anyone might be able to help me out here. I'm
currently
> looking to find the quickest way to find a best fit match in a large
array.
> My problem is that I have an array of, say, 600*400, which contains a
value
> at each point, and I need to find the value in that array which is
closest
> to the input value. It's basically some euclidean distances that I've
> calculated, and I need to be able to find the best matches over a large
> number of input values, so I was looking for a speedy solution to this.
> The array is implemented simply using a list of lists. The only solution
I
> could think of is to use some for statements, but that seems to take a
> while, even for just one search.
> Any hints or tips would be really helpful. :-)

If the values in the array have any structure or constraints, try to
exploit that. If they are ordered in both directions, use binary search on
one axis and then trace the zigzag frontier between lower and height
values. If there are just a 'few' values (as in 0-255), then a dict might
work, though changing values would complicate this.

Terry J. Reedy

Brian Kelley

unread,

Dec 30, 2003, 12:17:16 PM12/30/03

to

Daniel Pryde wrote:
> The array is implemented simply using a list of lists. The only solution
> I could think of is to use some for statements, but that seems to take a
> while, even for just one search.

Ahh this brings back memories of my image processing days :)

I've implemented a couple of approaches.
1) using just one list to hold the data
2) using a list of lists
3) using numeric

I've optimized the search function for 1 and 2 a little bit by defining
the scoring function on the fly like this and binding input during the
function definition:

input = 0.5
def euclidean(x, input=input): return abs(x-input)

We can also use a bounding box technique such that if a minimum x is
found and the x < input then we can reject all points less than x
without having to perform the euclidean distance. Conversely if x >
input then we can reject all points greater than x. This really only
works if x is a single value and not a vector.

Don't forget that when looking for minimums using a euclidean distance
you don't have to perform the square root. (x1*x1+x2*x2) will find the
correct minimum as well as opposed to the more time consuming
math.sqrt(x1*x1+x2*x2)

Here are the results looking for data closest to an input value of 0.5
using a pretty loaded pentium 4 2ghz. The result is for a single search
against a random array of 600x400.

time row col value
0.130000114441 396 328 0.500001351171 single list
0.119999885559 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

So numeric is the clear winner here. However, when psyco is used
http://psyco.sourceforge.net/

time row col value
0.0400000810623 396 328 0.500001351171 single list
0.0499999523163 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

psyco actually wins here but more interesting is that the timing order
is reversed! I'll have to do a lot more trials to generate good
profiling, but I think the results are pretty good here.

Here is the test code, feel free to rip it apart as you will.
import Numeric, random, time

# create data
data = [random.random() for x in xrange(600*400)]

# test list of list
lists = []
for i in range(400): # rows
lists.append(data[i*600:(i+1)*600]) # columns

matrix = Numeric.array(lists)

def testarray(data=data):
t1 = time.time()
# straight forward approach
mn = 10e99
index = None
lx = -10e99 # keep track of the best found
hx = 10e99 # approaching from the bottom and top

input = 0.5
def euclidean(x, input=input):
return abs(x-input)

for i,x in enumerate(data):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = i

t2 = time.time()
row = index/600
col = index%600
print t2-t1, row, col, data[index], "single list"

def testlists(list=list):
mn = 10e99
index = None
input = 0.5
lx = -10e99 # keep track of the best found
hx = 10e99 # approaching from the bottom and top

def euclidean(x, input=input):
return abs(x-input)
t1 = time.time()
for r, row in enumerate(lists):
for c, x in enumerate(row):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = r,c

t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "list of lists"

def testmatrix(matrix=matrix):
t1 = time.time()
mins = abs(matrix-0.5)
mn = 10e99
index = None
for r,c in enumerate(Numeric.argmin(mins)):
y = mins[r,c]
if y < mn:
mn = y
index = r,c
t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "matrix"

# list of lists approach
testarray()
testlists()
testmatrix()

print "*"*44
print "try with psyco"
try:
import psyco
psyco.full()
testarray()
testlists()
testmatrix()
except ImportError:
print "psyco not available"