Daniel
_________________________________________________________________
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo
I'm not an expert, but I'd use numarray module. The most simple
example can look like:
a = numarray.array(....) # An array to look in
value = x # Value to look for
c = a.copy() # Create a copy
print numarray.abs(a.copy() - value).argmin() # Finds minimal absolute
difference
If performance of numarray is still too slow, I'd resort to simple
hasing scheme: you can split your inputs in reasonable ranges and look
for closest match only in this ranges.
hth,
anton.
Daniel
_________________________________________________________________
Sign-up for a FREE BT Broadband connection today!
http://www.msn.co.uk/specials/btbroadband
If your "array" is relatively unchanging, then you can store it in a
different way so that searching takes much less time. I think that the
method I'm going to describe is O(log N) complexity.
You want to choose a tree representation for your data with the
following property: at each node of the tree, there are N children, and
each of the N children has approximately 1/N of the leaves below that
node. Furthermore, there must be a simple test that will let you choose
which of those children may hold the closest node under your distance
measure.
I think that one algorithm you could follow is described at
http://www.cs.sunysb.edu/~algorith/files/kd-trees.shtml
Beware that if you write your k-d trees in Python but compare to a
linear search using numpy that effectively runs in C, the numpy approach
may win on your example. But if the search space grows larger the
tree approach will eventually win over a brute-force approach.
Jeff
>Hi everyone.
>I was wondering if anyone might be able to help me out here. I'm currently
>looking to find the quickest way to find a best fit match in a large array.
[snip]
>It's basically some euclidean distances that I've
>calculated, and I need to be able to find the best matches over a large
>number of input values, so I was looking for a speedy solution to this.
These might be helpful:
http://groups.google.com.gr/groups?selm=JGWa8.159434%24XZ1.6306345%40typhoon.austin.rr.com
or
groups.google.com is your friend :)
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
-- Paul
# mat2Dict.py
from random import random
import pprint
ROWS = 4
COLS = 6
# create matrix
print "creating data matrix"
matrix = [ [ random() for i in range(COLS) ] for j in range(ROWS) ]
print "create dictionary of values"
valdict = {}
for i,row in enumerate(matrix):
for j,elem in enumerate(row):
if valdict.has_key(elem):
valdict[elem].append( (i,j) )
else:
valdict[elem] = [ (i,j) ]
# uncomment this line to view the dictionary of values
# pprint.pprint( valdict )
matvals = valdict.keys()
matvals.sort()
print len(matvals),"unique values in matrix"
def findClosestVal( searchval ):
if searchval <= matvals[0]:
return matvals[0]
if searchval >= matvals[-1]:
return matvals[-1]
# do binary search - see
http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary
hi = len(matvals)
lo = -1
while (hi-lo) > 1:
loc = ( hi + lo ) / 2
if matvals[loc] > searchval:
hi = loc
else:
lo = loc
if abs(searchval-matvals[lo]) < abs(searchval-matvals[hi]):
return matvals[lo]
else:
return matvals[hi]
# look up some values
for v in range(10):
vv = v/10.
closestVal = findClosestVal( vv )
print vv, "->", closestVal, "occurring at", valdict[closestVal]
If the values in the array have any structure or constraints, try to
exploit that. If they are ordered in both directions, use binary search on
one axis and then trace the zigzag frontier between lower and height
values. If there are just a 'few' values (as in 0-255), then a dict might
work, though changing values would complicate this.
Terry J. Reedy
I've implemented a couple of approaches.
1) using just one list to hold the data
2) using a list of lists
3) using numeric
I've optimized the search function for 1 and 2 a little bit by defining
the scoring function on the fly like this and binding input during the
function definition:
input = 0.5
def euclidean(x, input=input): return abs(x-input)
We can also use a bounding box technique such that if a minimum x is
found and the x < input then we can reject all points less than x
without having to perform the euclidean distance. Conversely if x >
input then we can reject all points greater than x. This really only
works if x is a single value and not a vector.
Don't forget that when looking for minimums using a euclidean distance
you don't have to perform the square root. (x1*x1+x2*x2) will find the
correct minimum as well as opposed to the more time consuming
math.sqrt(x1*x1+x2*x2)
Here are the results looking for data closest to an input value of 0.5
using a pretty loaded pentium 4 2ghz. The result is for a single search
against a random array of 600x400.
time row col value
0.130000114441 396 328 0.500001351171 single list
0.119999885559 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix
So numeric is the clear winner here. However, when psyco is used
http://psyco.sourceforge.net/
time row col value
0.0400000810623 396 328 0.500001351171 single list
0.0499999523163 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix
psyco actually wins here but more interesting is that the timing order
is reversed! I'll have to do a lot more trials to generate good
profiling, but I think the results are pretty good here.
Here is the test code, feel free to rip it apart as you will.
import Numeric, random, time
# create data
data = [random.random() for x in xrange(600*400)]
# test list of list
lists = []
for i in range(400): # rows
lists.append(data[i*600:(i+1)*600]) # columns
matrix = Numeric.array(lists)
def testarray(data=data):
t1 = time.time()
# straight forward approach
mn = 10e99
index = None
lx = -10e99 # keep track of the best found
hx = 10e99 # approaching from the bottom and top
input = 0.5
def euclidean(x, input=input):
return abs(x-input)
for i,x in enumerate(data):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = i
t2 = time.time()
row = index/600
col = index%600
print t2-t1, row, col, data[index], "single list"
def testlists(list=list):
mn = 10e99
index = None
input = 0.5
lx = -10e99 # keep track of the best found
hx = 10e99 # approaching from the bottom and top
def euclidean(x, input=input):
return abs(x-input)
t1 = time.time()
for r, row in enumerate(lists):
for c, x in enumerate(row):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = r,c
t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "list of lists"
def testmatrix(matrix=matrix):
t1 = time.time()
mins = abs(matrix-0.5)
mn = 10e99
index = None
for r,c in enumerate(Numeric.argmin(mins)):
y = mins[r,c]
if y < mn:
mn = y
index = r,c
t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "matrix"
# list of lists approach
testarray()
testlists()
testmatrix()
print "*"*44
print "try with psyco"
try:
import psyco
psyco.full()
testarray()
testlists()
testmatrix()
except ImportError:
print "psyco not available"