Average of binary vectors?

saneman

unread,

May 21, 2008, 3:51:39 PM5/21/08

to

Maybe this is a bit off-topic but thought that someone in this group might
have experince with bit patterns.

Assume that I have N binary vectors (bit patterns) with length 192. How do I
find the "average" binary vector? Do I just add the binary vectors and
divide through with N? This would work for real vectors but does it also
apply to binary vectors?

And is it possible to measure the distance between to binary vectors, for
reals the euclidean distance could be used but does it make sense to talk
about euclidean distance between binary vectors?

Eric Sosman

unread,

May 21, 2008, 4:46:40 PM5/21/08

to

saneman wrote:
> Maybe this is a bit off-topic but thought that someone in this group might
> have experince with bit patterns.
>
> Assume that I have N binary vectors (bit patterns) with length 192. How do I
> find the "average" binary vector? Do I just add the binary vectors and
> divide through with N? This would work for real vectors but does it also
> apply to binary vectors?

It depends on what you mean by "average." Note that dividing
(a[i]+b[i]+c[i]+...) by N is likely to yield a result that is
neither 0 nor 1, hence problematic in an "average bit-vector."

> And is it possible to measure the distance between to binary vectors, for
> reals the euclidean distance could be used but does it make sense to talk
> about euclidean distance between binary vectors?

It makes sense, yes: Each vector is the co-ordinate of one
of the corners of a 192-dimensional unit hypercube, and it's
easy to define Euclidean distance in 192-space:

sqrt((a[0]-b[0])**2 + ... + (a[191]-b[191])**2)

Note that each squared term is either 0 or 1, so the sum inside
the radical is simply the number of bit positions where a[i]
and b[i] disagree; this may save some arithmetic.

Euclidean distance makes sense, but is not the only distance
metric you could use. "Manhattan distance" also generalizes to
192-space; in the case at hand where all co-ordinates are 0 or 1
the Manahattan is just the square of the Euclidean, the sum
inside the radical above. Other notions of distance are possible,
too.

The choice of how to define distance (and average) depends
on what problem you're trying to solve. What problem are you
trying to solve?

--
Eric....@sun.com

saneman

unread,

May 21, 2008, 4:57:27 PM5/21/08

to

"Eric Sosman" <Eric....@sun.com> skrev i en meddelelse
news:1211402782.644550@news1nwk...

I have written a classification module that classifies the
characters/classes A-Z and integers 0-9. I use a 192 long binary
vector/descriptor for each sample (based on thresholded gradient directions)
and would like to plot the distance between the mean vectors for each class.
But I can't seem to find any litterature covering the mean of N binary
vectors/samples.

Eric Sosman

unread,

May 21, 2008, 5:53:16 PM5/21/08

to

saneman wrote:
>
> I have written a classification module that classifies the
> characters/classes A-Z and integers 0-9. I use a 192 long binary
> vector/descriptor for each sample (based on thresholded gradient directions)
> and would like to plot the distance between the mean vectors for each class.
> But I can't seem to find any litterature covering the mean of N binary
> vectors/samples.

If I understand you (and that's a big "if"), you have
measured 192 "somethings" for each widget and then clamped
each measurement to 0 or to 1. The widgets are pooled into
various classes and you'd like to (1) represent each class
by an "average" of its' widgets' sets of 192 measurements,
and (2) compute "distances" between the averages for the
various classes. If that's not it, read no further: I've
completely misunderstood.

Let's start with the "average." Your first decision is
whether you want the individual measurements in this average
to be clamped {0 1} or unclamped [0..1] values. If the former,
the average will be just like an actual set of measurements,
a "synthetic widget" that sort of stands for the whole class.
If the latter, the average will be a horse of another color.
Either way, you compute the i'th component of the average by
adding up the i'th components of all the measurement sets in
the class and dividing by the number of sets, then optionally
clamping to 1 or 0 depending on whether the quotient is above
or below 0.5 (you can take some shortcuts to avoid division
if you know you're going to clamp the final quotient).

Now for "distance," which is trickier and more problem-
dependent. You could use Euclidean distance *if* that idea
of distance makes sense for your problem: Are two averages
that differ by 0.001 on all 192 components "as far apart as"
two that agree perfectly on 191 components but differ by
0.192 on the other? Or does the fact that the second pair
agree almost everywhere outweigh the fact that the first pair
differ everywhere but never by much? There's not a universal
Right or Wrong for questions like this; it depends on how you
model your problem space.

--
Eric....@sun.com