Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Gesture Recognition (2 of 2)

13 views
Skip to first unread message

Don Y

unread,
Nov 26, 2011, 8:49:47 AM11/26/11
to
[Same comments from part 1 apply]

I'll skip over the chronology behind my current implementation.
Assume I've abandoned/refined various approaches because they
couldn't provide "acceptable" performance.

Rubine-style (Rubinesque ;-) ) recognizers are speedy and
simple to implement. But, they seem to produce bizarre
results with an unforgivably high frequency! They seem to
fall below the "Make everything as simple as possible, but
no simpler" yardstick. I spent a *lot* of time reviewing
raw data trying to locate elusive "bugs" in that recognizer
only to discover that it was correctly misrecognizing that
data!

[I suspect a problem here is that the feature sets distill
too much from the gestures -- esp much of the temporal
relationships -- and leave too many potential unresolved
conflicts in the data set. So, unexpected characteristics
of the input have disproportionate influence on the results]

DTW recognizers are incredibly expensive. And, enforcing
realistic constraints on the point matching/warping is
rocket science/mysticism. You can't "prove" that the
criteria you have adopted are "correct" -- just that they
*seem* to improve performance FOR YOUR PARTICULAR DATA
AND GESTURE SET!

OTOH, both of the above benefit from the fact that as
the size of the raw data increases ("long"/complex gestures
and/or slowly drawn), there is more "real time" available
for their recognition (because a gesture can't be expected
to be recognizable until it has been completely issued!)

[OTOOH, the increased size of that data set means you had
better be processing whatever you can *as* you collect it
else you will be taxed trying to crunch it all in the
milliseconds following the gesture's issuance!]

To make things affordable (esp to hit a ~100ms recognition
window), you have to seriously reduce the data that you
have to process. OTOH, you want enough incoming data
that will support fine detail in digitizing the motion
as well as supporting a wide range of "drawing speeds".

The first win is to design the recognizer such that it
is constrained as to the set of valid gestures that it
must consider at any given time. This reduces the
number of potential match candidates that must be
(numerically) evaluated. [It does so at the risk of
being "confused" by a "syntax error" -- a gesture
issued illegally!]

So, I pass a set of legal gestures to the recognizer
prior to each possible gesture issuance. Since one
gesture's (successful?) recognition can define the
set of valid subsequent gestures, this means I must
be able to recognize gesture #1, pass it to the UI
state machine and determine the set of valid subsequent
gestures *prior* to gesture #2 being STARTED.

[I could conceivably support a form of "type-ahead"
but it would be impossible to design with deterministic
support for that potential overloading as you wouldn't
be able to predict if future processing requirements
would INCREASE beyond what you are currently UNABLE to
handle! So, best plan on handling things as they happen]

A key aspect of this gesture set is the omnipresence
of a "none-of-the-above" result. The recognizer can't
assume that the input data set *is* one of these gestures
(i.e., it's not a question of "pick the best fit" but,
rather, pick the *likely* fit *if* likely)

For each possible gesture result, I note key characteristics
of that "model/template" gesture. These include a formal
definition of the gesture (i.e., a mathematical representation
of the "path" traversed in its issuance); the "event" that
should be signalled on its recognition; constraints placed
on its recognition (e.g., the minimum time required for
a non-resident of Krypton to draw the gesture); and a
summary of other "features" of the gesture.

The goal of much of this data is to quickly exclude
candidates that are not likely matches for a particular
input data set. E.g., drawing an "ampersand" takes
considerably longer than drawing a 5-pointed star -- even
though both have comparable total stroke lengths! And,
drawing *either* of those takes infinitely longer than
drawing a simple "stroke" (dash, etc.). By examining
*summaries* of the raw data set, these criteria let
the recognizer discount certain possibilities (or, at
least, decide they stand much less chance of being
successful matches so the recognizer's attention is
better directed elsewhere).

Models/templates are (relatively) static. So, any
computationally significant "characteristics" of those
models can be performed ahead of time. This reduces
the time required in the recognizer *if* these can be
*useful* characteristics!

Borrowing from the Rubine approach, I store many of the
"features" of the templates. Since I have a representation
of the gesture's (ideal) "path", computing these features
is fairly easy (it *could* be done at compile time *or*
as part of an initialization stage at run-time).

So, I store things like total stroke length (equivalent to
the "amount of ink" that would be used in writing said
gesture), total angle (for a circle/square/triangle this would
be 360 deg; for an 'L, it would be 90, etc.), aspect ratio
('1' for a circle; '0' for a hyphen; 'infinity' for a
vertical line, etc.), etc.

The "value" of these features is that they are economical
to compute *and* can be computed (for the raw input data set)
*as* the data is being generated. Contrast this with other
aspects of the recognition process that cannot be performed
until AFTER the input data is "complete".

Using these features, I prioritize the templates that I
examine in further attempts to discount unlikely candidates
and narrow in on the "actual" result.

For templates that look "highly likely to be worth considering
as potential matches for the input data set, I subdivide the
mathematical representation of the template's "path" to yield
a number of discrete data points proportional to the number
reported in the input data set (which has been manipulated
prior to this).

This is an attempt to further constrain the DTW algorithm
so that it has *less* flexibility in producing a correspondence
between input and template data points.

Roughly speaking, DTW maps the start point of the input data
to the start point of the template. The same applies to the
"end" points. Since DTW tolerates *differences* in the numbers
of points in the input and template data sets, everything between
start and end is "elastic".

I'm trying to make that elastic more *rigid*. I.e., the point
"in the middle" of the template's path "should" map to the
point "in the middle" of the path drawn by the input data set.
And, similarly, for the rest of the points in each set.

Ideally, I end up with N points in each of the template and
input data set (N is defined *by* the input data set and
the corresponding point in each template candidate are
*computed* to satisfy this N!). Then, I compute the
"difference" between the input and template curves based
on these points. Of course, the goal is to find the
smallest difference/tightest fit.

Some (bogus!) threshold differentiates between "good enough"
and "no cigar" in the final decision. (Note that I currently
pick *the* best "good enough" candidate as "the" result.
I probably need to return a *set* of likely results with
some sort of confidence ratings and let the application layer
cast the final decision -- including, possibly, asking
the use for clarification!)

In summary, I use inexpensive computations to try to discount
as many candidates as possible. (Rubinesque) Features are
"free" for the numerous templates and inexpensive for the
*sole* input data set. The explosive computational needs of
DTW are constrained by ensuring a ~1:1 correspondence of
input and template data points. The "difference" operator
is a bit more expensive than the simple Euclidean distance
operator used in DTW (e.g., I actually measure the *area*
between the "curves"). But, the cost of ensuring the 1:1
mapping is *not* born by these other algorithms (penalty
mine). I believe much of that cost is a consequence of the
way in which I represent the template "paths" (beziers).

The truly unfortunate aspect of all this is that it still
doesn't "guarantee" that the chosen template is "correct"!
(perhaps that simply isn't possible when trying to do this
sort of pattern recognition?)

So, the real questions are:
- anyone gone down this road before (good/bad experiences)
- other ways to represent "paths" (that don't have explosive
costs associated with their representations) that are
easier to "subdivide"
- easier ways of applying tranforms (I've glossed over the
"post processing" of the input data set) to support
scaling/translation/rotation
- other BFM that might be applicable to this domain
- any existing products that are worth evaluating to get
a feel for how well *they* perform (e.g., most writing
recognizers have too high of a failure rate to be
applied this way -- think of the car driving analogy!)
- other "less arbitrary" criteria to make the final
"no cigar" decision on a matching template

[I suspect I've seen most of the classic and current
literature on this subject -- in some form. And, most
of the available FOSS-ish implementations are far LESS
capable than what I've done.]

Thx,
--d

1 Lucky Texan

unread,
Nov 28, 2011, 2:17:29 PM11/28/11
to
On Nov 26, 7:49 am, Don Y <not.to...@seen.com> wrote:
> [Same comments from part 1 apply]

******[I could conceivably support a form of "type-ahead"
but it would be impossible to design with deterministic
support for that potential overloading as you wouldn't
be able to predict if future processing requirements
would INCREASE beyond what you are currently UNABLE to
handle! So, best plan on handling things as they happen] ******


Is there a way to implement any 'lock-outs'? That is, if certain
commands would NEVER follow certain different commands, would locking
out of one or more 'disallowed' commands help to narrow the 'set' from
which a subsequent entry is chosen?

seems like an interesting project.



Don Y

unread,
Nov 29, 2011, 10:17:24 AM11/29/11
to
Yes. My UI is implemented as a state machine (I can't
understand how anyone would *not* use an FSM for this
purpose!). So, in each state, the list of valid
"inputs" ("commands" -- things that can result in a
state transition) are known (by the design of the
state machine!). Rather than letting *any* "command"
be delivered ("recognized") to that atate machine
(i.e., by the gesture recognizer) and then *discarding*
"commands" that aren't acceptable (in that particular
state), I tell the recognizer what *is* acceptable
and let *it* filter out the gestures that wouldn't
be allowed.

This has the result of reducing the set of potential
gestures that might have to be considered at any
given point in time (which reduces the workload).

It also reduces the potential ambiguities that might
otherwise result from an "overly rich" feature set!
E.g., if <circle> and <square> are both in the
lexicon at a given instant, then the "roundness"
of the gesture becomes a critical issue (i.e.,
a <square> looks like a very poorly drawn <circle>
and vice versa).

OTOH, if only *one* of these is acceptable at a
particular instant (e.g., <square>), then the
significance of the "roundness feature" is greatly
diminished (assuming there are no other legal
"closed loop" figures in the lexicon). So, choosing
between, e.g., <square> and <horizontal stroke>
is a cake walk!

[I (poorly) track "misrecognitions" and try to
make guesses as to what a particular gesture
was *supposed* to be. I use this to help me
identify "competing" gestures -- e.g., 'O' vs '0'.
Eventually, I'll figure out a suitable API that
lets applications query this information and
adjust their "command (gesture) sets" to
minimize the potential for "near misses" like
that]

The problem is that it is very easy to end up with
*lots* of "legal" gestures. E.g., should it
matter of you draw that <circle> clockwise vs.
counterclockwise? Are they both considered
<circle>s? Or, are they different gestures
from each other -- <circle CW> vs. <circle CCW>?
And, what if you started drawing from the 6-o'clock
position vs. 12-o'clock?

Ideally, you want to give the user the freedom to
draw the gesture in whatever manner is most
convenient/reliable for him/her. So, there are
4 (e.g.) ways to draw that <circle>!

With an off-line recognizer, you wouldn't care.
You'd just look at the resulting "ink" and see
a <circle> -- you would have no knowledge of
*how* (temporally) the ink was applied.

OTOH, having access to the temporal aspect of
the drawing lets you enrich your gesture set
without having to create all sorts of bizarre
"drawing paths". So, you can have a small
set of "paths" (this is a bad term but one I
have saddled myself with since early on) that
are easy to draw AND easy to differentiate
between -- and just allow variations in how they
are "issued" (CW vs CCW, top-down vs bottom-up,
LR vs RL, etc.) to give you *more* gestures
to make available to the user.

This small set of "simple paths" can give you
a large number of "commands" with very little
processing complexity. E.g., imagine if the
"next page" and "previous page" gestures were
"horizontal line" and "vertical line" (two
substantially different paths) -- each of
which could be "drawn" any way desired -- instead
of <hline R> and <hline L>, respectively.

But, the gesture recognizer still has to be
able to *recognize* each variation -- whether
they are thereafter mapped to a single
"command" *or* are treated as separate commands
(mapped to different "events").

> seems like an interesting project.

<grin> That's what makes it worth doing!

1 Lucky Texan

unread,
Nov 29, 2011, 11:33:33 PM11/29/11
to
On Nov 29, 9:17 am, Don Y <not.to...@seen.com> wrote:
> On 11/28/2011 12:17 PM, 1 Lucky Texan wrote:
>
> > On Nov 26, 7:49 am, Don Y<not.to...@seen.com>  wrote:

Since you're using the term gesture, i assume a clockwise circle IS
different from a CCW circle. To me, gesture implies dynamics.

And, it seems that, within those 'sets'/'sequences' that are allowed
to follow each other, choosing shapes/gestures with maximum
'differences in form' would also be hlpful. As in staright lines
follow curves, curves follow straight lines, etc. Semaphore comes to
mind as well as Morse code for some reason when thinking about this
device.

I once read where a signature verifying program weighted the TIME a
signature was written more than the shape. Seems a forger would go
quite slowly compared to the legitimate 'author'.

Dunno if your devices would be 'assigned' to the same user or not. If
it is, then 'training' and 'auto-correlating' should improve accuracy.

it's all over my head except for the basic concept. i applaud your
efforts though. It seems to me some of what your working on could be
useful for disabled folks with certain neurological or motor-control
problems.

fun stuff to think about

Don Y

unread,
Dec 1, 2011, 3:54:40 PM12/1/11
to
> Since you're using the term gesture, i assume a clockwise circle IS
> different from a CCW circle. To me, gesture implies dynamics.

Yes, in my case, that's true. (note that the term
gesture has many meanings if you search the literature.
That's why I tried to present "The Backstory")

Note, however, that my implementation doesn't *require*
these to be different! E.g., I could define "circle"
and "square" to mean "cut" and "paste".

"Does it matter *how* I draw the circle/square?"
"No. As long as it is recognizable as a circle/square"

This gives the user freedom in how he issues the gestures.
I.e., with the above command binding, why should circleCW
be different from circleCCW? Should one of these be
"correct" while the other signals an error -- "unrecognized
gesture"?

OTOH, there are times when there *is* some difference that
wants to be highlighted by the different *way* the gesture
was issued. E.g., imagine controlling an inspection
system with these gestures. There, circleCW might mean
"rotate the object/image 90 degrees in a clockwise direction
(so I can get better look at some part of it)" while the
circleCCW gesture causes a rotation in the opposite direction.
(like "horizontal stroke right-to-left" can mean "move to
the next page" while the mirror image gesture can mean
"move to the previous page")

You might also want to provide some flexibility to the
user in *where* the gesture begins. E.g., I tend draw
circles beginning from the 12-o'clock position. But,
I've often seen folks who start at 6-o'clock. So,
two different starting points, two different directions.
Four different gestures??

I define the "paths" that define the gesture -- i.e.,
circle, square, slash, cup, box, etc. In a sense,
these are "shapes" but since they also have some
sense of "time" embodied in them (i.e., the direction
in which it is drawn), I call them "paths".

Each path is named (internally and externally).

A path has certain "features" that can be computed
(compile time) from that static data. E.g., its
aspect ratio, center of mass, total angle, curviness,
start-to-end vector, etc. These features greatly
distill the essence of the path in an attempt to
quickly differentiate "circle" from "stroke".

Paths can be easily *transformed*. Simple rotations and
reflections are the most apparent. E.g., a reflected
circle becomes a circle "drawn in the reverse direction".
Note that most features can be easily transformed to
correspond with the transformed path! (e.g., reflect
a circle and its total angle is negated but it's aspect
ratio remains the same, etc.)

Transformed gestures are easily named in a canonical
form: circle reflected, box rotated, etc. So, once
you have an understanding of the transforms, you can
visualize what any named path would look/feel like.

The recognizer takes a list of paths, transforms, etc.
as the "set of recognizable gestures" at any given
point in time. It examines the current input data set
trying to fit the data to one of these gestures.
*If* it can, it passes a gesture-specific event to
the application -- the equivalent of a keystroke.

> And, it seems that, within those 'sets'/'sequences' that are allowed
> to follow each other, choosing shapes/gestures with maximum
> 'differences in form' would also be hlpful. As in staright lines
> follow curves, curves follow straight lines, etc. Semaphore comes to
> mind as well as Morse code for some reason when thinking about this
> device.

I see your point but that limits what the user can do
in the interface. E.g. you couldn't "move to next
page" followed by "move to next page".

Instead, what I do is stall the recognizer after it
emits a "keystroke" (i.e., if there is more "raw data"
to be processed, it just sits there, idle). Then, let
the application/UI process the keystroke and decide
what the *next* set of valid gestures should be.
These are then passed to the recognizer and it processes
any pending and subsequent input data in that context.

Since the recognizer is operating synchronously, it
has to be *fast* (the application/UI must be equally
quick in its decisions as to "what can come next").

> I once read where a signature verifying program weighted the TIME a
> signature was written more than the shape. Seems a forger would go
> quite slowly compared to the legitimate 'author'.

Yes, I think that was initially true (no idea if it remains
so). AFAICT, many of the "signature pads" don't do any
checking of the data but, rather, act simply as "paper
replacements".

> Dunno if your devices would be 'assigned' to the same user
> or not.

No. A big part of the motivation behind this particular
implementation was to provide "operator independence".
But, not just in allowing multiple users to share a
device! Instead, facilitating the transport of "operator
characteristics" between devices and technologies.

I.e., *you* should be able to use your friend's
device with the same level of recognition accuracy
that you get on your own!

> If it is, then 'training' and 'auto-correlating' should improve
> accuracy.

I don't want an explicit "training phase" in which the
device is "taught" how you issue each gesture. Rather,
I want you to be able to begin using the device
immediately. The device can *learn* how you issue each
gesture as you use it. E.g., each "recognition" that
you "accept" (as correct) can be used to identify
your particular "gesture style". Similarly, each
gesture that you *reject* (as incorrect) can be
*discounted* from the training set (I haven't
yet figured out an algorithmic way of using those
"misrecognitions" to identify aspects of the models
that are *incorrect/ambguous*)

> it's all over my head except for the basic concept. i applaud your
> efforts though. It seems to me some of what your working on could be
> useful for disabled folks with certain neurological or motor-control
> problems.

The approach is intended to be *inclusive* of folks with
these sorts of problems. But, not specfically geared
towards them at the expense of (ahem) "normal" people.

For example, most "recognizers" use training to modify
the templates against which future input data are compared.
I.e., tweek the machine's idea of a <circle> to match
what the user *draws* as a <circle>.

This *might* increase recognition accuracy for future
<circle>s. But, it does nothing for future "figures-of-8"!
To get a similar training benefit for those, the user
would have to issue a comparable number of examples
from which the recognizer could "learn" the deviations
from the model that the user introduces for *that*
gesture. If figures-of-8 are less common in the
interface, then this learning can take quite a long
time -- without an explicit "learning exercise"
(in which the user is unnaturally prompted to issue
gestures under the COMMAND of the device! IMO, a
device should never "drive" the user! :< )

If, instead, you can identify "deformations" to the
model that account for the user's "drawing habits",
then you can conceivably apply those to *other*
models -- regardless of the shapes involved -- to
improve their recognition as well!

E.g., if you tend to draw <vertical_stroke>s on a
slight angle, chances are you will draw other
paths with a similar bias. For example, that touchpad
mounted on the jacket lapel might be a little crooked.
*Or*, it might be perfectly straight but the position
of your arm across you chest as you write on it might
lend that sort of skew to ALL of your gestures!
If this bias can be extracted and applied to all
of the gesture models (for *you*), then recognition
of ALL of them can be improved.

[And, *your* "characteristics" can be codified in this
"transform" which can be ported to other similar devices
to summarily adjust their gesture models to your needs]

For certain physical characteristics (e.g., disabilities),
the nature of the transforms would change -- but would
still apply to *all* gestures (as there is a physical
or anatomical basis underlying the transform). E.g.,
folks with ET would tend to exhibit a "high" frequency
left-right oscillation that wouldn't be present in folks
without that problem. At the same time, there might
be *no* top-bottom component present.

> fun stuff to think about

<frown> Beats the hell out of "shell sorts"! ;-)
0 new messages