We did a multi-touch demo a while back. The theory isn't that complex.
There are really two bits.
1. When is a hand touching?
2. When the hand releases, what do you do?
For touching, the straightforward approach is to use a depth
threshold. When the hand is close enough to the sensor, it creates a
contact with the image. When you pull back, it let's go. We did a demo
a while back where we used a grasp detector that we wrote to indicate
when we were touching. That works very well, but you obviously need to
code up a grasp detector which isn't as simple as a depth threshold
approach. One interesting concept might be to just measure the depth
from the shoulders or torso. When the hand extends from the body, it's
touching.
When the hand releases, you'll need to transform the image somehow. In
multi-touch, you have two operations that you're looking for, rotation
and scaling. For scaling, you can just compute the distance between
the hands at touch and at release. The relative scale of those
distances would be the scale of the image. For rotation, you can just
make a vector from left to right at touch and at release and compute
the rotation of that vector.
You could also add translation for a single touch.
Anyway, that's the high level stuff. I can elaborate on things if
you'd like (and ask the guys on the team that wrote the code if it
get's complex. ;) )
dba