How the kinect truly works

287 views
Skip to first unread message

Andrew Miller

unread,
Nov 25, 2010, 6:08:11 AM11/25/10
to OpenKinect
You probably know by now that the kinect works by emitting IR laser
dots, which are imaged by an IR camera.
http://groups.google.com/group/openkinect/browse_thread/thread/296a2963a9d67c9/a24e77a37741650f

I haven't seen anything mentioned here about how the processing works
- what happens in between the IR image and the processed depth image
we get. I think some people assume it involves a technique like stereo
vision, which would require that every neighborhood in the pattern can
be discerned from its neighbors. I don't think that's the case.

There are several good puzzles here: how does the kinect perform so
much better than stereo camera rigs, and faster than structured light
systems? Why don't multiple kinects interfere with each other, as
earlier observed? http://groups.google.com/group/openkinect/browse_thread/thread/1ff9b4501db3de05/49a578fce0da02ee

-------

I'v been reading the primesense patent application, as published here
(Jan 2010):
http://www.freepatentsonline.com/20100020078.pdf
Disclaimer: the following is entirely my own writing and it is not
authoritative. I may be completely wrong, and I have no knowledge
other than what's public.


I don't follow the patent 100% but let me give a go at explaining what
I think it says:
The speckle image actually consists of two kinds of light. The dots
(D1) are like little laser pointers. The have roughly the same
brightness no matter how far they are away from the source: D1(z) =
A. There's also a background illumination, an even covering of dark
grey. It acts like a flash light: the light is dimmer the farther away
from the wall it is: D2(z) = 1/z. The tricky optics in the kinect make
these have effectively the same point source. The brightness of each
kind of light decays with distance at a different rate.

In general, the the brightness of each pixel P observing a point
(x,y,z) depends on the amount of light hitting the object, the
direction of the light source, and the surface properties at that
point. In the case of the kinect, a narrow band filter on the camera
means the only observable light striking the object comes from the two
sources D1, D2, which both share the same point of origin. Under these
conditions, the contributions from the two light sources simply add
up.

So, consider a small patch from an IR image, and assume that the
surface is uniform here:
+ + + +
+ + o + Where o is a 'dot' and
+ + + + + is 'background'

Then the brightness of the dot pixel is Pd = Rd * (D1 + D2) = r(A + 1/
z), where r is an unknown parameter depending on the surface. The
brightness of the background pixel is Pb = r D2 =r / z, where Rd
depends on the surface of the other point.

You can determine the distance z without needing to know r:
z = Pd / (A Pb)

In direct terms, to find depth from an IR image, do this:
Find a really bright pixel. It's a dot.
Find a really dark pixel very near the dot. It's background.
Divide them. Now you know z.

---------------------------------

Implications:
- It isn't necessary to correspond a patch in the image to a specific
patch in the projected pattern. All that maters is a dot, and the dark
area around a dot.

- Shadows look the same to the kinect as infinitely far away points.

- This is very much different than both structured light systems and
time-of-flight sensors. Sort of a cross between them.

----------------
Open questions for discussion:
- Any mistakes in this explanation?
- What happens in the case of multiple kinects?
- What's special about the distance between the laser and the camera?
Would it matter if they were closer together or farther apart?
- Could we make a kinect microscope by putting a magnifying lens in
front of the camera and the laser?



Sibo Quan

unread,
Nov 25, 2010, 10:00:26 AM11/25/10
to openk...@googlegroups.com
Very interesting topic, and I have another question to add:
-Is the parallax caused by the distance between the IR camera and RGB camera a matter (of accuracy) when map the RGB image with depth info?  
 
Thanks.
 
Quan Sibo
 

 
2010/11/25 Andrew Miller <ami...@dappervision.com>



--
Sincerely yours,

QUAN Sibo
sbq...@gmail.com

Hector Martin

unread,
Nov 25, 2010, 10:03:54 AM11/25/10
to openk...@googlegroups.com
That's a very interesting theory. Here's a different one that I've had
up until now:

The projector uses two diffraction gratings to produce a constellation
of dots. The camera sees these dots from a slightly different
perspective (due to the 10cm or so spacing). Due to this, different
depths produce slightly horizontally shifted positions for the dots. The
Kinect performs subpixel analysis to determine the center position of
each dot, then compares it to a reference calibration "image" with every
dot position, created in the factory. It then determines depth for each
dot by comparing the horizontal displacement.

The arguments I think support this theory are:
- Multiple kinects don't interfere except at a small fraction of points.
This makes sense, since the chip would ignore extra dots not in the
calibration image, except where they overlap and break that depth sample.
- The laser is temperature-controlled with a Peltier element (both for
heating and cooling), presumably for wavelength stability. This makes
sense because, since the constellation is produced by a diffraction
grating, different wavelengths would alter the dot spacing and throw off
the calibration.
- It has been said that the Kinect is highly sensitive to mechanical
stress on the internal structure, i.e. applying a bending force to the
metal frame that joins the projector and camera causes very noticeable
changes in the depth image. This also agrees with some sort of accurate
and sensitive calibration being involved, and the perspective between
camera and projector being critical.

It's still just guesswork though, and it certainly doesn't match that
particular patent (though we don't know if PrimeSense have been working
on several different methods to do this kind of thing). Thoughts?


--
Hector Martin (hec...@marcansoft.com)
Public Key: http://www.marcansoft.com/marcan.asc

Dawes, Andrew M.

unread,
Nov 25, 2010, 10:36:11 AM11/25/10
to openk...@googlegroups.com
On 11/25/10 7:03 AM, Hector Martin wrote:
> That's a very interesting theory. Here's a different one that I've had
> up until now:
>
> The projector uses two diffraction gratings to produce a constellation
> of dots. The camera sees these dots from a slightly different
> perspective (due to the 10cm or so spacing). Due to this, different
> depths produce slightly horizontally shifted positions for the dots. The
> Kinect performs subpixel analysis to determine the center position of
> each dot, then compares it to a reference calibration "image" with every
> dot position, created in the factory. It then determines depth for each
> dot by comparing the horizontal displacement.
This description is certainly supported by the primesense patent (most
accounts say primsense licensed to Microsoft for the tech in kinect).

http://www.google.com/patents/about?id=eTKuAAAAEBAJ&dq=primesense

They describe a system that casts a reference field and then
cross-correlates the viewed field with reference depth images. This
basically comes down to decomposing the viewed field into regions that
correlate strongly with each of the calibrated depth images.

This seems to suggest that each kinect needs to be "factory calibrated".
That isn't ideal for mass-marketing but that wouldn't rule out the
possibility. Hector's other supporting claims are consistent with this
patent as well (sensitivity to bending, lack of major interference, etc).

I'm reading through more recent primesense patents to see if there are
any improvements on this approach (i.e. some that wouldn't require
individual device calibration).

-Andy

GUNNM

unread,
Nov 25, 2010, 10:51:31 AM11/25/10
to OpenKinect

Mike Harrison

unread,
Nov 25, 2010, 10:56:30 AM11/25/10
to openk...@googlegroups.com
On Thu, 25 Nov 2010 07:36:11 -0800, you wrote:

>On 11/25/10 7:03 AM, Hector Martin wrote:
>> That's a very interesting theory. Here's a different one that I've had
>> up until now:
>>
>> The projector uses two diffraction gratings to produce a constellation
>> of dots. The camera sees these dots from a slightly different
>> perspective (due to the 10cm or so spacing). Due to this, different
>> depths produce slightly horizontally shifted positions for the dots. The
>> Kinect performs subpixel analysis to determine the center position of
>> each dot, then compares it to a reference calibration "image" with every
>> dot position, created in the factory. It then determines depth for each
>> dot by comparing the horizontal displacement.
>This description is certainly supported by the primesense patent (most
>accounts say primsense licensed to Microsoft for the tech in kinect).
>
>http://www.google.com/patents/about?id=eTKuAAAAEBAJ&dq=primesense
>
>They describe a system that casts a reference field and then
>cross-correlates the viewed field with reference depth images. This
>basically comes down to decomposing the viewed field into regions that
>correlate strongly with each of the calibrated depth images.
>
>This seems to suggest that each kinect needs to be "factory calibrated".
>That isn't ideal for mass-marketing but that wouldn't rule out the
>possibility. Hector's other supporting claims are consistent with this
>patent as well (sensitivity to bending, lack of major interference, etc).

There is an eeprom on the camera board, but doesn't contain a great deal of data - about 600-odd
bytes. My guess is some of it's to do with laser power calibration -the biggest block is 480 bytes,
and looks like it's grouped as 30 blocks of 16 bytes

David Hubbard

unread,
Nov 25, 2010, 1:46:32 PM11/25/10
to openk...@googlegroups.com
Hi Andrew,

I followed it without any problems.

> - What happens in the case of multiple kinects?

If I'm understanding right the parallax has no effect on the
calculation. In other words the distance between the light source (D1,
D2) and the camera doesn't matter. I imagine a large difference in z
would matter, but from a single Kinect laser to primesense camera the
z does not change -- so it doesn't matter.

Ok if I'm not too far off, then Kinect #1 could see dots from Kinect
#2 and still calculate the right result. But because the z is
different, there is probably going to be a fixed offset after the
calculation. I'd add some sort of bayesian filter that evaluates the
probability that the dot is from Kinect #2 instead of from Kinect #1,
and go with the results with minimum mean squared error. For a more
advanced approach, I'd model the surface with nurbs and minimize the
coefficients at each control point to find the optimal fit.

> - What's special about the distance between the laser and the camera?

I think it doesn't matter as long as z is the same. However, I bet
factory calibration includes finding the "offset" (bias) of the
measurements.

I do think the laser output power is an important part of the
calibration. At least for A, and probably a scale & bias measurement.
That probably explains the (relatively expensive) peltier junction to
precisely control the laser output power.

> - Would it matter if they were closer together or farther apart?

Closer dots will have a brighter Pb (background pixel), so the
signal-to-noise ratio will be better. When dividing Z = Pd / (A Pb),
that means accuracy is better when the dots have a brighter Pb. (This
makes sense intuitively also.) It could mean that cooling the
primesense camera (think overclockers) and boosting the laser output
power would dramatically increase the range.

> - Could we make a kinect microscope by putting a magnifying lens in
> front of the camera and the laser?

Yes, but I bet you would have to recalibrate.

Cheers,
David

Mike Harrison

unread,
Nov 25, 2010, 1:59:53 PM11/25/10
to openk...@googlegroups.com

>I do think the laser output power is an important part of the
>calibration. At least for A, and probably a scale & bias measurement.
>That probably explains the (relatively expensive) peltier junction to
>precisely control the laser output power.

The peltier controls wavelength - power will be done by adjusting drive current, using an optical
feedback mechanism, typically a photodiode in the laser diode package.

The question is why is wavelngth so critical? - performance of the pattern generator or bandwidth of
the filter on the camera?
I wonder if the dot pattern spacing may be a function of wavelength - someone who knows about
diffraction may be able to answer this.

I have seen my Kinect not getting a depth image until the peltier has stabilised.

Considering the illuminator produces a geometric pattern from a point source, I find it hard to
believe that it is making significant use of strength of illumination, as opposed to geometry to
determine the depth. The size of a group of points of known angular spacing should give you all the
info you need to calculate distance.
Reflectivity of targets would have a bigger effect on intensity than proximity.

Murilo Saraiva de Queiroz

unread,
Nov 25, 2010, 4:56:54 PM11/25/10
to openk...@googlegroups.com
On Thu, Nov 25, 2010 at 1:51 PM, GUNNM <steven...@laposte.net> wrote:
May be this article can help :

http://en.wikipedia.org/wiki/Time-of-flight_camera

Kinect IS NOT a time-of-flight camera. It uses structured light.

There is a lot of confusion regarding this because Microsoft also bought 3DV, another Israeli company that used to produce TOF cameras, but Kinect is done with techonology from PrimeSense, not 3DV. 

--
Murilo Saraiva de Queiroz, MSc.
Senior Software Engineer 
http://www.vettalabs.com
http://www.tecnologiainteligente.com.br
http://www.acalantoemcasa.com.br

Ismael Salvador

unread,
Nov 25, 2010, 5:39:39 PM11/25/10
to OpenKinect
I was digging into the publications of the guys that applied for the
patent and I found all is around an Optics Department that is working
with this kind of gratings.

I just took a quick look at a paper dealing with this kind of
patterns, I did not get very deep but I think the "artifacts" between
the "big" points are caused by diffraction. In the ext days I will try
to find some time to read the whole paper.

http://www.uv.es/~gpoei/articulos/(2008)%20AO%2047%20(16)%203032%20Three-dimensional%20mapping%20and%20range%20measurement%20by%20means%20of%20projected%20speckle%20patterns.pdf

Nema Safvati

unread,
Nov 25, 2010, 10:43:25 PM11/25/10
to OpenKinect
I don't think the "differences in illumination cast" by the laser and
IR patterns are referring to brightness. The laser is most likely
doing a time of flight calculation to provide depth info. As for the
IR light, I think it could be as simple as size; objects closer to the
projected light pattern will be bigger/smaller (depending on how the
light is projected). So the difference in TOF for the laser, and
difference is size for the IR pattern, to produce a depth-map.

Just my two-cents.

Adam Crow

unread,
Nov 25, 2010, 11:31:12 PM11/25/10
to openk...@googlegroups.com
I would be very surprised if ther4e were no factory calibration. The
manufacturing tolerances in the cameras, the assembly all add up to a
pretty variable configuration that ,being physically small, can create
significant variation in depth measurement over the distances of a few
metres.

Calibration would be quite quick.
With a few reference points the system can interpolate between these
points to map every single camera pixel to a vector.
Given the small amount of eprom needed to store every single pixel
vector they could probably do that too.

There is no need to do fast mapping of camera pixel + laser dot
position -> xyz. This can be done (I used to do it) in real time
using floating point calculation.

ACC

--
Adam Crow BEng (hons) MEngSc MIEEE
Technical Director
DC123 Pty Ltd
Suite 10, Level 2
13 Corporate Drive
HEATHERTON VIC 3202
http://dc123.com
phone: 1300 88 04 07
fax: +61 3  9923 6590
int: +61 3 8689 9798

Hector Martin

unread,
Nov 26, 2010, 2:20:43 AM11/26/10
to openk...@googlegroups.com
On 11/25/2010 07:59 PM, Mike Harrison wrote:
> The question is why is wavelngth so critical? - performance of the pattern generator or bandwidth of
> the filter on the camera?
> I wonder if the dot pattern spacing may be a function of wavelength - someone who knows about
> diffraction may be able to answer this.

It does (this is why CDs produce shiny rainbows).

If the kinect only has a small EEPROM to store calibration data, then my
guess is they're storing a series of parameters that characterize the
manufacturing deviations of the system, instead of the raw position of
each dot.

So, maybe each of the 9 "subconstellations" (each segment around a
brighter central dot, which is the zero-order beam for that portion,
which originally came from the primary grating) has a very well-defined
pattern (maybe even algorithmic), but I'm guessing they can't really
precisely know the position of each one. So all they'd have to do is
store the affine parameters for each of the 9 segments and then derive
the position of each dot from whatever algorithm.

SK

unread,
Nov 26, 2010, 2:59:50 AM11/26/10
to OpenKinect
> If the kinect only has a small EEPROM to store calibration data, then my
> guess is they're storing a series of parameters that characterize the
> manufacturing deviations of the system, instead of the raw position of
> each dot.
In the patent paper on the PrimeSense reference design (http://
www.google.com/patents?id=k3vLAAAAEBAJ) there is a section about
"Registration of Depth with Color Image Data". There, it is explained
that indeed there is no large lookup table, but instead a parabolic
fit is used, so only 12 fitting coefficients have to be stored after
factory calibration.
It would of course be interesting if these parameters reside somewhere
on the Kinect and are transmitted to the XBox on startup.

SK

unread,
Nov 26, 2010, 3:11:14 AM11/26/10
to OpenKinect
On Nov 26, 4:43 am, Nema Safvati <cgn...@gmail.com> wrote:
> I don't think the "differences in illumination cast" by the laser and
> IR patterns are referring to brightness. The laser is most likely
> doing a time of flight calculation to provide depth info.
The patent is pretty clear on this:

[0013] In a disclosed embodiment, the depth-related image data include
an image of a pattern that is projected onto an object, and the
processing circuitry is configured to generate the depth map by
measuring shifts in the pattern relative to a reference image.

So there is very good reason to consider any theories regarding TOF or
intensity based depth estimation wrong. We already know that the IR-
cam is a conventional rolling shutter CMOS cam with a IR-Filter btw,
so already from this fact it cannot be TOF.

Mike Harrison

unread,
Nov 26, 2010, 5:56:35 AM11/26/10
to openk...@googlegroups.com

FFS - how many times does this need repeating - KINECT DOES NOT USE TOF!!!!!
It uses a standard CMOS image sensor, which can't do TOF, and the laser is not modulated.

Murilo Saraiva de Queiroz

unread,
Nov 26, 2010, 6:44:31 AM11/26/10
to openk...@googlegroups.com
Kinect isn't a TOF camera. 

Mike Harrison

unread,
Nov 26, 2010, 10:40:56 AM11/26/10
to openk...@googlegroups.com
If anyone's interested, here's the eeprom data
(Microchip 24LC64, U26)

Address 0 :
C0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
00 02 A5 A5 00 00 00 00 00 00 00 00 00 00 00 00
00 52 00 00 00 B1 FF FF 00 00 00 00 00 60 00 16
07 8C 00 00 00 04 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 01 00 00 00 01 00 00 00 00 00 00 00 00 00 01
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 16 FF FE 00 5F 02 0E 00 31 00 00 00 00 00 00


Address 0x400

00 F0 00 8B 00 63 01 39 01 75 00 00 00 00 00 00
00 F1 00 8C 00 63 01 6E 01 6E 00 00 00 00 00 00
00 F2 00 8C 00 62 01 6E 01 6E 00 00 00 00 00 00
00 F3 00 8C 00 62 01 6E 01 6E 00 00 00 00 00 00
00 F4 00 8B 00 61 01 36 01 6E 00 00 00 00 00 00
00 F5 00 8B 00 61 01 36 01 6E 00 00 00 00 00 00
00 F6 00 8B 00 61 01 32 01 6B 00 00 00 00 00 00
00 F7 00 8B 00 61 01 28 01 68 00 00 00 00 00 00
00 F8 00 8B 00 60 01 25 01 5E 00 00 00 00 00 00
00 F9 00 8B 00 60 01 25 01 93 00 00 00 00 00 00
00 FA 00 8B 00 60 01 25 02 62 00 00 00 00 00 00
00 DD 00 8B 00 68 01 5A 01 CF 00 00 00 00 00 00
00 DE 00 8B 00 68 01 5A 01 D2 00 00 00 00 00 00
00 DF 00 8A 00 16 00 D2 03 84 00 00 00 00 00 00
00 E0 00 8B 00 64 01 43 01 B4 00 00 00 00 00 00
00 E1 00 8B 00 6D 01 36 01 78 00 00 00 00 00 00
00 E2 00 8B 00 6C 01 36 01 C2 00 00 00 00 00 00
00 E3 00 8C 00 6A 01 8C 01 C8 00 00 00 00 00 00
00 E4 00 8B 00 47 00 78 03 84 00 00 00 00 00 00
00 E5 00 88 00 55 00 DC 03 84 00 00 00 00 00 00
00 E6 00 88 00 1A 00 D5 01 D6 00 00 00 00 00 00
00 E7 00 8C 00 3F 01 61 01 AE 00 00 00 00 00 00
00 E8 00 8C 00 56 01 78 01 BB 00 00 00 00 00 00
00 E9 00 8C 00 5C 01 86 01 C2 00 00 00 00 00 00
00 EA 00 8B 00 60 01 4A 01 8C 00 00 00 00 00 00
00 EB 00 8B 00 60 01 43 01 7F 00 00 00 00 00 00
00 EC 00 8B 00 62 01 3C 01 78 00 00 00 00 00 00
00 ED 00 8B 00 62 01 39 01 75 00 00 00 00 00 00
00 EE 00 8B 00 64 01 39 01 AA 00 00 00 00 00 00
00 EF 00 8C 00 63 01 75 01 75 00 00 00 00 00 00

Address 0x800
00 17 00 00 00 20 00 00 0A 10 00 00 16 04 01 01
00 17 00 00 00 21 00 00 00 00 00 00 00 00 00 00
00 17 00 00 00 20 00 00 05 E0 00 00 16 04 01 01
00 17 00 00 00 21 00 00 00 00 00 00 00 00 00 00
00 7D FF FF FF FF 00 00 00 00 00 00 00 00 00 00
00 7D 00 00 00 02 00 00 00 00 00 00 00 00 00 00
00 7D 00 00 00 07 00 00 00 00 00 00 00 00 00 00
FF FF FF FF 00 21 00 00 00 00 00 00 00 00 00 00

Mike Harrison

unread,
Nov 26, 2010, 10:53:52 AM11/26/10
to openk...@googlegroups.com
Some observations of what happens when the illuminator is moved slightly WRT the sensor :

Slackening the screws on the illuminator, and rotating it slightly, just the amount that the screws
allow within their holes, progressively narrows the depth image field of view from full to a narrow
vertical strip about 10% of the normal width.

Slackening some more and panning left/right shifts the depth values without noticeably affecting the
FOV or the geometry.

Panning up/down shifts the FOV left/right - not the image, just the part of the image that remains
visible after rotating as above.

The amount by which even small movements affect the image suggest that some post-assembly
calibration would be necessary.
It also shows that a rigid metal mounting plate is essential in maintaining good aligmment between
illuminator and sensor

Mohamed Ikbel Boulabiar

unread,
Nov 26, 2010, 10:59:49 AM11/26/10
to openk...@googlegroups.com
Hi,

Can you better save the brute binary content in a file and attach it
to the mail ?
Then, it would easier to hexdump it by anyone here.

Thanks.

Mike Harrison

unread,
Nov 26, 2010, 11:02:46 AM11/26/10
to openk...@googlegroups.com
Just playing with someack IR filter material - although the sensor works fine through it, I can
still clearly see the red speckle pattern from the illuminator.
As this filter material attenuates pretty much any amount of visible light short of sunlight or
halogen spotlights to complete invisibility, this was a bit of a surprise, indicating my eyes have
some sensitivity at 830nm.

I wonder if this is one of those things that varies between people, like high frequency hearing - is
there anyone here with normal vision who _doesn't_ see the red pattern when looking towards the
illuminator?

Phil Noonan

unread,
Nov 26, 2010, 11:12:47 AM11/26/10
to openk...@googlegroups.com
830 nm seems a bit high to be visible to the eye, maybe some scattering is shunting a small fraction of the light down to the just-visible range?

I too can see some red speckle when looking towards the illuminator, so maybe the filter is acting a source for Rayleigh scatter? Can you still see red when the filter is not in place? I haven't found the courage to attack my Kinect with a screwdriver yet...

Phil Noonan

unread,
Nov 26, 2010, 11:20:50 AM11/26/10
to openk...@googlegroups.com
Or more likely, the eye is sensitive to 830 nm, but only just. But you can still see it due to the brightness of the laser source.

Also, I don't think I meant Rayleigh scattering - it's last thing on Friday though!
Reply all
Reply to author
Forward
0 new messages