In many academic papers on the topic you'll see a pair of stereo
images being turned into a 3D model of the surroundings, but this
rarely works very well in practice because there may be correspondence
errors. Even if there are no correspondence errors there is still
insufficient information from a single stereo pair to identify the
ranges to surfaces precisely, except perhaps for very nearby objects.
So context is important, but how much should be used? Two previous
frames? Or ten? The distributed particle SLAM algorithm solves this
problem in a nice way which does not contain any of the ad-hoc
assumptions so commonly encountered in the machine vision literature.
Series of observations are turned into an ancestry tree, with branches
of the tree being progressively pruned as new observations are made.
This essentially solves the question "what 3D structure most probably
explains the series of observations which I've made up until now?".
At any point in time the robot inhabits a multiverse of possible
worlds, each very similar but slightly different. In the animation
below you can see how the robots path through space (in this case
along a corridor) is represented as a tree structure. The branches
are possible worlds which the robot could be moving through. Over
time less probable explanations are pruned away, leaving to most
probable route.
http://sentience.googlegroups.com/web/dpslam_tree2.gif
You can also see how the 3D grid map is occasionally adjusted as the
robot in essence "decides" which world it has fallen into.
http://sentience.googlegroups.com/web/dpslam_map1.gif
This algorithm is very efficient and produces good quality maps whilst
simultaneously being able to resolve uncertainties. I think it will
turn out to be a very practical solution.