THere is debate over what works best and what has the best performance to cost ratio.
Elon Musk of Tesla continues to say that you ONLY need stereo vision and his proof is that humans can drive cars and walk on sidewalks and only use stereo vision.
The counter argument is that the other companies all use multiple 3D LIDAR sensors and their cars are much better at driving than any Tesla.
So my opinion is that Elon Musk's theory is correct, stereo vios should be all that is needed but in practical terms, vision processing software is not so good and LIDAR data is much easier to process and in 2021 gives better result.
Again,my opinion.
1) Depth cameras are a really good compromise
2) no sensor can work if processed frame at a time. MANY frames of data must be accumulated to build-up an internal model of the world. So you are not going to detect a curb in noisy single frame of data. You need to look for some number of frames and build up an understanding of the local environment. All humans, dogs and self-drive cars do this. So do not expect to ever be able to buy a sensor that will detect every kind of object, sensors provide raw data, detection requires UNDERSTANDING the data.
With one frame of data you would not be able to know the difference between a wall attached to a building and a bus that was going to run you down. You have to watch it for a half second and see if it is moving and how fast and in what direction and only then can you can "it is a curb because it is low and moving at me at the same speed I am moving forward".
It is critical to know your own speed and heading or else EVERYTHING looks like it is a moving object that is heading at you on a near collision course. Not until after you subject or your own motion can you see it is a fixed step.
All of the engineering is in the processing. The argument for lidar over vision is that it is easier to process lidar. My argument is that a depth camera is very much like a lidar but cheaper and works well at shorter distances.
My guess -- You are going to need a depth camera and a powerful, trained neural network. The network converts pixels and depth onto a list of objects and objects then go to a tracker. Your motion planning must reference the list of tracked objects. This is not a job for a small microcontroller.