I've been looking at how robotic stereo vision works, and I think I've got the hang of it.
The eyes are a fixed, known distance apart. The incoming images are analyzed for similarities, and then one image is superimposed. The distance from the similar point on one image from the point on the other then forms the base of a right triangle. The angle of the hypotenuse and the length of the base(distance between the eyes) is known, therefore you can judge distance to the object in focus. The art lies in recognizing similarities, from what I understand.
I think that a neural net would certainly allow you to optimize border recognition, and that certain levels of camera contrast would allow greater precision. The idea, I think, is to map the immediate environment in 3D from the robots perspective. If it has already encountered an object, it should already have a model of it in it's database, and could tag a certain area of the screen as "object_Beer_Bottle047", and instead of actively scanning the bottle itself, it removes the area from active analysis(skips over the particular required overlaps corresponding with the known object), until either the bottle is removed from the vicinity, or is changed enough to warrant a new situation (such as falling of a table and being broken.)
So, a recap... two images are fed to the analyzing program. One is static, one is moved until a similarity can be aligned. The program places both side by side, and does a first pass 'find similarities' routine, and then notes the position of the similarities on the fixed point image. The mobile picture is superimposed onto the static image, pixel by pixel, each pixel representing a level of 3D resolution. When similarities are aligned, the positions are noted, providing an x,y,z coordinate relative to the bot's eyes. When similarities share certain parameters (light conditions, color, position, shared shadow, etc.) then the coordinates noted during the stereo analysis are combined to form a 3D model of the perceived object.
Using more cameras would increase recognition of borders and increase depth resolution, I think, by providing more angles of analysis. 4 cameras would provide 6 angles of perception, 5 would give 11 angles. I'm sure at some point the power, processing, and cost requirements make it pretty unworkable to include lots of 'eyes,' but on the other hand, with a bunch of cheap cameras, you could get really good resolution for realtime stereoscopic vision. Also, having more than 2 opens up the possibilities of multiple focus points, since a bot has no inherent 'train of thought' limitation. It can handle multicameral thought as easily as the programmer can create a thread.
Anyway, 3D stereo vision is limited only by processor speed and camera resolution, which can be simulated by using multiple low res cameras, which has the added benefit of increased z resolution. I know there have to be really really cheap cameras out there because they are ubiquitous in cell phones, and I can buy a webcam for $10 at tigerdirect.
So machine vision, in theory, shouldn't be too great a hurdle. Obviously the 1" resolution at up to 6 kilometers isnt gonna be within reasonable bot expectations (unless the army supply truck tips on the highway near my house and I happen to be driving by. eh heh.) I think a .5 cm resolution at up to 40 feet should be very reasonable, and the closer something gets the better the detail. Also, the cameras dont have to be the same resolution, they just have to have a known position and the images have to be the same size.