Active Stereo Without Pattern Projector

ICCV 2023

Luca Bartolomei

Matteo Poggi

Fabio Tosi

Andrea Conti

Stefano Mattoccia

University of Bologna

Paper Supplement Poster Video Code
Journal Extension CVPR24 Demo ECCV24 Demo

Virtual Pattern Projection for deep stereo. Either in challenging outdoor (top) or indoor (bottom) environments (a), a stereo network such as PSMNet often struggles (b). By projecting a virtual pattern on images (c), the very same network dramatically improves its accuracy (d). By further training the model to deal with the augmented images (e) further improves the results.

Abstract

"This paper proposes a novel framework integrating the principles of active stereo in standard passive cameras, yet in the absence of a physical pattern projector. Our methodology virtually projects a pattern over left and right images, according to sparse measurements obtained from a depth sensor. Any of such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stereo setup in any possible environments overcoming the limitation of physical patterns, such as limited working range. Exhaustive experiments on indoor/outdoor datasets, featuring both long and close-range, support the seamless effectiveness of our approach, boosting the accuracy of both stereo algorithms and deep networks."

Method

1 - Problems

Given a pair of stereo images, stereo algorithms try to resolve the so called "correspondence problem". This problem is not always easy: uniform areas such as the wall shown in the figure make the problem ambiguous. Furthermore, given learning nature of stereo networks, they suffer when dealing with unseen scenarios. This latter problem is also called domain shift.

Performance on uniform areas. Even recent stereo network struggles with uniform areas.

Active stereo deal with ambiguous regions using a physical pattern projector which aims to ease correspondences. However, a pattern projector is not feasible in outdoor scenarios where ambient light cancel out the projected pattern. Furthermore, active light decreases proportionally to the square of the distance: consequentially active stereo cannot deal with long distances.

Performance of active pattern. Even in a indoor scenario, active light decreases proportionally to the square of the distance. As soon as we move in an external environment, the projected light is dimmed by the external light.

2 - Proposal

Inspired by active stereo, our technique, dubbed as VPP, virtually project a pattern into stereo image pair according to sparse depth measurements. We assume a calibrated setup composed by stereo camera and a depth sensor appropriate for the final environment.

Potential of our proposal. Previous shown network, even trained on synthetic data, dramatically improves its accuracy when coupled with our framework, even with few sparse points.

Appropriate depth sensor choice. LiDAR sensor is well suited for outdoor environments.

Our proposal outperforms state-of-the-art fusion methods even with few depth points such as 1% of the whole image. Finally, as shown in the figure, virtual pattern reduce domain shift issue without requiring an additional training procedure.


(A)	(B)	(C)

	(D)

Appropriate depth sensor choice. Previous shown network trained on synthetic data struggles (A) when dealing on new environment. Fine tuning the network in final scenario (B) improves accuracy but require annotated data. Using our framework in combination with CFNet shows an ability to tackle domain shift issues (C), even with low amount (5%) of sparse points. Last figure (D) shows ground-truth disparity map.

3 - Virtual Pattern Projection

Framework overview. Given a vanilla stereo pair of images from a stereo camera and sparse depth points from a depth sensor, our framework virtually projects depth seeds to stereo pair according to system geometry and a pattering strategy.

As for all fusion methods, our proposal relies on sparse depth seeds but, differentially from them, we inject sparse points directly into images using a virtual pattern. For each known point, we convert depth into disparity then we apply a virtual pattern in each stereo image accordingly to a pattering strategy.

$$I_L(x,y) \leftarrow \mathcal{A}(x,x',y)$$ $$I_R(x',y) \leftarrow \mathcal{A}(x,x',y)$$

Our method do not require any change or assumption in the stereo matcher: we assume it to be a black-box model that requires in input a pair of rectified stereo images and produce a disparity map.

Augmented stereo pair is less affected by ambiguous regions and makes stereo matcher work easier. Given that our framework alters only stereo pair, any stereo matcher could benefit from it. Since disparity seed could have subpixel accuracy, we use weighted splatting to avoid loss of information.

$$I_R(\lfloor x'\rfloor,y) \leftarrow \beta I_R(\lfloor x'\rfloor,y) + (1-\beta) \mathcal{A}(x,x',y)$$ $$I_R(\lceil x'\rceil,y) \leftarrow (1-\beta) I_R(\lceil x'\rceil,y) + \beta \mathcal{A}(x,x',y)$$ $$\beta = x'-\lfloor x'\rfloor$$

We propose different pattering strategies to enhance distinctiveness and high correspondence, a patch-based approach to increase pattern visibility, alpha-blending with original content to deal with adversial issues with neural networks, a solution to ambiguous projection on occluded points and consequentially an heuristic to classify occluded sparse points. In particular:

We propose two different pattering strategies: i) random pattering strategy and ii) distinctiveness pattering strategy based on histogram analysis.
1. Our random strategy is faster than second proposed strategy, but do not guarantee distinctiveness: for each known pair of corresponding points a random pattern is applied in each view, sampling from an uniform distribution.
  $$\mathcal{A}(x,x',y)\sim\mathcal{U}(0,255)$$
2. Our histogram strategy performs an analysis in the neighbourood of each corresponding point to guarantee local distinctiveness. For $(x,y)$ in the reference image, we consider two windows of length $L$ centered on it and on $(x',y)$ in the target image. Then, the histograms computed over the two windows are summed up and the operator $\mathcal{A}(x,x',y)$ picks the color maximizing the distance from any other color in the histogram $\text{hdist}(i)$, with $\text{hdist}(i)$ returning the minimum distance from a filled bin in the sum histogram $\mathcal{H}$
  $$\text{hdist}(i) = \big\{\min\{ |i-i_l|,\,|i-i_r| \},i_l\in[0,i[:\mathcal{H}(i_l)>0,i_r\in]i,255]:\mathcal{H}(i_r)>0 \big\}$$
Instead of projecting in two single corresponding pixels, a patch-based approach simply assume the same disparity value also for neighbour pixels.
A virtual pattern might hinder a deep stereo model not used to deal with it. Thus, we combine the original image content with the virtual pattern through alpha-blending.
$$I_L(x,y) \leftarrow (1-\alpha) I_L(x,y) + \alpha \mathcal{A}(x,x',y)$$ $$I_R(x',y) \leftarrow (1-\alpha) I_R(x',y) + \alpha \mathcal{A}(x,x',y)$$
Occlusions are an intrinsic product of the stereo system as each view see the scene in a different perspective. In particular, known depth seeds that are visible in reference view, but occluded in target view could lead to ambiguities.

In this example, known point P is virtually projected accordingly to a pattering strategy. However, as P is occluded in target view, the virtual projection covers original foreground content Q. Consequentially, foreground point has no longer a clear match. If we can classify correctly P as occluded, we can exploit the potential of our proposal. Instead of projecting a pattern that would cover foreground content, we can virtually project target content of Q into reference pixel of P. This solution ease the correspondence for P and do not interfere with foreground pixel. We propose a simple heuristic to classify occluded sparse disparity points.

Considering the target view, it assumes at least one foreground sparse point in the neighborhood of the occluded sparse point. In this example P and Q are known overlapping points in target view, while in reference view they are distinct points. We aim to classify P as occluded.

Firstly, given a sparse disparity map obtained using depth sensor, we warp each point at coordinate $(x, y)$ into an image-like grid $W$ at coordinate $(x',y)$. In this representation sparse points are seen from target view perspective: foreground and occluded points are now closer to each other. Each valid sparse point $W(x_o,y_o)$ is classified as occluded if the following inequality holds for at least one neighbor $W(x,y)$ within a 2D window of size $r_x \times r_y$.
$$ W(x,y)-W(x_o,y_o) - \lambda (\gamma\lvert x-x_o \rvert + (1-\gamma) \lvert y-y_o \rvert ) > t$$
This inequality simply threshold difference between disparity of two points and weight it accordingly to coordinate distance. $\lambda$, $t$ and $\gamma$ are parameters that have to be tuned. Finally, points classified as occluded are warped back to obtain an occlusion mask.

Video

Experimental Results

Performance versus Competitors

Plots show the error rate on Middlebury 2014 training split, varying density of sparse hint points from 0% to 5%. We compare two stereo networks together with our VPP framework against:

The very same two networks in conjunction with Guided Stereo Matching framework
LidarStereoNet
CCVNorm

As shown in the figures, VPP generally reaches almost optimal performance with only 1% depth density. Except few cases in the training configurations with some higher density, VPP achieves much lower error rates.

VPP with off-the-shelf networks

These last plots show the effectiveness of our technique using off-the-shelf networks (i.e., HSMNet, CFNet, CREStereo, RAFT-Stereo) on four different datasets:

Middlebury 2014 training split (error rate > 2)
Middlebury 2021 (error rate > 2)
ETH3D (error rate > 1)
KITTI 142 Split (error rate > 3)

Even without any additional training, our framework boosts accuracy of any model with rare exceptions.

BibTeX

@InProceedings{Bartolomei_2023_ICCV,
		    author    = {Bartolomei, Luca and Poggi, Matteo and Tosi, Fabio and Conti, Andrea and Mattoccia, Stefano},
		    title     = {Active Stereo Without Pattern Projector},
		    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
		    month     = {October},
		    year      = {2023},
		    pages     = {18470-18482}
		}