Robust depth perception through Virtual Pattern Projection

CVPR 2024 DEMO

Arch 4AB (near the exhibition booth) Booth 13 - Thursday, June 20 10:30 am PDT - 6:45 pm PDT

Luca Bartolomei

Matteo Poggi

Fabio Tosi

Andrea Conti

Stefano Mattoccia

University of Bologna

ICCV23 Paper Journal Paper Video Code Flyer

Our prototype at CVPR 2024. We built a functional prototype composed of an OAK-D lite stereo camera with a built-in Stereo matching algorithm and an Intel RealSense L515 LiDAR as the sparse depth sensor. OAK-D Lite and L515 are rigidly fixed using a handmade alluminium support to guarantee a stable calibration over time.

Abstract

"The demo aims to showcase a novel matching paradigm, proposed at ICCV 2023, based on projecting virtual patterns onto conventional stereo pairs according to the sparse depth points gathered by a depth sensor to achieve robust and dense depth estimation at the resolution of the input images. Such a virtual hallucination strategy can be seamlessly used with any algorithm or deep network for stereo correspondence, dramatically improving the performance of the baseline methods. In contrast to active stereo systems based on a conventional pattern projector (e.g., Intel Realsense or OAK-D Pro stereo cameras), our proposal acts on the vanilla RGB images, is effective at any distance, even with sunlight, and does not require additional IR/RGB cameras nor a physical IR pattern projector. Moreover, the virtual projection paradigm can be used even for other tasks, such as depth completion, as proposed at 3DV 2024. We will showcase to the CVPR community how flexible and effective the virtual pattern projection paradigm is through a real-time demo based on off-the-shelf cameras and depth sensors. Specifically, we will demonstrate the advantage yielded by our proposal for stereo, through a live real-time session on a Jetson Nano connected to a commercial stereo camera (i.e., OAK-D camera) and an depth sensor (i.e., Realsense LiDAR L515). This setup will also allow us to explain thoroughly, with a live demonstration, the principle and the outcome of our virtual hallucination strategy on the vanilla images acquired by the camera in the task mentioned above. "

The Prototype

Calibration of sensors. We managed to calibrate the setup using a standard chessboard calibration algorithm between the IR camera of L515 and the left camera of OAK-D Lite. L515 and OAK-D share the same clock of the host machine and are time synchronized using the nearest timestamp algorithm. Kudos to bachelor student Nicole Ferrari for the algorithm implementation.

Reprojection of sparse depth points. Firstly, the Intel Realsense L515 captures sparse depth points with high confidence. The resulting sparse depth map is reprojected from the L515 reference frame to the OAK-D left camera using the rigid transformation previously found using the chessboard calibration algorithm.

Robust and vanilla depth estimation. The vanilla pair of stereo images are fed into the OAK-D StereoDepth module to obtain a vanilla depth estimation. The vanilla pair is also fed into our VPP module which produces two enhanced stereo images in a fraction of a second. These VPP images are fed again into the OAK-D stereo algorithm to obtain a robust depth estimation. Our graphical interface permits us to show the benefits of our proposal in critical areas such as textureless regions with respect to the vanilla passive stereo.

Video

Virtual Pattern Projection (VPP)

Virtual Pattern Projection for deep stereo. Either in challenging outdoor (top) or indoor (bottom) environments (a), stereo networks like RAFT-Stereo (top) or PSMNet (bottom) struggle (b). By projecting a virtual pattern on images (c), the same networks dramatically improve their accuracy without retraining (d). Training the models to deal with the augmented images (e) further reduces errors.

1 - Problems

Given a pair of stereo images, stereo algorithms try to resolve the so called "correspondence problem". This problem is not always easy: uniform areas such as the wall shown in the figure make the problem ambiguous. Furthermore, given learning nature of stereo networks, they suffer when dealing with unseen scenarios. This latter problem is also called domain shift.

Performance on uniform areas. Even recent stereo network struggles with uniform areas.

Active stereo deal with ambiguous regions using a physical pattern projector which aims to ease correspondences. However, a pattern projector is not feasible in outdoor scenarios where ambient light cancel out the projected pattern. Furthermore, active light decreases proportionally to the square of the distance: consequentially active stereo cannot deal with long distances.

Performance of active pattern. Even in a indoor scenario, active light decreases proportionally to the square of the distance. As soon as we move in an external environment, the projected light is dimmed by the external light.

2 - Proposal

Inspired by active stereo, our technique, dubbed as VPP, virtually project a pattern into stereo image pair according to sparse depth measurements. We assume a calibrated setup composed by stereo camera and a depth sensor appropriate for the final environment.

Potential of our proposal. Previous shown network, even trained on synthetic data, dramatically improves its accuracy when coupled with our framework, even with few sparse points.

Appropriate depth sensor choice. LiDAR sensor is well suited for outdoor environments.

Our proposal outperforms state-of-the-art fusion methods even with few depth points such as 1% of the whole image. Finally, as shown in the figure, virtual pattern reduce domain shift issue without requiring an additional training procedure.


(A)	(B)	(C)

	(D)

Appropriate depth sensor choice. Previous shown network trained on synthetic data struggles (A) when dealing on new environment. Fine tuning the network in final scenario (B) improves accuracy but require annotated data. Using our framework in combination with CFNet shows an ability to tackle domain shift issues (C), even with low amount (5%) of sparse points. Last figure (D) shows ground-truth disparity map.

3 - Proposed Framework (VPP)

Framework overview. Given a vanilla stereo pair of images from a stereo camera and sparse depth points from a depth sensor, our framework virtually projects depth seeds to stereo pair according to system geometry and a pattering strategy.

As for all fusion methods, our proposal relies on sparse depth seeds but, differentially from them, we inject sparse points directly into images using a virtual pattern. For each known point, we convert depth into disparity then we apply a virtual pattern in each stereo image accordingly to a pattering strategy.

$$I_L(x,y) \leftarrow \mathcal{A}(x,x',y)$$ $$I_R(x',y) \leftarrow \mathcal{A}(x,x',y)$$

Our method do not require any change or assumption in the stereo matcher: we assume it to be a black-box model that requires in input a pair of rectified stereo images and produce a disparity map.

Augmented stereo pair is less affected by ambiguous regions and makes stereo matcher work easier. Given that our framework alters only stereo pair, any stereo matcher could benefit from it. Since disparity seed could have subpixel accuracy, we use weighted splatting to avoid loss of information.

$$I_R(\lfloor x'\rfloor,y) \leftarrow \beta I_R(\lfloor x'\rfloor,y) + (1-\beta) \mathcal{A}(x,x',y)$$ $$I_R(\lceil x'\rceil,y) \leftarrow (1-\beta) I_R(\lceil x'\rceil,y) + \beta \mathcal{A}(x,x',y)$$ $$\beta = x'-\lfloor x'\rfloor$$

We propose different pattering strategies to enhance distinctiveness and high correspondence, a patch-based approach to increase pattern visibility, alpha-blending with original content to deal with adversial issues with neural networks, a solution to ambiguous projection on occluded points and consequentially an heuristic to classify occluded sparse points. In particular:

We propose two different pattering strategies: i) random pattering strategy and ii) distinctiveness pattering strategy based on histogram analysis.
1. Our random strategy is faster than second proposed strategy, but do not guarantee distinctiveness: for each known pair of corresponding points a random pattern is applied in each view, sampling from an uniform distribution.
  $$\mathcal{A}(x,x',y)\sim\mathcal{U}(0,255)$$
2. Our histogram strategy performs an analysis in the neighbourood of each corresponding point to guarantee local distinctiveness. For $(x,y)$ in the reference image, we consider two windows of length $L$ centered on it and on $(x',y)$ in the target image. Then, the histograms computed over the two windows are summed up and the operator $\mathcal{A}(x,x',y)$ picks the color maximizing the distance from any other color in the histogram $\text{hdist}(i)$, with $\text{hdist}(i)$ returning the minimum distance from a filled bin in the sum histogram $\mathcal{H}$
  $$\text{hdist}(i) = \big\{\min\{ |i-i_l|,\,|i-i_r| \},i_l\in[0,i[:\mathcal{H}(i_l)>0,i_r\in]i,255]:\mathcal{H}(i_r)>0 \big\}$$
Instead of projecting in two single corresponding pixels, a patch-based approach simply assume the same disparity value also for neighbour pixels.
The patch can be shaped to fit the context using two strategies: distance based patch-size and RGB guided adaptive patch. The former assumes that things appear smaller in the images at farther distances: a large virtual patch might cover most of a small object when it is far from camera, preventing the stereo algorithm from inferring its possibly different disparity values. Considering a disparity hint $d(x,y)$, a disparity search range $\left[ D_\text{min}, D_\text{max} \right]$, and a maximum patch size $N_\text{max}$, we adjust the patch size $N(x,y)$ as follows:
$$N(x,y) = \text{round}\left( \left(\frac{d(x,y)-D_\text{min}}{D_\text{max}-D_\text{min}}\right)^{\frac{1}{\phi}} \cdot \left( N_\text{max}-1 \right) + 1 \right)$$
where $\phi$ models the mapping curve and $D_\text{min}, D_\text{max}$ are given respectively by the nearest and the farthest hint in the frame.

Regardless of the distance from the camera, a fixed patch shape becomes problematic near depth discontinuities and with thin objects. As detailed next, following Bartolomei et al. we adapt the patch shape according to the reference image content to address this challenge. Given a local window $\mathcal{N}(x,y)$ centered around the disparity hint $d(x,y)$, for each pixel $(u,v)\in\mathcal{N}(x,y)$ we estimate the spatial $S(x,y,u,v)=(x-u)^2+(y-v)^2$ and color $C(x,y,u,v)=\left|I_L(u,v)-I_L(x,y)\right|$ agreement with the reference hint point $(x,y)$ and feed them into model $W_\text{c}(x,y,u,v)$:
$$W_\text{c}(x,y,u,v)=\exp\left(\frac{S(x,y,u,v)}{-2\sigma_s^2}+\frac{C(x,y,u,v)}{-2\sigma_c^2}\right)$$
where $\sigma_s$ and $\sigma_c$ control the impact of the spatial and color contribution as in a bilateral filter. For each pixel $(u,v)$ within a patch centred in $(x,y)$, we apply the virtual projection only if $W(x,y,u,v)$ exceeds a threshold $t_w$. Additionally, we store $W(x,y,u,v)$ values in a proper data structure to handle overlapping pixels between two or more patches -- i.e., we perform virtual projection only for the pixel with the highest score.
A virtual pattern might hinder a deep stereo model not used to deal with it. Thus, we combine the original image content with the virtual pattern through alpha-blending.
$$I_L(x,y) \leftarrow (1-\alpha) I_L(x,y) + \alpha \mathcal{A}(x,x',y)$$ $$I_R(x',y) \leftarrow (1-\alpha) I_R(x',y) + \alpha \mathcal{A}(x,x',y)$$
Occlusions are an intrinsic product of the stereo system as each view see the scene in a different perspective. In particular, known depth seeds that are visible in reference view, but occluded in target view could lead to ambiguities.

In this example, known point P is virtually projected accordingly to a pattering strategy. However, as P is occluded in target view, the virtual projection covers original foreground content Q. Consequentially, foreground point has no longer a clear match. If we can classify correctly P as occluded, we can exploit the potential of our proposal. Instead of projecting a pattern that would cover foreground content, we can virtually project target content of Q into reference pixel of P. This solution ease the correspondence for P and do not interfere with foreground pixel. We propose a simple heuristic to classify occluded sparse disparity points.

Considering the target view, it assumes at least one foreground sparse point in the neighborhood of the occluded sparse point. In this example P and Q are known overlapping points in target view, while in reference view they are distinct points. We aim to classify P as occluded.

Firstly, given a sparse disparity map obtained using depth sensor, we warp each point at coordinate $(x, y)$ into an image-like grid $W$ at coordinate $(x',y)$. In this representation sparse points are seen from target view perspective: foreground and occluded points are now closer to each other. Each valid sparse point $W(x_o,y_o)$ is classified as occluded if the following inequality holds for at least one neighbor $W(x,y)$ within a 2D window of size $r_x \times r_y$.
$$ W(x,y)-W(x_o,y_o) - \lambda (\gamma\lvert x-x_o \rvert + (1-\gamma) \lvert y-y_o \rvert ) > t$$
This inequality simply threshold difference between disparity of two points and weight it accordingly to coordinate distance. $\lambda$, $t$ and $\gamma$ are parameters that have to be tuned. Finally, points classified as occluded are warped back to obtain an occlusion mask.

VPP in action. Hallucinated stereo pair using pattern (vi); zoomed-in view with FGD-Projection (a), corresponding area (b) with sub-pixel splatting and left border occlusion projection (c). Adaptive patches guarantee the preservation of depth discontinuities and thin details (d).

Experimental Results

Performance versus Competitors

Plots show the error rate on Middlebury 2014 training split, varying density of sparse hint points from 0% to 5%. We compare two stereo networks together with our VPP framework against:

The very same two networks in conjunction with Guided Stereo Matching framework
LidarStereoNet
CCVNorm

As shown in the figures, VPP generally reaches almost optimal performance with only 1% depth density. Except few cases in the training configurations with some higher density, VPP achieves much lower error rates.

VPP with off-the-shelf networks

These last plots show the effectiveness of our technique using off-the-shelf networks (i.e., HSMNet, CFNet, CREStereo, RAFT-Stereo) on four different datasets:

Middlebury 2014 training split (error rate > 2)
Middlebury 2021 (error rate > 2)
ETH3D (error rate > 1)
KITTI 142 Split (error rate > 3)

Even without any additional training, our framework boosts accuracy of any model with rare exceptions.

Additional Evaluation with Raw Depth Data

Qualitative result on DSEC night split. Passive stereo networks (e.g., LEAStereo and PCWNet) often fail when dealing with low ambient light, while VPP allows perceiving even challenging objects, such as the far background walls.

Qualitative result on M3ED day split. Both ELFNet and HITNet struggle with large texture-less areas, for example, created by scenes with high dynamic range. Our framework can leverage sparse depth points to solve this issue, whereas projected patterns would be ineffective.

Qualitative result on SIMSTEREO. Some networks, such as DLNR, often produce artefacts when fed with stereo pairs acquired with pattern projection active. In contrast, the same networks can seamlessly take advantage of our virtual patterns.

Qualitative result on M3ED-Active. Virtual and physical projected patterns help to improve the accuracy of a recent stereo network (i.e., RAFT-Stereo) when dealing with large uniform areas.

These last plots show the effectiveness of our technique using additional off-the-shelf networks (e.g., DLNR, ELFNet) on four additional passive/active raw datasets:

M3ED (error rate > 3)
M3ED-Active (error rate > 3)
DSEC (error rate > 3)
SIMSTEREO (error rate > 1)

Even without any additional training, our framework boosts accuracy of any model with rare exceptions.

BibTeX

@InProceedings{Bartolomei_2023_ICCV,
		    author    = {Bartolomei, Luca and Poggi, Matteo and Tosi, Fabio and Conti, Andrea and Mattoccia, Stefano},
		    title     = {Active Stereo Without Pattern Projector},
		    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
		    month     = {October},
		    year      = {2023},
		    pages     = {18470-18482}
		}

@misc{bartolomei2024stereodepth,
            title={Stereo-Depth Fusion through Virtual Pattern Projection}, 
            author={Luca Bartolomei and Matteo Poggi and Fabio Tosi and Andrea Conti and Stefano Mattoccia},
            year={2024},
            eprint={2406.04345},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
      }

Acknowledgements

This study was carried out within the MOST – Sustainable Mobility National Research Center and received funding from the European Union Next-GenerationEU – PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D. 1033 17/06/2022, CN00000023. This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.