Intelligent Systems
Note: This research group has relocated. Discover the updated page here


2023


no image
KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D

Liao, Y., Xie, J., Geiger, A.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292-3310, IEEE, March 2023 (article)

DOI [BibTex]

2023

DOI [BibTex]

2022


no image
ARAH: Animatable Volume Rendering of Articulated Human SDFs

Wang, S. S. K. G. A. T. S.

Computer Vision – ECCV 2022 , 13692, pages: 1-19 , Lecture Note on Computer Science (LNCS), (Editors: Avidan, S; Brostow, G; Cisse, M; Farinella, GM; Hassner, T), Springer, 17th European Conference on Computer Vision (ECCV), October 2022 (conference)

DOI [BibTex]

2022

DOI [BibTex]


no image
KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients

Hanselmann, N. R. K. C. K. B. A. G. A.

Proceedings 17th European Conference on Computer Vision (ECCV), 13698, pages: 335-352, (Editors: Avidan, S; Brostow, G; Cisse, M; Farinella, GM; Hassner, T), IEEE, ECCV, October 2022 (conference)

DOI [BibTex]

DOI [BibTex]


no image
TensoRF: Tensorial Radiance Fields

Chen, A. X. Z. G. A. Y. J. S. H.

Proceedings COMPUTER VISION - ECCV 2022, PT XXXII, 13692, pages: 333-350, IEEE, ECCV 2022, October 2022 (conference)

DOI [BibTex]

DOI [BibTex]


no image
PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence

Dong, Z. G. C. S. J. C. X. G. A. H. O.

Proceedings 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), pages: 20438-20448, IEEE, CVPR 2022, June 2022 (conference)

DOI [BibTex]

DOI [BibTex]


no image
NICE-SLAM: Neural Implicit Scalable Encoding for SLAM

Zhu, Z. P. S. L. V. X. W. B. H. C. Z. O. M. R. P. M.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages: 12776-12786, IEEE, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022) , June 2022 (conference)

DOI [BibTex]

DOI [BibTex]


{gDNA}: Towards Generative Detailed Neural Avatars
gDNA: Towards Generative Detailed Neural Avatars

Chen, X., Jiang, T., Song, J., Yang, J., Black, M. J., Geiger, A., Hilliges, O.

In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), pages: 204395-20405, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), June 2022 (inproceedings)

Abstract
To make 3D human avatars widely available, we must be able to generate a variety of 3D virtual humans with varied identities and shapes in arbitrary poses. This task is challenging due to the diversity of clothed body shapes, their complex articulations, and the resulting rich, yet stochastic geometric detail in clothing. Hence, current methods to represent 3D people do not provide a full generative model of people in clothing. In this paper, we propose a novel method that learns to generate detailed 3D shapes of people in a variety of garments with corresponding skinning weights. Specifically, we devise a multi-subject forward skinning module that is learned from only a few posed, un-rigged scans per subject. To capture the stochastic nature of high-frequency details in garments, we leverage an adversarial loss formulation that encourages the model to capture the underlying statistics. We provide empirical evidence that this leads to realistic generation of local details such as clothing wrinkles. We show that our model is able to generate natural human avatars wearing diverse and detailed clothing. Furthermore, we show that our method can be used on the task of fitting human models to raw scans, outperforming the previous state-of-the-art.

Project page Video Code DOI [BibTex]

Project page Video Code DOI [BibTex]


no image
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs

Niemeyer, M. B. J. T. M. B. S. M. S. M. G. A. R. N.

Proceedings 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), pages: 5470-5480, IEEE, 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022 (conference)

DOI [BibTex]

DOI [BibTex]

2021


no image
Projected GANs Converge Faster

Sauer, A. C. K. M. J. G. A.

Advances in Neural Information Processing Systems 34 (NeurIPS 2021) , 34, (Editors: Ranzato, M; Beygelzimer, A; Dauphin, Y; Liang, PS; Vaughan, JW), NeuRIPS, 35th Conference on Neural Information Processing Systems (NeurIPS), December 2021 (conference)

link (url) [BibTex]

2021

link (url) [BibTex]


no image
CAMPARI: Camera-Aware Decomposed Generative Neural Radiance Fields

Niemeyer, M. G. A.

In Proceedings 2021 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2021), pages: 951-961, 3DV, 3DV 2021, December 2021 (inproceedings)

DOI [BibTex]

DOI [BibTex]


no image
On the Frequency Bias of Generative Models

Schwarz, K., Liao, Y., Geiger, A.

In Advances in Neural Information Processing Systems 34, 22, pages: 18126-18136, (Editors: M. Ranzato and A. Beygelzimer and Y. Dauphin and P. S. Liang and J. Wortman Vaughan), Curran Associates, Inc., Red Hook, NY, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), December 2021 (inproceedings)

Abstract
The key objective of Generative Adversarial Networks (GANs) is to generate new data with the same statistics as the provided training data. However, multiple recent works show that state-of-the-art architectures yet struggle to achieve this goal. In particular, they report an elevated amount of high frequencies in the spectral statistics which makes it straightforward to distinguish real and generated images. Explanations for this phenomenon are controversial: While most works attribute the artifacts to the generator, other works point to the discriminator. We take a sober look at those explanations and provide insights on what makes proposed measures against high-frequency artifacts effective. To achieve this, we first independently assess the architectures of both the generator and discriminator and investigate if they exhibit a frequency bias that makes learning the distribution of high-frequency content particularly problematic. Based on these experiments, we make the following four observations: 1) Different upsampling operations bias the generator towards different spectral properties. 2) Checkerboard artifacts introduced by upsampling cannot explain the spectral discrepancies alone as the generator is able to compensate for these artifacts. 3) The discriminator does not struggle with detecting high frequencies per se but rather struggles with frequencies of low magnitude. 4) The downsampling operations in the discriminator can impair the quality of the training signal it provides. In light of these findings, we analyze proposed measures against high-frequency artifacts in state-of-the-art GAN training but find that none of the existing approaches can fully resolve spectral artifacts yet. Our results suggest that there is great potential in improving the discriminator and that this could be key to match the distribution of the training data more closely.

link (url) [BibTex]

link (url) [BibTex]


no image
ATISS: Autoregressive Transformers for Indoor Scene Synthesis

Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.

In Advances in Neural Information Processing Systems 34, 15, pages: 12013-12026, (Editors: M. Ranzato and A. Beygelzimer and Y. Dauphin and P. S. Liang and J. Wortman Vaughan), Curran Associates, Inc., Red Hook, NY, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), December 2021 (inproceedings)

Abstract
The ability to synthesize realistic and diverse indoor furniture layouts automatically or based on partial input, unlocks many applications, from better interactive 3D tools to data synthesis for training and simulation. In this paper, we present ATISS, a novel autoregressive transformer architecture for creating diverse and plausible synthetic indoor environments, given only the room type and its floor plan. In contrast to prior work, which poses scene synthesis as sequence generation, our model generates rooms as unordered sets of objects. We argue that this formulation is more natural, as it makes ATISS generally useful beyond fully automatic room layout synthesis. For example, the same trained model can be used in interactive applications for general scene completion, partial room re-arrangement with any objects specified by the user, as well as object suggestions for any partial room. To enable this, our model leverages the permutation equivariance of the transformer when conditioning on the partial scene, and is trained to be permutation-invariant across object orderings. Our model is trained end-to-end as an autoregressive generative model using only labeled 3D bounding boxes as supervision. Evaluations on four room types in the 3D-FRONT dataset demonstrate that our model consistently generates plausible room layouts that are more realistic than existing methods. In addition, it has fewer parameters, is simpler to implement and train and runs up to 8x faster than existing methods.

link (url) [BibTex]

link (url) [BibTex]


MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images
MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images

Wang, S., Mihajlovic, M., Ma, Q., Geiger, A., Tang, S.

In Advances in Neural Information Processing Systems 34, 4, pages: 2810-2822, (Editors: Ranzato, M. and Beygelzimer, A. and Dauphin, Y. and Liang, P. S. and Wortman Vaughan, J.), Curran Associates, Inc., Red Hook, NY, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), December 2021 (inproceedings)

Abstract
In this paper, we aim to create generalizable and controllable neural signed distance fields (SDFs) that represent clothed humans from monocular depth observations. Recent advances in deep learning, especially neural implicit representations, have enabled human shape reconstruction and controllable avatar generation from different sensor inputs. However, to generate realistic cloth deformations from novel input poses, watertight meshes or dense full-body scans are usually needed as inputs. Furthermore, due to the difficulty of effectively modeling pose-dependent cloth deformations for diverse body shapes and cloth types, existing approaches resort to per-subject/cloth-type optimization from scratch, which is computationally expensive. In contrast, we propose an approach that can quickly generate realistic clothed human avatars, represented as controllable neural SDFs, given only monocular depth images. We achieve this by using meta-learning to learn an initialization of a hypernetwork that predicts the parameters of neural SDFs. The hypernetwork is conditioned on human poses and represents a clothed neural avatar that deforms non-rigidly according to the input poses. Meanwhile, it is meta-learned to effectively incorporate priors of diverse body shapes and cloth types and thus can be much faster to fine-tune compared to models trained from scratch. We qualitatively and quantitatively show that our approach outperforms state-of-the-art approaches that require complete meshes as inputs while our approach requires only depth frames as inputs and runs orders of magnitudes faster. Furthermore, we demonstrate that our meta-learned hypernetwork is very robust, being the first to generate avatars with realistic dynamic cloth deformations given as few as 8 monocular depth frames.

Project page arXiv link (url) Project Page Project Page [BibTex]

Project page arXiv link (url) Project Page Project Page [BibTex]


Shape As Points: A Differentiable Poisson Solver
Shape As Points: A Differentiable Poisson Solver

Peng, S., Jiang, C. M., Liao, Y., Niemeyer, M., Pollefeys, M., Geiger, A.

In Advances in Neural Information Processing Systems 34, 16, pages: 13032-13044, (Editors: M. Ranzato and A. Beygelzimer and Y. Dauphin and P. S. Liang and J. Wortman Vaughan), Curran Associates, Inc., Red Hook, NY, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), December 2021 (inproceedings)

Abstract
In recent years, neural implicit representations gained popularity in 3D reconstruction due to their expressiveness and flexibility. However, the implicit nature of neural implicit representations results in slow inference times and requires careful initialization. In this paper, we revisit the classic yet ubiquitous point cloud representation and introduce a differentiable point-to-mesh layer using a differentiable formulation of Poisson Surface Reconstruction (PSR) which allows for a GPU-accelerated fast solution of the indicator function given an oriented point cloud. The differentiable PSR layer allows us to efficiently and differentiably bridge the explicit 3D point representation with the 3D mesh via the implicit indicator field, enabling end-to-end optimization of surface reconstruction metrics such as Chamfer distance. This duality between points and meshes hence allows us to represent shapes as oriented point clouds, which are explicit, lightweight and expressive. Compared to neural implicit representations, our Shape-As-Points (SAP) model is more interpretable, lightweight, and accelerates inference time by one order of magnitude. Compared to other explicit representations such as points, patches, and meshes, SAP produces topology-agnostic, watertight manifold surfaces. We demonstrate the effectiveness of SAP on the task of surface reconstruction from unoriented point clouds and learning-based reconstruction.

Paper link (url) [BibTex]

Paper link (url) [BibTex]


NEAT: Neural Attention Fields for End-to-End Autonomous Driving
NEAT: Neural Attention Fields for End-to-End Autonomous Driving

Chitta, K., Prakash, A., Geiger, A.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 15773-15783 , IEEE, International Conference on Computer Vision (ICCV), 2021 (inproceedings)

Abstract
Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial pre-requisite for autonomous driving. We present NEural ATtention fields (NEAT), a novel representation that enables such reasoning for end-to-end Imitation Learning (IL) models. Our representation is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows our model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. NEAT nearly matches the state-of-the-art on the CARLA Leaderboard while being far less resource-intensive. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability. On a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert used to generate its training data.

Paper Supplementary Material Video 1 Video 2 Project page link (url) DOI [BibTex]

Paper Supplementary Material Video 1 Video 2 Project page link (url) DOI [BibTex]


{SNARF}: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes
SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes

Chen, X., Zheng, Y., Black, M. J., Hilliges, O., Geiger, A.

In Proc. International Conference on Computer Vision (ICCV), pages: 11574-11584, IEEE, Piscataway, NJ, International Conference on Computer Vision 2021, October 2021 (inproceedings)

Abstract
Neural implicit surface representations have emerged as a promising paradigm to capture 3D shapes in a continuous and resolution-independent manner. However, adapting them to articulated shapes is non-trivial. Existing approaches learn a backward warp field that maps deformed to canonical points. However, this is problematic since the backward warp field is pose dependent and thus requires large amounts of data to learn. To address this, we introduce SNARF, which combines the advantages of linear blend skinning (LBS) for polygonal meshes with those of neural implicit surfaces by learning a forward deformation field without direct supervision. This deformation field is defined in canonical, pose-independent, space, enabling generalization to unseen poses. Learning the deformation field from posed meshes alone is challenging since the correspondences of deformed points are defined implicitly and may not be unique under changes of topology. We propose a forward skinning model that finds all canonical correspondences of any deformed point using iterative root finding. We derive analytical gradients via implicit differentiation, enabling end-to-end training from 3D meshes with bone transformations. Compared to state-of-the-art neural implicit representations, our approach generalizes better to unseen poses while preserving accuracy. We demonstrate our method in challenging scenarios on (clothed) 3D humans in diverse and unseen poses.

pdf pdf 2 supplementary material project blog blog 2 video video 2 code DOI Project Page Project Page [BibTex]

pdf pdf 2 supplementary material project blog blog 2 video video 2 code DOI Project Page Project Page [BibTex]


GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

Niemeyer, M., Geiger, A.

In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages: 11448-11459 , IEEE, Conference on Computer Vision and Pattern Recognition (CVPR), June 2021 (inproceedings)

Abstract
Deep generative models allow for photorealistic image synthesis at high resolutions. But for many applications, this is not enough: content creation also needs to be controllable. While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional. Further, only few works consider the compositional nature of scenes. Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. Representing scenes as compositional generative neural feature fields allows us to disentangle one or multiple objects from the background as well as individual objects' shapes and appearances while learning from unstructured and unposed image collections without any additional supervision. Combining this scene representation with a neural rendering pipeline yields a fast and realistic image synthesis model. As evidenced by our experiments, our model is able to disentangle individual objects and allows for translating and rotating them in the scene as well as changing the camera pose.

pdf suppmat video Project Page link (url) DOI [BibTex]

pdf suppmat video Project Page link (url) DOI [BibTex]


no image
Learning Steering Kernels for Guided Depth Completion

Liu, L., Liao, Y., Wang, Y., Geiger, A., Liu, Y.

IEEE Transactions on Image Processing , 30, pages: 2850-2861, IEEE, February 2021 (article)

DOI [BibTex]

DOI [BibTex]


SMD-Nets: Stereo Mixture Density Networks
SMD-Nets: Stereo Mixture Density Networks

Tosi, F., Liao, Y., Schmitt, C., Geiger, A.

Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (article)

Abstract
Despite stereo matching accuracy has greatly improved by deep learning in the last few years, recovering sharp boundaries and high-resolution outputs efficiently remains challenging. In this paper, we propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures which ameliorates both issues. Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities while explicitly modeling the aleatoric uncertainty inherent in the observations. Moreover, we formulate disparity estimation as a continuous problem in the image domain, allowing our model to query disparities at arbitrary spatial precision. We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets. Our experiments demonstrate increased depth accuracy near object boundaries and prediction of ultra high-resolution disparity maps on standard GPUs. We demonstrate the flexibility of our technique by improving the performance of a variety of stereo backbones.

pdf suppmat Project page Best of CVPR [BibTex]

pdf suppmat Project page Best of CVPR [BibTex]


UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction
UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

Oechsle, M., Peng, S., Geiger, A.

In International Conference on Computer Vision (ICCV), 2021 (inproceedings)

Abstract
Neural implicit 3D representations have emerged as a powerful paradigm for reconstructing surfaces from multi-view images and synthesizing novel views. Unfortunately, existing methods such as DVR or IDR require accurate per-pixel object masks as supervision. At the same time, neural radiance fields have revolutionized novel view synthesis. However, NeRF's estimated volume density does not admit accurate surface reconstruction. Our key insight is that implicit surface models and radiance fields can be formulated in a unified way, enabling both surface and volume rendering using the same model. This unified perspective enables novel, more efficient sampling procedures and the ability to reconstruct accurate surfaces without input masks. We compare our method on the DTU, BlendedMVS, and a synthetic indoor dataset. Our experiments demonstrate that we outperform NeRF in terms of reconstruction quality while performing on par with IDR without requiring masks.

Paper Supplementary Material Video Project page [BibTex]


no image
Zoomorphic Gestures for Communicating Cobot States

Sauer, V. S. A. M. A.

IEEE Robotics and Automation Letters, 6(2):2179-2185, 2021 (article)

DOI [BibTex]

DOI [BibTex]


Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration
Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration

Wang, S., Geiger, A., Tang, S.

In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages: 7635-7644 , IEEE, Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (inproceedings)

Abstract
Registering point clouds of dressed humans to parametric human models is a challenging task in computer vision. Traditional approaches often rely on heavily engineered pipelines that require accurate manual initialization of human poses and tedious post-processing. More recently, learning-based methods are proposed in hope to automate this process. We observe that pose initialization is key to accurate registration but existing methods often fail to provide accurate pose initialization. One major obstacle is that, despite recent effort on rotation representation learning in neural networks, regressing joint rotations from point clouds or images of humans is still very challenging. To this end, we propose novel piecewise transformation fields (PTF), a set of functions that learn 3D translation vectors to map any query point in posed space to its correspond position in rest-pose space. We combine PTF with multi-class occupancy networks, obtaining a novel learning-based framework that learns to simultaneously predict shape and per-point correspondences between the posed space and the canonical space for clothed human. Our key insight is that the translation vector for each query point can be effectively estimated using the point-aligned local features; consequently, rigid per bone transformations and joint rotations can be obtained efficiently via a least-square fitting given the estimated point correspondences, circumventing the challenging task of directly regressing joint rotations from neural networks. Furthermore, the proposed PTF facilitate canonicalized occupancy estimation, which greatly improves generalization capability and results in more accurate surface reconstruction with only half of the parameters compared with the state-of-the-art. Both qualitative and quantitative studies show that fitting parametric models with poses initialized by our network results in much better registration quality, especially for extreme poses.

pdf suppmat video video_2 Project page link (url) DOI [BibTex]

pdf suppmat video video_2 Project page link (url) DOI [BibTex]


SLIM: Self-Supervised LiDAR Scene Flow and Motion Segmentation
SLIM: Self-Supervised LiDAR Scene Flow and Motion Segmentation

Baur, S., Emmerichs, D., Moosmann, F., Pinggera, P., Ommer, B., Geiger, A.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages: 13106-13116 , IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021 (inproceedings)

Abstract
Recently, several frameworks for self-supervised learning of 3D scene flow on point clouds have emerged. Scene flow inherently separates every scene into multiple moving agents and a large class of points following a single rigid sensor motion. However, existing methods do not leverage this property of the data in their self-supervised training routines which could improve and stabilize flow predictions. Based on the discrepancy between a robust rigid ego-motion estimate and a raw flow prediction, we generate a self-supervised motion segmentation signal. The predicted motion segmentation, in turn, is used by our algorithm to attend to stationary points for aggregation of motion information in static parts of the scene. We learn our model end-to-end by backpropagating gradients through Kabsch's algorithm and demonstrate that this leads to accurate ego-motion which in turn improves the scene flow estimate. Using our method, we show state-of-the-art results across multiple scene flow metrics for different real-world datasets, showcasing the robustness and generalizability of this approach. We further analyze the performance gain when performing joint motion segmentation and scene flow in an ablation study. We also present a novel network architecture for 3D LiDAR scene flow which is capable of handling an order of magnitude more points during training than previously possible.

Paper Supplementary Material Video DOI [BibTex]

Paper Supplementary Material Video DOI [BibTex]


Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks
Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

Paschalidou, D., Katharopoulos, A., Geiger, A., Fidler, S.

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) , pages: 3203-3214 , 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021 (inproceedings)

pdf suppmat video Project page DOI [BibTex]

pdf suppmat video Project page DOI [BibTex]


Counterfactual Generative Networks
Counterfactual Generative Networks

Sauer, A., Geiger, A.

In The Ninth International Conference on Learning Representations (ICLR 2021) , 9th International Conference on Learning Representations (ICLR 2021) , 2021 (inproceedings)

Abstract
Neural networks are prone to learning shortcuts -they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances in deep generative modeling, we propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background; hence, they allow for generating counterfactual images. We demonstrate the ability of our model to generate such images on MNIST and ImageNet. Further, we show that the counterfactual images can improve out-of-distribution robustness with a marginal drop in performance on the original classification task, despite being synthetic. Lastly, our generative model can be trained efficiently on a single GPU, exploiting common pre-trained models as inductive biases.

pdf Project Page video code Blog link (url) [BibTex]

pdf Project Page video code Blog link (url) [BibTex]


Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Prakash, A., Chitta, K., Geiger, A.

In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages: 7073-7083 , IEEE, Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (inproceedings)

Abstract
How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. However, for the actual driving task, the global context of the 3D scene is key, e.g. a change in traffic light state can affect the behavior of a vehicle geometrically distant from that traffic light. Geometry alone may therefore be insufficient for effectively fusing representations in end-to-end driving models. In this work, we demonstrate that existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, such as handling traffic oncoming from multiple directions at uncontrolled intersections. Therefore, we propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. We experimentally validate the efficacy of our approach in urban settings involving complex scenarios using the CARLA urban driving simulator. Our approach achieves state-of-the-art driving performance while reducing collisions by 80% compared to geometry-based fusion.

pdf video Project Page link (url) DOI [BibTex]

pdf video Project Page link (url) DOI [BibTex]


Learning Cascaded Detection Tasks with Weakly-Supervised Domain Adaptation
Learning Cascaded Detection Tasks with Weakly-Supervised Domain Adaptation

Hanselmann, N., Schneider, N., Ortelt, B., Geiger, A.

In 2021 IEEE Intelligent Vehicles Symposium (IV), 4th IEEE Intelligent Vehicles Symposium (IV 2021), 2021 (inproceedings)

Abstract
n order to handle the challenges of autonomous driving, deep learning has proven to be crucial in tackling increasingly complex tasks, such as 3D detection or instance segmentation. State-of-the-art approaches for image-based detection tasks tackle this complexity by operating in a cascaded fashion: they first extract a 2D bounding box based on which additional attributes, e.g. instance masks, are inferred. While these methods perform well, a key challenge remains the lack of accurate and cheap annotations for the growing variety of tasks. Synthetic data presents a promising solution but, despite the effort in domain adaptation research, the gap between synthetic and real data remains an open problem. In this work, we propose a weakly supervised domain adaptation setting which exploits the structure of cascaded detection tasks. In particular, we learn to infer the attributes solely from the source domain while leveraging 2D bounding boxes as weak labels in both domains to explain the domain shift. We further encourage domain-invariant features through class-wise feature alignment using ground-truth class information, which is not available in the unsupervised setting. As our experiments demonstrate, the approach is competitive with fully supervised settings while outperforming unsupervised adaptation approaches by a large margin.

Paper Video Project page DOI [BibTex]

Paper Video Project page DOI [BibTex]


Benchmarking Unsupervised Object Representations for Video Sequences
Benchmarking Unsupervised Object Representations for Video Sequences

Weis, M., Chitta, K., Sharma, Y., Brendel, W., Bethge, M., Geiger, A., Ecker, A.

Journal of Machine Learning Research (JMLR), 22, pages: 61, 2021 (article)

Abstract
Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a benchmark with four data sets of varying complexity and seven additional test sets featuring challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four object-centric approaches: ViMON, a video-extension of MONet, based on recurrent spatial attention, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use explicit factorization via spatial transformers. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking than the spatial transformer based architectures. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

Paper Project page link (url) [BibTex]

Paper Project page link (url) [BibTex]


KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

Reiser, C., Peng, S., Liao, Y., Geiger, A.

In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , pages: 14315-14325 , IEEE/CVF International Conference on Computer Vision (ICCV 2021) , 2021 (inproceedings)

Abstract
NeRF synthesizes novel views of a scene with unprecedented quality by fitting a neural radiance field to RGB images. However, NeRF requires querying a deep Multi-Layer Perceptron (MLP) millions of times, leading to slow rendering times, even on modern GPUs. In this paper, we demonstrate that real-time rendering is possible by utilizing thousands of tiny MLPs instead of one single large MLP. In our setting, each individual MLP only needs to represent parts of the scene, thus smaller and faster-to-evaluate MLPs can be used. By combining this divide-and-conquer strategy with further optimizations, rendering is accelerated by three orders of magnitude compared to the original NeRF model without incurring high storage costs. Further, using teacher-student distillation for training, we show that this speed-up can be achieved without sacrificing visual quality.

Paper Supplementary Material Video 1 Video 2 Project page Blog link (url) DOI [BibTex]

Paper Supplementary Material Video 1 Video 2 Project page Blog link (url) DOI [BibTex]

2020


GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis
GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis

Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.

In Advances in Neural Information Processing Systems 33, 25, pages: 20154-20166, (Editors: Larochelle , H. and Ranzato, M. and Hadsell , R. and Balcan , M. F. and Lin, H.), Curran Associates, Inc., Red Hook, NY, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), December 2020 (inproceedings)

Abstract
While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Thus, they do not provide precise control over camera viewpoint or object pose. To address this problem, several recent approaches leverage intermediate voxel-based representations in combination with differentiable rendering. However, existing methods either produce low image resolution or fall short in disentangling camera and scene properties, eg, the object identity may vary with the viewpoint. In this paper, we propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene. In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity. By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone. We systematically analyze our approach on several challenging synthetic and real-world datasets. Our experiments reveal that radiance fields are a powerful representation for generative image synthesis, leading to 3D consistent models that render with high fidelity.

pdf suppmat video Project Page link (url) [BibTex]

2020

pdf suppmat video Project Page link (url) [BibTex]


Label Efficient Visual Abstractions for Autonomous Driving
Label Efficient Visual Abstractions for Autonomous Driving

Behl, A., Chitta, K., Prakash, A., Ohn-Bar, E., Geiger, A.

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, October 2020 (conference)

Abstract
It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this work, we seek to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents. We analyze several segmentation-based intermediate representations. We use these visual abstractions to systematically study the trade-off between annotation efficiency and driving performance, ie, the types of classes labeled, the number of image samples used to learn the visual abstraction model, and their granularity (eg, object masks vs. 2D bounding boxes). Our analysis uncovers several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner. Surprisingly, we find that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost. Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.

pdf slides video Project Page [BibTex]

pdf slides video Project Page [BibTex]


Convolutional Occupancy Networks
Convolutional Occupancy Networks

Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.

In Computer Vision – ECCV 2020, 3, pages: 523-540, Lecture Notes in Computer Science, 12348, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (inproceedings)

Abstract
Recently, implicit neural representations have gained popularity for learning-based 3D reconstruction. While demonstrating promising results, most implicit approaches are limited to comparably simple geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor of implicit methods is their simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance. In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases, enabling structured reasoning in 3D space. We investigate the effectiveness of the proposed representation by reconstructing complex geometry from noisy point clouds and low-resolution voxel representations. We empirically find that our method enables the fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes, and generalizes well from synthetic to real data.

pdf suppmat video Project Page DOI [BibTex]

pdf suppmat video Project Page DOI [BibTex]


Category Level Object Pose Estimation via Neural Analysis-by-Synthesis
Category Level Object Pose Estimation via Neural Analysis-by-Synthesis

Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.

In Computer Vision – ECCV 2020, 26, pages: 139-156, Lecture Notes in Computer Science, 12371, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020) , August 2020 (inproceedings)

Abstract
Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully.

Project Page pdf suppmat DOI [BibTex]

Project Page pdf suppmat DOI [BibTex]


no image
Self-Supervised Linear Motion Deblurring

Liu, P., Janai, J., Pollefeys, M., Sattler, T., Geiger, A.

IEEE Robotics and Automation Letters, 5(2):2475-2482, IEEE, April 2020 (article)

DOI [BibTex]

DOI [BibTex]


no image
Scalable Active Learning for Object Detection

Haussmann, E. F. M. C. K. I. J. X. H. R. D. M. A. K. N. F. C. A. J. M.

Proceedings 31st IEEE Intelligent Vehicles Symposium (IV), pages: 1430-1435, IEEE, 31st IEEE Intelligent Vehicles Symposium (IV), 2020 (conference) Accepted

DOI [BibTex]

DOI [BibTex]


Self-supervised motion deblurring
Self-supervised motion deblurring

Liu, P., Janai, J., Pollefeys, M., Sattler, T., Geiger, A.

IEEE Robotics and Automation Letters, 2020 (article)

Abstract
Motion blurry images challenge many computer vision algorithms, e.g., feature detection, motion estimation, or object recognition. Deep convolutional neural networks are state-of-the-art for image deblurring. However, obtaining training data with corresponding sharp and blurry image pairs can be difficult. In this paper, we present a differentiable reblur model for self-supervised motion deblurring, which enables the network to learn from real-world blurry image sequences without relying on sharp images for supervision. Our key insight is that motion cues obtained from consecutive images yield sufficient information to inform the deblurring task. We therefore formulate deblurring as an inverse rendering problem, taking into account the physical image formation process: we first predict two deblurred images from which we estimate the corresponding optical flow. Using these predictions, we re-render the blurred images and minimize the difference with respect to the original blurry inputs. We use both synthetic and real dataset for experimental evaluations. Our experiments demonstrate that self-supervised single image deblurring is really feasible and leads to visually compelling results.

pdf Project Page Blog [BibTex]

pdf Project Page Blog [BibTex]


Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image
Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image

Paschalidou, D., Gool, L., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Humans perceive the 3D world as a set of distinct objects that are characterized by various low-level (geometry, reflectance) and high-level (connectivity, adjacency, symmetry) properties. Recent methods based on convolutional neural networks (CNNs) demonstrated impressive progress in 3D reconstruction, even when using a single 2D image as input. However, the majority of these methods focuses on recovering the local 3D geometry of an object without considering its part-based decomposition or relations between parts. We address this challenging problem by proposing a novel formulation that allows to jointly recover the geometry of a 3D object as a set of primitives as well as their latent hierarchical structure without part-level supervision. Our model recovers the higher level structural decomposition of various objects in the form of a binary tree of primitives, where simple parts are represented with fewer primitives and more complex parts are modeled with more components. Our experiments on the ShapeNet and D-FAUST datasets demonstrate that considering the organization of parts indeed facilitates reasoning about 3D geometry.

pdf suppmat Video 2 Project Page Slides Poster Video 1 [BibTex]

pdf suppmat Video 2 Project Page Slides Poster Video 1 [BibTex]


Learning Neural Light Transport
Learning Neural Light Transport

Sanzenbacher, P., Mescheder, L., Geiger, A.

Arxiv, 2020 (article)

Abstract
In recent years, deep generative models have gained significance due to their ability to synthesize natural-looking images with applications ranging from virtual reality to data augmentation for training computer vision models. While existing models are able to faithfully learn the image distribution of the training set, they often lack controllability as they operate in 2D pixel space and do not model the physical image formation process. In this work, we investigate the importance of 3D reasoning for photorealistic rendering. We present an approach for learning light transport in static and dynamic 3D scenes using a neural network with the goal of predicting photorealistic images. In contrast to existing approaches that operate in the 2D image domain, our approach reasons in both 3D and 2D space, thus enabling global illumination effects and manipulation of 3D scene geometry. Experimentally, we find that our model is able to produce photorealistic renderings of static and dynamic scenes. Moreover, it compares favorably to baselines which combine path tracing and image denoising at the same computational budget.

arxiv [BibTex]


Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis
Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis

Liao, Y., Schwarz, K., Mescheder, L., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 5870 - 5879, IEEE, Piscataway, NJ, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
In recent years, Generative Adversarial Networks have achieved impressive results in photorealistic image synthesis. This progress nurtures hopes that one day the classical rendering pipeline can be replaced by efficient models that are learned directly from images. However, current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or object pose is challenging. Furthermore, they lack an interpretable and controllable representation. Our key hypothesis is that the image generation process should be modeled in 3D space as the physical world surrounding us is intrinsically three-dimensional. We define the new task of 3D controllable image synthesis and propose an approach for solving it by reasoning both in 3D space and in the 2D image domain. We demonstrate that our model is able to disentangle latent 3D factors of simple multi-object scenes in an unsupervised fashion from raw images. Compared to pure 2D baselines, it allows for synthesizing scenes that are consistent wrt. changes in viewpoint or object pose. We further evaluate various 3D representations in terms of their usefulness for this challenging task.

pdf suppmat Video 2 Project Page Video 1 Slides Poster DOI [BibTex]

pdf suppmat Video 2 Project Page Video 1 Slides Poster DOI [BibTex]


Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving
Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving

Prakash, A., Behl, A., Ohn-Bar, E., Chitta, K., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Data aggregation techniques can significantly improve vision-based policy learning within a training environment, e.g., learning to drive in a specific simulation condition. However, as on-policy data is sequentially sampled and added in an iterative manner, the policy can specialize and overfit to the training conditions. For real-world applications, it is useful for the learned policy to generalize to novel scenarios that differ from the training conditions. To improve policy learning while maintaining robustness when training end-to-end driving policies, we perform an extensive analysis of data aggregation techniques in the CARLA environment. We demonstrate how the majority of them have poor generalization performance, and develop a novel approach with empirically better generalization performance compared to existing techniques. Our two key ideas are (1) to sample critical states from the collected on-policy data based on the utility they provide to the learned policy in terms of driving behavior, and (2) to incorporate a replay buffer which progressively focuses on the high uncertainty regions of the policy's state distribution. We evaluate the proposed approach on the CARLA NoCrash benchmark, focusing on the most challenging driving scenarios with dense pedestrian and vehicle traffic. Our approach improves driving success rate by 16% over state-of-the-art, achieving 87% of the expert performance while also reducing the collision rate by an order of magnitude without the use of any additional modality, auxiliary tasks, architectural modifications or reward from the environment.

pdf suppmat Video 2 Project Page Slides Video 1 [BibTex]

pdf suppmat Video 2 Project Page Slides Video 1 [BibTex]


HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking
HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking

Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixe, L., Leibe, B.

International Journal of Computer Vision, 129(2):548-578, 2020 (article)

Abstract
Multi-Object Tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, HOTA (Higher Order Tracking Accuracy), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance.

pdf DOI [BibTex]

pdf DOI [BibTex]


On Joint Estimation of Pose, Geometry and svBRDF from a Handheld Scanner
On Joint Estimation of Pose, Geometry and svBRDF from a Handheld Scanner

Schmitt, C., Donne, S., Riegler, G., Koltun, V., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
We propose a novel formulation for joint recovery of camera pose, object geometry and spatially-varying BRDF. The input to our approach is a sequence of RGB-D images captured by a mobile, hand-held scanner that actively illuminates the scene with point light sources. Compared to previous works that jointly estimate geometry and materials from a hand-held scanner, we formulate this problem using a single objective function that can be minimized using off-the-shelf gradient-based solvers. By integrating material clustering as a differentiable operation into the optimization process, we avoid pre-processing heuristics and demonstrate that our model is able to determine the correct number of specular materials independently. We provide a study on the importance of each component in our formulation and on the requirements of the initial geometry. We show that optimizing over the poses is crucial for accurately recovering fine details and that our approach naturally results in a semantically meaningful material segmentation.

pdf Project Page Slides Video Poster [BibTex]

pdf Project Page Slides Video Poster [BibTex]


Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image Decomposition
Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image Decomposition

Hassan Alhaija, Siva Mustikovela, Varun Jampani, Justus Thies, Matthias Niessner, Andreas Geiger, Carsten Rother

In International Conference on 3D Vision (3DV), 2020 (inproceedings)

Abstract
Neural rendering techniques promise efficient photo-realistic image synthesis while providing rich control over scene parameters by learning the physical image formation process. While several supervised methods have been pro-posed for this task, acquiring a dataset of images with accurately aligned 3D models is very difficult. The main contribution of this work is to lift this restriction by training a neural rendering algorithm from unpaired data. We pro-pose an auto encoder for joint generation of realistic images from synthetic 3D models while simultaneously decomposing real images into their intrinsic shape and appearance properties. In contrast to a traditional graphics pipeline, our approach does not require to specify all scene properties, such as material parameters and lighting by hand.Instead, we learn photo-realistic deferred rendering from a small set of 3D models and a larger set of unaligned real images, both of which are easy to acquire in practice. Simultaneously, we obtain accurate intrinsic decompositions of real images while not requiring paired ground truth. Our experiments confirm that a joint treatment of rendering and de-composition is indeed beneficial and that our approach out-performs state-of-the-art image-to-image translation base-lines both qualitatively and quantitatively.

pdf suppmat [BibTex]

pdf suppmat [BibTex]


Learning Situational Driving
Learning Situational Driving

Ohn-Bar, E., Prakash, A., Behl, A., Chitta, K., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 11293 - 11302 , IEEE, Piscataway, NJ, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Human drivers have a remarkable ability to drive in diverse visual conditions and situations, e.g., from maneuvering in rainy, limited visibility conditions with no lane markings to turning in a busy intersection while yielding to pedestrians. In contrast, we find that state-of-the-art sensorimotor driving models struggle when encountering diverse settings with varying relationships between observation and action. To generalize when making decisions across diverse conditions, humans leverage multiple types of situation-specific reasoning and learning strategies. Motivated by this observation, we develop a framework for learning a situational driving policy that effectively captures reasoning under varying types of scenarios. Our key idea is to learn a mixture model with a set of policies that can capture multiple driving modes. We first optimize the mixture model through behavior cloning, and show it to result in significant gains in terms of driving performance in diverse conditions. We then refine the model by directly optimizing for the driving task itself, i.e., supervised with the navigation task reward. Our method is more scalable than methods assuming access to privileged information, e.g., perception labels, as it only assumes demonstration and reward-based supervision. We achieve over 98% success rate on the CARLA driving benchmark as well as state-of-the-art performance on a newly introduced generalization benchmark.

pdf suppmat Video 2 Project Page Video 1 Slides DOI [BibTex]

pdf suppmat Video 2 Project Page Video 1 Slides DOI [BibTex]


Learning Implicit Surface Light Fields
Learning Implicit Surface Light Fields

Oechsle, M., Niemeyer, M., Reiser, C., Mescheder, L., Strauss, T., Geiger, A.

In International Conference on 3D Vision (3DV), 2020 (inproceedings)

Abstract
Implicit representations of 3D objects have recently achieved impressive results on learning-based 3D reconstruction tasks. While existing works use simple texture models to represent object appearance, photo-realistic image synthesis requires reasoning about the complex interplay of light, geometry and surface properties. In this work, we propose a novel implicit representation for capturing the visual appearance of an object in terms of its surface light field. In contrast to existing representations, our implicit model represents surface light fields in a continuous fashion and independent of the geometry. Moreover, we condition the surface light field with respect to the location and color of a small light source. Compared to traditional surface light field models, this allows us to manipulate the light source and relight the object using environment maps. We further demonstrate the capabilities of our model to predict the visual appearance of an unseen object from a single real RGB image and corresponding 3D shape information. As evidenced by our experiments, our model is able to infer rich visual appearance including shadows and specular reflections. Finally, we show that the proposed representation can be embedded into a variational auto-encoder for generating novel appearances that conform to the specified illumination conditions.

pdf suppmat Project Page [BibTex]

pdf suppmat Project Page [BibTex]


Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art
Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art

Janai, J., Güney, F., Behl, A., Geiger, A.

12(1-3), Foundations and Trends® in Computer Graphics and Vision, now Publishers Inc., Hanover, MA, 2020 (book)

Abstract
Recent years have witnessed enormous progress in AI-related fields such as computer vision, machine learning, and autonomous vehicles. As with any rapidly growing field, it becomes increasingly difficult to stay up-to-date or enter the field as a beginner. While several survey papers on particular sub-problems have appeared, no comprehensive survey on problems, datasets, and methods in computer vision for autonomous vehicles has been published. This monograph attempts to narrow this gap by providing a survey on the state-of-the-art datasets and techniques. Our survey includes both the historically most relevant literature as well as the current state of the art on several specific topics, including recognition, reconstruction, motion estimation, tracking, scene understanding, and end-to-end learning for autonomous driving. Towards this goal, we analyze the performance of the state of the art on several challenging benchmarking datasets, including KITTI, MOT, and Cityscapes. Besides, we discuss open problems and current research challenges. To ease accessibility and accommodate missing references, we also provide a website that allows navigating topics as well as methods and provides additional information.

pdf Project Page link DOI Project Page [BibTex]

pdf Project Page link DOI Project Page [BibTex]


Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision
Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision

Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.

In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), pages: 3501 - 3512, IEEE, Piscataway, NJ, IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020 (inproceedings)

Abstract
Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

pdf suppmat Video 2 Project Page Video 1 Video 3 Slides Poster DOI [BibTex]

pdf suppmat Video 2 Project Page Video 1 Video 3 Slides Poster DOI [BibTex]

2019


Attacking Optical Flow
Attacking Optical Flow

Ranjan, A., Janai, J., Geiger, A., Black, M. J.

In Proceedings International Conference on Computer Vision (ICCV), pages: 2404-2413, IEEE, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), November 2019, ISSN: 2380-7504 (inproceedings)

Abstract
Deep neural nets achieve state-of-the-art performance on the problem of optical flow estimation. Since optical flow is used in several safety-critical applications like self-driving cars, it is important to gain insights into the robustness of those techniques. Recently, it has been shown that adversarial attacks easily fool deep neural networks to misclassify objects. The robustness of optical flow networks to adversarial attacks, however, has not been studied so far. In this paper, we extend adversarial patch attacks to optical flow networks and show that such attacks can compromise their performance. We show that corrupting a small patch of less than 1% of the image size can significantly affect optical flow estimates. Our attacks lead to noisy flow estimates that extend significantly beyond the region of the attack, in many cases even completely erasing the motion of objects in the scene. While networks using an encoder-decoder architecture are very sensitive to these attacks, we found that networks using a spatial pyramid architecture are less affected. We analyse the success and failure of attacking both architectures by visualizing their feature maps and comparing them to classical optical flow techniques which are robust to these attacks. We also demonstrate that such attacks are practical by placing a printed pattern into real scenes.

Video Project Page Paper Supplementary Material link (url) DOI Project Page [BibTex]

2019

Video Project Page Paper Supplementary Material link (url) DOI Project Page [BibTex]


Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics
Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics

Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.

International Conference on Computer Vision, October 2019 (conference)

Abstract
Deep learning based 3D reconstruction techniques have recently achieved impressive results. However, while state-of-the-art methods are able to output complex 3D geometry, it is not clear how to extend these results to time-varying topologies. Approaches treating each time step individually lack continuity and exhibit slow inference, while traditional 4D reconstruction methods often utilize a template model or discretize the 4D space at fixed resolution. In this work, we present Occupancy Flow, a novel spatio-temporal representation of time-varying 3D geometry with implicit correspondences. Towards this goal, we learn a temporally and spatially continuous vector field which assigns a motion vector to every point in space and time. In order to perform dense 4D reconstruction from images or sparse point clouds, we combine our method with a continuous 3D representation. Implicitly, our model yields correspondences over time, thus enabling fast inference while providing a sound physical description of the temporal dynamics. We show that our method can be used for interpolation and reconstruction tasks, and demonstrate the accuracy of the learned correspondences. We believe that Occupancy Flow is a promising new 4D representation which will be useful for a variety of spatio-temporal reconstruction tasks.

pdf poster suppmat code Project page video blog [BibTex]


Texture Fields: Learning Texture Representations in Function Space
Texture Fields: Learning Texture Representations in Function Space

Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.

International Conference on Computer Vision, October 2019 (conference)

Abstract
In recent years, substantial progress has been achieved in learning-based reconstruction of 3D objects. At the same time, generative models were proposed that can generate highly realistic images. However, despite this success in these closely related tasks, texture reconstruction of 3D objects has received little attention from the research community and state-of-the-art methods are either limited to comparably low resolution or constrained experimental setups. A major reason for these limitations is that common representations of texture are inefficient or hard to interface for modern deep learning techniques. In this paper, we propose Texture Fields, a novel texture representation which is based on regressing a continuous 3D function parameterized with a neural network. Our approach circumvents limiting factors like shape discretization and parameterization, as the proposed texture representation is independent of the shape representation of the 3D object. We show that Texture Fields are able to represent high frequency texture and naturally blend with modern deep learning techniques. Experimentally, we find that Texture Fields compare favorably to state-of-the-art methods for conditional texture reconstruction of 3D objects and enable learning of probabilistic generative models for texturing unseen 3D models. We believe that Texture Fields will become an important building block for the next generation of generative 3D models.

pdf suppmat video poster blog Project Page [BibTex]