Command Palette
Search for a command to run...
Haiyang Wang Chen Shi Shaoshuai Shi Meng Lei Sen Wang Di He Bernt Schiele Liwei Wang

Abstract
Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at \url{https://github.com/Haiyang-W/DSVT}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-object-detection-on-nuscenes | DSVT | NDS: 0.73 mAAE: 0.14 mAOE: 0.30 mASE: 0.23 mATE: 0.25 mAVE: 0.25 |
| 3d-object-detection-on-nuscenes-lidar-only | DSVT | NDS: 72.7 NDS (val): 71.1 mAP: 68.4 mAP (val): 66.4 |
| 3d-object-detection-on-waymo-cyclist | DSVT(val) | APH/L2: 78.0 |
| 3d-object-detection-on-waymo-open-dataset | DSVT | mAPH/L2: 72.1 |
| 3d-object-detection-on-waymo-pedestrian | DSVT(val) | APH/L2: 76.4 |
| 3d-object-detection-on-waymo-vehicle | DSVT(val) | APH/L2: 74.1 L1 mAP: 82.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.