PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
An efficient, camera-only, end-to-end autonomous driving model that achieves state-of-the-art performance without LiDAR or explicit BEV representations.
Code will be available upon publication.
While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors, and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment.
High Efficiency
Introduced PRIX, a novel camera-only, end-to-end planner that is significantly more efficient than multimodal and previous camera-only approaches in terms of inference speed and model size.
CaRT Module
Proposed the Context-aware Recalibration Transformer (CaRT), a new module designed to effectively enhance multi-level visual features for more robust planning.
Comprehensive Validation
Provided a comprehensive ablation study that validates our architectural choices and offers insights into optimizing the trade-off between performance, speed, and model size.
State-of-the-Art Performance
Achieved SOTA performance on NavSim and nuScenes datasets, outperforming larger, multimodal planners while being much smaller and faster.
PRIX's architecture processes multi-camera images through a visual backbone featuring our novel CaRT module. These enhanced visual features, combined with the vehicle's state and noisy anchors, are fed into a conditional diffusion planner to generate the final, safe trajectory.
Performance vs. Speed
PRIX outperforms or matches multimodal methods like DiffusionDrive while being significantly smaller and faster, operating at a highly competitive framerate.
Key Benchmarks
NavSim-v1
87.8 PDMS
Top-performing model, surpassing both camera-only and multimodal competitors.
NavSim-v2
84.2 EPDMS
Achieved the best overall score, solidifying its position as the leading model.
nuScenes
0.57m Avg. L2 Error
Outperforms all existing camera-based baselines with the lowest error and collision rate.
Our model correctly handles complex scenarios like busy intersections and can even generate safer trajectories than the ground truth by maintaining a larger safety distance.
Sharp Left Turn
Improved Safety Margin
The model generates a trajectory that is safer than the ground truth by keeping a larger distance from other vehicles.
Robustness to Weather Conditions
PRIX demonstrates strong robustness, generating consistent and safe trajectories across clear, rainy, and snowy conditions.
Weather Example
Clean
Rainy
Snowy
Corresponding Model Predictions (Example 1)
Clean Prediction
Rain Prediction
Snow Prediction
Weather Example 2
Clean
Rainy
Snowy
Corresponding Model Predictions (Example 2)
Clean Prediction
Rain Prediction
Snow Prediction
Trajectory Refinement via Diffusion
The model generates initial trajectories and refines them over diffusion steps. The final selected path is shown in red, with the second-best alternative in blue.
Going straight on narrow road
Lane Change
Right turn
Left Turn
Hyper Parameters
The table below summarizes the full hyperparameter configuration used for the PRIX model. We group settings for the backbone, detection and planning heads, batching and precision, optimization, distributed training, and loss weights.
| Hyperparameter | Value |
|---|---|
| Backbone configuration | |
| Image backbone | ResNet34 |
| Shared CaRT dimension | 512 |
| Number of CaRT self-attention layers | 2 |
| Number of attention heads | 4 |
| Heads configuration (detection and planning) | |
| Max number of bounding boxes | 30 |
| Segmentation feature channels | 64 |
| Segmentation number of classes | 7 |
| Trajectory output | (x, y, yaw) |
| Batching and precision | |
| GPUs | 8 × A100 (40 GB) |
| Per-GPU batch size | 64 |
| Mixed precision | bfloat16 (AMP) |
| Gradient clipping | 0.1 |
| Optimization | |
| Optimizer | AdamW |
| Initial learning rate | 1e-5 |
| Weight decay | 1e-3 |
| AdamW (β₁, β₂) | (0.9, 0.999) |
| LR scheduler | MultiStepLR |
| LR decay factor | 0.1 |
| Param-wise LR multiplier (image encoder) | 0.5 |
| Loss weights | |
| Trajectory loss weight | 10.0 |
| Agent classification weight | 10.0 |
| Agent box regression weight | 1.0 |
| Semantic segmentation weight | 10.0 |
Additional Experiments
All experiments in this section are performed on Navsim-v1 unless noted otherwise.
Backbone Capacity, Speed, and Stability
- Aim: To identify the backbone (ResNet34, ResNet50, ResNet101) that offers the best balance of performance, speed, and stability.
- Method: PRIX was trained with each backbone, and PDMS (mean ± standard deviation over five runs), parameter count, and FPS were measured.
- Result: ResNet34 provides the best overall trade-off. It is the fastest model (57.0 FPS) and the most stable (87.8 ± 0.1 PDMS), while showing only a minor performance difference compared to the larger and slower ResNet101.
| Model | Backbone | PDMS | Params | FPS |
|---|---|---|---|---|
| PRIX (default) | ResNet34 | 87.8±.1 | 37M | 57.0 |
| PRIX-50 | ResNet50 | 87.8±.2 | 41M | 47.3 |
| PRIX-101 | ResNet101 | 87.9±.4 | 58M | 28.6 |
Ablation on Loss Weights
- Aim: To determine the best weighting for the detection loss and the semantic loss.
- Method: A grid search was performed over different weight combinations, and the PDMS score was recorded for each setting.
- Result: A low detection loss weight and a high semantic loss weight perform best. The configuration with detection weight 1 and semantic weight 10 yields 87.8 PDMS. Performance increases consistently as the semantic loss weight increases.
Sensor Failures
- Aim: To test PRIX's robustness to sensor failures, such as camera noise or dropout.
- Method: Sensor failures were simulated at test time for a standard model. New models were then trained with these corruptions (noise or dropout) to evaluate whether robustness improves.
- Result: While failures reduce the baseline model's score from 88.7 to 82.2 PDMS, training with noise recovers the score to 84.7.
| Training Method | Test-Time Input | PDMS Score |
|---|---|---|
| Standard (Baseline) | Clean | 88.7 |
| Standard (Baseline) | With Failures | 82.2 |
| Train w/ Full Camera Dropout | With Failures | 83.9 |
| Train w/ Random Noise | With Failures | 84.7 |
Impact of Ego Status in the Planning Head
- Aim: To understand how important the ego status input (velocity, acceleration) is for the planner.
- Method: We corrupted the ego status at test time (masking it, replacing it with noise) for both a standard model and a model trained *with* status corruption.
- Result: Ego status is critical, but PRIX can be trained to be robust. Corrupting the status drops the baseline model's score from 87.8 to 66.8 PDMS. However, the model trained with corruption achieves a strong 84.7 PDMS, even with random inputs.
| Ego Status (Test Time) | Corruption at Training | PRIX | DiffusionDrive |
|---|---|---|---|
| Status (Clean) | × | 87.8 | 88.1 |
| Zero (masked) | × | 64.4 | 63.9 |
| Random | × | 66.8 | 68.1 |
| Scaled | × | 80.9 | 81.3 |
| Zero (masked) | ✓ | 81.3 | 81.1 |
| Random | ✓ | 84.7 | 83.9 |
| Scaled | ✓ | 84.4 | 84.0 |