PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Sharp Right Turn

Safe Intersection Turn

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors, and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment.

High Efficiency

Introduced PRIX, a novel camera-only, end-to-end planner that is significantly more efficient than multimodal and previous camera-only approaches in terms of inference speed and model size.

CaRT Module

Proposed the Context-aware Recalibration Transformer (CaRT), a new module designed to effectively enhance multi-level visual features for more robust planning.

Comprehensive Validation

Provided a comprehensive ablation study that validates our architectural choices and offers insights into optimizing the trade-off between performance, speed, and model size.

State-of-the-Art Performance

Achieved SOTA performance on NavSim and nuScenes datasets, outperforming larger, multimodal planners while being much smaller and faster.

PRIX's architecture processes multi-camera images through a visual backbone featuring our novel CaRT module. These enhanced visual features, combined with the vehicle's state and noisy anchors, are fed into a conditional diffusion planner to generate the final, safe trajectory.

Performance vs. Speed

PRIX outperforms or matches multimodal methods like DiffusionDrive while being significantly smaller and faster, operating at a highly competitive framerate.

Key Benchmarks

NavSim-v1

87.8 PDMS

Top-performing model, surpassing both camera-only and multimodal competitors.

NavSim-v2

84.2 EPDMS

Achieved the best overall score, solidifying its position as the leading model.

nuScenes

0.57m Avg. L2 Error

Outperforms all existing camera-based baselines with the lowest error and collision rate.

Our model correctly handles complex scenarios like busy intersections and can even generate safer trajectories than the ground truth by maintaining a larger safety distance.

Sharp Left Turn

Improved Safety Margin

The model generates a trajectory that is safer than the ground truth by keeping a larger distance from other vehicles.

Robustness to Weather Conditions

PRIX demonstrates strong robustness, generating consistent and safe trajectories across clear, rainy, and snowy conditions.

Weather Example

Left Camera (l0)

Front Camera (f0)

Right Camera (r0)

Clean

Rainy

Snowy

Corresponding Model Predictions (Example 1)

Clean Prediction

Rain Prediction

Snow Prediction

Weather Example 2

Left Camera (l0)

Front Camera (f0)

Right Camera (r0)

Clean

Rainy

Snowy

Corresponding Model Predictions (Example 2)

Clean Prediction

Rain Prediction

Snow Prediction

Trajectory Refinement via Diffusion

The model generates initial trajectories and refines them over diffusion steps. The final selected path is shown in red, with the second-best alternative in blue.

Going straight on narrow road

Lane Change

Right turn

Left Turn

Hyper Parameters

The table below summarizes the full hyperparameter configuration used for the PRIX model. We group settings for the backbone, detection and planning heads, batching and precision, optimization, distributed training, and loss weights.

Hyperparameter	Value
Backbone configuration
Image backbone	ResNet34
Shared CaRT dimension	512
Number of CaRT self-attention layers	2
Number of attention heads	4
Heads configuration (detection and planning)
Max number of bounding boxes	30
Segmentation feature channels	64
Segmentation number of classes	7
Trajectory output	(x, y, yaw)
Batching and precision
GPUs	8 × A100 (40 GB)
Per-GPU batch size	64
Mixed precision	bfloat16 (AMP)
Gradient clipping	0.1
Optimization
Optimizer	AdamW
Initial learning rate	1e-5
Weight decay	1e-3
AdamW (β₁, β₂)	(0.9, 0.999)
LR scheduler	MultiStepLR
LR decay factor	0.1
Param-wise LR multiplier (image encoder)	0.5
Loss weights
Trajectory loss weight	10.0
Agent classification weight	10.0
Agent box regression weight	1.0
Semantic segmentation weight	10.0

Additional Experiments

All experiments in this section are performed on Navsim-v1 unless noted otherwise.

Backbone Capacity, Speed, and Stability

Aim: To identify the backbone (ResNet34, ResNet50, ResNet101) that offers the best balance of performance, speed, and stability.
Method: PRIX was trained with each backbone, and PDMS (mean ± standard deviation over five runs), parameter count, and FPS were measured.
Result: ResNet34 provides the best overall trade-off. It is the fastest model (57.0 FPS) and the most stable (87.8 ± 0.1 PDMS), while showing only a minor performance difference compared to the larger and slower ResNet101.

Backbone comparison (mean±std over 5 runs)
Model	Backbone	PDMS	Params	FPS
PRIX (default)	ResNet34	87.8±.1	37M	57.0
PRIX-50	ResNet50	87.8±.2	41M	47.3
PRIX-101	ResNet101	87.9±.4	58M	28.6

Ablation on Loss Weights

Aim: To determine the best weighting for the detection loss and the semantic loss.
Method: A grid search was performed over different weight combinations, and the PDMS score was recorded for each setting.
Result: A low detection loss weight and a high semantic loss weight perform best. The configuration with detection weight 1 and semantic weight 10 yields 87.8 PDMS. Performance increases consistently as the semantic loss weight increases.

Comparison of different weights for the loss parts. — PDMS score heatmap for different loss weight combinations.

Sensor Failures

Aim: To test PRIX's robustness to sensor failures, such as camera noise or dropout.
Method: Sensor failures were simulated at test time for a standard model. New models were then trained with these corruptions (noise or dropout) to evaluate whether robustness improves.
Result: While failures reduce the baseline model's score from 88.7 to 82.2 PDMS, training with noise recovers the score to 84.7.

Robustness to Sensor Failures
Training Method	Test-Time Input	PDMS Score
Standard (Baseline)	Clean	88.7
Standard (Baseline)	With Failures	82.2
Train w/ Full Camera Dropout	With Failures	83.9
Train w/ Random Noise	With Failures	84.7

Impact of Ego Status in the Planning Head

Aim: To understand how important the ego status input (velocity, acceleration) is for the planner.
Method: We corrupted the ego status at test time (masking it, replacing it with noise) for both a standard model and a model trained *with* status corruption.
Result: Ego status is critical, but PRIX can be trained to be robust. Corrupting the status drops the baseline model's score from 87.8 to 66.8 PDMS. However, the model trained with corruption achieves a strong 84.7 PDMS, even with random inputs.

PDMS↑ under test-time ego status modification.
Ego Status (Test Time)	Corruption at Training	PRIX	DiffusionDrive
Status (Clean)	×	87.8	88.1
Zero (masked)	×	64.4	63.9
Random	×	66.8	68.1
Scaled	×	80.9	81.3
Zero (masked)	✓	81.3	81.1
Random	✓	84.7	83.9
Scaled	✓	84.4	84.0