PINO Reproducibility Issues Between GPU Architectures

I have seen a large discrepancy between identical PINO training running on 4090 + torch 2.6.0 and 5090 + torch 2.8.0. After some investigation, I found the cause to be the tensor core optimization that cuDNN uses. For my case, I was able to fix this (and gain substantial improvement in accuracy) by setting:
```
torch.backends.cudnn.allow_tf32 = False
```
for torch < 2.9.0, and 
```
torch.backends.fp32_precision = "ieee"
```
for torch >= 2.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PINO Reproducibility Issues Between GPU Architectures #720

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PINO Reproducibility Issues Between GPU Architectures #720

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions