Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100#12753
Add patches for PyTorch 1.7.1 avoiding failures on POWER and A100#12753boegel merged 14 commits intoeasybuilders:developfrom
Conversation
|
@boegelbot please test @ generoso |
|
@boegel: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 827555748 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @boegel |
|
I was running my last test build on a node that has the latest CUDA driver (465.19.01), which seems to have impact on the tests for Ever seen something like that @Flamefire? Perhaps the test is just unstable, and it was bad luck? edit: This also happens without the extra patches being added here BTW... |
|
Was your update to CUDA 11.2? The PyTorch guys have seen those: pytorch/pytorch#51905 Edit: That test was on foss, no CUDA. Running on fosscuda now |
|
Test report by @boegel |
|
Test failure on Driver version is |
|
Test report by @boegel |
|
Test report by @Flamefire |
|
@Flamefire Fix with latest CUDA drivers confirmed, but trouble on other systems? |
|
@verdurin Does that system has the latest patches for OpenBLAS? Because that is a CPU test failure which works on our POWER system. @boegel IIRC it did never work on our A100 system (the previous was foss test only) FWIW the upstream issue is pytorch/pytorch#52278 Maybe I'll just disable |
@boegel If it helps, I got that currently installed on the AMD EPYC 7552 48-Core Processo machine I got access to: |
|
@Flamefire this is the same node I mentioned recently. It has the following patches: |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @branfosj |
I've seen that for PyTorch 1.8.1 on our A100 GPUs but never on the older ones. Maybe that is CUDA driver version related? Still got to check if I see this on our new partition with this EC here too. Might be that our (EB) NCCL is faulty. |
The failed test was also with A100s. NVidia driver 460.73.01. I have this built fine on a system with a single P100 - NVidia driver 460.32.03. PyTorch 1.8.1 failed on the A100s with the same problem that you have reported to PyTorch. That also failed with a self-built NCCL 2.9.8. |
Confirmed also with PyTorch 1.8.1 fosscuda/2020b and their submodule NCCL (2.7.8) |
|
@boegelbot please test @ generoso |
|
@boegel: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 851550784 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @branfosj |
|
Test report by @branfosj |
|
Test report by @boegel |
|
Test report by @Flamefire |
|
The failed tests are the bottleneck_test which will be disabled by easybuilders/easybuild-easyblocks#2450 and a flaky test with fosscuda-2020b where a timeout is reached to early, i.e. all are non-critical. |
OK, thanks for clarifying @Flamefire! |
|
Going in, thanks @Flamefire! |
(created using
eb --new-pr)