@bedroge and myself hit an issue on his CC 12.0 system where my build for NCCL failed with:
Warnings and errors of '/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/share/Lmod/libexec/lmod ...' command (stderr only):
Lmod has detected the following error:
Your driver CUDA version is 12.0
but the module you want to load requires CUDA 12.8.0. Please update your
CUDA driver libraries and then let EESSI know about the update.
For more information on how to do this, see
https://www.eessi.io/docs/site_specific_config/gpu/.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
UCX-CUDA/1.18.0-GCCcore-14.2.0-CUDA-12.8.0 /cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen5/accel/nvidia/cc120/modules/all/UCX-CUDA/1.18.0-GCCcore-14.2.0-CUDA-12.8.0.lua
We were surprised to see this, since the driver on the host is 13.0. Initially, we figured @bedroge just had to rerun the driver symlink script - but that didn't help. Even removing the cuda_version.txt from the shared_fs of the bot didn't help: it regenerated the cuda_version.txt, but with the same 12.0 content - rather than the expected 13.0.
BUG 1
Turns out our driver version injected into cuda_version.txt is wrong. We query nvidia-smi --query-gpu=gpu_name,count,driver_version,compute_cap here, but compute_cap returns the compute capability of the hardware - not of the driver. Since @bedroge has RTX6000 cards, it returns 12.0. I checked on my system, there the cuda_version.txt indeed contained 9.0 (for our H100 cards).
BUG 2
The check in SitePackage.lua is also broken, which is why the above issue went unnoticed for a very long time. This does a comparison between strings. On our system, that means that major=9, and major_req=12 if you're installing CUDA 12.X-related software. The reason this passes is that, it being string comparison, it compares 9 against 1, concludes that 9 is larger, and that the requirement is met. The reason this fails on @bedroge 's system is that the major=12, major_req=12, minor=0 and minor_req=8. It then passes the comparison of the majors (they are equal) and fails the comparison of the minors (since 0 < 8), hence raising the error.
What should be done here is tonumber(major) < tonumber(major_req), and similar for the minors.
The fix
Before we fix bug 2, we should fix bug 1, otherwise, this will break the CUDA runtime support on many many systems that have the wrong version encoded in their cuda_version.txt. In all honestly, to not break it, I think we should never use cuda_version.txt again, as we now know it may contain wrong information. Probably, we should just push for #189 . I was hoping we could use the cuda_version.txt as a fallback option there, but it seems we can't rely on that any longer (unless we put a second text file under a different name in place or something). Maybe the fallback should just be: if we can't grep the CUDA driver version by parsing nvidia-smi output, we just print a warning that runtime support may be broken, since we cannot check if the linked driver is new enough...
@bedroge and myself hit an issue on his CC 12.0 system where my build for NCCL failed with:
We were surprised to see this, since the driver on the host is 13.0. Initially, we figured @bedroge just had to rerun the driver symlink script - but that didn't help. Even removing the
cuda_version.txtfrom theshared_fsof the bot didn't help: it regenerated thecuda_version.txt, but with the same12.0content - rather than the expected13.0.BUG 1
Turns out our driver version injected into
cuda_version.txtis wrong. We querynvidia-smi --query-gpu=gpu_name,count,driver_version,compute_caphere, butcompute_capreturns the compute capability of the hardware - not of the driver. Since @bedroge has RTX6000 cards, it returns12.0. I checked on my system, there thecuda_version.txtindeed contained9.0(for our H100 cards).BUG 2
The check in SitePackage.lua is also broken, which is why the above issue went unnoticed for a very long time. This does a comparison between strings. On our system, that means that
major=9, andmajor_req=12if you're installing CUDA 12.X-related software. The reason this passes is that, it being string comparison, it compares 9 against 1, concludes that 9 is larger, and that the requirement is met. The reason this fails on @bedroge 's system is that themajor=12,major_req=12,minor=0andminor_req=8. It then passes the comparison of the majors (they are equal) and fails the comparison of the minors (since 0 < 8), hence raising the error.What should be done here is
tonumber(major) < tonumber(major_req), and similar for the minors.The fix
Before we fix bug 2, we should fix bug 1, otherwise, this will break the CUDA runtime support on many many systems that have the wrong version encoded in their
cuda_version.txt. In all honestly, to not break it, I think we should never usecuda_version.txtagain, as we now know it may contain wrong information. Probably, we should just push for #189 . I was hoping we could use thecuda_version.txtas a fallback option there, but it seems we can't rely on that any longer (unless we put a second text file under a different name in place or something). Maybe the fallback should just be: if we can't grep the CUDA driver version by parsingnvidia-smioutput, we just print a warning that runtime support may be broken, since we cannot check if the linked driver is new enough...