Skip to content

NVIDIA GPU driver version written by link_nvidia_host_libraries.sh AND check in SitePackage.lua are broken #201

@casparvl

Description

@casparvl

@bedroge and myself hit an issue on his CC 12.0 system where my build for NCCL failed with:

Warnings and errors of '/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/share/Lmod/libexec/lmod ...' command (stderr only):
Lmod has detected the following error:
Your driver CUDA version is 12.0
but the module you want to load requires CUDA 12.8.0. Please update your
CUDA driver libraries and then let EESSI know about the update.
For more information on how to do this, see
https://www.eessi.io/docs/site_specific_config/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.18.0-GCCcore-14.2.0-CUDA-12.8.0  /cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen5/accel/nvidia/cc120/modules/all/UCX-CUDA/1.18.0-GCCcore-14.2.0-CUDA-12.8.0.lua

We were surprised to see this, since the driver on the host is 13.0. Initially, we figured @bedroge just had to rerun the driver symlink script - but that didn't help. Even removing the cuda_version.txt from the shared_fs of the bot didn't help: it regenerated the cuda_version.txt, but with the same 12.0 content - rather than the expected 13.0.

BUG 1

Turns out our driver version injected into cuda_version.txt is wrong. We query nvidia-smi --query-gpu=gpu_name,count,driver_version,compute_cap here, but compute_cap returns the compute capability of the hardware - not of the driver. Since @bedroge has RTX6000 cards, it returns 12.0. I checked on my system, there the cuda_version.txt indeed contained 9.0 (for our H100 cards).

BUG 2

The check in SitePackage.lua is also broken, which is why the above issue went unnoticed for a very long time. This does a comparison between strings. On our system, that means that major=9, and major_req=12 if you're installing CUDA 12.X-related software. The reason this passes is that, it being string comparison, it compares 9 against 1, concludes that 9 is larger, and that the requirement is met. The reason this fails on @bedroge 's system is that the major=12, major_req=12, minor=0 and minor_req=8. It then passes the comparison of the majors (they are equal) and fails the comparison of the minors (since 0 < 8), hence raising the error.

What should be done here is tonumber(major) < tonumber(major_req), and similar for the minors.

The fix

Before we fix bug 2, we should fix bug 1, otherwise, this will break the CUDA runtime support on many many systems that have the wrong version encoded in their cuda_version.txt. In all honestly, to not break it, I think we should never use cuda_version.txt again, as we now know it may contain wrong information. Probably, we should just push for #189 . I was hoping we could use the cuda_version.txt as a fallback option there, but it seems we can't rely on that any longer (unless we put a second text file under a different name in place or something). Maybe the fallback should just be: if we can't grep the CUDA driver version by parsing nvidia-smi output, we just print a warning that runtime support may be broken, since we cannot check if the linked driver is new enough...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions