NVIDIA GPU driver version written by link_nvidia_host_libraries.sh AND check in SitePackage.lua are broken

@bedroge and myself hit an issue on his CC 12.0 system where [my build for NCCL](https://github.com/EESSI/software-layer-scripts/pull/200#issuecomment-4243524549) failed with:

```
Warnings and errors of '/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/share/Lmod/libexec/lmod ...' command (stderr only):
Lmod has detected the following error:
Your driver CUDA version is 12.0
but the module you want to load requires CUDA 12.8.0. Please update your
CUDA driver libraries and then let EESSI know about the update.
For more information on how to do this, see
https://www.eessi.io/docs/site_specific_config/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.18.0-GCCcore-14.2.0-CUDA-12.8.0  /cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen5/accel/nvidia/cc120/modules/all/UCX-CUDA/1.18.0-GCCcore-14.2.0-CUDA-12.8.0.lua
```

We were surprised to see this, since the driver on the host is 13.0. Initially, we figured @bedroge just had to rerun the driver symlink script - but that didn't help. Even removing the `cuda_version.txt` from the `shared_fs` of the bot didn't help: it regenerated the `cuda_version.txt`, but with the same `12.0` content - rather than the expected `13.0`.

**BUG 1**

Turns out our driver version injected into `cuda_version.txt` is wrong. We query `nvidia-smi --query-gpu=gpu_name,count,driver_version,compute_cap` [here](https://github.com/EESSI/software-layer-scripts/blob/88653852578d0cc79a9fbc6b023cd968f0451b57/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh#L224-L234), but `compute_cap` returns the compute capability of the hardware - not of the driver. Since @bedroge has RTX6000 cards, it returns `12.0`. I checked on my system, there the `cuda_version.txt` indeed contained `9.0` (for our H100 cards).

**BUG 2**

The check in SitePackage.lua is also broken, which is why the above issue went unnoticed for a very long time. [This](https://github.com/EESSI/software-layer-scripts/blob/88653852578d0cc79a9fbc6b023cd968f0451b57/create_lmodsitepackage.py#L201) does a comparison between _strings_. On our system, that means that `major=9`, and `major_req=12` if you're installing CUDA 12.X-related software. The reason this passes is that, it being string comparison, it compares 9 against 1, concludes that 9 is larger, and that the requirement is met. The reason this fails on @bedroge 's system is that the `major=12`, `major_req=12`, `minor=0` and `minor_req=8`. It then passes the comparison of the majors (they are equal) and fails the comparison of the minors (since 0 < 8), hence raising the error.

What _should_ be done here is `tonumber(major) < tonumber(major_req)`, and similar for the minors.

**The fix**

Before we fix bug 2, we should fix bug 1, otherwise, this will break the CUDA runtime support on _many many_ systems that have the wrong version encoded in their `cuda_version.txt`. In all honestly, to not break it, I think we should never use `cuda_version.txt` again, as we now know it may contain wrong information. Probably, we should just push for https://github.com/EESSI/software-layer-scripts/issues/189 . I was hoping we could use the `cuda_version.txt` as a fallback option there, but it seems we can't rely on that any longer (unless we put a second text file under a different name in place or something). Maybe the fallback should just be: if we can't grep the CUDA driver version by parsing `nvidia-smi` output, we just print a warning that runtime support _may_ be broken, since we cannot check if the linked driver is new enough...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA GPU driver version written by link_nvidia_host_libraries.sh AND check in SitePackage.lua are broken #201

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVIDIA GPU driver version written by link_nvidia_host_libraries.sh AND check in SitePackage.lua are broken #201

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions