Replace nvgpu with nvidia-ml-py by matthewfeickert · Pull Request #2160 · PlasmaControl/DESC

matthewfeickert · 2026-04-12T06:59:05Z

Resolves #2159

As nvgpu is no longer maintained and uses pynvml, which directly tells the user to use nvidia-ml-py instead at import

The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you."

drop nvgpu and replace its nvgpu.gpu_info() call with a single function using nvidia-ml-py (which uses the pynvml namespace).

Place a lower bound on nvidia-ml-py of 12.535.77, which was the first release to support nvmlMemory_v2 which properly accounts for system-reserved memory.
Remove all mentions of nvgpu in other areas of the codebase and replace them with nvidia-ml-py, except for publications/ as this is historical information.
- Do not add nvidia-ml-py to dependabot.yml as pinning this tightly will cause installation issues, especially with NVIDIA libraries.

Example:

$ nvidia-smi --version
NVIDIA-SMI version  : 590.48.01
NVML version        : 590.48
DRIVER version      : 590.48.01
CUDA Version        : 13.1

On main (2471d55)

$ uv venv main
$ . main/bin/activate
$ uv pip install .
$ python
Python 3.13.7 (main, Sep 18 2025, 19:47:49) [Clang 20.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import desc
>>> desc.set_device("gpu")
/tmp/DESC/main/lib/python3.13/site-packages/nvgpu/__init__.py:8: SyntaxWarning: invalid escape sequence '\('
  gpu_infos = [re.match('GPU ([0-9]+): (.+?) \(UUID: ([^)]+)\)', gpu) for gpu in gpus]
>>> import os
>>> os.environ["CUDA_VISIBLE_DEVICES"]
'0'
>>>

$ python -c 'import nvgpu; print(nvgpu.gpu_info())'
[{'index': '0', 'type': 'NVIDIA GeForce RTX 4060 Laptop GPU', 'uuid': 'GPU-7fef9454-d8d1-86cf-c4b3-e2fd5e35e862', 'mem_used': 8, 'mem_total': 8188, 'mem_used_percent': 0.09770395701025891}]

This PR

$ uv venv
$ . .venv/bin/activate
$ uv pip install .
$ python
Python 3.13.7 (main, Sep 18 2025, 19:47:49) [Clang 20.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import desc
>>> desc.set_device("gpu")
>>> import os
>>> os.environ["CUDA_VISIBLE_DEVICES"]
'0'
>>>

As I made it a guarded import I can't directly import it from desc, but is the same code

# _implementation.py
from pynvml import (
    nvmlDeviceGetCount,
    nvmlDeviceGetHandleByIndex,
    nvmlDeviceGetMemoryInfo,
    nvmlDeviceGetName,
    nvmlDeviceGetUUID,
    nvmlInit,
    nvmlMemory_v2,
    nvmlShutdown,
)


def _gpu_info():
    """Equivalent to nvgpu.gpu_info() using nvidia-ml-py."""
    nvmlInit()
    try:
        info = []
        for device_idx in range(nvmlDeviceGetCount()):
            handle = nvmlDeviceGetHandleByIndex(device_idx)
            mem = nvmlDeviceGetMemoryInfo(handle, version=nvmlMemory_v2)
            _bytes_to_mib = 1024 * 1024
            mem_used = mem.used // _bytes_to_mib
            mem_total = mem.total // _bytes_to_mib
            info.append(
                {
                    "index": str(device_idx),
                    "type": nvmlDeviceGetName(handle),
                    "uuid": nvmlDeviceGetUUID(handle),
                    "mem_used": mem_used,
                    "mem_total": mem_total,
                    "mem_used_percent": 100.0 * mem_used / mem_total,
                }
            )
        return info
    finally:
        nvmlShutdown()


if __name__ == "__main__":
    print(_gpu_info())

so

$ python ./_implementation.py
[{'index': '0', 'type': 'NVIDIA GeForce RTX 4060 Laptop GPU', 'uuid': 'GPU-7fef9454-d8d1-86cf-c4b3-e2fd5e35e862', 'mem_used': 7, 'mem_total': 8188, 'mem_used_percent': 0.08549096238397655}]

So the memory consumption is effectively the same (good), and an unmaintained dependency can be replaced with a maintained one.

github-actions · 2026-04-12T07:24:06Z

Memory benchmark result

|               Test Name                |      %Δ      |    Master (MB)     |      PR (MB)       |    Δ (MB)    |    Time PR (s)     |  Time Master (s)   |
| -------------------------------------- | ------------ | ------------------ | ------------------ | ------------ | ------------------ | ------------------ |
  test_objective_jac_w7x                 |    3.39 %    |     4.139e+03      |     4.279e+03      |    140.50    |       41.78        |       38.96        |
  test_proximal_jac_w7x_with_eq_update   |    0.12 %    |     6.615e+03      |     6.623e+03      |     7.99     |       165.34       |       161.01       |
  test_proximal_freeb_jac                |    0.33 %    |     1.337e+04      |     1.341e+04      |    44.06     |       87.76        |       88.57        |
  test_proximal_freeb_jac_blocked        |   -0.19 %    |     7.792e+03      |     7.777e+03      |    -14.85    |       78.93        |       80.35        |
  test_proximal_freeb_jac_batched        |    0.89 %    |     7.675e+03      |     7.743e+03      |    68.34     |       77.82        |       78.32        |
  test_proximal_jac_ripple               |    0.41 %    |     3.602e+03      |     3.616e+03      |    14.74     |       65.10        |       62.70        |
  test_proximal_jac_ripple_bounce1d      |    2.63 %    |     3.796e+03      |     3.896e+03      |    100.02    |       77.56        |       75.24        |
  test_eq_solve                          |   -0.03 %    |     2.202e+03      |     2.201e+03      |    -0.69     |       98.78        |       100.58       |

For the memory plots, go to the summary of Memory Benchmarks workflow and download the artifact.

matthewfeickert · 2026-04-12T07:44:17Z

This is ready for review, but needs a maintainer to approve the CI runs. Let me know if you have any questions. 👍

codecov · 2026-04-12T18:54:04Z

Codecov Report

❌ Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.42%. Comparing base (03637dd) to head (89fd614).

Files with missing lines	Patch %	Lines
desc/__init__.py	0.00%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2160      +/-   ##
==========================================
- Coverage   94.45%   94.42%   -0.04%     
==========================================
  Files         101      101              
  Lines       28604    28617      +13     
==========================================
+ Hits        27018    27021       +3     
- Misses       1586     1596      +10

Files with missing lines	Coverage Δ
desc/__init__.py	`36.48% <0.00%> (-7.78%)`	⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

matthewfeickert · 2026-04-12T23:12:47Z

Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.

Given that the changes in this PR should be covered by

DESC/tests/benchmarks/benchmark_gpu_small.py

Line 11 in 2471d55

desc.set_device("gpu")

DESC/tests/benchmarks/benchmark_gpu_large.py

Line 12 in 2471d55

desc.set_device("gpu")

DESC/tests/benchmarks/memory_funcs.py

Line 21 in 2471d55

set_device("gpu")

I assume the lack of coverage reported is that the benchmarks haven't finished running / need additional approval to run.

dpanici · 2026-04-12T23:27:59Z

Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.

Given that the changes in this PR should be covered by

DESC/tests/benchmarks/benchmark_gpu_small.py

Line 11 in 2471d55

desc.set_device("gpu")

DESC/tests/benchmarks/benchmark_gpu_large.py

Line 12 in 2471d55

desc.set_device("gpu")

DESC/tests/benchmarks/memory_funcs.py

Line 21 in 2471d55

set_device("gpu")

I assume the lack of coverage reported is that the benchmarks haven't finished running / need additional approval to run.

Benchmarks don't increase coverage, and our CI cannot run GPU things, so this will just have 0 coverage which is fine with the devs

matthewfeickert · 2026-04-12T23:29:38Z

Benchmarks don't increase coverage, and our CI cannot run GPU things, so this will just have 0 coverage which is fine with the devs

Great. Thanks for the very fast follow up!

YigitElma

Thanks! It mostly looks good, but I think pynvml issue has to be documented.

@dpanici @ddudt can you also test this on a cluster? I tested on my laptop GPU and on Della. It works. However, on Della login node, you cannot run python; from desc import set_device; set_device("gpu");, it says the drivers are not there. I am not sure if we could run this on the login node before, though.

* As nvgpu is no longer maintained and uses pynvml, which directly tells the user to use nvidia-ml-py instead at import "The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you." drop nvgpu and replace its nvgpu.gpu_info() call with a single function using nvidia-ml-py (which uses the pynvml namespace). * Place a lower bound on nvidia-ml-py of 12.535.77, which was the first release to support nvmlMemory_v2 which properly accounts for system-reserved memory. * Remove all mentions of nvgpu in other areas of the codebase and replace them with nvidia-ml-py, except for publications/ as this is historical information. - Do NOT add nvidia-ml-py to dependabot.yml as pinning this tightly is an anti-pattern in library design that will cause installation issues, especially with NVIDIA libraries.

* Note that nvgpu has been replaced with nvidia-ml-py.

YigitElma

It looks good to me. I have tested on my personal laptop and Della.

We should wait for one more developer to test it.

matthewfeickert marked this pull request as ready for review April 12, 2026 07:15

matthewfeickert force-pushed the feat/drop-nvgpu-for-nvidia-ml-py branch from 3671f39 to f63fb54 Compare April 12, 2026 07:21

matthewfeickert force-pushed the feat/drop-nvgpu-for-nvidia-ml-py branch from f63fb54 to 11c073b Compare April 12, 2026 07:40

matthewfeickert force-pushed the feat/drop-nvgpu-for-nvidia-ml-py branch 3 times, most recently from aba1019 to 37170df Compare April 12, 2026 08:08

matthewfeickert mentioned this pull request Apr 12, 2026

Add orthax and desc-opt conda-forge/staged-recipes#32937

Draft

9 tasks

ddudt requested review from a team, YigitElma, ddudt, dpanici, f0uriest, rahulgaur104 and unalmis and removed request for a team April 13, 2026 16:21

YigitElma requested changes Apr 13, 2026

View reviewed changes

Comment thread desc/__init__.py

Comment thread CHANGELOG.md Outdated

matthewfeickert added 2 commits April 13, 2026 15:54

Add dependency change to CHANGELOG

89fd614

* Note that nvgpu has been replaced with nvidia-ml-py.

matthewfeickert force-pushed the feat/drop-nvgpu-for-nvidia-ml-py branch from 37170df to 89fd614 Compare April 13, 2026 22:00

matthewfeickert requested a review from YigitElma April 13, 2026 22:01

YigitElma approved these changes Apr 13, 2026

View reviewed changes

YigitElma added the override codecov Override codecov label Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace nvgpu with nvidia-ml-py#2160

Replace nvgpu with nvidia-ml-py#2160
matthewfeickert wants to merge 2 commits intoPlasmaControl:masterfrom
matthewfeickert:feat/drop-nvgpu-for-nvidia-ml-py

matthewfeickert commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

matthewfeickert commented Apr 12, 2026

Uh oh!

codecov bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

matthewfeickert commented Apr 12, 2026

Uh oh!

dpanici commented Apr 12, 2026

Uh oh!

matthewfeickert commented Apr 12, 2026

Uh oh!

YigitElma left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

YigitElma left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

matthewfeickert commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Memory benchmark result

Uh oh!

matthewfeickert commented Apr 12, 2026

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

matthewfeickert commented Apr 12, 2026

Uh oh!

dpanici commented Apr 12, 2026

Uh oh!

matthewfeickert commented Apr 12, 2026

Uh oh!

YigitElma left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YigitElma left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthewfeickert commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

codecov bot commented Apr 12, 2026 •

edited

Loading

YigitElma left a comment •

edited

Loading