cuda.bindings: Fix segfault when converting `char*` `NULL` to `bytes` by rwgk · Pull Request #497 · NVIDIA/cuda-python

rwgk · 2025-03-06T21:10:57Z

The segfault was discovered while working on PR #458.

Full test coverage for changed code.

…ad of segfaulting when converting char* NULL to bytes.

Based on: NVIDIA@d3df80d#diff-29c7ab322cdb6dfa72e21edffba21d51afc1a5fed8b3974206c6ba7bd4dcfd06

copy-pr-bot · 2025-03-06T21:11:00Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-03-06T21:12:35Z

/ok to test

rwgk · 2025-03-06T22:22:10Z

@leofang The two test failures are these flakes:

>           assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2  # tolerance 2 ms
E           assert 503.1598205566406 < ((0.5 * 1000) + 2)

>           assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2  # tolerance 2 ms
E           assert 505.0583190917969 < ((0.5 * 1000) + 2)

vzhurba01 · 2025-03-06T22:26:53Z

Changes LGTM at the moment but I see that this is still a draft

rwgk · 2025-03-06T22:33:46Z

Changes LGTM at the moment

Thanks!

but I see that this is still a draft

I'm looking into adding more tests, to ideally cover all changes.

And we need to do something about the flakes, even though they are unrelated.

Observed failures: ``` > assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2 # tolerance 2 ms E assert 503.1598205566406 < ((0.5 * 1000) + 2) ``` ``` > assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2 # tolerance 2 ms E assert 505.0583190917969 < ((0.5 * 1000) + 2) ```

rwgk · 2025-03-06T23:02:34Z

@leofang I piggy-backed commit 9dd2630 here to resolve the problem with the flaky tests.

…tring() CUDA_ERROR_INVALID_VALUE

…ve segfaulted before.

rwgk · 2025-03-06T23:46:44Z

/ok to test

copy-pr-bot · 2025-03-06T23:46:49Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-03-07T00:22:28Z

Woah, looks like I'm good at generating segfaults ...

Fatal Python error: Segmentation fault

Current thread 0x0000ffff9f213020 (most recent call first):
  File "/__w/cuda-python/cuda-python/cuda_bindings/tests/test_nvrtc.py", line 35 in test_nvrtcGetLoweredName_failure
...
  File "/opt/hostedtoolcache/Python/3.9.21/arm64/lib/python3.9/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/opt/hostedtoolcache/Python/3.9.21/arm64/bin/pytest", line 8 in <module>
/__w/_temp/6c5a43e8-eaeb-4c41-8067-c06dcfabc91e.sh: line 12:  4545 Segmentation fault      (core dumped) pytest -rxXs -v tests/
tests/test_nvrtc.py::test_nvrtcGetLoweredName_failure

@leofang should I just remove that new test and leave that for another PR? I don't think I'll get to the bottom of the newly discovered segfault today. Tomorrow pretty sure.

leofang · 2025-03-07T00:24:52Z

Just isolate out the cuda.core changes from those for bindings. Let's merge core-related things first; bindings can wait.

NVIDIA#497 (comment)

rwgk · 2025-03-07T00:25:58Z

/ok to test

rwgk · 2025-03-07T00:27:54Z

The cuda.core change is to resolve the flakes. (Without it I'd (maybe) have to hit rerun a few times to deflake.)

I just backed out the test that triggers the segfaults: commit a39720b

leofang · 2025-03-07T02:55:38Z

The cuda.core change is to resolve the flakes

Ah, I meant we can separate it out and merge it first. Mainly if we want to auto-backport the fix, the cuda-11 branch does not have any cuda.core code, so the apto-backport would fail. I am curious if it could work without any cuda-core changes involved.

This reverts commit 9dd2630.

…r PR NVIDIA#497).

kkraus14

lgtm

rwgk · 2025-03-07T16:25:05Z

/ok to test

leofang · 2025-03-07T17:49:21Z

@rwgk could you update the commit message and avoid @someone? Last time you did that I ended up receiving a ton of notifications whenever someone rebased a branch that contains this commit lol

(I noticed this because of a notification 😂)

…ested by at-leofang

rwgk · 2025-03-07T18:06:42Z

@rwgk could you update the commit message and avoid @someone? Last time you did that I ended up receiving a ton of notifications whenever someone rebased a branch that contains this commit lol

Oh ... sorry, done.

(I did this a lot in the pybind11 repo, wanting to give proper credit. I didn't realize this can lead to spammy notifications.)

leofang · 2025-03-07T18:09:32Z

/ok to test

rwgk · 2025-03-07T18:16:51Z

@leofang The tests passed previously with the exact same code. The only change was to the commit message. I.e. admin merge would be ideal.

leofang · 2025-03-07T18:18:35Z

ok

github-actions · 2025-03-07T18:18:55Z

Backport failed because this pull request contains merge commits. You can either backport this pull request manually, or configure the action to skip merge commits.

rwgk · 2025-03-07T18:29:38Z

Backport failed because this pull request contains merge commits. You can either backport this pull request manually, or configure the action to skip merge commits.

Should I take care of that? (Happy to.)

github-actions · 2025-03-07T18:37:30Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

leofang · 2025-03-07T18:41:14Z

Should I take care of that? (Happy to.)

Yes, please. My bad in updating your branch, sorry...

rwgk · 2025-03-07T20:34:12Z

I'll backport this together with #499, after that is merged.

* PR #497 squash-merged * Bring back test_nvrtcGetLoweredName_failure() (it was originally under PR #497). * Add code for debugging * test_all_CUresult_codes(): max_code = int(max(cuda.CUresult)) as suggested by at-leofang * Change pytest options, mostly to disable output capturing (of both stdout and stderr) * Undo debugging changes in nvrtc.pyx.in * Revert "Change pytest options, mostly to disable output capturing (of both stdout and stderr)" This reverts commit b0464e7. * Skip new test if nvrtc version < 12.1

* Backport #497 * Backport #499 * Remove @pytest.mark.skipif(nvrtcVersionLessThan(12, 1), ...) * Revert "Remove @pytest.mark.skipif(nvrtcVersionLessThan(12, 1), ...)" This reverts commit 41160a8.

rwgk added 2 commits March 6, 2025 11:52

Update cuda/bindings/driver.pyx.in, nvrtc.pyx.in to return None inste…

ef95b46

…ad of segfaulting when converting char* NULL to bytes.

Add test_all_CUresult_codes

0ff0cf4

Based on: NVIDIA@d3df80d#diff-29c7ab322cdb6dfa72e21edffba21d51afc1a5fed8b3974206c6ba7bd4dcfd06

This comment has been minimized.

Sign in to view

leofang requested a review from vzhurba01 March 6, 2025 22:02

leofang assigned rwgk Mar 6, 2025

rwgk added 2 commits March 6, 2025 15:11

Add to the new test_all_CUresult_codes() to also exercise cuGetErrorS…

dffc603

…tring() CUDA_ERROR_INVALID_VALUE

Exercise all other bindings changed in this PR. All of these would ha…

dee59ba

…ve segfaulted before.

rwgk marked this pull request as ready for review March 6, 2025 23:46

rwgk requested a review from leofang March 6, 2025 23:47

leofang added bug Something isn't working P0 High priority - Must do! cuda.bindings Everything related to the cuda.bindings module to-be-backported Trigger the bot to raise a backport PR upon merge labels Mar 6, 2025

leofang added this to the cuda-python 12-next, 11-next milestone Mar 6, 2025

Remove one new test because it's generating segfaults:

a39720b

NVIDIA#497 (comment)

Revert "Increase tolerance in test_timing() to avoid flaky tests."

6d45bda

This reverts commit 9dd2630.

rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 7, 2025

PR NVIDIA#497 squash-merged

d6b53a3

rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 7, 2025

Bring back test_nvrtcGetLoweredName_failure() (it was originally unde…

ff5deec

…r PR NVIDIA#497).

rwgk mentioned this pull request Mar 7, 2025

cuda.bindings: Add test_nvrtcGetLoweredName_failure #499

Merged

kkraus14 previously approved these changes Mar 7, 2025

View reviewed changes

rwgk dismissed stale reviews from kkraus14 and leofang via 7814469 March 7, 2025 16:24

test_all_CUresult_codes(): max_code = int(max(cuda.CUresult)) as sugg…

ab6db22

…ested by at-leofang

rwgk force-pushed the fix_segfault_char_ptr_to_bytes branch from 7814469 to ab6db22 Compare March 7, 2025 18:05

leofang approved these changes Mar 7, 2025

View reviewed changes

Merge branch 'main' into fix_segfault_char_ptr_to_bytes

a78a8aa

leofang enabled auto-merge March 7, 2025 18:09

vzhurba01 approved these changes Mar 7, 2025

View reviewed changes

leofang disabled auto-merge March 7, 2025 18:18

leofang merged commit 82df864 into NVIDIA:main Mar 7, 2025
16 checks passed

rwgk deleted the fix_segfault_char_ptr_to_bytes branch March 7, 2025 18:29

rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 8, 2025

Backport NVIDIA#497

4c14ca1

rwgk mentioned this pull request Mar 8, 2025

Backport #497 and #499 #502

Merged

Conversation

rwgk commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 6, 2025

Uh oh!

rwgk commented Mar 6, 2025

Uh oh!

This comment has been minimized.

rwgk commented Mar 6, 2025

Uh oh!

vzhurba01 commented Mar 6, 2025

Uh oh!

rwgk commented Mar 6, 2025

Uh oh!

rwgk commented Mar 6, 2025

Uh oh!

rwgk commented Mar 6, 2025

Uh oh!

copy-pr-bot bot commented Mar 6, 2025

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

leofang commented Mar 7, 2025

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

leofang commented Mar 7, 2025

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

leofang commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

leofang commented Mar 7, 2025

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

leofang commented Mar 7, 2025

Uh oh!

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

leofang commented Mar 7, 2025

Uh oh!

rwgk commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rwgk commented Mar 6, 2025 •

edited

Loading

leofang commented Mar 7, 2025 •

edited

Loading