Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cuda_core/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# `cuda.core`: (experimental) Pythonic CUDA module
# `cuda.core`: Pythonic CUDA module

Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.

## Installing

Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-core/latest/install.html) for instructions and required/optional dependencies.

## Developing

Expand Down
4 changes: 4 additions & 0 deletions cuda_core/docs/nv-versions.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
"version": "latest",
"url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
},
{
"version": "0.7.0",
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"
},
{
"version": "0.6.0",
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.6.0/"
Expand Down
28 changes: 28 additions & 0 deletions cuda_core/docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,12 +129,40 @@ Each subclass exposes attributes unique to its operation type.
graph.SwitchNode


Graphics interoperability
-------------------------

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

GraphicsResource


Tensor Memory Accelerator (TMA)
-------------------------------

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

TensorMapDescriptor

:template: dataclass.rst

TensorMapDescriptorOptions


CUDA compilation toolchain
--------------------------

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

Program
Linker
ObjectCode
Expand Down
5 changes: 0 additions & 5 deletions cuda_core/docs/source/release/0.6.0-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,6 @@ New features
- Added CUDA version compatibility check at import time to detect mismatches between
``cuda.core`` and the installed ``cuda-bindings`` version.

- ``Program.compile()`` now automatically resizes the NVRTC PCH heap and
Comment thread
rparolin marked this conversation as resolved.
retries when precompiled header creation fails due to heap exhaustion.
The ``pch_status`` property reports the PCH creation outcome
(``"created"``, ``"not_attempted"``, ``"failed"``, or ``None``).


Fixes and enhancements
----------------------
Expand Down
116 changes: 116 additions & 0 deletions cuda_core/docs/source/release/0.7.0-notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
.. SPDX-License-Identifier: Apache-2.0

.. currentmodule:: cuda.core

``cuda.core`` 0.7.0 Release Notes
=================================


Highlights
----------

- Introduced support for explicit graph construction. CUDA graphs can now be
built programmatically by adding nodes and edges, and their topology can be
modified after construction.
- Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of
GPU memory between CUDA compute kernels and OpenGL renderers.
- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator)
bulk data movement, with automatic kernel argument integration.
- :class:`~utils.StridedMemoryView` now supports DLPack export via
``from_dlpack()`` array API.


New features
------------

- Added the :mod:`cuda.core.graph` public module containing
:class:`~graph.GraphDef` for explicit graph construction, typed node
subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream
capture) also moves into this module.

- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream
capture, mirroring the existing :meth:`~graph.GraphDef.callback` API.

- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability.
Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and
:meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA
access, and mapping returns a :class:`Buffer` for zero-copy kernel use.

- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's ``CUtensorMap``
for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement.
:class:`~utils.StridedMemoryView` gains an :meth:`~utils.StridedMemoryView.as_tensor_map`
method for convenient descriptor creation, with automatic dtype inference, stride
computation, and first-class kernel argument integration.
Comment thread
leofang marked this conversation as resolved.

- Added DLPack export support to :class:`~utils.StridedMemoryView` via
``__dlpack__`` and ``__dlpack_device__``, complementing the existing import
path.

- Added the DLPack C exchange API (``__dlpack_c_exchange_api__``) to
:class:`~utils.StridedMemoryView`.

- Added NVRTC precompiled header (PCH) support (CUDA 12.8+).
:class:`ProgramOptions` gains ``pch``, ``create_pch``, ``use_pch``,
``pch_dir``, and related options. :attr:`Program.pch_status` reports the
PCH creation outcome, and :meth:`~Program.compile` automatically resizes the NVRTC
PCH heap and retries when PCH creation fails due to heap exhaustion.

- Added NUMA-aware managed memory pool placement.
:class:`ManagedMemoryResourceOptions` gains a ``preferred_location_type``
option (``"device"``, ``"host"``, or ``"host_numa"``), and
:attr:`ManagedMemoryResource.preferred_location` queries the resolved
location. The existing ``preferred_location`` parameter retains full
backwards compatibility.

- Added NUMA-aware pinned memory pool placement.
:class:`PinnedMemoryResourceOptions` gains a ``numa_id`` option, and
:attr:`PinnedMemoryResource.numa_id` queries the host NUMA node ID used for
pool placement. When ``ipc_enabled=True`` and ``numa_id`` is not set, the
NUMA node is automatically derived from the current CUDA device.

- Added support for CUDA 13.2.


New examples
------------

- ``gl_interop_plasma.py``: Real-time plasma effect demonstrating CUDA-OpenGL
interoperability via :class:`GraphicsResource`.
- ``tma_tensor_map.py``: TMA bulk data movement using
:class:`TensorMapDescriptor` on Hopper+ GPUs.


Fixes and enhancements
----------------------

- Fixed managed memory buffers being misclassified as ``kDLCUDAHost`` in DLPack
device mapping. They are now correctly reported as ``kDLCUDAManaged``.
(`#1863 <https://github.com/NVIDIA/cuda-python/pull/1863>`__)
- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0``
instead of the NUMA node closest to the active CUDA device. On multi-NUMA
systems where the device is attached to a non-zero host NUMA node, this could
cause pool creation or allocation failures. (`#1603 <https://github.com/NVIDIA/cuda-python/issues/1603>`__)
- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping
a non-owned (default) memory pool. The property now always queries the CUDA driver for
non-owned pools, so multiple wrappers around the same pool see consistent state. (`#1720 <https://github.com/NVIDIA/cuda-python/issues/1720>`__)
- Fixed a bare ``except`` clause in stream acceptance that silently swallowed all exceptions,
including ``KeyboardInterrupt`` and ``SystemExit``. Only the expected "protocol not
supported" case is now caught. (`#1631 <https://github.com/NVIDIA/cuda-python/issues/1631>`__)
- :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported
layouts fail immediately instead of on first metadata access. (`#1429 <https://github.com/NVIDIA/cuda-python/issues/1429>`__)
- IPC file descriptor cleanup now uses a C++ ``shared_ptr`` with a POSIX deleter, avoiding
cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown.
- Improved error message when :class:`ManagedMemoryResource` is called without options on platforms
that lack a default managed memory pool (e.g. WSL2). (`#1617 <https://github.com/NVIDIA/cuda-python/issues/1617>`__)
- Handle properties on core API objects now return ``None`` during Python shutdown instead of
crashing.
- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and
linking operations to the C level and releasing the GIL during backend calls. This benefits
workloads that create many programs or linkers, and enables concurrent compilation in
multithreaded applications.
- Error enum explanations are now derived from ``cuda-bindings`` docstrings when available
(bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.
- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely
missing optional modules are treated as unavailable; unrelated import failures now surface
normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``.
76 changes: 0 additions & 76 deletions cuda_core/docs/source/release/0.7.x-notes.rst

This file was deleted.

2 changes: 1 addition & 1 deletion cuda_core/pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ examples = { features = ["cu13", "examples", "local-deps"], solve-group = "examp
# TODO: check if these can be extracted from pyproject.toml
[package]
name = "cuda-core"
version = "0.6.0"
version = "0.7.0"

[package.build]
backend = { name = "pixi-build-python", version = "*" }
Expand Down
2 changes: 1 addition & 1 deletion cuda_core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ dynamic = [
"readme",
]
requires-python = '>=3.10'
description = "cuda.core: (experimental) pythonic CUDA module"
description = "cuda.core: pythonic CUDA module"
authors = [
{ name = "NVIDIA Corporation" }
]
Expand Down
Loading