NVIDIA · leofang · Apr 8, 2026 · Apr 7, 2026 · Apr 7, 2026 · Apr 7, 2026
diff --git a/cuda_core/README.md b/cuda_core/README.md
@@ -1,10 +1,10 @@
-# `cuda.core`: (experimental) Pythonic CUDA module
+# `cuda.core`: Pythonic CUDA module
 
 Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.
 
 ## Installing
 
-Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
+Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-core/latest/install.html) for instructions and required/optional dependencies.
 
 ## Developing
 

diff --git a/cuda_core/docs/nv-versions.json b/cuda_core/docs/nv-versions.json
@@ -3,6 +3,10 @@
         "version": "latest",
         "url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
     },
+    {
+        "version": "0.7.0",
+        "url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"
+    },
     {
         "version": "0.6.0",
         "url": "https://nvidia.github.io/cuda-python/cuda-core/0.6.0/"

diff --git a/cuda_core/docs/source/api.rst b/cuda_core/docs/source/api.rst
@@ -129,12 +129,40 @@ Each subclass exposes attributes unique to its operation type.
    graph.SwitchNode
 
 
+Graphics interoperability
+-------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   :template: autosummary/cyclass.rst
+
+   GraphicsResource
+
+
+Tensor Memory Accelerator (TMA)
+-------------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   :template: autosummary/cyclass.rst
+
+   TensorMapDescriptor
+
+   :template: dataclass.rst
+
+   TensorMapDescriptorOptions
+
+
 CUDA compilation toolchain
 --------------------------
 
 .. autosummary::
    :toctree: generated/
 
+   :template: autosummary/cyclass.rst
+
    Program
    Linker
    ObjectCode

diff --git a/cuda_core/docs/source/release/0.6.0-notes.rst b/cuda_core/docs/source/release/0.6.0-notes.rst
@@ -54,11 +54,6 @@ New features
 - Added CUDA version compatibility check at import time to detect mismatches between
   ``cuda.core`` and the installed ``cuda-bindings`` version.
 
-- ``Program.compile()`` now automatically resizes the NVRTC PCH heap and
-  retries when precompiled header creation fails due to heap exhaustion.
-  The ``pch_status`` property reports the PCH creation outcome
-  (``"created"``, ``"not_attempted"``, ``"failed"``, or ``None``).
-
 
 Fixes and enhancements
 ----------------------

diff --git a/cuda_core/docs/source/release/0.7.0-notes.rst b/cuda_core/docs/source/release/0.7.0-notes.rst
@@ -0,0 +1,116 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
+
+.. currentmodule:: cuda.core
+
+``cuda.core`` 0.7.0 Release Notes
+=================================
+
+
+Highlights
+----------
+
+- Introduced support for explicit graph construction. CUDA graphs can now be
+  built programmatically by adding nodes and edges, and their topology can be
+  modified after construction.
+- Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of
+  GPU memory between CUDA compute kernels and OpenGL renderers.
+- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator)
+  bulk data movement, with automatic kernel argument integration.
+- :class:`~utils.StridedMemoryView` now supports DLPack export via
+  ``from_dlpack()`` array API.
+
+
+New features
+------------
+
+- Added the :mod:`cuda.core.graph` public module containing
+  :class:`~graph.GraphDef` for explicit graph construction, typed node
+  subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream
+  capture) also moves into this module.
+
+- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream
+  capture, mirroring the existing :meth:`~graph.GraphDef.callback` API.
+
+- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability.
+  Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and
+  :meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA
+  access, and mapping returns a :class:`Buffer` for zero-copy kernel use.
+
+- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's ``CUtensorMap``
+  for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement.
+  :class:`~utils.StridedMemoryView` gains an :meth:`~utils.StridedMemoryView.as_tensor_map`
+  method for convenient descriptor creation, with automatic dtype inference, stride
+  computation, and first-class kernel argument integration.
+
+- Added DLPack export support to :class:`~utils.StridedMemoryView` via
+  ``__dlpack__`` and ``__dlpack_device__``, complementing the existing import
+  path.
+
+- Added the DLPack C exchange API (``__dlpack_c_exchange_api__``) to
+  :class:`~utils.StridedMemoryView`.
+
+- Added NVRTC precompiled header (PCH) support (CUDA 12.8+).
+  :class:`ProgramOptions` gains ``pch``, ``create_pch``, ``use_pch``,
+  ``pch_dir``, and related options. :attr:`Program.pch_status` reports the
+  PCH creation outcome, and :meth:`~Program.compile` automatically resizes the NVRTC
+  PCH heap and retries when PCH creation fails due to heap exhaustion.
+
+- Added NUMA-aware managed memory pool placement.
+  :class:`ManagedMemoryResourceOptions` gains a ``preferred_location_type``
+  option (``"device"``, ``"host"``, or ``"host_numa"``), and
+  :attr:`ManagedMemoryResource.preferred_location` queries the resolved
+  location. The existing ``preferred_location`` parameter retains full
+  backwards compatibility.
+
+- Added NUMA-aware pinned memory pool placement.
+  :class:`PinnedMemoryResourceOptions` gains a ``numa_id`` option, and
+  :attr:`PinnedMemoryResource.numa_id` queries the host NUMA node ID used for
+  pool placement. When ``ipc_enabled=True`` and ``numa_id`` is not set, the
+  NUMA node is automatically derived from the current CUDA device.
+
+- Added support for CUDA 13.2.
+
+
+New examples
+------------
+
+- ``gl_interop_plasma.py``: Real-time plasma effect demonstrating CUDA-OpenGL
+  interoperability via :class:`GraphicsResource`.
+- ``tma_tensor_map.py``: TMA bulk data movement using
+  :class:`TensorMapDescriptor` on Hopper+ GPUs.
+
+
+Fixes and enhancements
+----------------------
+
+- Fixed managed memory buffers being misclassified as ``kDLCUDAHost`` in DLPack
+  device mapping. They are now correctly reported as ``kDLCUDAManaged``.
+  (`#1863 <https://github.com/NVIDIA/cuda-python/pull/1863>`__)
+- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0``
+  instead of the NUMA node closest to the active CUDA device. On multi-NUMA
+  systems where the device is attached to a non-zero host NUMA node, this could
+  cause pool creation or allocation failures. (`#1603 <https://github.com/NVIDIA/cuda-python/issues/1603>`__)
+- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping
+  a non-owned (default) memory pool. The property now always queries the CUDA driver for
+  non-owned pools, so multiple wrappers around the same pool see consistent state. (`#1720 <https://github.com/NVIDIA/cuda-python/issues/1720>`__)
+- Fixed a bare ``except`` clause in stream acceptance that silently swallowed all exceptions,
+  including ``KeyboardInterrupt`` and ``SystemExit``. Only the expected "protocol not
+  supported" case is now caught. (`#1631 <https://github.com/NVIDIA/cuda-python/issues/1631>`__)
+- :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported
+  layouts fail immediately instead of on first metadata access. (`#1429 <https://github.com/NVIDIA/cuda-python/issues/1429>`__)
+- IPC file descriptor cleanup now uses a C++ ``shared_ptr`` with a POSIX deleter, avoiding
+  cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown.
+- Improved error message when :class:`ManagedMemoryResource` is called without options on platforms
+  that lack a default managed memory pool (e.g. WSL2). (`#1617 <https://github.com/NVIDIA/cuda-python/issues/1617>`__)
+- Handle properties on core API objects now return ``None`` during Python shutdown instead of
+  crashing.
+- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and
+  linking operations to the C level and releasing the GIL during backend calls. This benefits
+  workloads that create many programs or linkers, and enables concurrent compilation in
+  multithreaded applications.
+- Error enum explanations are now derived from ``cuda-bindings`` docstrings when available
+  (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.
+- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely
+  missing optional modules are treated as unavailable; unrelated import failures now surface
+  normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``.
diff --git a/cuda_core/docs/source/release/0.7.x-notes.rst b/cuda_core/docs/source/release/0.7.x-notes.rst
diff --git a/cuda_core/pixi.toml b/cuda_core/pixi.toml
@@ -107,7 +107,7 @@ examples = { features = ["cu13", "examples", "local-deps"], solve-group = "examp
 # TODO: check if these can be extracted from pyproject.toml
 [package]
 name = "cuda-core"
-version = "0.6.0"
+version = "0.7.0"
 
 [package.build]
 backend = { name = "pixi-build-python", version = "*" }

diff --git a/cuda_core/pyproject.toml b/cuda_core/pyproject.toml
@@ -19,7 +19,7 @@ dynamic = [
     "readme",
 ]
 requires-python = '>=3.10'
-description = "cuda.core: (experimental) pythonic CUDA module"
+description = "cuda.core: pythonic CUDA module"
 authors = [
     { name = "NVIDIA Corporation" }
 ]