This directory contains practical examples demonstrating how to use the ZigCUDA library.
- CUDA-capable GPU with driver installed
- Zig compiler (0.15.2 or later)
- CUDA Toolkit (for PTX compilation, optional)
Basic device enumeration and querying
Demonstrates:
- Initializing CUDA context
- Enumerating available devices
- Querying device properties (name, compute capability, memory, SM count)
zig build-exe examples/01_device_info.zig --dep zigcuda --mod zigcuda::src/lib.zig
./01_device_infoHost ↔ Device memory operations
Demonstrates:
- Allocating device memory
- Copying data from host to device
- Copying data from device to host
- Memory cleanup and verification
zig build-exe examples/02_memory_transfer.zig --dep zigcuda --mod zigcuda::src/lib.zig
./02_memory_transferLoading and launching CUDA kernels
Demonstrates:
- Loading PTX modules
- Extracting kernel functions
- Setting up kernel parameters
- Launching kernels with grid/block configuration
- Result verification
zig build-exe examples/03_kernel_launch.zig --dep zigcuda --mod zigcuda::src/lib.zig
./03_kernel_launchAsynchronous operations with CUDA streams
Demonstrates:
- Creating multiple CUDA streams
- Launching concurrent operations
- Stream synchronization
- Querying stream status
zig build-exe examples/04_streams.zig --dep zigcuda --mod zigcuda::src/lib.zig
./04_streamsYou can add these examples to your build.zig to make them easy to build:
// In build.zig, add:
const examples = [_][]const u8{
"01_device_info",
"02_memory_transfer",
"03_kernel_launch",
"04_streams",
};
for (examples) |example_name| {
const exe = b.addExecutable(.{
.name = example_name,
.root_source_file = b.path(b.fmt("examples/{s}.zig", .{example_name})),
.target = target,
.optimize = optimize,
});
exe.root_module.addImport("zigcuda", lib_module);
b.installArtifact(exe);
}Then build with:
zig buildThe kernels/ subdirectory contains PTX (Parallel Thread Execution) code for CUDA kernels:
- vector_add.ptx: Vector addition kernel used by example 03
For compute capability 12.0 (Blackwell):
nvcc --cubin -arch=compute_120 -code=sm_120 your_kernel.cu -o your_kernel.cubinFound 1 CUDA device(s)
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Compute Capability: 12.0
Total Memory: 95.59 GB
Multiprocessors: 188
=== CUDA Memory Transfer Example ===
Allocated 4096 bytes on host
Input data: [0.0, 1.0, 2.0, ..., 1023.0]
Allocated 4096 bytes on device
Copying data from host to device...
✓ Host-to-device transfer complete
Copying data from device to host...
✓ Device-to-host transfer complete
✓ SUCCESS: All 1024 elements transferred correctly
Output data: [0.0, 1.0, 2.0, ..., 1023.0]
=== CUDA Kernel Launch Example ===
Input A: [0.0, 1.0, 2.0, ..., 1023.0]
Input B: [0.0, 2.0, 4.0, ..., 2046.0]
✓ PTX module loaded
✓ Kernel function extracted
Launching kernel with 4 blocks x 256 threads
✓ Kernel launched
✓ SUCCESS: Vector addition completed correctly
Result C: [0.0, 3.0, 6.0, ..., 3069.0]
Verified: C[i] = A[i] + B[i] for all 1024 elements
=== CUDA Streams Example ===
Creating 3 CUDA streams...
✓ 3 streams created
Memory allocated for 3 streams
Each stream handles 512 elements (2048 bytes)
Launching async memory copies on all streams...
Synchronizing all streams...
✓ All streams synchronized
Time elapsed: 2 ms
Querying stream status...
Stream 0: ✓ Complete
Stream 1: ✓ Complete
Stream 2: ✓ Complete
✓ SUCCESS: Streams example completed
Total data transferred: 6 KB across 3 streams
Ensure your NVIDIA driver is properly installed:
nvidia-smiCheck that your GPU is recognized:
lspci | grep -i nvidiaEnsure the PTX file exists and the path is correct. PTX files are embedded at compile time using @embedFile().
- Modify the examples to experiment with different configurations
- Create your own CUDA kernels
- Explore the cuBLAS integration for matrix operations
- Check the test suite in
test/for more advanced usage patterns