[gpu-operator] Update to v26.3.0 and add vGPU variant#2323
[gpu-operator] Update to v26.3.0 and add vGPU variant#2323
Conversation
Update vendored upstream chart to v26.3.0. Key changes: - CDI enabled by default - Driver version 580.126.20 - NFD subchart updated to 0.18.3 - Component versions bumped (toolkit v1.19.0, device-plugin v0.19.0, dcgm-exporter 4.5.2) - Pin chart version in Makefile for reproducibility Assisted-By: Claude <[email protected]> Signed-off-by: Aleksei Sviridkin <[email protected]>
Add vGPU mode as a new variant alongside the existing passthrough (default) variant. This enables sharing a single physical GPU across multiple VMs using NVIDIA mediated devices. Changes: - Add values-talos-vgpu.yaml with vGPU Manager and vGPU Device Manager enabled - Add 'vgpu' variant to PackageSource - Add documentation for vGPU setup (driver image, NLS licensing, profiles, KubeVirt) Users select the vgpu variant via Package CR and provide their proprietary vGPU Manager image coordinates through component values. Assisted-By: Claude <[email protected]> Signed-off-by: Aleksei Sviridkin <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds vGPU support to the GPU Operator (new Changes
Sequence Diagram(s)sequenceDiagram
participant Admin
participant BuildServer
participant Registry
participant GPU_Operator
participant NLS as "NVIDIA License Server"
participant KubeVirt
Admin->>BuildServer: Provide NVIDIA .run files & EULA
BuildServer->>Registry: Build & push vgpu-manager driver image
Admin->>GPU_Operator: Install Package CR with variant=vgpu (values-talos-vgpu)
GPU_Operator->>Registry: Pull vgpu-manager image
GPU_Operator->>NLS: Reference licensing Secret (gridd.conf) -> configure licensing
Admin->>KubeVirt: Create VirtualMachine spec referencing vGPU resource
KubeVirt->>GPU_Operator: Request mediated device allocation on host
GPU_Operator->>Registry: Ensure driver/kernel modules loaded (vgpuManager)
GPU_Operator->>KubeVirt: Expose vGPU resource to VM (permittedHostDevices/mediatedDeviceTypes)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces NVIDIA vGPU support, updates the GPU Operator to version v26.3.0, and upgrades Node Feature Discovery (NFD) to 0.18.3. Key changes include the addition of a vgpu package variant, transition to HTTP-based health probes in NFD, and the adoption of a dedicated utility for CRD management. Feedback suggests updating the documentation to use Secrets for licensing configuration to align with security best practices and removing the deprecated cdi.default field from the ClusterPolicy template.
| 1. Create a ConfigMap with the NLS client configuration in the `cozy-gpu-operator` namespace: | ||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: licensing-config | ||
| namespace: cozy-gpu-operator | ||
| data: | ||
| gridd.conf: | | ||
| ServerAddress=nls.example.com | ||
| ServerPort=443 | ||
| FeatureType=1 | ||
| ``` | ||
|
|
||
| 2. Reference the ConfigMap in the Package values: | ||
|
|
||
| ```yaml | ||
| gpu-operator: | ||
| vgpuManager: | ||
| repository: registry.example.com/nvidia | ||
| version: "550.90.05" | ||
| driver: | ||
| licensingConfig: | ||
| configMapName: licensing-config | ||
| ``` |
There was a problem hiding this comment.
The example uses ConfigMap and configMapName for licensing configuration. However, the updated CRDs (nvidia.com_clusterpolicies.yaml and nvidia.com_nvidiadrivers.yaml) mark configMapName as deprecated in favor of secretName. It is recommended to update the documentation to use a Secret for licensing configuration to align with the new best practices for security.
| {{- if .Values.cdi.default }} | ||
| default: {{ .Values.cdi.default }} | ||
| {{- end }} |
There was a problem hiding this comment.
Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA notice about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude <[email protected]> Signed-off-by: Aleksei Sviridkin <[email protected]>
ConfigMap-based licensing configuration (configMapName) is deprecated in GPU Operator v26.x. Switch to Secret-based configuration (secretName) as recommended by the updated CRD schema. Assisted-By: Claude <[email protected]> Signed-off-by: Aleksei Sviridkin <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/gpu-vgpu.md`:
- Around line 81-109: The doc incorrectly instructs creating a ConfigMap and
using driver.licensingConfig.configMapName; update it to create a Secret (kind:
Secret) containing the gridd.conf under stringData and change the package values
example to use driver.licensingConfig.secretName: licensing-config so the
chart/schema picks up the NLS licensing configuration.
In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`:
- Around line 97-100: The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.
In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`:
- Around line 917-924: The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 51f9626b-f0b3-498a-82e0-c845bb0b24ad
⛔ Files ignored due to path filters (1)
packages/system/gpu-operator/charts/gpu-operator/Chart.lockis excluded by!**/*.lock
📒 Files selected for processing (32)
docs/gpu-vgpu.mdpackages/core/platform/sources/gpu-operator.yamlpackages/system/gpu-operator/Makefilepackages/system/gpu-operator/charts/gpu-operator/Chart.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/Chart.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/README.mdpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/_helpers.tplpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master-pdb.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc-pdb.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/prometheus.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/topologyupdater.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/worker.yamlpackages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/values.yamlpackages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_clusterpolicies.yamlpackages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/cleanup_crd.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/clusterpolicy.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/extra-objects.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/nodefeaturerules.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/nvidiadriver.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/operator.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/role.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/rolebinding.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/serviceaccount.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/upgrade_crd.yamlpackages/system/gpu-operator/charts/gpu-operator/templates/validations.yamlpackages/system/gpu-operator/charts/gpu-operator/values.yamlpackages/system/gpu-operator/values-talos-vgpu.yaml
| {{- with .Values.master.resources }} | ||
| resources: | ||
| {{- toYaml . | nindent 8 }} | ||
| {{- end }} |
There was a problem hiding this comment.
Resources block is misplaced outside the container spec.
The resources block is rendered at the pod spec level (after tolerations) rather than inside the nfd-master container definition. This would cause a Kubernetes API validation error since resources is not a valid pod spec field.
🐛 Proposed fix to move resources inside container spec
The resources block should be inside the container definition (after line 83, inside the nfd-master container):
args:
- "-prune"
{{- if .Values.master.instance | empty | not }}
- "-instance={{ .Values.master.instance }}"
{{- end }}
+ {{- with .Values.master.resources }}
+ resources:
+ {{- toYaml . | nindent 12 }}
+ {{- end }}
restartPolicy: Never
{{- with .Values.master.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.master.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.master.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
- {{- with .Values.master.resources }}
- resources:
- {{- toYaml . | nindent 8 }}
- {{- end }}
{{- end }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`
around lines 97 - 100, The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.
| operator: | ||
| description: |- | ||
| Operator represents a key's relationship to the value. | ||
| Valid operators are Exists and Equal. Defaults to Equal. | ||
| Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal. | ||
| Exists is equivalent to wildcard for value, so that a pod can | ||
| tolerate all taints of a particular category. | ||
| Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators). | ||
| type: string |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Do Kubernetes Pod tolerations support LtandGtoperators behind theTaintTolerationComparisonOperators feature gate, and in which Kubernetes versions is that feature available or enabled by default?
💡 Result:
Yes, Kubernetes Pod tolerations support the Lt (Less Than) and Gt (Greater Than) operators behind the TaintTolerationComparisonOperators feature gate. These operators enable numeric comparisons between taint values and toleration values (both must be valid 64-bit positive integers), allowing threshold-based scheduling. The feature is available starting in Kubernetes v1.35 as an alpha feature (disabled by default). It must be explicitly enabled via the --feature-gates=TaintTolerationComparisonOperators=true flag on the API server and kube-scheduler. As of 2026-04-02 (during v1.36 development), it remains alpha and is not enabled by default in any version.
Citations:
- 1: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration
- 2: https://github.com/kubernetes/api/blob/master/core/v1/toleration.go
- 3: https://kubernetes.io/blog/2026/01/05/kubernetes-v1-35-numeric-toleration-operators/
- 4: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
🏁 Script executed:
#!/bin/bash
# Search for Kubernetes version constraints in the repository
fd -t f -e yaml -e txt -e md | xargs grep -l "kubernetes" | head -20
# Also search for version info in common locations
if [ -f "Chart.yaml" ]; then echo "=== Chart.yaml ===" && cat Chart.yaml; fi
if [ -f "requirements.yaml" ]; then echo "=== requirements.yaml ===" && cat requirements.yaml; fi
if [ -f "README.md" ]; then echo "=== README.md (first 50 lines) ===" && head -50 README.md; fi
if [ -f "go.mod" ]; then echo "=== go.mod ===" && cat go.mod; fi
# Search for version constraints in the gpu-operator chart
if [ -d "packages/system/gpu-operator" ]; then
echo "=== Files in gpu-operator ==="
ls -la packages/system/gpu-operator/ | head -20
fiRepository: cozystack/cozystack
Length of output: 10781
🏁 Script executed:
#!/bin/bash
# Search for references to TaintTolerationComparisonOperators or kubernetes version requirements
rg -i "kubernetes|k8s|version|feature.*gate|taint" --type yaml -A 2 -B 2 | head -100Repository: cozystack/cozystack
Length of output: 13930
🏁 Script executed:
#!/bin/bash
# Check if this is a vendored chart by looking for upstream references
if [ -d "packages/system/gpu-operator/charts" ]; then
echo "=== GPU Operator Chart Structure ==="
find packages/system/gpu-operator/charts -type f -name "Chart.yaml" -o -name "values.yaml" | xargs head -20
fiRepository: cozystack/cozystack
Length of output: 2228
Verify Lt/Gt toleration operators are supported in supported Kubernetes versions.
The GPU operator chart supports Kubernetes >= 1.16.0, but the TaintTolerationComparisonOperators feature gate is available only in Kubernetes v1.35+ and is disabled by default. Users can define valid NVIDIADriver specs with Lt/Gt operators that will be rejected by the API server on Kubernetes versions 1.16–1.34 or on v1.35+ without the feature gate explicitly enabled. Either update the CRD documentation to clarify this requirement or update the chart's kubeVersion constraint to reflect the actual compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`
around lines 917 - 924, The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.
Testing Report: gpu-operator v26.3.0 on Talos LinuxTested this PR on a Talos v1.12.6 cluster (3× VM nodes with NVIDIA A10 GPUs, Kubernetes v1.34.3). Critical: Talos OS version parsing failureThe ClusterPolicy enters This is a known bug in gpu-operator v26.3.0 (NVIDIA/gpu-operator#2239). The Already fixed upstream in NVIDIA/gpu-operator#2244, released as v26.3.1 on 2026-04-18. The vendored chart should be bumped from v26.3.0 to v26.3.1. This affects all Talos deployments regardless of variant ( Secondary: nvidia-operator-validator permission errorOn nodes where the validator pod was re-rolled, the This happens during SummaryThe primary blocker is the OS version parsing bug — any Talos cluster with GPU nodes will hit it. Bumping the vendored chart to v26.3.1 should resolve this. The validator permission issue needs verification after the bump. |
What this PR does
Updates the NVIDIA GPU Operator from v25.3.0 to v26.3.0 and adds a new
vgpuvariant for GPU sharing between VMs using mediated devices.Changes:
values-talos-vgpu.yamlfor vGPU mode (vGPU Manager + vGPU Device Manager)vgpuvariant to PackageSource alongside existingdefault(passthrough)The vGPU variant enables users to deploy GPU Operator with NVIDIA vGPU Manager. Users must provide their own proprietary vGPU Manager image via Package CR component values.
Release note
Summary by CodeRabbit
New Features
Upgrades
Improvements