Skip to content

[gpu-operator] Update to v26.3.0 and add vGPU variant#2323

Draft
lexfrei wants to merge 4 commits intomainfrom
feat/update-gpu-operator-v26.3.0
Draft

[gpu-operator] Update to v26.3.0 and add vGPU variant#2323
lexfrei wants to merge 4 commits intomainfrom
feat/update-gpu-operator-v26.3.0

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented Apr 2, 2026

What this PR does

Updates the NVIDIA GPU Operator from v25.3.0 to v26.3.0 and adds a new vgpu variant for GPU sharing between VMs using mediated devices.

Changes:

  • Update vendored upstream chart to v26.3.0 (CDI enabled by default, component version bumps)
  • Pin chart version in Makefile for reproducibility
  • Add values-talos-vgpu.yaml for vGPU mode (vGPU Manager + vGPU Device Manager)
  • Add vgpu variant to PackageSource alongside existing default (passthrough)
  • Add internal documentation for vGPU setup

The vGPU variant enables users to deploy GPU Operator with NVIDIA vGPU Manager. Users must provide their own proprietary vGPU Manager image via Package CR component values.

Release note

[gpu-operator] Update NVIDIA GPU Operator to v26.3.0 and add vGPU variant for GPU sharing between VMs

Summary by CodeRabbit

  • New Features

    • vGPU support for the GPU Operator with KubeVirt integration and a new end-to-end vGPU setup guide.
    • Kata sandbox device plugin and expanded CDI/NRI configuration options.
  • Upgrades

    • GPU Operator bumped to v26.3.0.
    • Node Feature Discovery bumped to v0.18.3.
  • Improvements

    • HTTP-based health probes, PodDisruptionBudget templates, imagePullSecrets helper, enhanced RBAC/namespace scoping, CRD schema additions (licensing secret, vGPU fields, toleration ops), and chart/values refinements.

lexfrei added 2 commits April 2, 2026 14:58
Update vendored upstream chart to v26.3.0. Key changes:
- CDI enabled by default
- Driver version 580.126.20
- NFD subchart updated to 0.18.3
- Component versions bumped (toolkit v1.19.0, device-plugin v0.19.0, dcgm-exporter 4.5.2)
- Pin chart version in Makefile for reproducibility

Assisted-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
Add vGPU mode as a new variant alongside the existing passthrough (default) variant.
This enables sharing a single physical GPU across multiple VMs using NVIDIA mediated devices.

Changes:
- Add values-talos-vgpu.yaml with vGPU Manager and vGPU Device Manager enabled
- Add 'vgpu' variant to PackageSource
- Add documentation for vGPU setup (driver image, NLS licensing, profiles, KubeVirt)

Users select the vgpu variant via Package CR and provide their proprietary
vGPU Manager image coordinates through component values.

Assisted-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 50f62eaa-c639-4f6c-acb7-1e95d4184b65

📥 Commits

Reviewing files that changed from the base of the PR and between ab5c618 and 783b51b.

📒 Files selected for processing (1)
  • docs/gpu-vgpu.md
✅ Files skipped from review due to trivial changes (1)
  • docs/gpu-vgpu.md

📝 Walkthrough

Walkthrough

Adds vGPU support to the GPU Operator (new vgpu variant and Talos vGPU values), upgrades GPU Operator to v26.3.0 and Node Feature Discovery to v0.18.3, migrates NFD probes from gRPC to HTTP, enhances multiple CRDs/templates for licensing, vGPU/kernel-module config, CDI/NRI and kata sandbox integration, and adds documentation.

Changes

Cohort / File(s) Summary
Documentation
docs/gpu-vgpu.md
New end-to-end guide for configuring GPU Operator with NVIDIA vGPU: prerequisites, driver image build/publish, Package CR deployment, NLS licensing Secret, vGPU profile selection, vgpuDeviceManager config, and KubeVirt integration examples.
Platform Variant
packages/core/platform/sources/gpu-operator.yaml
Added a new vgpu spec.variant referencing values-talos-vgpu.yaml, enabling a vGPU-specific installation path alongside the existing default variant.
Chart Pinning & Metadata
packages/system/gpu-operator/Makefile, packages/system/gpu-operator/charts/gpu-operator/Chart.yaml
Pinned Helm pull to --version v26.3.0; bumped chart appVersion/version to v26.3.0 and updated NFD dependency repo/version to oci://registry.k8s.io/nfd/charts 0.18.3.
Node Feature Discovery (NFD) chart
packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/...
Migrated NFD components from gRPC probes to HTTP /healthz, consolidated metrics/health ports into single port, added dnsPolicy values, imagePullSecrets helper, PodDisruptionBudget templates, and updated values/README. CRDs extended with Ge/Le/GeLe, type, and rule vars/varsTemplate.
CRD Extensions
.../crds/nvidia.com_clusterpolicies.yaml, .../crds/nvidia.com_nvidiadrivers.yaml
Extended ClusterPolicy and NVIDIADrivers CRDs: added vgpuManager/kernelModuleConfig, licensing secretName (deprecated ConfigMap), secretEnv, podSecurityContext, kataSandboxDevicePlugin, sandboxWorkloads.mode, NRI plugin/CI-related fields, expanded toleration operators, and added disabled state.
Templates & Rendering Logic
packages/system/gpu-operator/charts/gpu-operator/templates/*.yaml
Numerous template changes: namespace-scoped RBAC/SA, use manage-crds binary for CRD apply/delete in hooks, added extraObjects template, refined ClusterPolicy rendering for new fields (vgpu/kata/CDI), conditional NVIDIADriver/license secret handling, simplified nodefeaturerules to kernel-module rules, and added validation preventing NRI enablement when CDI is disabled.
Values & Talos vGPU overrides
packages/system/gpu-operator/charts/gpu-operator/values.yaml, packages/system/gpu-operator/values-talos-vgpu.yaml
Major values updates: enabled CDI by default with nriPluginEnabled: false, introduced sandboxWorkloads.mode, switched many env lists to env: [], added extraObjects, consolidated NFD ports to 8080, added PodDisruptionBudgets, driver/licensing secret wiring, and added values-talos-vgpu.yaml enabling vgpuManager/vgpuDeviceManager and disabling the standard driver/devicePlugin for Talos vGPU setups.

Sequence Diagram(s)

sequenceDiagram
participant Admin
participant BuildServer
participant Registry
participant GPU_Operator
participant NLS as "NVIDIA License Server"
participant KubeVirt

Admin->>BuildServer: Provide NVIDIA .run files & EULA
BuildServer->>Registry: Build & push vgpu-manager driver image
Admin->>GPU_Operator: Install Package CR with variant=vgpu (values-talos-vgpu)
GPU_Operator->>Registry: Pull vgpu-manager image
GPU_Operator->>NLS: Reference licensing Secret (gridd.conf) -> configure licensing
Admin->>KubeVirt: Create VirtualMachine spec referencing vGPU resource
KubeVirt->>GPU_Operator: Request mediated device allocation on host
GPU_Operator->>Registry: Ensure driver/kernel modules loaded (vgpuManager)
GPU_Operator->>KubeVirt: Expose vGPU resource to VM (permittedHostDevices/mediatedDeviceTypes)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 I nibble bytes and build a crate,
I stitch vGPU to the cluster gate.
From driver image to license key,
I hop through pods so VMs roam free.
Hooray — kernels, secrets, configs done; let rabbits dance beneath the sun!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[gpu-operator] Update to v26.3.0 and add vGPU variant' accurately and concisely summarizes the main changes: upgrading the GPU Operator chart to v26.3.0 and introducing a new vGPU variant for GPU sharing. It is specific, clear, and reflects the primary objectives of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/update-gpu-operator-v26.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lexfrei lexfrei self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NVIDIA vGPU support, updates the GPU Operator to version v26.3.0, and upgrades Node Feature Discovery (NFD) to 0.18.3. Key changes include the addition of a vgpu package variant, transition to HTTP-based health probes in NFD, and the adoption of a dedicated utility for CRD management. Feedback suggests updating the documentation to use Secrets for licensing configuration to align with security best practices and removing the deprecated cdi.default field from the ClusterPolicy template.

Comment thread docs/gpu-vgpu.md Outdated
Comment on lines +79 to +104
1. Create a ConfigMap with the NLS client configuration in the `cozy-gpu-operator` namespace:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: licensing-config
namespace: cozy-gpu-operator
data:
gridd.conf: |
ServerAddress=nls.example.com
ServerPort=443
FeatureType=1
```

2. Reference the ConfigMap in the Package values:

```yaml
gpu-operator:
vgpuManager:
repository: registry.example.com/nvidia
version: "550.90.05"
driver:
licensingConfig:
configMapName: licensing-config
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example uses ConfigMap and configMapName for licensing configuration. However, the updated CRDs (nvidia.com_clusterpolicies.yaml and nvidia.com_nvidiadrivers.yaml) mark configMapName as deprecated in favor of secretName. It is recommended to update the documentation to use a Secret for licensing configuration to align with the new best practices for security.

Comment on lines +152 to +154
{{- if .Values.cdi.default }}
default: {{ .Values.cdi.default }}
{{- end }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cdi.default field is deprecated in the ClusterPolicy CRD and is no longer used by the GPU Operator (setting cdi.enabled=true now automatically configures it as the default). This block should be removed to align with the updated CRD schema and avoid confusion.

Replace simplified Containerfile with NVIDIA's Makefile-based build
system from gitlab.com/nvidia/container-images/driver. The GPU Operator
expects pre-compiled kernel modules, not a raw .run file.

Add EULA notice about public redistribution of vGPU driver images.
Add note about NLS ServerPort being deployment-dependent.

Assisted-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@lexfrei lexfrei marked this pull request as ready for review April 2, 2026 12:51
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Apr 2, 2026
ConfigMap-based licensing configuration (configMapName) is deprecated
in GPU Operator v26.x. Switch to Secret-based configuration (secretName)
as recommended by the updated CRD schema.

Assisted-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/gpu-vgpu.md`:
- Around line 81-109: The doc incorrectly instructs creating a ConfigMap and
using driver.licensingConfig.configMapName; update it to create a Secret (kind:
Secret) containing the gridd.conf under stringData and change the package values
example to use driver.licensingConfig.secretName: licensing-config so the
chart/schema picks up the NLS licensing configuration.

In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`:
- Around line 97-100: The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.

In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`:
- Around line 917-924: The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51f9626b-f0b3-498a-82e0-c845bb0b24ad

📥 Commits

Reviewing files that changed from the base of the PR and between 73ec5a5 and ab5c618.

⛔ Files ignored due to path filters (1)
  • packages/system/gpu-operator/charts/gpu-operator/Chart.lock is excluded by !**/*.lock
📒 Files selected for processing (32)
  • docs/gpu-vgpu.md
  • packages/core/platform/sources/gpu-operator.yaml
  • packages/system/gpu-operator/Makefile
  • packages/system/gpu-operator/charts/gpu-operator/Chart.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/Chart.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/README.md
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/_helpers.tpl
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master-pdb.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/master.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc-pdb.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/nfd-gc.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/prometheus.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/topologyupdater.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/worker.yaml
  • packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/values.yaml
  • packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_clusterpolicies.yaml
  • packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/cleanup_crd.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/clusterpolicy.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/extra-objects.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/nodefeaturerules.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/nvidiadriver.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/operator.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/role.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/rolebinding.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/serviceaccount.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/upgrade_crd.yaml
  • packages/system/gpu-operator/charts/gpu-operator/templates/validations.yaml
  • packages/system/gpu-operator/charts/gpu-operator/values.yaml
  • packages/system/gpu-operator/values-talos-vgpu.yaml

Comment thread docs/gpu-vgpu.md Outdated
Comment on lines +97 to +100
{{- with .Values.master.resources }}
resources:
{{- toYaml . | nindent 8 }}
{{- end }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Resources block is misplaced outside the container spec.

The resources block is rendered at the pod spec level (after tolerations) rather than inside the nfd-master container definition. This would cause a Kubernetes API validation error since resources is not a valid pod spec field.

🐛 Proposed fix to move resources inside container spec

The resources block should be inside the container definition (after line 83, inside the nfd-master container):

           args:
             - "-prune"
             {{- if .Values.master.instance | empty | not }}
             - "-instance={{ .Values.master.instance }}"
             {{- end }}
+          {{- with .Values.master.resources }}
+          resources:
+            {{- toYaml . | nindent 12 }}
+          {{- end }}
       restartPolicy: Never
       {{- with .Values.master.nodeSelector }}
       nodeSelector:
         {{- toYaml . | nindent 8 }}
       {{- end }}
       {{- with .Values.master.affinity }}
       affinity:
         {{- toYaml . | nindent 8 }}
       {{- end }}
       {{- with .Values.master.tolerations }}
       tolerations:
         {{- toYaml . | nindent 8 }}
       {{- end }}
-      {{- with .Values.master.resources }}
-      resources:
-        {{- toYaml . | nindent 8 }}
-      {{- end }}
 {{- end }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/system/gpu-operator/charts/gpu-operator/charts/node-feature-discovery/templates/post-delete-job.yaml`
around lines 97 - 100, The resources block is being rendered at the Pod spec
level using .Values.master.resources but must be moved into the nfd-master
container spec; update the template so the resources: {{- toYaml
.Values.master.resources }} block is placed inside the container definition for
"nfd-master" (e.g., immediately after that container's fields) rather than after
tolerations, ensuring the resources stanza is nested under the nfd-master
container.

Comment on lines 917 to 924
operator:
description: |-
Operator represents a key's relationship to the value.
Valid operators are Exists and Equal. Defaults to Equal.
Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal.
Exists is equivalent to wildcard for value, so that a pod can
tolerate all taints of a particular category.
Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators).
type: string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Do Kubernetes Pod tolerations support LtandGtoperators behind theTaintTolerationComparisonOperators feature gate, and in which Kubernetes versions is that feature available or enabled by default?

💡 Result:

Yes, Kubernetes Pod tolerations support the Lt (Less Than) and Gt (Greater Than) operators behind the TaintTolerationComparisonOperators feature gate. These operators enable numeric comparisons between taint values and toleration values (both must be valid 64-bit positive integers), allowing threshold-based scheduling. The feature is available starting in Kubernetes v1.35 as an alpha feature (disabled by default). It must be explicitly enabled via the --feature-gates=TaintTolerationComparisonOperators=true flag on the API server and kube-scheduler. As of 2026-04-02 (during v1.36 development), it remains alpha and is not enabled by default in any version.

Citations:


🏁 Script executed:

#!/bin/bash
# Search for Kubernetes version constraints in the repository
fd -t f -e yaml -e txt -e md | xargs grep -l "kubernetes" | head -20

# Also search for version info in common locations
if [ -f "Chart.yaml" ]; then echo "=== Chart.yaml ===" && cat Chart.yaml; fi
if [ -f "requirements.yaml" ]; then echo "=== requirements.yaml ===" && cat requirements.yaml; fi
if [ -f "README.md" ]; then echo "=== README.md (first 50 lines) ===" && head -50 README.md; fi
if [ -f "go.mod" ]; then echo "=== go.mod ===" && cat go.mod; fi

# Search for version constraints in the gpu-operator chart
if [ -d "packages/system/gpu-operator" ]; then
  echo "=== Files in gpu-operator ==="
  ls -la packages/system/gpu-operator/ | head -20
fi

Repository: cozystack/cozystack

Length of output: 10781


🏁 Script executed:

#!/bin/bash
# Search for references to TaintTolerationComparisonOperators or kubernetes version requirements
rg -i "kubernetes|k8s|version|feature.*gate|taint" --type yaml -A 2 -B 2 | head -100

Repository: cozystack/cozystack

Length of output: 13930


🏁 Script executed:

#!/bin/bash
# Check if this is a vendored chart by looking for upstream references
if [ -d "packages/system/gpu-operator/charts" ]; then
  echo "=== GPU Operator Chart Structure ==="
  find packages/system/gpu-operator/charts -type f -name "Chart.yaml" -o -name "values.yaml" | xargs head -20
fi

Repository: cozystack/cozystack

Length of output: 2228


Verify Lt/Gt toleration operators are supported in supported Kubernetes versions.

The GPU operator chart supports Kubernetes >= 1.16.0, but the TaintTolerationComparisonOperators feature gate is available only in Kubernetes v1.35+ and is disabled by default. Users can define valid NVIDIADriver specs with Lt/Gt operators that will be rejected by the API server on Kubernetes versions 1.16–1.34 or on v1.35+ without the feature gate explicitly enabled. Either update the CRD documentation to clarify this requirement or update the chart's kubeVersion constraint to reflect the actual compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/system/gpu-operator/charts/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml`
around lines 917 - 924, The CRD docs for the `operator` field currently list
`Lt`/`Gt` without noting they require the Kubernetes feature gate; update the
`operator` description under the `operator` field (in the NVIDIADriver CRD) to
state that `Lt` and `Gt` require Kubernetes v1.35+ with the
TaintTolerationComparisonOperators feature gate enabled, or alternatively update
the chart's `kubeVersion` constraint to >=1.35.0 so that the chart and the
`NVIDIADriver` spec only allow these operators on supported Kubernetes versions;
pick one approach and make the matching change to either the `operator`
description or the chart `kubeVersion`.

@IvanHunters IvanHunters self-assigned this Apr 9, 2026
@lexfrei lexfrei marked this pull request as draft April 10, 2026 11:46
@Arsolitt
Copy link
Copy Markdown
Contributor

Testing Report: gpu-operator v26.3.0 on Talos Linux

Tested this PR on a Talos v1.12.6 cluster (3× VM nodes with NVIDIA A10 GPUs, Kubernetes v1.34.3).

Critical: Talos OS version parsing failure

The ClusterPolicy enters notReady state immediately after upgrade:

failed to retrieve GPU node OS tag: error processing OS major version v1: strconv.Atoi: parsing "v1": invalid syntax

This is a known bug in gpu-operator v26.3.0 (NVIDIA/gpu-operator#2239). The getGPUNodeOSTag() function in controllers/state_manager.go runs unconditionally during ClusterPolicy controller init for any cluster with GPU-labeled nodes — it is not gated by driver.enabled or any other ClusterPolicy setting. Since Talos sets VERSION_ID=v1.12.6 (with v prefix), strconv.Atoi fails on the major version.

Already fixed upstream in NVIDIA/gpu-operator#2244, released as v26.3.1 on 2026-04-18. The vendored chart should be bumped from v26.3.0 to v26.3.1.

This affects all Talos deployments regardless of variant (default or vgpu).

Secondary: nvidia-operator-validator permission error

On nodes where the validator pod was re-rolled, the driver-validation init container crashes with:

error creating symlink creator: failed to create device node nvidia-modeset: permission denied

This happens during /dev/char symlink creation. Could not fully isolate this issue since the ClusterPolicy failure above prevents normal reconciliation. Might be resolved by v26.3.1 as well, but worth verifying separately. See NVIDIA/gpu-operator#1687 for related Talos context.

Summary

The primary blocker is the OS version parsing bug — any Talos cluster with GPU nodes will hit it. Bumping the vendored chart to v26.3.1 should resolve this. The validator permission issue needs verification after the bump.

@IvanHunters IvanHunters removed their assignment Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants