Skip to content

Update NVIDIA --gpu requests using CDI to match AMD logic#52145

Draft
elezar wants to merge 4 commits intomoby:masterfrom
elezar:cdi-vendor-detection
Draft

Update NVIDIA --gpu requests using CDI to match AMD logic#52145
elezar wants to merge 4 commits intomoby:masterfrom
elezar:cdi-vendor-detection

Conversation

@elezar
Copy link
Contributor

@elezar elezar commented Mar 5, 2026

- What I did

Refactored the handling of --gpus requests for NVIDIA devices to align with AMD device requests. See #52048.

- How I did it

Introduced a cdiCacheInjector and update the AMD-specific code to make use of this type. I then migrated the NVIDIA implementation to this.

- How to verify it

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

@github-actions github-actions bot added the area/daemon Core Engine label Mar 5, 2026
@elezar elezar changed the title Cdi vendor detection Update NVIDIA --gpu requests using CDI to match AMD logic Mar 5, 2026
@vvoland vvoland added the kind/refactor PR's that refactor, or clean-up code label Mar 5, 2026
@vvoland vvoland added this to the 29.3.1 milestone Mar 5, 2026
Comment on lines 10 to 12
// RegisterGPUDeviceDrivers registers GPU device drivers.
// If the cdiCache is provided, it is used to detect presence of CDI specs for AMD GPUs.
// For NVIDIA GPUs, presence of CDI specs is detected by checking for the nvidia-cdi-hook binary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RegisterGPUDeviceDrivers registers GPU device drivers.
// If the cdiCache is provided, it is used to detect presence of CDI specs for AMD GPUs.
// For NVIDIA GPUs, presence of CDI specs is detected by checking for the nvidia-cdi-hook binary.
// RegisterGPUDeviceDrivers registers GPU device drivers.
// If the cdiCache is provided, it is used to discover the vendor via available CDI specs
// and translate the GPU requests to CDI device requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar ptal 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Sorry for the delay.

@elezar elezar force-pushed the cdi-vendor-detection branch from c61a286 to 5a0ed51 Compare March 12, 2026 13:37
}
}

// a cdiCacheInjector uses the specified CDI cache to inject device requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// a cdiCacheInjector uses the specified CDI cache to inject device requests
// cdiCacheInjector uses the specified CDI cache to inject device requests

Comment on lines +50 to +55
if c.optionalVendor != "" && vendor != c.optionalVendor {
return fmt.Errorf("CDI specs for required vendor %v not found", c.optionalVendor)
}
injector := &cdiDeviceInjector{
defaultCDIDeviceKind: c.optionalVendor + "/gpu",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be called requiredVendor (or just vendor)? Otherwise the injectDevices will error out.

Otherwise, if optionalVendor is empty, we end up with defaultCDIDeviceKind: "/gpu"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The logic here is not correct. Let me work through it and also add unit tests to capture the intent.

elezar added 3 commits March 17, 2026 16:17
This change adds a cdiCacheInjector type for handling --gpus requests
as vendor-specific CDI requests. The handling of --gpus requests for
AMD devices is updated to use this type.

Signed-off-by: Evan Lezar <[email protected]>
This change aligns the handling of --gpus requests using CDI
for NVIDIA device with those for AMD devices. Instead of checking
for the existence of the nvidia-cdi-hook executable, available CDI
vendors are checked. This makes the implementation more robust w.r.t
changing implementation details in NVIDIA CDI specs.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the cdi-vendor-detection branch from b3a33e2 to dad2476 Compare March 17, 2026 15:17
@thaJeztah
Copy link
Member

🙈 now a linter is failing

goanalysis_metalinter: 1m24.292750831s 
daemon/devices_nvidia_linux.go:223:29: ST1016: methods on the same type should have the same receiver name (seen 1x "c", 2x "i") (staticcheck)
func (i *cdiDeviceInjector) injectDevices(s *specs.Spec, dev *deviceInstance) error {
                            ^

@elezar elezar force-pushed the cdi-vendor-detection branch from dad2476 to 2fc090d Compare March 18, 2026 08:26
ListVendors() []string
}

// createCDIInjector creates a injector for CDI devices from the specified a set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// createCDIInjector creates a injector for CDI devices from the specified a set
// createCDIDeviceInjector creates an injector for CDI devices from the specified a set

Comment on lines +279 to +284
func (i *cdiDeviceInjector) resolveVendor() (string, error) {
vendor, err := i.validVendors.getFirstAvailableIn(i.availableVendors.ListVendors())
if err != nil {
return "", err
}
return vendor, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, can be simplified:

Suggested change
func (i *cdiDeviceInjector) resolveVendor() (string, error) {
vendor, err := i.validVendors.getFirstAvailableIn(i.availableVendors.ListVendors())
if err != nil {
return "", err
}
return vendor, nil
func (i *cdiDeviceInjector) resolveVendor() (string, error) {
return i.validVendors.getFirstAvailableIn(i.availableVendors.ListVendors())

Comment on lines +42 to +50
if cdiCache != nil {
// Create a device injector that handles --gpus device requests using
// nvidia.com CDI requests.
injector := createCDIDeviceInjector(cdiCache, "nvidia.com")
// Register a named device driver for this injector so that a user can
// explicitly request it.
// This has no capabilities associated to not inadvertently match requests.
cdiDeviceDriver := &deviceDriver{
updateSpec: (&cdiDeviceInjector{
defaultCDIDeviceKind: "nvidia.com/gpu",
}).injectDevices,
nvidiaDrivers["nvidia.cdi"] = &deviceDriver{
updateSpec: injector.injectDevices,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with this approach is that this will ALWAYS register an NVIDIA driver and the AMD driver will not be used. I need to rethink this and also how we will test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked this as draft for the time being.

@elezar elezar marked this pull request as draft March 18, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/daemon Core Engine kind/refactor PR's that refactor, or clean-up code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants