Update NVIDIA --gpu requests using CDI to match AMD logic#52145
Update NVIDIA --gpu requests using CDI to match AMD logic#52145elezar wants to merge 4 commits intomoby:masterfrom
Conversation
daemon/devices_linux.go
Outdated
| // RegisterGPUDeviceDrivers registers GPU device drivers. | ||
| // If the cdiCache is provided, it is used to detect presence of CDI specs for AMD GPUs. | ||
| // For NVIDIA GPUs, presence of CDI specs is detected by checking for the nvidia-cdi-hook binary. |
There was a problem hiding this comment.
| // RegisterGPUDeviceDrivers registers GPU device drivers. | |
| // If the cdiCache is provided, it is used to detect presence of CDI specs for AMD GPUs. | |
| // For NVIDIA GPUs, presence of CDI specs is detected by checking for the nvidia-cdi-hook binary. | |
| // RegisterGPUDeviceDrivers registers GPU device drivers. | |
| // If the cdiCache is provided, it is used to discover the vendor via available CDI specs | |
| // and translate the GPU requests to CDI device requests. |
There was a problem hiding this comment.
Done. Sorry for the delay.
c61a286 to
5a0ed51
Compare
daemon/devices_linux.go
Outdated
| } | ||
| } | ||
|
|
||
| // a cdiCacheInjector uses the specified CDI cache to inject device requests |
There was a problem hiding this comment.
| // a cdiCacheInjector uses the specified CDI cache to inject device requests | |
| // cdiCacheInjector uses the specified CDI cache to inject device requests |
daemon/devices_linux.go
Outdated
| if c.optionalVendor != "" && vendor != c.optionalVendor { | ||
| return fmt.Errorf("CDI specs for required vendor %v not found", c.optionalVendor) | ||
| } | ||
| injector := &cdiDeviceInjector{ | ||
| defaultCDIDeviceKind: c.optionalVendor + "/gpu", | ||
| } |
There was a problem hiding this comment.
Shouldn't it be called requiredVendor (or just vendor)? Otherwise the injectDevices will error out.
Otherwise, if optionalVendor is empty, we end up with defaultCDIDeviceKind: "/gpu"
There was a problem hiding this comment.
You're right. The logic here is not correct. Let me work through it and also add unit tests to capture the intent.
Signed-off-by: Evan Lezar <[email protected]>
This change adds a cdiCacheInjector type for handling --gpus requests as vendor-specific CDI requests. The handling of --gpus requests for AMD devices is updated to use this type. Signed-off-by: Evan Lezar <[email protected]>
This change aligns the handling of --gpus requests using CDI for NVIDIA device with those for AMD devices. Instead of checking for the existence of the nvidia-cdi-hook executable, available CDI vendors are checked. This makes the implementation more robust w.r.t changing implementation details in NVIDIA CDI specs. Signed-off-by: Evan Lezar <[email protected]>
b3a33e2 to
dad2476
Compare
|
🙈 now a linter is failing |
Signed-off-by: Evan Lezar <[email protected]>
dad2476 to
2fc090d
Compare
| ListVendors() []string | ||
| } | ||
|
|
||
| // createCDIInjector creates a injector for CDI devices from the specified a set |
There was a problem hiding this comment.
| // createCDIInjector creates a injector for CDI devices from the specified a set | |
| // createCDIDeviceInjector creates an injector for CDI devices from the specified a set |
| func (i *cdiDeviceInjector) resolveVendor() (string, error) { | ||
| vendor, err := i.validVendors.getFirstAvailableIn(i.availableVendors.ListVendors()) | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| return vendor, nil |
There was a problem hiding this comment.
nit, can be simplified:
| func (i *cdiDeviceInjector) resolveVendor() (string, error) { | |
| vendor, err := i.validVendors.getFirstAvailableIn(i.availableVendors.ListVendors()) | |
| if err != nil { | |
| return "", err | |
| } | |
| return vendor, nil | |
| func (i *cdiDeviceInjector) resolveVendor() (string, error) { | |
| return i.validVendors.getFirstAvailableIn(i.availableVendors.ListVendors()) |
| if cdiCache != nil { | ||
| // Create a device injector that handles --gpus device requests using | ||
| // nvidia.com CDI requests. | ||
| injector := createCDIDeviceInjector(cdiCache, "nvidia.com") | ||
| // Register a named device driver for this injector so that a user can | ||
| // explicitly request it. | ||
| // This has no capabilities associated to not inadvertently match requests. | ||
| cdiDeviceDriver := &deviceDriver{ | ||
| updateSpec: (&cdiDeviceInjector{ | ||
| defaultCDIDeviceKind: "nvidia.com/gpu", | ||
| }).injectDevices, | ||
| nvidiaDrivers["nvidia.cdi"] = &deviceDriver{ | ||
| updateSpec: injector.injectDevices, |
There was a problem hiding this comment.
One issue with this approach is that this will ALWAYS register an NVIDIA driver and the AMD driver will not be used. I need to rethink this and also how we will test this.
There was a problem hiding this comment.
I've marked this as draft for the time being.
- What I did
Refactored the handling of
--gpusrequests for NVIDIA devices to align with AMD device requests. See #52048.- How I did it
Introduced a
cdiCacheInjectorand update the AMD-specific code to make use of this type. I then migrated the NVIDIA implementation to this.- How to verify it
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)