Skip to content

Conversation

@JunAr7112
Copy link
Contributor

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

return 0, fmt.Errorf("vGPU type %s not found in file %s", vgpuTypeName, filePath)
}

func (m *nvlibVGPUConfigManager) IsVFIOEnabled() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this method, you're checking if there are occurrences of ubuntu 24.04 or rhel 10 in the /etc/os-release file. How does that tell us if VFIO is enabled? I also don't see the receiver m *nvlibVGPUConfigManager being used anywhere in this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see it would probably be better to just check the directories directly rather than look for distro. I updated the method to check for devices in /sys/class/mdev_bus.

return nil, fmt.Errorf("unable to get GPU by index %d: %v", gpu, err)
}
vgpuConfig := types.VGPUConfig{}
VFnum := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use capitalisation when naming local variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove capitalization

return fmt.Errorf("GPU at index %d not found in available NVIDIA devices", gpu)
}

cmd := exec.Command("chroot", "/host", "/run/nvidia/driver/usr/lib/nvidia/sriov-manage", "-e", nvdevice.Address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be hardcoding the /run/nvidia/driver/ path here. They should be retrieved from a parameter instead. The param is called driverRoot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where driverRoot is defined? Should I manually define it as a constant in the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having some difficulties running sriov-manage using driverRoot from within the container as the container is built as a distroless image and cannot run sriov-manage which is a bash script

Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JunAr7112, this is a good start. As we make iterations on this and get more familiar with the internals here, it may be valuable to create a new internal/vgpu package that hides away the vfio vs mdev framework complexity. We need to think through what the right interface would be, but I imagine we will need methods for 1) getting all vGPU devices, 2) getting all parent devices (of which you can create a vGPU device on top of), 3) creating a vGPU device. The pkg/vgpu/config.go file, which is concerned with getting / setting a particular vGPU config, can invoke these methods without having to know what vfio / mdev is.

Comment on lines 194 to 195
// Check if mdev_bus exists and has entries
mdevBusPath := "/sys/class/mdev_bus"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right check? Is it possible for mdev_bus to exist in cases where we want to use the VFIO framework?

}
vgpuConfig := types.VGPUConfig{}
vfnum := 0
if nvdevice.SriovInfo.PhysicalFunction == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
if nvdevice.SriovInfo.PhysicalFunction == nil {
if nvdevice.SriovInfo.IsVF() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to this.

Comment on lines 72 to 75
if _, err := os.Stat(vfAddr); err != nil {
vfnum++
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we incrementing vfnum here when the directory does not exist? As I said above, we can just use the number of VFs already calculated when constructing the NVIDIA PCI device using go-nvlib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this check.

return vgpuConfig, nil
}
totalVF := int(nvdevice.SriovInfo.PhysicalFunction.TotalVFs)
for vfnum < totalVF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- don't we already know the number of VFs from nvdevice.SriovInfo.PhysicalFunction.NumVFs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to iterate over 0....nvdevice.SriovInfo.PhysicalFunction.NumVFs. The other virtual functions outside that range should be empty. I'll switch to this.

Copy link
Contributor Author

@JunAr7112 JunAr7112 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever run into a scenario where say devices were manually deployed on Virtual Functions in a non-sequential order?

}
totalVF := int(nvdevice.SriovInfo.PhysicalFunction.TotalVFs)
for vfnum < totalVF {
vfAddr := HostPCIDevicesRoot + "/" + nvdevice.Address + "/virtfn" + strconv.Itoa(vfnum) + "/nvidia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when constructing file paths let's use filepath.Join() throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to filepath.Join

remainingToCreate := val
for remainingToCreate > 0 {
vfAddr := HostPCIDevicesRoot + "/" + nvdevice.Address + "/virtfn" + strconv.Itoa(vfnum) + "/nvidia"
number, err := m.getVGPUTypeNumberforVFIO(vfAddr + "/creatable_vgpu_types", key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method could be improved. Maybe getIdForVGPUTypeName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to this.

vfnum++
continue
}
vgpuTypeName, err := m.getVGPUTypeNameforVFIO(vfAddr + "/creatable_vgpu_types", vgpuTypeNumber)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method could be improved. Maybe getVGPUTypeNameForId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to this.

Comment on lines 43 to 45
IsVFIOEnabled() (bool, error)
GetVGPUConfigforVFIO(gpu int) (types.VGPUConfig, error)
SetVGPUConfigforVFIO(gpu int, config types.VGPUConfig) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue this Interface should not change. The caller should just call GetVGPUConfig / SetVGPUConfig and the details about vfio / mdev should be hidden from them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I switched back to the old interface names. We will use the VFIO enabled check in this file instead.

matched := make([]bool, len(gpus))
err = WalkSelectedVGPUConfigForEachGPU(c.VGPUConfig, func(vc *v1.VGPUConfigSpec, i int, d types.DeviceID) error {
configManager := vgpu.NewNvlibVGPUConfigManager()
current, err := configManager.GetVGPUConfig(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original interface and just call configManager.GetVGPUConfig(i) here. The vfio/mdev details can be captured in that method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to this.

func VGPUConfig(c *Context) error {
return assert.WalkSelectedVGPUConfigForEachGPU(c.VGPUConfig, func(vc *v1.VGPUConfigSpec, i int, d types.DeviceID) error {
configManager := vgpu.NewNvlibVGPUConfigManager()
current, err := configManager.GetVGPUConfig(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original interface and just call configManager.GetVGPUConfig(i) / configManager.SetVGPUConfig(i) here. The vfio/mdev details can be captured in those methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to this.

@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 5 times, most recently from 8846795 to 5daf473 Compare November 24, 2025 16:54
@JunAr7112 JunAr7112 force-pushed the vfio_changes branch 6 times, most recently from 15a9586 to b1fd32d Compare December 4, 2025 17:17
Signed-off-by: Arjun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants