-
Notifications
You must be signed in to change notification settings - Fork 22
Create VGPU changes with VFIO Framework #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fffa26f to
cbe892c
Compare
Signed-off-by: Arjun <[email protected]>
cbe892c to
3f31828
Compare
pkg/vgpu/config.go
Outdated
| return 0, fmt.Errorf("vGPU type %s not found in file %s", vgpuTypeName, filePath) | ||
| } | ||
|
|
||
| func (m *nvlibVGPUConfigManager) IsVFIOEnabled() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this method, you're checking if there are occurrences of ubuntu 24.04 or rhel 10 in the /etc/os-release file. How does that tell us if VFIO is enabled? I also don't see the receiver m *nvlibVGPUConfigManager being used anywhere in this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see it would probably be better to just check the directories directly rather than look for distro. I updated the method to check for devices in /sys/class/mdev_bus.
pkg/vgpu/config.go
Outdated
| return nil, fmt.Errorf("unable to get GPU by index %d: %v", gpu, err) | ||
| } | ||
| vgpuConfig := types.VGPUConfig{} | ||
| VFnum := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not use capitalisation when naming local variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to remove capitalization
pkg/vgpu/config.go
Outdated
| return fmt.Errorf("GPU at index %d not found in available NVIDIA devices", gpu) | ||
| } | ||
|
|
||
| cmd := exec.Command("chroot", "/host", "/run/nvidia/driver/usr/lib/nvidia/sriov-manage", "-e", nvdevice.Address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't be hardcoding the /run/nvidia/driver/ path here. They should be retrieved from a parameter instead. The param is called driverRoot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know where driverRoot is defined? Should I manually define it as a constant in the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having some difficulties running sriov-manage using driverRoot from within the container as the container is built as a distroless image and cannot run sriov-manage which is a bash script
Signed-off-by: Arjun <[email protected]>
1682d72 to
4a52cbe
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JunAr7112, this is a good start. As we make iterations on this and get more familiar with the internals here, it may be valuable to create a new internal/vgpu package that hides away the vfio vs mdev framework complexity. We need to think through what the right interface would be, but I imagine we will need methods for 1) getting all vGPU devices, 2) getting all parent devices (of which you can create a vGPU device on top of), 3) creating a vGPU device. The pkg/vgpu/config.go file, which is concerned with getting / setting a particular vGPU config, can invoke these methods without having to know what vfio / mdev is.
pkg/vgpu/config.go
Outdated
| // Check if mdev_bus exists and has entries | ||
| mdevBusPath := "/sys/class/mdev_bus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right check? Is it possible for mdev_bus to exist in cases where we want to use the VFIO framework?
pkg/vgpu/config.go
Outdated
| } | ||
| vgpuConfig := types.VGPUConfig{} | ||
| vfnum := 0 | ||
| if nvdevice.SriovInfo.PhysicalFunction == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| if nvdevice.SriovInfo.PhysicalFunction == nil { | |
| if nvdevice.SriovInfo.IsVF() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this.
pkg/vgpu/config.go
Outdated
| if _, err := os.Stat(vfAddr); err != nil { | ||
| vfnum++ | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we incrementing vfnum here when the directory does not exist? As I said above, we can just use the number of VFs already calculated when constructing the NVIDIA PCI device using go-nvlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this check.
pkg/vgpu/config.go
Outdated
| return vgpuConfig, nil | ||
| } | ||
| totalVF := int(nvdevice.SriovInfo.PhysicalFunction.TotalVFs) | ||
| for vfnum < totalVF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- don't we already know the number of VFs from nvdevice.SriovInfo.PhysicalFunction.NumVFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to iterate over 0....nvdevice.SriovInfo.PhysicalFunction.NumVFs. The other virtual functions outside that range should be empty. I'll switch to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we ever run into a scenario where say devices were manually deployed on Virtual Functions in a non-sequential order?
pkg/vgpu/config.go
Outdated
| } | ||
| totalVF := int(nvdevice.SriovInfo.PhysicalFunction.TotalVFs) | ||
| for vfnum < totalVF { | ||
| vfAddr := HostPCIDevicesRoot + "/" + nvdevice.Address + "/virtfn" + strconv.Itoa(vfnum) + "/nvidia" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: when constructing file paths let's use filepath.Join() throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to filepath.Join
pkg/vgpu/config.go
Outdated
| remainingToCreate := val | ||
| for remainingToCreate > 0 { | ||
| vfAddr := HostPCIDevicesRoot + "/" + nvdevice.Address + "/virtfn" + strconv.Itoa(vfnum) + "/nvidia" | ||
| number, err := m.getVGPUTypeNumberforVFIO(vfAddr + "/creatable_vgpu_types", key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method could be improved. Maybe getIdForVGPUTypeName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this.
pkg/vgpu/config.go
Outdated
| vfnum++ | ||
| continue | ||
| } | ||
| vgpuTypeName, err := m.getVGPUTypeNameforVFIO(vfAddr + "/creatable_vgpu_types", vgpuTypeNumber) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method could be improved. Maybe getVGPUTypeNameForId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this.
pkg/vgpu/config.go
Outdated
| IsVFIOEnabled() (bool, error) | ||
| GetVGPUConfigforVFIO(gpu int) (types.VGPUConfig, error) | ||
| SetVGPUConfigforVFIO(gpu int, config types.VGPUConfig) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue this Interface should not change. The caller should just call GetVGPUConfig / SetVGPUConfig and the details about vfio / mdev should be hidden from them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I switched back to the old interface names. We will use the VFIO enabled check in this file instead.
| matched := make([]bool, len(gpus)) | ||
| err = WalkSelectedVGPUConfigForEachGPU(c.VGPUConfig, func(vc *v1.VGPUConfigSpec, i int, d types.DeviceID) error { | ||
| configManager := vgpu.NewNvlibVGPUConfigManager() | ||
| current, err := configManager.GetVGPUConfig(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original interface and just call configManager.GetVGPUConfig(i) here. The vfio/mdev details can be captured in that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this.
| func VGPUConfig(c *Context) error { | ||
| return assert.WalkSelectedVGPUConfigForEachGPU(c.VGPUConfig, func(vc *v1.VGPUConfigSpec, i int, d types.DeviceID) error { | ||
| configManager := vgpu.NewNvlibVGPUConfigManager() | ||
| current, err := configManager.GetVGPUConfig(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original interface and just call configManager.GetVGPUConfig(i) / configManager.SetVGPUConfig(i) here. The vfio/mdev details can be captured in those methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this.
8846795 to
5daf473
Compare
5daf473 to
89a1e62
Compare
15a9586 to
b1fd32d
Compare
Signed-off-by: Arjun <[email protected]>
b1fd32d to
82b9164
Compare
Signed-off-by: Arjun <[email protected]>
ae21d52 to
50e2185
Compare
No description provided.