[NVIDIA 6.14-next] MPAM 24.04 linux nvidia adv 6.14 next #177

fyu1 · 2025-08-04T14:21:20Z

ARM64 MPAM is backed ported from Morse's mpam/snapshot/v6.16-rc5 branch: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git?h=mpam%2Fsnapshot%2Fv6.16-rc5

Please pull to 24.04_linux-nvidia-adv-6.14-next branch.

Similarly to other cpumask search functions, accept -1, and consider it as 'any CPU' hint. This helps users to avoid coding special cases. Signed-off-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: James Morse <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Tested-by: James Morse <[email protected]> Tested-by: Tony Luck <[email protected]> Tested-by: Fenghua Yu <[email protected]> Link: https://lore.kernel.org/[email protected]

The function helps to implement cpumask_andnot() APIs. Signed-off-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: James Morse <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Tested-by: James Morse <[email protected]> Tested-by: Tony Luck <[email protected]> Tested-by: Fenghua Yu <[email protected]> Link: https://lore.kernel.org/[email protected]

With the lack of the functions, client code has to abuse less efficient cpumask_nth(). Signed-off-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: James Morse <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Tested-by: Fenghua Yu <[email protected]> Tested-by: James Morse <[email protected]> Tested-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/[email protected]

Currently if architectures want to support HOTPLUG_SMT they need to provide a topology_is_primary_thread() telling the framework which thread in the SMT cannot offline. However arm64 doesn't have a restriction on which thread in the SMT cannot offline, a simplest choice is that just make 1st thread as the "primary" thread. So just make this as the default implementation in the framework and let architectures like x86 that have special primary thread to override this function (which they've already done). There's no need to provide a stub function if !CONFIG_SMP or !CONFIG_HOTPLUG_SMT. In such case the testing CPU is already the 1st CPU in the SMT so it's always the primary thread. Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Pierre Gondois <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Signed-off-by: Yicong Yang <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>

kernfs_notify_workfn() dereferences kernfs_node::name and passes it later to fsnotify(). If the node is renamed then the previously observed name pointer becomes invalid. Acquire kernfs_root::kernfs_rwsem to block renames of the node. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

kernfs_get_parent_dentry() passes kernfs_node::parent to kernfs_get_inode(). Acquire kernfs_root::kernfs_rwsem to ensure kernfs_node::parent isn't replaced during the operation. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

kernfs_node_dentry() passes kernfs_node::name to lookup_positive_unlocked(). Acquire kernfs_root::kernfs_rwsem to ensure the node is not renamed during the operation. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

The readdir operation iterates over all entries and invokes dir_emit() for every entry passing kernfs_node::name as argument. Since the name argument can change, and become invalid, the kernfs_root::kernfs_rwsem lock should not be dropped to prevent renames during the operation. The lock drop around dir_emit() has been initially introduced in commit 1e5289c ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2") to avoid holding a global lock during a page fault. The lock drop is wrong since the support of renames and not a big burden since the lock is no longer global. Don't re-acquire kernfs_root::kernfs_rwsem while copying the name to the userpace buffer. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent} pointer. This is a preparation to access kernfs_node::parent under RCU and ensure that the pointer remains stable under the RCU lifetime guarantees. For a complete path, as it is done in kernfs_path_from_node(), the kernfs_rename_lock is still required in order to obtain a stable parent relationship while computing the relevant node depth. This must not change while the nodes are inspected in order to build the path. If the kernfs user never moves the nodes (changes the parent) then the kernfs_rename_lock is not required and the RCU guarantees are sufficient. This "restriction" can be set with KERNFS_ROOT_INVARIANT_PARENT. Otherwise the lock is required. Rename kernfs_node::parent to kernfs_node::__parent to denote the RCU access and use RCU accessor while accessing the node. Make cgroup use KERNFS_ROOT_INVARIANT_PARENT since the parent here can not change. Acked-by: Tejun Heo <[email protected]> Cc: Yonghong Song <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

Using RCU lifetime rules to access kernfs_node::name can avoid the trouble with kernfs_rename_lock in kernfs_name() and kernfs_path_from_node() if the fs was created with KERNFS_ROOT_INVARIANT_PARENT. This is usefull as it allows to implement kernfs_path_from_node() only with RCU protection and avoiding kernfs_rename_lock. The lock is only required if the __parent node can be changed and the function requires an unchanged hierarchy while it iterates from the node to its parent. The change is needed to allow the lookup of the node's path (kernfs_path_from_node()) from context which runs always with disabled preemption and or interrutps even on PREEMPT_RT. The problem is that kernfs_rename_lock becomes a sleeping lock on PREEMPT_RT. I went through all ::name users and added the required access for the lookup with a few extensions: - rdtgroup_pseudo_lock_create() drops all locks and then uses the name later on. resctrl supports rename with different parents. Here I made a temporal copy of the name while it is used outside of the lock. - kernfs_rename_ns() accepts NULL as new_parent. This simplifies sysfs_move_dir_ns() where it can set NULL in order to reuse the current name. - kernfs_rename_ns() is only using kernfs_rename_lock if the parents are different. All users use either kernfs_rwsem (for stable path view) or just RCU for the lookup. The ::name uses always RCU free. Use RCU lifetime guarantees to access kernfs_node::name. Suggested-by: Tejun Heo <[email protected]> Acked-by: Tejun Heo <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Hillf Danton <[email protected]> Closes: https://lore.kernel.org/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

Move the following sysctl tables into arch/x86/kernel/setup.c: panic_on_{unrecoverable_nmi,io_nmi} bootloader_{type,version} io_delay_type unknown_nmi_panic acpi_realmode_flags Variables moved from include/linux/ to arch/x86/include/asm/ because there is no longer need for them outside arch/x86/kernel: acpi_realmode_flags panic_on_{unrecoverable_nmi,io_nmi} Include <asm/nmi.h> in arch/s86/kernel/setup.h in order to bring in panic_on_{io_nmi,unrecovered_nmi}. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kerenel/sysctl.c. Signed-off-by: Joel Granados <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]

When the HED driver is built-in, it initializes after evged because they both are at the same initcall level, so the initialization ordering depends on the Makefile order. However, this prevents RAS records coming in between the evged driver initialization and the HED driver initialization from being handled. If the number of such RAS records is above the APEI HEST error source number, the HEST resources may be exhausted, and that may affect subsequent RAS error reporting. To fix this issue, change the initcall level of HED to subsys_initcall and prevent the driver from being built as a module by changing ACPI_HED in Kconfig from "tristate" to "bool". Signed-off-by: Xiaofei Tan <[email protected]> Link: https://patch.msgid.link/[email protected] [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>

The mcount_loc section holds the addresses of the functions that get patched by ftrace when enabling function callbacks. It can contain tens of thousands of entries. These addresses must be sorted. If they are not sorted at compile time, they are sorted at boot. Sorting at boot does take some time and does have a small impact on boot performance. x86 and arm32 have the addresses in the mcount_loc section of the ELF file. But for arm64, the section just contains zeros. The .rela.dyn Elf_Rela section holds the addresses and they get patched at boot during the relocation phase. In order to sort these addresses, the Elf_Rela needs to be updated instead of the location in the binary that holds the mcount_loc section. Have the sorttable code, allocate an array to hold the functions, load the addresses from the Elf_Rela entries, sort them, then put them back in order into the Elf_rela entries so that they will be sorted at boot up without having to sort them during boot up. Cc: bpf <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nicolas Schier <[email protected]> Cc: Zheng Yejian <[email protected]> Cc: Martin Kelly <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Will Deacon <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Alexander Gordeev <[email protected]> Link: https://lore.kernel.org/[email protected] Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>

Some architectures need to expose architecture-specific data to the vDSO. Enable the generic vDSO storage mechanism to both store and map this data. Some architectures require more than a single page, like LoongArch, so prepare for that usecase, too. Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]

The generic storage implementation provides the same features as the custom one. However it can be shared between architectures, making maintenance easier. This switch also moves the random state data out of the time data page. The currently used hardcoded __VDSO_RND_DATA_OFFSET does not take into account changes to the time data page layout. Co-developed-by: Nam Cao <[email protected]> Signed-off-by: Nam Cao <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]

Rockchip RK3566/RK3568 GIC600 integration has DDR addressing limited to the first 32bit of physical address space. Rockchip assigned Erratum ID #3568002 for this issue. Add driver quirk for this Rockchip GIC Erratum. Note, that the 0x0201743b GIC600 ID is not Rockchip-specific and is common for many ARM GICv3 implementations. Hence, there is an extra of_machine_is_compatible() check. Signed-off-by: Dmitry Osipenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/all/[email protected]

The current cxl region size only indicates the size of the CXL memory region without accounting for the extended linear cache size. Retrieve the cache size from HMAT and append that to the cxl region size for the cxl region range that matches the SRAT range that has extended linear cache enabled. The SRAT defines the whole memory range that includes the extended linear cache and the CXL memory region. The new HMAT ECN/ECR to the Memory Side Cache Information Structure defines the size of the extended linear cache size and matches to the SRAT Memory Affinity Structure by the memory proxmity domain. Add a helper to match the cxl range to the SRAT memory range in order to retrieve the cache size. There are several places that checks the cxl region range against the decoder range. Use new helper to check between the two ranges and address the new cache size. Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Li Ming <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Dave Jiang <[email protected]>

There are three variants of which Huawei released the first two simultaneously. Huawei Matebook E Go LTE(sc8180x), codename seems to be gaokun2. Huawei Matebook E Go([email protected]), codename must be gaokun3. (see [1]) Huawei Matebook E Go 2023([email protected]), codename should be also gaokun3. Adding support for the latter two variants for now, this driver should also work for the sc8180x variant according to acpi table files, but I don't have the device to test yet. Different from other Qualcomm Snapdragon sc8280xp based machines, the Huawei Matebook E Go uses an embedded controller while others use a system called PMIC GLink. This embedded controller can be used to perform a set of various functions, including, but not limited to: - Battery and charger monitoring; - Charge control and smart charge; - Fn_lock settings; - Tablet lid status; - Temperature sensors; - USB Type-C notifications (ports orientation, DP alt mode HPD); - USB Type-C PD (according to observation, up to 48w). Add a driver for the EC which creates devices for UCSI and power supply devices. This driver is inspired by the following drivers: drivers/platform/arm64/acer-aspire1-ec.c drivers/platform/arm64/lenovo-yoga-c630.c drivers/platform/x86/huawei-wmi.c Also thanks for reviewers' working. They have made this patch improve a lot. [1] https://bugzilla.kernel.org/show_bug.cgi?id=219645 Signed-off-by: Pengyu Luo <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ilpo Järvinen <[email protected]>

…ce list Resctrl occasionally wants to know something about a specific resource, in these cases it reaches into the arch code's rdt_resources_all[] array. Once the filesystem parts of resctrl are moved to /fs/, this means it will need visibility of the architecture specific struct rdt_hw_resource definition, and the array of all resources. All architectures would also need a r_resctrl member in this struct. Instead, abstract this via a helper to allow architectures to do different things here. Move the level enum to the resctrl header and add a helper to retrieve the struct rdt_resource by 'rid'. resctrl_arch_get_resource() should not return NULL for any value in the enum, it may instead return a dummy resource that is !alloc_enabled && !mon_enabled. Co-developed-by: Dave Martin <[email protected]> Signed-off-by: Dave Martin <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The resctrl arch code specifies whether a resource controls a cache or memory using the fflags field. This field is then used by resctrl to determine which files should be exposed in the filesystem. Allowing the architecture to pick this value means the RFTYPE_ flags have to be in a shared header, and allows an architecture to create a combination that resctrl does not support. Remove the fflags field, and pick the value based on the resource id. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Resctrl's architecture code gets to specify a function pointer that is used when parsing schema entries. This is expected to be one of two helpers from the filesystem code. Setting this function pointer allows the architecture code to change the ABI resctrl presents to user-space, and forces resctrl to expose these helpers. Instead, add a schema format enum to choose which schema parser to use. This allows the helpers to be made static and the structs used for passing arguments moved out of shared headers. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Resctrl's architecture code gets to specify a format string that is used when printing schema entries. This is expected to be one of two values that the filesystem code supports. Setting this format string allows the architecture code to change the ABI resctrl presents to user-space. Instead, use the schema format enum to choose which format string to use. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The resctrl architecture code provides a data_width for the controls of each resource. This is used to zero pad all control values in the schemata file so they appear in columns. The same is done with the resource names to complete the visual effect. e.g. | SMBA:0=2048 | L3:0=00ff AMD platforms discover their maximum bandwidth for the MB resource from firmware, but hard-code the data_width to 4. If the maximum bandwidth requires more digits - the tabular format is silently broken. This is also broken when the mba_MBps mount option is used as the field width isn't updated. If new schema are added resctrl will need to be able to determine the maximum width. The benefit of this pretty-printing is questionable. Instead of handling runtime discovery of the data_width for AMD platforms, remove the feature. These fields are always zero padded so should be harmless to remove if the whole field has been treated as a number. In the above example, this would now look like this: | SMBA:0=2048 | L3:0=ff Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

__rdt_get_mem_config_amd() and __get_mem_config_intel() both use the default_ctrl property as a maximum value. This is because the MBA schema works differently between these platforms. Doing this complicates determining whether the default_ctrl property belongs to the arch code, or can be derived from the schema format. Deriving the maximum or default value from the schema format would avoid the architecture code having to tell resctrl such obvious things as the maximum percentage is 100, and the maximum bitmap is all ones. Maximum bandwidth is always going to vary per platform. Add max_bw as a special case. This is currently used for the maximum MBA percentage on Intel platforms, but can be removed from the architecture code if 'percentage' becomes a schema format resctrl supports directly. This value isn't needed for other schema formats. This will allow the default_ctrl to be generated from the schema properties when it is needed. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The struct rdt_resource default_ctrl is used by both the architecture code for resetting the hardware controls, and sometimes by the filesystem code as the default value for the schema, unless the bandwidth software controller is in use. Having the default exposed by the architecture code causes unnecessary duplication for each architecture as the default value must be specified, but can be derived from other schema properties. Now that the maximum bandwidth is explicitly described, resctrl can derive the default value from the schema format and the other resource properties. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

rdtgroup_rmdir_ctrl() and rdtgroup_rmdir_mon() set the per-CPU pqr_state for CPUs that were part of the rmdir()'d group. Another architecture might not have a 'pqr_state', its hardware may need the values in a different format. MPAM's equivalent of RMID values are not unique, and always need the CLOSID to be provided too. There is only one caller that modifies a single value, (rdtgroup_rmdir_mon()). MPAM always needs both CLOSID and RMID for the hardware value as these are written to the same system register. As rdtgroup_rmdir_mon() has the CLOSID on hand, only provide a helper to set both values. These values are read by __resctrl_sched_in(), but may be written by a different CPU without any locking, add READ/WRTE_ONCE() to avoid torn values. Co-developed-by: Dave Martin <[email protected]> Signed-off-by: Dave Martin <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

update_cpu_closid_rmid() takes a struct rdtgroup as an argument, which it uses to update the local CPUs default pqr values. This is a problem once the resctrl parts move out to /fs/, as the arch code cannot poke around inside struct rdtgroup. Rename update_cpu_closid_rmid() as resctrl_arch_sync_cpus_defaults() to be used as the target of an IPI, and pass the effective CLOSID and RMID in a new struct. Co-developed-by: Dave Martin <[email protected]> Signed-off-by: Dave Martin <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

rdtgroup_init() needs exposing to the rest of the kernel so that arch code can call it once it lives in core code. As this is one of the few functions exposed, rename it to have "resctrl" in the name. The same goes for the exit call. Rename x86's arch code init functions for RDT to have an arch prefix to make it clear these are part of the architecture code. Co-developed-by: Dave Martin <[email protected]> Signed-off-by: Dave Martin <[email protected]> Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

rdt_find_domain() finds a domain given a resource and a cache-id. This is used by both the architecture code and the filesystem code. After the filesystem code moves to live in /fs/, this helper is either duplicated by all architectures, or needs exposing by the filesystem code. Add the declaration to the global header file. As it's now globally visible, and has only a handful of callers, swap the 'rdt' for 'resctrl'. Move the function to live with its caller in ctrlmondata.c as the filesystem code will not have anything corresponding to core.c. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Babu Moger <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Shaopeng Tan <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

When resctrl is fully factored into core and per-arch code, each arch will need to use some resctrl common definitions in order to define its own specializations and helpers. Following conventional practice, it would be desirable to put the dependent arch definitions in an <asm/resctrl.h> header that is included by the common <linux/resctrl.h> header. However, this can make it awkward to avoid a circular dependency between <linux/resctrl.h> and the arch header. To avoid such dependencies, move the affected common types and constants into a new header that does not need to depend on <linux/resctrl.h> or on the arch headers. The same logic applies to the monitor-configuration defines, move these too. Some kind of enumeration for events is needed between the filesystem and architecture code. Take the x86 definition as its convenient for x86. The definition of enum resctrl_event_id is needed to allow the architecture code to define resctrl_arch_mon_ctx_alloc() and resctrl_arch_mon_ctx_free(). The definition of enum resctrl_res_level is needed to allow the architecture code to define resctrl_arch_set_cdp_enabled() and resctrl_arch_get_cdp_enabled(). The bits for mbm_local_bytes_config et al are ABI, and must be the same on all architectures. These are documented in Documentation/arch/x86/resctrl.rst The maintainers entry for these headers was missed when resctrl.h was created. Add a wildcard entry to match both resctrl.h and resctrl_types.h. Signed-off-by: James Morse <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Shaopeng Tan <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Carl Worth <[email protected]> # arm64 Tested-by: Shaopeng Tan <[email protected]> Tested-by: Peter Newman <[email protected]> Tested-by: Amit Singh Tomar <[email protected]> # arm64 Tested-by: Shanker Donthineni <[email protected]> # arm64 Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]

resctrl exposes a counter via a file named llc_occupancy. This isn't really a counter as its value goes up and down, this is a snapshot of the cache storage usage monitor. Add some picking code to find a cache as close as possible to the L3 that supports the CSU monitor. If there is an L3, but it doesn't have any controls, force the L3 resource to exist. The existing topology_matches_l3() and mpam_resctrl_domain_hdr_init() code will ensure this looks like the L3, even if the class belongs to a later cache. Signed-off-by: James Morse <[email protected]>

resctrl has two types of counters, NUMA-local and global. MPAM has only bandwidth counters, but the position of the MSC may mean it counts NUMA-local, or global traffic. But the topology information is not available. Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are probably NUMA-local. If the memory controller supports bandwidth monitors, they are probably global. This also allows us to assert that we don't have the same class backing two different resctrl events. Signed-off-by: James Morse <[email protected]>

When there are enough monitors, the resctrl mbm local and total files can be exposed. These need all the monitors that resctrl may use to be allocated up front. Add helpers to do this. If a different candidate class is discovered, the old array should be free'd and the allocated monitors returned to the driver. Signed-off-by: James Morse <[email protected]>

When resctrl wants to read a domain's 'QOS_L3_OCCUP', it needs to allocate a monitor on the corresponding resource. Monitors are allocated by class instead of component. Add helpers to do this. The MBM events depend on having their monitors allocated at init time so that they can be left running. Where possible, allocate the MBWU monitors up front so the mbm_{local,total} files can be exposed via resctrl. resctrl_arch_mon_ctx_alloc() is implemented to have a no_wait version and a waitqueue for callers that sleep. The no_wait version will later become an interface for the resctrl_pmu to use. Signed-off-by: James Morse <[email protected]>

…t_rmid() resctrl uses resctrl_arch_rmid_read() to read counters. CDP emulation means the counter may need reading twice to get both the I and D side allocations. The same goes for reset. Add the rounding helper for checking monitor values while we're here. Signed-off-by: James Morse <[email protected]>

MPAM MSCs may have support for filtering reads or writes when monitoring traffic. Resctrl has a configuration bitmap for which kind of accesses should be monitored. Bridge the gap where possible. MPAM only has a read/write bit, so not all the combinations can be supported. Signed-off-by: James Morse <[email protected]>

resctrl has individual hooks to separately enable and disable the closid/partid and rmid/pmg context switching code. For MPAM this is all the same thing, as the value in struct task_struct is used to cache the value that should be written to hardware. arm64's context switching code is enabled once MPAM is usable, but doesn't touch the hardware unless the value has changed. Resctrl doesn't need to ask. Add empty definitions for these hooks. Signed-off-by: James Morse <[email protected]>

Enough MPAM support is present to enable ARCH_HAS_CPU_RESCTRL. Let it rip^Wlink! ARCH_HAS_CPU_RESCTRL indicates resctrl can be enabled. It is enabled by the arch code sipmly because it has 'arch' in its name. This removes ARM_CPU_RESCTRL as a mimic of X86_CPU_RESCTRL and defines a dummy ARM64_MPAM_DRIVER to hold the bits and pieces relevant to the MPAM driver. While here, move the ACPI dependency to the driver's Kconfig file. Signed-off-by: James Morse <[email protected]>

…monitors On platforms with no monitors the rmid_ptrs[] array is not allocated. The rmid on these platforms is likely to be '0' for all control groups, which may lead to free_rmid() being called on rmid 0. Dave points out that the index == (0,0) check to skip freeing of a non-existant rmid is not sufficient on MPAM because the provided closid may be non-zero. The index can't be used to spot this case. Instead, check if there are any resctrl monitors enabled. This avoids a null pointer dereference in free_rmid() when control groups are freed. It isn't possible to hit this on x86 platforms. This patch to be replaced by one from Dave. Reported-by: Dave Martin <[email protected]> Signed-off-by: James Morse <[email protected]> Tested-by: Shaopeng Tan <[email protected]> Tested-by: Shanker Donthineni <[email protected]> # arm64

…id[] On MPAM systems if an error occurs the arhictecture code will call resctrl_exit(). This calls dom_data_exit() which takes the rdrgroup_mutex and kfree()s closid_num_dirty_rmid[]. It is possible that another syscall tries to access that same array in the meantime, but is blocked on the mutex. Once dom_data_exit() completes, that syscall will see a NULL pointer. Pull the IS_ENABLED() Kconfig checks into a helper and additionally check that the array has been allocated. This will cause callers to fallback to the regular CLOSID allocation strategy. Signed-off-by: James Morse <[email protected]>

On MPAM systems if an error occurs the arhictecture code will call resctrl_exit(). This calls dom_data_exit() which takes the rdrgroup_mutex and kfree()s rmid_ptrs[]. It is possible that another syscall tries to access that same array in the meantime, but is blocked on the mutex. Once dom_data_exit() completes, that syscall will see a NULL pointer. Make __rmid_entry() return NULL in this case. Neither __check_limbo() nor free_rmid() return an error, and can silently stop their work if this occurs. dom_data_init() has only just allocated the array and still holds the lock, so __rmid_entry() should never return NULL here. Signed-off-by: James Morse <[email protected]>

Carl reports that when both the MPAM driver and CMN driver are built into the kernel, they fight over who can claim the resources associated with their registers. This prevents the second of these two drivers from probing. Currently the CMN PMU driver claims all the CMN registers. The MPAM registers are grouped together in a small number of pages, whereas the PMU registers that the CMN PMU driver uses appear throughout the CMN register space. Having the CMN driver claim all the resources is the wrong thing to do, and claiming individual registers here and there is not worthwhile. Instead, stop the CMN driver from claiming any resources as its registers are not grouped together. Reported-by: Carl Worth <[email protected]> Tested-by: Carl Worth <[email protected]> Signed-off-by: James Morse <[email protected]> CC: Ilkka Koskinen <[email protected]>

Now that mpam links against resctrl, call the cpu and domain online/offline calls at the appropriate point. Signed-off-by: James Morse <[email protected]>

All of MPAMs errors indicate a software bug, e.g. an out-of-bounds partid has been generated. When this happens, the mpam driver is disabled. If resctrl_init() succeeded, also call resctrl_exit() to remove resctrl. mpam_devices.c calls mpam_resctrl_teardown_class() when a class becomes incomplete, and can no longer be used by resctrl. If resctrl was using this class, then resctrl_exit() is called. This in turn removes the kernfs hierarchy from the filesystem and free()s memory that was allocated by resctrl. Signed-off-by: James Morse <[email protected]>

resctrl's limbo code needs to be told when the data left in a cache is small enough for the partid+pmg value to be re-allocated. x86 uses the cache size divided by the number of rmid users the cache may have. Do the same, but for the smallest cache, and with the number of partid-and-pmg users. Querying the cache size can't happen until after cacheinfo_sysfs_init() has run, so mpam_resctrl_setup() must wait until then. Signed-off-by: James Morse <[email protected]>

resctrl expects the domain list ot be sorted by id. Do that. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Pulled out of a larger patch ] Signed-off-by: James Morse <[email protected]>

MPAM supports a minimum and maximum control for memory bandwidth. The purpose of the minimum control is to give priority to tasks that are below their minimum value. Resctrl only provides one value for the bandwidth configuration, which is used for the maximum. The minimum control is always programmed to zero on hardware that supports it. Generate a minimum bandwidth value that is 5% lower than the value provided by resctrl. This means tasks that are not receiving their target bandwidth can be prioritised by the hardware. CC: Zeng Heng <[email protected]> Signed-off-by: James Morse <[email protected]>

The MPAM specification includes the MPAMF_IIDR, which serves to uniquely identify the MSC implementation through a combination of implementer details, product ID, variant, and revision. Certain hardware issues/errata can be resolved using software workarounds. Introduce a quirk framework to allow workarounds to be enabled based on the MPAMF_IIDR value. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Stash the IIDR so this doesn't need an IPI, enable quirks only once, move the description to the callback so it can be pr_once()d, add an enum of workarounds for popular errata. Add macros for making lists of product/revision/vendor half readable ] Signed-off-by: James Morse <[email protected]>

The MPAM bandwidth partitioning controls will not be correctly configured, and hardware will retain default configuration register values, meaning generally that bandwidth will remain unprovisioned. To address the issue, follow the below steps after updating the MBW_MIN and/or MBW_MAX registers. - Perform 64b reads from all 12 bridge MPAM shadow registers at offsets (0x360048 + slice*0x10000 + partid*8). These registers are read-only. - Continue iterating until all 12 shadow register values match in a loop. pr_warn_once if the values fail to match within the loop count 1000. - Perform 64b writes with the value 0x0 to the two spare registers at offsets 0x1b0000 and 0x1c0000. In the hardware, writes to the MPAMCFG_MBW_MAX MPAMCFG_MBW_MIN registers are transformed into broadcast writes to the 12 shadow registers. The final two writes to the spare registers cause a final rank of downstream micro-architectural MPAM registers to be updated from the shadow copies. The intervening loop to read the 12 shadow registers helps avoid a race condition where writes to the spare registers occur before all shadow registers have been updated. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Merged the min/max update into a single mpam_quirk_post_config_change() helper. Stashed the t241_id in the msc instead of carrying the physical address around. Test the msc quirk bit instead of a static key. ] Signed-off-by: James Morse <[email protected]>

In the T241 implementation of memory-bandwidth partitioning, in the absence of contention for bandwidth, the minimum bandwidth setting can affect the amount of achieved bandwidth. Specifically, the achieved bandwidth in the absence of contention can settle to any value between the values of MPAMCFG_MBW_MIN and MPAMCFG_MBW_MAX. Also, if MPAMCFG_MBW_MIN is set zero (below 0.78125%), once a core enters a throttled state, it will never leave that state. The first issue is not a cocern if the MPAM software allows to program MPAMCFG_MBW_MIN through the sysfs interface. This patch ensures program MBW_MIN=1 (0.78125%) whenever MPAMCFG_MBW_MIN=0 is programmed. In the scenario where the resctrl doesn't support the MBW_MIN interface via sysfs, to achieve bandwidth closer to MW_MAX in the absence of contention, software should configure a relatively narrow gap between MBW_MIN and MBW_MAX. The recommendation is to use a 5% gap to mitigate the problem. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Added as second quirk, adapted to use the new intermediate values in mpam_extend_config() ] Signed-off-by: James Morse <[email protected]>

The registers MSMON_MBWU_L and MSMON_MBWU return the number of requests rather than the number of bytes transferred. Bandwidth resource monitoring is performed at the last level cache, where each request arrive in 64Byte granularity. The current implementation returns the number of transactions received at the last level cache but does not provide the value in bytes. Scaling by 64 gives an accurate byte count to match the MPAM specification for the MSMON_MBWU and MSMON_MBWU_L registers. This patch fixes the issue by reporting the actual number of bytes instead of the number of transactions from __ris_msmon_read(). Signed-off-by: Shanker Donthineni <[email protected]> Signed-off-by: James Morse <[email protected]>

CMN-650 is afflicted with an erratum where the CSU NRDY bit never clears. This tells us the monitor never finishes scanning the cache. The erratum document says to wait the maximum time, then ignore the field. Add a flag to indicate whether this is the final attempt to read the counter, and when this quirk is applied, ignore the NRDY field. This means accesses to this counter will always retry, even if the counter was previously programmed to the same values. The counter value is not expected to be stable, it drifts up and down with each allocation and eviction. The CSU register provides the value for a point in time. Signed-off-by: James Morse <[email protected]>

debugfs has handy helpers to make a bool, integer or string available through debugfs. Add helpers to do the same for cpumasks. These are read only. CC: Ben Horgan <[email protected]> Signed-off-by: James Morse <[email protected]>

Not all of MPAM is visible through the resctrl user-space interface. To make it easy to debug why certain devices were not exposed through resctrl, allow the properties of the devices to be read through debugfs. This adds an mpam directory to debugfs, and exposes the devices as well as the hierarchy that was built. Signed-off-by: James Morse <[email protected]>

MPAM has an error interrupt that can be triggered by an MSC when corrupt or out of range values are seen. The hardware only needs to raise an error interrupt if the error was detected, it is also permissible for the hardware to just use the corrupt or our of range value. All the reasons to raise an error indicate a software bug. When the error interrupt is triggered, the MPAM driver attempts to reset all the CPUs back to PARTID-0 and reset PARTID-0 to be unrestricted. This is done to ensure important tasks aren't accidentally given the performance of unimportant tasks. This teardown path in the driver is hard to trigger. Add a debugfs file to poke this manually. It is expected you have to reboot to make MPAM work again after this. Signed-off-by: James Morse <[email protected]>

It's really popular to tie NRDY high, and then act surprised when the OS never reads the counters, because they aren't ready. The spec obliges hardware to clear this bit automatically before the firmware advertised timeout. To make it easier to find errant hardware, count the number of retries and expose that number in debugfs. Signed-off-by: James Morse <[email protected]>

nvmochs · 2025-08-06T21:23:04Z

Closing, this is replaced by #185

YuryNorov and others added 30 commits August 4, 2025 03:28

James Morse and others added 26 commits August 4, 2025 03:28

arm_mpam: resctrl: Tell resctrl about cpu/domain online/offline

edfed21

Now that mpam links against resctrl, call the cpu and domain online/offline calls at the appropriate point. Signed-off-by: James Morse <[email protected]>

FIX ME: arm_mpam: Sort the domain list by domain-id

8cfa5a8

resctrl expects the domain list ot be sorted by id. Do that. Signed-off-by: Shanker Donthineni <[email protected]> [ morse: Pulled out of a larger patch ] Signed-off-by: James Morse <[email protected]>

fyu1 changed the title ~~Pull request for MPAM 24.04 linux nvidia adv 6.14 next~~ [6.14-next] MPAM 24.04 linux nvidia adv 6.14 next Aug 4, 2025

fyu1 changed the title ~~[6.14-next] MPAM 24.04 linux nvidia adv 6.14 next~~ [NVIDIA 6.14-next] MPAM 24.04 linux nvidia adv 6.14 next Aug 4, 2025

nvmochs closed this Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA 6.14-next] MPAM 24.04 linux nvidia adv 6.14 next #177

[NVIDIA 6.14-next] MPAM 24.04 linux nvidia adv 6.14 next #177

Uh oh!

fyu1 commented Aug 4, 2025

Uh oh!

nvmochs commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[NVIDIA 6.14-next] MPAM 24.04 linux nvidia adv 6.14 next #177

[NVIDIA 6.14-next] MPAM 24.04 linux nvidia adv 6.14 next #177

Uh oh!

Conversation

fyu1 commented Aug 4, 2025

Uh oh!

nvmochs commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants