Skip to content

Conversation

@pcwalton
Copy link
Contributor

This commit restores the functionality present in older versions of Hanabi that allows multiple effects to be packed into a single buffer. This is in preparation for reenabling batching.

I use the base_instance field on the effects metadata to supply the offset within each indirect buffer to the shader. This is convenient, because it naturally paves the way to use multi-draw as a follow-up, and it avoids having to add more fields to the effects metadata.

As suggested in the code, I rewrote sorting to be prior to batching instead of after batching. It now uses a modified version of Tarjan's algorithm to sort the effects into the optimum batching order. Not only does it make sure that parents precede children, but it also groups buffers together, and sorts by offset, in order to make batching easier as a follow-up. Sorting by offset within the buffer should enable the use of binary search in order for the shader to map from the global thread ID to the effect instance.

This commit restores the functionality present in older versions of
Hanabi that allows multiple effects to be packed into a single buffer.
This is in preparation for reenabling batching.

I use the `base_instance` field on the effects metadata to supply the
offset within each indirect buffer to the shader. This is convenient,
because it naturally paves the way to use multi-draw as a follow-up, and
it avoids having to add more fields to the effects metadata.

As suggested in the code, I rewrote sorting to be prior to batching
instead of after batching. It now uses a modified version of [Tarjan's
algorithm] to sort the effects into the optimum batching order. Not only
does it make sure that parents precede children, but it also groups
buffers together, and sorts by offset, in order to make batching easier
as a follow-up. Sorting by offset within the buffer should enable the
use of binary search in order for the shader to map from the global
thread ID to the effect instance.

[Tarjan's algorithm]: https://en.wikipedia.org/wiki/Topological_sorting#Depth-first_search
@djeedai
Copy link
Owner

djeedai commented Jun 18, 2025

@pcwalton did you see some example fails with what appears to be a genuine error?

thread '<unnamed>' panicked at /Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-24.0.5/src/backend/wgpu_core.rs:2750:26:
wgpu error: Validation Error

Caused by:
  In ComputePass::end
    In a set_bind_group command
      Dynamic binding index 0 (targeting BindGroup with 'hanabi:bind_group:sort_fill' label 0, binding 1) with value 3200, does not respect device's requested `min_storage_buffer_offset_alignment` limit: 256

@djeedai djeedai added C - enhancement New feature or request A - internal Internal change on a core system labels Jun 18, 2025
The sort shaders weren't updated, which caused failures in the `ribbons`
example. It looks like the plan was to use dynamic offsets in order to
move the window around within the buffer; however, that doesn't work due
to alignment limitations. So I removed the dynamic offsets and instead
use the new field in the effects metadata.

Much of the complexity comes from the fact that the render shader now
needs access to the effects metadata (in just a single place), because
there's no special WGSL shader variable available for the base instance.
Thus I had to update a bunch of bind groups and such in order to make
the effects metadata accessible to the render shader. The good news is
that, if we ever need to add more fields to the effects metadata that
the render shader needs to access in the future, doing so will be
straightforward.
@pcwalton
Copy link
Contributor Author

@djeedai Thanks, it should be fixed now with the latest commit. This was because the sorting shaders previously weren't updated to deal with multiple effects in the same buffer. See the commit message for more details.

@djeedai
Copy link
Owner

djeedai commented Jun 19, 2025

The ribbon example is completely broken.

ribbon.mp4

@pcwalton
Copy link
Contributor Author

@djeedai Hmm, it works for me. What system are you on?

@pcwalton
Copy link
Contributor Author

ribbon is tested and is working on both Windows 11, NVIDIA GeForce RTX 4070 Laptop GPU and an macOS Apple M2 mac mini.

effect_slice,
effect_slice: effect_slice.clone(),
init_and_update_pipeline_ids,
parent_buffer_index,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have the index of the parent buffer, we can't batch together updates for effect which have a parent, because the shader only has 1 bind point for a parent. It might not be used yet but conceptually we need that info for batching. Or am I missing something?

Copy link
Contributor Author

@pcwalton pcwalton Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this patch, we only used the parent_buffer_index field for the sorting. When I moved the sorting to be before batching instead of after it, the field ended up becoming unused, so I deleted it. I believe you're right that yes, we do conceptually need that info when we re-add batching later, but I figured you'd prefer not having unused fields.

Copy link
Owner

@djeedai djeedai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See individual comments, there's multiple bugs in the sorter.

mod tests {
use super::*;

fn make_batch(buffer_index: u32, parent_buffer_index: Option<u32>) -> EffectBatch {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep tests like that. Best example why is that I tried to re-add it and that's how I realized:

  1. the sorter in your change was sorting in the wrong order.
  2. the algorithm is broken in the first place, doesn't even assign the correct levels to parents and children

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my test
#[cfg(test)]
mod tests {
    use super::*;

    fn insert_entry(
        sorter: &mut EffectSorter,
        entity: Entity,
        buffer_index: u32,
        base_instance: u32,
        parent: Option<Entity>,
    ) {
        sorter.effects.push(EffectToBeSorted {
            entity,
            base_instance,
            buffer_index,
        });
        if let Some(parent) = parent {
            sorter.child_to_parent.insert(entity, parent);
        }
    }

    #[test]
    fn toposort_batches() {
        let mut sorter = EffectSorter::new();

        // Some "parent" effect
        let e1 = Entity::from_raw(1);
        insert_entry(&mut sorter, e1, 42, 0, None);
        assert_eq!(sorter.effects.len(), 1);
        assert_eq!(sorter.effects[0].entity, e1);
        assert!(sorter.child_to_parent.is_empty());

        // Some "child" effect in a different buffer
        let e2 = Entity::from_raw(2);
        insert_entry(&mut sorter, e2, 5, 30, Some(e1));
        assert_eq!(sorter.effects.len(), 2);
        assert_eq!(sorter.effects[0].entity, e1);
        assert_eq!(sorter.effects[1].entity, e2);
        assert_eq!(sorter.child_to_parent.len(), 1);
        assert_eq!(sorter.child_to_parent[&e2], e1);

        sorter.sort();
        assert_eq!(sorter.effects.len(), 2);
        assert_eq!(sorter.effects[0].entity, e2); // child first
        assert_eq!(sorter.effects[1].entity, e1); // parent after
        assert_eq!(sorter.child_to_parent.len(), 1); // unchanged
        assert_eq!(sorter.child_to_parent[&e2], e1); // unchanged

        // Some "child" effect in the same buffer as its parent
        let e3 = Entity::from_raw(3);
        insert_entry(&mut sorter, e3, 5, 20, Some(e1));
        assert_eq!(sorter.effects.len(), 3);
        assert_eq!(sorter.effects[0].entity, e2); // from previous sort
        assert_eq!(sorter.effects[1].entity, e1); // from previous sort
        assert_eq!(sorter.effects[2].entity, e3); // simply appended
        assert_eq!(sorter.child_to_parent.len(), 2);
        assert_eq!(sorter.child_to_parent[&e2], e1);
        assert_eq!(sorter.child_to_parent[&e3], e1);

        sorter.sort();
        assert_eq!(sorter.effects.len(), 3);
        // Note: Ideally this would want e1 to be first and e3 after, so that e3 and e2
        // are grouped together since they're in the same buffer. But sorting by
        // topologicial order AND by buffer at the same time is in general impossible
        // (conflicting requirements; no solution).
        if sorter.effects[0].entity == e1 {
            assert_eq!(sorter.effects[1].entity, e3); // child first
        } else {
            assert_eq!(sorter.effects[0].entity, e3); // child first
            assert_eq!(sorter.effects[1].entity, e1); // child first
        }
        assert_eq!(sorter.effects[2].entity, e2); // parent after all children
        assert_eq!(sorter.child_to_parent.len(), 2);
        assert_eq!(sorter.child_to_parent[&e2], e1);
        assert_eq!(sorter.child_to_parent[&e3], e1);
    }
}

Copy link
Contributor Author

@pcwalton pcwalton Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that last test is right? We have:

e1 -> e2 (i.e. e2 is a child of e1)
e1 -> e3 (i.e. e3 is a child of e1)

e1 buffer = 42
e2 buffer = 5
e3 buffer = 5

So a possible order is:

e3, e2, e1

because this satisfies the constraints of children before parents (e3 and e2 before e1) and buffers grouped together (e2 = 5, e3 = 5). But this code doesn't admit that order:

// Note: Ideally this would want e1 to be first and e3 after, so that e3 and e2
// are grouped together since they're in the same buffer. But sorting by
// topological order AND by buffer at the same time is in general impossible
// (conflicting requirements; no solution).
if sorter.effects[0].entity == e1 {
    assert_eq!(sorter.effects[1].entity, e3); // child first
} else {
    assert_eq!(sorter.effects[0].entity, e3); // child first
    assert_eq!(sorter.effects[1].entity, e1); // child first
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new commit that changes the test to match the comments. I think this is what you were going for.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I changed my mental model mid-way. The test starts with "e1 is parent" but my mental model was that e2 was. The test is likely wrong.

Comment on lines 224 to 232
while let Some(effect_index) = visited.zeroes().next() {
visit(
&mut levels,
&mut visiting,
&mut visited,
&parent_to_children,
effect_index,
0,
);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm simply doesn't work. I wrote a small test with 3 entities, inserted in that order:

  • [0], child of [1]
  • [1], parent of both [0] and [2]
  • [2], child of [1]

Here's what happens:

  • On first while iteration, index 0 is returned because it's never been visited. visit() then mark [0] as visited, skip the for loop since it has no children, and insert [0] with level=0.
  • On second while iteration, index 1 is returned (the common parent). This time the for loop is executed, but child [0] is already visited so it's skipped. Child [2] is recursively visited with level=1. Finally parent [1] is inserted with level=0

We end up with:

  • [0] level=0
  • [1] level=0
  • [2] level=1

This is wrong; [0] and [2] should have the same level, different (larger or smaller depending on convention) than their parent [1].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this should be fixed now. Now it's two phases. First, I do a topological sort (in the reverse order I had before). Second, I assign levels, based on the constraint that, for each (parent, child) pair, the level of the parent must be at least the level of the child + 1. This seems correct.

Now children are sorted before parents, and the sorting logic has been
rewritten.

The test isn't the same as the one suggested, because I believe the
suggested test had a bug in it. I changed the test to follow the
comments.
@pcwalton
Copy link
Contributor Author

Addressed review comments.

@pcwalton pcwalton marked this pull request as draft June 22, 2025 21:53
@pcwalton
Copy link
Contributor Author

Marking this as a draft because I can reproduce the ribbon failure now.

@pcwalton
Copy link
Contributor Author

So it's very strange: I saw the bad behavior once, but can't reproduce it again. Whatever it is is very nondeterministic.

@pcwalton pcwalton marked this pull request as ready for review June 23, 2025 01:26
@pcwalton
Copy link
Contributor Author

pcwalton commented Jun 23, 2025

I only ever reproduced the problem once. Incredibly obnoxious.

@djeedai Sorry about this, I hate to ask someone else to debug my code, but if you can reproduce it reliably it might just be faster for you to look into it. I'm kind of wondering whether it was a pre-existing off-by-one error that suddenly became problematic when multiple effects were packed into the same buffer, looking at your video.

@pcwalton
Copy link
Contributor Author

pcwalton commented Jun 23, 2025

I've tested and ribbon is working on:

  • AMD Ryzen 9 8945HS, NVIDIA GeForce RTX 4070 Laptop GPU, Windows, Vulkan
  • AMD Ryzen 9 8945HS, NVIDIA GeForce RTX 4070 Laptop GPU, Windows, OpenGL
  • 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, NVIDIA GeForce RTX 2060, Windows, Vulkan
  • 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, NVIDIA GeForce RTX 2060, Windows, OpenGL
  • Apple M2 Mac mini, macOS, Metal

Those are all the machines I have easy access to. It doesn't work on DX12, but that's expected because of the downlevel flag VERTEX_AND_INSTANCE_INDEX_RESPECTS_RESPECTIVE_FIRST_VALUE_IN_INDIRECT_DRAW (which I think is gone in newer wgpu).

@djeedai
Copy link
Owner

djeedai commented Jun 23, 2025

Thanks for the extensive testing. I also couldn't reproduce after the first try, not sure why. But this looks like one of the bugs we had in the past, indexing being subtly wrong, but since it addresses a large buffer and doesn't go out of bounds, so we only see occasional visual artifacts. I will try to find some time this week or more likely week-end to debug once more, and if I don't find anything I'll merge as is.

pcwalton added a commit to pcwalton/bevy_hanabi that referenced this pull request Jul 4, 2025
offsets.

Currently, we use a dynamic offset to point to the `Spawner`. This
prevents batching, so this PR changes the shaders so that the effect
metadata contains an index to the spawner, and the shader picks the
spawner out of an array.

This is built on top of djeedai#480.
pcwalton added a commit to pcwalton/bevy_hanabi that referenced this pull request Jul 4, 2025
offsets.

Currently, we use a dynamic offset to point to the `Spawner`. This
prevents batching, so this PR changes the shaders so that the effect
metadata contains an index to the spawner, and the shader picks the
spawner out of an array.

This is built on top of djeedai#480.
pcwalton added a commit to pcwalton/bevy_hanabi that referenced this pull request Jul 4, 2025
offsets.

Currently, we use a dynamic offset to point to the `Spawner`. This
prevents batching, so this PR changes the shaders so that the effect
metadata contains an index to the spawner, and the shader picks the
spawner out of an array.

This is built on top of djeedai#480.
@pcwalton
Copy link
Contributor Author

Thought: Maybe the random failure is due to an assumption that the head of a ribbon is the first instance, which is usually but not always the case?

@djeedai
Copy link
Owner

djeedai commented Sep 11, 2025

Re:

Those are all the machines I have easy access to. It doesn't work on DX12, but that's expected because of the downlevel flag VERTEX_AND_INSTANCE_INDEX_RESPECTS_RESPECTIVE_FIRST_VALUE_IN_INDIRECT_DRAW (which I think is gone in newer wgpu).

Yes and no. As I read the docs, this will only work if you keep the VALIDATION_INDIRECT_CALL feature, which is I think that monstrosity which injects an extra compute dispatch before all indirect compute dispatches, and does so without caring for the bindings, that is, it will let the user code do the bindings, they undo them with its own bindings, then do the injected call, then redo the user bindings, then finally let the user call dispatch:

image

So I want to disable that ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A - internal Internal change on a core system C - enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants