Skip to content

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Nov 25, 2025

This implements cuda::std::reduce utilizing a CUB backend

@miscco miscco requested review from a team as code owners November 25, 2025 09:12
@miscco miscco requested a review from wmaxey November 25, 2025 09:12
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 25, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 25, 2025
Comment on lines 71 to 79
_CCCL_TRY_CUDA_API(
::cub::DeviceReduce::Reduce,
"__pstl_cuda_reduce: cub::DeviceReduce::Reduce failed",
::cuda::std::move(__first),
__device_ret_ptr,
__count,
::cuda::std::move(__func),
::cuda::std::move(__init),
::cuda::std::move(__policy));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: In the for_each_n implementation we create a ::cuda::stream_ref __stream{cudaStreamPerThread}; and pass the stream instead of the policy here. I think we need to add __stream to __policy here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should pass a stream. Regarding adding stream to policy, it's a complicated subject that we are deferring until after reduction is merged.

Comment on lines 65 to 66
_CCCL_TRY_CUDA_API(
::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This must be cudaMallocAsync.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think cudaMallocAsync is enough. Many calls below throw without ever freeing allocated memory. We need a RAII abstraction, like async_buffer, at least internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a unique_ptr?

Comment on lines 81 to 89
_CCCL_TRY_CUDA_API(
::cudaMemcpy,
"__pstl_cuda_reduce: copy of result from device to host failed",
::cuda::std::addressof(__ret),
__device_ret_ptr,
sizeof(_Tp),
::cudaMemcpyDeviceToHost);

_CCCL_TRY_CUDA_API(::cudaFree, "__pstl_cuda_reduce: deallocate failed", __device_ret_ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: this should be cudaMemcpyAsync and cudaFreeAsync, followed by a sync of the stream.

@github-actions

This comment has been minimized.

Comment on lines 63 to 67
_Tp* __device_ret_ptr = nullptr;

_CCCL_TRY_CUDA_API(
::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: this might lead to an issue if user type has a non-trivial constructor. Thrust hanles that by invoking a kernel only when a constructor is needed. We might do a bit better. Consider having a fancy iterator that does in-place new for non-trivial types and a raw pointer otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think the iterator handling the output should do placement new.

// Allocate memory for result
_Tp* __device_ret_ptr = nullptr;

_CCCL_TRY_CUDA_API(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: given that we throw below, this looks like a memory leak. Consider a RAII abstraction

thrust::sequence(data.begin(), data.end(), 1);

const auto policy = cuda::execution::__cub_par_unseq;
decltype(auto) res = cuda::std::reduce(policy, data.begin(), data.end(), 42, plus_two{});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: consider adding a check in the binary operator to see if it's actually invoked on GPU

We need to guard against memory leaks and make sure we properly construct the output on device
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

🥳 CI Workflow Results

🟩 Finished in 4h 10m: Pass: 100%/90 | Total: 2d 09h | Max: 2h 42m | Hits: 87%/202475

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants