Implement parallel `cuda::std::reduce` #6777

miscco · 2025-11-25T09:12:12Z

This implements cuda::std::reduce utilizing a CUB backend

bernhardmgruber · 2025-11-25T10:10:47Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _CCCL_TRY_CUDA_API(
+      ::cub::DeviceReduce::Reduce,
+      "__pstl_cuda_reduce: cub::DeviceReduce::Reduce failed",
+      ::cuda::std::move(__first),
+      __device_ret_ptr,
+      __count,
+      ::cuda::std::move(__func),
+      ::cuda::std::move(__init),
+      ::cuda::std::move(__policy));


Important: In the for_each_n implementation we create a ::cuda::stream_ref __stream{cudaStreamPerThread}; and pass the stream instead of the policy here. I think we need to add __stream to __policy here.

Agreed, we should pass a stream. Regarding adding stream to policy, it's a complicated subject that we are deferring until after reduction is merged.

bernhardmgruber · 2025-11-25T10:11:15Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _CCCL_TRY_CUDA_API(
+      ::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));


Important: This must be cudaMallocAsync.

I don't think cudaMallocAsync is enough. Many calls below throw without ever freeing allocated memory. We need a RAII abstraction, like async_buffer, at least internally.

How about a unique_ptr?

bernhardmgruber · 2025-11-25T10:12:06Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _CCCL_TRY_CUDA_API(
+      ::cudaMemcpy,
+      "__pstl_cuda_reduce: copy of result from device to host failed",
+      ::cuda::std::addressof(__ret),
+      __device_ret_ptr,
+      sizeof(_Tp),
+      ::cudaMemcpyDeviceToHost);
+
+    _CCCL_TRY_CUDA_API(::cudaFree, "__pstl_cuda_reduce: deallocate failed", __device_ret_ptr);


Important: this should be cudaMemcpyAsync and cudaFreeAsync, followed by a sync of the stream.

gevtushenko · 2025-11-25T17:58:52Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _Tp* __device_ret_ptr = nullptr;
+
+    _CCCL_TRY_CUDA_API(
+      ::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));
+


important: this might lead to an issue if user type has a non-trivial constructor. Thrust hanles that by invoking a kernel only when a constructor is needed. We might do a bit better. Consider having a fancy iterator that does in-place new for non-trivial types and a raw pointer otherwise.

Yep, I think the iterator handling the output should do placement new.

gevtushenko · 2025-11-25T17:59:48Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    // Allocate memory for result
+    _Tp* __device_ret_ptr = nullptr;
+
+    _CCCL_TRY_CUDA_API(


important: given that we throw below, this looks like a memory leak. Consider a RAII abstraction

gevtushenko · 2025-11-25T18:02:39Z

libcudacxx/test/libcudacxx/std/numerics/numeric.ops/reduce/pstl_reduce.cu

+  thrust::sequence(data.begin(), data.end(), 1);
+
+  const auto policy  = cuda::execution::__cub_par_unseq;
+  decltype(auto) res = cuda::std::reduce(policy, data.begin(), data.end(), 42, plus_two{});


suggestion: consider adding a check in the binary operator to see if it's actually invoked on GPU

We need to guard against memory leaks and make sure we properly construct the output on device

github-actions · 2025-12-04T13:17:58Z

🥳 CI Workflow Results

🟩 Finished in 4h 10m: Pass: 100%/90 | Total: 2d 09h | Max: 2h 42m | Hits: 87%/202475

See results here.

miscco requested review from a team as code owners November 25, 2025 09:12

miscco requested a review from wmaxey November 25, 2025 09:12

github-project-automation bot added this to CCCL Nov 25, 2025

github-project-automation bot moved this to Todo in CCCL Nov 25, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 25, 2025

bernhardmgruber reviewed Nov 25, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

gevtushenko reviewed Nov 25, 2025

View reviewed changes

miscco added 2 commits December 4, 2025 09:02

Implement parallel cuda::std::reduce

4400d72

Address review feedback

c89e061

We need to guard against memory leaks and make sure we properly construct the output on device

miscco force-pushed the parallel_reduce branch from 9c3f735 to c89e061 Compare December 4, 2025 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement parallel `cuda::std::reduce` #6777

Implement parallel `cuda::std::reduce` #6777

Uh oh!

miscco commented Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

This comment has been minimized.

gevtushenko Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		_CCCL_TRY_CUDA_API(
		::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));

Implement parallel cuda::std::reduce #6777

Are you sure you want to change the base?

Implement parallel cuda::std::reduce #6777

Uh oh!

Conversation

miscco commented Nov 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 4, 2025

🥳 CI Workflow Results

🟩 Finished in 4h 10m: Pass: 100%/90 | Total: 2d 09h | Max: 2h 42m | Hits: 87%/202475

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement parallel `cuda::std::reduce` #6777

Implement parallel `cuda::std::reduce` #6777