Segmentation error in test_cross_entropy_bwd_benchmark[executor='thunder'-variation='hf_phi3']

```bash
00:25:18 E       RuntimeError:  INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/fusion_segmenter.cpp:4695, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. 
00:25:18 E       Expected out_val_to_replace->uses().size() == 1 . Multiple use of replicated upcast tensor found: 
00:25:18 E       Exception raised from revertPrivatizedOps at /opt/pytorch/nvfuser/csrc/fusion_segmenter.cpp:4695 (most recent call first):
00:25:18 E       frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfb (0x7efca0ef2071 in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #1: nvfuser::nvfErrorFail(char const*, char const*, long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x69 (0x7efca129ce19 in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #2: <unknown function> + 0xa89ecf (0x7efca130becf in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #3: <unknown function> + 0xa8a19d (0x7efca130c19d in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #4: <unknown function> + 0xa979dd (0x7efca13199dd in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #5: <unknown function> + 0xa9f24c (0x7efca132124c in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #6: nvfuser::SegmentCandidateFinder::SegmentCandidateFinder(std::unique_ptr<nvfuser::Fusion, std::default_delete<nvfuser::Fusion> >, nvfuser::KernelArgumentHolder const&, nvfuser::SegmentCandidateFinderOptions, bool) + 0x2b4 (0x7efca13235c4 in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #7: <unknown function> + 0xaa186f (0x7efca132386f in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #8: <unknown function> + 0xaa2428 (0x7efca1324428 in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #9: <unknown function> + 0xeb4edc (0x7efca1736edc in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #10: nvfuser::FusionExecutorCache::getKernelRuntimeFor(nvfuser::KernelArgumentHolder const&, std::optional<nvfuser::PrimDataType>) + 0xb53 (0x7efca172b4e3 in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #11: nvfuser::FusionExecutorCache::runFusionWithInputs(nvfuser::KernelArgumentHolder, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xba (0x7efca172c2ea in /opt/pytorch/nvfuser/python/nvfuser_direct/../build/libnvfuser_codegen.so)
00:25:18 E       frame #12: <unknown function> + 0x16191d (0x7efca3c2e91d in /opt/pytorch/nvfuser/python/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
00:25:18 E       frame #13: <unknown function> + 0x164bd3 (0x7efca3c31bd3 in /opt/pytorch/nvfuser/python/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
00:25:18 E       frame #14: <unknown function> + 0x6a44b (0x7efca3b3744b in /opt/pytorch/nvfuser/python/nvfuser_direct/_C_DIRECT.cpython-312-x86_64-linux-gnu.so)
00:25:18 E       frame #15: /usr/bin/python3() [0x581a6f]
00:25:18 E       frame #16: _PyObject_MakeTpCall + 0x13e (0x5493be in /usr/bin/python3)
00:25:18 E       frame #17: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /usr/bin/python3)
00:25:18 E       frame #18: _PyObject_Call_Prepend + 0xc2 (0x54ab42 in /usr/bin/python3)
00:25:18 E       frame #19: /usr/bin/python3() [0x5a30c8]
00:25:18 E       frame #20: _PyObject_MakeTpCall + 0x75 (0x5492f5 in /usr/bin/python3)
00:25:18 E       frame #21: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /usr/bin/python3)
00:25:18 E       frame #22: _PyObject_Call_Prepend + 0x18a (0x54ac0a in /usr/bin/python3)
00:25:18 E       frame #23: /usr/bin/python3() [0x5a30c8]
00:25:18 E       frame #24: _PyObject_MakeTpCall + 0x13e (0x5493be in /usr/bin/python3)
00:25:18 E       frame #25: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /usr/bin/python3)
00:25:18 E       frame #26: _PyObject_Call_Prepend + 0x18a (0x54ac0a in /usr/bin/python3)
00:25:18 E       frame #27: /usr/bin/python3() [0x5a30c8]
00:25:18 E       frame #28: PyObject_Call + 0x6c (0x54b47c in /usr/bin/python3)
00:25:18 E       frame #29: _PyEval_EvalFrameDefault + 0x4cb0 (0x5daa90 in /usr/bin/python3)
00:25:18 E       frame #30: _PyObject_Call_Prepend + 0x18a (0x54ac0a in /usr/bin/python3)
00:25:18 E       frame #31: /usr/bin/python3() [0x5a30c8]
00:25:18 E       frame #32: _PyObject_MakeTpCall + 0x13e (0x5493be in /usr/bin/python3)
00:25:18 E       frame #33: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /usr/bin/python3)
00:25:18 E       frame #34: _PyObject_Call_Prepend + 0x18a (0x54ac0a in /usr/bin/python3)
00:25:18 E       frame #35: /usr/bin/python3() [0x5a30c8]
00:25:18 E       frame #36: _PyObject_MakeTpCall + 0x13e (0x5493be in /usr/bin/python3)
00:25:18 E       frame #37: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /usr/bin/python3)
00:25:18 E       frame #38: _PyObject_Call_Prepend + 0x18a (0x54ac0a in /usr/bin/python3)
00:25:18 E       frame #39: /usr/bin/python3() [0x5a30c8]
00:25:18 E       frame #40: _PyObject_MakeTpCall + 0x13e (0x5493be in /usr/bin/python3)
00:25:18 E       frame #41: _PyEval_EvalFrameDefault + 0xadf (0x5d68bf in /usr/bin/python3)
00:25:18 E       frame #42: PyEval_EvalCode + 0x15b (0x5d4dab in /usr/bin/python3)
00:25:18 E       frame #43: /usr/bin/python3() [0x607fc2]
00:25:18 E       frame #44: /usr/bin/python3() [0x6b4393]
00:25:18 E       frame #45: _PyRun_SimpleFileObject + 0x1aa (0x6b40fa in /usr/bin/python3)
00:25:18 E       frame #46: _PyRun_AnyFileObject + 0x4f (0x6b3f2f in /usr/bin/python3)
00:25:18 E       frame #47: Py_RunMain + 0x3b5 (0x6bbf45 in /usr/bin/python3)
00:25:18 E       frame #48: Py_BytesMain + 0x2d (0x6bba2d in /usr/bin/python3)
00:25:18 E       frame #49: <unknown function> + 0x2a1ca (0x7f01d710b1ca in /usr/lib/x86_64-linux-gnu/libc.so.6)
00:25:18 E       frame #50: __libc_start_main + 0x8b (0x7f01d710b28b in /usr/lib/x86_64-linux-gnu/libc.so.6)
00:25:18 E       frame #51: _start + 0x25 (0x656a35 in /usr/bin/python3)
```

```
00:25:18 =================================== FAILURES ===================================
00:25:18 ___ test_cross_entropy_bwd_benchmark[executor='thunder'-variation='hf_phi3'] ___
00:25:18 
00:25:18 benchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7ef4e0a7ce60>
00:25:18 variation = 'hf_phi3', executor = 'thunder'
00:25:18 
00:25:18     @pytest.mark.parametrize(
00:25:18         "variation",
00:25:18         [
00:25:18             "hf_qwen2",
00:25:18             "hf_phi3",
00:25:18             "hf_mistral_nemo",
00:25:18         ],
00:25:18     )
00:25:18     @pytest.mark.parametrize(
00:25:18         "executor", ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
00:25:18     )
00:25:18     def test_cross_entropy_bwd_benchmark(
00:25:18         benchmark,
00:25:18         variation: str,
00:25:18         executor: str,
00:25:18     ):
00:25:18         kwargs = {}
00:25:18         if executor == "torchcompile":
00:25:18             clear_dynamo_cache()
00:25:18     
00:25:18         test_case = cross_entropy_loss_setup[variation](dtype=torch.bfloat16)
00:25:18         fwd_inputs = test_case.inputs()
00:25:18         model = test_case.model()
00:25:18     
00:25:18         def fwd_call(inp):
00:25:18             return model(**inp)
00:25:18     
00:25:18         # execute the compiled fwd fn
00:25:18         fwd_fn = with_executor(executor, fwd_call, **kwargs)
00:25:18 >       outputs = fwd_fn(fwd_inputs)
00:25:18                   ^^^^^^^^^^^^^^^^^^
00:25:18 
00:25:18 benchmarks/python/test_cross_entropy_loss.py:87: 
00:25:18 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
00:25:18 ../lightning-thunder/thunder/__init__.py:843: in wrapped
00:25:18     return fn(*args, **kwargs)
00:25:18            ^^^^^^^^^^^^^^^^^^^
00:25:18 ../lightning-thunder/thunder/__init__.py:893: in fn_
00:25:18     result = cache_entry.computation_fn(*inps)
00:25:18              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00:25:18 ../lightning-thunder/thunder/__init__.py:798: in wrapped
00:25:18     return fn(*args, **kwargs)
00:25:18            ^^^^^^^^^^^^^^^^^^^
00:25:18 /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py:121: in decorate_context
00:25:18     return func(*args, **kwargs)
00:25:18            ^^^^^^^^^^^^^^^^^^^^^
00:25:18 ../lightning-thunder/thunder/executors/torchex.py:169: in no_autocast_fn
00:25:18     return fn(*args, **kwargs)
00:25:18            ^^^^^^^^^^^^^^^^^^^
00:25:18 thunder.computation_11:14: in computation
00:25:18     [t65, t68, t66, t67] = nvFusion0(t74, labels)
00:25:18                            ^^^^^^^^^^^^^^^^^^^^^^
00:25:18 ../lightning-thunder/thunder/executors/nvfuserex_impl.py:566: in __call__
00:25:18     return fd.execute(
00:25:18 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
00:25:18 
00:25:18 self = def nvfuser_fusion(fd : FusionDefinition) -> None :
00:25:18     tv0 = fd.define_tensor(shape=[1, 8192, 32064], contiguity=[Non...fd.ops.div(tv29, tv28)
00:25:18     fd.add_output(tv30)
00:25:18     fd.add_output(tv15)
00:25:18     fd.add_output(tv21)
00:25:18     fd.add_output(tv28)
00:25:18 
00:25:18 inputs = (tensor([[[ 1.4922, -0.9609,  0.1025,  ..., -0.5469,  0.2578,  0.0874],
00:25:18          [ 0.0366, -0.8633, -1.8672,  ...,  1....   device='cuda:0', dtype=torch.bfloat16), tensor([[12263,  8142, 23663,  ...,  9877, 12805, 31727]], device='cuda:0'))
00:25:18 
00:25:18     def execute(
00:25:18         self,
00:25:18         inputs,
00:25:18         *,
00:25:18         device=None,
00:25:18         save_repro_inputs=False,
00:25:18         _enable_options: list[str] = [],
00:25:18         _disable_options: list[str] = [],
00:25:18     ) -> list[torch.Tensor]:
00:25:18         """
00:25:18         Execute the fusion with the given inputs. The fusion is automatically
00:25:18         scheduled and supports input caching.
00:25:18     
00:25:18         Parameters
00:25:18         ----------
00:25:18         inputs : list of torch.Tensor
00:25:18             Input tensors and scalars to the fusion
00:25:18         device : torch.device, optional
00:25:18             Device to execute the fusion on
00:25:18         save_repro_inputs : bool, default=False
00:25:18             Whether to save the inputs for last_repro_script() to provide a reproduction script.
00:25:18         _enable_options : list of str, default=[]
00:25:18             A list of enable options. An alternative to setting NVFUSER_ENABLE environment variable.
00:25:18         _disable_options : list of str, default=[]
00:25:18             A list of disable options. An alternative to setting NVFUSER_DISABLE environment variable.
00:25:18     
00:25:18         Returns
00:25:18         -------
00:25:18         list of torch.Tensor
00:25:18             Output tensors from the fusion
00:25:18         """
00:25:18     
00:25:18         if save_repro_inputs:
00:25:18             from torch._subclasses.fake_tensor import FakeTensorMode
00:25:18     
00:25:18             fake_mode = FakeTensorMode()
00:25:18             self.fake_inputs = [fake_mode.from_tensor(inp) for inp in inputs]
00:25:18     
00:25:18         assert not hasattr(
00:25:18             self, "ke"
00:25:18         ), "KernelExecutor already exists! Use manual_execute() to execute the fusion."
00:25:18         if not hasattr(self, "fec"):
00:25:18             is_valid, error_message = self.validate_definition()
00:25:18             if not is_valid:
00:25:18                 raise NotImplementedError(error_message)
00:25:18     
00:25:18             self.fec = FusionExecutorCache(self._fusion)
00:25:18             # A copy of fusion is created after construction FusionExecutorCache
00:25:18             # Delete the _fusion and reference the fusion inside FusionExecutorCache
00:25:18             del self._fusion
00:25:18 >       return self.fec.execute(
00:25:18             inputs,
00:25:18             device=self._get_device_index(device),
00:25:18             _enable_options=_enable_options,
00:25:18             _disable_options=_disable_options,
00:25:18         )
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segmentation error in test_cross_entropy_bwd_benchmark[executor='thunder'-variation='hf_phi3'] #5542

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation error in test_cross_entropy_bwd_benchmark[executor='thunder'-variation='hf_phi3'] #5542

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions