metal: use shared buffers on eGPU #17866

jdemeule · 2025-12-08T18:04:52Z

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.

Before #15906, llama-bench on gemma 3 give me this kind of result:

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         48.72 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          5.95 ± 0.00 |

build: 33daece86 (6440)

So above 45t/s on pp test, and more than 5t/s on tg test.

After #15906, pp test has improved but tg test has been divided by 2.

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         60.66 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          2.84 ± 0.00 |

build: 0f0a3c285 (6441)

Launching the benchmark with "Metal System Trace" in Instruments.app, reveals some usage of the DMA1 channel which introduced lot of latency (at least, this is how I interpreted it).

With this PR, the performance are back as before on eGPU and should not impact any other configuration (dGPU and M1-M5).

# ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         47.24 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          6.07 ± 0.00 |

build: b0db6483b (7327)

With ggml-org#15906, I noticed on important regression when using metal backend on eGPU. This commit restore the previous behavior and add an option to force its activation.

ggerganov · 2025-12-09T08:16:17Z

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

taronaeo · 2025-12-10T00:17:16Z

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

Looks like it, and an external GPU connected via thunderbolt.

https://support.apple.com/en-sg/102363

jdemeule · 2025-12-10T00:29:31Z

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

Yes, this is specific to Intel Mac when desktop GPU are plugged behind Thunderbolt.
I've tried to make the smallest modification as possible without introducing deprecated API or any specific code path.

ggerganov · 2025-12-10T07:47:04Z

Thanks. Would need to fix the ios, tvos and visionos builds.

jdemeule · 2025-12-11T22:09:24Z

Thanks. Would need to fix the ios, tvos and visionos builds.

For sure.
Should be better now.

taronaeo · 2025-12-12T07:03:00Z

CI is still failing :/

jdemeule · 2025-12-12T07:28:59Z

My bad, I thought TARGET_OS_OSX where not defined for ios, tvos, and visonos.

metal: use shared buffers on eGPU

6d041c9

With ggml-org#15906, I noticed on important regression when using metal backend on eGPU. This commit restore the previous behavior and add an option to force its activation.

jdemeule requested a review from ggerganov as a code owner December 8, 2025 18:04

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 8, 2025

loci-dev mentioned this pull request Dec 8, 2025

UPSTREAM PR #17866: metal: use shared buffers on eGPU auroralabs-loci/llama.cpp#488

Open

metal: use shared buffers on eGPU

d87d4d6

metal: use shared buffers on eGPU

a1cca9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal: use shared buffers on eGPU #17866

metal: use shared buffers on eGPU #17866

jdemeule commented Dec 8, 2025

Uh oh!

ggerganov commented Dec 9, 2025

Uh oh!

taronaeo commented Dec 10, 2025

Uh oh!

jdemeule commented Dec 10, 2025

Uh oh!

ggerganov commented Dec 10, 2025

Uh oh!

jdemeule commented Dec 11, 2025

Uh oh!

taronaeo commented Dec 12, 2025

Uh oh!

jdemeule commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

metal: use shared buffers on eGPU #17866

Are you sure you want to change the base?

metal: use shared buffers on eGPU #17866

Conversation

jdemeule commented Dec 8, 2025

Uh oh!

ggerganov commented Dec 9, 2025

Uh oh!

taronaeo commented Dec 10, 2025

Uh oh!

jdemeule commented Dec 10, 2025

Uh oh!

ggerganov commented Dec 10, 2025

Uh oh!

jdemeule commented Dec 11, 2025

Uh oh!

taronaeo commented Dec 12, 2025

Uh oh!

jdemeule commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants