Outils pour utilisateurs

Outils du site


informatique:ai_lm:gpu_bench

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentesRévision précédente
Prochaine révision
Révision précédente
informatique:ai_lm:gpu_bench [09/06/2026 20:29] – [Qwen3-Coder-30B-A3B-Instruct-Q4_K_M] cyrilleinformatique:ai_lm:gpu_bench [25/06/2026 18:18] (Version actuelle) – [Nemotron-Cascade-2-30B-A3B] cyrille
Ligne 255: Ligne 255:
 **Environnement et compilation sensible** pour llama.cpp : **Environnement et compilation sensible** pour llama.cpp :
   * https://github.com/ggml-org/llama.cpp/issues/23546#issuecomment-4662239477   * https://github.com/ggml-org/llama.cpp/issues/23546#issuecomment-4662239477
 +
 +
 +^ Modèle ^ params ^ Offload GPU ^ Prompt (t/s) ^ Eval (t/s) ^ Total (ms) ^ Tokens générés ^ Graphs reused ^
 +| Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL | 24B | 17/41 | 427.81 – 545.85 | 0.80 – 3.19 | 123,500 – 568,458 | 9,629 – 47,241 | 0 |
 +| Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL | 30B | 49/49 | 590.38 – 591.76 | 28.64 – 30.06 | 4,715 – 12,818 | 19,919 – 22,804 | 294 – 530 |
 +| Qwen3-Coder-Next-UD-Q4_K_XL | 80B | 49/49 | 29.00 – 400.09 | 18.68 – 32.44 | 25,057 – 87,659 | 719 – 43,214 | 10 – 1,024 |
 +| DeepSeek-R1-Distill-Qwen-32B-Q4_K_M | 32B | 24/65 | 88.97 – 428.81 | 2.14 – 2.32 | 116,052 – 189,566 | 925 – 3,397 | 228 – 419 |
 +| DeepSeek-R1-Distill-Qwen-14B-Q8_0 | 14B | 24/49 | 225.55 – 775.01 | 4.10 – 4.13 | 81,383 – 147,476 | 1,307 – 3,858 | 313 – 582 |
 +
 +=== gpt-oss-20b-UD-Q4_K_XL ===
 +
 +<code>
 +$ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 0 -n 128,256,512
 +ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
 +  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
 +| model                           size |     params | backend | ngl |    test |            t/s |
 +| ------------------------- | ---------: | ---------: | ------- | --: | ------: | -------------: |
 +| gpt-oss 20B Q4_K - Medium |  11.04 GiB |    20.91 B | CUDA    |  -1 |   tg128 |  155.79 ± 0.21 |
 +| gpt-oss 20B Q4_K - Medium |  11.04 GiB |    20.91 B | CUDA    |  -1 |   tg256 |  155.81 ± 0.03 |
 +| gpt-oss 20B Q4_K - Medium |  11.04 GiB |    20.91 B | CUDA    |  -1 |   tg512 |  155.15 ± 0.01 |
 +
 +build: e25a32e98 (9584)
 +
 +$ ./llama.cpp/build/bin/llama-bench -m /data/models/gpt-oss-20b-UD-Q4_K_XL.gguf -p 1024 -n 0 -b 128,256,512
 +ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
 +  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
 +| model                           size |  params | backend | ngl | n_batch |    test |             t/s |
 +| ------------------------- | ---------: | ------: | ------- | --: | ------: | ------: | --------------: |
 +| gpt-oss 20B Q4_K - Medium |  11.04 GiB | 20.91 B | CUDA    |  -1 |     128 |  pp1024 | 3308.23 ± 19.28 |
 +| gpt-oss 20B Q4_K - Medium |  11.04 GiB | 20.91 B | CUDA    |  -1 |     256 |  pp1024 | 4792.27 ± 39.25 |
 +| gpt-oss 20B Q4_K - Medium |  11.04 GiB | 20.91 B | CUDA    |  -1 |     512 |  pp1024 | 6048.13 ± 32.16 |
 +
 +build: e25a32e98 (9584)
 +</code>
  
 === Qwen2.5-coder-7b-instruct-q8_0 === === Qwen2.5-coder-7b-instruct-q8_0 ===
Ligne 304: Ligne 338:
  
 build: e25a32e98 (9584) build: e25a32e98 (9584)
 +</code>
 +
 +=== gemma-4-26B-A4B-it-qat-UD-Q4_K_XL ===
 +
 +<code>
 +prompt eval time =     318.17 ms /   165 tokens (    1.93 ms per token,   518.59 tokens per second)
 +       eval time =    1338.88 ms /    86 tokens (   15.57 ms per token,    64.23 tokens per second)
 +      total time =    1657.05 ms /   251 tokens
 +   graphs reused =       1916
 +stop processing: n_tokens = 20931, truncated = 0
 +
 +prompt eval time =    3143.73 ms /  4850 tokens (    0.65 ms per token,  1542.75 tokens per second)
 +       eval time =   31502.45 ms /  1854 tokens (   16.99 ms per token,    58.85 tokens per second)
 +      total time =   34646.18 ms /  6704 tokens
 +   graphs reused =       3762
 +stop processing: n_tokens = 27604, truncated = 0
 </code> </code>
  
 === Qwen3-Coder-30B-A3B-Instruct-Q4_K_M === === Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ===
 +
 +J'ai essayé des ''-ngl'' petit mais ça passe pas.
  
 <code> <code>
 $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -p 0 -n 128,256,512 $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -p 0 -n 128,256,512
  
-llama_bench: error: failed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf'+llama_bench: error: failed to load model ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
 </code> </code>
 +
 +=== Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL ===
 +
 +J'ai essayé des ''-ngl'' petit mais ça passe pas.
  
 <code> <code>
-exec llama-server \ +$ ./llama.cpp/build/bin/llama-bench -m /data/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -0 -n 128,256,512
- -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf +
- --host 0.0.0.0 --port 8012 \ +
- --verbosity $VERBOSITY \ +
- --threads-http 2 \ +
- --flash-attn on \ +
- --no-mmap \ +
- --cache-type-k q8_0 --cache-type-v q8_0 \ +
- --jinja \ +
- -c 96000+
  
-common_params_print_infobuild 9584 (e25a32e98) with GNU 15.2.0 for Linux x86_64 +ggml_cuda_initfound 1 CUDA devices (Total VRAM15849 MiB): 
-log_infoverbosity = 4 (adjust with the `-lv N` CLI arg) +  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMMyesVRAM: 15849 MiB 
-device_info+model                                size     params backend    ngl            test                  t/| 
-  - CUDA0   : NVIDIA GeForce RTX 5060 Ti (15849 MiB15712 MiB free) +------------------------------ ---------: ---------: ---------- --: --------------: | -------------------| 
-  - CPU     Intel(R) Core(TM) Ultra 7 270K Plus (93508 MiB93508 MiB free) +llama_bencherrorfailed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf' 
-system_info: n_threads = 4 (n_threads_batch = 4) / 24 CUDA : ARCHS = 1200 USE_GRAPHS = 1 PEER_MAX_BATCH_SIZE = 128 BLACKWELL_NATIVE_FP4 = 1 CPU : SSE3 = 1 SSSE3 = 1 AVX = 1 AVX_VNNI = 1 AVX2 = 1 F16C = 1 FMA = 1 BMI2 = 1 LLAMAFILE = 1 OPENMP = 1 REPACK = 1 |  +</code>
-srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true +
-... +
-common_params_fit_impl: memory for test allocation by device: +
-common_params_fit_impl: id=0, n_layer=49, n_part=24, overflow_type=3, mem= 14787 MiB +
-common_params_fit_impl:   CUDA0 (NVIDIA GeForce RTX 5060 Ti)49 layers (24 overflowing),  14678 MiB used,   1034 MiB free +
-common_fit_paramssuccessfully fit params to free device memory +
-common_fit_paramsfitting params to free memory took 6.76 seconds +
-llama_model_loader: loaded meta data with 44 key-value pairs and 579 tensors from /data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest)) +
-... +
-load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false) +
-load_tensors: offloading output layer to GPU +
-load_tensors: offloading 47 repeating layers to GPU +
-load_tensors: offloaded 49/49 layers to GPU +
-load_tensors:          CPU model buffer size =   166.92 MiB +
-load_tensors:        CUDA0 model buffer size =  9585.43 MiB +
-load_tensors:    CUDA_Host model buffer size =  7939.00 MiB +
-... +
-llama_context: n_ctx_seq (96000) n_ctx_train (262144) -- the full capacity of the model will not be utilized +
-llama_context:  CUDA_Host  output buffer size =     2.32 MiB +
-llama_kv_cache:      CUDA0 KV buffer size =  4781.25 MiB +
-llama_kv_cache: size = 4781.25 MiB ( 96000 cells,  48 layers,  4/1 seqs), K (q8_0): 2390.62 MiB, V (q8_0): 2390.62 MiB +
-... +
-sched_reserve: resolving fused Gated Delta Net support: +
-sched_reserve: fused Gated Delta Net (autoregressive) enabled +
-sched_reserve: fused Gated Delta Net (chunked) enabled +
-sched_reserve:      CUDA0 compute buffer size =   311.34 MiB +
-sched_reserve:  CUDA_Host compute buffer size =   101.84 MiB +
-sched_reserve: graph nodes  = 3606 +
-sched_reserve: graph splits = 70 (with bs=512), 50 (with bs=1) +
-... +
-srv    load_model: prompt cache is enabled, size limit: 8192 MiB +
-... +
-srv          init: init: chat template, thinking = 0 +
-srv  llama_server: model loaded +
-srv  llama_server: server is listening on http://0.0.0.0:8012 +
-srv  update_slots: all slots are idle+
  
 +=== Nemotron-Cascade-2-30B-A3B ===
 +
 +J'ai essayé des ''-ngl'' petit mais ça passe pas.
 +
 +<code>
 +$ ./llama.cpp/build/bin/llama-bench -m /data/models/Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 0 -n 128,256,512
 +ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
 +  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
 +| model                          |       size |     params | backend    | ngl |            test |                  t/s |
 +| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
 +llama_bench: error: failed to load model '/data/models/Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf'
 </code> </code>
  
-==== Stabilité Avec eGPU 😩 ====+==== INstabilité avec eGPU 😩 ====
  
 Reset nvidia et CUDA: Reset nvidia et CUDA:
informatique/ai_lm/gpu_bench.1781029763.txt.gz · Dernière modification : de cyrille

Sauf mention contraire, le contenu de ce wiki est placé sous les termes de la licence suivante : CC0 1.0 Universal
CC0 1.0 Universal Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki