Différences

Ci-dessous, les différences entre deux révisions de la page.

--- informatique:ai_lm:gpu_bench [17/06/2026 13:52] – [Qwen3-Coder-30B-A3B-Instruct-Q4_K_M] cyrille
+++ informatique:ai_lm:gpu_bench [25/06/2026 18:18] (Version actuelle) – [Nemotron-Cascade-2-30B-A3B] cyrille
@@ Ligne 357: / Ligne 357: @@
 === Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ===
+J'ai essayé des ''-ngl'' petit mais ça passe pas.
 <code>
 $ ./llama.cpp/build/bin/llama-bench -m ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -p 0 -n 128,256,512
-llama_bench: error: failed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf'
+llama_bench: error: failed to load model ~/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
 </code>
+=== Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL ===
+J'ai essayé des ''-ngl'' petit mais ça passe pas.
 <code>
-exec llama-server \
+$ ./llama.cpp/build/bin/llama-bench -m /data/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf -p 0 -n 128,256,512
- -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
- --host 0.0.0.0 --port 8012 \
- --verbosity $VERBOSITY \
- --threads-http 2 \
- --flash-attn on \
- --no-mmap \
- --cache-type-k q8_0 --cache-type-v q8_0 \
- --jinja \
- -c 96000
-common_params_print_info: build 9584 (e25a32e98) with GNU 15.2.0 for Linux x86_64
+ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
-log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
+  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
-device_info:
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
-  - CUDA0   : NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free)
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
-  - CPU     : Intel(R) Core(TM) Ultra 7 270K Plus (93508 MiB, 93508 MiB free)
+llama_bench: error: failed to load model '/data/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf'
-system_info: n_threads = 4 (n_threads_batch = 4) / 24 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+</code>
-srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
-...
-common_params_fit_impl: memory for test allocation by device:
-common_params_fit_impl: id=0, n_layer=49, n_part=24, overflow_type=3, mem= 14787 MiB
-common_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 5060 Ti): 49 layers (24 overflowing),  14678 MiB used,   1034 MiB free
-common_fit_params: successfully fit params to free device memory
-common_fit_params: fitting params to free memory took 6.76 seconds
-llama_model_loader: loaded meta data with 44 key-value pairs and 579 tensors from /data/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
-...
-load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
-load_tensors: offloading output layer to GPU
-load_tensors: offloading 47 repeating layers to GPU
-load_tensors: offloaded 49/49 layers to GPU
-load_tensors:          CPU model buffer size =   166.92 MiB
-load_tensors:        CUDA0 model buffer size =  9585.43 MiB
-load_tensors:    CUDA_Host model buffer size =  7939.00 MiB
-...
-llama_context: n_ctx_seq (96000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
-llama_context:  CUDA_Host  output buffer size =     2.32 MiB
-llama_kv_cache:      CUDA0 KV buffer size =  4781.25 MiB
-llama_kv_cache: size = 4781.25 MiB ( 96000 cells,  48 layers,  4/1 seqs), K (q8_0): 2390.62 MiB, V (q8_0): 2390.62 MiB
-...
-sched_reserve: resolving fused Gated Delta Net support:
-sched_reserve: fused Gated Delta Net (autoregressive) enabled
-sched_reserve: fused Gated Delta Net (chunked) enabled
-sched_reserve:      CUDA0 compute buffer size =   311.34 MiB
-sched_reserve:  CUDA_Host compute buffer size =   101.84 MiB
-sched_reserve: graph nodes  = 3606
-sched_reserve: graph splits = 70 (with bs=512), 50 (with bs=1)
-...
-srv    load_model: prompt cache is enabled, size limit: 8192 MiB
-...
-srv          init: init: chat template, thinking = 0
-srv  llama_server: model loaded
-srv  llama_server: server is listening on http://0.0.0.0:8012
-srv  update_slots: all slots are idle
-$ nvidia-smi
+=== Nemotron-Cascade-2-30B-A3B ===
-+-----------------------------------------------------------------------------------------+
-| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
-+-----------------------------------------+------------------------+----------------------+
-| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
-|                                         |                        |               MIG M. |
-|=========================================+========================+======================|
-|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:02:00.0 Off |                  N/A |
-|  0%   29C    P8              6W /  180W |   14856MiB /  16311MiB |      0%      Default |
-|                                         |                        |                  N/A |
-+-----------------------------------------+------------------------+----------------------+
-+-----------------------------------------------------------------------------------------+
+J'ai essayé des ''-ngl'' petit mais ça passe pas.
-| Processes:                                                                              |
-|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+<code>
-|        ID   ID                                                               Usage      |
+$ ./llama.cpp/build/bin/llama-bench -m /data/models/Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf -p 0 -n 128,256,512
-|=========================================================================================|
+ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15849 MiB):
-|    0   N/A  N/A            2643      C   ...ma.cpp/build/bin/llama-server      14830MiB |
+  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
-+-----------------------------------------------------------------------------------------+
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+llama_bench: error: failed to load model '/data/models/Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf'
 </code>