Ceci est une ancienne révision du document !
Table des matières
GPU Bench
- RTX AI Garage sur blog de nvidia
Benchmark d'IA pour extraction de noms :
- avec service Mistral, modèle Codestral =
00j 01h 02m 48s - RTX3060 + Intel-i7, modèle granite-4.0-h-small-Q8_0 =
02j 16h 11m 34s
Selon LeChat:
| Carte graphique | TOPS (INT8) | TOPS (FP16) | Architecture |
|---|---|---|---|
| RTX 3060 (12 Go) | ~120 TOPS | ~60 TOPS | Ampere |
| RTX 5060 Ti (16 Go) | ~759 TOPS | ~380 TOPS | Blackwell |
Bench llama.cpp :
- Text generation: tg128, tg256, tg512 :
-p 0 -n 128,256,512 - Prompt processing: b128, b256, b512 :
-p 1024 -n 0 -b 128,256,512
| models | test | tokens/seconds | ||
|---|---|---|---|---|
| i7-1360P | RTX 3060 | RTX 5060 Ti | ||
| Qwen2.5-coder-7b-instruct-q5_k_m | tg128 | 5.47 | 57.65 | 73.54 |
| size: 5.07 GiB | tg256 | … | 57.61 | 73.32 |
| tg512 | … | 56.20 | 71.80 | |
| b128 | … | 1825.17 | 2840.57 | |
| b256 | … | 1924.10 | 3209.52 | |
| b512 | … | 1959.18 | 3271.22 | |
| Qwen2.5-coder-7b-instruct-q8_0 | tg128 | … | 41.42 | 50.33 |
| size: 7.54 GiB | tg256 | … | 41.38 | 50.33 |
| tg512 | … | 40.70 | 49.62 | |
| b128 | 13.98 | 1952.96 | 2972.52 | |
| b256 | … | 2054.09 | 3460.41 | |
| b512 | … | 2093.21 | 3511.29 | |
| EuroLLM-9B-Instruct-Q4_0 | tg128 | … | 56.06 | 71.41 |
| size: 4.94 GiB | tg256 | … | 55.96 | 71.15 |
| tg512 | … | 53.87 | 69.45 | |
| b128 | … | 1433.95 | CUDA error | |
| b256 | … | 1535.06 | … | |
| b512 | … | 1559.88 | … | |
| Qwen3-14B-UD-Q5_K_XL | tg128 | … | 30.00 | 37.66 |
| size: 9.82 GiB | tg256 | … | 29.97 | 38.17 |
| tg512 | … | 29.25 | 37.30 | |
| b128 | … | 903.97 | CUDA error | |
| b256 | … | 951.71 | … | |
| b512 | … | 963.76 | … | |
| Qwen3-4B-UD-Q8_K_XL | tg128 | 7.37 | 56.35 | … |
| size: 4.70 GiB | tg256 | 6.63 | 56.35 | … |
| tg512 | 6.24 | 54.56 | … | |
| b128 | 20.66 | 2163.17 | … | |
| b256 | … | 2405.27 | … | |
| b512 | … | 2495.35 | … | |
| GemmaCoder3-12B-IQ4_NL.gguf | tg128 | … | 40.70 | … |
| size: 6.41 GiB | tg256 | … | 40.67 | … |
| tg512 | … | 39.54 | … | |
| b128 | … | 1150.11 | … | |
| b256 | … | 1218.27 | … | |
| b512 | … | 1253.92 | … | |
| Gemma3-Code-Reasoning-4B.Q8_0 | tg128 | … | 66.98 | … |
| size: 3.84 GiB | tg256 | … | 66.95 | … |
| tg512 | … | 65.75 | … | |
| b128 | … | 2885.80 | … | |
| b256 | … | 3266.87 | … | |
| b512 | … | 3457.03 | … | |
| GemmaCoder3-12B-Q5_K_M | tg128 | … | 34.10 | … |
| size: 7.86 GiB | tg256 | … | 34.06 | … |
| tg512 | … | 33.28 | … | |
| b128 | … | 1045.27 | … | |
| b256 | … | 1108.95 | … | |
| b512 | … | 1144.97 | … | |
- Les “CUDA error” apparaissent avec la RTX 5060 Ti et le bridge PCIe/THB4 “Wikingoo L17” et le driver nvidia 580.
- Avec le CPU, laisser le nombre de cœurs en automatique, ce sont les physiques qui seront utilisés. Si on force plus de thread, les perfs diminuent.
- le multi-threads physique est utile. Ex: en auto 7.37 t/s, avec 1 thread 3.39 t/s
Intel® Core™ i7-1360P 13th Gen
Pour comparaison …
Qwen2.5-coder-7b-instruct-q5_k_m:
./llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128 load_backend: loaded RPC backend from /home/.../llama-b7109/libggml-rpc.so load_backend: loaded CPU backend from /home/.../llama-b7109/libggml-cpu-alderlake.so | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CPU | 4 | tg128 | 5.47 ± 0.72 |
Gigabyte Windforce OC 12GB Geforce RTX 3060
Avec sudo nsys-ui :
| NVIDIA GeForce RTX 3060 | |
|---|---|
| Chip Name | GA104 |
| SM Count | 28 |
| L2 Cache Size | 2,25 MiB |
| Memory Bandwidth | 335,32 GiB/s |
| Memory Size | 11,63 GiB |
| Core Clock | 1,79 GHz |
| Bus Location | 0000:05:00.0 |
| GSP firmware version | 580.105.08 |
| Video accelerator tracing | Supported |
Avec llama.cpp et CUDA 12.9.
Qwen2.5-coder-7b-instruct-q5_k_m
./build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 57.65 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 57.61 ± 0.03 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 56.24 ± 0.05 |
GemmaCoder3-12B-Q5_K_M
Pour lancer llama-server avec le modèle “GemmaCoder3-12B-Q5_K_M.gguf” (fichier 8.4Go) fait de 49 layers en utilisant son contexte maximale “131072” avec --ctx-size 0 au lieu du par défaut “4096” il faut décharger des layers sur le CPU, sinon c'est main: error: unable to load model. À noter que c'est pareil avec llama-cli.
| n-gpu-layers | test | tokens/s | time | % perte perf |
|---|---|---|---|---|
| (all) 49 | tg128 | 34.15 | 0m25,904s | 0.00% |
| b128 | 1041.60 | 0m13,117s | 0.00% | |
| 44 | tg128 | 15.55 | 0m48,049s | 54.47% |
| b128 | 279.26 | 0m28,613s | 73.19% | |
| 39 | tg128 | 10.74 | 1m07,555s | 68.55% |
| b128 | 150.49 | 0m46,996s | 85.55% | |
| 30 | tg128 | 6.83 | 1m42,221s | 80.01% |
| b128 | 82.91 | 1m19,729s | 92.04% | |
| full cpu | tg128 | 3.12 | 3m28,308s | 90.86% |
| b128 | 4.50 | 22m37,674s | 99.57% |
Les valeurs qui permettent de charger ce modèle :
llama-cli:- avec son context max 131072 c'est 30 layers sur GPU :
--n-gpu-layers 30, donc 80% perte perf --ctx-size 70000 --n-gpu-layers 41- et pour tous les layers sur le GPU :
--ctx-size 42000
llama-server:--ctx-size 40000 --n-gpu-layers 44--ctx-size 43500 --n-gpu-layers 43--ctx-size 52500 --n-gpu-layers 42
Avec --ctx-size 52500 --n-gpu-layers 42 :
... NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes ... print_info: n_ctx_train = 131072 print_info: n_embd = 3840 print_info: n_embd_inp = 3840 print_info: n_layer = 48 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 256 print_info: n_swa = 1024 print_info: is_swa_any = 1 print_info: n_embd_head_k = 256 print_info: n_embd_head_v = 256 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 2048 print_info: n_embd_v_gqa = 2048 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 6.2e-02 print_info: n_ff = 15360 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.125 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 12B print_info: model params = 11.77 B print_info: general.name = gemma-3-12b-it-codeforces-SFT print_info: vocab type = SPM print_info: n_vocab = 262208 print_info: n_merges = 0 ... print_info: max token length = 48 ... load_tensors: offloading 42 repeating layers to GPU load_tensors: offloaded 42/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 1720.59 MiB load_tensors: CUDA0 model buffer size = 6327.03 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 52736 llama_context: n_ctx_seq = 52736 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = true llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.125 llama_context: n_ctx_seq (52736) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 4.00 MiB llama_kv_cache_iswa: creating non-SWA KV cache, size = 52736 cells llama_kv_cache: CPU KV buffer size = 412.00 MiB llama_kv_cache: CUDA0 KV buffer size = 2884.00 MiB llama_kv_cache: size = 3296.00 MiB ( 52736 cells, 8 layers, 4/1 seqs), K (f16): 1648.00 MiB, V (f16): 1648.00 MiB llama_kv_cache_iswa: creating SWA KV cache, size = 4608 cells llama_kv_cache: CPU KV buffer size = 180.00 MiB llama_kv_cache: CUDA0 KV buffer size = 1260.00 MiB llama_kv_cache: size = 1440.00 MiB ( 4608 cells, 40 layers, 4/1 seqs), K (f16): 720.00 MiB, V (f16): 720.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CUDA0 compute buffer size = 1307.32 MiB llama_context: CUDA_Host compute buffer size = 120.02 MiB llama_context: graph nodes = 1929 llama_context: graph splits = 94 (with bs=512), 27 (with bs=1)
PNY OC 16 Go Geforce RTX 5060 Ti
Qwen2.5-coder-7b-instruct-q5_k_m
$ ./llama.cpp/build/bin/llama-bench -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q5_k_m.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg128 | 73.54 ± 0.01 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg256 | 73.32 ± 0.40 | | qwen2 7B Q5_K - Medium | 5.07 GiB | 7.62 B | CUDA | 99 | tg512 | 71.80 ± 0.61 | build: 3f3a4fb9c (7130)
Stabilité
Reset nvidia et CUDA:
$ sudo rmmod nvidia_uvm nvidia
- Lucie-7B_OpenLLM-France.Instruct-human-data.Q8_0.gguf
- Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
- CUDA0 model buffer size = 7605,33 MiB
- CUDA0 compute buffer size = 258,50 MiB
Après 2 mois de re-essais avec des configs grub et modprobe de toutes sortes avec l'aide de forums et d'assistants (Claude, ChatGpt, LeChat), une solution apparaît sur ce ticket : forcer le PCI en “Gen 3”
# Pour récupérer l'adresse PCI "0000:05:00.0" de la RTX: lspci | grep -i nvidia sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 8GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ sudo setpci -s 0000:05:00.0 CAP_EXP+0xC.W=0x0003 sudo lspci -vvv -s 0000:05:00.0 | grep -i "LnkCap\|LnkSta" LnkCap: Port #0, Speed 32GT/s, Width x8, ASPM L1, Exit Latency L1 unlimited LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
Mais non, ça a bien fonctionné avec llama-bench mais pas avec Yolo:
kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84 kernel: NVRM: GPU Board Serial Number: 0 kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: Class 0xffff Subchannel 0x0 Mismatch kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x4041b0=0x3f20ffff kernel: NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ESR 0x404000=0x80000002 kernel: NVRM: Xid (PCI:0000:05:00): 13, pid=6871, name=python3, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000100, Data deaddead
Traduction
1 phrase
Très rapide avec RTX 3060.
./build/bin/llama-cli -m ~/Data/AI_Models/EuroLLM-9B-Instruct-Q4_0.gguf -i
Prompt:
french to english “C'est une occasion rare pour les élèves de pouvoir échanger sur les parcours de nos anciens élèves et ainsi, de répondre à de nombreuses questions voire à lever des freins.”
Replies:
EuroLLM-9B-Instruct:
“This is a rare opportunity for students to be able to discuss the career paths of our former students and, in doing so, to answer many questions and even remove some of the barriers they may be facing.”
EuroLLM-9B-q8_0:
The provided text is in French and needs to be translated into English.
I'll translate it step-by-step:
1. Identify the source language: The text is in French.
2. Analyze the text: The text appears to be a sentence describing an event or opportunity. It mentions “élèves” (students), “anciens élèves” (former students), and “lever des freins” (lifting barriers or obstacles).
3. Translate each part:
* “C'est une occasion rare pour les élèves de pouvoir échanger sur les parcours de nos anciens élèves” translates to “This is a rare opportunity for students to be able to discuss the journeys of our former students.”
* “et ainsi, de répondre à de nombreuses questions voire à lever des freins” translates to “and thus, to answer many questions and even to lift barriers.”
4. Combine the translated parts: The complete translation is: “This is a rare opportunity for students to be able to discuss the journeys of our former students and thus, to answer many questions and even to lift barriers.”
Qwen2.5-coder-7b-instruct-q5_k_m :
Here's the English translation:
“It's a rare opportunity for students to exchange on the paths of our former students, thus being able to answer many questions or even overcome obstacles.”
This translation conveys the meaning that the situation being described is unusual and provides students with a valuable learning experience by allowing them to share and discuss the career paths of alumni, which can help them answer questions and potentially resolve challenges they may face in their own professional development.
Qwen2.5-coder-7b-instruct-q8_0:
Here's the English translation of the French sentence:
“This is a rare opportunity for students to exchange about the paths of our former students, and thus, to answer many questions or even to overcome obstacles.”
This translation preserves the meaning and tone of the original French statement, highlighting the value of this unique educational experience.
Qwen3-14B-UD-Q5_K_XL:
<think>
Okay, let's tackle this French to English translation. The user provided the sentence: “C'est une occasion rare pour les élèves de pouvoir échanger sur les parcours de nos anciens élèves et ainsi, de répondre à de nombreuses questions voire à lever des freins.”
First, I need to understand the structure and meaning. The sentence starts with “C'est une occasion rare” which translates to “It's a rare opportunity.” The subject is “les élèves” (the students), and the verb is “pouvoir échanger” (to be able to exchange). The object of the exchange is “les parcours de nos anciens élèves” (the paths/careers of our alumni). Then, the purpose is “de répondre à de nombreuses questions voire à lever des freins” (to answer many questions or even to remove barriers).
Now, I should check for any nuances. “Parcours” here likely refers to career paths or journeys, not just physical paths. “Anciens élèves” is alumni, so “our alumni's paths” makes sense. “Répondre à de nombreuses questions” is straightforward: answer many questions. “Voire à lever des freins” – “voire” is like “or even,” so “or even to remove barriers.”
I need to make sure the flow is natural in English. Maybe “a rare opportunity for students to exchange about the career paths of our alumni, and thus answer many questions or even overcome barriers.”
Wait, “échanger sur” could be “discuss” instead of “exchange about.” Also, “lever des freins” – “freins” can mean barriers or brakes. In this context, it's more about barriers. “Overcome barriers” sounds better than “remove barriers.”
Putting it all together: “It's a rare opportunity for students to discuss the career paths of our alumni and thus answer many questions or even overcome barriers.”
I should check if “thus” is the best conjunction. “And thus” or “and thereby” might work. Alternatively, “thereby” could be more formal. Let me see. “And thus” is okay, but maybe “and thereby” sounds more academic.
Another point: “pouvoir échanger” is “to be able to exchange,” but in English, “to exchange” might not be the best term here. “Discuss” or “talk about” is more natural.
So revised: “It's a rare opportunity for students to discuss the career paths of our alumni and thereby answer many questions or even overcome barriers.”
That seems accurate. Let me make sure there are no other nuances. The original sentence emphasizes that this opportunity is rare, which is captured. The purpose is to answer questions and overcome barriers, which is clear. I think that's a solid translation.
</think>
It's a rare opportunity for students to discuss the career paths of our alumni and thereby answer many questions or even overcome barriers.

