Les modèles d’intelligence artificielle (IA), des simples algorithmes de régression jusqu’aux réseaux neuronaux complexes utilisés dans l’apprentissage profond, fonctionnent grâce à une logique mathématique. Toutes les données utilisées par un modèle d’intelligence artificielle, y compris les données non structurées comme le texte, l’audio ou les images, doivent être exprimées sous forme numérique. Le plongement vectoriel, ou représentation vectorielle, est une méthode qui permet de convertir un point de données non structuré en un tableau de nombres, tout en conservant la signification originale des données.
Articles:
Autres pages:
Classification de modèles ouverts: Foundation models by Ibm
Comment faire pour qu'un appel à un LLM ait un résultat reproductible d'une fois sur l'autre ?
Hugging Face entreprise française créée en 2016 → L'IA open source par Hugging Face - Gen AI Nantes 2024-01 par Julien Simon
launch a opencode server :
opencode serve --port=30781 --print-logs --log-level DEBUG
Then prompt : “Explain async/await in JavaScript”
with:
time opencode run -m <ProviderId/ModelId> --attach=http://127.0.0.1:30781 --agent=plan "Explain async/await in JavaScript"
👾 Attention, les résultats peuvent être très différents:
context, ce qui a une grande importance sur la taille/qualité de la réponse …system message prompt est sélectionné par opencode …Hailo
Axelera
seeedstudio
Ollama & Nvidia Jetpack
Nvidia
| A 10 | A 30 | A 40 | A 100 SXM4 | A 800 | H 100 SMX5 | |
|---|---|---|---|---|---|---|
| Prix eBay | $2,330 | $3,999 | $9,950 | $4,000 | $20,000 | $20,000 |
| Architecture | Ampere | Ampere | Ampere | Ampere | Ampere | Hopper |
| Code name | GA102 | GA100 | GA102 | GA100 | GA100 | GH100 |
| Launch date | 2021-04 | 2021-04 | 2020-10 | 2020-05 | 2022-11 | 2022-03 |
| Maximum RAM | 24 GB | 24 GB | 48 GB | 40 GB | 40 GB | 96 GB |
| Memory type | GDDR6 | HBM2e | GDDR6 | HBM2e | HBM2e | HBM3 |
| Memory bandwidth | 600.2 GB/s | 933.1 GB/s | 695.8 GB/s | 1555 GB/s | 1.56 TB/s | 1,681 GB/s |
| Memory bus width | 384 bit | 3072 bit | 384 bit | 5120 bit | 5120 bit | 5120 bit |
| Memory clock speed | 1563 MHz | 1215 MHz | 1812 MHz | 1215 MHz | 1215 MHz | 1313 MHz |
| Core clock speed | 885 MHz | 930 MHz | 1305 MHz | 1095 MHz | 765 MHz | 1837 MHz |
| Boost clock speed | 1695 MHz | 1440 MHz | 1740 MHz | 1410 MHz | 1410 MHz | 1665 MHz |
| Peak Half Precision (FP16) | 31.24 TFLOPS (1:1) | 10.32 TFLOPS (1:1) | 37.42 TFLOPS (1:1) | 77.97 TFLOPS (4:1) | ||
| Pipelines | 9216 | 3584 | 10752 | 6912 | 6912 | 16896 |
| Thermal Design Power | 150 Watt | 165 Watt | 300 Watt | 400 Watt | 250 Watt | 700 Watt |
| OpenCL | 3.0 | 3.0 | 3.0 | 3.0 |
Nvidia
Tips: Reset nvidia et CUDA:
# éteindre la carte # débrancher THB $ sudo rmmod nvidia_uvm nvidia
En anglais “GPU enclosures”. Nécessite un port Thunderbolt 3, 4 ou à venir 5.
egpu docks
Accelerating Machine Learning on a Linux Laptop with an External GPU by NVidia (Setting up Ubuntu to use NVIDIA eGPU)
https://github.com/ggml-org/llama.cpp
Lancer le serveur avec un modèle en local:
./bin/llama-server -m devstralQ5_K_M.gguf --port 8012 --jinja --ctx-size 20000 ./bin/llama-server --port 8012 --chatml -m ~/Data/AI_Models/Qwen2.5-coder-7b-instruct-q8_0.gguf --ctx-size 48000
nouveautés hiver 2025-26:
–n-gpu-layerscontent vide et reasoning_content archi plein. L'utilisation de l'option –cache-ram 0 semble résoudre ces plantages.Quid des chat formats ? Est-ce lié au modèle ?
--jinja--chatml$ llama-server --help
...
--chat-template JINJA_TEMPLATE set custom jinja chat template (default: template taken from model's
metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set
before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3,
exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2,
hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys,
llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm,
mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7,
mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3,
phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca,
yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE)
...
Modèles:
$ ./bin/llama-server --jinja -m ./Qwen3-Coder-30B-A3B-Instruct-Q5_K_S.gguf llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
Élargir la “context window” :
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model--rope-scale N RoPE context scaling factor, expands context by a factor of N--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size)Il faut le compiler avec CUDA. Avec une version >= 11.7 pour compatibilité syntaxe.
J'ai installé CUDA le dépot Nvidia Cuda et cuda toolkit 13
$ sudo cat /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /
Ma dernière installation :
sudo apt install nvidia-headless-590-open nvidia-utils-590 nvidia-cuda-toolkit nvidia-cuda-dev Package: nvidia-headless-590-open Version: 590.48.01-0ubuntu0.24.04.1 APT-Sources: http://fr.archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages Package: nvidia-cuda-toolkit Version: 12.0.140~12.0.1-4build4 APT-Sources: http://fr.archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages # Je ne comprends pas j'ai pourtant un /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list # qui pointe sur /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list
en option ou @ spécifier pour le cmake build :
export PATH=$PATH:/usr/local/cuda-<version>/bin/
Ensuite une longue compilation :
# DCMAKE_CUDA_ARCHITECTURES :
# CUDA GPU Compute Capability https://developer.nvidia.com/cuda-gpus
# RTX 3060 : 86
# RTX 5060 : 120
$ export CUDA_VERSION=12.9 && cmake -B build -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="86;120" \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-${CUDA_VERSION}/bin/nvcc \
-DCMAKE_INSTALL_RPATH="/usr/local/cuda-${CUDA_VERSION}/lib64;\$ORIGIN"
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- CUDA Toolkit found
-- Using CUDA architectures: 86;120
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.4
-- ggml commit: 6016d0bd4
-- Configuring done (0.5s)
-- Generating done (0.2s)
-- Build files have been written to: /home/cyrille/Code/bronx/AI_Coding/llama.cpp/build
$ time cmake --build build --config Release -j 10
# host: i7-1360P + SSD
...
real 44m35,149s
user 42m38,100s
sys 1m51,594s
...
# Avec `-j 10` (concurent tasks)
real 11m6,449s
user 104m56,615s
sys 3m45,431s
# Plus récemment
real 6m35,663s
user 61m37,436s
sys 2m37,613s
Avec CUDA 13.1 llama.cpp plante direct à la 1ère requête, mais sans message dans syslog : ce n'est donc pas le driver mais le logiciel llama.cpp qui ne support pas cette version de CUDA :
/home/cyrille/Code/bronx/AI_Coding/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error CUDA error: invalid argument current device: 0, in function ggml_cuda_mul_mat_q at /home/cyrille/Code/bronx/AI_Coding/llama.cpp/ggml/src/ggml-cuda/mmq.cu:179
Linux OneApi toolkit
Par défaut intel-oneapi-toolkit installe tout ce monde :
intel-oneapi-ccl-2022.0 intel-oneapi-ccl-devel intel-oneapi-ccl-devel-2022.0 intel-oneapi-common-licensing intel-oneapi-common-licensing-2026.0 intel-oneapi-common-oneapi-vars intel-oneapi-common-oneapi-vars-2026.0 intel-oneapi-common-vars intel-oneapi-compiler-cpp-eclipse-cfg-2026.0 intel-oneapi-compiler-dpcpp-cpp intel-oneapi-compiler-dpcpp-cpp-2026.0 intel-oneapi-compiler-dpcpp-cpp-common-2026.0 intel-oneapi-compiler-dpcpp-cpp-runtime-2026.0 intel-oneapi-compiler-dpcpp-eclipse-cfg-2026.0 intel-oneapi-compiler-fortran-2026.0 intel-oneapi-compiler-fortran-common-2026.0 intel-oneapi-compiler-fortran-runtime-2026.0 intel-oneapi-compiler-shared-2026.0 intel-oneapi-compiler-shared-common-2026.0 intel-oneapi-compiler-shared-runtime-2026.0 intel-oneapi-dev-utilities intel-oneapi-dev-utilities-2026.0 intel-oneapi-dev-utilities-eclipse-cfg-2026.0 intel-oneapi-dnnl-2026.0 intel-oneapi-dnnl-devel intel-oneapi-dnnl-devel-2026.0 intel-oneapi-dpcpp-cpp-2026.0 intel-oneapi-dpcpp-debugger-2026.0 intel-oneapi-icc-eclipse-plugin-cpp-2026.0 intel-oneapi-ipp-2026.0 intel-oneapi-ipp-devel intel-oneapi-ipp-devel-2026.0 intel-oneapi-ippcp-2026.0 intel-oneapi-ippcp-devel intel-oneapi-ippcp-devel-2026.0 intel-oneapi-libdpstd-devel-2022.12 intel-oneapi-mkl-classic-devel-2026.0 intel-oneapi-mkl-classic-include-2026.0 intel-oneapi-mkl-cluster-2026.0 intel-oneapi-mkl-cluster-devel-2026.0 intel-oneapi-mkl-core-2026.0 intel-oneapi-mkl-core-devel-2026.0 intel-oneapi-mkl-devel intel-oneapi-mkl-devel-2026.0 intel-oneapi-mkl-sycl-2026.0 intel-oneapi-mkl-sycl-blas-2026.0 intel-oneapi-mkl-sycl-data-fitting-2026.0 intel-oneapi-mkl-sycl-devel-2026.0 intel-oneapi-mkl-sycl-dft-2026.0 intel-oneapi-mkl-sycl-include-2026.0 intel-oneapi-mkl-sycl-lapack-2026.0 intel-oneapi-mkl-sycl-rng-2026.0 intel-oneapi-mkl-sycl-sparse-2026.0 intel-oneapi-mkl-sycl-stats-2026.0 intel-oneapi-mkl-sycl-vm-2026.0 intel-oneapi-mpi-2021.18 intel-oneapi-mpi-devel intel-oneapi-mpi-devel-2021.18 intel-oneapi-openmp-2026.0 intel-oneapi-openmp-common-2026.0 intel-oneapi-tbb-2023.0 intel-oneapi-tbb-devel intel-oneapi-tbb-devel-2023.0 intel-oneapi-tcm-1.5 intel-oneapi-tlt intel-oneapi-tlt-2026.0 intel-oneapi-toolkit intel-oneapi-toolkit-env-2026.0 intel-oneapi-toolkit-getting-started-2026.0 intel-oneapi-umf-1.1 intel-oneapi-vtune
$ source /opt/intel/oneapi/setvars.sh $ sycl-ls [opencl:cpu][opencl:0] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-1360P OpenCL 3.0 (Build 0) [2026.21.3.0.31_160000]
En fait ça ne va pas car
$ ./llama-ls-sycl-device ./llama-ls-sycl-device: error while loading shared libraries: libsycl.so.8: cannot open shared object file: No such file or directory # Probleme de version 😩 $ find /opt/intel/oneapi -name "libsycl.so*" /opt/intel/oneapi/2026.0/lib/libsycl.so.9.0.0 /opt/intel/oneapi/2026.0/lib/libsycl.so.9.0.0-gdb.py /opt/intel/oneapi/2026.0/lib/libsycl.so /opt/intel/oneapi/2026.0/lib/libsycl.so.9 /opt/intel/oneapi/compiler/2026.0/lib/libsycl.so.9.0.0 /opt/intel/oneapi/compiler/2026.0/lib/libsycl.so.9.0.0-gdb.py /opt/intel/oneapi/compiler/2026.0/lib/libsycl.so /opt/intel/oneapi/compiler/2026.0/lib/libsycl.so.9
Ok, passe à la compilation comme expliqué sur https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#ii-build-llamacpp pour que le binaire utilise la version SYCL installée par intel-oneapi-toolkit.
./examples/sycl/build.sh
Compilation sans erreur, mais … “what(): can not find preferred GPU platform” 😩
$ ./build/bin/llama-ls-sycl-device # idem avec $ ./build/bin/llama-bench -p 0 -n 128,256,512 [New LWP 35410] [New LWP 35409] [New LWP 35408] [New LWP 35407] [New LWP 35406] [New LWP 35405] [New LWP 35404] [New LWP 35403] [New LWP 35402] [New LWP 35401] [New LWP 35400] [New LWP 35399] [New LWP 35398] [New LWP 35397] [New LWP 35396] This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.ubuntu.com> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal] Debuginfod has been disabled. ... Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x000079304a910813 in __GI___wait4 (pid=35411, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce nom #0 0x000079304a910813 in __GI___wait4 (pid=35411, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 30 in ../sysdeps/unix/sysv/linux/wait4.c #1 0x000079304e48aa1a in ggml_print_backtrace () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-base.so.0 #2 0x000079304e4a3d76 in ggml_uncaught_exception() () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-base.so.0 #3 0x000079304acbb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x000079304aca5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x000079304acbb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x000079304b19e765 in dpct::dev_mgr::dev_mgr() () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-sycl.so.0 #7 0x000079304b16e8f3 in ggml_backend_sycl_print_sycl_devices () from /home/cyrille/Code/bronx/AI_Coding/llama.cpp-SYCL/build/bin/libggml-sycl.so.0 #8 0x0000000000405527 in main () [Inferior 1 (process 35394) detached] terminate called after throwing an instance of 'std::runtime_error' what(): can not find preferred GPU platform PLEASE submit a bug report to https://software.intel.com/en-us/support/priority-support and include the crash backtrace and instructions to reproduce the bug. Abandon (core dumped)
Et fait un reboot puis ça fonctionne. Les perfs: 2.6 plus rapide que sans SYCL (36.34 vs 13.94).
- https://ollama.com - https://github.com/ollama/ollama
Chat & build with open models.
Interface utilisateur pour gérer et exécuter des modèles localement, utilise Llama.cpp sous le capot.
Sur linux install un service systemd
A single self-contained distributable that builds off llama.cpp and adds many additional powerful features
vLLM est une bibliothèque open-source optimisée pour servir efficacement des LLMs en production, à la différence de llama.cpp qui est pour le développement ou usage solo sur du matériel standard (RTX ou CPU).