Outils pour utilisateurs

Outils du site


informatique:egpu

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentesRévision précédente
Prochaine révision
Révision précédente
informatique:egpu [24/04/2026 10:55] – [Nvidia] cyrilleinformatique:egpu [24/04/2026 11:52] (Version actuelle) – [Update 2026-04] cyrille
Ligne 57: Ligne 57:
 sudo apt upgrade sudo apt upgrade
 > ... Building initial module nvidia/595.58.03 for 6.17.0-22-generic ... > ... Building initial module nvidia/595.58.03 for 6.17.0-22-generic ...
 +# Oups, penser à supprimer version 590
 +sudo apt purge nvidia-utils-590 nvidia-driver-590-open nvidia-dkms-590-open nvidia-compute-utils-590
 </code> </code>
 +
 +Après l'installation vérifier dans ''/etc/modprobe.d/nvidia-graphics-drivers-kms.conf'' que ''options nvidia_drm modeset=0'' car par défaut il est à ''1'' et donc Xorg aura un process dans la RTX, visible avec ''nvidia-smi''.
 +
 +
 +Branchement de la RTX via THB
 +<code>
 +kernel: thunderbolt 0-1: new device found, vendor=0x215 device=0x41
 +kernel: thunderbolt 0-1: TB4 HOME TB4 eGFX
 +boltd[1096]: [c9030000-0080-TB4 eGFX                   ] parent is 7dbb8780-4047...
 +...
 +kernel: nvidia: loading out-of-tree module taints kernel.
 +kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
 +kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 508
 +kernel: 
 +kernel: nvidia 0000:05:00.0: enabling device (0000 -> 0003)
 +kernel: nvidia 0000:05:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
 +kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  595.58.03  Release Build  (dvs-builder@U22-I3-AM25-28-3)  Tue Mar 17 19:55:10 UTC 2026
 +systemd[2149]: Reached target sound.target - Sound Card.
 +kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  595.58.03  Release Build  (dvs-builder@U22-I3-AM25-28-3)  Tue Mar 17 19:39:14 UTC 2026
 +kernel: [drm] [nvidia-drm] [GPU ID 0x00000500] Loading driver
 +kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:05:00.0 on minor 0
 +kernel: nvidia 0000:05:00.0: [drm] Cannot find any crtc or sizes
 +systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
 +nvidia-persistenced[4123]: Verbose syslog connection opened
 +nvidia-persistenced[4123]: Now running with user ID 124 and group ID 127
 +nvidia-persistenced[4123]: Started (4123)
 +nvidia-persistenced[4123]: device 0000:05:00.0 - registered
 +nvidia-persistenced[4123]: Local RPC services initialized
 +systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.
 +boltd[1120]: probing: timeout, done: [2644839] (2000000)
 +...
 +</code>
 +
 +Essai avec llama.cpp tout frais et compilé avec CUDA_ARCHITECTURES=120 et CUDA 12.9.
 +  * 🚀 Quelques questions dans le chat de llama.cpp : Ok
 +  * 🚀 Refactoring de code avec ''opencode'' et le modèle ''gpt-oss-20b-UD-Q4_K_XL.gguf'' : Ok
 +  * 😩 Détection sur un boucle d'images avec Yolo26 : Fail, plantage après un certains nombre d'itérations
 +    * **Xid 79 "GPU has fallen off the bus"**
 +
 +<code>
 +kernel: NVRM: GPU at PCI:0000:05:00: GPU-ab296f23-e6a6-a23b-b6c1-33f9b813df84
 +kernel: NVRM: GPU Board Serial Number: 0
 +kernel: NVRM: Xid (PCI:0000:05:00): 79, pid=8699, name=python3, GPU has fallen off the bus.
 +kernel: NVRM: GPU 0000:05:00.0: GPU has fallen off the bus.
 +kernel: NVRM: GPU 0000:05:00.0: GPU serial number is 0.
 +kernel: NVRM: GPU0 krcRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
 +kernel: NVRM: GPU0 _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
 +...
 +kernel: NVRM: GPU0 _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78 sequence 2397!
 +kernel: NVRM: GPU0 nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
 +...
 +</code>
 +
 +Alors je passe au nouvel essai proposé sur [[https://github.com/NVIDIA/open-gpu-kernel-modules/issues/974#issuecomment-4311518502|RTX 5060 Ti eGPU unable to init, falls of the bus immediately]].
  
 ===== Nvidia ===== ===== Nvidia =====
Ligne 89: Ligne 145:
 La RTX 3060 fonctionne bien avec la version 580 ''nvidia-headless-580-open, nvidia-dkms-580-open'' La RTX 3060 fonctionne bien avec la version 580 ''nvidia-headless-580-open, nvidia-dkms-580-open''
  
-==== nvidia-headless-575-open ==== 
- 
-<code> 
-$ sudo apt install nvidia-headless-575-open 
-Les NOUVEAUX paquets suivants seront installés : 
-libnvidia-cfg1-575 libnvidia-compute-575 libnvidia-decode-575 libnvidia-gpucomp-575 nvidia-compute-utils-575 nvidia-dkms-575-open nvidia-firmware-575 nvidia-headless-575-open nvidia-headless-no-dkms-575-open nvidia-kernel-common-575 nvidia-kernel-source-575-open nvidia-persistenced 
-</code> 
- 
-<code> 
-ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version 
-</code> 
- 
-==== nvidia-uvm ==== 
- 
-<code> 
-$ modinfo nvidia-uvm 
- 
-filename:       /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-uvm.ko.zst 
-version:        580.126.09 
-supported:      external 
-license:        Dual MIT/GPL 
-srcversion:     B7E9DECF7BD1D315EBCCCF0 
-depends:        nvidia 
-name:           nvidia_uvm 
-retpoline:      Y 
-vermagic:       6.14.0-37-generic SMP preempt mod_unload modversions  
-sig_id:         PKCS#7 
-signer:         NS5x-NS7xAU Secure Boot Module Signature key 
-sig_key:        3B:82:8F:E4:B9:99:2E:1F:E5:76:9C:33:AC:26:A9:F0:0A:1A:E3:46 
-sig_hashalgo:   sha512 
-signature:      66:E9:9A:75:7C:2D:5B:1C:56:B9:CD:CE:E4:64:3B:5F:66:BB:F3:B2: 
- 8F:E8:34:44:62:FD:02:32:A3:27:A8:EA:20:BB:BA:87:6F:F7:F8:6E: 
- F5:27:67:07:97:55:39:39:B2:7E:DE:01:F1:E5:64:AF:3A:29:98:90: 
- 8D:A3:7A:0C:D9:D2:60:A8:15:C1:55:6E:F1:53:FE:85:D2:07:54:12: 
- B0:A4:D5:76:96:D4:A9:5F:85:B4:75:18:B4:38:A2:8B:15:3D:8C:8B: 
- F3:0A:AA:1E:F6:81:F1:27:CC:1E:22:EC:E6:72:BC:DC:3A:FD:39:2F: 
- F4:BF:DE:47:38:7E:1D:FE:04:D1:29:24:AD:CB:46:44:7F:4F:62:67: 
- 38:FA:96:10:58:47:02:C8:65:05:67:7A:53:A6:70:76:A1:10:39:56: 
- 0B:B3:5F:98:E2:D3:F1:FC:7E:85:02:E0:37:04:E4:91:E6:7D:92:25: 
- FE:3E:CD:0F:E1:26:B8:78:FA:C6:DB:AD:AA:CB:A9:22:2E:E7:20:DA: 
- 91:46:FC:14:EB:54:54:B4:AF:1D:66:72:9B:C2:99:18:1B:57:77:14: 
- FD:65:14:B0:96:A5:0A:78:A4:AA:E2:F3:49:96:85:53:A3:28:50:C9: 
- E4:74:89:65:C7:24:19:BC:AF:4C:15:5E:55:8C:53:CC 
-parm:           uvm_conf_computing_channel_iv_rotation_limit:ulong 
-parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int) 
-parm:           uvm_perf_prefetch_enable:uint 
-parm:           uvm_perf_prefetch_threshold:uint 
-parm:           uvm_perf_prefetch_min_faults:uint 
-parm:           uvm_perf_thrashing_enable:uint 
-parm:           uvm_perf_thrashing_threshold:uint 
-parm:           uvm_perf_thrashing_pin_threshold:uint 
-parm:           uvm_perf_thrashing_lapse_usec:uint 
-parm:           uvm_perf_thrashing_nap:uint 
-parm:           uvm_perf_thrashing_epoch:uint 
-parm:           uvm_perf_thrashing_pin:uint 
-parm:           uvm_perf_thrashing_max_resets:uint 
-parm:           uvm_perf_map_remote_on_native_atomics_fault:uint 
-parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool) 
-parm:           uvm_perf_migrate_cpu_preunmap_enable:int 
-parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint 
-parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int) 
-parm:           uvm_perf_pma_batch_nonpinned_order:uint 
-parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint) 
-parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int) 
-parm:           uvm_force_prefetch_fault_support:uint 
-parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint) 
-parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint) 
-parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp) 
-parm:           uvm_perf_access_counter_migration_enable:Whether access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int) 
-parm:           uvm_perf_access_counter_batch_count:uint 
-parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint) 
-parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint 
-parm:           uvm_perf_fault_batch_count:uint 
-parm:           uvm_perf_fault_replay_policy:uint 
-parm:           uvm_perf_fault_replay_update_put_ratio:uint 
-parm:           uvm_perf_fault_max_batches_per_service:uint 
-parm:           uvm_perf_fault_max_throttle_per_service:uint 
-parm:           uvm_perf_fault_coalesce:uint 
-parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int) 
-parm:           uvm_perf_map_remote_on_eviction:int 
-parm:           uvm_block_cpu_to_cpu_copy_with_ce:Use GPU CEs for CPU-to-CPU migrations. (int) 
-parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint) 
-parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint) 
-parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint) 
-parm:           uvm_channel_num_gpfifo_entries:uint 
-parm:           uvm_channel_gpfifo_loc:charp 
-parm:           uvm_channel_gpput_loc:charp 
-parm:           uvm_channel_pushbuffer_loc:charp 
-parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int) 
-parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int) 
-parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp) 
-parm:           uvm_debug_prints:Enable uvm debug prints. (int) 
-parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int) 
-parm:           uvm_release_asserts:Enable uvm asserts included in release builds. (int) 
-parm:           uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int) 
-parm:           uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int) 
- 
-$ systool -m nvidia_uvm -v 
- 
-Module = "nvidia_uvm" 
-  Attributes: 
-    coresize            = "2154496" 
-    initsize            = "0" 
-    initstate           = "live" 
-    refcnt              = "4" 
-    srcversion          = "B7E9DECF7BD1D315EBCCCF0" 
-    taint               = "OE" 
-    uevent              = <store method only> 
-    version             = "580.126.09" 
-  Parameters: 
-    uvm_ats_mode        = "1" 
-    uvm_block_cpu_to_cpu_copy_with_ce= "0" 
-    uvm_channel_gpfifo_loc= "auto" 
-    uvm_channel_gpput_loc= "auto" 
-    uvm_channel_num_gpfifo_entries= "1024" 
-    uvm_channel_pushbuffer_loc= "auto" 
-    uvm_conf_computing_channel_iv_rotation_limit= "2147483648" 
-    uvm_cpu_chunk_allocation_sizes= "2166784" 
-    uvm_debug_enable_push_acquire_info= "0" 
-    uvm_debug_enable_push_desc= "0" 
-    uvm_debug_prints    = "0" 
-    uvm_disable_hmm     = "Y" 
-    uvm_downgrade_force_membar_sys= "1" 
-    uvm_enable_builtin_tests= "0" 
-    uvm_enable_debug_procfs= "0" 
-    uvm_enable_va_space_mm= "1" 
-    uvm_exp_gpu_cache_peermem= "0" 
-    uvm_exp_gpu_cache_sysmem= "0" 
-    uvm_fault_force_sysmem= "0" 
-    uvm_force_prefetch_fault_support= "0" 
-    uvm_global_oversubscription= "1" 
-    uvm_leak_checker    = "0" 
-    uvm_page_table_location= "(null)" 
-    uvm_peer_copy       = "phys" 
-    uvm_perf_access_counter_batch_count= "256" 
-    uvm_perf_access_counter_migration_enable= "-1" 
-    uvm_perf_access_counter_threshold= "256" 
-    uvm_perf_fault_batch_count= "256" 
-    uvm_perf_fault_coalesce= "1" 
-    uvm_perf_fault_max_batches_per_service= "20" 
-    uvm_perf_fault_max_throttle_per_service= "5" 
-    uvm_perf_fault_replay_policy= "2" 
-    uvm_perf_fault_replay_update_put_ratio= "50" 
-    uvm_perf_map_remote_on_eviction= "1" 
-    uvm_perf_map_remote_on_native_atomics_fault= "0" 
-    uvm_perf_migrate_cpu_preunmap_block_order= "2" 
-    uvm_perf_migrate_cpu_preunmap_enable= "1" 
-    uvm_perf_pma_batch_nonpinned_order= "6" 
-    uvm_perf_prefetch_enable= "1" 
-    uvm_perf_prefetch_min_faults= "1" 
-    uvm_perf_prefetch_threshold= "51" 
-    uvm_perf_reenable_prefetch_faults_lapse_msec= "1000" 
-    uvm_perf_thrashing_enable= "1" 
-    uvm_perf_thrashing_epoch= "2000" 
-    uvm_perf_thrashing_lapse_usec= "500" 
-    uvm_perf_thrashing_max_resets= "4" 
-    uvm_perf_thrashing_nap= "1" 
-    uvm_perf_thrashing_pin= "300" 
-    uvm_perf_thrashing_pin_threshold= "10" 
-    uvm_perf_thrashing_threshold= "3" 
-    uvm_release_asserts = "1" 
-    uvm_release_asserts_dump_stack= "0" 
-    uvm_release_asserts_set_global_error= "0" 
-</code> 
- 
-Le plantage de la RTX 5060 Ti arrive plus tard si ''options nvidia_uvm uvm_disable_hmm=1''. 
  
  
informatique/egpu.1777020900.txt.gz · Dernière modification : de cyrille

Sauf mention contraire, le contenu de ce wiki est placé sous les termes de la licence suivante : CC0 1.0 Universal
CC0 1.0 Universal Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki