Linux CUDA 驱动失效问题的解决办法
2024-12-18 21:18:31+08:00

Linux CUDA 驱动失效问题的解决办法

在使用 Linux 的时候,如果经常进行升级,时不时会遇到驱动失效的问题:

$ nvidia-smi 
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

这是由于升级后 NVIDIA 驱动版本与内核或显卡不匹配的导致的。

首先通过 uname 命令查看当前内核版本,可以看到这里为 6.5.0-35-generic:

$ uname -a
Linux ROG 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

然后通过查看当前使用的 NVIDIA 驱动版本,可以看到这里为 555.42.02:

$ ls /usr/src/ | grep nvidia
nvidia-555.42.02

通过官网搜索可用的驱动版本,这里可以看到最新可用版本是 550.90.07:

搜索

版本

安装对应版本的驱动:

sudo apt install nvidia-driver-550
sudo dkms install -m nvidia -v 550.90.07

注意
如果开启了 UEFI Secure Boot,在安装驱动时会请求输入一段密钥。
重启后会进入一个特殊的页面,需要在该页面上选择登记密钥(enroll key)输入相同的密钥,然后驱动才能访问固件。

NVIDIA 驱动安装成功但无法使用
2024-12-18 21:18:31+08:00

NVIDIA驱动安装成功但无法使用

现象

$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ dkms status
nvidia, 470.86, 5.13.0-22-generic, x86_64: installed

$ nvidia-settings:

ERROR: NVIDIA driver is not loaded
ERROR: Unable to load info from any available system

原因与解决办法

如果开启了 UEFI Secure Boot,在安装驱动时会请求输入一段密钥。
重启后会进入一个特殊的页面,需要在该页面上选择登记密钥(enroll key)输入相同的密钥,然后驱动才能访问固件。

也可以从 BIOS 上关闭 Secure Boot

Ubuntu CUDA 驱动升级失败的问题
2024-12-18 21:18:31+08:00

Ubuntu CUDA 驱动升级失败的问题

升级时失败

sudo apt update
sudo apt upgrade

错误信息:

$ sudo apt --fix-broken install
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Correcting dependencies... Done
The following package was automatically installed and is no longer required:
  nvidia-firmware-535-535.129.03
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  nvidia-kernel-common-535
The following packages will be upgraded:
  nvidia-kernel-common-535
1 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.
1 not fully installed or removed.
Need to get 0 B/38,3 MB of archives.
After this operation, 61,2 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
(Reading database ... 291692 files and directories currently installed.)
Preparing to unpack .../nvidia-kernel-common-535_535.129.03-0ubuntu1_amd64.deb .
..
Unpacking nvidia-kernel-common-535 (535.129.03-0ubuntu1) over (535.129.03-0ubunt
u0.22.04.1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-kernel-common-535_
535.129.03-0ubuntu1_amd64.deb (--unpack):
 trying to overwrite '/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin', which is a
lso in package nvidia-firmware-535-535.129.03 535.129.03-0ubuntu0.22.04.1
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Errors were encountered while processing:
 /var/cache/apt/archives/nvidia-kernel-common-535_535.129.03-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

从错误信息上看,应该是 nvidia-kernel-common-535nvidia-firmware-535-535.129.03 两个包 都包含了 /lib/firmware/nvidia/535.129.03/gsp_ga10x.bin 这个文件。

解决办法:

只需要指定使用其中一个包的该文件即可

sudo dpkg -i --force-overwrite /var/cache/apt/archives/nvidia-kernel-common-535_535.129.03-0ubuntu1_amd64.deb
Windows 上 CMake 找不到 CUDA 工具集的问题
2024-12-18 21:18:31+08:00

Windows 上 CMake 找不到 CUDA 工具集的问题

问题1

CMake Error at C:/Program Files/CMake/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:470 (message):
  No CUDA toolset found.

需要将 CUDA 目录里的 MSBuildExtensions 赋值 Visual Studio 的目录中:

cp "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.5\extras\visual_studio_integration\MSBuildExtensions\*" "C:\Program Files\Microsoft Visual Studio\2022\Community\Msbuild\Microsoft\VC\v170\BuildCustomizations"

问题2

The CUDA compiler identification is unknown
CMake Error at src/matrix/cuda/CMakeLists.txt:2 (project):
  No CMAKE_CUDA_COMPILER could be found.

这是因为 CUDA 不支持 32 位,需要添加 -A x64 指定构建 64 位目标。

另外 CUDA 11.5 只支持 VS2017 - VS2019,因此使用 VS2022 也会报这个错误,需要升级 CUDA 版本。