Linux CUDA 驱动失效问题的解决办法
在使用 Linux 的时候,如果经常进行升级,时不时会遇到驱动失效的问题:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
这是由于升级后 NVIDIA 驱动版本与内核或显卡不匹配的导致的。
首先通过 uname
命令查看当前内核版本,可以看到这里为 6.5.0-35-generic
:
$ uname -a
Linux ROG 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
然后通过查看当前使用的 NVIDIA 驱动版本,可以看到这里为 555.42.02
:
$ ls /usr/src/ | grep nvidia
nvidia-555.42.02
通过官网搜索可用的驱动版本,这里可以看到最新可用版本是 550.90.07
:
安装对应版本的驱动:
sudo apt install nvidia-driver-550
sudo dkms install -m nvidia -v 550.90.07
注意
如果开启了 UEFI Secure Boot,在安装驱动时会请求输入一段密钥。
重启后会进入一个特殊的页面,需要在该页面上选择登记密钥(enroll key)输入相同的密钥,然后驱动才能访问固件。
NVIDIA驱动安装成功但无法使用
现象
$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ dkms status
nvidia, 470.86, 5.13.0-22-generic, x86_64: installed
$ nvidia-settings:
ERROR: NVIDIA driver is not loaded
ERROR: Unable to load info from any available system
原因与解决办法
如果开启了 UEFI Secure Boot,在安装驱动时会请求输入一段密钥。
重启后会进入一个特殊的页面,需要在该页面上选择登记密钥(enroll key)输入相同的密钥,然后驱动才能访问固件。
也可以从 BIOS 上关闭 Secure Boot
Ubuntu CUDA 驱动升级失败的问题
升级时失败
sudo apt update
sudo apt upgrade
错误信息:
$ sudo apt --fix-broken install
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Correcting dependencies... Done
The following package was automatically installed and is no longer required:
nvidia-firmware-535-535.129.03
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
nvidia-kernel-common-535
The following packages will be upgraded:
nvidia-kernel-common-535
1 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.
1 not fully installed or removed.
Need to get 0 B/38,3 MB of archives.
After this operation, 61,2 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
(Reading database ... 291692 files and directories currently installed.)
Preparing to unpack .../nvidia-kernel-common-535_535.129.03-0ubuntu1_amd64.deb .
..
Unpacking nvidia-kernel-common-535 (535.129.03-0ubuntu1) over (535.129.03-0ubunt
u0.22.04.1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-kernel-common-535_
535.129.03-0ubuntu1_amd64.deb (--unpack):
trying to overwrite '/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin', which is a
lso in package nvidia-firmware-535-535.129.03 535.129.03-0ubuntu0.22.04.1
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-kernel-common-535_535.129.03-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
从错误信息上看,应该是 nvidia-kernel-common-535
和 nvidia-firmware-535-535.129.03
两个包
都包含了 /lib/firmware/nvidia/535.129.03/gsp_ga10x.bin
这个文件。
解决办法:
只需要指定使用其中一个包的该文件即可
sudo dpkg -i --force-overwrite /var/cache/apt/archives/nvidia-kernel-common-535_535.129.03-0ubuntu1_amd64.deb
Windows 上 CMake 找不到 CUDA 工具集的问题
问题1
CMake Error at C:/Program Files/CMake/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:470 (message):
No CUDA toolset found.
需要将 CUDA 目录里的 MSBuildExtensions 赋值 Visual Studio 的目录中:
cp "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.5\extras\visual_studio_integration\MSBuildExtensions\*" "C:\Program Files\Microsoft Visual Studio\2022\Community\Msbuild\Microsoft\VC\v170\BuildCustomizations"
问题2
The CUDA compiler identification is unknown
CMake Error at src/matrix/cuda/CMakeLists.txt:2 (project):
No CMAKE_CUDA_COMPILER could be found.
这是因为 CUDA 不支持 32 位,需要添加 -A x64
指定构建 64 位目标。
另外 CUDA 11.5 只支持 VS2017 - VS2019,因此使用 VS2022 也会报这个错误,需要升级 CUDA 版本。