基础环境

root@server1:~# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 22.04.5 LTS
Release:    22.04
Codename:    jammy

root@server1:~# lspci | grep -i nvidia
34:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
35:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
36:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
37:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
9b:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
9c:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
9d:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
9e:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)

显卡驱动安装

1.环境准备

1.1 删除之前安装的驱动

可以通过指令sudo apt purge nvidia*删除以前安装的NVIDIA驱动版本,重新安装。

sudo apt purge nvidia*

1.2 关闭系统自带的nouveau

在安装NVIDIA驱动以前需要禁止系统自带显卡驱动nouveau:

可以先通过指令lsmod | grep nouveau查看nouveau驱动的启用情况,如果有输出表示nouveau驱动正在工作,如果没有内容输出则表示已经禁用了nouveau

如果有则按照下面操作禁用:

在终端输入sudo vim /etc/modprobe.d/blacklist.conf弹出blacklist.conf文件:在blacklist.conf文件末尾加上这两行,并保存:

blacklist nouveau
options nouveau modeset=0

在终端中输入以下指令,使修改生效:

sudo update-initramfs -u #应用更改

重启,就禁止了ubuntu22.04自带的nouveau显卡驱动了,接下来我们就可以安心的安装驱动程序了

如果重启后,光标闪烁,无法开机,则需要,在重启的时候,按住ESC或者F2,进入recover 模式,进行下面的步骤。

2.安装显卡驱动

2.1.查询系统建议安装的nvidia版本

root@server1:~# ubuntu-drivers devices
ERROR:root:aplay command not found
== /sys/devices/pci0000:30/0000:30:02.0/0000:31:00.0/0000:32:04.0/0000:37:00.0 ==
modalias : pci:v000010DEd000020B5sv000010DEsd00001533bc03sc02i00
vendor   : NVIDIA Corporation
model    : GA100 [A100 PCIe 80GB]
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-545 - distro non-free
driver   : nvidia-driver-550-open - distro non-free
driver   : nvidia-driver-535 - distro non-free
driver   : nvidia-driver-535-server-open - distro non-free
driver   : nvidia-driver-535-open - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-545-open - distro non-free
driver   : nvidia-driver-550 - distro non-free recommended #推荐安装nvidia-driver-550
driver   : xserver-xorg-video-nouveau - distro free builtin

2.2.安装推荐驱动

  • 使用 ubuntu-drivers 工具:
sudo ubuntu-drivers autoinstall

该命令将自动安装系统推荐的驱动。

  • 手动安装:
sudo apt install nvidia-driver-550

安装完驱动后,必须重启系统才能生效。

2.3.查看Nvidia Driver 信息以及显卡信息

root@server1:~# nvidia-smi 
Thu Dec 12 02:20:47 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:34:00.0 Off |                    0 |
| N/A   34C    P0             52W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:35:00.0 Off |                    0 |
| N/A   37C    P0             54W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:36:00.0 Off |                    0 |
| N/A   36C    P0             53W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:37:00.0 Off |                    0 |
| N/A   36C    P0             55W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100 80GB PCIe          Off |   00000000:9B:00.0 Off |                    0 |
| N/A   35C    P0             51W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100 80GB PCIe          Off |   00000000:9C:00.0 Off |                    0 |
| N/A   36C    P0             54W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100 80GB PCIe          Off |   00000000:9D:00.0 Off |                    0 |
| N/A   35C    P0             51W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100 80GB PCIe          Off |   00000000:9E:00.0 Off |                    0 |
| N/A   34C    P0             51W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

注意:这里右上角所显示的 cuda version 是指当前 nvidia 所支持的 cuda 的最高版本,也就是说是可以兼容 12.4 的
nvidia-smi 显示的的 cuda version 是当前驱动支持的最大 cuda toolkit 的版本。

2.4.安装 CUDA

官方给的部署步骤:https://developer.nvidia.com/cuda-12-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

部署步骤1

部署步骤2

部署步骤3

注意:驱动在上面已经装过了,只需要安装cuda-toolkit-12-4即可

设置环境变量并验证cuda是否配置成功:

root@server1:~# echo "export PATH=$PATH:/usr/local/cuda/bin" >> /etc/profile.d/cuda.sh
root@server1:~# echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64" >> /etc/profile.d/cuda.sh
root@server1:~# echo "export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda/lib64" >> /etc/profile.d/cuda.sh

root@server1:~# source /etc/profile.d/cuda.sh 
root@server1:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

3.卸载 nvidia 驱动方法

sudo apt-get purge nvidia-*
sudo apt-get update
sudo apt-get autoremove

最好重启下系统,清理卸载残留

安装Anaconda

1.下载Anaconda

下载地址:https://www.anaconda.com/download

Anaconda下载地址1

Anaconda下载地址2

2.部署Anaconda

root@server1:~# bash Anaconda3-2024.10-1-Linux-x86_64.sh 

Welcome to Anaconda3 2024.10-1

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>  #摁回车
ANACONDA TERMS OF SERVICE
......  #太多省略  摁q取消查看
Do you accept the license terms? [yes|no] #是否接受许可条款
>>> yes 

Anaconda3 will now be installed into this location: #选择安装路径  不可事先创建
/root/anaconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda3] >>> /anaconda
PREFIX=/anaconda
Unpacking payload ...
                                                                                                                                                                                                                                             
Installing base environment...


Downloading and Extracting Packages: #下载并解压软件包
......  #太多省略
Downloading and Extracting Packages:

Preparing transaction: done
Executing transaction: done
installation finished.
Do you wish to update your shell profile to automatically initialize conda?
This will activate conda on startup and change the command prompt when activated.
If you'd prefer that conda's base environment not be activated on startup,
   run the following command when conda is activated:

conda config --set auto_activate_base false
#是否希望 conda 在每次打开终端时自动激活。如果您选择 "yes",每次打开终端时,conda 环境就会自动启动,方便直接使用 conda 命令。
You can undo this by running `conda init --reverse $SHELL`? [yes|no]
[no] >>> yes
no change     /anaconda/condabin/conda
no change     /anaconda/bin/conda
no change     /anaconda/bin/conda-env
no change     /anaconda/bin/activate
no change     /anaconda/bin/deactivate
no change     /anaconda/etc/profile.d/conda.sh
no change     /anaconda/etc/fish/conf.d/conda.fish
no change     /anaconda/shell/condabin/Conda.psm1
no change     /anaconda/shell/condabin/conda-hook.ps1
no change     /anaconda/lib/python3.12/site-packages/xontrib/conda.xsh
no change     /anaconda/etc/profile.d/conda.csh
modified      /root/.bashrc

==> For changes to take effect, close and re-open your current shell. <==

Thank you for installing Anaconda3!

3.设置环境变量

echo "export PATH=$PATH:/anaconda/bin/" >> /etc/profile.d/anaconda.sh && source /etc/profile.d/anaconda.sh

Anaconda的使用

1.创建虚拟环境

conda create --name newenv python=3.10

示例:

root@server1:~# conda create --name DeepSeek-V2.5 python=3.10
Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /anaconda/envs/DeepSeek-V2.5

  added / updated specs:
    - python=3.10


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2024.11.26 |       h06a4308_0         131 KB
    pip-24.2                   |  py310h06a4308_0         2.3 MB
    python-3.10.16             |       he870216_1        26.9 MB
    setuptools-75.1.0          |  py310h06a4308_0         1.7 MB
    wheel-0.44.0               |  py310h06a4308_0         109 KB
    ------------------------------------------------------------
                                           Total:        31.1 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2024.11.26-h06a4308_0 
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
  libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
  libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
  ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
  openssl            pkgs/main/linux-64::openssl-3.0.15-h5eee18b_0 
  pip                pkgs/main/linux-64::pip-24.2-py310h06a4308_0 
  python             pkgs/main/linux-64::python-3.10.16-he870216_1 
  readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
  setuptools         pkgs/main/linux-64::setuptools-75.1.0-py310h06a4308_0 
  sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
  tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
  tzdata             pkgs/main/noarch::tzdata-2024b-h04d1e81_0 
  wheel              pkgs/main/linux-64::wheel-0.44.0-py310h06a4308_0 
  xz                 pkgs/main/linux-64::xz-5.4.6-h5eee18b_1 
  zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 


Proceed ([y]/n)? y #是否确认安装


Downloading and Extracting Packages:
                                                                                                                                                                                                                                             
Preparing transaction: done                                                                                                                                                                                                                  
Verifying transaction: done                                                                                                                                                                                                                  
Executing transaction: done                                                                                                                                                                                                                  
#                                                                                                                                                                                                                                            
# To activate this environment, use
#
#     $ conda activate DeepSeek-V2.5
#
# To deactivate an active environment, use
#
#     $ conda deactivate

2.查看虚拟环境

conda env list

示例:

root@server1:~# conda env list
# conda environments:
#
base                     /anaconda
DeepSeek-V2.5            /anaconda/envs/DeepSeek-V2.5

3.使用虚拟环境

conda activate newenv

示例:

root@server1:~# conda activate DeepSeek-V2.5
(DeepSeek-V2.5) root@server1:~# 

可能会出现 CondaError: Run 'conda init' before 'conda activate' 报错

root@server1:~# conda activate DeepSeek-V2.5

CondaError: Run 'conda init' before 'conda activate'

解决办法:

source .bashrc  #进入conda (base) 环境 
conda deactivate #退出conda (base) 环境

4.退出虚拟环境

conda deactivate

示例:

(DeepSeek-V2.5) root@server1:~# conda deactivate
root@server1:~# 

5.删除虚拟环境

conda remove --name mynewenv --all

示例:

root@server1:~# conda remove --name DeepSeek-V2.5 --all

Remove all packages in environment /anaconda/envs/DeepSeek-V2.5:


## Package Plan ##

  environment location: /anaconda/envs/DeepSeek-V2.5


The following packages will be REMOVED:

  _libgcc_mutex-0.1-main
  _openmp_mutex-5.1-1_gnu
  bzip2-1.0.8-h5eee18b_6
  ca-certificates-2024.11.26-h06a4308_0
  ld_impl_linux-64-2.40-h12ee557_0
  libffi-3.4.4-h6a678d5_1
  libgcc-ng-11.2.0-h1234567_1
  libgomp-11.2.0-h1234567_1
  libstdcxx-ng-11.2.0-h1234567_1
  libuuid-1.41.5-h5eee18b_0
  ncurses-6.4-h6a678d5_0
  openssl-3.0.15-h5eee18b_0
  pip-24.2-py310h06a4308_0
  python-3.10.16-he870216_1
  readline-8.2-h5eee18b_0
  setuptools-75.1.0-py310h06a4308_0
  sqlite-3.45.3-h5eee18b_0
  tk-8.6.14-h39e8969_0
  tzdata-2024b-h04d1e81_0
  wheel-0.44.0-py310h06a4308_0
  xz-5.4.6-h5eee18b_1
  zlib-1.2.13-h5eee18b_1


Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Everything found within the environment (/anaconda/envs/DeepSeek-V2.5), including any conda environment configurations and any non-conda files, will be deleted. Do you wish to continue?
 #是否删除环境 (/anaconda/envs/DeepSeek-V2.5) 中的所有内容
 (y/[n])? y

6.查看虚拟环境的默认目录

conda config --show envs_dirs

示例:

root@server1:~# conda config --show envs_dirs
envs_dirs:
  - /anaconda/envs
  - /root/.conda/envs

7.修改虚拟环境的默认目录

conda config --add envs_dirs <new_directory_path>

envs_dirs 列表中的第一个路径是 Conda 创建新的虚拟环境时默认使用的目录。
要想修改这个默认目录,只需添加一个新的目录,这个新添加的目录就会排在列表最前面,成为新的默认目录。

最后修改:2024 年 12 月 13 日
如果觉得我的文章对你有用,请随意赞赏