在 Blackwell 架构上编译大模型相关框架

RTX 5070 Ti 显卡刚刚上市，但是相关框架尚未适配，如何安装最新显卡驱动，cuda，pytorch, triton, flash-attention 等。（持续更新）

2025-03-23

【腾讯云】2核2G云服务器新老同享 99元/年，续费同价，云服务器3年机/5年机限时抢购，低至 2.5折

新系列显卡

由于我自大学以来都是使用的轻薄本，一台是 Surface Pro 5，一台是 MacBook M1 Pro，它们都没有独立显卡，而家中的 GTX 1050 Ti 也是很久以前只有 4GB 显存的普通显卡，在这个人工智能的时代，这种硬件配置是不怎么够的：大模型的科研学习需要独立显卡进行训练推理，玩游戏也需要高性能显卡基于更先进的光追技术提供惊艳的画面。因此很久以来，我都希望有一张独享的英伟达高性能显卡。在 40 系显卡上市两年之后，50 系显卡终于被端上来了，虽然各路评测都认为这代在游戏方面的提升差强人意，但是由于老黄在~~拼好帧~~ DLSS 4 这种基于 Transformers 的结构发展使得这代显卡在 AI 上的性能明显有了更多的偏向性。于是，顶着溢价，入手了京东自营的带有 16 GB 显存、CUDA 数 8960 的魔鹰 5070 Ti（属于是 buff 拉满了，但是考虑到最近 ROPS 光栅单元风波，还是买正牌货相对保险一些）。

虽然 Blackwell 自专业计算显卡 H100 就有了，但是一直以来都没有在消费级上铺开，所以对于新显卡，各种大模型相关框架的适配都尚不成熟，也就不能直接即装即用，这对 AI 方面不是很友好的。幸运的是，部分框架已经有了可以测试的版本，通过手动的源代码编译至少可以跑起来代码，由于最近还没有相关的资料，所以分享出来以供参考，过一段时间估计也就不用这么麻烦了。

环境配置

对于 AI 研究而言，使用 Linux 操作系统无疑是最好的选择，但是为了不打扰我打游戏，按照教程安装了 Windows 11-Ubuntu 24.10 双系统，在 Ubuntu 系统上通过 frp 工具实现了远程 SSH 内网连接（这一块是通过 systemctl 管理服务来实现自启动的，感兴趣可以参见这篇博文）。

显卡驱动

开始进入正题，首先第一个障碍就是显卡驱动。由于该型号的显卡太新了，暂时还无法通过 Ubuntu 自带的软件与更新获取到显卡驱动，所以需要手动地添加软件更新源安装驱动。

sudo apt install build-essential
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update

然后，需要关闭与图形显示相关的桌面环境，进入内置终端界面安装（这步很重要，否则显卡驱动安装过程会卡死，然后就得重装系统了）。也就是使用 Ctrl + Alt + F3 进入 tty3。登录之后，关闭桌面：

sudo systemctl stop gdm3  # gnome

然后开始安装驱动，在安装界面中暂时使用 MIT 驱动，专有驱动经过验证暂时无法被 nvidia-smi 指令检测到：

sudo apt install nvidia-driver-570

等待安装完毕后，重启系统：

reboot

之后通过 nvidia-smi 命令验证，如果检测到显卡就说明安装成功了。

Sun Mar 23 22:15:50 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   27C    P5             38W /  300W |     130MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

CUDA

根据这篇重要的官方说明，现在 50 系显卡只能使用 CUDA 12.8，所以按照官方安装说明下载安装，安装的重要命令如下可供参考：

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

保险起见，安装完成后，将 CUDA 加入路径中（可以临时修改，也可以通过修改 ~/.bashrc 永久修改）：

export PATH="/usr/local/cuda-12.8/bin:$PATH"
export CUDA_HOME="/usr/local/cuda-12.8"
export LIBRARY_PATH="/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8/lib64/stubs:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH"

新建终端，或者 source ~/.bashrc 生效。

Anaconda

为了隔离不同的 Python 环境（系统自带的 Python 环境由于保护原因，一般不能直接通过 pip 安装），建议安装 anaconda 进行环境管理（实际上 conda 也可以安装 CUDA，如果你需要安装多个版本的 CUDA 的话，参见 CUDA 安装文档），以下截取至 Anaconda 安装说明：

curl -O https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
bash ~/Anaconda3-2024.10-1-Linux-x86_64.sh
# 以下说明使用了默认安装位置
source ~/.bashrc

安装结束后，通过下面的方式创建一个新环境：

conda create -n llm python=3.12

这里使用 Python 3.12 版本是因为一些很新的包要求的 Python 版本是很高的（比如 langchain），使用更新版本的 Python 也能够使用一些相对更新更顺手的语法。

创建完毕之后，就可以进入该环境：

conda activate llm

之后就可以在该环境下进行安装：

(llm) $

Pytorch

torch

按照这篇重要的官方说明的说法，实际上已经可以直接使用最新的预编译二进制的 Pytorch 进行安装了：pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128，如果你只需要运行非量化的大模型，那么安装这个也就足够。

否则，考虑到 triton 以及后续的 vllm 都会依赖于 Pytorch 2.6.0 版本，根据 triton 的官方说明，为了对老版本使用 CUDA 12.8，需要对 Pytorch 这个版本进行源码编译安装。

在开始源码编译之前，首先安装一些编译必需品：

sudo apt-get install cmake git ninja

然后拉取代码进行预备工作：

# 拉取 pytorch 2.6.0-rc9 版本，只拉取该版本而不拉取全部
git clone https://github.com/pytorch/pytorch -b v2.6.0-rc9 --depth 1
cd pytorch
git submodule sync
git submodule update --init --recursive -j 8

# 安装 pytorch 的其他依赖包
pip install -r requirements.txt
pip install mkl-static mkl-include wheel

虽然之前已经安装了 build-essential 包含了 gcc 编译器，但是 Ubuntu 24.04 默认安装的版本是 gcc-14，根据这个尚未被解决的 Issue pytorch/pytorch#129358，Pytorch 目前因为依赖的 fbgemm 第三方库版本过老会出现编译报错（即使更新了该库的版本，经过尝试也无法在即将安装的老 Pytorch 编译成功），所以你需要安装一个降级的 gcc 版本，并让 CMAKE 使用该版本的 gcc（Pytorch 文档中提到的 CC 环境变量设置是不够的，还需要设置 CXX），截取自这里：

sudo apt install gcc-13 g++-13
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13

之后可能还需要对硬编码的 gcc 程序进行更新，将这里改为：

for command in all_commands:
    if command["command"].startswith("gcc-13 "):
        command["command"] = "g++-13 " + command["command"][7:]

添加环境变量，开始编译：

# 开始编译安装
export CUDA_HOME=/usr/local/cuda-12.8
export CUDA_PATH=$CUDA_HOME
export TORCH_CUDA_ARCH_LIST=Blackwell
python setup.py develop

# 可选地导出 wheel 二进制供备份使用
python setup.py bdist_wheel
ls dist # 二进制文件在这

编译完成后，应该就可以检测到了：

$ pip list | grep torch    # 不清楚为什么显示 2.7.0 版本
torch                             2.7.0.dev20250308+cu128

torchvision

安装 torchvision 也需要从源码安装，按照安装说明不要安装可选项，直接使用以下命令：

git clone https://github.com/pytorch/vision.git -b v0.21.0-rc8 --depth 1
cd vision
# 考虑到之前制定的版本是 GCC 13，这里也进行指定
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
python setup.py develop

验证安装完成：

$ pip list | grep torchvision
torchvision                       0.21.0+7af6987

torchaudio

安装 torchaudio 同理，但是安装过程中可能需要在线安装依赖，注意联网问题：

git clone https://github.com/pytorch/audio.git -b v2.6.0-rc7 --depth 1
cd audio
# 考虑到之前制定的版本是 GCC 13，这里也进行指定
export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13
python setup.py develop

验证安装完成：

$ pip list | grep torchaudio
torchaudio                        2.6.0a0+d883142

triton

由于 autoawq、flash-attention、deepspeed 都可能需要依赖于 triton 包，所以这里首先安装 triton。

继续 triton 的官方说明，首先克隆存储库：

git clone https://github.com/triton-lang/triton.git --depth 1
cd triton

由于目前可能会出现 CUDA 库链接错误的问题，需要设置下面的环境变量（考虑到后续也有可能出现类似的运行时问题，建议加入全局 ~/.bashrc 中）：

export TRITON_LIBCUDA_PATH=/usr/local/cuda-12/lib64/stubs

之后就可以继续编译了：

pip install pybind11
pip install -e python

编译安装完毕后，可以验证是否安装成功：

$ pip list | grep triton
pytorch-triton                    3.2.0+git4b3bb1f8

Flash Attention

考虑到很多大模型都可以使用 flash attention 来提速优化，这里也对 flash attention 进行安装，根据 flash-attention 安装说明，根据 CPU 核心数、内存设定好一个保守的并行数量（因为内存过小时，过高的并行数会让该安装过程卡死）安装即可：

MAX_JOBS=4 pip install flash-attn --no-build-isolation

由于 Blackwell 暂时不支持 flash attention 3，所以之后使用时一般只使用 flash-attention 2。如果遇到了 Dao-AILab/flash-attention#1312 中的 Operation Error: /usr/bin/ld: cannot find -lcuda 错误，就意味着之前的 TRITON_LIBCUDA_PATH 没有设置好。

AutoAWQ

Triton 安装完毕后，安装该包相对容易：

pip install autoawq

之后就可以运行 AWQ 量化大模型。

Transformers

至此，你应该能够使用这些前置依赖来高效地运行本地大模型了！通过安装 Huggingface 的 transformers，你大概能够在 5070 Ti 上跑通 DeepSeek-Distill-Qwen-14B-AWQ：

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch

model_name = "casperhansen/deepseek-r1-distill-qwen-14b-awq"
prompt = "你好"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

streamer = TextStreamer(tokenizer=tokenizer)
text = prompt + "<think>"
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    streamer=streamer,
    do_sample=False,
)

由于国内访问 Huggingface 较为缓慢，所以可以考虑使用镜像地址运行该脚本：

HF_ENDPOINT=https://hf-mirror.com python start_llm.py

最后可以输出：

<｜begin▁of▁sentence｜>你好<think>

</think>

你好！很高兴见到你，有什么我可以帮忙的吗？无论是学习、工作还是生活中的问题，都可以告诉我哦！😊<｜end▁of▁sentence｜>

通过 watch nvidia-smi 监测占用如下：

Sun Mar 23 23:50:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0  On |                  N/A |
| 30%   53C    P1            300W /  300W |   10102MiB /  16303MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           10730      C   python                                 9964MiB |
+-----------------------------------------------------------------------------------------+

VLLM

VLLM 常被用来做大模型推理部署，按照实验性说明，进行如下步骤：

git clone https://github.com/vllm-project/vllm.git --depth 1
cd vllm
python use_existing_torch.py
pip install -r requirements/build.txt
pip install setuptools_scm

之后，需要对一些环境变量进行修改/或者在 vllm/setup.py 中对CMAKE添加环境变量（否则会出现 Caffe2 找不到 CUDA 的报错）：

export CUDA_HOME=/usr/local/cuda
export CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
export CUDA_INCLUDE_DIRS=/usr/local/cuda/include

似乎 CUDA 12.8 暂时没有对 NVTX3 的支持（不进行这一步后面会出现链接错误），所以需要手动下载 NVTX 库到一个路径 <path> 上：

git clone https://github.com/NVIDIA/NVTX.git --depth 1

然后对 ~/anaconda3/envs/llm/lib/python3.12/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake（Python 环境中的 torch 包的一个文件）中的下面注释处进行替换：

# nvToolsExt
# if(USE_SYSTEM_NVTX)
#   find_path(nvtx3_dir NAMES nvtx3 PATHS ${CUDA_INCLUDE_DIRS})
# else()
#   find_path(nvtx3_dir NAMES nvtx3 PATHS "${PROJECT_SOURCE_DIR}/third_party/NVTX/c/include" NO_DEFAULT_PATH)
# endif()
find_path(nvtx3_dir NAMES nvtx3 PATHS "<path>/NVTX/c/include") # use custom nvtx3, replace <path> to your path

继续按照说明，就可以正确编译 vllm 了：

MAX_JOBS=4 VLLM_FLASH_ATTN_VERSION=2 python setup.py develop

之后，还要做一些善后工作，比如 triton 找不到了（ImportError: cannot import name 'Config' from 'triton' (unknown location)），需要返回之前的 triton 仓库重新安装：

cd triton && pip install -e python

以及 ImportError: Numba needs NumPy 2.0 or less. Got NumPy 2.2.，安装 numpy 2.0：

pip install numpy==2.0

之后就可以顺利地部署 VLLM 了！为了避免显存溢出，需要调整默认参数，比如按照下面的命令就可以成功部署：

vllm serve /home/<username>/.cache/huggingface/hub/models--casperhansen--deepseek-r1-distill-qwen-14b-awq/snapshots/bc43ec1bbf08de53452630806d5989208b4186db --max_num_seqs 2 --gpu_memory_utilization 0.9 --max_model_len 2048 --port 3003

其中 max_num_seqs 代表同时可以处理的数量，max_model_len 是模型上下文长度，port 是暴露的端口号，之后就可以使用 OpenAI 兼容标准调通模型。

手动消除 egg 警告

将下面的代码段保存为 requirements.txt，然后运行 pip install --force-reinstall -r requirements.txt --no-deps 重新安装。

python_json_logger==3.3.0
distro==1.9.0
lm_format_enforcer==0.10.11
dnspython==2.7.0
lark==1.2.2
httpx==0.28.1
openai==1.68.2
msgspec==0.19.0
cloudpickle==3.1.1
uvicorn==0.34.0
depyf==0.18.0
nest_asyncio==1.6.0
python_dotenv==1.0.1
numpy==1.26.4
anyio==4.9.0
transformers==4.50.0
tiktoken==0.9.0
xgrammar==0.1.16
opencv_python_headless==4.11.0.86
websockets==15.0.1
httpcore==1.0.7
partial_json_parser==0.2.1.1.post5
scipy==1.15.2
h11==0.14.0
pyzmq==26.3.0
shellingham==1.5.4
ray==2.44.0
outlines==0.1.11
python_multipart==0.0.20
sentencepiece==0.2.0
watchfiles==1.0.4
compressed_tensors==0.9.2
sniffio==1.3.1
rich_toolkit==0.13.2
mistral_common==1.5.4
jiter==0.9.0
fastapi==0.115.12
airportsdata==20250224
interegular==0.3.3
uvloop==0.21.0
llvmlite==0.43.0
numba==0.60.0
outlines_core==0.1.26
httptools==0.6.4
fastapi_cli==0.0.7
starlette==0.46.1
pycountry==24.6.1
diskcache==5.6.3
typer==0.15.2
astor==0.8.1
blake3==1.0.4
email_validator==2.2.0
llguidance==0.7.9
cachetools==6.0.0b1
prometheus_fastapi_instrumentator==7.1.0
gguf==0.10.0
prometheus_client==0.21.1
fastrlock==0.8.3
cupy_cuda12x==13.4.1

DeepSpeed

DeepSpeed 可以被用来与 CPU 均衡负载。首先，克隆存储库：

git clone https://github.com/deepspeedai/DeepSpeed.git --depth 1
cd DeepSpeed

尝试直接安装：

pip install .

可以使用 ds_report 来看有哪些报错，一般而言需要做下面的额外步骤：

对于 [WARNING] gds: please install the libaio-dev package with apt，使用下面的命令安装这个库：

sudo apt install libaio-dev

然后可能还会有 ld 链接错误，这个时候，需要也指定为 GCC 13 版本：

export CC=/usr/bin/gcc-13
export CXX=/usr/bin/g++-13

需要克隆 cutlass 到 <path>，并指定 CUTLASS_PATH：

export CUTLASS_PATH=<path>/cutlass

找不到 dskernels，需要安装这个包：

pip install deepspeed-kernels

之后就可以进行编译安装所有子包，可能有些无法编译可以忽略：

DS_BUILD_OPS=1 pip install . --global-option="build_ext" --global-option="-j8"

安装完毕后使用 ds_report 命令验证：

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_lion ............... [YES] ...... [OKAY]
evoformer_attn ......... [YES] ...... [OKAY]
 [WARNING]  FP Quantizer is using an untested triton version (3.3.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [YES] ...... [OKAY]
fused_lion ............. [YES] ...... [OKAY]
gds .................... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
inference_core_ops ..... [YES] ...... [OKAY]
cutlass_ops ............ [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
ragged_device_ops ...... [YES] ...... [OKAY]
ragged_ops ............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.7
 [WARNING]  using untested triton version (3.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]

之后你可能需要重新安装 triton。