(KTransformers) RTX4090单卡运行 DeepSeek-R1 671B

news2025/7/15 21:38:44

安装环境为：ubuntu 22.04 x86_64

下载模型

编辑文件vim url.list 写入如下内容

https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00002-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00003-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00004-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00005-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00006-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00007-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00008-of-00009.gguf 
https://modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/resolve/master/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00009-of-00009.gguf

#下载 wget -i url.list -P /data/model/DeepSeek-R1 
#也可以用 aria2 并行下载 
sudo apt-get install aria2 
aria2c -i url.list -d /data/model/DeepSeek-R1 --log=aria2.log

前期准备

前置条件:

CUDA 12.1 及更高版本官网：CUDA Toolkit 12.8 Downloads | NVIDIA Developer

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb 
sudo dpkg -i cuda-keyring_1.1-1_all.deb 
sudo apt-get update 
sudo apt-get -y install cuda-toolkit-12-8

添加环境变量

# Adding CUDA to PATH
if [ -d "/usr/local/cuda/bin" ]; then
    export PATH=$PATH:/usr/local/cuda/bin
fi

if [ -d "/usr/local/cuda/lib64" ]; then
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
    # Or you can add it to /etc/ld.so.conf and run ldconfig as root:
    # echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf
    # sudo ldconfig
fi

if [ -d "/usr/local/cuda" ]; then
    export CUDA_PATH=$CUDA_PATH:/usr/local/cuda
fi

安装 gcc、g++ 和 cmake

sudo apt-get install build-essential cmake ninja-build

切换root用户 sudo -i

#创建python虚拟环境
conda create --name ktransformers python=3.11

#进入环境：
conda activate ktransformers

#安装程序包：
conda install -c conda-forge libstdcxx-ng

#您应该确保 Anaconda 使用的 GNU C++ 标准库的版本标识符包括~/anaconda3 GLIBCXX-3.4.32
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX

确保已安装 PyTorch、packaging、ninja

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip3 install packaging ninja cpufeature numpy

同时，您应该从 Releases · Dao-AILab/flash-attention · GitHub 下载并安装相应版本的 flash-attention

#确认自己的pyTorch与cuda
python3 -c "import torch; print(torch.__version__)"
2.6.0+cu126
#确认python版本
python3 --version
Python 3.11.11
#确认PyTorch是否开启 C++11 ABI  输出包含：-D_GLIBCXX_USE_CXX11_ABI=1 为开启
python -c "import torch; print(torch.__config__.show())" |grep enable-cxx11-abi



#根据输出需要安装与这个版本兼容的 Flash-Attention。
# - 当前安装的 PyTorch 版本是 2.6.0+cu126，这表示它支持 CUDA 12.6 选择cu12torch2.6版本
# - C++11 ABI为开启，选择xx11abiTRUE版本
# - python版本为3.11.11 选择cp311版本

#下载Flash-Attention 
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

#安装 Flash-Attention
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

开始安装

下载源码并编译:

init 源代码

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init
git submodule update

linux 简单安装

bash install.sh

#如果计算机是两颗CPU 对于拥有两个 CPU 和 1T（1TB）RAM 的用户
export USE_NUMA=1
bash install.sh

命令行运行模型

python -m ktransformers.local_chat --model_path /home/ubuntu/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-R1 --gguf_path /data/model/DeepSeek-R1 --cpu_infer 62 --optimize_config_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml

--model_path（必填）：模型的名称（例如“deepseek-ai/DeepSeek-V2-Lite-Chat”，它将自动从 Hugging Face 下载配置）。或者，如果您已经有本地文件，则可以直接使用该路径来初始化模型。
注意：目录中不需要 .safetensors 文件。我们只需要配置文件来构建 model 和 tokenizer。
--gguf_path（必需）：包含 GGUF 文件的目录路径，可以从 Hugging Face 下载。请注意，该目录应仅包含当前模型的 GGUF，这意味着每个模型需要一个单独的目录。
--optimize_config_path（Qwen2Moe 和 DeepSeek-V2 除外是必需的）：包含优化规则的 YAML 文件的路径。在 ktransformers/optimize/optimize_rules 目录中预先编写了两个规则文件，用于优化 DeepSeek-V2 和 Qwen2-57B-A14，这两个 SOTA MoE 模型。
--max_new_tokens： Int （默认值 = 1000）。要生成的最大新令牌数。
--cpu_infer： Int （默认值 = 10）。用于推理的 CPU 数量。理想情况下，应设置为（内核总数 - 2）。

成功运行

由于每个人硬件环境不同，受限于内存速度，显卡性能，cpu性能等生成的tokens/s速度差别较大

遇到问题

问题：Turing 架构 Volta 架构显卡默认配置不兼容

如果有人遇到类似已经进入到对话界面，但一说话就报如下内容

RuntimeError: CUDA error: invalid device function CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.