vscode调试deepspeed的方法之一（无需调整脚本）

news2026/3/13 6:41:27

现在deepspeed的脚本文件是：

# 因为使用 RTX 4000 系列显卡时，不支持通过 P2P 或 IB 实现更快的通信宽带，需要设置以下两个环境变量
# 禁用 NCCL 的 P2P 通信，以避免可能出现的兼容性问题
export NCCL_P2P_DISABLE="1"
# 禁用 NCCL 的 IB 通信，以适应 RTX 4000 系列显卡的特性
export NCCL_IB_DISABLE="1"

# 设置 Hugging Face 模型仓库的镜像地址，方便下载模型等资源
export HF_ENDPOINT=https://hf-mirror.com

# 使用 deepspeed 工具运行 simple_LLaVA_run.py 脚本
# --include localhost:0,1 表示指定在本地的 0 号和 1 号 GPU 上运行任务
# 注：localhost 代表本地机器，0 和 1 是 GPU 的编号
deepspeed --include localhost:0,1 simple_LLaVA_run.py \
    --deepspeed ds_zero2_no_offload.json \
    --model_name_or_path /home/louis/LK/study/transformers/lk_study/llava_study/my_llava_model/model_01 \
    --train_type use_lora \
    --data_path /home/louis/LK/study/transformers/lk_study/llava_study/train_llava/data \
    --remove_unused_columns false \
    --bf16 true \
    --fp16 false \
    --dataloader_pin_memory True \
    --dataloader_num_workers 10 \
    --dataloader_persistent_workers True \
    --output_dir output_model_user_lora_simple_train \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 3 \
    --report_to "tensorboard" \
    --learning_rate 4e-4 \
    --logging_steps 10

要用vscode对这个deepspeed命令执行的python程序进行调试，一个方法是：

1）点击侧边栏“调试”按钮

在这里插入图片描述
然后点击“设置”，就会出现“launch.json”文件。

2）launch.json添加内容

在“launch.json”文件的"configurations"的内容中增加下面的内容：

{
            "name": "DeepSpeed调试单GPU",
            "type": "debugpy",
            "request": "launch",
            "program": "/home/louis/anaconda3/envs/unsloth_env_py311_torch240/bin/deepspeed",  // 替换为实际脚本路径
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [
                "--num_gpus", "1",
                "/home/louis/LK/study/transformers/lk_study/llava_study/simple_LLaVA_run.py",
                "--deepspeed", "/home/louis/LK/study/transformers/lk_study/llava_study/ds_zero2_no_offload.json",
                "--model_name_or_path", "/home/louis/LK/study/transformers/lk_study/llava_study/my_llava_model/model_01",
                "--train_type", "use_lora",
                "--data_path", "/home/louis/LK/study/transformers/lk_study/llava_study/train_llava/data",
                "--remove_unused_columns", "false",
                "--bf16", "true",
                "--fp16", "false",
                "--dataloader_pin_memory", "True",
                "--dataloader_num_workers", "10",
                "--dataloader_persistent_workers", "True",
                "--output_dir", "output_model_user_lora_simple_train",
                "--num_train_epochs", "10",
                "--per_device_train_batch_size", "1",
                "--per_device_eval_batch_size", "1",
                "--gradient_accumulation_steps", "8",
                "--evaluation_strategy", "no",
                "--save_strategy", "epoch",
                "--save_total_limit", "3",
                "--report_to", "tensorboard",
                "--learning_rate", "4e-4",
                "--logging_steps", "10"
            ],
            "env": {
                "NCCL_P2P_DISABLE": "1",
                "NCCL_IB_DISABLE": "1",
                "HF_ENDPOINT": "https://hf-mirror.com",
                "CUDA_VISIBLE_DEVICES": "0",  // 关键：强制单GPU调试
                "PYTHONUNBUFFERED": "1",      // 确保日志立即输出
                "CUDA_LAUNCH_BLOCKING": "1"   // 同步CUDA操作
            }
}