本文的server.py和req.py代码参见:https://github.com/zysNLP/quickllm
配套课程《AIGC大模型理论与工业落地实战》;Deepseek相关课程更新中
1. 安装相关docker镜像:nvcr.io/nvidia/pytorch:25.02-py3
docker pull nvcr.io/nvidia/pytorch:25.02-py3
2. 启动docker
docker run -idt --network host --shm-size=64g --name vllm --restart=always --gpus all -v /data2/users/yszhang/quickllm:/quickllm nvcr.io/nvidia/pytorch:25.02-py3 /bin/bash
3. 在魔塔中下载相关模型
pip install modelscope
modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir /data2/users/yszhang/quickllm/qwen2.5-vl-instruct
4.进入docker容器,安装conda环境;下载LLama-Factory
docker exec -it vllm /bin/bash
cd /quickllm
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n sft python=3.11
conda activate sft
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
5. 启动LLaMA-Factory的web ui
llamafactory-cli webui
6. 训练模型、融合lora参数
# 融合后的模型路径/quickllm/LLaMA-Factory/qwen2.5-mmlm0513;以webui的实际调整为准
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /quickllm/qwen2.5-vl-instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template qwen2_vl \
--flash_attn auto \
--dataset_dir data \
--dataset mllm_demo \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--report_to none \
--output_dir saves/Qwen2.5-VL-7B-Instruct/lora/train_2025-05-16-05-48-02 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all
7. 创建conda环境安装vllm/transformers
conda create -n vllm python=3.11
conda activate vllm
pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
8. 启动vllm+fastapi服务
python server.py
9. 请求服务
python req.py
模型实测速度
使用batch的方式请求同一条数据;速度1000条/20s。(这里因为是同一条图文数据所以速度会更快,如果用不同的图文数据速度会稍慢一些,但是也非常快!)
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO: Started server process [19930]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7868 (Press CTRL+C to quit)
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 7.68it/s, est. speed input: 899.09 toks/s, output: 430.33 toks/s]
INFO: 127.0.0.1:60618 - "POST /chat HTTP/1.1" 200 OK
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 7.58it/s, est. speed input: 886.83 toks/s, output: 424.46 toks/s]
INFO: 127.0.0.1:60620 - "POST /chat HTTP/1.1" 200 OK
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 7.58it/s, est. speed input: 887.41 toks/s, output: 424.74 toks/s]
INFO: 127.0.0.1:44776 - "POST /chat HTTP/1.1" 200 OK
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 47.28it/s, est. speed input: 5532.30 toks/s, output: 2647.93 toks/s]
INFO: 127.0.0.1:47144 - "POST /chat HTTP/1.1" 200 OK
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:16<00:00, 62.39it/s, est. speed input: 7299.34 toks/s, output: 3493.70 toks/s]
INFO: 127.0.0.1:38156 - "POST /chat HTTP/1.1" 200 OK
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:15<00:00, 62.51it/s, est. speed input: 7313.23 toks/s, output: 3500.35 toks/s]
INFO: 127.0.0.1:50830 - "POST /chat HTTP/1.1" 200 OK
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:16<00:00, 62.35it/s, est. speed input: 7295.48 toks/s, output: 3491.85 toks/s]