微软最新轻量级、多模态Phi-3.5-vision-instruct模型部署

news2026/2/4 22:22:30

Phi-3.5-vision-instruct是微软最新发布的 Phi-3.5 系列中的一个AI模型，专注于多模态任务处理，尤其是视觉推理方面的能力。

Phi-3.5-vision-instruct模型具备广泛的图像理解、光学字符识别（OCR）、图表和表格解析、多图像或视频剪辑摘要等功能，非常适合多种AI驱动的应用，在图像和视频处理相关的基准测试中表现出显著的性能提升。

Phi-3.5-vision-instruct模型的架构包括一个42亿参数的系统，集成了图像编码器、连接器、投影器和Phi-3 Mini语言模型，训练使用了256个NVIDIA A100-80G GPU，训练时间为6天。

Phi-3.5-vision在多模态多图像理解（MMMU）中的得分为43.0，相较于之前版本有所提升，显示了其在处理复杂图像理解任务时的增强能力。

github项目地址：https://github.com/microsoft/Phi-3CookBook。

一、环境安装

1、python环境

建议安装python版本在3.10以上。

2、pip库安装

pip install torch==2.3.0+cu118 torchvision==0.18.0+cu118 torchaudio==2.3.0 --extra-index-url https://download.pytorch.org/whl/cu118

pip install upgrade transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install flash-attn --no-build-isolation

3、模型下载：

git lfs install

git clone https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct

二、功能测试

1、运行测试：

（1）python代码调用测试

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
import argparse

class VisionInstructModel:
    def __init__(self, model_path, local_image_path, torch_dtype='auto'):
        self.model_path = model_path
        self.local_image_path = local_image_path
        self.torch_dtype = torch_dtype

        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self._load_model_and_processor()

    def _load_model_and_processor(self):
        self.processor = AutoProcessor.from_pretrained(self.model_path, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_path,
            trust_remote_code=True,
            torch_dtype=self.torch_dtype,
            _attn_implementation='flash_attention_2'
        ).to(self.device)

    def _prepare_input(self, prompt, image_path):
        image = Image.open(image_path)
        return self.processor(prompt, image, return_tensors="pt").to(self.device)

    def generate_response(self, prompt, max_new_tokens=1000):
        inputs = self._prepare_input(prompt, self.local_image_path)
        generate_ids = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            eos_token_id=self.processor.tokenizer.eos_token_id
        )
        generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

        response = self.processor.batch_decode(
            generate_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]
        
        return response

    def describe_image(self):
        user_prompt = '<|user|>\n'
        assistant_prompt = '<|assistant|>\n'
        prompt_suffix = "<|end|>\n"
        prompt = f"{user_prompt}<|image_1|>\nDescribe the picture{prompt_suffix}{assistant_prompt}"

        response = self.generate_response(prompt)
        print("response:", response)
        return response

def main(model_path, image_path):
    model = VisionInstructModel(model_path, image_path, torch_dtype='bfloat16')
    model.describe_image()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run VisionInstructModel to describe an image.")
    parser.add_argument("--model_path", type=str, required=True, help="Path to the model directory.")
    parser.add_argument("--image_path", type=str, required=True, help="Path to the image file.")
    
    args = parser.parse_args()
    main(args.model_path, args.image_path)

未完......

更多详细的欢迎关注：杰哥新技术