Phi-3.5-vision-instruct是微软最新发布的 Phi-3.5 系列中的一个AI模型,专注于多模态任务处理,尤其是视觉推理方面的能力。
Phi-3.5-vision-instruct模型具备广泛的图像理解、光学字符识别(OCR)、图表和表格解析、多图像或视频剪辑摘要等功能,非常适合多种AI驱动的应用,在图像和视频处理相关的基准测试中表现出显著的性能提升。
Phi-3.5-vision-instruct模型的架构包括一个42亿参数的系统,集成了图像编码器、连接器、投影器和Phi-3 Mini语言模型,训练使用了256个NVIDIA A100-80G GPU,训练时间为6天。
Phi-3.5-vision在多模态多图像理解(MMMU)中的得分为43.0,相较于之前版本有所提升,显示了其在处理复杂图像理解任务时的增强能力。
github项目地址:https://github.com/microsoft/Phi-3CookBook。
一、环境安装
1、python环境
建议安装python版本在3.10以上。
2、pip库安装
pip install torch==2.3.0+cu118 torchvision==0.18.0+cu118 torchaudio==2.3.0 --extra-index-url https://download.pytorch.org/whl/cu118
pip install upgrade transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install flash-attn --no-build-isolation3、模型下载:
git lfs install
git clone https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct二、功能测试
1、运行测试:
(1)python代码调用测试
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
import argparse
class VisionInstructModel:
    def __init__(self, model_path, local_image_path, torch_dtype='auto'):
        self.model_path = model_path
        self.local_image_path = local_image_path
        self.torch_dtype = torch_dtype
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self._load_model_and_processor()
    def _load_model_and_processor(self):
        self.processor = AutoProcessor.from_pretrained(self.model_path, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_path,
            trust_remote_code=True,
            torch_dtype=self.torch_dtype,
            _attn_implementation='flash_attention_2'
        ).to(self.device)
    def _prepare_input(self, prompt, image_path):
        image = Image.open(image_path)
        return self.processor(prompt, image, return_tensors="pt").to(self.device)
    def generate_response(self, prompt, max_new_tokens=1000):
        inputs = self._prepare_input(prompt, self.local_image_path)
        generate_ids = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            eos_token_id=self.processor.tokenizer.eos_token_id
        )
        generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
        response = self.processor.batch_decode(
            generate_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]
        
        return response
    def describe_image(self):
        user_prompt = '<|user|>\n'
        assistant_prompt = '<|assistant|>\n'
        prompt_suffix = "<|end|>\n"
        prompt = f"{user_prompt}<|image_1|>\nDescribe the picture{prompt_suffix}{assistant_prompt}"
        response = self.generate_response(prompt)
        print("response:", response)
        return response
def main(model_path, image_path):
    model = VisionInstructModel(model_path, image_path, torch_dtype='bfloat16')
    model.describe_image()
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run VisionInstructModel to describe an image.")
    parser.add_argument("--model_path", type=str, required=True, help="Path to the model directory.")
    parser.add_argument("--image_path", type=str, required=True, help="Path to the image file.")
    
    args = parser.parse_args()
    main(args.model_path, args.image_path)未完......
更多详细的欢迎关注:杰哥新技术



















