Vllm快速入门

news2026/3/17 22:45:47

背景vLLM 是一个用于 LLM 推理和服务的快速易用的库。vLLM 最初是在加州大学伯克利分校的 Sky Computing Lab 开发的现已发展成为一个社区驱动的项目融合了学术界和工业界的贡献。核心细节参考官网https://docs.vllm.com.cn/en/latest/快速入门这里仅仅介绍cuda其他显卡请参考官网uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --torch-backendauto离线批量推理# SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project from vllm import LLM, SamplingParams # Sample prompts. prompts [ Hello, my name is, The president of the United States is, The capital of France is, The future of AI is, ] # Create a sampling params object. sampling_params SamplingParams(temperature0.8, top_p0.95) def main(): # Create an LLM. llm LLM(modelfacebook/opt-125m) # Generate texts from the prompts. # The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs llm.generate(prompts, sampling_params) # Print the outputs. print(\nGenerated Outputs:\n - * 60) for output in outputs: prompt output.prompt generated_text output.outputs[0].text print(fPrompt: {prompt!r}) print(fOutput: {generated_text!r}) print(- * 60) if __name__ __main__: main()采样温度设置为0.8 nucleus sampling 概率设置为0.95分别控制着“抽奖的随机程度”和“候选词的范围”。兼容 OpenAI 的服务器from vllm import LLM llm LLM(modelmeta-llama/Meta-Llama-3-8B-Instruct) conversation [ { role: system, content: You are a helpful assistant, }, { role: user, content: Hello, }, { role: assistant, content: Hello! How can I assist you today?, }, { role: user, content: Write an essay about the importance of higher education., }, ] outputs llm.chat(conversation) for output in outputs: prompt output.prompt generated_text output.outputs[0].text print(fPrompt: {prompt!r}, Generated text: {generated_text!r})如果模型没有聊天模板或您想指定另一个您可以显式地传递一个聊天模板。from vllm.entrypoints.chat_utils import load_chat_template # You can find a list of existing chat templates under examples/ custom_template load_chat_template(chat_templatepath_to_template) print(Loaded chat template:, custom_template) outputs llm.chat(conversation, chat_templatecustom_template)Ray Serve LLM¶Ray Serve LLM 实现了 vLLM 引擎的可扩展、生产级服务。它与 vLLM 紧密集成并增加了自动扩展、负载均衡和反压等功能。主要功能提供与 OpenAI 兼容的 HTTP API 和 Pythonic API。可从单个 GPU 扩展到多节点集群无需更改代码。通过 Ray Dashboard 和指标提供可观测性和自动扩展策略。以下示例展示了如何使用 Ray Serve LLM 部署 DeepSeek R1 等大型模型 examples/online_serving/ray_serve_deepseek.py。通过官方 Ray Serve LLM 文档了解更多关于 Ray Serve LLM 的信息。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2420886.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！