从0-1体验模型部署到评测

news2026/3/24 10:09:39

以下为mac电脑环境window部分命令自行替换目录1.首先python环境安装2.创建python虚拟环境3.安装评测框架4.小模型下载常见问题1执行报错是没安装 PyTorch常见问题2 代码执行超时是由于网络问题最好使用国内镜像5.运行评测命令常见问题1ModuleNotFoundError: No module named accelerate常见问题2httpx.ConnectTimeout: [Errno 60] Operation timed out常见问题3timed out thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/main/README.mdRetrying in 1s [Retry 1/5].补充说明仅yaml文件不创建python utils.py的文件查看.parquet文件内容的方式1.首先python环境安装推荐 3.9以上2.创建python虚拟环境# 创建虚拟环境 python3 -m venv venv # 激活 ,激活后剩余base命令操作均需在虚拟环境中 source venv/bin/activate当一切操作结束退出虚拟环境deactivate3.安装评测框架# 下载评测框架 git clone https://github.com/EleutherAI/lm-evaluation-harness # 安装 cd lm-evaluation-harness pip install -e .4.小模型下载可以在https://huggingface.co/ 上直接下载小模型到本地也可以通过代码下载模型名说明gpt2GPT‑2 基础模型非常小很适合初步体验评测链路EleutherAI/pythia‑160m约 160M 权重的小模型训练/评估快StabilityAI/stablelm‑2‑1.6b中型开源模型质量和速度比较好本地可跑以下载gpt2为例# 首先安装transformers pip install transformers # 其次安装 torch pip install torch # 在安装 pip install accelerate # 全部安装完成后执行如下命令验证 python -c import torch; import transformers; import accelerate; print(All good!)在python代码中下载gpt2模型from transformers import AutoModelForCausalLM, AutoTokenizer model_name gpt2 # 也可以是 EleutherAI/pythia-160m tokenizer AutoTokenizer.from_pretrained(model_name) model AutoModelForCausalLM.from_pretrained(model_name)这段代码会自动把模型权重下载到本地缓存 (~ ~/.cache/huggingface/transformers)。常见问题1执行报错是没安装 PyTorch安装命令(CPU版本)pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu如果GPU版本pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118如果不确定显卡或驱动情况先用 CPU 版本即可跑小模型足够练手。验证安装在 Python 中执行import torch print(torch.__version__) print(torch.cuda.is_available())输出类似2.1.0 False说明 PyTorch 安装成功CPU 可用GPU 可选。常见问题2 代码执行超时是由于网络问题最好使用国内镜像import os # 设置镜像源加速下载 os.environ[HF_ENDPOINT] https://hf-mirror.com from transformers import AutoModelForCausalLM, AutoTokenizer model_name gpt2 # 让 transformers 自动管理缓存,不要手动指定路径 tokenizer AutoTokenizer.from_pretrained(model_name) model AutoModelForCausalLM.from_pretrained(model_name) print(模型加载成功!) print(f模型参数量: {sum(p.numel() for p in model.parameters()):,})加载完成会得到如下输出如果还会报加载模型失败就干脆直接在huaggingface下载模型。核心文件如下5.运行评测命令查看有哪些评测任务lm-eval ls tasks评估模型基本能力以GPT‑2 在 HellaSwag benchmark上跑分为例lm_eval --model hf --model_args pretrainedgpt2 --tasks hellaswag --device cpu --batch_size 4 --output results.json注意如果执行报错连接失败看常见问题3按本地数据集的方式运行参数解释--model hf使用 HuggingFace 模型后端--model_args pretrainedgpt2模型名称可以换成本地路径--tasks hellaswag评测任务名字--device cpu若有 GPU可以设成cuda:0--batch_size 4每批多少样本--output results.json输出评测结果 JSON 文件评测结束后大概5-10分钟你将看到类似{results: { hellaswag_local: { name: hellaswag_local, alias: hellaswag_local, sample_len: 10042, acc,none: 0.2891854212308305, acc_stderr,none: 0.004524575892953094, acc_norm,none: 0.31139215295757816, acc_norm_stderr,none: 0.004621163476949437 } } }这表示 GPT‑2 在 HellaSwag 上的准确率大约是 28.91%- acc,none → 准确率 28.92%- acc_stderr,none → 标准误 0.45%就是 ± 后面的数- acc_norm,none → 标准化准确率 31.14%- acc_norm_stderr,none → 标准误 0.46%也可以在过程文件 eval_output.log 和日志打印中看到。也可以评测多个任务. 示例lm_eval --model hf \ --model_args pretrainedgpt2 \ --tasks hellaswag,mmlu \ --device cpu \ --batch_size 4 \ --output full_results.json这里列出运行日志2026-03-20:14:26:02 INFO [_cli.run:377] Including path: /Users/hongshao/dataset/tasks 2026-03-20:14:26:02 INFO [_cli.run:378] Selected Tasks: [hellaswag_local] 2026-03-20:14:26:03 INFO [evaluator:213] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2026-03-20:14:26:03 INFO [evaluator:238] Initializing hf model, with arguments: {pretrained: /Users/hongshao/models/gpt2} 2026-03-20:14:26:05 INFO [models.huggingface:256] Using device cpu 2026-03-20:14:26:05 INFO [models.huggingface:518] Model parallel was set to False, max memory was not set, and device map was set to {: cpu} Loading weights: 0%| | 0/148 [00:00?, ?it/s] Loading weights: 100%|██████████| 148/148 [00:0000:00, 66519.18it/s] 2026-03-20:14:26:06 INFO [evaluator_utils:446] Selected tasks: 2026-03-20:14:26:06 INFO [evaluator_utils:480] Task: hellaswag_local (/Users/hongshao/dataset/tasks/hellaswag_local.yaml) 2026-03-20:14:26:06 INFO [api.task:312] Building contexts for hellaswag_local on rank 0... 0%| | 0/10042 [00:00?, ?it/s] 3%|▎ | 296/10042 [00:0000:08, 1216.45it/s] 7%|▋ | 727/10042 [00:0000:03, 2359.78it/s] 12%|█▏ | 1181/10042 [00:0000:02, 3112.42it/s] 中间省略--------------------------- Running loglikelihood requests: 100%|█████████▉| 40164/40168 [16:0200:00, 90.43it/s] Running loglikelihood requests: 100%|██████████| 40168/40168 [16:0200:00, 41.73it/s] fatal: not a git repository (or any of the parent directories): .git 2026-03-20:14:42:21 INFO [loggers.evaluation_tracker:247] Saving results aggregated hf ({pretrained: /Users/hongshao/models/gpt2}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 4 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag_local| 1|none | 0|acc |↑ |0.2892|± |0.0045| | | |none | 0|acc_norm|↑ |0.3114|± |0.0046|常见问题1ModuleNotFoundError: No module named accelerate在虚拟环境中执行pip install accelerate常见问题2httpx.ConnectTimeout: [Errno 60] Operation timed out由于我们是联网加载模型进行评测因此会受网络问题影响。这里就需要将gpt模型下载到本地。然后修改模型加载的方式用本地模型方式from transformers import AutoTokenizer, AutoModelForCausalLM model_dir /Users/hongshao/models/gpt2 tokenizer AutoTokenizer.from_pretrained(model_dir, local_files_onlyTrue) model AutoModelForCausalLM.from_pretrained(model_dir, local_files_onlyTrue)同时评测命令的执行注意使用# 也是加载本地模型的方式就避免了联网不稳定 lm_eval --model hf --model_args pretrained/Users/hongshao/models/gpt2 --tasks hellaswag --device cpu --batch_size 4 --output results.json常见问题3timed out thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/main/README.mdRetrying in 1s [Retry 1/5].原因模型已经加载完成但是lm-evaluation-harness仍在尝试从 HuggingFace Hub 下载 benchmark 数据集因为hellaswagbenchmark 数据集默认不是本地的需要联网下载。你的网络不稳定或者被墙所以报超时。解决办法1.打开 HellaSwag 数据集页面https://huggingface.co/datasets/Rowan/hellaswag2.点击 Files and versions下载文件到本地 /User/hongshao/dataset/此时只能通过代码的方式执行因为lm-evaluation-harness没有支持的CLI 参数加载本地评测数据集3.处理文件差异原始 hellaswag 数据集字段 { activity_label: Removing ice from car, ctx_a: Then, the man writes over the snow..., ctx_b: then, endings: [option1, option2, option3, option4], label: 3 # 字符串类型 } lm-eval 需要的字段 { query: Removing ice from car: Then, the man writes..., # 需要拼接 choices: [option1, option2, option3, option4], gold: 3 # 需要是整数 }4.运行评测脚本4.1创建本地yaml配置文件 /Users/hongshao/dataset/tasks/hellaswag_local.yamltask: hellaswag_local dataset_path: /Users/hongshao/dataset/hellaswag dataset_name: null output_type: multiple_choice training_split: null validation_split: validation test_split: null process_docs: !function utils.process_docs doc_to_text: {{query}} doc_to_target: {{gold}} doc_to_choice: choices metric_list: - metric: acc aggregation: mean higher_is_better: true - metric: acc_norm aggregation: mean higher_is_better: true metadata: version: 1.04.2创建本地 Utils 函数文件 (/Users/hongshao/dataset/tasks/utils.py) 也可以使用纯yaml完成这件事下面补充import re def preprocess(text): text text.strip() text text.replace( [title], . ) text re.sub(\\[.*?\\], , text) text text.replace( , ) return text def process_docs(dataset): def _process_doc(doc): ctx doc[ctx_a] doc[ctx_b].capitalize() label doc.get(label, 0) try: gold int(label) except (ValueError, TypeError): gold 0 out_doc { query: preprocess(doc[activity_label] : ctx), choices: [preprocess(ending) for ending in doc[endings]], gold: gold, } return out_doc return dataset.map(_process_doc)process_docs 函数做三件事1. 拼接字段: 把 activity_label ctx_a ctx_b 拼成完整的 query2. 类型转换: 把 label 从字符串 3 转成整数 33. 文本清洗: preprocess 去除多余空格和伪影在虚拟机中执行HF_ENDPOINThttps://hf-mirror.com lm-eval run \ --model hf \ --model_args pretrained/Users/hongshao/models/gpt2 \ --tasks hellaswag_local \ --include_path /Users/hongshao/dataset/tasks \ --device cpu \ --batch_size 4 \ --output_path /Users/hongshao/results.json到这里你就静静等待结果吧。补充说明仅yaml文件不创建python utils.py的文件task: hellaswag_simple dataset_path: /Users/hongshao/dataset/hellaswag dataset_name: null output_type: multiple_choice validation_split: validation doc_to_text: {{activity_label}}: {{ctx_a}} {{ctx_b | capitalize}} doc_to_target: {{label | int}} doc_to_choice: {{endings}} metric_list: - metric: acc aggregation: mean higher_is_better: true metadata: version: 1.0查看.parquet文件内容的方式1使用 Python pandas最简单source venv/bin/activate python -c import pandas as pd df pd.read_parquet(/Users/hongshao/dataset/hellaswag/data/validation-00000-of-00001.parquet) print(df.head(2)) # 打印前 2 行 print(df.columns) # 打印列名 print(df.shape) # 打印形状 2直接用 lm-eval 内置的查看功能source venv/bin/activate python -c from datasets import load_dataset ds load_dataset(/Users/hongshao/dataset/hellaswag, splitvalidation) print(ds.features) # 查看字段 print(ds[0]) # 查看第一条数据输出结果字段定义 {ind: Value(int32), activity_label: Value(string), ctx_a: Value(string), ctx_b: Value(string), ctx: Value(string), endings: List(Value(string)), source_id: Value(string), split: Value(string), split_type: Value(string), label: Value(string)} 第一条数据 ind: 24 activity_label: Roof shingle removal ctx_a: A man is sitting on a roof. ctx_b: he ctx: A man is sitting on a roof. he endings: [is using wrap to wrap a pair of skis., is ripping level tiles off., is holding a rubiks cube., starts pulling up roofing on a roof.] source_id: activitynet~v_-JhWjGDPHMY split: val split_type: indomain label: 3

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2443540.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！