15chatGLM3半精度微调

news2025/12/18 15:56:26

1 模型准备

数据依然使用之前的数据，但是模型部分我们使用chatglb-3，该模型大小6B，如果微调的话需要24*4 = 96GB,硬件要求很高，那么我们使用半精度微调策略进行调试，半精度微调有很多坑啊，注意别踩到了；

#依赖 pip install modelscope

# pip install transformers==4.40.2, 不知道为什么使用之前的版本推理有问题！

模型

http://chatGLM3

模型文件很大，综合十几个G的，自己试试吧；

2 模型介绍

如果假设 ChatGLM3 是 ChatGLM 系列的后续版本，那么可以推测它可能是对现有 ChatGLM 模型的进一步改进和扩展。这样的改进可能包括但不限于以下几个方面：

模型规模：增加模型的参数量，以提高模型的表达能力和泛化能力。
架构改进：引入新的架构设计，例如更先进的注意力机制或其他创新技术，以提高模型的性能。
训练数据：使用更多的训练数据，特别是高质量的对话数据，以增强模型的理解和生成能力。
优化技术：采用更高效的训练方法和优化算法，以加速训练过程并提高模型的收敛速度。
多模态能力：增强模型处理多种模态数据（如图像、视频等）的能力，使其成为一个更全面的多模态模型。
安全性与伦理：加强对模型输出的安全性和伦理性的控制，确保生成的内容更加可靠和安全。

ChatGLM2与ChatGLM3模型架构是完全一致的，ChatGLM与后继者结构不同。可见ChatGLM3相对于ChatGLM2没有模型架构上的改进。

相对于ChatGLM，ChatGLM2、ChatGLM3模型上的变化：

词表的大小从ChatGLM的150528缩小为65024 （一个直观的体验是ChatGLM2、3加载比ChatGLM快不少）
位置编码从每个GLMBlock一份提升为全局一份
SelfAttention之后的前馈网络有不同。ChatGLM用GELU（Gaussian Error Linear Unit）做激活；ChatGLM用Swish-1做激活。而且ChatGLM2、3应该是修正了之前的一个bug，因为GLU（Gated Linear Unit）本质上一半的入参是用来做门控制的，不需要输出到下层，所以ChatGLM2、3看起来前后维度不一致（27392->13696)反而是正确的。

model

使用Lora进行微调：

chatGLM进行切词会生成：

from transformers import AutoTokenizer, AutoModel
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

ds = Dataset.load_from_disk("../data/")
# trust_remote_code=True 注意添加
tokenizer = AutoTokenizer.from_pretrained("../model/chatglm3-6b/", trust_remote_code=True)

def process_func(example):
    MAX_LENGTH = 256
    input_ids, attention_mask, labels = [], [], []
    instruction = "\n".join([example["instruction"], example["input"]]).strip()     # query
    instruction = tokenizer.build_chat_input(instruction, history=[], role="user")  # [gMASK]sop<|user|> \n query<|assistant|>
    response = tokenizer("\n" + example["output"], add_special_tokens=False)        # \n response, 缺少eos token
    input_ids = instruction["input_ids"][0].numpy().tolist() + response["input_ids"] + [tokenizer.eos_token_id]
    attention_mask = instruction["attention_mask"][0].numpy().tolist() + response["attention_mask"] + [1]
    labels = [-100] * len(instruction["input_ids"][0].numpy().tolist()) + response["input_ids"] + [tokenizer.eos_token_id]
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }
tokenized_ds = ds.map(process_func, remove_columns=ds.column_names)
tokenized_ds


import torch


# 多卡情况，可以去掉device_map="auto"，否则会将模型拆开
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="../model/chatglm3-6b/",
                                             trust_remote_code=True, 
                                             torch_dtype=torch.bfloat16)

from peft import LoraConfig, TaskType, get_peft_model, PeftModel

config = LoraConfig(target_modules=["query_key_value"], modules_to_save=["post_attention_layernorm"])
config

model = get_peft_model(model, config)

model.print_trainable_parameters()

from transformers.trainer_callback import TrainerCallback
import matplotlib.pyplot as plt

class PrintLossCallback(TrainerCallback):
    
    def __init__(self):
        self.losses = []
        self.steps = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        # 打印训练过程中的日志信息
        try:
            if logs is not None:
                print(f"Step {state.global_step}: Loss={logs['loss']:.4f}, Learning Rate={logs['learning_rate']:.6f}")
                self.losses.append(logs['loss'])
                self.steps.append(state.global_step)

        except Exception as e :
            print(f'on_log error {e}')
    
    def plot_losses(self):
        plt.figure(figsize=(10, 5))
        plt.plot(self.steps, self.losses, label='Training Loss')
        plt.xlabel('Steps')
        plt.ylabel('Loss')
        plt.title('Training Loss Over Time')
        plt.legend()
        plt.show()


args = TrainingArguments(
    output_dir="./chatbot_gml3",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=16,
    logging_steps=10,
    num_train_epochs=1,
    learning_rate=1e-4,
    remove_unused_columns=False,
    save_strategy="epoch"
)

plot_losses_callback = PrintLossCallback()

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds,#.select(range(6000)),
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    callbacks=[plot_losses_callback]  # 注册自定义回调
)
if torch.cuda.is_available():
    trainer.model = trainer.model.to("cuda")
# 训练模型
trainer.train()

可以看到loss终于到达了1.9；