大语言模型入门

1 大语言模型步骤
- 1.1 pre-training 预训练
- - 1.1.1 从网上爬数据
  - 1.1.2 tokenization
  - - 1.1.2.1 tokenization using byte pair encoding
- 1.3 预训练
- - 1.3.1 context
  - 1.3.2 training
  - 1.3.3 输出
- 1.2 post-training1：SFT监督微调
- - 1.2.1 token
- 1.3 强化学习
- - 1.3.1 基于奖励函数进行试错
  - 1.3.2 人类反馈强化学习
  - - 1.3.2.3 为什么使用了HFRL的模型效果会更好
    - 1.3.2.4 HFRL的缺点
  - PPO
  - GRPO
4 大语言模型值得相信吗
- 4.1 AI hallucinations幻觉
- - 4.1.1 如何确认存在幻觉
  - 4.1.2 如何解决幻觉
- 4.2 大模型心理学LLM psychology
- 4.3 大模型的自我认知
- 4.4 大模型的数学能力
- - 4.4.1 Let the model think
  - 4.4.2 如果强行要求直出结果，可能导致出错
  - 4.4.3 Use tools
  - - 4.4.3.1 Use tools to count
    - 4.4.3.2 Use tools to spell
  - 4.4.4 为什么模型认为4.11 > 4.9
All in all
Deepseek
- Reasoning oriented RL using GRPO

1 大语言模型步骤

1.1 pre-training 预训练

1.1.1 从网上爬数据

download and preprocess the internet, including url filtering(like, removing racist/adult websites), PII removal(personal identifiable information should be removed, 豆包被爆输出个人信息，所以是这一点没有做好)
数据规模大概是44TB左右，不大，甚至一个大点的移动硬盘就可以装下。

1.1.2 tokenization

tokenization = context -> symbols/a sequence of tokens

文字编码的本质就是把字或字母转成一个有限表示的序列，比如a-z可以用1-26来表示。
这个序列不希望太长，要不然变相挤占了输入资源

1.1.2.1 tokenization using byte pair encoding

找到常出现的编码对，比如（125 67），就把它合成为一个新编码符号（比如301），然后再看有没有新的编码对，比如（301 786），那就再把这个合成新的。这样做decrease the length and increase the symbol size, in practice symbol size turns out to be 100k。
在这里插入图片描述

44TB -> 15 trillion万亿 token sequences

1.3 预训练

1.3.1 context

选取任意长度的token窗口arbitrary window lengths of tokens, but in practice the maximum length is often set as 16k/8k/4k

1.3.2 training

模型输出每个token的概率，和真值算loss

1.3.3 输出

做完预训练的模型一般称为base，base一般不会被release出来。
如果给base输入一句话，它不会回答，只会续写。
如果想要prompt，可以参考如下，因为模型具有in-context learning ability
在这里插入图片描述

这个base model更像是internet document simulator

1.2 post-training1：SFT监督微调

The pre-training stage is to sample documents, the post-training stage is to give answers to questions.
The pre-training relies on the online documents, but the post-training stage totally throws them out and use datasets of human-labeling conversations.
Knowledge in the parameters is the vague recollection (like something people read 1 month age); while knowledge in the context is the working memory (like something people read just now)，因此在做prompt时，可以把尽可能充分的信息给到模型去推理，working memory可以直接访问，更加准确。
The pre-training stage takes 3 months while post-training 3 hours, because the datasets of conversations are much smaller.
This stage is much computationally cheaper.
这个阶段要学会和人类互动，也要拒绝不合理的要求（比如如何黑进别人的电脑）
这个阶段不会也无法覆盖所有的可能的问题，但是通过这种示例，模型能够学到这样的统计学模式statistical pattern，从而在推理时遇到没被训练过的问题，也能给予回答。
从LLM获得的回答其实是类似人类标注员的回答，或者说LLM公司编写的回答规范的回答。You are not talking to a magical AI, instead an average labeler.
为了克服大模型幻觉，可以使大模型具备联网搜索能力，然后根据这些信息组织答案，这个方法也要添加在训练集中。

1.2.1 token

因为输入的是对话，所以需要对对话进行处理，增加虚拟独白部分，分别在最前和最后，而且要指明是user还是assistant。
在这里插入图片描述

在推理时，输入到<|im_start|>assistant<|im_sep|>，后面的由模型补充就得到了答案。

1.3 强化学习

chatgpt-4o are mostly SFT models, but deepseek RL models. So deepseek can present thinking process.
RL is a powerful way to learn. 在AlphaGo的训练中，采用了强化学习的ALphaGo获得了更强大的性能。Models can’t fundamentally go beyond a human player if it just imitates the human players. RL is not restrained by the human performance.
在这里插入图片描述

1.3.1 基于奖励函数进行试错

给模型一个问题，让它产生非常多次的回答，然后选出最好的回答（最精简，正确），然后拿进去训练。
SFT更像是RL的初始化过程，教模型如何组织一个答案，但是模型学会组织好答案是需要依靠强化学习的。
在这里插入图片描述

在LLM中，pre-training和SFT已经标准化了，但是RL仍是在早期阶段。This stage is early and nascent. 所以很多公司并不公开讨论这些细节。

因此，deepseek能够公开其RL方法，是对该领域的重要贡献。This paper reinvigorated the interest of RL in LLMs, and gave the necessary details to reproduce the results.

论文名称：DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
论文链接：https://arxiv.org/abs/2501.12948

下图表示，随着RL训练，模型use more tokens to achieve higher accuracy.
在这里插入图片描述

因为随着模型训练，它在不停try out different ideas，甚至可能“顿悟”aha moment。这些思考过程导致了模型回答长度变长，以及deepseek的think的过程。

在这里插入图片描述

1.3.2 人类反馈强化学习

上述都在说的是易于verify的任务，可以通过自动化的评判标准进行评价，从而引导模型学习，但是对于unverified task, like creative writing tasks，就无法依赖自动化的评价工具了，需要人类进行评价和反馈。

人类反馈的是排序，而不是分数。因为针对某些任务（比如讲一个笑话），很难给它打分，因此会让模型针对同一个问题生成多个任务，然后人类对其进行排序。
如果让模型每生成一个答案，就拉一组评审员来评估，是很naive和cost的，因此，通过评审员训练一个模型，让模型学会如何给打分，分数和人类排序是正相关的。The reward model is a totally seperate neural net, is a simulator of human preferences.

1.3.2.3 为什么使用了HFRL的模型效果会更好

The best guess, not proved yet:
判别器和生成器的gap。

一般而言，判别比生成要简单很多（比如要判断哪几个笑话更好笑，和自己写一个笑话）。
RHLF避免了人类自己生成（写笑话），而是去评价（哪个笑话更好笑），生成的任务交给模型自己学。

so that step of indirection allows the models to become even better.

1.3.2.4 HFRL的缺点

奖励函数可能具有误导性。当我们训练一个奖励模型时，这个模型是对人类评价能力的有损模拟 a lossy simulation of humans, which could be misleading. 可能并不能真实反映人类的评价。
Reward function is gamble / 模型会变得“狡诈”devious. RL is extremely good at discovering a way to “game” the model, whose outputs are extremely nonsensical 荒谬, getting very high scores but in a fake way.

举个例子：让模型学讲笑话

在前几百steps时，模型讲笑话的能力会有所提升
但是之后会急剧下滑，到1k左右时，模型会输出"the the the the"，this top joke在奖励模型中获得高分。

这个问题很难修复，因为像这样的荒谬的对抗样本还有很多，无法穷举，RL can always find a way to game the model.

结论：当我们运行几百steps时，就需要停下来。You can do RL infinitely (like alphago to beat against Sedol Lee李世石), but you can’t do HFRL infinitely.

PPO

proximal policy opt
最大化长期奖励，但是通过clip保证不会偏离当前策略太远。

举个例子：智能家居系统
有一个智能家居系统，其中的恒温器需要根据室内外温度、能源价格等因素来调整设置，以达到舒适与节能的平衡。这个恒温器就是我们的“智能体”，它的目标是通过调节温度来最大化长期奖励（比如节省能源费用和保持舒适）。
场景设定：
状态：当前室内温度、室外温度、时间等。
动作：将恒温器设置为 20°C、21°C 或 22°C。
奖励：根据能源消耗和舒适度打分，比如：
如果温度太低或太高，奖励较低（因为不舒服）。
如果温度适中且能源消耗少，奖励较高。
恒温器需要通过不断尝试不同的设置来找到最优策略。PPO 的作用就是帮助它在学习过程中逐步优化策略，避免“走得太远”而犯错。

GRPO

group relative policy opt
一种用于大模型的策略优化算法，通过分组比较，动态调整学习策略，使训练更高效和稳定。因为之前是采用PPO算法，PPO会直接冲着奖励最高的方向去学习，无法平衡风险。

举个例子：班级小组学习
假设一个班级要提升数学成绩，老师用 GRPO 的思路设计学习计划：
分组：把学生分为 A、B、C 三组，每组用不同学习方法：
A组：题海战术
B组：错题分析
C组：概念推导
相对比较：
每周考试后，对比各组平均分。比如 B 组得分最高，说明“错题分析”方法更有效。
策略优化：
让 B 组分享经验，但不强制其他组完全照搬（避免学得太快反而混乱）。
A、C 组参考 B 组的方法，适当调整自己的策略（比如题海战术中加入错题分析）。
结果：
整体班级成绩提升，且各组保持自己的特色（稳定性）。

4 大语言模型值得相信吗

things that occur very frequently in the internet are probably more likely to be remembered correctly. The output of the LLM is just a vague recollection of internet documents

4.1 AI hallucinations幻觉

给模型输入一段没见过的东西，它会沿着继续预测下去，本质上是在概率性的前提下进行最佳预测。
由于是概率性的，所以在训练集的分布里，并没有"我不知道"，只有自信的回答，所以面对及时不知道的事，AI也会给出自信的编造。Even the model knows it doesn’t know, it will not surface that.

The model are not looking it up（查找信息）, instead just imitating the answer.

4.1.1 如何确认存在幻觉

测试大模型是否存在幻觉
a. 先用大模型A给出一些问答
b. 用这些问答问待测试的大模型
c. 如果回答错了，就说明存在幻觉

4.1.2 如何解决幻觉

训练大模型说我不知道
a. 根据答错的问题，训练它说我不知道
Through this way, the model learns the association of the knowledge-based refusal to the internal neuron
联网搜索，把搜索到的知识当作输入一起给模型。

4.2 大模型心理学LLM psychology

涌现认知效应 emergent cognitive effect

4.3 大模型的自我认知

大模型实际上没有自我认知，它其实就是基于traning set做best guess，如果没有专门训练过这个问题，那么它可能会说自己的GPT（尽管它不是），但是是网上很多数据是有GPT生成的，导致它认为自己的GPT。
在这里插入图片描述
因此，可以考虑对其进行专门训练，或者hardcode，如下所示

4.4 大模型的数学能力

为了让大模型有足够的数学能力，它的输入至关重要。举例而言，下面左侧的回答就更差，导致可能模型学不出来。
在这里插入图片描述

4.4.1 Let the model think

因为模型一开始就揭晓了答案，但是前面没有足够的推理过程。模型是采用自回归的方式输出答案的，这就说明模型要在 The answer is 结束时，完成所有的推理和计算，然后给出答案来。如果采用右侧的答案，那么就能通过tokens的输出，先计算中间结果intermediate results，将其存储在working memory中，逐步完成推理和计算，提高其数学能力。
在这里插入图片描述

We should teach the model to spread out the reasoning and computation over the tokens, in other words, the model need tokens to think.

4.4.2 如果强行要求直出结果，可能导致出错

在这里插入图片描述

4.4.3 Use tools

==use tools instead of allowing the models to do all the calculations in the memory. ==
因为在memory中计算就像人的心算，不一定特别靠谱。
只需要在提示词中增加Use code即可。

4.4.3.1 Use tools to count

举个例子，模型对于计数能力很差，因为这些数其实是离散的多个tokens。
在这里插入图片描述

这里的运算并不是用的模型的心算，而是让模型做了复制粘贴的事（把dots复制到python中），然后写python，最后的结果是由python给出的。这样带来2点好处：

代码很容易检查运算思路
避免了模型心算，模型只需要给出解题过程就可以了

4.4.3.2 Use tools to spell

模型拼写能力不强，因为模型只能访问到token，一个token包含多个字符/字母，但是人眼是能够acess每个字符/字母的。
Model is not byte level or character level, but token level.
在这里插入图片描述

如果使用工具，那么模型提供思路+复制粘贴tokens即可，答案由python给出。

4.4.4 为什么模型认为4.11 > 4.9

因为模型回忆起圣经经文了，在经文中先有chapter4.9，然后才有4.11，所以它认为4.9 > 4.11。
在这里插入图片描述

All in all

Don’t fully trust the LLM, they are not infalliable无懈可击的, it’s like a Swiss cheese, some things works(tastes) pretty well, but still has drawbacks(holes)