题意:理解 OpenAI 5(1024 单元 LSTM 强化学习)的模型
问题背景:
I recently came across openAI 5. I was curious to see how their model is built and understand it. I read in wikipedia that it "contains a single layer with a 1024-unit LSTM". Then I found this pdf containing a scheme of the architecture.
我最近了解了 OpenAI 5。我很好奇他们的模型是如何构建的,并希望了解它。我在维基百科上读到,它“包含一个具有 1024 单元的 LSTM 层”。然后我找到了这份包含架构示意图的 PDF
My Questions 我的问题
From all this I don't understand a few things:
从这些信息中,我有几个地方不太明白
-  What does it mean to have a 1024-unit LSTM layer? Does this mean we have 1024 time steps with a single LSTM cell, or does this mean we have 1024 cells. Could you show me some kind of graph visualizing this? I'm especially having a hard time visualizing 1024 cells in one layer. (I tried looking at several SO questions such as 1, 2, or the openAI 5 blog, but they didn't help much). 
拥有一个 1024 单元的 LSTM 层是什么意思?这是否意味着我们有 1024 个时间步长和一个单独的 LSTM 单元,还是说我们有 1024 个单元?你能给我展示一些可视化的图表吗?我特别难以想象在一层中有 1024 个单元。(我尝试查看了几个 SO 问题,例如 1、2,或 OpenAI 5 的博客,但没有太大帮助。)
-  How can you do reinforcement learning on such model? I'm used to RL being used with Q-Tables and them being updated during training. Does this simply mean that their loss function is the reward? 
你如何在这样的模型上进行强化学习?我习惯于使用 Q 表进行强化学习,并在训练过程中对其进行更新。这是否意味着他们的损失函数就是奖励
-  How come such large model doesn't suffer from vanishing gradients or something? Haven't seen in the pdf any types of normalizations or so. 
为什么这样的大型模型不会受到梯度消失等问题的影响?我在 PDF 中没有看到任何类型的归一化或类似的内容
-  In the pdf you can see a blue rectangle, seems like it's a unit and there are Nof those. What does this mean? And correct me please if I'm mistaken, the pink boxes are used to select the best move/item(?)
在 PDF 中,你可以看到一个蓝色的矩形,似乎它是一个单元,并且有 N 个这样的单元。这是什么意思?如果我错了,请纠正我,粉色的框是用来选择最佳动作/项目的
In general all of this can be summarized to "how does the openAI 5 model work?
总的来说,这些问题可以归结为:“OpenAI 5 模型是如何工作的?
问题解决:
-  It means that the size of the hidden state is 1024 units, which is essentially that your LSTM has 1024 cells, in each timestep. We do not know in advance how many timesteps we will have. 
这意味着隐藏状态的大小是 1024 单元,这基本上意味着你的 LSTM 在每个时间步都有 1024 个单元。我们事先不知道会有多少个时间步
-  The state of the LSTM (hidden state) represents the current state that is observed by the agent. It gets updated every timestep using the input received. This hidden state can be used to predict the Q-function (as in Deep Q-learning). You don't have an explicit table of (state, action) -> q_value, instead you have a 1024 sized vector which represents the state and feeds into another dense layer, which will output the q_values for all possible actions.
LSTM 的状态(隐藏状态)表示智能体当前观察到的状态。它会在每个时间步通过接收到的输入进行更新。这个隐藏状态可以用来预测 Q 函数(如深度 Q 学习中所示)。你没有一个明确的(状态,动作)-> Q 值的表格,而是有一个 1024 维的向量,它代表状态,并输入到另一个全连接层,该层会输出所有可能动作的 Q 值
-  LSTMs are the mechanism which help stop vanishing gradients, as the long range memory also allows the gradients to flow back easier. 
LSTM 是帮助防止梯度消失的机制,因为其长程记忆功能使得梯度更容易反向传播
-  If you are referring to the big blue and pink boxes, then the pink ones seem like they are the input values which are put through a network and pooled, over each pickup or modifier. The blue space seems to be the same thing over each unit. The terms pickup, modifier, unit, etc., should be meaningful in the context of the game they are playing. 
如果你指的是大的蓝色和粉色框,那么粉色框似乎是输入值,它们通过网络处理并在每个拾取物或修饰物上进行汇总。蓝色区域似乎是相同的东西,只是针对每个单位。拾取物、修饰物、单位等术语应该在他们玩的游戏的上下文中具有特定含义
Here is an image of the LSTM - the yellow nodes at each step are the n:
这是 LSTM 的一张图片——每一步的黄色节点是 n

The vector h is the hidden state of the LSTM which is being passed to both the next timestep and being used as the output of that timestep.
向量 h 是 LSTM 的隐藏状态,它被传递到下一个时间步,同时也作为该时间步的输出




















