Lecture 7 Deep Learning for NLP: Feedforward Networks

- - Deep Learning
  - Feedforward Neural Network 前馈神经网络
  - Neuron 神经元
  - Output Layer 输出层
  - Optimization
  - Regularization 正则化
  - Topic Classification 主题分类
  - Language Model as Classifiers 语言模型作为分类器
  - Word Embeddings 词嵌入
  - Training a Feed-Forward Neural Network Language Model 训练前向传播神经网络语言模型
  - Feed-Forward Neural Network for POS Tagging 用于词性标注的前向传播神经网络
  - Convolutional Networks 卷积网络
  - Convolutional Networks for NLP

Feedforward Neural Networks Basics

Deep Learning

A branch of machine learning 机器学习的一个分支
Re-branded name for neural networks 神经网络的重命名
Deep: Many layers are chained together in modern deep learning models 深度：现在深度学习模型中的链式链接的多层
Neural Networks: Historically inspired by the way computation works in the brain 神经网络：历史上受大脑计算方式的启发
- Consists of computation units called neurons 由称为神经元的计算单元组成

Feedforward Neural Network 前馈神经网络

Also called multilayer perceptron 也被称为多层感知机
E.g. of architecture: 架构示例
Each arrow carries a weight, reflecting its importance 每个箭头都会携带一个权重，以反应其重要性
Certain layers have non-linear activation functions 某些层具有非线性的激活函数

Neuron 神经元

Each neuron is a function: 每个神经元是一个函数
- Given input x, compute real-value h: 给定输入x，计算实值h
$h = tanh(\sum_{j}w_jx_j + b)$
- Scales input (with weights, w) and adds offset (bias, b) 通过权重w缩放输入并添加偏移量（偏置，b）
- Applies non-linear function, such as logistic sigmoid, hyperbolic sigmoid(tanh), or rectified linear unit 应用非线性函数，如逻辑sigmoid、双曲sigmoid（tanh）或修正线性单元
- w and b are parameters of the model w 和 b 是模型的参数
Typically hava several hidden units 通常有几个隐藏单元. E.g. $h_i = tanh(\sum_{j}w_{ij}x_j + b_i)$
- Each with its own weights w_i, and bias terms b_i 每个单元都有自己的权重 w_i，和偏置项 b_i
- Can be expressed using matrix and vector operators 可以使用矩阵和向量运算符表示: $\vec{h} = tanh(W\vec{x} + \vec{b}$
- Where $W$ is a matrix comprising the weight vectors and 是一个包含权重向量的矩阵 $\vec{b}$ is a vector of all bias terms 是所有偏置项的向量
- Non-linear function applied element-wise 非线性函数逐元素应用

Output Layer 输出层

For binary classification problem: Sigmoid Activation Function 对于二元分类问题：Sigmoid 激活函数
Multi-class classification problem: Softmax Activation Function ensures probabilities are greater than 0 and sum to 1 多类分类问题：Softmax 激活函数确保概率大于0且和为1: $[\frac{exp(v_1)}{\sum_iexp(v_i)}, \frac{exp(v_2)}{\sum_iexp(v_i)}, \ldots, \frac{exp(v_m)}{\sum_iexp(v_i)}]$

Optimization

Consider how well the model fits the training data, in terms of the probability it assigns to the correct output 考虑模型如何拟合训练数据，即它对正确输出所分配的概率
- $L = \prod_{i=0}^mP(y_i|x_i)$
- Want to maximize total probability L 想要最大化总概率 L
- Equivalently minimize -log L with respect to parameters 等价地，以参数最小化 -log L
Trained using gradient descent 使用梯度下降进行训练
- Tools like tensorflow, pytorch, dynet use autodiff to compute gradients automatically 像 tensorflow, pytorch, dynet 这样的工具用于自动计算梯度

Regularization 正则化

Have many parameters, so the model overfits easily 有许多参数，所以模型很容易过拟合
Low bias, high variance 偏差较低，方差较高
Regularization is very import in neural networks 在神经网络中，正则化非常重要
L1-norm: sum of absolute values of all parameters L1范数：所有参数的绝对值之和
L2-norm: sum of squares of all parameters L2范数：所有参数的平方之和
Dropout: randomly zero-out some neurons of a layer Dropout：随机将某一层的部分神经元值设为零
- If a dropout rate = 0.1, a random 10% of neurons now have 0 values 如果Dropout比率 = 0.1，那么随机10%的神经元现在的值为0
- Can apply dropout to any layer, but in practice, mostly to the hidden layers 可以在任何层应用dropout，但实际上，主要应用于隐藏层

Applications in NLP

Topic Classification 主题分类

Given a document, classify it into a predefined set of topics. E.g. economy, politics, sports 给定一个文档，将其分类到预定义的主题集合中。例如：经济，政治，体育
Input: Bag-of-words 输入：词袋模型

Example input:
Training:

Architecture:

Hidden Layers:

$\vec{h_1} = tanh(W_1\vec{x} + \vec{b_1})$

$\vec{h_2} = tanh(W_2\vec{h_1} + \vec{b_2})$

$\vec{y} = softmax(W_3\vec{h_2})$
- Randomly initialize W and b
- E.g:
  
  Input: $\vec{x} = [0, 2, 3, 0]$
  
  Output: $\vec{y} = [0.1, 0.6, 0.3]$ : Probability distribution over $C_1, C_2, C_3$
  
  Loss: $L = -log(0.1)$ if true label is $C_1$
Prediction:
- E.g.
  
  Input: $\vec{x} = [1, 3, 5, 0]$
  
  Output: $\vec{y} = [0.2, 0.1, 0.7]$
  
  Predicted class is $C_3$
Potential Improvements: 可能的改进
- Use Bag of bigrams as input 使用bigrams的词袋作为输入
- Preprocess text to lemmatize words and remove stopwords 预处理文本，对词进行词元化并去除停用词
- Instead of raw counts, can weight words using TF-IDF or indicators 可以使用TF-IDF或指标来权衡单词，而不是原始计数

Language Model as Classifiers 语言模型作为分类器

Language Models can be considered simple classifiers 语言模型可以被视为简单的分类器.
- E.g. For a trigram model: $P(w_i|w_{i-2} = salt, w_{i-1} = and)$ classifies the likely next word in a sequence, given salt and and 给出了在给定 salt 和 and 的情况下，可能的下一个词的分类
Feedforward Neural Network Language Model 前向传播神经网络语言模型
- Use neural network as a classifier to model 使用神经网络作为分类器来模型
- Input features: The previous words 输入特征：前面的单词
- Output classes: the next word 输出类别：下一个单词
- Use Word Embeddings to represent each word 使用词嵌入来表示每个单词
  - E.g. of Word Embeddings: 词嵌入的例子

Word Embeddings 词嵌入

Maps discrete word symbols to continuous vectors in a relatively low dimensional space 将离散的词符号映射到相对较低维度的连续向量空间中
Word embeddings allow the model to capture similarity between words 词嵌入允许模型捕获单词之间的相似性
In feed-forward neural network language model, the first layer is the sum of input word embeddings 在前向传播神经网络语言模型中，第一层是输入词嵌入的和

Training a Feed-Forward Neural Network Language Model 训练前向传播神经网络语言模型

E.g. $P(w_i = grass|w_{i-3} = a, w_{i-2} = cow, w_{i-1} = eats)$
Lookup word embeddings ( $W_1$ ) for a, cow, and eats
Concatenate them and feed it to the network 将它们连接起来并输入网络:
- $\vec{x} = \vec{v_a} \oplus \vec{v_{cow}} \oplus \vec{v_{eats}}$
- $\vec{h} = tanh(W_2\vec{x} + \vec{b_1})$
- $\vec{y} = softmax(W_3\vec{h})$
$\vec{y}$ gives the probability distribution over all words in the vocabulary 给出了所有词汇表中单词的概率分布
- E.g.
$P(w_i = grass | w_{i-3}=a, w_{i-2}=cow, w_{i-1}=eats) = 0.8$
Loss: $L = -log(0.8)$
Most parameters are in the word embeddings W₁ (size = d * |V|) and the output embeddings W₃ (size = |V| * d) 大部分参数都在词嵌入 W₁ (大小 = d * |V|) 和输出嵌入 W₃ (大小 = |V| * d)
Example Architecture:
Problem of Count-based N-gram models: 基于计数的N-gram模型的问题
- Cheap to train 便宜的训练
- Problems with sparsity and scaling to larger contexts 在稀疏性和大规模上下文的扩展性上存在问题
- Don’t adequately capture properties of words 不能充分捕获单词的属性
Advantages of Feed-Forward Neural Network Language Model: 前向传播神经网络语言模型的优点
- Automatically capture word properties, leading to more robust estimates 自动捕获单词的属性，导致更稳健的估计

Feed-Forward Neural Network for POS Tagging 用于词性标注的前向传播神经网络

POS tagging can also be framed as classification: 词性标注也可以被框定为分类
- $P(t_i|w_{i-1}=cow, w_i=eats)$ -> classifies the likely POS tag for eats 对eats的可能的词性标签进行分类
Feed-Forward Neural Network Language Model architecture can be adapted to the task directly 前向传播神经网络语言模型的架构可以直接适应这个任务
Inputs:
- recent words 最近的词: $w_{i-2}, w_{i-1}, w_i$
- recent tags 最近的标签: $t_{i-2}, t_{i-1}$
Outputs:
- current tags: $t_i$
Frame as neural network with:
- 5 inputs: 3 word embeddings and 2 tag embeddings 5个输入：3个词嵌入和2个标签嵌入
- 1 output: vector of size |T|, using softmax 1个输出：大小为|T|的向量，其中|T|是词性标签集的大小
Training to minimize: $-\sum__ilogP(t_i|w_{i-2}, w_{i-1}, w_i, t_{i-2}, t_{i-1})$
Example Architecture: