Python 深度学习实战第11章自然语言处理(NLP)实例

内容概要

第11章深入探讨了自然语言处理（NLP）的深度学习应用，涵盖了从文本预处理到序列到序列学习的多种技术。本章通过IMDB电影评论情感分类和英西翻译任务，详细介绍了如何使用循环神经网络（RNN）、卷积神经网络（CNN）和Transformer架构来处理文本数据。读者将掌握如何使用深度学习解决文本分类和序列到序列问题，并理解Transformer的工作原理。
在这里插入图片描述

主要内容

文本预处理
- 文本标准化：将文本转换为小写、去除标点等。
- 分词（Tokenization）：将文本分割为单词或短语。
- 词汇索引：将每个词转换为数值表示。
- TextVectorization层：使用Keras的TextVectorization层进行高效文本预处理。
文本表示方法
- 词袋模型（Bag-of-Words）：将文本视为单词集合，忽略顺序。
- 序列模型：处理单词顺序，适用于RNN、CNN和Transformer。
词嵌入（Word Embeddings）
- 学习词嵌入：使用Embedding层学习词向量。
- 预训练词嵌入：加载如GloVe等预训练词嵌入。
Transformer架构
- 自注意力机制（Self-Attention）：通过计算词之间的相关性生成上下文感知的词表示。
- 多头注意力（Multi-Head Attention）：将自注意力机制分解为多个独立的子空间。
- Transformer编码器（TransformerEncoder）：结合自注意力和前馈网络。
- 位置编码（Positional Encoding）：向模型注入词序信息。
序列到序列学习
- RNN序列到序列模型：使用GRU或LSTM进行序列到序列任务。
- Transformer序列到序列模型：结合Transformer编码器和解码器进行机器翻译。

关键代码和算法

1.1 文本标准化和分词

import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)

text_vectorization = tf.keras.layers.TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn
)

dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]

text_vectorization.adapt(dataset)

1.2 词袋模型（二元编码）

text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot"
)

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = tf.keras.Input(shape=(max_tokens,))
    x = tf.keras.layers.Dense(hidden_dim, activation="relu")(inputs)
    x = tf.keras.layers.Dropout(0.5)(x)
    outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
    model = tf.keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

model = get_model()
model.fit(binary_1gram_train_ds.cache(), validation_data=val_ds, epochs=10)

1.3 序列模型（嵌入层和双向LSTM）

max_length = 600
max_tokens = 20000
text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length
)
text_vectorization.adapt(text_only_train_ds)

inputs = tf.keras.Input(shape=(None,), dtype="int64")
embedded = tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(embedded)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10)

1.4 Transformer编码器

class TransformerEncoder(tf.keras.layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = tf.keras.Sequential([tf.keras.layers.Dense(dense_dim, activation="relu"), tf.keras.layers.Dense(embed_dim)])
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = tf.keras.layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = tf.keras.layers.GlobalMaxPooling1D()(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

model.fit(int_train_ds, validation_data=int_val_ds, epochs=20)

1.5 Transformer序列到序列模型

class TransformerDecoder(tf.keras.layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = tf.keras.Sequential([tf.keras.layers.Dense(dense_dim, activation="relu"), tf.keras.layers.Dense(embed_dim)])
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()
        self.layernorm_3 = tf.keras.layers.LayerNormalization()
        self.supports_masking = True

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat([tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs, attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(query=attention_output_1, value=encoder_outputs, key=encoder_outputs, attention_mask=padding_mask)
        attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

embed_dim = 256
dense_dim = 2048
num_heads = 8
encoder_inputs = tf.keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
decoder_inputs = tf.keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = tf.keras.layers.Dropout(0.5)(x)
decoder_outputs = tf.keras.layers.Dense(vocab_size, activation="softmax")(x)
transformer = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
transformer.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
transformer.fit(train_ds, epochs=30, validation_data=val_ds)

精彩语录

中文：自然语言的“规则”是在事实之后才形成的，这使得它与机器语言不同。
英文原文：Natural language was shaped by an evolution process, much like biological organisms—that’s what makes it “natural.” Its “rules,” like the grammar of English, were formalized after the fact and are often ignored or broken by its users.
解释：这句话强调了自然语言的动态特性和与机器语言的区别。
中文：机器学习的目标是让模型从数据中学习有用特征，而不是手动设计规则。
英文原文：When you find yourself building systems that are big piles of ad hoc rules, as a clever engineer, you’re likely to start asking: “Could I use a corpus of data to automate the process of finding these rules? Could I search for the rules within some kind of rule space, instead of having to come up with them myself?”
解释：这句话介绍了机器学习在自然语言处理中的重要性。
中文：Transformer架构通过注意力机制实现了序列到序列任务的革命性进展。
英文原文：The Transformer architecture, which consists of a TransformerEncoder and a TransformerDecoder, yields excellent results on sequence-to-sequence tasks.
解释：这句话总结了Transformer架构的核心优势。
中文：词嵌入将单词的语义关系建模为向量空间中的距离关系。
英文原文：Word embeddings are vector spaces where semantic relationships between words are modeled as distance relationships between vectors that represent those words.
解释：这句话介绍了词嵌入的基本概念。
中文：序列到序列学习是一个强大的框架，适用于多种NLP任务。
英文原文：Sequence-to-sequence learning is a generic, powerful learning framework that can be applied to solve many NLP problems, including machine translation.
解释：这句话强调了序列到序列学习的广泛适用性。