► 代码示例 / 生成式深度学习 / 使用 KerasHub 从零开始进行 GPT 文本生成

使用 KerasHub 从零开始进行 GPT 文本生成

作者： Jesse Chan
创建日期 2022/07/25
最后修改日期 2022/07/25
描述： 使用 KerasHub 训练一个小型 GPT 模型以进行文本生成。

ⓘ 本示例使用 Keras 3

在 Colab 中查看 • GitHub 源代码

简介

在本示例中，我们将使用 KerasHub 来构建一个小型生成式预训练 (GPT) 模型。GPT 是一种基于 Transformer 的模型，可以根据提示生成复杂的文本。

我们将使用 simplebooks-92 语料库训练模型，这是一个由多部小说组成的。这个数据集很适合本示例，因为它词汇量小且词频高，这对于训练参数很少的模型有利。

本示例将使用小型 GPT 进行文本生成的概念与 KerasHub 的抽象相结合。我们将演示 KerasHub 的分词、层和指标如何简化训练过程，然后展示如何使用 KerasHub 的采样实用程序生成输出文本。

注意：如果您在 Colab 上运行此示例，请务必启用 GPU 运行时以加快训练速度。

此示例需要 KerasHub。您可以通过以下命令安装它：pip install keras-hub

设置

!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras  # Upgrade to Keras 3.

import os
import keras_hub
import keras

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

设置与超参数

# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512  # Strings shorter than this will be discarded
SEQ_LEN = 128  # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 5

# Inference
NUM_TOKENS_TO_GENERATE = 80

加载数据

现在，让我们下载数据集！SimpleBooks 数据集包含 1,573 本古腾堡书籍，其词汇量与词级别标记的比例是最小的之一。其词汇量约为 98k，是 WikiText-103 的三分之一，而标记数量（约 1 亿）则大致相同。这使得拟合小型模型变得容易。

keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
    tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
    tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
)

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
 282386239/282386239 ━━━━━━━━━━━━━━━━━━━━ 7s 0us/step

训练分词器

我们从训练数据集中训练分词器，词汇量大小为 VOCAB_SIZE，这是一个经过调整的超参数。我们希望尽可能限制词汇量，因为我们稍后会看到它对模型参数数量有很大影响。同时，我们也不希望包含太少的词汇项，否则会有太多的词汇外 (OOV) 子词。此外，词汇表中还保留了三个标记：

"[PAD]" 用于将序列填充到 SEQ_LEN。此标记在 reserved_tokens 和 vocab 中索引均为 0，因为 WordPieceTokenizer（及其他层）将 0/vocab[0] 视为默认填充。
"[UNK]" 用于 OOV 子词，这应与 WordPieceTokenizer 中的默认 oov_token="[UNK]" 匹配。
"[BOS]" 代表句子开头，但在此处，它实际上是一个表示训练数据每一行开头的标记。

# Train tokenizer vocabulary
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

加载分词器

我们使用词汇数据初始化 keras_hub.tokenizers.WordPieceTokenizer。WordPieceTokenizer 是 BERT 和其他模型使用的 WordPiece 算法的一种高效实现。它会进行去除、小写化以及其他不可逆的预处理操作。

tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

标记化数据

我们通过标记化并将数据集分为 features 和 labels 来预处理数据集。

# packer adds a start token
start_packer = keras_hub.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)

构建模型

我们使用以下层来创建小型 GPT 模型：

一个 keras_hub.layers.TokenAndPositionEmbedding 层，它结合了标记及其位置的嵌入。
多个 keras_hub.layers.TransformerDecoder 层，具有默认的因果掩码。当仅使用解码器序列运行时，该层没有交叉注意力。
一个最终的密集线性层

inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_hub.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_hub.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

让我们看一下我们的模型摘要——绝大多数参数都在 token_and_position_embedding 和输出 dense 层中！这意味着词汇量大小 (VOCAB_SIZE) 对模型大小有很大影响，而 Transformer 解码器层的数量 (NUM_LAYERS) 对其影响不大。

model.summary()

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape              ┃    Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None, None)              │          0 │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ token_and_position_embedding    │ (None, None, 256)         │  1,312,768 │
│ (TokenAndPositionEmbedding)     │                           │            │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ transformer_decoder             │ (None, None, 256)         │    329,085 │
│ (TransformerDecoder)            │                           │            │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ transformer_decoder_1           │ (None, None, 256)         │    329,085 │
│ (TransformerDecoder)            │                           │            │
├─────────────────────────────────┼───────────────────────────┼────────────┤
│ dense (Dense)                   │ (None, None, 5000)        │  1,285,000 │
└─────────────────────────────────┴───────────────────────────┴────────────┘

 Total params: 3,255,938 (12.42 MB)

 Trainable params: 3,255,938 (12.42 MB)

 Non-trainable params: 0 (0.00 B)

训练

现在我们有了模型，让我们使用 fit() 方法来训练它。

model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Epoch 1/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 216s 66ms/step - loss: 5.0008 - perplexity: 180.0715 - val_loss: 4.2176 - val_perplexity: 68.0438
Epoch 2/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 127s 48ms/step - loss: 4.1699 - perplexity: 64.7740 - val_loss: 4.0553 - val_perplexity: 57.7996
Epoch 3/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 126s 47ms/step - loss: 4.0286 - perplexity: 56.2138 - val_loss: 4.0134 - val_perplexity: 55.4446
Epoch 4/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 134s 50ms/step - loss: 3.9576 - perplexity: 52.3643 - val_loss: 3.9900 - val_perplexity: 54.1153
Epoch 5/5
 2445/2445 ━━━━━━━━━━━━━━━━━━━━ 135s 51ms/step - loss: 3.9080 - perplexity: 49.8242 - val_loss: 3.9500 - val_perplexity: 52.0006

<keras.src.callbacks.history.History at 0x7f7de0365ba0>

推理

使用我们训练好的模型，我们可以进行测试以评估其性能。为此，我们可以用以 "[BOS]" 标记开头的输入序列来“播种”我们的模型，并在循环中通过为每个后续标记进行预测来逐步采样模型。

首先，让我们构建一个与我们的模型输入形状相同的提示，其中仅包含 "[BOS]" 标记。

# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

我们将使用 keras_hub.samplers 模块进行推理，该模块需要一个包装我们刚刚训练好的模型的函数。此包装器会调用模型并返回我们正在生成的当前标记的 logit 预测。

注意：在定义回调函数时，有两种更高级的功能可用。第一种是能够接受先前生成步骤中计算的状态 cache，这可以用于加速生成。第二种是能够输出每个生成标记的最终密集“隐藏状态”。这被 keras_hub.samplers.ContrastiveSampler 使用，它通过惩罚重复的隐藏状态来避免重复。两者都是可选的，我们暂时忽略它们。

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache

创建包装器函数是使用这些函数中最复杂的部分。现在已经完成了，让我们测试一下不同的实用程序，从贪婪搜索开始。

贪婪搜索

我们在每个时间步贪婪地选择最可能的标记。换句话说，我们获得模型输出的 argmax。

sampler = keras_hub.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
[b'[BOS] " i \' m going to tell you , " said the boy , " i \' ll tell you , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good']

正如您所见，贪婪搜索一开始还说得过去，但很快就开始重复。这是文本生成的一个常见问题，可以通过后面展示的一些概率文本生成实用程序来解决！

束搜索

总的来说，束搜索在每个时间步会跟踪 num_beams 条最可能的序列，并从所有序列中预测最佳的下一个标记。与贪婪搜索相比，它是一个改进，因为它存储了更多的可能性。然而，它不如贪婪搜索高效，因为它必须计算和存储多个潜在序列。

注意： num_beams=1 的束搜索与贪婪搜索相同。

sampler = keras_hub.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
[b'[BOS] " i don \' t know anything about it , " she said . " i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \'']

与贪婪搜索类似，束搜索也很快开始重复，因为它仍然是一种确定性方法。

随机搜索

随机搜索是我们的第一个概率方法。在每个时间步，它使用模型提供的 softmax 概率来采样下一个标记。

sampler = keras_hub.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
[b'[BOS] eleanor . like ice , not children would have suspicious forehead . they will see him , no goods in her plums . i have made a stump one , on the occasion , - - it is sacred , and one is unholy - plaything - - the partial consequences , and one refuge in a style of a boy , who was his grandmother . it was a young gentleman who bore off upon the middle of the day , rush and as he maltreated the female society , were growing at once . in and out of the craid little plays , stopping']

瞧，没有重复！但是，在使用随机搜索时，我们可能会看到一些无意义的词出现，因为使用此采样方法，词汇表中的任何词都有可能出现。这可以通过我们的下一个搜索实用程序，即 Top-K 搜索来解决。

Top-K 搜索

与随机搜索类似，我们从模型提供的概率分布中采样下一个标记。唯一的区别是，在这里，我们选出概率最高的 k 个标记，并在采样之前将概率质量分配给它们。这样，我们就不会从低概率标记中采样，因此会减少无意义的词！

sampler = keras_hub.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
[b'[BOS] " the young man was not the one , and the boy went away to the green forest . they were a little girl \' s wife , and the child loved him as much as he did , and he had often heard of a little girl who lived near the house . they were too tired to go , and when they went down to the barns and get into the barn , and they got the first of the barns that they had been taught to do so , and the little people went to their homes . she did , she told them that she had been a very clever , and they had made the first . she knew they']

Top-P 搜索

即使有了 Top-K 搜索，仍有改进的空间。在 Top-K 搜索中，k 的值是固定的，这意味着它为任何概率分布选择相同数量的标记。考虑两种情况：一种是概率质量集中在 2 个词上，另一种是概率质量均匀分布在 10 个词上。我们应该选择 k=2 还是 k=10？这里没有一个万能的 k 值。

这就是 Top-P 搜索的作用！我们不选择 k，而是选择一个概率 p，我们希望其前 k 个标记的概率加起来等于这个值。这样，我们就可以根据概率分布动态调整 k。通过设置 p=0.9，如果 90% 的概率质量集中在前 2 个标记上，我们可以过滤掉前 2 个标记进行采样。如果相反，90% 分布在 10 个标记上，它将同样过滤掉前 10 个标记进行采样。

sampler = keras_hub.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
[b'[BOS] the children were both born in the spring , and the youngest sister were very much like the other children , but they did not see them . they were very happy , and their mother was a beautiful one . the youngest was one of the youngest sister of the youngest , and the youngest baby was very fond of the children . when they came home , they would see a little girl in the house , and had the beautiful family , and the children of the children had to sit and look on their backs , and the eldest children were very long , and they were so bright and happy , as they were , they had never noticed their hair ,']

使用回调进行文本生成

我们还可以将实用程序包装在回调中，这允许您为模型的每个 epoch 打印一个预测序列！以下是 Top-K 搜索的回调示例：

class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_hub.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2
Top-K search generated text: 
[b"[BOS] the young man was in the middle of a month , and he was able to take the crotch , but a long time , for he felt very well for himself in the sepoys ' s hands were chalks . he was the only boy , and he had a few years before been married , and the man said he was a tall one . he was a very handsome , and he was a very handsome young fellow , and a handsome , noble young man , but a boy , and man . he was a very handsome man , and was tall and handsome , and he looked like a gentleman . he was an"]

1/1 - 16s - 16s/step - loss: 3.9454 - perplexity: 51.6987
Epoch 2/2
Top-K search generated text: 
[b'[BOS] " well , it is true . it is true that i should go to the house of a collector , in the matter of prussia that there is no other way there . there is no chance of being in the habit of being in the way of an invasion . i know not what i have done , but i have seen the man in the middle of a day . the next morning i shall take him to my father , for i am not the very day of the town , which would have been a little more than the one \' s daughter , i think it over and the whole affair will be']

1/1 - 17s - 17s/step - loss: 3.7860 - perplexity: 44.0932

<keras.src.callbacks.history.History at 0x7f7de0325600>

结论

总而言之，在本示例中，我们使用 KerasHub 层来训练子词词汇表，标记训练数据，创建小型 GPT 模型，并使用文本生成库进行推理。

如果您想了解 Transformer 的工作原理，或者想了解更多关于训练完整 GPT 模型的信息，这里有一些进一步的阅读材料：

Attention Is All You Need Vaswani 等人，2017
GPT-3 论文 Brown 等人，2020