作者: Jesse Chan
创建时间 2022/07/25
上次修改时间 2022/07/25
描述:使用 KerasHub 训练一个用于文本生成的迷你 GPT 模型。
在本示例中,我们将使用 KerasHub 构建一个缩减版的生成式预训练 (GPT) 模型。GPT 是一种基于 Transformer 的模型,允许您根据提示生成复杂的文本。
我们将使用 simplebooks-92 语料库训练模型,该语料库是由多部小说组成的数据集。它非常适合本示例,因为它词汇量小且词频高,这在训练参数较少的模型时非常有利。
此示例结合了 使用小型 GPT 进行文本生成 和 KerasHub 抽象的概念。我们将演示 KerasHub 分词、层和指标如何简化训练过程,然后展示如何使用 KerasHub 采样工具生成输出文本。
注意:如果您在 Colab 上运行此示例,请确保启用 GPU 运行时以加快训练速度。
此示例需要 KerasHub。您可以通过以下命令安装它:pip install keras-hub
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras # Upgrade to Keras 3.
import os
import keras_hub
import keras
import tensorflow.data as tf_data
import tensorflow.strings as tf_strings
# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512 # Strings shorter than this will be discarded
SEQ_LEN = 128 # Length of training sequences, in tokens
# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000 # Limits parameters in model.
# Training
EPOCHS = 5
# Inference
NUM_TOKENS_TO_GENERATE = 80
现在,让我们下载数据集!SimpleBooks 数据集包含 1,573 本古腾堡书籍,并且具有最小的词汇量与词级标记比率之一。它的词汇量约为 98k,是 WikiText-103 的三分之一,但标记数量大致相同(约 1 亿)。这使得它易于适应小型模型。
keras.utils.get_file(
origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")
# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
.filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
.batch(BATCH_SIZE)
.shuffle(buffer_size=256)
)
# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
.filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
.batch(BATCH_SIZE)
)
Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
282386239/282386239 ━━━━━━━━━━━━━━━━━━━━ 7s 0us/step
我们根据训练数据集训练分词器,词汇量为 VOCAB_SIZE
,这是一个经过调整的超参数。我们希望尽可能限制词汇量,因为我们稍后将看到它对模型参数数量有很大影响。我们也不希望包含太少的词汇项,否则会出现太多词汇外 (OOV) 子词。此外,词汇表中保留了三个标记
"[PAD]"
用于将序列填充到 SEQ_LEN
。此标记在 reserved_tokens
和 vocab
中的索引均为 0,因为 WordPieceTokenizer
(以及其他层)将 0
/vocab[0]
视为默认填充。"[UNK]"
用于 OOV 子词,应与 WordPieceTokenizer
中的默认 oov_token="[UNK]"
匹配。"[BOS]"
代表句子开头,但在这里它实际上是表示训练数据每一行开头的标记。# Train tokenizer vocabulary
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
raw_train_ds,
vocabulary_size=VOCAB_SIZE,
lowercase=True,
reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)
我们使用词汇数据来初始化 keras_hub.tokenizers.WordPieceTokenizer
。WordPieceTokenizer 是 BERT 等模型使用的 WordPiece 算法的高效实现。它将去除、小写并执行其他不可逆的预处理操作。
tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
vocabulary=vocab,
sequence_length=SEQ_LEN,
lowercase=True,
)
我们通过分词和将数据集拆分为 features
和 labels
来预处理数据集。
# packer adds a start token
start_packer = keras_hub.layers.StartEndPacker(
sequence_length=SEQ_LEN,
start_value=tokenizer.token_to_id("[BOS]"),
)
def preprocess(inputs):
outputs = tokenizer(inputs)
features = start_packer(outputs)
labels = outputs
return features, labels
# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
tf_data.AUTOTUNE
)
我们使用以下层创建缩减版的 GPT 模型
keras_hub.layers.TokenAndPositionEmbedding
层,它结合了标记及其位置的嵌入。keras_hub.layers.TransformerDecoder
层,具有默认的因果掩码。当仅使用解码器序列运行时,该层没有交叉注意力。inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_hub.layers.TokenAndPositionEmbedding(
vocabulary_size=VOCAB_SIZE,
sequence_length=SEQ_LEN,
embedding_dim=EMBED_DIM,
mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
decoder_layer = keras_hub.layers.TransformerDecoder(
num_heads=NUM_HEADS,
intermediate_dim=FEED_FORWARD_DIM,
)
x = decoder_layer(x) # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_hub.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])
让我们看看模型摘要 - 大多数参数都在 token_and_position_embedding
和输出 dense
层!这意味着词汇量 (VOCAB_SIZE
) 对模型的大小有很大影响,而 Transformer 解码器层的数量 (NUM_LAYERS
) 对它影响不大。
model.summary()
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ input_layer (InputLayer) │ (None, None) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ token_and_position_embedding │ (None, None, 256) │ 1,312,768 │ │ (TokenAndPositionEmbedding) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ transformer_decoder │ (None, None, 256) │ 329,085 │ │ (TransformerDecoder) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ transformer_decoder_1 │ (None, None, 256) │ 329,085 │ │ (TransformerDecoder) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense (Dense) │ (None, None, 5000) │ 1,285,000 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 3,255,938 (12.42 MB)
Trainable params: 3,255,938 (12.42 MB)
Non-trainable params: 0 (0.00 B)
现在我们有了模型,让我们使用 fit()
方法对其进行训练。
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)
Epoch 1/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 216s 66ms/step - loss: 5.0008 - perplexity: 180.0715 - val_loss: 4.2176 - val_perplexity: 68.0438
Epoch 2/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 127s 48ms/step - loss: 4.1699 - perplexity: 64.7740 - val_loss: 4.0553 - val_perplexity: 57.7996
Epoch 3/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 126s 47ms/step - loss: 4.0286 - perplexity: 56.2138 - val_loss: 4.0134 - val_perplexity: 55.4446
Epoch 4/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 134s 50ms/step - loss: 3.9576 - perplexity: 52.3643 - val_loss: 3.9900 - val_perplexity: 54.1153
Epoch 5/5
2445/2445 ━━━━━━━━━━━━━━━━━━━━ 135s 51ms/step - loss: 3.9080 - perplexity: 49.8242 - val_loss: 3.9500 - val_perplexity: 52.0006
<keras.src.callbacks.history.History at 0x7f7de0365ba0>
使用训练好的模型,我们可以对其进行测试以评估其性能。为此,我们可以使用以 "[BOS]"
标记开头的输入序列作为模型的种子,并通过循环预测每个后续标记来逐步对模型进行采样。
首先,让我们构建一个与模型输入形状相同的提示,其中只包含 "[BOS]"
标记。
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens
<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
dtype=int32)>
我们将使用 keras_hub.samplers
模块进行推理,该模块需要一个包装我们刚刚训练的模型的回调函数。此包装器调用模型并返回我们正在生成的当前标记的对数几率预测。
注意:在定义回调函数时,可以使用两个更高级的功能。第一个是能够获取先前生成步骤中计算的状态的 cache
,这可以用来加速生成。第二个是能够输出每个生成标记的最终密集“隐藏状态”。这由 keras_hub.samplers.ContrastiveSampler
使用,它通过惩罚重复的隐藏状态来避免重复。这两个都是可选的,我们现在将忽略它们。
def next(prompt, cache, index):
logits = model(prompt)[:, index - 1, :]
# Ignore hidden states for now; only needed for contrastive search.
hidden_states = None
return logits, hidden_states, cache
创建包装器函数是使用这些函数最复杂的部分。现在已经完成了,让我们测试不同的工具,从贪婪搜索开始。
我们在每个时间步长贪婪地选择最可能的标记。换句话说,我们获取模型输出的 argmax。
sampler = keras_hub.samplers.GreedySampler()
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1, # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")
Greedy search generated text:
[b'[BOS] " i \' m going to tell you , " said the boy , " i \' ll tell you , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good friend , and you \' ll be a good']
正如您所看到的,贪婪搜索一开始似乎有点道理,但很快就开始重复自己。这是文本生成中常见的问题,可以通过后面显示的一些概率文本生成工具来解决!
从高层次上讲,束搜索在每个时间步长跟踪 num_beams
个最可能的序列,并从所有序列中预测最佳的下一个标记。它比贪婪搜索有所改进,因为它存储了更多可能性。但是,它不如贪婪搜索高效,因为它必须计算和存储多个潜在序列。
注意:num_beams=1
的束搜索与贪婪搜索相同。
sampler = keras_hub.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")
Beam search generated text:
[b'[BOS] " i don \' t know anything about it , " she said . " i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know anything about it . i don \' t know anything about it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \' t know it , but i don \' t know it . i don \'']
与贪婪搜索类似,束搜索很快就开始重复自己,因为它仍然是一种确定性方法。
随机搜索是我们第一个概率方法。在每个时间步长,它使用模型提供的 softmax 概率对下一个标记进行采样。
sampler = keras_hub.samplers.RandomSampler()
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")
Random search generated text:
[b'[BOS] eleanor . like ice , not children would have suspicious forehead . they will see him , no goods in her plums . i have made a stump one , on the occasion , - - it is sacred , and one is unholy - plaything - - the partial consequences , and one refuge in a style of a boy , who was his grandmother . it was a young gentleman who bore off upon the middle of the day , rush and as he maltreated the female society , were growing at once . in and out of the craid little plays , stopping']
瞧,没有重复!但是,使用随机搜索,我们可能会看到一些无意义的单词出现,因为词汇表中的任何单词都有可能使用这种采样方法出现。这可以通过我们的下一个搜索工具,即 Top-K 搜索来解决。
与随机搜索类似,我们从模型提供的概率分布中对下一个标记进行采样。唯一的区别是,在这里,我们选择前 k
个最可能的标记,并在采样前将概率质量分布在它们之间。这样,我们就不会从低概率标记中进行采样,因此无意义的单词会更少!
sampler = keras_hub.samplers.TopKSampler(k=10)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")
Top-K search generated text:
[b'[BOS] " the young man was not the one , and the boy went away to the green forest . they were a little girl \' s wife , and the child loved him as much as he did , and he had often heard of a little girl who lived near the house . they were too tired to go , and when they went down to the barns and get into the barn , and they got the first of the barns that they had been taught to do so , and the little people went to their homes . she did , she told them that she had been a very clever , and they had made the first . she knew they']
即使使用 Top-K 搜索,也有一些需要改进的地方。使用 Top-K 搜索,数字 k
是固定的,这意味着它为任何概率分布选择相同数量的标记。考虑两种情况,一种是概率质量集中在 2 个单词上,另一种是概率质量均匀地集中在 10 个单词上。我们应该选择 k=2
还是 k=10
?这里没有一个适合所有 k
的大小。
这就是top-p搜索的用武之地!与其选择一个k
,我们选择一个概率p
,我们希望顶级token的概率加起来等于这个概率。这样,我们可以根据概率分布动态调整k
。通过设置p=0.9
,如果90%的概率质量集中在前2个token上,我们可以过滤掉前2个token进行采样。如果90%的概率分布在10个token上,它也会类似地过滤掉前10个token进行采样。
sampler = keras_hub.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")
Top-P search generated text:
[b'[BOS] the children were both born in the spring , and the youngest sister were very much like the other children , but they did not see them . they were very happy , and their mother was a beautiful one . the youngest was one of the youngest sister of the youngest , and the youngest baby was very fond of the children . when they came home , they would see a little girl in the house , and had the beautiful family , and the children of the children had to sit and look on their backs , and the eldest children were very long , and they were so bright and happy , as they were , they had never noticed their hair ,']
我们还可以将实用程序包装在回调中,这允许您为模型的每个 epoch 打印出预测序列!这是一个top-k搜索回调的示例
class TopKTextGenerator(keras.callbacks.Callback):
"""A callback to generate text from a trained model using top-k."""
def __init__(self, k):
self.sampler = keras_hub.samplers.TopKSampler(k)
def on_epoch_end(self, epoch, logs=None):
output_tokens = self.sampler(
next=next,
prompt=prompt_tokens,
index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")
text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])
Epoch 1/2
Top-K search generated text:
[b"[BOS] the young man was in the middle of a month , and he was able to take the crotch , but a long time , for he felt very well for himself in the sepoys ' s hands were chalks . he was the only boy , and he had a few years before been married , and the man said he was a tall one . he was a very handsome , and he was a very handsome young fellow , and a handsome , noble young man , but a boy , and man . he was a very handsome man , and was tall and handsome , and he looked like a gentleman . he was an"]
1/1 - 16s - 16s/step - loss: 3.9454 - perplexity: 51.6987
Epoch 2/2
Top-K search generated text:
[b'[BOS] " well , it is true . it is true that i should go to the house of a collector , in the matter of prussia that there is no other way there . there is no chance of being in the habit of being in the way of an invasion . i know not what i have done , but i have seen the man in the middle of a day . the next morning i shall take him to my father , for i am not the very day of the town , which would have been a little more than the one \' s daughter , i think it over and the whole affair will be']
1/1 - 17s - 17s/step - loss: 3.7860 - perplexity: 44.0932
<keras.src.callbacks.history.History at 0x7f7de0325600>
概括来说,在这个例子中,我们使用KerasHub层来训练一个子词词汇表,标记训练数据,创建一个微型GPT模型,并使用文本生成库进行推理。
如果您想了解Transformer的工作原理,或了解更多关于训练完整GPT模型的信息,以下是一些进一步的阅读材料