作者: 洪宇
创建日期 2024/05/14
最后修改 2024/05/14
描述:使用 float8 量化训练简单的 Transformer 模型。
随着 Transformer 模型中参数数量的不断增加,训练和推理变得高度内存和计算密集。因此,引入了 8 位浮点数 (FP8),与 16 位浮点数相比,它提供了改进的性能,几乎没有精度下降。
具体来说,FP8 有两种截然不同的类型:E4M3 和 E5M2,它们在训练的不同部分非常有用。
通常,E4M3 最适合在正向传播期间使用,因为激活和权重需要更高的精度。然而,在反向传播期间,使用 E5M2,因为梯度不太容易受到精度损失的影响,但需要更高的动态范围。
值得注意的是,FP8 推理部署得到了极大简化,因为推理和训练使用相同的类型。这与在 32 位或 16 位浮点数中训练的网络的 INT8 推理形成对比,后者需要进行训练后量化 (PTQ) 校准,甚至需要进行量化感知训练 (QAT) 才能保持模型精度。
在本示例中,我们将构建一个简单的 Transformer 模型,并使用 FP16 和 FP8 精度对其进行训练。您将观察到,精度不会随着精度的降低而降低。
注意:您将需要一个支持 FP8 Tensor Core 的良好 GPU 才能获得预期的性能提升。
我们将使用 KerasHub 库来简化模型实现。此外,使用混合精度训练以减少训练时间。
注意:对 TensorFlow 的依赖仅用于数据处理。
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras # Upgrade to Keras 3.
import os
os.environ["KERAS_BACKEND"] = "jax"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import re
import keras
import keras_hub
import tensorflow as tf
keras.config.set_dtype_policy("mixed_bfloat16")
定义一些超参数。
EPOCHS = 3
BATCH_SIZE = 32
VOCABULARY_SIZE = 20000
MAX_SEQUENCE_LENGTH = 200
MODEL_KWARGS = dict(
vocabulary_size=VOCABULARY_SIZE,
max_sequence_length=MAX_SEQUENCE_LENGTH,
hidden_dim=32, # Hidden size for each token
num_heads=2, # Number of attention heads
intermediate_dim=32, # Intermediate size in feedforward network
dropout=0.1, # Dropout rate
)
首先,让我们下载 IMDB 数据集并提取它。
!mkdir -p datasets
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -q -O datasets/aclImdb_v1.tar.gz
!mkdir -p datasets/aclImdb
!tar -xzf datasets/aclImdb_v1.tar.gz -C datasets
!rm -rf datasets/aclImdb/train/unsup
我们将使用 keras.utils.text_dataset_from_directory
实用程序从文本文件生成带标签的 tf.data.Dataset
数据集。
train_ds = keras.utils.text_dataset_from_directory(
"datasets/aclImdb/train",
batch_size=BATCH_SIZE,
validation_split=0.2,
subset="training",
seed=42,
)
val_ds = keras.utils.text_dataset_from_directory(
"datasets/aclImdb/train",
batch_size=BATCH_SIZE,
validation_split=0.2,
subset="validation",
seed=42,
)
test_ds = keras.utils.text_dataset_from_directory(
"datasets/aclImdb/test", batch_size=BATCH_SIZE
)
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
现在,我们将把文本转换为小写。
train_ds = train_ds.map(lambda x, y: (tf.strings.lower(x), y))
val_ds = val_ds.map(lambda x, y: (tf.strings.lower(x), y))
test_ds = test_ds.map(lambda x, y: (tf.strings.lower(x), y))
让我们打印一些样本。
for text_batch, label_batch in train_ds.take(1):
for i in range(3):
print(f"Text: {text_batch.numpy()[i]}")
print(f"Label: {label_batch.numpy()[i]}")
Text: b'"pandemonium" is a horror movie spoof that comes off more stupid than funny. believe me when i tell you, i love comedies. especially comedy spoofs. "airplane", "the naked gun" trilogy, "blazing saddles", "high anxiety", and "spaceballs" are some of my favorite comedies that spoof a particular genre. "pandemonium" is not up there with those films. most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. there are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. geez, "scream" had more laughs than this film and that was more of a horror film. how bizarre is that?<br /><br />*1/2 (out of four)'
Label: 0
Text: b"david mamet is a very interesting and a very un-equal director. his first movie 'house of games' was the one i liked best, and it set a series of films with characters whose perspective of life changes as they get into complicated situations, and so does the perspective of the viewer.<br /><br />so is 'homicide' which from the title tries to set the mind of the viewer to the usual crime drama. the principal characters are two cops, one jewish and one irish who deal with a racially charged area. the murder of an old jewish shop owner who proves to be an ancient veteran of the israeli independence war triggers the jewish identity in the mind and heart of the jewish detective.<br /><br />this is were the flaws of the film are the more obvious. the process of awakening is theatrical and hard to believe, the group of jewish militants is operatic, and the way the detective eventually walks to the final violent confrontation is pathetic. the end of the film itself is mamet-like smart, but disappoints from a human emotional perspective.<br /><br />joe mantegna and william macy give strong performances, but the flaws of the story are too evident to be easily compensated."
Label: 0
Text: b'great documentary about the lives of ny firefighters during the worst terrorist attack of all time.. that reason alone is why this should be a must see collectors item.. what shocked me was not only the attacks, but the"high fat diet" and physical appearance of some of these firefighters. i think a lot of doctors would agree with me that,in the physical shape they were in, some of these firefighters would not of made it to the 79th floor carrying over 60 lbs of gear. having said that i now have a greater respect for firefighters and i realize becoming a firefighter is a life altering job. the french have a history of making great documentary\'s and that is what this is, a great documentary.....'
Label: 1
我们将使用 keras_hub.tokenizers.WordPieceTokenizer
层对文本进行标记化。 keras_hub.tokenizers.WordPieceTokenizer
采用 WordPiece 词汇表,并具有对文本进行标记化和将标记序列去标记化回原始文本的功能。
在定义标记器之前,我们首先需要对我们拥有的数据集进行训练。WordPiece 标记化算法是一种子词标记化算法;在语料库上对其进行训练会为我们提供子词词汇表。子词标记器是词标记器(词标记器需要非常大的词汇表才能很好地覆盖输入词)和字符标记器(字符不像词那样真正编码含义)之间的折衷方案。幸运的是,KerasHub 使得使用 keras_hub.tokenizers.compute_word_piece_vocabulary
实用程序在语料库上训练 WordPiece 变得非常简单。
def train_word_piece(ds, vocab_size, reserved_tokens):
word_piece_ds = ds.unbatch().map(lambda x, y: x)
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
word_piece_ds.batch(1000).prefetch(2),
vocabulary_size=vocab_size,
reserved_tokens=reserved_tokens,
)
return vocab
每个词汇表都有一些特殊的保留标记。我们有两个这样的标记
"[PAD]"
- 填充标记。当输入序列长度小于最大序列长度时,填充标记将附加到输入序列长度。"[UNK]"
- 未知标记。reserved_tokens = ["[PAD]", "[UNK]"]
train_sentences = [element[0] for element in train_ds]
vocab = train_word_piece(train_ds, VOCABULARY_SIZE, reserved_tokens)
让我们看看一些标记!
print("Tokens: ", vocab[100:110])
Tokens: ['à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é']
现在,让我们定义标记器。我们将使用上面训练的词汇表配置标记器。我们将定义一个最大序列长度,以便将所有序列填充到相同的长度,如果序列的长度小于指定的序列长度。否则,序列将被截断。
tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
vocabulary=vocab,
lowercase=False,
sequence_length=MAX_SEQUENCE_LENGTH,
)
让我们尝试对来自我们数据集的样本进行标记化!为了验证文本是否已正确标记化,我们还可以将标记列表去标记化回原始文本。
input_sentence_ex = train_ds.take(1).get_single_element()[0][0]
input_tokens_ex = tokenizer(input_sentence_ex)
print("Sentence: ", input_sentence_ex)
print("Tokens: ", input_tokens_ex)
print("Recovered text after detokenizing: ", tokenizer.detokenize(input_tokens_ex))
Sentence: tf.Tensor(b'great movie - especially the music - etta james - "at last". this speaks volumes when you have finally found that special someone.', shape=(), dtype=string)
Tokens:
[ 218 150 14 393 137 356 14 4917 2941 719 14 3
164 370 3 15 145 2705 11670 186 155 160 557 391
146 452 416 15 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0]
Recovered text after detokenizing: tf.Tensor(b'great movie - especially the music - etta james - " at last " . this speaks volumes when you have finally found that special someone . [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]', shape=(), dtype=string)
接下来,我们将以将被馈送到模型的形式格式化我们的数据集。我们需要对文本进行标记化。
def format_dataset(sentence, label):
sentence = tokenizer(sentence)
return ({"input_ids": sentence}, label)
def make_dataset(dataset):
dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
return dataset.shuffle(512).prefetch(tf.data.AUTOTUNE).cache()
train_ds = make_dataset(train_ds)
val_ds = make_dataset(val_ds)
test_ds = make_dataset(test_ds)
让我们构建一个简单的 Transformer 模型。我们将使用 KerasHub 库中的 TokenAndPositionEmbedding
和 TransformerDecoder
。TokenAndPositionEmbedding
表示句子中的单词及其顺序,而 TransformerDecoder
为输入序列的每个时间步输出一个向量。在这里,我们对所有时间步取平均值,并在其之上使用前馈网络对文本进行分类。
def build_model(
vocabulary_size=20000,
max_sequence_length=200,
hidden_dim=32,
num_heads=2,
intermediate_dim=32,
dropout=0.1,
):
token_id_input = keras.layers.Input(shape=(None,), dtype="int32", name="input_ids")
x = keras_hub.layers.TokenAndPositionEmbedding(
vocabulary_size=vocabulary_size,
sequence_length=max_sequence_length,
embedding_dim=hidden_dim,
)(token_id_input)
x = keras.layers.Dropout(rate=dropout)(x)
x = keras_hub.layers.TransformerDecoder(
intermediate_dim=intermediate_dim,
num_heads=num_heads,
dropout=dropout,
)(x)
x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(dropout)(x)
x = keras.layers.Dense(intermediate_dim, activation="relu")(x)
x = keras.layers.Dropout(dropout)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
return keras.Model(inputs=token_id_input, outputs=outputs)
首先,我们使用混合精度 ("mixed_bfloat16"
) 训练和评估模型。之后,我们将结果与 FP8 训练/推理进行比较。
model = build_model(**MODEL_KWARGS)
model.summary()
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"],
)
history = model.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
result = model.evaluate(test_ds)
print(f"Accuracy (mixed_bfloat16): {result[1]:.2%}")
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_ids (InputLayer) │ (None, None) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ token_and_position_embedding │ (None, None, 32) │ 646,400 │ │ (TokenAndPositionEmbedding) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ transformer_decoder │ (None, None, 32) │ 6,464 │ │ (TransformerDecoder) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ global_average_pooling1d │ (None, 32) │ 0 │ │ (GlobalAveragePooling1D) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 32) │ 1,056 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_3 (Dropout) │ (None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 33 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 653,953 (2.49 MB)
Trainable params: 653,953 (2.49 MB)
Non-trainable params: 0 (0.00 B)
Accuracy (mixed_bfloat16): 75.56%
我们可以通过一行 API 启用 FP8 训练/推理:model.quantize("float8")
。
model = build_model(**MODEL_KWARGS)
model.quantize("float8")
为了检查 FP8 训练是否已完成,我们可以打印出一些与 FP8 训练相关的变量
*_scale
:缩放因子,用于将输入、权重和梯度的分布移到 FP8 的可表示范围内。默认为 1.0
*_amax_history
:用于计算缩放因子的 amax 历史窗口。默认为 0.0
,长度为 1024。pattern = r"(transformer).+(multi_head).+(query).+(scale|amax_history)"
for v in model.trainable_variables:
if re.findall(pattern, v.path):
print(v.path)
print(keras.ops.convert_to_numpy(v.value))
FP8 层的数据类型策略也已修改。
for layer in model._flatten_layers(recursive=True):
if "float8" in str(layer.dtype_policy):
print(f"{layer.name}: {layer.dtype_policy}")
feedforward_output_dense: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
feedforward_intermediate_dense: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
attention_output: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
value: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
key: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
query: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
dense_2: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
dense_3: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
让我们训练模型并查看结果。我们可以验证 FP8 训练不会降低精度,并且在拟合后包含 FP8 信息的变量会发生变化。
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"],
)
history = model.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
result = model.evaluate(test_ds)
print(f"Accuracy (float8): {result[1]:.2%}")
for v in model.trainable_variables:
if re.findall(pattern, v.path):
print(v.path)
print(keras.ops.convert_to_numpy(v.value))
Accuracy (float8): 74.16%