作者: Hongyu Chiu
创建日期 2024/05/14
最后修改日期 2024/05/14
描述: 使用 float8 量化训练一个简单的 Transformer 模型。
随着 Transformer 模型参数数量的持续增长,训练和推理变得非常占用内存和计算资源。因此,引入了 8 位浮点数(FP8),它在相比 16 位浮点数提供性能改进的同时,几乎不会降低准确率。
具体来说,FP8 有两种不同的类型:E4M3 和 E5M2,它们在训练的不同阶段有用。
通常,E4M3 最适合用于前向传播,因为激活值和权重需要更高的精度。然而,在反向传播中,使用 E5M2,因为梯度对精度损失不太敏感,但需要更高的动态范围。
值得注意的是,FP8 推理部署大大简化了,因为推理和训练使用相同的数据类型。这与使用 32 位或 16 位浮点数训练的网络进行 INT8 推理不同,后者需要训练后量化 (PTQ) 校准甚至量化感知训练 (QAT) 来保持模型准确率。
在此示例中,我们将构建一个简单的 Transformer 模型,并使用 FP16 和 FP8 精度进行训练。您将观察到,使用较低精度时准确率没有下降。
注意:您需要一个支持 FP8 Tensor Cores 的性能良好的 GPU,以获得预期的性能提升。
我们将使用 KerasHub 库来简化模型实现。此外,使用混合精度训练来减少训练时间。
注意:对 TensorFlow 的依赖仅用于数据处理。
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras # Upgrade to Keras 3.
import os
os.environ["KERAS_BACKEND"] = "jax"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import re
import keras
import keras_hub
import tensorflow as tf
keras.config.set_dtype_policy("mixed_bfloat16")
定义一些超参数。
EPOCHS = 3
BATCH_SIZE = 32
VOCABULARY_SIZE = 20000
MAX_SEQUENCE_LENGTH = 200
MODEL_KWARGS = dict(
vocabulary_size=VOCABULARY_SIZE,
max_sequence_length=MAX_SEQUENCE_LENGTH,
hidden_dim=32, # Hidden size for each token
num_heads=2, # Number of attention heads
intermediate_dim=32, # Intermediate size in feedforward network
dropout=0.1, # Dropout rate
)
首先,让我们下载 IMDB 数据集并解压。
!mkdir -p datasets
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -q -O datasets/aclImdb_v1.tar.gz
!mkdir -p datasets/aclImdb
!tar -xzf datasets/aclImdb_v1.tar.gz -C datasets
!rm -rf datasets/aclImdb/train/unsup
我们将使用 keras.utils.text_dataset_from_directory
工具从文本文件生成带有标签的 tf.data.Dataset
数据集。
train_ds = keras.utils.text_dataset_from_directory(
"datasets/aclImdb/train",
batch_size=BATCH_SIZE,
validation_split=0.2,
subset="training",
seed=42,
)
val_ds = keras.utils.text_dataset_from_directory(
"datasets/aclImdb/train",
batch_size=BATCH_SIZE,
validation_split=0.2,
subset="validation",
seed=42,
)
test_ds = keras.utils.text_dataset_from_directory(
"datasets/aclImdb/test", batch_size=BATCH_SIZE
)
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
现在我们将文本转换为小写。
train_ds = train_ds.map(lambda x, y: (tf.strings.lower(x), y))
val_ds = val_ds.map(lambda x, y: (tf.strings.lower(x), y))
test_ds = test_ds.map(lambda x, y: (tf.strings.lower(x), y))
让我们打印一些样本。
for text_batch, label_batch in train_ds.take(1):
for i in range(3):
print(f"Text: {text_batch.numpy()[i]}")
print(f"Label: {label_batch.numpy()[i]}")
Text: b'"pandemonium" is a horror movie spoof that comes off more stupid than funny. believe me when i tell you, i love comedies. especially comedy spoofs. "airplane", "the naked gun" trilogy, "blazing saddles", "high anxiety", and "spaceballs" are some of my favorite comedies that spoof a particular genre. "pandemonium" is not up there with those films. most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. there are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. geez, "scream" had more laughs than this film and that was more of a horror film. how bizarre is that?<br /><br />*1/2 (out of four)'
Label: 0
Text: b"david mamet is a very interesting and a very un-equal director. his first movie 'house of games' was the one i liked best, and it set a series of films with characters whose perspective of life changes as they get into complicated situations, and so does the perspective of the viewer.<br /><br />so is 'homicide' which from the title tries to set the mind of the viewer to the usual crime drama. the principal characters are two cops, one jewish and one irish who deal with a racially charged area. the murder of an old jewish shop owner who proves to be an ancient veteran of the israeli independence war triggers the jewish identity in the mind and heart of the jewish detective.<br /><br />this is were the flaws of the film are the more obvious. the process of awakening is theatrical and hard to believe, the group of jewish militants is operatic, and the way the detective eventually walks to the final violent confrontation is pathetic. the end of the film itself is mamet-like smart, but disappoints from a human emotional perspective.<br /><br />joe mantegna and william macy give strong performances, but the flaws of the story are too evident to be easily compensated."
Label: 0
Text: b'great documentary about the lives of ny firefighters during the worst terrorist attack of all time.. that reason alone is why this should be a must see collectors item.. what shocked me was not only the attacks, but the"high fat diet" and physical appearance of some of these firefighters. i think a lot of doctors would agree with me that,in the physical shape they were in, some of these firefighters would not of made it to the 79th floor carrying over 60 lbs of gear. having said that i now have a greater respect for firefighters and i realize becoming a firefighter is a life altering job. the french have a history of making great documentary\'s and that is what this is, a great documentary.....'
Label: 1
我们将使用 keras_hub.tokenizers.WordPieceTokenizer
层对文本进行分词。keras_hub.tokenizers.WordPieceTokenizer
接受一个 WordPiece 词汇表,并具有用于对文本进行分词以及将 token 序列进行反分词的功能。
在定义分词器之前,我们首先需要使用现有数据集训练它。WordPiece 分词算法是一种子词分词算法;在语料库上训练它会给我们一个子词词汇表。子词分词器是单词分词器(单词分词器需要非常大的词汇表才能很好地覆盖输入单词)和字符分词器(字符不像单词那样真正编码意义)之间的一种折衷。幸运的是,KerasHub 使使用 keras_hub.tokenizers.compute_word_piece_vocabulary
工具在语料库上训练 WordPiece 变得非常简单。
def train_word_piece(ds, vocab_size, reserved_tokens):
word_piece_ds = ds.unbatch().map(lambda x, y: x)
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
word_piece_ds.batch(1000).prefetch(2),
vocabulary_size=vocab_size,
reserved_tokens=reserved_tokens,
)
return vocab
每个词汇表都有一些特殊的保留 token。我们有两个这样的 token:
"[PAD]"
- 填充 token。当输入序列长度短于最大序列长度时,会将填充 token 追加到输入序列末尾。"[UNK]"
- 未知 token。reserved_tokens = ["[PAD]", "[UNK]"]
train_sentences = [element[0] for element in train_ds]
vocab = train_word_piece(train_ds, VOCABULARY_SIZE, reserved_tokens)
让我们看看一些 token!
print("Tokens: ", vocab[100:110])
Tokens: ['à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é']
现在,让我们定义分词器。我们将使用上面训练的词汇表配置分词器。我们将定义一个最大序列长度,以便如果序列长度小于指定长度,所有序列都会被填充到相同长度。否则,序列将被截断。
tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
vocabulary=vocab,
lowercase=False,
sequence_length=MAX_SEQUENCE_LENGTH,
)
让我们尝试从数据集中分词一个样本!为了验证文本是否被正确分词,我们还可以将 token 列表反分词回原始文本。
input_sentence_ex = train_ds.take(1).get_single_element()[0][0]
input_tokens_ex = tokenizer(input_sentence_ex)
print("Sentence: ", input_sentence_ex)
print("Tokens: ", input_tokens_ex)
print("Recovered text after detokenizing: ", tokenizer.detokenize(input_tokens_ex))
Sentence: tf.Tensor(b'great movie - especially the music - etta james - "at last". this speaks volumes when you have finally found that special someone.', shape=(), dtype=string)
Tokens:
[ 218 150 14 393 137 356 14 4917 2941 719 14 3
164 370 3 15 145 2705 11670 186 155 160 557 391
146 452 416 15 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0]
Recovered text after detokenizing: tf.Tensor(b'great movie - especially the music - etta james - " at last " . this speaks volumes when you have finally found that special someone . [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]', shape=(), dtype=string)
接下来,我们将数据集格式化为将要馈送到模型的形式。我们需要对文本进行分词。
def format_dataset(sentence, label):
sentence = tokenizer(sentence)
return ({"input_ids": sentence}, label)
def make_dataset(dataset):
dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
return dataset.shuffle(512).prefetch(tf.data.AUTOTUNE).cache()
train_ds = make_dataset(train_ds)
val_ds = make_dataset(val_ds)
test_ds = make_dataset(test_ds)
让我们构建一个简单的 Transformer 模型。我们将使用 KerasHub 库中的 TokenAndPositionEmbedding
和 TransformerDecoder
。TokenAndPositionEmbedding
表示单词及其在句子中的顺序,而 TransformerDecoder
为输入序列的每个时间步输出一个向量。在这里,我们对所有时间步取平均值,并在其之上使用前馈网络来对文本进行分类。
def build_model(
vocabulary_size=20000,
max_sequence_length=200,
hidden_dim=32,
num_heads=2,
intermediate_dim=32,
dropout=0.1,
):
token_id_input = keras.layers.Input(shape=(None,), dtype="int32", name="input_ids")
x = keras_hub.layers.TokenAndPositionEmbedding(
vocabulary_size=vocabulary_size,
sequence_length=max_sequence_length,
embedding_dim=hidden_dim,
)(token_id_input)
x = keras.layers.Dropout(rate=dropout)(x)
x = keras_hub.layers.TransformerDecoder(
intermediate_dim=intermediate_dim,
num_heads=num_heads,
dropout=dropout,
)(x)
x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(dropout)(x)
x = keras.layers.Dense(intermediate_dim, activation="relu")(x)
x = keras.layers.Dropout(dropout)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
return keras.Model(inputs=token_id_input, outputs=outputs)
首先,我们使用混合精度 ("mixed_bfloat16"
) 训练和评估模型。然后,我们将结果与 FP8 训练/推理进行比较。
model = build_model(**MODEL_KWARGS)
model.summary()
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"],
)
history = model.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
result = model.evaluate(test_ds)
print(f"Accuracy (mixed_bfloat16): {result[1]:.2%}")
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input_ids (InputLayer) │ (None, None) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ token_and_position_embedding │ (None, None, 32) │ 646,400 │ │ (TokenAndPositionEmbedding) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ transformer_decoder │ (None, None, 32) │ 6,464 │ │ (TransformerDecoder) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ global_average_pooling1d │ (None, 32) │ 0 │ │ (GlobalAveragePooling1D) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 32) │ 1,056 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_3 (Dropout) │ (None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 33 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 653,953 (2.49 MB)
Trainable params: 653,953 (2.49 MB)
Non-trainable params: 0 (0.00 B)
Accuracy (mixed_bfloat16): 75.56%
我们可以通过一行代码的 API 启用 FP8 训练/推理:model.quantize("float8")
。
model = build_model(**MODEL_KWARGS)
model.quantize("float8")
为了检查 FP8 训练是否发生,我们可以打印出一些与 FP8 训练相关的变量:
*_scale
:将输入、权重和梯度的分布平移到 FP8 可表示范围内的缩放因子。默认值为 1.0
*_amax_history
:用于缩放因子计算的 amax 历史窗口。默认值为 0.0
,长度为 1024。pattern = r"(transformer).+(multi_head).+(query).+(scale|amax_history)"
for v in model.trainable_variables:
if re.findall(pattern, v.path):
print(v.path)
print(keras.ops.convert_to_numpy(v.value))
FP8 层的 dtype 策略也已修改。
for layer in model._flatten_layers(recursive=True):
if "float8" in str(layer.dtype_policy):
print(f"{layer.name}: {layer.dtype_policy}")
feedforward_output_dense: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
feedforward_intermediate_dense: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
attention_output: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
value: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
key: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
query: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
dense_2: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
dense_3: <QuantizedFloat8DTypePolicy "float8_from_mixed_bfloat16">
让我们训练模型并查看结果。我们可以验证在使用 FP8 训练时准确率没有下降,以及包含 FP8 信息的变量在拟合后发生了变化。
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"],
)
history = model.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
result = model.evaluate(test_ds)
print(f"Accuracy (float8): {result[1]:.2%}")
for v in model.trainable_variables:
if re.findall(pattern, v.path):
print(v.path)
print(keras.ops.convert_to_numpy(v.value))
Accuracy (float8): 74.16%