作者: Matthew Carrigan,Sayak Paul
创建日期 2023/05/21
上次修改日期 2023/05/21
描述:使用 🤗 Transformers 在 TPU 上训练掩码语言模型。
在本示例中,我们将介绍如何使用 TensorFlow、🤗 Transformers 和 TPU 训练掩码语言模型。
TPU 训练是一项有用的技能:TPU Pod 具有高性能且可扩展性极强,可以轻松地训练各种规模的模型,从几千万个参数到真正巨大的规模:Google 的 PaLM 模型(超过 5000 亿个参数!)完全在 TPU Pod 上训练。
我们之前编写了一个教程 和一个Colab 示例,展示了使用 TensorFlow 进行小规模 TPU 训练,并介绍了使模型在 TPU 上运行所需的核心概念。但是,我们的 Colab 示例不包含从头开始训练语言模型所需的所有步骤,例如训练分词器。因此,我们希望提供一个综合示例,引导您完成其中涉及的每个关键步骤。
与我们的 Colab 示例一样,我们利用了 TensorFlow 通过 XLA 和 TPUStrategy
提供的非常干净的 TPU 支持。我们还将受益于 🤗 Transformers 中大多数 TensorFlow 模型都完全兼容 XLA 的事实。因此,令人惊讶的是,要使它们在 TPU 上运行,所需的额外工作很少。
此示例旨在可扩展,并且更接近真实的训练运行——尽管我们默认只使用 BERT 规模的模型,但可以通过更改一些配置选项将代码扩展到更大的模型和更强大的 TPU Pod 切片。
下图概述了使用 TensorFlow 和 TPU 训练 🤗 Transformers 语言模型的步骤
(此示例的内容与这篇博文 重叠)。
我们使用WikiText 数据集 (v1)。您可以访问Hugging Face Hub 上的数据集页面 探索该数据集。
由于数据集已以兼容格式提供在 Hub 上,因此我们可以使用🤗 datasets轻松加载和与之交互。但是,从头开始训练语言模型还需要一个单独的分词器训练步骤。为了简洁起见,我们在本示例中跳过了这部分,但是,以下是我们可以用来从头开始训练分词器的方法概述
train
分割。您可以在此处找到分词器训练代码,以及此处的分词器。此脚本还允许您使用 Hub 上任何兼容的数据集运行它。
训练好分词器后,我们可以将其用于所有数据集分割(在本例中为 train
、validation
和 test
),并从中创建 TFRecord 分片。与将每个分割放在单个 TFRecord 文件中相比,将数据集分割分布在多个 TFRecord 分片中有助于实现大规模并行处理。
我们分别对样本进行分词。然后,我们获取一批样本,将它们连接在一起,并将其分成几个固定大小(在本例中为 128)的块。我们遵循这种策略而不是使用固定长度对一批样本进行分词,以避免过度丢弃文本内容(由于截断)。
然后,我们以批次获取这些分词后的样本,并将这些批次序列化为多个 TFRecord 分片,其中数据集的总长度和单个分片的大小决定了分片数量。最后,这些分片被推送到Google Cloud Storage (GCS) 存储桶。
如果您正在使用 TPU 节点进行训练,则需要从 GCS 存储桶流式传输数据,因为节点主机内存非常小。但是对于 TPU VM,我们可以使用本地数据集,甚至可以将持久性存储附加到这些 VM。由于 TPU 节点(我们在 Colab 中使用的节点)仍然被广泛使用,因此我们基于使用 GCS 存储桶进行数据存储的示例。
您可以在此脚本中查看所有这些代码。为方便起见,我们还在 Hub 上的此存储库中托管了生成的 TFRecord 分片。
将数据分词并序列化到 TFRecord 分片后,我们可以继续进行训练。
让我们首先安装 🤗 Transformers。
!pip install transformers -q
然后,让我们导入所需的模块。
import os
import re
import tensorflow as tf
import transformers
然后让我们连接到我们的 TPU 并确定分布策略
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.TPUStrategy(tpu)
print(f"Available number of replicas: {strategy.num_replicas_in_sync}")
Available number of replicas: 8
然后我们加载分词器。有关分词器的更多详细信息,请查看其存储库。对于模型,我们使用 RoBERTa(基本变体),在这篇论文中引入。
tokenizer = "tf-tpu/unigram-tokenizer-wikitext"
pretrained_model_config = "roberta-base"
tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer)
config = transformers.AutoConfig.from_pretrained(pretrained_model_config)
config.vocab_size = tokenizer.vocab_size
Downloading (…)okenizer_config.json: 0%| | 0.00/483 [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.61M [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/286 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 0%| | 0.00/481 [00:00<?, ?B/s]
我们现在加载 WikiText 数据集的 TFRecord 分片(Hugging Face 团队在此示例之前已准备好了)
train_dataset_path = "gs://tf-tpu-training-resources/train"
eval_dataset_path = "gs://tf-tpu-training-resources/validation"
training_records = tf.io.gfile.glob(os.path.join(train_dataset_path, "*.tfrecord"))
eval_records = tf.io.gfile.glob(os.path.join(eval_dataset_path, "*.tfrecord"))
现在,我们将编写一个实用程序来计算我们拥有的训练样本数量。我们需要知道此值才能正确初始化我们的优化器
def count_samples(file_list):
num_samples = 0
for file in file_list:
filename = file.split("/")[-1]
sample_count = re.search(r"-\d+-(\d+)\.tfrecord", filename).group(1)
sample_count = int(sample_count)
num_samples += sample_count
return num_samples
num_train_samples = count_samples(training_records)
print(f"Number of total training samples: {num_train_samples}")
Number of total training samples: 300917
现在让我们准备用于训练和评估的数据集。我们首先编写我们的实用程序。首先,我们需要能够解码 TFRecord
max_sequence_length = 512
def decode_fn(example):
features = {
"input_ids": tf.io.FixedLenFeature(
dtype=tf.int64, shape=(max_sequence_length,)
),
"attention_mask": tf.io.FixedLenFeature(
dtype=tf.int64, shape=(max_sequence_length,)
),
}
return tf.io.parse_single_example(example, features)
这里,max_sequence_length
需要与准备 TFRecord 分片时使用的相同。有关更多详细信息,请参阅此脚本。
接下来,我们有我们的掩码实用程序,它负责掩盖输入的部分并为掩码语言模型准备标签以学习。我们为此目的利用了DataCollatorForLanguageModeling
。
# We use a standard masking probability of 0.15. `mlm_probability` denotes
# probability with which we mask the input tokens in a sequence.
mlm_probability = 0.15
data_collator = transformers.DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm_probability=mlm_probability, mlm=True, return_tensors="tf"
)
def mask_with_collator(batch):
special_tokens_mask = (
~tf.cast(batch["attention_mask"], tf.bool)
| (batch["input_ids"] == tokenizer.cls_token_id)
| (batch["input_ids"] == tokenizer.sep_token_id)
)
batch["input_ids"], batch["labels"] = data_collator.tf_mask_tokens(
batch["input_ids"],
vocab_size=len(tokenizer),
mask_token_id=tokenizer.mask_token_id,
special_tokens_mask=special_tokens_mask,
)
return batch
现在是时候编写最终的数据准备实用程序,将其全部整合到tf.data.Dataset
对象中
auto = tf.data.AUTOTUNE
shuffle_buffer_size = 2**18
def prepare_dataset(
records, decode_fn, mask_fn, batch_size, shuffle, shuffle_buffer_size=None
):
num_samples = count_samples(records)
dataset = tf.data.Dataset.from_tensor_slices(records)
if shuffle:
dataset = dataset.shuffle(len(dataset))
dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=auto)
# TF can't infer the total sample count because it doesn't read
# all the records yet, so we assert it here.
dataset = dataset.apply(tf.data.experimental.assert_cardinality(num_samples))
dataset = dataset.map(decode_fn, num_parallel_calls=auto)
if shuffle:
assert shuffle_buffer_size is not None
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.map(mask_fn, num_parallel_calls=auto)
dataset = dataset.prefetch(auto)
return dataset
让我们使用这些实用程序准备我们的数据集
per_replica_batch_size = 16 # Change as needed.
batch_size = per_replica_batch_size * strategy.num_replicas_in_sync
shuffle_buffer_size = 2**18 # Default corresponds to a 1GB buffer for seq_len 512
train_dataset = prepare_dataset(
training_records,
decode_fn=decode_fn,
mask_fn=mask_with_collator,
batch_size=batch_size,
shuffle=True,
shuffle_buffer_size=shuffle_buffer_size,
)
eval_dataset = prepare_dataset(
eval_records,
decode_fn=decode_fn,
mask_fn=mask_with_collator,
batch_size=batch_size,
shuffle=False,
)
现在让我们研究一下数据集的单个批次是什么样的。
single_batch = next(iter(train_dataset))
print(single_batch.keys())
dict_keys(['attention_mask', 'input_ids', 'labels'])
input_ids
表示输入样本的分词版本,也包含掩码标记。attention_mask
表示执行注意力操作时要使用的掩码。labels
表示模型应该学习的掩码标记的实际值。for k in single_batch:
if k == "input_ids":
input_ids = single_batch[k]
print(f"Input shape: {input_ids.shape}")
if k == "labels":
labels = single_batch[k]
print(f"Label shape: {labels.shape}")
Input shape: (128, 512)
Label shape: (128, 512)
现在,我们可以利用我们的 tokenizer
来研究标记的值。让我们从 input_ids
开始
idx = 0
print("Taking the first sample:\n")
print(tokenizer.decode(input_ids[idx].numpy()))
Taking the first sample:
they called the character of Tsugum[MASK] one of the[MASK] tragic heroines[MASK] had encountered in a game. Chandran ranked the game as the third best role @[MASK][MASK] playing game from the sixth generation of video[MASK] consoles, saying that it was his favorite in the[MASK]Infinity[MASK], and one his favorite[MASK] games overall[MASK].[MASK]
[SEP][CLS][SEP][CLS][SEP][CLS] =[MASK] Sea party 1914[MASK]– 16 =
[SEP][CLS][SEP][CLS] The Ross Sea party was a component of Sir[MASK] Shackleton's Imperial Trans @-@ Antarctic Expedition 1914 garde 17.[MASK] task was to lay a series of supply depots across the Great Ice Barrier from the Ross Sea to the Beardmore Glacier, along the[MASK] route established by earlier Antarctic expeditions[MASK]. The expedition's main party, under[MASK], was to land[MASK]on the opposite, Weddell Sea coast of Antarctica [MASK] and to march across the continent via the South[MASK] to the Ross Sea. As the main party would be un[MASK] to carry[MASK] fuel and supplies for the whole distance[MASK], their survival depended on the Ross Sea party's depots[MASK][MASK][MASK] would cover the[MASK] quarter of their journey.
[SEP][CLS][MASK] set sail from London on[MASK] ship Endurance, bound[MASK] the Weddell Sea in August 1914. Meanwhile, the Ross Sea party[MASK] gathered in Australia, prior[MASK] Probabl for the Ross Sea in[MASK] second expedition ship, SY Aurora. Organisational and financial problems[MASK]ed their[MASK] until December 1914, which shortened their first depot @-@[MASK] season.[MASK][MASK] arrival the inexperienced party struggle[MASK] to master the art of Antarctic travel, in the[MASK] losing most of their sledge dogs [MASK]อ greater misfortune[MASK]ed when, at the onset of the southern winter, Aurora[MASK] torn from its [MASK]ings during [MASK] severe storm and was un[MASK] to return, leaving the shore party stranded.
[SEP][CLS] Crossroadspite[MASK] setbacks, the Ross Sea party survived inter @-@ personnel disputes, extreme weather[MASK], illness, and Pay deaths of three of its members to carry[MASK] its[MASK] in full during its[MASK] Antarctic season. This success proved ultimate[MASK] without purpose, because Shackleton's Grimaldi expedition was un
正如预期的那样,解码后的标记包含特殊标记,包括掩码标记。现在让我们研究掩码标记
# Taking the first 30 tokens of the first sequence.
print(labels[0].numpy()[:30])
[-100 -100 -100 -100 -100 -100 -100 -100 -100 43 -100 -100 -100 -100
351 -100 -100 -100 99 -100 -100 -100 -100 -100 -100 -100 -100 -100
-100 -100]
这里,-100
表示 input_ids
中对应的 token **未被掩盖**,非 -100
的值表示被掩盖的 token 的实际值。
数据集准备就绪后,我们现在在 strategy.scope()
中初始化并编译我们的模型和优化器。
# For this example, we keep this value to 10. But for a realistic run, start with 500.
num_epochs = 10
steps_per_epoch = num_train_samples // (
per_replica_batch_size * strategy.num_replicas_in_sync
)
total_train_steps = steps_per_epoch * num_epochs
learning_rate = 0.0001
weight_decay_rate = 1e-3
with strategy.scope():
model = transformers.TFAutoModelForMaskedLM.from_config(config)
model(
model.dummy_inputs
) # Pass some dummy inputs through the model to ensure all the weights are built
optimizer, schedule = transformers.create_optimizer(
num_train_steps=total_train_steps,
num_warmup_steps=total_train_steps // 20,
init_lr=learning_rate,
weight_decay_rate=weight_decay_rate,
)
model.compile(optimizer=optimizer, metrics=["accuracy"])
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
这里需要注意几点:* create_optimizer()
函数使用带有预热阶段和线性衰减的学习率调度创建 Adam 优化器。由于我们在这里使用了权重衰减,因此在后台,create_optimizer()
会实例化 Adam 的正确变体 以启用权重衰减。* 在编译模型时,我们**没有**使用任何 loss
参数。这是因为 TensorFlow 模型在提供预期标签时会在内部计算损失。根据模型类型和使用的标签,transformers
将自动推断要使用的损失。
接下来,我们设置一个方便的回调函数,将中间训练检查点推送到 Hugging Face Hub。为了能够使用此回调函数,我们需要登录到我们的 Hugging Face 帐户(如果您没有帐户,可以在这里免费创建一个)。执行以下代码进行登录
from huggingface_hub import notebook_login
notebook_login()
现在让我们定义 PushToHubCallback
hub_model_id = output_dir = "masked-lm-tpu"
callbacks = []
callbacks.append(
transformers.PushToHubCallback(
output_dir=output_dir, hub_model_id=hub_model_id, tokenizer=tokenizer
)
)
Cloning https://hugging-face.cn/sayakpaul/masked-lm-tpu into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://hugging-face.cn/sayakpaul/masked-lm-tpu into local empty directory.
Download file tf_model.h5: 0%| | 15.4k/477M [00:00<?, ?B/s]
Clean file tf_model.h5: 0%| | 1.00k/477M [00:00<?, ?B/s]
现在,我们准备充分利用 TPU 了。
# In the interest of the runtime of this example,
# we limit the number of batches to just 2.
model.fit(
train_dataset.take(2),
validation_data=eval_dataset.take(2),
epochs=num_epochs,
callbacks=callbacks,
)
# After training we also serialize the final model.
model.save_pretrained(output_dir)
Epoch 1/10
2/2 [==============================] - 96s 35s/step - loss: 10.2116 - accuracy: 0.0000e+00 - val_loss: 10.1957 - val_accuracy: 2.2888e-05
Epoch 2/10
2/2 [==============================] - 9s 2s/step - loss: 10.2017 - accuracy: 0.0000e+00 - val_loss: 10.1798 - val_accuracy: 0.0000e+00
Epoch 3/10
2/2 [==============================] - ETA: 0s - loss: 10.1890 - accuracy: 7.6294e-06
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 9.1679s). Check your callbacks.
2/2 [==============================] - 35s 27s/step - loss: 10.1890 - accuracy: 7.6294e-06 - val_loss: 10.1604 - val_accuracy: 1.5259e-05
Epoch 4/10
2/2 [==============================] - 8s 2s/step - loss: 10.1733 - accuracy: 1.5259e-05 - val_loss: 10.1145 - val_accuracy: 7.6294e-06
Epoch 5/10
2/2 [==============================] - 34s 26s/step - loss: 10.1336 - accuracy: 1.5259e-05 - val_loss: 10.0666 - val_accuracy: 7.6294e-06
Epoch 6/10
2/2 [==============================] - 10s 2s/step - loss: 10.0906 - accuracy: 6.1035e-05 - val_loss: 10.0200 - val_accuracy: 5.4169e-04
Epoch 7/10
2/2 [==============================] - 33s 25s/step - loss: 10.0360 - accuracy: 6.1035e-04 - val_loss: 9.9646 - val_accuracy: 0.0049
Epoch 8/10
2/2 [==============================] - 8s 2s/step - loss: 9.9830 - accuracy: 0.0038 - val_loss: 9.8938 - val_accuracy: 0.0155
Epoch 9/10
2/2 [==============================] - 33s 26s/step - loss: 9.9067 - accuracy: 0.0116 - val_loss: 9.8225 - val_accuracy: 0.0198
Epoch 10/10
2/2 [==============================] - 8s 2s/step - loss: 9.8302 - accuracy: 0.0196 - val_loss: 9.7454 - val_accuracy: 0.0215
训练完成后,您可以轻松地执行推理,如下所示
from transformers import pipeline
# Replace your `model_id` here.
# Here, we're using a model that the Hugging Face team trained for longer.
model_id = "tf-tpu/roberta-base-epochs-500-no-wd"
unmasker = pipeline("fill-mask", model=model_id, framework="tf")
print(unmasker("Goal of my life is to [MASK]."))
Downloading (…)lve/main/config.json: 0%| | 0.00/649 [00:00<?, ?B/s]
Downloading tf_model.h5: 0%| | 0.00/500M [00:00<?, ?B/s]
All model checkpoint layers were used when initializing TFRobertaForMaskedLM.
All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at tf-tpu/roberta-base-epochs-500-no-wd.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
Downloading (…)okenizer_config.json: 0%| | 0.00/683 [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.61M [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/286 [00:00<?, ?B/s]
[{'score': 0.10031876713037491, 'token': 52, 'token_str': 'be', 'sequence': 'Goal of my life is to be.'}, {'score': 0.032648470252752304, 'token': 5, 'token_str': '', 'sequence': 'Goal of my life is to .'}, {'score': 0.02152678370475769, 'token': 138, 'token_str': 'work', 'sequence': 'Goal of my life is to work.'}, {'score': 0.019547568634152412, 'token': 984, 'token_str': 'act', 'sequence': 'Goal of my life is to act.'}, {'score': 0.01939115859568119, 'token': 73, 'token_str': 'have', 'sequence': 'Goal of my life is to have.'}]
就是这样!