► 代码示例 / 自然语言处理 / 使用 🤗 Transformers 和 TPU 从头开始训练语言模型

使用 🤗 Transformers 和 TPU 从头开始训练语言模型

作者： Matthew Carrigan，Sayak Paul
创建日期 2023/05/21
上次修改日期 2023/05/21
描述：使用 🤗 Transformers 在 TPU 上训练掩码语言模型。

ⓘ 此示例使用 Keras 2

简介

在本示例中，我们将介绍如何使用 TensorFlow、🤗 Transformers 和 TPU 训练掩码语言模型。

TPU 训练是一项有用的技能：TPU Pod 具有高性能且可扩展性极强，可以轻松地训练各种规模的模型，从几千万个参数到真正巨大的规模：Google 的 PaLM 模型（超过 5000 亿个参数！）完全在 TPU Pod 上训练。

我们之前编写了一个教程和一个Colab 示例，展示了使用 TensorFlow 进行小规模 TPU 训练，并介绍了使模型在 TPU 上运行所需的核心概念。但是，我们的 Colab 示例不包含从头开始训练语言模型所需的所有步骤，例如训练分词器。因此，我们希望提供一个综合示例，引导您完成其中涉及的每个关键步骤。

与我们的 Colab 示例一样，我们利用了 TensorFlow 通过 XLA 和 TPUStrategy 提供的非常干净的 TPU 支持。我们还将受益于 🤗 Transformers 中大多数 TensorFlow 模型都完全兼容 XLA 的事实。因此，令人惊讶的是，要使它们在 TPU 上运行，所需的额外工作很少。

此示例旨在可扩展，并且更接近真实的训练运行——尽管我们默认只使用 BERT 规模的模型，但可以通过更改一些配置选项将代码扩展到更大的模型和更强大的 TPU Pod 切片。

下图概述了使用 TensorFlow 和 TPU 训练 🤗 Transformers 语言模型的步骤

（此示例的内容与这篇博文重叠）。

数据

我们使用WikiText 数据集 (v1)。您可以访问Hugging Face Hub 上的数据集页面探索该数据集。

data_preview_wikitext

由于数据集已以兼容格式提供在 Hub 上，因此我们可以使用🤗 datasets轻松加载和与之交互。但是，从头开始训练语言模型还需要一个单独的分词器训练步骤。为了简洁起见，我们在本示例中跳过了这部分，但是，以下是我们可以用来从头开始训练分词器的方法概述

使用 🤗 datasets 加载 WikiText 的 train 分割。
利用🤗 tokenizers训练Unigram 模型。
将训练好的分词器上传到 Hub。

您可以在此处找到分词器训练代码，以及此处的分词器。此脚本还允许您使用 Hub 上任何兼容的数据集运行它。

分词数据并创建 TFRecord

训练好分词器后，我们可以将其用于所有数据集分割（在本例中为 train、validation 和 test），并从中创建 TFRecord 分片。与将每个分割放在单个 TFRecord 文件中相比，将数据集分割分布在多个 TFRecord 分片中有助于实现大规模并行处理。

我们分别对样本进行分词。然后，我们获取一批样本，将它们连接在一起，并将其分成几个固定大小（在本例中为 128）的块。我们遵循这种策略而不是使用固定长度对一批样本进行分词，以避免过度丢弃文本内容（由于截断）。

然后，我们以批次获取这些分词后的样本，并将这些批次序列化为多个 TFRecord 分片，其中数据集的总长度和单个分片的大小决定了分片数量。最后，这些分片被推送到Google Cloud Storage (GCS) 存储桶。

如果您正在使用 TPU 节点进行训练，则需要从 GCS 存储桶流式传输数据，因为节点主机内存非常小。但是对于 TPU VM，我们可以使用本地数据集，甚至可以将持久性存储附加到这些 VM。由于 TPU 节点（我们在 Colab 中使用的节点）仍然被广泛使用，因此我们基于使用 GCS 存储桶进行数据存储的示例。

您可以在此脚本中查看所有这些代码。为方便起见，我们还在 Hub 上的此存储库中托管了生成的 TFRecord 分片。

将数据分词并序列化到 TFRecord 分片后，我们可以继续进行训练。

训练

设置和导入

让我们首先安装 🤗 Transformers。

!pip install transformers -q

然后，让我们导入所需的模块。

import os
import re

import tensorflow as tf

import transformers

初始化 TPU

然后让我们连接到我们的 TPU 并确定分布策略

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

strategy = tf.distribute.TPUStrategy(tpu)

print(f"Available number of replicas: {strategy.num_replicas_in_sync}")

Available number of replicas: 8

然后我们加载分词器。有关分词器的更多详细信息，请查看其存储库。对于模型，我们使用 RoBERTa（基本变体），在这篇论文中引入。

初始化分词器

tokenizer = "tf-tpu/unigram-tokenizer-wikitext"
pretrained_model_config = "roberta-base"

tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer)
config = transformers.AutoConfig.from_pretrained(pretrained_model_config)
config.vocab_size = tokenizer.vocab_size

Downloading (…)okenizer_config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

准备数据集

我们现在加载 WikiText 数据集的 TFRecord 分片（Hugging Face 团队在此示例之前已准备好了）

train_dataset_path = "gs://tf-tpu-training-resources/train"
eval_dataset_path = "gs://tf-tpu-training-resources/validation"

training_records = tf.io.gfile.glob(os.path.join(train_dataset_path, "*.tfrecord"))
eval_records = tf.io.gfile.glob(os.path.join(eval_dataset_path, "*.tfrecord"))

现在，我们将编写一个实用程序来计算我们拥有的训练样本数量。我们需要知道此值才能正确初始化我们的优化器

def count_samples(file_list):
    num_samples = 0
    for file in file_list:
        filename = file.split("/")[-1]
        sample_count = re.search(r"-\d+-(\d+)\.tfrecord", filename).group(1)
        sample_count = int(sample_count)
        num_samples += sample_count

    return num_samples


num_train_samples = count_samples(training_records)
print(f"Number of total training samples: {num_train_samples}")

Number of total training samples: 300917

现在让我们准备用于训练和评估的数据集。我们首先编写我们的实用程序。首先，我们需要能够解码 TFRecord

max_sequence_length = 512


def decode_fn(example):
    features = {
        "input_ids": tf.io.FixedLenFeature(
            dtype=tf.int64, shape=(max_sequence_length,)
        ),
        "attention_mask": tf.io.FixedLenFeature(
            dtype=tf.int64, shape=(max_sequence_length,)
        ),
    }
    return tf.io.parse_single_example(example, features)

这里，max_sequence_length 需要与准备 TFRecord 分片时使用的相同。有关更多详细信息，请参阅此脚本。

接下来，我们有我们的掩码实用程序，它负责掩盖输入的部分并为掩码语言模型准备标签以学习。我们为此目的利用了DataCollatorForLanguageModeling。

# We use a standard masking probability of 0.15. `mlm_probability` denotes
# probability with which we mask the input tokens in a sequence.
mlm_probability = 0.15
data_collator = transformers.DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=mlm_probability, mlm=True, return_tensors="tf"
)


def mask_with_collator(batch):
    special_tokens_mask = (
        ~tf.cast(batch["attention_mask"], tf.bool)
        | (batch["input_ids"] == tokenizer.cls_token_id)
        | (batch["input_ids"] == tokenizer.sep_token_id)
    )
    batch["input_ids"], batch["labels"] = data_collator.tf_mask_tokens(
        batch["input_ids"],
        vocab_size=len(tokenizer),
        mask_token_id=tokenizer.mask_token_id,
        special_tokens_mask=special_tokens_mask,
    )
    return batch

现在是时候编写最终的数据准备实用程序，将其全部整合到tf.data.Dataset对象中

auto = tf.data.AUTOTUNE
shuffle_buffer_size = 2**18


def prepare_dataset(
    records, decode_fn, mask_fn, batch_size, shuffle, shuffle_buffer_size=None
):
    num_samples = count_samples(records)
    dataset = tf.data.Dataset.from_tensor_slices(records)
    if shuffle:
        dataset = dataset.shuffle(len(dataset))
    dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=auto)
    # TF can't infer the total sample count because it doesn't read
    #  all the records yet, so we assert it here.
    dataset = dataset.apply(tf.data.experimental.assert_cardinality(num_samples))
    dataset = dataset.map(decode_fn, num_parallel_calls=auto)
    if shuffle:
        assert shuffle_buffer_size is not None
        dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.map(mask_fn, num_parallel_calls=auto)
    dataset = dataset.prefetch(auto)
    return dataset

让我们使用这些实用程序准备我们的数据集

per_replica_batch_size = 16  # Change as needed.
batch_size = per_replica_batch_size * strategy.num_replicas_in_sync
shuffle_buffer_size = 2**18  # Default corresponds to a 1GB buffer for seq_len 512

train_dataset = prepare_dataset(
    training_records,
    decode_fn=decode_fn,
    mask_fn=mask_with_collator,
    batch_size=batch_size,
    shuffle=True,
    shuffle_buffer_size=shuffle_buffer_size,
)

eval_dataset = prepare_dataset(
    eval_records,
    decode_fn=decode_fn,
    mask_fn=mask_with_collator,
    batch_size=batch_size,
    shuffle=False,
)

现在让我们研究一下数据集的单个批次是什么样的。

single_batch = next(iter(train_dataset))
print(single_batch.keys())

dict_keys(['attention_mask', 'input_ids', 'labels'])

input_ids 表示输入样本的分词版本，也包含掩码标记。
attention_mask 表示执行注意力操作时要使用的掩码。
labels 表示模型应该学习的掩码标记的实际值。

for k in single_batch:
    if k == "input_ids":
        input_ids = single_batch[k]
        print(f"Input shape: {input_ids.shape}")
    if k == "labels":
        labels = single_batch[k]
        print(f"Label shape: {labels.shape}")

Input shape: (128, 512)
Label shape: (128, 512)

现在，我们可以利用我们的 tokenizer 来研究标记的值。让我们从 input_ids 开始

idx = 0
print("Taking the first sample:\n")
print(tokenizer.decode(input_ids[idx].numpy()))

Taking the first sample:

they called the character of Tsugum[MASK] one of the[MASK] tragic heroines[MASK] had encountered in a game. Chandran ranked the game as the third best role @[MASK][MASK] playing game from the sixth generation of video[MASK] consoles, saying that it was his favorite in the[MASK]Infinity[MASK], and one his favorite[MASK] games overall[MASK].[MASK]
[SEP][CLS][SEP][CLS][SEP][CLS] =[MASK] Sea party 1914[MASK]– 16 = 
[SEP][CLS][SEP][CLS] The Ross Sea party was a component of Sir[MASK] Shackleton's Imperial Trans @-@ Antarctic Expedition 1914  garde 17.[MASK] task was to lay a series of supply depots across the Great Ice Barrier from the Ross Sea to the Beardmore Glacier, along the[MASK] route established by earlier Antarctic expeditions[MASK]. The expedition's main party, under[MASK], was to land[MASK]on the opposite, Weddell Sea coast of Antarctica [MASK] and to march across the continent via the South[MASK] to the Ross Sea. As the main party would be un[MASK] to carry[MASK] fuel and supplies for the whole distance[MASK], their survival depended on the Ross Sea party's depots[MASK][MASK][MASK] would cover the[MASK] quarter of their journey. 
[SEP][CLS][MASK] set sail from London on[MASK] ship Endurance, bound[MASK] the Weddell Sea in August 1914. Meanwhile, the Ross Sea party[MASK] gathered in Australia, prior[MASK] Probabl for the Ross Sea in[MASK] second expedition ship, SY Aurora. Organisational and financial problems[MASK]ed their[MASK] until December 1914, which shortened their first depot @-@[MASK] season.[MASK][MASK] arrival the inexperienced party struggle[MASK] to master the art of Antarctic travel, in the[MASK] losing most of their sledge dogs [MASK]อ greater misfortune[MASK]ed when, at the onset of the southern winter, Aurora[MASK] torn from its [MASK]ings during [MASK] severe storm and was un[MASK] to return, leaving the shore party stranded. 
[SEP][CLS] Crossroadspite[MASK] setbacks, the Ross Sea party survived inter @-@ personnel disputes, extreme weather[MASK], illness, and Pay deaths of three of its members to carry[MASK] its[MASK] in full during its[MASK] Antarctic season. This success proved ultimate[MASK] without purpose, because Shackleton's Grimaldi expedition was un

正如预期的那样，解码后的标记包含特殊标记，包括掩码标记。现在让我们研究掩码标记

# Taking the first 30 tokens of the first sequence.
print(labels[0].numpy()[:30])

[-100 -100 -100 -100 -100 -100 -100 -100 -100   43 -100 -100 -100 -100
  351 -100 -100 -100   99 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100]

这里，-100 表示 input_ids 中对应的 token **未被掩盖**，非 -100 的值表示被掩盖的 token 的实际值。

初始化模型和优化器

数据集准备就绪后，我们现在在 strategy.scope() 中初始化并编译我们的模型和优化器。

# For this example, we keep this value to 10. But for a realistic run, start with 500.
num_epochs = 10
steps_per_epoch = num_train_samples // (
    per_replica_batch_size * strategy.num_replicas_in_sync
)
total_train_steps = steps_per_epoch * num_epochs
learning_rate = 0.0001
weight_decay_rate = 1e-3

with strategy.scope():
    model = transformers.TFAutoModelForMaskedLM.from_config(config)
    model(
        model.dummy_inputs
    )  # Pass some dummy inputs through the model to ensure all the weights are built
    optimizer, schedule = transformers.create_optimizer(
        num_train_steps=total_train_steps,
        num_warmup_steps=total_train_steps // 20,
        init_lr=learning_rate,
        weight_decay_rate=weight_decay_rate,
    )
    model.compile(optimizer=optimizer, metrics=["accuracy"])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

这里需要注意几点：* create_optimizer() 函数使用带有预热阶段和线性衰减的学习率调度创建 Adam 优化器。由于我们在这里使用了权重衰减，因此在后台，create_optimizer() 会实例化 Adam 的正确变体以启用权重衰减。* 在编译模型时，我们**没有**使用任何 loss 参数。这是因为 TensorFlow 模型在提供预期标签时会在内部计算损失。根据模型类型和使用的标签，transformers 将自动推断要使用的损失。

开始训练！

接下来，我们设置一个方便的回调函数，将中间训练检查点推送到 Hugging Face Hub。为了能够使用此回调函数，我们需要登录到我们的 Hugging Face 帐户（如果您没有帐户，可以在这里免费创建一个）。执行以下代码进行登录

from huggingface_hub import notebook_login

notebook_login()

现在让我们定义 PushToHubCallback

hub_model_id = output_dir = "masked-lm-tpu"

callbacks = []
callbacks.append(
    transformers.PushToHubCallback(
        output_dir=output_dir, hub_model_id=hub_model_id, tokenizer=tokenizer
    )
)

Cloning https://hugging-face.cn/sayakpaul/masked-lm-tpu into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://hugging-face.cn/sayakpaul/masked-lm-tpu into local empty directory.

Download file tf_model.h5:   0%|          | 15.4k/477M [00:00<?, ?B/s]

Clean file tf_model.h5:   0%|          | 1.00k/477M [00:00<?, ?B/s]

现在，我们准备充分利用 TPU 了。

# In the interest of the runtime of this example,
# we limit the number of batches to just 2.
model.fit(
    train_dataset.take(2),
    validation_data=eval_dataset.take(2),
    epochs=num_epochs,
    callbacks=callbacks,
)

# After training we also serialize the final model.
model.save_pretrained(output_dir)

Epoch 1/10
2/2 [==============================] - 96s 35s/step - loss: 10.2116 - accuracy: 0.0000e+00 - val_loss: 10.1957 - val_accuracy: 2.2888e-05
Epoch 2/10
2/2 [==============================] - 9s 2s/step - loss: 10.2017 - accuracy: 0.0000e+00 - val_loss: 10.1798 - val_accuracy: 0.0000e+00
Epoch 3/10
2/2 [==============================] - ETA: 0s - loss: 10.1890 - accuracy: 7.6294e-06

WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0045s vs `on_train_batch_end` time: 9.1679s). Check your callbacks.

2/2 [==============================] - 35s 27s/step - loss: 10.1890 - accuracy: 7.6294e-06 - val_loss: 10.1604 - val_accuracy: 1.5259e-05
Epoch 4/10
2/2 [==============================] - 8s 2s/step - loss: 10.1733 - accuracy: 1.5259e-05 - val_loss: 10.1145 - val_accuracy: 7.6294e-06
Epoch 5/10
2/2 [==============================] - 34s 26s/step - loss: 10.1336 - accuracy: 1.5259e-05 - val_loss: 10.0666 - val_accuracy: 7.6294e-06
Epoch 6/10
2/2 [==============================] - 10s 2s/step - loss: 10.0906 - accuracy: 6.1035e-05 - val_loss: 10.0200 - val_accuracy: 5.4169e-04
Epoch 7/10
2/2 [==============================] - 33s 25s/step - loss: 10.0360 - accuracy: 6.1035e-04 - val_loss: 9.9646 - val_accuracy: 0.0049
Epoch 8/10
2/2 [==============================] - 8s 2s/step - loss: 9.9830 - accuracy: 0.0038 - val_loss: 9.8938 - val_accuracy: 0.0155
Epoch 9/10
2/2 [==============================] - 33s 26s/step - loss: 9.9067 - accuracy: 0.0116 - val_loss: 9.8225 - val_accuracy: 0.0198
Epoch 10/10
2/2 [==============================] - 8s 2s/step - loss: 9.8302 - accuracy: 0.0196 - val_loss: 9.7454 - val_accuracy: 0.0215

训练完成后，您可以轻松地执行推理，如下所示

from transformers import pipeline

# Replace your `model_id` here.
# Here, we're using a model that the Hugging Face team trained for longer.
model_id = "tf-tpu/roberta-base-epochs-500-no-wd"
unmasker = pipeline("fill-mask", model=model_id, framework="tf")
print(unmasker("Goal of my life is to [MASK]."))

Downloading (…)lve/main/config.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/500M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at tf-tpu/roberta-base-epochs-500-no-wd.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.

Downloading (…)okenizer_config.json:   0%|          | 0.00/683 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

[{'score': 0.10031876713037491, 'token': 52, 'token_str': 'be', 'sequence': 'Goal of my life is to be.'}, {'score': 0.032648470252752304, 'token': 5, 'token_str': '', 'sequence': 'Goal of my life is to .'}, {'score': 0.02152678370475769, 'token': 138, 'token_str': 'work', 'sequence': 'Goal of my life is to work.'}, {'score': 0.019547568634152412, 'token': 984, 'token_str': 'act', 'sequence': 'Goal of my life is to act.'}, {'score': 0.01939115859568119, 'token': 73, 'token_str': 'have', 'sequence': 'Goal of my life is to have.'}]

就是这样！

如果您喜欢此示例，我们建议您查看完整的代码库这里和随附的博文这里。

使用 🤗 Transformers 和 TPU 从头开始训练语言模型

◆ 引言

◆ 数据

◆ 分词数据并创建 TFRecords

◆ 训练

设置和导入

初始化 TPU

初始化分词器

准备数据集

◆ 初始化模型和优化器

开始训练！