作者:Sreyan Ghosh
创建日期 2022/07/01
最后修改 2022/08/27
描述: 使用 Hugging Face Transformers 在 NSP 和 MLM 上预训练 BERT。
在计算机视觉领域,研究人员已经反复证明了迁移学习的价值——在一个已知的任务/数据集(例如 ImageNet 分类)上预训练神经网络模型,然后执行微调——使用训练好的神经网络作为新的特定用途模型的基础。近年来,研究人员已经表明,类似的技术在许多自然语言任务中也很有用。
BERT 使用 Transformer,这是一种学习文本中单词(或子词)之间上下文关系的注意力机制。在其原始形式中,Transformer 包括两个独立的机制——一个读取文本输入的编码器和一个为任务生成预测的解码器。由于 BERT 的目标是生成语言模型,因此只需要编码器机制。Transformer 的详细工作原理在 Google 的一篇论文中进行了描述。
与按顺序(从左到右或从右到左)读取文本输入的方向模型相反,Transformer 编码器一次读取整个单词序列。因此,它被认为是双向的,尽管说它是非方向的会更准确。此特性允许模型根据其所有周围环境(单词的左侧和右侧)来学习单词的上下文。
在训练语言模型时,一个挑战是定义预测目标。许多模型预测序列中的下一个单词(例如 "孩子从 _ 回家了"
),这是一种方向性方法,它固有地限制了上下文学习。为了克服这一挑战,BERT 使用了两种训练策略
在将单词序列输入 BERT 之前,每个序列中 15% 的单词会被替换为 [MASK]
标记。然后,该模型会尝试根据序列中其他非掩码单词提供的上下文来预测掩码单词的原始值。
在 BERT 训练过程中,该模型接收成对的句子作为输入,并学习预测该对中的第二个句子是否是原始文档中的后续句子。在训练期间,50% 的输入是一对,其中第二个句子是原始文档中的后续句子,而在另外 50% 的输入中,从语料库中随机选择一个句子作为第二个句子。假设随机句子将表示与第一个句子的断开连接。
尽管 Google 提供了英语的预训练 BERT 检查点,但您可能经常需要从头开始预训练模型以适应不同的语言,或者进行持续预训练以使模型适应新的领域。在本笔记本中,我们从头开始预训练 BERT,使用 🤗 Transformers 在从 🤗 Datasets 加载的 WikiText
英文数据集上优化 MLM 和 NSP 目标。
pip install git+https://github.com/huggingface/transformers.git
pip install datasets
pip install huggingface-hub
pip install nltk
import nltk
import random
import logging
import tensorflow as tf
from tensorflow import keras
nltk.download("punkt")
# Only log error messages
tf.get_logger().setLevel(logging.ERROR)
# Set random seed
tf.keras.utils.set_random_seed(42)
[nltk_data] Downloading package punkt to /speech/sreyan/nltk_data...
[nltk_data] Package punkt is already up-to-date!
TOKENIZER_BATCH_SIZE = 256 # Batch-size to train the tokenizer on
TOKENIZER_VOCABULARY = 25000 # Total number of unique subwords the tokenizer can have
BLOCK_SIZE = 128 # Maximum number of tokens in an input sample
NSP_PROB = 0.50 # Probability that the next sentence is the actual next sentence in NSP
SHORT_SEQ_PROB = 0.1 # Probability of generating shorter sequences to minimize the mismatch between pretraining and fine-tuning.
MAX_LENGTH = 512 # Maximum number of tokens in an input sample after padding
MLM_PROB = 0.2 # Probability with which tokens are masked in MLM
TRAIN_BATCH_SIZE = 2 # Batch-size for pretraining the model on
MAX_EPOCHS = 1 # Maximum number of epochs to train the model for
LEARNING_RATE = 1e-4 # Learning rate for training the model
MODEL_CHECKPOINT = "bert-base-cased" # Name of pretrained model from 🤗 Model Hub
我们现在下载 WikiText
语言建模数据集。它是从维基百科上经过验证的“好”和“特色”文章集中提取的超过 1 亿个标记的集合。
我们从 🤗 Datasets 加载数据集。为了在本笔记本中进行演示,我们只使用数据集的 train
分割。这可以使用 load_dataset
函数轻松完成。
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Downloading data: 0%| | 0.00/4.72M [00:00<?, ?B/s]
Generating test split: 0%| | 0/4358 [00:00<?, ? examples/s]
Generating train split: 0%| | 0/36718 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/3760 [00:00<?, ? examples/s]
Dataset wikitext downloaded and prepared to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s]
数据集只有一个列,即原始文本,这正是我们预训练 BERT 所需的全部内容!
print(dataset)
DatasetDict({
test: Dataset({
features: ['text'],
num_rows: 4358
})
train: Dataset({
features: ['text'],
num_rows: 36718
})
validation: Dataset({
features: ['text'],
num_rows: 3760
})
})
首先,我们从头开始在我们的语料库上训练我们自己的分词器,这样我们就可以使用它从头开始训练我们的语言模型。
但是为什么你需要训练一个分词器呢?这是因为 Transformer 模型经常使用子词分词算法,并且需要对其进行训练以识别您正在使用的语料库中经常出现的单词部分。
🤗 Transformers Tokenizer
(顾名思义)将对输入进行标记(包括将标记转换为预训练词汇表中对应的 ID),并将其置于模型期望的格式中,以及生成模型需要的其他输入。
首先,我们制作一个包含 WikiText
语料库中所有原始文档的列表
all_texts = [
doc for doc in dataset["train"]["text"] if len(doc) > 0 and not doc.startswith(" =")
]
接下来,我们创建一个 batch_iterator
函数,该函数将帮助我们训练分词器。
def batch_iterator():
for i in range(0, len(all_texts), TOKENIZER_BATCH_SIZE):
yield all_texts[i : i + TOKENIZER_BATCH_SIZE]
在本笔记本中,我们使用与现有分词器完全相同的算法和参数训练分词器。例如,我们使用相同的分词算法在 Wikitext-2
上训练一个新的 BERT-CASED
分词器版本。
首先,我们需要加载我们要用作模型的分词器
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 52 files to the new cache system
0%| | 0/52 [00:00<?, ?it/s]
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
现在,我们使用 Wikitext-2
数据集的整个 train
分割来训练我们的分词器。
tokenizer = tokenizer.train_new_from_iterator(
batch_iterator(), vocab_size=TOKENIZER_VOCABULARY
)
现在我们完成了新分词器的训练!接下来,我们继续进行数据预处理步骤。
为了演示工作流程,在本笔记本中,我们仅使用整个 WikiText train
和 test
分割的小子集。
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["validation"] = dataset["validation"].select([i for i in range(1000)])
在我们可以将这些文本输入到我们的模型之前,我们需要对它们进行预处理并使其为任务做好准备。如前所述,BERT 预训练任务总共包括两个任务,即 NSP
任务和 MLM
任务。🤗 Transformers 有一个易于实现的 collator
,称为 DataCollatorForLanguageModeling
。但是,我们需要手动为 NSP
准备好数据。
接下来,我们编写一个简单的函数,称为 prepare_train_features
,它可以帮助我们进行预处理,并且与 🤗 Datasets 兼容。总而言之,我们的预处理函数应
token_type_ids
、attention_mask
等。# We define the maximum number of tokens after tokenization that each training sample
# will have
max_num_tokens = BLOCK_SIZE - tokenizer.num_special_tokens_to_add(pair=True)
def prepare_train_features(examples):
"""Function to prepare features for NSP task
Arguments:
examples: A dictionary with 1 key ("text")
text: List of raw documents (str)
Returns:
examples: A dictionary with 4 keys
input_ids: List of tokenized, concatnated, and batched
sentences from the individual raw documents (int)
token_type_ids: List of integers (0 or 1) corresponding
to: 0 for senetence no. 1 and padding, 1 for sentence
no. 2
attention_mask: List of integers (0 or 1) corresponding
to: 1 for non-padded tokens, 0 for padded
next_sentence_label: List of integers (0 or 1) corresponding
to: 1 if the second sentence actually follows the first,
0 if the senetence is sampled from somewhere else in the corpus
"""
# Remove un-wanted samples from the training set
examples["document"] = [
d.strip() for d in examples["text"] if len(d) > 0 and not d.startswith(" =")
]
# Split the documents from the dataset into it's individual sentences
examples["sentences"] = [
nltk.tokenize.sent_tokenize(document) for document in examples["document"]
]
# Convert the tokens into ids using the trained tokenizer
examples["tokenized_sentences"] = [
[tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)) for sent in doc]
for doc in examples["sentences"]
]
# Define the outputs
examples["input_ids"] = []
examples["token_type_ids"] = []
examples["attention_mask"] = []
examples["next_sentence_label"] = []
for doc_index, document in enumerate(examples["tokenized_sentences"]):
current_chunk = [] # a buffer stored current working segments
current_length = 0
i = 0
# We *usually* want to fill up the entire sequence since we are padding
# to `block_size` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pretraining and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `block_size` is a hard limit.
target_seq_length = max_num_tokens
if random.random() < SHORT_SEQ_PROB:
target_seq_length = random.randint(2, max_num_tokens)
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = random.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
if len(current_chunk) == 1 or random.random() < NSP_PROB:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = random.randint(
0, len(examples["tokenized_sentences"]) - 1
)
if random_document_index != doc_index:
break
random_document = examples["tokenized_sentences"][
random_document_index
]
random_start = random.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
input_ids = tokenizer.build_inputs_with_special_tokens(
tokens_a, tokens_b
)
# add token type ids, 0 for sentence a, 1 for sentence b
token_type_ids = tokenizer.create_token_type_ids_from_sequences(
tokens_a, tokens_b
)
padded = tokenizer.pad(
{"input_ids": input_ids, "token_type_ids": token_type_ids},
padding="max_length",
max_length=MAX_LENGTH,
)
examples["input_ids"].append(padded["input_ids"])
examples["token_type_ids"].append(padded["token_type_ids"])
examples["attention_mask"].append(padded["attention_mask"])
examples["next_sentence_label"].append(1 if is_random_next else 0)
current_chunk = []
current_length = 0
i += 1
# We delete all the un-necessary columns from our dataset
del examples["document"]
del examples["sentences"]
del examples["text"]
del examples["tokenized_sentences"]
return examples
tokenized_dataset = dataset.map(
prepare_train_features, batched=True, remove_columns=["text"], num_proc=1,
)
Parameter 'function'=<function prepare_train_features at 0x7fd4a214cb90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
0%| | 0/5 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
对于 MLM,我们将对我们的数据集使用与以前相同的预处理,但有一个额外的步骤:我们随机掩盖一些标记(将其替换为 [MASK]),并且标签将进行调整以仅包括掩码标记(我们不必预测非掩码标记)。如果您使用自己训练的分词器,请确保 [MASK] 标记是您在训练期间传递的特殊标记之一!
为了为 MLM 准备好数据,我们只需在我们已经为 NSP 任务准备好的数据集上使用 🤗 Transformers 库提供的称为 DataCollatorForLanguageModeling
的 collator
。collator
需要某些参数。在本笔记本中,我们使用原始 BERT 论文中的默认参数。return_tensors='tf'
确保我们得到 tf.Tensor
对象。
from transformers import DataCollatorForLanguageModeling
collater = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=MLM_PROB, return_tensors="tf"
)
接下来,我们定义用于训练我们模型的训练集。同样,🤗 Datasets 为我们提供了 to_tf_dataset
方法,这将有助于我们将数据集与上面定义的 collator
集成。该方法需要某些参数
train = tokenized_dataset["train"].to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels", "next_sentence_label"],
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
collate_fn=collater,
)
validation = tokenized_dataset["validation"].to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels", "next_sentence_label"],
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
collate_fn=collater,
)
要定义我们的模型,首先我们需要定义一个配置,这将帮助我们定义模型架构的某些参数。这包括诸如 Transformer 层数、注意力头数、隐藏维度等参数。对于本笔记本,我们尝试定义原始 BERT 论文中定义的精确配置。
我们可以使用 🤗 Transformers 库中的 BertConfig
类轻松实现这一点。from_pretrained()
方法期望一个模型名称。这里我们定义了最简单的模型,我们也是用它来训练我们的模型,即 bert-base-cased
。
from transformers import BertConfig
config = BertConfig.from_pretrained(MODEL_CHECKPOINT)
为了定义我们的模型,我们使用 🤗 Transformers 库中的 TFBertForPreTraining
类。此类在内部处理一切,从定义我们的模型到解包我们的输入并计算损失。因此,我们无需自己做任何事情,只需使用我们想要的正确 config
定义模型即可!
from transformers import TFBertForPreTraining
model = TFBertForPreTraining(config)
现在我们定义优化器并编译模型。损失计算在内部处理,因此我们无需担心!
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
最后,所有步骤都已完成,现在我们可以开始训练我们的模型了!
model.fit(train, validation_data=validation, epochs=MAX_EPOCHS)
483/483 [==============================] - 96s 141ms/step - loss: 8.3765 - val_loss: 8.5572
<keras.callbacks.History at 0x7fd27c219790>
我们的模型现在已经训练完成!我们建议您在完整的数据集上至少训练 50 个 epoch 以获得良好的性能。预训练模型现在充当语言模型,旨在在下游任务上进行微调。因此,它现在可以在任何下游任务上进行微调,如问答、文本分类等!
现在,您可以将此模型推送到 🤗 模型中心,并与您的所有朋友、家人、最喜欢的宠物分享:他们都可以使用标识符 "your-username/the-name-you-picked"
加载它,例如
model.push_to_hub("pretrained-bert", organization="keras-io")
tokenizer.push_to_hub("pretrained-bert", organization="keras-io")
在您推送模型后,这就是您将来如何加载它的方式!
from transformers import TFBertForPreTraining
model = TFBertForPreTraining.from_pretrained("your-username/my-awesome-model")
或者,由于它是一个预训练模型,您通常会将其用于在下游任务上进行微调,您也可以将其加载用于其他任务,例如
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("your-username/my-awesome-model")
在这种情况下,预训练头将被删除,模型将仅使用 Transformer 层进行初始化。将使用随机权重添加一个新的特定于任务的头。