作者: Sreyan Ghosh
创建日期 2022/07/01
修改日期 2022/08/27
描述: 使用 Hugging Face Transformers 对 BERT 进行 NSP 和 MLM 预训练。
在计算机视觉领域,研究人员反复证明了迁移学习的价值——在一个已知任务/数据集(例如 ImageNet 分类)上预训练一个神经网络模型,然后进行微调——使用训练好的神经网络作为新的特定用途模型的基础。近年来,研究人员表明,类似的技术在许多自然语言任务中也非常有用。
BERT 利用 Transformer,这是一种注意力机制,可以学习文本中词(或子词)之间的上下文关系。在其基本形式中,Transformer 包括两个独立的机制——一个读取文本输入的编码器和一个为任务产生预测的解码器。由于 BERT 的目标是生成一个语言模型,因此只需要编码器机制。Google 的一篇论文详细描述了 Transformer 的工作原理。
与按顺序(从左到右或从右到左)读取文本输入的定向模型不同,Transformer 编码器同时读取整个词序列。因此,它被认为是双向的,尽管更准确的说法是它是非定向的。这种特性允许模型根据单词的所有周围环境(单词的左侧和右侧)来学习单词的上下文。
在训练语言模型时,一个挑战是定义预测目标。许多模型预测序列中的下一个词(例如,"这个孩子从 _ 回来了"
),这是一种定向方法,固有地限制了上下文学习。为了克服这个挑战,BERT 使用了两种训练策略
在将词序列输入 BERT 之前,每个序列中有 15% 的词被替换为 [MASK]
标记。然后,模型尝试根据序列中其他非掩码词提供的上下文来预测被掩码词的原始值。
在 BERT 训练过程中,模型接收成对的句子作为输入,并学习预测这对句子中的第二句话是否是原始文档中的后续句子。训练期间,50% 的输入是第二句话是原始文档中后续句子的对,而另外 50% 则从语料库中随机选择一个句子作为第二句话。假定随机选择的句子与第一句话之间存在断裂。
尽管 Google 提供了用于英语的预训练 BERT 检查点,但您通常可能需要从头开始为不同的语言预训练模型,或者进行继续预训练以使模型适应新的领域。在本笔记本中,我们将使用 🤗 Transformers 在从 🤗 Datasets 加载的 WikiText
英语数据集上,从头开始预训练 BERT,同时优化 MLM 和 NSP 两个目标。
pip install git+https://github.com/huggingface/transformers.git
pip install datasets
pip install huggingface-hub
pip install nltk
import nltk
import random
import logging
import tensorflow as tf
from tensorflow import keras
nltk.download("punkt")
# Only log error messages
tf.get_logger().setLevel(logging.ERROR)
# Set random seed
tf.keras.utils.set_random_seed(42)
[nltk_data] Downloading package punkt to /speech/sreyan/nltk_data...
[nltk_data] Package punkt is already up-to-date!
TOKENIZER_BATCH_SIZE = 256 # Batch-size to train the tokenizer on
TOKENIZER_VOCABULARY = 25000 # Total number of unique subwords the tokenizer can have
BLOCK_SIZE = 128 # Maximum number of tokens in an input sample
NSP_PROB = 0.50 # Probability that the next sentence is the actual next sentence in NSP
SHORT_SEQ_PROB = 0.1 # Probability of generating shorter sequences to minimize the mismatch between pretraining and fine-tuning.
MAX_LENGTH = 512 # Maximum number of tokens in an input sample after padding
MLM_PROB = 0.2 # Probability with which tokens are masked in MLM
TRAIN_BATCH_SIZE = 2 # Batch-size for pretraining the model on
MAX_EPOCHS = 1 # Maximum number of epochs to train the model for
LEARNING_RATE = 1e-4 # Learning rate for training the model
MODEL_CHECKPOINT = "bert-base-cased" # Name of pretrained model from 🤗 Model Hub
现在我们下载 WikiText
语言建模数据集。它是从维基百科经过验证的“Good”和“Featured”文章集合中提取的超过 1 亿个标记的集合。
我们从 🤗 Datasets 加载数据集。为了在本笔记本中演示工作流程,我们只使用数据集的 train
分割。这可以使用 load_dataset
函数轻松完成。
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Downloading data: 0%| | 0.00/4.72M [00:00<?, ?B/s]
Generating test split: 0%| | 0/4358 [00:00<?, ? examples/s]
Generating train split: 0%| | 0/36718 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/3760 [00:00<?, ? examples/s]
Dataset wikitext downloaded and prepared to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s]
数据集只有一个原始文本列,这正是我们预训练 BERT 所需要的!
print(dataset)
DatasetDict({
test: Dataset({
features: ['text'],
num_rows: 4358
})
train: Dataset({
features: ['text'],
num_rows: 36718
})
validation: Dataset({
features: ['text'],
num_rows: 3760
})
})
首先,我们在我们的语料库上从头开始训练我们自己的 tokenizer,这样我们就可以用它来从头开始训练我们的语言模型。
但是为什么需要训练一个 tokenizer 呢?这是因为 Transformer 模型经常使用子词分词算法,并且需要对其进行训练以识别您正在使用的语料库中经常出现的词的部分。
🤗 Transformers 的 Tokenizer
(顾名思义)将对输入进行分词(包括将标记转换为预训练词汇表中的对应 ID),并将其转换为模型期望的格式,同时生成模型所需的其他输入。
首先,我们从 WikiText
语料库中创建一个包含所有原始文档的列表
all_texts = [
doc for doc in dataset["train"]["text"] if len(doc) > 0 and not doc.startswith(" =")
]
接下来,我们创建一个 batch_iterator
函数,它将帮助我们训练 tokenizer。
def batch_iterator():
for i in range(0, len(all_texts), TOKENIZER_BATCH_SIZE):
yield all_texts[i : i + TOKENIZER_BATCH_SIZE]
在本笔记本中,我们训练一个与现有 tokenizer 具有完全相同算法和参数的 tokenizer。例如,我们使用相同的分词算法在 Wikitext-2
上训练一个新版本的 BERT-CASED
tokenizer。
首先,我们需要加载我们想用作模型的 tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 52 files to the new cache system
0%| | 0/52 [00:00<?, ?it/s]
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
现在我们使用 Wikitext-2
数据集的整个 train
分割来训练我们的 tokenizer。
tokenizer = tokenizer.train_new_from_iterator(
batch_iterator(), vocab_size=TOKENIZER_VOCABULARY
)
所以现在我们已经训练完新的 tokenizer 了!接下来我们进入数据预处理步骤。
为了演示工作流程,在本笔记本中,我们只使用整个 WikiText train
和 test
分割的小部分。
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["validation"] = dataset["validation"].select([i for i in range(1000)])
在将这些文本输入模型之前,我们需要对其进行预处理并为任务做好准备。如前所述,BERT 预训练任务总共包含两个任务:NSP
任务和 MLM
任务。🤗 Transformers 提供了一个易于实现的 collator
,称为 DataCollatorForLanguageModeling
。但是,我们需要手动为 NSP
准备数据。
接下来,我们编写一个名为 prepare_train_features
的简单函数,它可以帮助我们进行预处理,并且与 🤗 Datasets 兼容。总而言之,我们的预处理函数应该:
token_type_ids
、attention_mask
等。# We define the maximum number of tokens after tokenization that each training sample
# will have
max_num_tokens = BLOCK_SIZE - tokenizer.num_special_tokens_to_add(pair=True)
def prepare_train_features(examples):
"""Function to prepare features for NSP task
Arguments:
examples: A dictionary with 1 key ("text")
text: List of raw documents (str)
Returns:
examples: A dictionary with 4 keys
input_ids: List of tokenized, concatnated, and batched
sentences from the individual raw documents (int)
token_type_ids: List of integers (0 or 1) corresponding
to: 0 for senetence no. 1 and padding, 1 for sentence
no. 2
attention_mask: List of integers (0 or 1) corresponding
to: 1 for non-padded tokens, 0 for padded
next_sentence_label: List of integers (0 or 1) corresponding
to: 1 if the second sentence actually follows the first,
0 if the senetence is sampled from somewhere else in the corpus
"""
# Remove un-wanted samples from the training set
examples["document"] = [
d.strip() for d in examples["text"] if len(d) > 0 and not d.startswith(" =")
]
# Split the documents from the dataset into it's individual sentences
examples["sentences"] = [
nltk.tokenize.sent_tokenize(document) for document in examples["document"]
]
# Convert the tokens into ids using the trained tokenizer
examples["tokenized_sentences"] = [
[tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)) for sent in doc]
for doc in examples["sentences"]
]
# Define the outputs
examples["input_ids"] = []
examples["token_type_ids"] = []
examples["attention_mask"] = []
examples["next_sentence_label"] = []
for doc_index, document in enumerate(examples["tokenized_sentences"]):
current_chunk = [] # a buffer stored current working segments
current_length = 0
i = 0
# We *usually* want to fill up the entire sequence since we are padding
# to `block_size` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pretraining and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `block_size` is a hard limit.
target_seq_length = max_num_tokens
if random.random() < SHORT_SEQ_PROB:
target_seq_length = random.randint(2, max_num_tokens)
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = random.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
if len(current_chunk) == 1 or random.random() < NSP_PROB:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = random.randint(
0, len(examples["tokenized_sentences"]) - 1
)
if random_document_index != doc_index:
break
random_document = examples["tokenized_sentences"][
random_document_index
]
random_start = random.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
input_ids = tokenizer.build_inputs_with_special_tokens(
tokens_a, tokens_b
)
# add token type ids, 0 for sentence a, 1 for sentence b
token_type_ids = tokenizer.create_token_type_ids_from_sequences(
tokens_a, tokens_b
)
padded = tokenizer.pad(
{"input_ids": input_ids, "token_type_ids": token_type_ids},
padding="max_length",
max_length=MAX_LENGTH,
)
examples["input_ids"].append(padded["input_ids"])
examples["token_type_ids"].append(padded["token_type_ids"])
examples["attention_mask"].append(padded["attention_mask"])
examples["next_sentence_label"].append(1 if is_random_next else 0)
current_chunk = []
current_length = 0
i += 1
# We delete all the un-necessary columns from our dataset
del examples["document"]
del examples["sentences"]
del examples["text"]
del examples["tokenized_sentences"]
return examples
tokenized_dataset = dataset.map(
prepare_train_features, batched=True, remove_columns=["text"], num_proc=1,
)
Parameter 'function'=<function prepare_train_features at 0x7fd4a214cb90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
0%| | 0/5 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
对于 MLM,我们将对数据集使用与之前相同的预处理步骤,但增加一个额外步骤:我们随机掩码一些标记(将其替换为 [MASK]),并且标签将被调整为只包含被掩码的标记(我们无需预测未被掩码的标记)。如果您使用自己训练的 tokenizer,请确保 [MASK] 标记包含在训练时传递的特殊标记中!
为了准备用于 MLM 的数据,我们只需在已经为 NSP 任务准备好的数据集上使用 🤗 Transformers 库提供的 collator
,称为 DataCollatorForLanguageModeling
。该 collator
期望某些参数。在本笔记本中,我们使用原始 BERT 论文中的默认参数。return_tensors='tf'
确保我们得到 tf.Tensor
对象。
from transformers import DataCollatorForLanguageModeling
collater = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=MLM_PROB, return_tensors="tf"
)
接下来,我们定义用于训练模型的数据集。同样,🤗 Datasets 为我们提供了 to_tf_dataset
方法,这将帮助我们将数据集与上面定义的 collator
集成。该方法期望某些参数:
train = tokenized_dataset["train"].to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels", "next_sentence_label"],
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
collate_fn=collater,
)
validation = tokenized_dataset["validation"].to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels", "next_sentence_label"],
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
collate_fn=collater,
)
为了定义我们的模型,首先我们需要定义一个 config,它将帮助我们定义模型架构的某些参数。这包括诸如 Transformer 层数、注意力头数、隐藏维度等参数。对于本笔记本,我们尝试定义与原始 BERT 论文中完全相同的 config。
我们可以使用 🤗 Transformers 库中的 BertConfig
类轻松实现这一点。from_pretrained()
方法需要模型的名称。在这里,我们定义了与我们训练的模型相同的最简单的模型,即 bert-base-cased
。
from transformers import BertConfig
config = BertConfig.from_pretrained(MODEL_CHECKPOINT)
为了定义我们的模型,我们使用 🤗 Transformers 库中的 TFBertForPreTraining
类。这个类内部处理所有事情,从定义我们的模型到解包输入和计算损失。所以我们无需自己做任何事情,只需使用我们想要的正确 config
来定义模型即可!
from transformers import TFBertForPreTraining
model = TFBertForPreTraining(config)
现在我们定义优化器并编译模型。损失计算由内部处理,所以我们无需为此担心!
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
最后,所有步骤都已完成,现在我们可以开始训练我们的模型了!
model.fit(train, validation_data=validation, epochs=MAX_EPOCHS)
483/483 [==============================] - 96s 141ms/step - loss: 8.3765 - val_loss: 8.5572
<keras.callbacks.History at 0x7fd27c219790>
我们的模型现在已经训练完毕!我们建议在完整数据集上至少训练模型 50 个 epoch 以获得不错的性能。预训练的模型现在作为语言模型,并旨在在下游任务上进行微调。因此,它现在可以在任何下游任务上进行微调,如问答、文本分类等!
现在您可以将此模型推送到 🤗 模型 Hub,并与您的所有朋友、家人和宠物分享:他们都可以使用标识符 "your-username/the-name-you-picked"
加载它,例如
model.push_to_hub("pretrained-bert", organization="keras-io")
tokenizer.push_to_hub("pretrained-bert", organization="keras-io")
在您推送模型之后,以后就可以这样加载它!
from transformers import TFBertForPreTraining
model = TFBertForPreTraining.from_pretrained("your-username/my-awesome-model")
或者,由于它是一个预训练模型,您通常会将其用于下游任务的微调,您也可以将其加载用于其他任务,例如
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("your-username/my-awesome-model")
在这种情况下,预训练头将被丢弃,模型将仅使用 Transformer 层进行初始化。将添加一个具有随机权重的新的特定任务头。