作者:Sreyan Ghosh
创建日期 2022/07/01
上次修改日期 2022/08/27
描述:使用 Hugging Face Transformers 在 NSP 和 MLM 上预训练 BERT。
在计算机视觉领域,研究人员反复证明了迁移学习的价值——在一个已知的任务/数据集上预训练神经网络模型,例如 ImageNet 分类,然后进行微调——使用训练好的神经网络作为新特定用途模型的基础。近年来,研究人员表明,类似的技术可用于许多自然语言任务。
BERT 利用 Transformer,这是一种注意力机制,用于学习文本中单词(或子词)之间的上下文关系。在其原始形式中,Transformer 包括两种独立的机制——一个读取文本输入的编码器和一个为任务生成预测的解码器。由于 BERT 的目标是生成语言模型,因此只需要编码器机制。Transformer 的详细工作原理在 Google 的一篇论文中进行了描述。
与按顺序读取文本输入的方向模型(从左到右或从右到左)相反,Transformer 编码器会同时读取整个单词序列。因此,它被认为是双向的,尽管更准确地说它是无方向的。此特性允许模型根据单词的所有周围环境(单词的左侧和右侧)来学习单词的上下文。
在训练语言模型时,一个挑战是定义预测目标。许多模型预测序列中的下一个单词(例如 "The child came home from _"
),这是一种方向性方法,本质上限制了上下文学习。为了克服这一挑战,BERT 使用两种训练策略
在将单词序列馈送到 BERT 之前,每个序列中 15% 的单词将被 [MASK]
标记替换。然后,模型尝试根据序列中其他未掩码单词提供的上下文来预测掩码单词的原始值。
在 BERT 训练过程中,模型接收句子对作为输入,并学习预测句子对中的第二个句子是否为原始文档中的后续句子。在训练期间,50% 的输入是第二个句子确实是原始文档中的后续句子的对,而在另外 50% 的情况下,从语料库中随机选择一个句子作为第二个句子。假设随机句子将表示与第一个句子的断开连接。
尽管 Google 提供了英语的预训练 BERT 检查点,但您可能经常需要从头开始为其他语言预训练模型,或者进行持续预训练以使模型适应新领域。在本笔记本中,我们使用 🤗 Transformers 在 WikiText
英语数据集(从 🤗 Datasets 加载)上从头开始预训练 BERT,并优化 MLM 和 NSP 目标。
pip install git+https://github.com/huggingface/transformers.git
pip install datasets
pip install huggingface-hub
pip install nltk
import nltk
import random
import logging
import tensorflow as tf
from tensorflow import keras
nltk.download("punkt")
# Only log error messages
tf.get_logger().setLevel(logging.ERROR)
# Set random seed
tf.keras.utils.set_random_seed(42)
[nltk_data] Downloading package punkt to /speech/sreyan/nltk_data...
[nltk_data] Package punkt is already up-to-date!
TOKENIZER_BATCH_SIZE = 256 # Batch-size to train the tokenizer on
TOKENIZER_VOCABULARY = 25000 # Total number of unique subwords the tokenizer can have
BLOCK_SIZE = 128 # Maximum number of tokens in an input sample
NSP_PROB = 0.50 # Probability that the next sentence is the actual next sentence in NSP
SHORT_SEQ_PROB = 0.1 # Probability of generating shorter sequences to minimize the mismatch between pretraining and fine-tuning.
MAX_LENGTH = 512 # Maximum number of tokens in an input sample after padding
MLM_PROB = 0.2 # Probability with which tokens are masked in MLM
TRAIN_BATCH_SIZE = 2 # Batch-size for pretraining the model on
MAX_EPOCHS = 1 # Maximum number of epochs to train the model for
LEARNING_RATE = 1e-4 # Learning rate for training the model
MODEL_CHECKPOINT = "bert-base-cased" # Name of pretrained model from 🤗 Model Hub
我们现在下载 WikiText
语言建模数据集。它是从维基百科上经过验证的“优质”和“特色”文章集中提取的超过 1 亿个标记的集合。
我们从 🤗 Datasets 加载数据集。为了在本笔记本中进行演示,我们仅使用数据集的 train
分割。这可以通过 load_dataset
函数轻松完成。
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Downloading data: 0%| | 0.00/4.72M [00:00<?, ?B/s]
Generating test split: 0%| | 0/4358 [00:00<?, ? examples/s]
Generating train split: 0%| | 0/36718 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/3760 [00:00<?, ? examples/s]
Dataset wikitext downloaded and prepared to /speech/sreyan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s]
数据集只有一个列,即原始文本,这正是我们预训练 BERT 所需的!
print(dataset)
DatasetDict({
test: Dataset({
features: ['text'],
num_rows: 4358
})
train: Dataset({
features: ['text'],
num_rows: 36718
})
validation: Dataset({
features: ['text'],
num_rows: 3760
})
})
首先,我们在语料库上从头开始训练自己的分词器,以便我们可以使用它从头开始训练我们的语言模型。
但是为什么要训练分词器呢?这是因为 Transformer 模型经常使用子词分词算法,并且需要对其进行训练以识别您正在使用的语料库中经常出现的单词部分。
🤗 Transformers Tokenizer
(顾名思义)将对输入进行分词(包括将标记转换为其在预训练词汇表中的对应 ID)并将其放在模型期望的格式中,以及生成模型所需的其他输入。
首先,我们从 WikiText
语料库中列出所有原始文档
all_texts = [
doc for doc in dataset["train"]["text"] if len(doc) > 0 and not doc.startswith(" =")
]
接下来,我们创建一个 batch_iterator
函数,它将帮助我们训练我们的分词器。
def batch_iterator():
for i in range(0, len(all_texts), TOKENIZER_BATCH_SIZE):
yield all_texts[i : i + TOKENIZER_BATCH_SIZE]
在本笔记本中,我们使用与现有分词器完全相同的算法和参数来训练分词器。例如,我们使用相同的标记化算法在 Wikitext-2
上训练 BERT-CASED
分词器的新版本。
首先,我们需要加载我们要用作模型的分词器
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 52 files to the new cache system
0%| | 0/52 [00:00<?, ?it/s]
vocab_file vocab.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
现在,我们使用 Wikitext-2
数据集的整个 train
分割来训练我们的分词器。
tokenizer = tokenizer.train_new_from_iterator(
batch_iterator(), vocab_size=TOKENIZER_VOCABULARY
)
所以现在我们已经完成了新的分词器的训练!接下来,我们进入数据预处理步骤。
为了演示工作流程,在本笔记本中,我们只取整个 WikiText train
和 test
分割的小子集。
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["validation"] = dataset["validation"].select([i for i in range(1000)])
在将这些文本馈送到我们的模型之前,我们需要对其进行预处理并使其准备好执行任务。如前所述,BERT 预训练任务总共包括两个任务,即 NSP
任务和 MLM
任务。🤗 Transformers 有一个易于实现的 collator
,称为 DataCollatorForLanguageModeling
。但是,我们需要手动准备 NSP
的数据。
接下来,我们编写一个名为 prepare_train_features
的简单函数,它可以帮助我们进行预处理,并且与 🤗 Datasets 兼容。总而言之,我们的预处理函数应该
token_type_ids
、attention_mask
等。# We define the maximum number of tokens after tokenization that each training sample
# will have
max_num_tokens = BLOCK_SIZE - tokenizer.num_special_tokens_to_add(pair=True)
def prepare_train_features(examples):
"""Function to prepare features for NSP task
Arguments:
examples: A dictionary with 1 key ("text")
text: List of raw documents (str)
Returns:
examples: A dictionary with 4 keys
input_ids: List of tokenized, concatnated, and batched
sentences from the individual raw documents (int)
token_type_ids: List of integers (0 or 1) corresponding
to: 0 for senetence no. 1 and padding, 1 for sentence
no. 2
attention_mask: List of integers (0 or 1) corresponding
to: 1 for non-padded tokens, 0 for padded
next_sentence_label: List of integers (0 or 1) corresponding
to: 1 if the second sentence actually follows the first,
0 if the senetence is sampled from somewhere else in the corpus
"""
# Remove un-wanted samples from the training set
examples["document"] = [
d.strip() for d in examples["text"] if len(d) > 0 and not d.startswith(" =")
]
# Split the documents from the dataset into it's individual sentences
examples["sentences"] = [
nltk.tokenize.sent_tokenize(document) for document in examples["document"]
]
# Convert the tokens into ids using the trained tokenizer
examples["tokenized_sentences"] = [
[tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)) for sent in doc]
for doc in examples["sentences"]
]
# Define the outputs
examples["input_ids"] = []
examples["token_type_ids"] = []
examples["attention_mask"] = []
examples["next_sentence_label"] = []
for doc_index, document in enumerate(examples["tokenized_sentences"]):
current_chunk = [] # a buffer stored current working segments
current_length = 0
i = 0
# We *usually* want to fill up the entire sequence since we are padding
# to `block_size` anyways, so short sequences are generally wasted
# computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pretraining and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas
# `block_size` is a hard limit.
target_seq_length = max_num_tokens
if random.random() < SHORT_SEQ_PROB:
target_seq_length = random.randint(2, max_num_tokens)
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = random.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
if len(current_chunk) == 1 or random.random() < NSP_PROB:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = random.randint(
0, len(examples["tokenized_sentences"]) - 1
)
if random_document_index != doc_index:
break
random_document = examples["tokenized_sentences"][
random_document_index
]
random_start = random.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
input_ids = tokenizer.build_inputs_with_special_tokens(
tokens_a, tokens_b
)
# add token type ids, 0 for sentence a, 1 for sentence b
token_type_ids = tokenizer.create_token_type_ids_from_sequences(
tokens_a, tokens_b
)
padded = tokenizer.pad(
{"input_ids": input_ids, "token_type_ids": token_type_ids},
padding="max_length",
max_length=MAX_LENGTH,
)
examples["input_ids"].append(padded["input_ids"])
examples["token_type_ids"].append(padded["token_type_ids"])
examples["attention_mask"].append(padded["attention_mask"])
examples["next_sentence_label"].append(1 if is_random_next else 0)
current_chunk = []
current_length = 0
i += 1
# We delete all the un-necessary columns from our dataset
del examples["document"]
del examples["sentences"]
del examples["text"]
del examples["tokenized_sentences"]
return examples
tokenized_dataset = dataset.map(
prepare_train_features, batched=True, remove_columns=["text"], num_proc=1,
)
Parameter 'function'=<function prepare_train_features at 0x7fd4a214cb90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
0%| | 0/5 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
对于 MLM,我们将对我们的数据集使用与之前相同的预处理,并增加一个步骤:我们随机掩盖一些标记(通过用 [MASK] 替换它们),并且标签将调整为仅包含掩盖的标记(我们不必预测未掩盖的标记)。如果您使用自己训练的分词器,请确保 [MASK] 标记位于训练期间传递的特殊标记中!
为了准备 MLM 的数据,我们只需对已经准备好执行 NSP 任务的数据集使用 🤗 Transformers 库提供的名为 DataCollatorForLanguageModeling
的 collator
。collator
期望某些参数。在本笔记本中,我们使用 BERT 原始论文中的默认参数。return_tensors='tf'
确保我们获得 tf.Tensor
对象。
from transformers import DataCollatorForLanguageModeling
collater = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=MLM_PROB, return_tensors="tf"
)
接下来,我们定义用于训练模型的训练集。同样,🤗 Datasets 为我们提供了 to_tf_dataset
方法,这将帮助我们将数据集与上面定义的 collator
集成。该方法期望某些参数
train = tokenized_dataset["train"].to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels", "next_sentence_label"],
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
collate_fn=collater,
)
validation = tokenized_dataset["validation"].to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels", "next_sentence_label"],
batch_size=TRAIN_BATCH_SIZE,
shuffle=True,
collate_fn=collater,
)
要定义我们的模型,首先我们需要定义一个配置,它将帮助我们定义模型架构的某些参数。这包括诸如 Transformer 层数、注意力头数、隐藏维度等参数。在本笔记本中,我们尝试定义与原始 BERT 论文中定义的完全相同的配置。
我们可以使用 🤗 Transformers 库中的 BertConfig
类轻松实现这一点。from_pretrained()
方法需要模型的名称。在这里,我们定义了我们也用它训练模型的最简单模型,即 bert-base-cased
。
from transformers import BertConfig
config = BertConfig.from_pretrained(MODEL_CHECKPOINT)
为了定义我们的模型,我们使用 🤗 Transformers 库中的 TFBertForPreTraining
类。此类在内部处理从定义模型到解包输入和计算损失的所有内容。因此,我们无需自己执行任何操作,只需使用我们想要的正确 config
定义模型即可!
from transformers import TFBertForPreTraining
model = TFBertForPreTraining(config)
现在我们定义优化器并编译模型。损失计算在内部处理,因此我们无需担心!
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
最后所有步骤都完成了,现在我们可以开始训练我们的模型了!
model.fit(train, validation_data=validation, epochs=MAX_EPOCHS)
483/483 [==============================] - 96s 141ms/step - loss: 8.3765 - val_loss: 8.5572
<keras.callbacks.History at 0x7fd27c219790>
我们的模型现在已经训练好了!我们建议您在完整数据集上至少训练 50 个 epoch 以获得不错的性能。预训练的模型现在充当语言模型,旨在在下游任务上进行微调。因此,它现在可以在下游任务(如问答、文本分类等)上进行微调!
现在您可以将此模型推送到 🤗 模型中心,并与所有朋友、家人、喜欢的宠物分享:他们都可以使用标识符 "your-username/the-name-you-picked"
加载它,例如
model.push_to_hub("pretrained-bert", organization="keras-io")
tokenizer.push_to_hub("pretrained-bert", organization="keras-io")
在您推送模型后,以下是如何在将来加载它!
from transformers import TFBertForPreTraining
model = TFBertForPreTraining.from_pretrained("your-username/my-awesome-model")
或者,由于它是一个预训练模型,并且您通常会将其用于在下游任务上进行微调,因此您也可以将其加载到其他任务中,例如
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("your-username/my-awesome-model")
在这种情况下,预训练头将被丢弃,模型将仅使用 Transformer 层进行初始化。将添加一个新的特定于任务的头,并使用随机权重。