代码示例 / 自然语言处理 / 使用 Hugging Face Transformers 进行抽象式摘要

使用 Hugging Face Transformers 进行抽象式摘要

作者:Sreyan Ghosh
创建日期 2022/07/04
上次修改 2022/08/28
描述:使用 Hugging Face Transformers 训练 T5 用于抽象式摘要。

ⓘ 此示例使用 Keras 2

在 Colab 中查看 GitHub 源代码


简介

自动摘要是自然语言处理 (NLP) 中的核心问题之一。它带来了许多与语言理解(例如识别重要内容)和生成(例如将识别出的内容汇总并改写成摘要)相关的挑战。

在本教程中,我们使用抽象式建模方法来处理单文档摘要任务。这里的主要思想是生成一个简短的单句新闻摘要,回答“新闻文章是关于什么的?”这个问题。这种摘要方法也称为抽象式摘要,在各个学科的研究人员中引起了越来越多的兴趣。

借鉴先前的工作,我们旨在使用序列到序列模型来解决这个问题。文本到文本转换 Transformer (T5) 是一个基于Transformer 的模型,建立在编码器-解码器架构之上,在无监督和监督任务的混合多任务数据集上进行预训练,其中每个任务都转换为文本到文本格式。T5 在各种序列到序列任务(本笔记本中的序列指的是文本)中展现出令人印象深刻的结果,例如摘要、翻译等等。

在本笔记本中,我们将使用 Hugging Face Transformers 在XSum 数据集上对预训练的 T5 进行微调,该数据集来自 Hugging Face Datasets。


设置

安装需求

!pip install transformers==4.20.0
!pip install keras_hub==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

导入必要的库

import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

定义某些变量

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

加载数据集

我们现在将下载极端摘要 (XSum) 数据集。该数据集包含 BBC 文章及其相应的单句摘要。具体来说,每篇文章都以一个引言句(也称为摘要)为开头,该引言句通常由文章作者撰写。该数据集包含 226,711 篇文章,分为训练集 (90%,204,045 篇)、验证集 (5%,11,332 篇) 和测试集 (5%,11,334 篇)。

借鉴大量文献,我们使用 Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 指标来评估我们的序列到序列抽象式摘要方法。

我们将使用Hugging Face Datasets 库来下载用于训练和评估所需的数据。这可以通过load_dataset 函数轻松完成。

from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

该数据集包含以下字段

  • document:要摘要的原始 BBC 文章
  • summary:BBC 文章的单句摘要
  • id:文档-摘要对的 ID
print(raw_datasets)
Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

我们现在将查看数据的外观

print(raw_datasets[0])
{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on [email protected] or [email protected].', 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.', 'id': '35232142'}

为了演示工作流程,在本笔记本中,我们将仅使用训练集的较小分层平衡拆分 (10%) 作为我们的训练集和测试集。我们可以使用train_test_split 方法轻松拆分数据集,该方法需要拆分大小和您想要分层的列的名称。

raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

数据预处理

在我们能够将这些文本馈送到模型之前,我们需要对其进行预处理,并使其准备好用于该任务。这是通过 Hugging Face Transformers 的Tokenizer 完成的,该Tokenizer 将对输入进行标记化(包括将标记转换为预训练词表中的相应 ID)并将它们放入模型期望的格式中,以及生成模型所需的其它输入。

from_pretrained() 方法需要 Hugging Face 模型中心的模型名称。这与前面声明的 MODEL_CHECKPOINT 完全相同,我们只需要传递它即可。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

如果您使用的是五个 T5 检查点之一,我们必须在输入前添加 "summarize:"(该模型也可以进行翻译,它需要该前缀来知道它需要执行哪个任务)。

if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

我们将编写一个简单的函数,帮助我们在与 Hugging Face Datasets 兼容的预处理中。总而言之,我们的预处理函数应该

  • 将文本数据集(输入和目标)标记化为其相应的标记 ID,这些 ID 将用于在 BERT 中进行嵌入查找
  • 将前缀添加到标记中
  • 为模型创建附加输入,例如token_type_idsattention_mask 等等。
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

为了将此函数应用于我们数据集中所有句子对,我们只需要使用前面创建的dataset 对象的map 方法。这将把该函数应用于dataset 中所有拆分的所有元素,因此我们的训练数据和测试数据将通过一条命令进行预处理。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

定义模型

现在我们可以下载预训练的模型并对其进行微调。由于我们的任务是序列到序列(输入和输出都是文本序列),因此我们使用 Hugging Face Transformers 库中的TFAutoModelForSeq2SeqLM 类。与标记器类似,from_pretrained 方法将为我们下载和缓存模型。

from_pretrained() 方法需要 Hugging Face 模型中心的模型名称。如前所述,我们将使用t5-small 模型检查点。

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)
Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.
All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.

对于训练序列到序列模型,我们需要一种特殊类型的 Data Collator,它不仅会将输入填充到批次中的最大长度,还会将标签也填充到最大长度。因此,我们使用 Hugging Face Transformers 库提供的DataCollatorForSeq2Seq 对我们的数据集进行处理。return_tensors='tf' 确保我们返回tf.Tensor 对象。

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

接下来,我们定义我们的训练集和测试集,我们将使用它们来训练我们的模型。同样,Hugging Face Datasets 为我们提供了to_tf_dataset 方法,该方法将帮助我们将数据集与上面定义的collator 集成。该方法需要某些参数

  • columns:用作自变量的列
  • batch_size:我们的训练批次大小
  • shuffle:是否要打乱数据集
  • collate_fn:我们的 collator 函数

此外,我们还定义了一个相对较小的generation_dataset,以便在训练过程中实时计算ROUGE 分数。

train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

构建和编译模型

现在我们将定义我们的优化器并编译模型。损失计算由模型内部处理,因此我们无需担心!

optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

训练和评估模型

为了在训练过程中实时评估我们的模型,我们将定义metric_fn,它将计算地面真值和预测之间的ROUGE 分数。

import keras_hub

rouge_l = keras_hub.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

现在,我们终于可以开始训练我们的模型了!

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)
WARNING:root:No label_cols specified for KerasMetricCallback, assuming you want the 'labels' key.

2551/2551 [==============================] - 652s 250ms/step - loss: 2.9159 - val_loss: 2.5875 - RougeL: 0.2065

<keras.callbacks.History at 0x7f1d002f9810>

为了获得最佳效果,我们建议在整个训练数据集上至少训练 5 个 epochs!


推理

现在我们将尝试对任意文章进行推理。为此,我们将使用 Hugging Face Transformers 的pipeline 方法。Hugging Face Transformers 为我们提供了各种管道可供选择。对于我们的任务,我们使用summarization 管道。

pipeline 方法接受经过训练的模型和标记器作为参数。framework="tf" 参数确保您传递的是用 TF 训练的模型。

from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)
Your max_length is set to 128, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)

[{'summary_text': 'Boss Wagner says he is "a 100% professional and has a winning mentality to play on the pitch."'}]

现在您可以将此模型推送到 Hugging Face 模型中心,并与您的朋友、家人、喜欢的宠物分享:他们都可以使用标识符"your-username/the-name-you-picked" 加载它,例如

model.push_to_hub("transformers-qa", organization="keras-io")
tokenizer.push_to_hub("transformers-qa", organization="keras-io")

将您的模型推送到模型中心后,您就可以在将来像这样加载它!

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")