► 代码示例 / 自然语言处理 / 使用 Hugging Face Transformers 进行抽象式文本摘要

使用 Hugging Face Transformers 进行抽象式文本摘要

作者： Sreyan Ghosh
创建日期 2022/07/04
最后修改日期 2022/08/28
描述： 使用 Hugging Face Transformers 训练 T5 进行抽象式文本摘要。

ⓘ 本示例使用 Keras 2

引言

自动文本摘要是自然语言处理 (NLP) 中的核心问题之一。它涉及语言理解（例如识别重要内容）和生成（例如汇总并重新组织识别的内容形成摘要）等多个挑战。

在本教程中，我们将使用抽象式建模方法解决单文档摘要任务。这里的核心思想是生成一个简短的、单一句子的新闻摘要，回答“这篇新闻文章是关于什么的？”这个问题。这种摘要方法也称为抽象式摘要，并已引起各学科研究人员日益增长的兴趣。

遵循以往的研究工作，我们旨在利用 sequence-to-sequence 模型来解决这个问题。Text-to-Text Transfer Transformer (T5) 是一个基于Transformer的模型，建立在编码器-解码器架构之上，并在由无监督和有监督任务混合组成的多任务上进行了预训练，其中每个任务都被转换为文本到文本的格式。T5 在各种 sequence-to-sequence（本文档中的 sequence 指文本）任务（如摘要、翻译等）中展现出令人印象深刻的结果。

在本笔记本中，我们将使用 Hugging Face Transformers 在从 Hugging Face Datasets 加载的 XSum 数据集上，针对抽象式摘要任务微调预训练的 T5 模型。

设置

安装所需依赖

!pip install transformers==4.20.0
!pip install keras_hub==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

导入必要的库

import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

定义某些变量

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

加载数据集

我们现在将下载极端摘要 (XSum) 数据集。该数据集包含 BBC 文章及其配套的单句摘要。具体来说，每篇文章都以一个导语（即摘要）开头，该导语由文章作者等专业人士撰写。该数据集包含 226,711 篇文章，分为训练集（90%，204,045 篇）、验证集（5%，11,332 篇）和测试集（5%，11,334 篇）。

遵循大量文献的做法，我们使用面向召回的摘要评估代理 (ROUGE) 指标来评估我们的 sequence-to-sequence 抽象式摘要方法。

我们将使用 Hugging Face Datasets 库下载训练和评估所需的数据。这可以通过 load_dataset 函数轻松完成。

from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

数据集包含以下字段

document：要进行摘要的原始 BBC 文章
summary：BBC 文章的单句摘要
id：文档-摘要对的 ID

print(raw_datasets)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

现在我们将看看数据的样子

print(raw_datasets[0])

{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.', 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.', 'id': '35232142'}

为了演示工作流程，在本笔记本中，我们将仅从训练集中选取小规模分层平衡分割（10%）作为我们的训练集和测试集。我们可以使用 train_test_split 方法轻松分割数据集，该方法需要分割大小和您想要进行分层的列名。

raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

数据预处理

在将这些文本输入到我们的模型之前，我们需要对它们进行预处理，以便模型可以使用。这通过 Hugging Face Transformers 的 Tokenizer 完成，它将对输入进行分词（包括将词元转换为其在预训练词表中的对应 ID），并将其整理成模型期望的格式，同时生成模型所需的其他输入。

from_pretrained() 方法需要 Hugging Face 模型中心的一个模型名称。这与之前声明的 MODEL_CHECKPOINT 完全相同，我们将直接传递它。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

如果您正在使用五个 T5 检查点中的一个，您必须在输入前加上“summarize:”前缀（模型也可以进行翻译，需要该前缀来知道它必须执行哪个任务）。

if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

我们将编写一个简单的函数来帮助进行与 Hugging Face Datasets 兼容的预处理。总而言之，我们的预处理函数应该

将文本数据集（输入和目标）分词成其对应的词元 ID，这些 ID 将用于在 BERT 中进行嵌入查找
将前缀添加到词元
为模型创建额外的输入，例如 token_type_ids、attention_mask 等。

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

为了将此函数应用于我们之前创建的 dataset 对象中所有句子对，我们只需使用 dataset 对象的 map 方法。这将对 dataset 中所有分割的所有元素应用该函数，因此我们的训练数据和测试数据将在一个命令中完成预处理。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

定义模型

现在我们可以下载预训练模型并进行微调。由于我们的任务是 sequence-to-sequence（输入和输出都是文本序列），我们使用 Hugging Face Transformers 库中的 TFAutoModelForSeq2SeqLM 类。与分词器一样，from_pretrained 方法将为我们下载并缓存模型。

from_pretrained() 方法需要 Hugging Face 模型中心的一个模型名称。如前所述，我们将使用 t5-small 模型检查点。

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.

对于训练 Sequence to Sequence 模型，我们需要一种特殊的数据整理器，它不仅会将输入填充到批次中的最大长度，还会对标签进行填充。因此，我们在数据集上使用 Hugging Face Transformers 库提供的 DataCollatorForSeq2Seq。return_tensors='tf' 参数确保我们获得 tf.Tensor 对象。

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

接下来，我们定义将用于训练模型的训练集和测试集。同样，Hugging Face Datasets 为我们提供了 to_tf_dataset 方法，这将帮助我们将数据集与上面定义的 collator 集成。该方法需要一些参数：

columns：将作为我们独立变量的列
batch_size：我们的训练批次大小
shuffle：我们是否要打乱数据集
collate_fn：我们的整理器函数

此外，我们还定义了一个相对较小的 generation_dataset，用于在训练时即时计算 ROUGE 分数。

train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

构建和编译模型

现在我们将定义优化器并编译模型。损失计算在内部处理，所以我们不必担心！

optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

训练和评估模型

为了在训练时即时评估我们的模型，我们将定义 metric_fn，它将计算真实值和预测值之间的 ROUGE 分数。

import keras_hub

rouge_l = keras_hub.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

现在我们终于可以开始训练我们的模型了！

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)

WARNING:root:No label_cols specified for KerasMetricCallback, assuming you want the 'labels' key.

2551/2551 [==============================] - 652s 250ms/step - loss: 2.9159 - val_loss: 2.5875 - RougeL: 0.2065

<keras.callbacks.History at 0x7f1d002f9810>

为了获得最佳结果，我们建议在整个训练数据集上训练模型至少 5 个周期！

推理

现在我们将尝试对我们训练好的模型进行任意文章的推理。为此，我们将使用 Hugging Face Transformers 中的 pipeline 方法。Hugging Face Transformers 为我们提供了多种 pipeline 可供选择。对于我们的任务，我们使用 summarization pipeline。

pipeline 方法接受训练好的模型和分词器作为参数。framework="tf" 参数确保您传递的是一个使用 TF 训练的模型。

from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)

Your max_length is set to 128, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)

[{'summary_text': 'Boss Wagner says he is "a 100% professional and has a winning mentality to play on the pitch."'}]

现在您可以将这个模型推送到 Hugging Face Model Hub，并与您的所有朋友、家人、最喜欢的宠物分享：他们都可以通过标识符 "your-username/the-name-you-picked" 加载它，例如

model.push_to_hub("transformers-qa", organization="keras-io")
tokenizer.push_to_hub("transformers-qa", organization="keras-io")

推送模型后，将来您就可以这样加载它了！

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")