代码示例 / 自然语言处理 / 使用 Hugging Face Transformers 进行抽象摘要

使用 Hugging Face Transformers 进行抽象摘要

作者: Sreyan Ghosh
创建日期 2022/07/04
上次修改日期 2022/08/28
描述:使用 Hugging Face Transformers 训练 T5 进行抽象摘要。

ⓘ 此示例使用 Keras 2

在 Colab 中查看 GitHub 源码


简介

自动摘要是自然语言处理 (NLP) 中的核心问题之一。它提出了与语言理解(例如识别重要内容)和生成(例如将识别出的内容汇总并改写成摘要)相关的若干挑战。

在本教程中,我们使用抽象建模方法解决单文档摘要任务。这里的主要思想是生成一个简短的单句新闻摘要,回答“新闻文章的主题是什么?”这个问题。这种摘要方法也称为抽象摘要,并且在各个学科的研究人员中引起了越来越多的兴趣。

根据之前的工作,我们旨在使用序列到序列模型来解决这个问题。 文本到文本传输 Transformer (T5) 是一个基于Transformer 的模型,构建在编码器-解码器架构之上,在多任务混合的无监督和监督任务上进行预训练,其中每个任务都转换为文本到文本格式。T5 在各种序列到序列(本笔记本中的序列指的是文本)任务中(如摘要、翻译等)展示了令人印象深刻的结果。

在本笔记本中,我们将使用 Hugging Face Transformers 在XSum数据集(从 Hugging Face Datasets 加载)上对预训练的 T5 进行微调,以完成抽象摘要任务。


设置

安装依赖项

!pip install transformers==4.20.0
!pip install keras_hub==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

导入必要的库

import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

定义某些变量

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

加载数据集

我们现在将下载极端摘要 (XSum)。该数据集包含 BBC 文章及其相应的单句摘要。具体来说,每篇文章都以引言句(即摘要)开头,该引言句由专业人士撰写,通常由文章作者撰写。该数据集包含 226,711 篇文章,分为训练集(90%,204,045)、验证集(5%,11,332)和测试集(5%,11,334)。

与大多数文献一样,我们使用用于摘要评估的召回导向研究 (ROUGE) 指标来评估我们的序列到序列抽象摘要方法。

我们将使用Hugging Face Datasets 库下载我们用于训练和评估所需的数据。这可以通过load_dataset函数轻松完成。

from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

该数据集包含以下字段

  • document:要总结的原始 BBC 文章
  • summary:BBC 文章的单句摘要
  • id:文档-摘要对的 ID
print(raw_datasets)
Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

我们现在将看看数据是什么样的

print(raw_datasets[0])
{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on [email protected] or [email protected].', 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.', 'id': '35232142'}

为了演示工作流程,在本笔记本中,我们只取训练集的少量分层平衡分割(10%)作为我们的训练集和测试集。我们可以使用train_test_split方法轻松地分割数据集,该方法需要分割大小和您想要分层相关的列的名称。

raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

数据预处理

在将这些文本馈送到我们的模型之前,我们需要对其进行预处理并使其准备好用于任务。这是通过 Hugging Face Transformers 的Tokenizer完成的,它将对输入进行标记化(包括将标记转换为其在预训练词汇表中的对应 ID)并将其放入模型期望的格式,以及生成模型所需的其它输入。

from_pretrained()方法需要 Hugging Face 模型中心的模型名称。这与前面声明的 MODEL_CHECKPOINT 完全相同,我们只需将其传递。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

如果您使用的是我们拥有的五个 T5 检查点之一,则必须在输入前添加“summarize:”(模型还可以进行翻译,它需要前缀来知道它必须执行哪个任务)。

if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

我们将编写一个简单的函数来帮助我们进行与 Hugging Face Datasets 兼容的预处理。总而言之,我们的预处理函数应该

  • 将文本数据集(输入和目标)标记化为其对应的标记 ID,这些 ID 将用于 BERT 中的嵌入查找
  • 在标记前添加前缀
  • 为模型创建额外的输入,如token_type_idsattention_mask等。
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

要将此函数应用于我们数据集中所有句子对,我们只需使用我们之前创建的dataset对象的map方法。这将把函数应用于dataset中所有分割的所有元素,因此我们的训练和测试数据将在一个命令中预处理。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

定义模型

现在我们可以下载预训练模型并对其进行微调。由于我们的任务是序列到序列(输入和输出都是文本序列),因此我们使用 Hugging Face Transformers 库中的TFAutoModelForSeq2SeqLM类。与标记器一样,from_pretrained方法将为我们下载和缓存模型。

from_pretrained()方法需要 Hugging Face 模型中心的模型名称。如前所述,我们将使用t5-small模型检查点。

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)
Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.
All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.

对于训练序列到序列模型,我们需要一种特殊类型的data collator,它不仅会将输入填充到批次中的最大长度,还会填充标签。因此,我们对数据集使用 Hugging Face Transformers 库提供的DataCollatorForSeq2Seqreturn_tensors='tf'确保我们获得tf.Tensor对象。

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

接下来,我们定义我们的训练集和测试集,我们将用它们来训练我们的模型。同样,Hugging Face Datasets 为我们提供了to_tf_dataset方法,它将帮助我们将数据集与上面定义的collator集成。该方法需要某些参数

  • columns:将作为自变量的列
  • batch_size:我们的训练批次大小
  • shuffle:我们是否要洗牌数据集
  • collate_fn:我们的 collator 函数

此外,我们还定义了一个相对较小的generation_dataset,以便在训练期间动态计算ROUGE分数。

train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

构建和编译模型

现在我们将定义我们的优化器并编译模型。损失计算是在内部处理的,所以我们不必担心!

optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

训练和评估模型

为了在训练过程中动态评估我们的模型,我们将定义metric_fn,它将计算地面实况和预测之间的ROUGE分数。

import keras_hub

rouge_l = keras_hub.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

现在我们终于可以开始训练我们的模型了!

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)
WARNING:root:No label_cols specified for KerasMetricCallback, assuming you want the 'labels' key.

2551/2551 [==============================] - 652s 250ms/step - loss: 2.9159 - val_loss: 2.5875 - RougeL: 0.2065

<keras.callbacks.History at 0x7f1d002f9810>

为了获得最佳结果,我们建议在整个训练数据集上至少训练模型 5 个 epoch!


推理

现在我们将尝试在我们训练过的模型上对任意文章进行推理。为此,我们将使用 Hugging Face Transformers 的pipeline方法。Hugging Face Transformers 为我们提供了各种管道可供选择。对于我们的任务,我们使用summarization管道。

pipeline方法接收训练过的模型和标记器作为参数。framework="tf"参数确保您传递的是使用 TF 训练的模型。

from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)
Your max_length is set to 128, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)

[{'summary_text': 'Boss Wagner says he is "a 100% professional and has a winning mentality to play on the pitch."'}]

现在您可以将此模型推送到 Hugging Face 模型中心,并与您的所有朋友、家人、喜欢的宠物分享:他们都可以使用标识符"your-username/the-name-you-picked"加载它,例如

model.push_to_hub("transformers-qa", organization="keras-io")
tokenizer.push_to_hub("transformers-qa", organization="keras-io")

并且在您推送模型后,您可以这样在将来加载它!

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")