代码示例 / 自然语言处理 / 使用 Hugging Face Transformers 进行抽象式文本摘要

使用 Hugging Face Transformers 进行抽象式文本摘要

作者: Sreyan Ghosh
创建日期 2022/07/04
最后修改日期 2022/08/28
描述: 使用 Hugging Face Transformers 训练 T5 进行抽象式文本摘要。

ⓘ 本示例使用 Keras 2

在 Colab 中查看 GitHub 源代码


简介

自动文本摘要是自然语言处理(NLP)中的核心问题之一。它提出了几个与语言理解(例如,识别重要内容)和生成(例如,将识别的内容聚合和改写成摘要)相关的挑战。

在本教程中,我们使用抽象建模方法解决单文档摘要任务。这里的主要思想是生成一个简短的、单句的新闻摘要,回答问题“这篇新闻文章是关于什么的?”。这种摘要方法也称为抽象式摘要,并且在各个学科的研究人员中越来越受到关注。

按照先前的工作,我们旨在通过序列到序列模型来解决这个问题。文本到文本转移 Transformer (T5) 是一个基于 Transformer 的模型,构建在编码器-解码器架构之上,在无监督和有监督任务的多任务混合中预训练,其中每个任务都转换为文本到文本的格式。T5 在各种序列到序列(本笔记本中的序列指文本)任务(例如摘要、翻译等)中表现出令人印象深刻的结果。

在本笔记本中,我们将使用 Hugging Face Transformers 在从 Hugging Face Datasets 加载的 XSum 数据集上微调预训练的 T5,用于抽象式摘要任务。


设置

安装要求

!pip install transformers==4.20.0
!pip install keras_hub==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

导入必要的库

import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

定义一些变量

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

加载数据集

我们现在将下载极端摘要(XSum)。该数据集由 BBC 文章和随附的单句摘要组成。具体来说,每篇文章都以一个介绍性的句子(又名摘要)开头,该句子通常由文章的作者专业撰写。该数据集有 226,711 篇文章,分为训练集(90%,204,045 篇)、验证集(5%,11,332 篇)和测试集(5%,11,334 篇)。

在许多文献中,我们使用面向召回的 Gisting 评估下研究 (ROUGE) 指标来评估我们的序列到序列抽象式摘要方法。

我们将使用Hugging Face Datasets库来下载我们需要用于训练和评估的数据。这可以使用 load_dataset 函数轻松完成。

from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

该数据集具有以下字段

  • document:要摘要的原始 BBC 文章
  • summary:BBC 文章的单句摘要
  • id:文档-摘要对的 ID
print(raw_datasets)
Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

我们现在将查看数据是什么样的

print(raw_datasets[0])
{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on [email protected] or [email protected].', 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.', 'id': '35232142'}

为了演示工作流程,在本笔记本中,我们只会将训练集的小型分层平衡拆分(10%)作为我们的训练集和测试集。我们可以使用 train_test_split 方法轻松拆分数据集,该方法需要拆分大小和要相对于其进行分层的列的名称。

raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

数据预处理

在将这些文本输入到我们的模型之前,我们需要对它们进行预处理,并为该任务做好准备。这是通过 Hugging Face Transformers 的 Tokenizer 完成的,它将标记化输入(包括将标记转换为预训练词汇表中相应的 ID),并将其放入模型期望的格式中,并生成模型需要的其他输入。

from_pretrained() 方法需要来自 Hugging Face 模型中心的模型名称。这与先前声明的 MODEL_CHECKPOINT 完全相似,我们将直接传递它。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

如果您使用的是五个 T5 检查点之一,我们必须在输入前面加上“summarize:”(该模型也可以翻译,并且需要前缀来知道它必须执行哪个任务)。

if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

我们将编写一个简单的函数来帮助我们进行预处理,该预处理与 Hugging Face Datasets 兼容。总而言之,我们的预处理函数应该

  • 将文本数据集(输入和目标)标记化为对应的标记 ID,这些标记 ID 将用于 BERT 中的嵌入查找
  • 将前缀添加到标记中
  • 为模型创建额外的输入,例如 token_type_idsattention_mask 等。
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

要将此函数应用于数据集中所有句子对,我们只需使用我们先前创建的 dataset 对象的 map 方法。这将将该函数应用于 dataset 中所有拆分的所有元素,因此我们的训练和测试数据将在一个命令中预处理。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

定义模型

现在我们可以下载预训练模型并对其进行微调。由于我们的任务是序列到序列(输入和输出都是文本序列),我们使用 Hugging Face Transformers 库中的 TFAutoModelForSeq2SeqLM 类。与标记器一样,from_pretrained 方法将为我们下载并缓存模型。

from_pretrained() 方法需要来自 Hugging Face 模型中心的模型名称。如前所述,我们将使用 t5-small 模型检查点。

from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)
Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.
All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.

对于训练序列到序列模型,我们需要一种特殊类型的数据整理器,它不仅会将输入填充到批次中的最大长度,还会填充标签。因此,我们在数据集上使用 Hugging Face Transformers 库提供的 DataCollatorForSeq2Seqreturn_tensors='tf' 确保我们获得tf.Tensor 对象。

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

接下来,我们定义训练集和测试集,我们将用它们训练我们的模型。同样,Hugging Face Datasets 为我们提供了 to_tf_dataset 方法,该方法将帮助我们将数据集与上面定义的 collator 集成。该方法需要一些参数

  • columns:将作为我们自变量的列
  • batch_size:我们训练的批次大小
  • shuffle:我们是否要洗牌数据集
  • collate_fn:我们的整理器函数

此外,我们还定义了一个相对较小的 generation_dataset,以便在训练时动态计算 ROUGE 分数。

train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

构建和编译模型

现在我们将定义我们的优化器并编译模型。损失计算在内部处理,因此我们不必担心!

optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

训练和评估模型

为了在训练时动态评估我们的模型,我们将定义 metric_fn,它将计算真实值和预测值之间的 ROUGE 分数。

import keras_hub

rouge_l = keras_hub.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

现在我们终于可以开始训练我们的模型了!

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)
WARNING:root:No label_cols specified for KerasMetricCallback, assuming you want the 'labels' key.

2551/2551 [==============================] - 652s 250ms/step - loss: 2.9159 - val_loss: 2.5875 - RougeL: 0.2065

<keras.callbacks.History at 0x7f1d002f9810>

为了获得最佳效果,我们建议在整个训练数据集上至少训练模型 5 个 epoch!


推理

现在我们将尝试对我们训练的模型进行推理,使用一篇任意的文章。为此,我们将使用 Hugging Face Transformers 中的 pipeline 方法。Hugging Face Transformers 为我们提供了多种 pipeline 可供选择。对于我们的任务,我们使用 summarization pipeline。

pipeline 方法接收训练好的模型和 tokenizer 作为参数。framework="tf" 参数确保您传入的是使用 TF 训练的模型。

from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)
Your max_length is set to 128, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)

[{'summary_text': 'Boss Wagner says he is "a 100% professional and has a winning mentality to play on the pitch."'}]

现在您可以将此模型推送到 Hugging Face 模型中心,并与您的所有朋友、家人、最喜欢的宠物分享:他们都可以使用标识符 "your-username/the-name-you-picked" 加载它,例如

model.push_to_hub("transformers-qa", organization="keras-io")
tokenizer.push_to_hub("transformers-qa", organization="keras-io")

在您推送模型后,这就是您将来如何加载它的方式!

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")