► 代码示例 / 自然语言处理 / 使用 KerasNLP 进行语义相似度计算

使用 KerasNLP 进行语义相似度计算

作者： Anshuman Mishra
创建日期 2023/02/25
最后修改日期 2023/02/25
描述：使用 KerasNLP 中的预训练模型进行语义相似度任务。

ⓘ 此示例使用 Keras 3

介绍

语义相似度是指根据句子含义确定两个句子之间相似程度的任务。我们在这个示例中已经了解了如何使用 SNLI（斯坦福自然语言推理）语料库来预测句子语义相似度，方法是使用 HuggingFace Transformers 库。在本教程中，我们将学习如何使用KerasNLP（核心 Keras API 的扩展）来执行相同的任务。此外，我们将探索 KerasNLP 如何有效地减少样板代码，并简化构建和使用模型的过程。有关 KerasNLP 的更多信息，请参阅KerasNLP 的官方文档。

本指南分为以下几个部分

设置、任务定义和建立基线。
使用 BERT 建立基线。
保存和重新加载模型。
使用模型进行推理。5. 提高精度，使用 RoBERTa

设置

以下指南使用Keras Core，可在 tensorflow、jax 或 torch 中的任何一个环境中运行。KerasNLP 内置了对 Keras Core 的支持，只需更改下面的 KERAS_BACKEND 环境变量即可更改要使用的后端。我们选择下面的 jax 后端，这将使我们能够在下面获得特别快的训练步骤。

!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras  # Upgrade to Keras 3.

import numpy as np
import tensorflow as tf
import keras
import keras_nlp
import tensorflow_datasets as tfds

要加载 SNLI 数据集，我们使用 tensorflow-datasets 库，其中包含超过 550,000 个样本。但是，为了确保此示例快速运行，我们仅使用 20% 的训练样本。

SNLI 数据集概述

数据集中的每个样本都包含三个部分：hypothesis、premise 和 label。表示提供给样本作者的原始标题，而假设则指样本作者创建的假设标题。标签由注释者分配，以指示两个句子之间的相似性。

数据集包含三个可能的相似性标签值：矛盾、蕴含和中性。矛盾表示完全不同的句子，而蕴含则表示含义相似的句子。最后，中性表示两个句子之间无法建立明显的相似性或差异性。

snli_train = tfds.load("snli", split="train[:20%]")
snli_val = tfds.load("snli", split="validation")
snli_test = tfds.load("snli", split="test")

# Here's an example of how our training samples look like, where we randomly select
# four samples:
sample = snli_test.batch(4).take(1).get_single_element()
sample

{'hypothesis': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A girl is entertaining on stage',
        b'A group of people posing in front of a body of water.',
        b"The group of people aren't inide of the building.",
        b'The people are taking a carriage ride.'], dtype=object)>,
 'label': <tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 0, 0, 0])>,
 'premise': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A girl in a blue leotard hula hoops on a stage with balloon shapes in the background.',
        b'A group of people taking pictures on a walkway in front of a large body of water.',
        b'Many people standing outside of a place talking to each other in front of a building that has a sign that says "HI-POINTE."',
        b'Three people are riding a carriage pulled by four horses.'],
       dtype=object)>}

预处理

在我们的数据集中，我们发现一些样本的缺失或标记错误的数据，用 -1 表示。为了确保模型的准确性和可靠性，我们只需从数据集中过滤掉这些样本。

def filter_labels(sample):
    return sample["label"] >= 0

以下是一个将示例拆分为 (x, y) 元组的实用函数，该元组适合于 model.fit()。默认情况下，keras_nlp.models.BertClassifier 将使用 "[SEP]" 标记对原始字符串进行分词并打包在一起，以便在训练期间使用。因此，这种标签拆分是我们需要执行的所有数据准备工作。

def split_labels(sample):
    x = (sample["hypothesis"], sample["premise"])
    y = sample["label"]
    return x, y


train_ds = (
    snli_train.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)
val_ds = (
    snli_val.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)
test_ds = (
    snli_test.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)

使用 BERT 建立基线。

我们使用 KerasNLP 中的 BERT 模型来为我们的语义相似度任务建立一个基线。keras_nlp.models.BertClassifier 类将分类头附加到 BERT 主干，将主干输出映射到适合分类任务的 logits 输出。这大大减少了对自定义代码的需求。

KerasNLP 模型具有内置的分词功能，默认情况下会根据所选模型处理分词。但是，用户也可以根据自己的特定需求使用自定义预处理技术。如果我们传递一个元组作为输入，则模型将对所有字符串进行分词，并将它们与 "[SEP]" 分隔符连接在一起。

我们使用此模型，并使用预训练权重，可以使用 from_preset() 方法使用我们自己的预处理器。对于 SNLI 数据集，我们将 num_classes 设置为 3。

bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased", num_classes=3
)

请注意，BERT Tiny 模型只有 4,386,307 个可训练参数。

KerasNLP 任务模型带有编译默认值。现在，我们可以通过调用 fit() 方法来训练我们刚刚实例化的模型。

bert_classifier.fit(train_ds, validation_data=val_ds, epochs=1)

 6867/6867 ━━━━━━━━━━━━━━━━━━━━ 61s 8ms/step - loss: 0.8732 - sparse_categorical_accuracy: 0.5864 - val_loss: 0.5900 - val_sparse_categorical_accuracy: 0.7602

<keras.src.callbacks.history.History at 0x7f4660171fc0>

我们的 BERT 分类器在验证拆分上的准确率约为 76%。现在，让我们评估它在测试拆分上的性能。

评估训练好的模型在测试数据上的性能。

bert_classifier.evaluate(test_ds)

 614/614 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 0.5815 - sparse_categorical_accuracy: 0.7628

[0.5895748734474182, 0.7618078589439392]

我们的基线 BERT 模型在测试拆分上的准确率也约为 76%。现在，让我们尝试通过重新编译模型并使用略高的学习率来提高其性能。

bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased", num_classes=3
)
bert_classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(5e-5),
    metrics=["accuracy"],
)

bert_classifier.fit(train_ds, validation_data=val_ds, epochs=1)
bert_classifier.evaluate(test_ds)

 6867/6867 ━━━━━━━━━━━━━━━━━━━━ 59s 8ms/step - accuracy: 0.6007 - loss: 0.8636 - val_accuracy: 0.7648 - val_loss: 0.5800
 614/614 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7700 - loss: 0.5692

[0.578984260559082, 0.7686278820037842]

仅仅调整学习率并不足以提高性能，性能仍然停留在 76% 左右。让我们再尝试一下，但这次使用keras.optimizers.AdamW和一个学习率调度器。

class TriangularSchedule(keras.optimizers.schedules.LearningRateSchedule):
    """Linear ramp up for `warmup` steps, then linear decay to zero at `total` steps."""

    def __init__(self, rate, warmup, total):
        self.rate = rate
        self.warmup = warmup
        self.total = total

    def get_config(self):
        config = {"rate": self.rate, "warmup": self.warmup, "total": self.total}
        return config

    def __call__(self, step):
        step = keras.ops.cast(step, dtype="float32")
        rate = keras.ops.cast(self.rate, dtype="float32")
        warmup = keras.ops.cast(self.warmup, dtype="float32")
        total = keras.ops.cast(self.total, dtype="float32")

        warmup_rate = rate * step / self.warmup
        cooldown_rate = rate * (total - step) / (total - warmup)
        triangular_rate = keras.ops.minimum(warmup_rate, cooldown_rate)
        return keras.ops.maximum(triangular_rate, 0.0)


bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased", num_classes=3
)

# Get the total count of training batches.
# This requires walking the dataset to filter all -1 labels.
epochs = 3
total_steps = sum(1 for _ in train_ds.as_numpy_iterator()) * epochs
warmup_steps = int(total_steps * 0.2)

bert_classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.AdamW(
        TriangularSchedule(1e-4, warmup_steps, total_steps)
    ),
    metrics=["accuracy"],
)

bert_classifier.fit(train_ds, validation_data=val_ds, epochs=epochs)

Epoch 1/3
 6867/6867 ━━━━━━━━━━━━━━━━━━━━ 59s 8ms/step - accuracy: 0.5457 - loss: 0.9317 - val_accuracy: 0.7633 - val_loss: 0.5825
Epoch 2/3
 6867/6867 ━━━━━━━━━━━━━━━━━━━━ 55s 8ms/step - accuracy: 0.7291 - loss: 0.6515 - val_accuracy: 0.7809 - val_loss: 0.5399
Epoch 3/3
 6867/6867 ━━━━━━━━━━━━━━━━━━━━ 55s 8ms/step - accuracy: 0.7708 - loss: 0.5695 - val_accuracy: 0.7918 - val_loss: 0.5214

<keras.src.callbacks.history.History at 0x7f45645b3370>

成功！使用学习率调度器和 AdamW 优化器后，我们的验证准确率提高到约 79%。

现在，让我们在测试集上评估我们的最终模型，看看它的性能如何。

bert_classifier.evaluate(test_ds)

 614/614 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7956 - loss: 0.5128

[0.5245093703269958, 0.7890879511833191]

我们的 Tiny BERT 模型使用学习率调度器在测试集上的准确率约为 79%。与我们之前的结果相比，这是一个显著的改进。微调预训练的 BERT 模型可能是自然语言处理任务中的一项强大工具，即使是像 Tiny BERT 这样的小模型也可以获得令人印象深刻的结果。

让我们现在保存我们的模型，并继续学习如何使用它进行推理。

保存和重新加载模型

bert_classifier.save("bert_classifier.keras")
restored_model = keras.models.load_model("bert_classifier.keras")
restored_model.evaluate(test_ds)

 614/614 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 0.5128 - sparse_categorical_accuracy: 0.7956

[0.5245093703269958, 0.7890879511833191]

使用模型进行推理。

让我们看看如何使用 KerasNLP 模型进行推理。

# Convert to Hypothesis-Premise pair, for forward pass through model
sample = (sample["hypothesis"], sample["premise"])
sample

(<tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A girl is entertaining on stage',
        b'A group of people posing in front of a body of water.',
        b"The group of people aren't inide of the building.",
        b'The people are taking a carriage ride.'], dtype=object)>,
 <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A girl in a blue leotard hula hoops on a stage with balloon shapes in the background.',
        b'A group of people taking pictures on a walkway in front of a large body of water.',
        b'Many people standing outside of a place talking to each other in front of a building that has a sign that says "HI-POINTE."',
        b'Three people are riding a carriage pulled by four horses.'],
       dtype=object)>)

KerasNLP 模型中的默认预处理器会自动处理输入分词，因此我们不需要显式地执行分词。

predictions = bert_classifier.predict(sample)


def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)


# Get the class predictions with maximum probabilities
predictions = softmax(predictions)

 1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 711ms/step

使用 RoBERTa 提高精度

现在我们已经建立了一个基线，我们可以尝试通过试验不同的模型来改进我们的结果。得益于 KerasNLP，只需几行代码即可轻松地将 RoBERTa 检查点微调到同一个数据集上。

# Inittializing a RoBERTa from preset
roberta_classifier = keras_nlp.models.RobertaClassifier.from_preset(
    "roberta_base_en", num_classes=3
)

roberta_classifier.fit(train_ds, validation_data=val_ds, epochs=1)

roberta_classifier.evaluate(test_ds)

 6867/6867 ━━━━━━━━━━━━━━━━━━━━ 2049s 297ms/step - loss: 0.5509 - sparse_categorical_accuracy: 0.7740 - val_loss: 0.3292 - val_sparse_categorical_accuracy: 0.8789
 614/614 ━━━━━━━━━━━━━━━━━━━━ 56s 88ms/step - loss: 0.3307 - sparse_categorical_accuracy: 0.8784

[0.33771008253097534, 0.874796450138092]

RoBERTa 基础模型的可训练参数明显多于 BERT Tiny 模型，几乎是 BERT Tiny 模型的 30 倍，拥有 124,645,635 个参数。因此，在 P100 GPU 上训练大约需要 1.5 个小时。但是，性能的提升是显著的，验证和测试拆分的准确率都提高到 88%。使用 RoBERTa，我们能够在 P100 GPU 上将最大批次大小设置为 16。

尽管使用了不同的模型，但使用 RoBERTa 进行推理的步骤与使用 BERT 一样！

predictions = roberta_classifier.predict(sample)
print(tf.math.argmax(predictions, axis=1).numpy())

 1/1 ━━━━━━━━━━━━━━━━━━━━ 4s 4s/step
[0 0 0 0]

希望本教程有助于说明使用 KerasNLP 和 BERT 进行语义相似度任务的简便性和有效性。

在本教程中，我们演示了如何使用预训练的 BERT 模型来建立基线，并通过使用几行代码来训练更大的 RoBERTa 模型以提高性能。

KerasNLP 工具箱为文本预处理提供了一系列模块化构建块，包括预训练的最新模型和低级的 Transformer 编码器层。我们相信这使得试验自然语言解决方案变得更加容易和高效。

使用 KerasNLP 进行语义相似度计算

◆ 介绍

◆ 设置

预处理

评估训练好的模型在测试数据上的性能。