作者: Mohammed Abu El-Nasr
创建日期 2023/07/14
最后修改日期 2023/07/14
描述:使用 KerasHub 微调 RoBERTa 模型以生成句子嵌入。
BERT 和 RoBERTa 可用于语义文本相似度任务,其中两个句子传递给模型,网络预测它们是否相似。但如果我们有大量句子并希望找到该集合中最相似的对呢?这将需要 n*(n-1)/2 次推理计算,其中 n 是集合中句子的数量。例如,如果 n = 10000,则在 V100 GPU 上所需时间为 65 小时。
克服时间开销问题的常用方法是将一个句子传递给模型,然后平均模型的输出,或取第一个标记([CLS] 标记)并将其用作句子嵌入,然后使用余弦相似度或曼哈顿/欧几里得距离等向量相似度度量来查找相近的句子(语义上相似的句子)。这将使查找 10,000 个句子集合中最相似对的时间从 65 小时减少到 5 秒!
如果我们直接使用 RoBERTa,则会产生相当糟糕的句子嵌入。但如果我们使用 Siamese 网络微调 RoBERTa,则会生成语义上有意义的句子嵌入。这将使 RoBERTa 能够用于新任务。这些任务包括
在本例中,我们将展示如何使用 Siamese 网络微调 RoBERTa 模型,使其能够生成语义上有意义的句子嵌入,并在语义搜索和聚类示例中使用它们。这种微调方法在Sentence-BERT中提出。
让我们安装并导入所需的库。在本例中,我们将使用 KerasHub 库。
我们还将启用混合精度训练。这将有助于我们减少训练时间。
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras # Upgrade to Keras 3.
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
import keras_hub
import tensorflow as tf
import tensorflow_datasets as tfds
import sklearn.cluster as cluster
keras.mixed_precision.set_global_policy("mixed_float16")
Siamese 网络是一种神经网络架构,包含两个或多个子网络。子网络共享相同的权重。它用于为每个输入生成特征向量,然后比较它们以判断相似性。
对于我们的示例,子网络将是一个 RoBERTa 模型,在其顶部有一个池化层以生成输入句子的嵌入。然后将这些嵌入相互比较,以学习生成语义上有意义的嵌入。
使用的池化策略包括平均池化、最大池化和 CLS 池化。平均池化产生最佳结果。我们将在示例中使用它。
为了构建具有回归目标函数的 Siamese 网络,要求 Siamese 网络预测两个输入句子嵌入之间的余弦相似度。
余弦相似度表示句子嵌入之间的角度。如果余弦相似度高,则表示嵌入之间的角度小;因此,它们在语义上是相似的。
我们将使用 STSB 数据集来微调模型以实现回归目标。STSB 由一系列句子对组成,这些句子对的标签范围为 [0, 5]。0 表示两个句子之间的语义相似度最低,5 表示两个句子之间的语义相似度最高。
余弦相似度的范围为 [-1, 1],它是 Siamese 网络的输出,但数据集中标签的范围为 [0, 5]。我们需要统一余弦相似度和数据集标签之间的范围,因此在准备数据集时,我们将标签除以 2.5 并减去 1。
TRAIN_BATCH_SIZE = 6
VALIDATION_BATCH_SIZE = 8
TRAIN_NUM_BATCHES = 300
VALIDATION_NUM_BATCHES = 40
AUTOTUNE = tf.data.experimental.AUTOTUNE
def change_range(x):
return (x / 2.5) - 1
def prepare_dataset(dataset, num_batches, batch_size):
dataset = dataset.map(
lambda z: (
[z["sentence1"], z["sentence2"]],
[tf.cast(change_range(z["label"]), tf.float32)],
),
num_parallel_calls=AUTOTUNE,
)
dataset = dataset.batch(batch_size)
dataset = dataset.take(num_batches)
dataset = dataset.prefetch(AUTOTUNE)
return dataset
stsb_ds = tfds.load(
"glue/stsb",
)
stsb_train, stsb_valid = stsb_ds["train"], stsb_ds["validation"]
stsb_train = prepare_dataset(stsb_train, TRAIN_NUM_BATCHES, TRAIN_BATCH_SIZE)
stsb_valid = prepare_dataset(stsb_valid, VALIDATION_NUM_BATCHES, VALIDATION_BATCH_SIZE)
让我们看看数据集中两个句子及其相似度的示例。
for x, y in stsb_train:
for i, example in enumerate(x):
print(f"sentence 1 : {example[0]} ")
print(f"sentence 2 : {example[1]} ")
print(f"similarity : {y[i]} \n")
break
sentence 1 : b"A young girl is sitting on Santa's lap."
sentence 2 : b"A little girl is sitting on Santa's lap"
similarity : [0.9200001]
sentence 1 : b'A women sitting at a table drinking with a basketball picture in the background.'
sentence 2 : b'A woman in a sari drinks something while sitting at a table.'
similarity : [0.03999996]
sentence 1 : b'Norway marks anniversary of massacre'
sentence 2 : b"Norway Marks Anniversary of Breivik's Massacre"
similarity : [0.52]
sentence 1 : b'US drone kills six militants in Pakistan: officials'
sentence 2 : b'US missiles kill 15 in Pakistan: officials'
similarity : [-0.03999996]
sentence 1 : b'On Tuesday, the central bank left interest rates steady, as expected, but also declared that overall risks were weighted toward weakness and warned of deflation risks.'
sentence 2 : b"The central bank's policy board left rates steady for now, as widely expected, but surprised the market by declaring that overall risks were weighted toward weakness."
similarity : [0.6]
sentence 1 : b'At one of the three sampling sites at Huntington Beach, the bacteria reading came back at 160 on June 16 and at 120 on June 23.'
sentence 2 : b'The readings came back at 160 on June 16 and 120 at June 23 at one of three sampling sites at Huntington Beach.'
similarity : [0.29999995]
现在,我们将构建用于生成句子嵌入的编码器模型。它包括
keras.layers.GlobalAveragePooling1D
将平均池化应用于骨干输出。我们将把填充掩码传递给该层,以排除被平均的填充标记。preprocessor = keras_hub.models.RobertaPreprocessor.from_preset("roberta_base_en")
backbone = keras_hub.models.RobertaBackbone.from_preset("roberta_base_en")
inputs = keras.Input(shape=(1,), dtype="string", name="sentence")
x = preprocessor(inputs)
h = backbone(x)
embedding = keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
h, x["padding_mask"]
)
n_embedding = keras.layers.UnitNormalization(axis=1)(embedding)
roberta_normal_encoder = keras.Model(inputs=inputs, outputs=n_embedding)
roberta_normal_encoder.summary()
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ sentence │ (None, 1) │ 0 │ - │ │ (InputLayer) │ │ │ │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ roberta_preprocess… │ [(None, 512), │ 0 │ sentence[0][0] │ │ (RobertaPreprocess… │ (None, 512)] │ │ │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ roberta_backbone │ (None, 512, 768) │ 124,05… │ roberta_preprocesso… │ │ (RobertaBackbone) │ │ │ roberta_preprocesso… │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ pooling_layer │ (None, 768) │ 0 │ roberta_backbone[0]… │ │ (GlobalAveragePool… │ │ │ roberta_preprocesso… │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ unit_normalization │ (None, 768) │ 0 │ pooling_layer[0][0] │ │ (UnitNormalization) │ │ │ │ └─────────────────────┴───────────────────┴─────────┴──────────────────────┘
Total params: 124,052,736 (473.22 MB)
Trainable params: 124,052,736 (473.22 MB)
Non-trainable params: 0 (0.00 B)
上面已经描述过,Siamese 网络有两个或多个子网络,对于此 Siamese 模型,我们需要两个编码器。但我们只有一个编码器;我们只有一个编码器,但我们将通过它传递两个句子。这样,我们可以有两个路径来获取嵌入,并且两个路径之间也共享权重。
将两个句子传递给模型并获得归一化嵌入后,我们将这两个归一化嵌入相乘以获得两个句子之间的余弦相似度。
class RegressionSiamese(keras.Model):
def __init__(self, encoder, **kwargs):
inputs = keras.Input(shape=(2,), dtype="string", name="sentences")
sen1, sen2 = keras.ops.split(inputs, 2, axis=1)
u = encoder(sen1)
v = encoder(sen2)
cosine_similarity_scores = keras.ops.matmul(u, keras.ops.transpose(v))
super().__init__(
inputs=inputs,
outputs=cosine_similarity_scores,
**kwargs,
)
self.encoder = encoder
def get_encoder(self):
return self.encoder
让我们在训练前尝试此示例,并将其与训练后的输出进行比较。
sentences = [
"Today is a very sunny day.",
"I am hungry, I will get my meal.",
"The dog is eating his food.",
]
query = ["The dog is enjoying his meal."]
encoder = roberta_normal_encoder
sentence_embeddings = encoder(tf.constant(sentences))
query_embedding = encoder(tf.constant(query))
cosine_similarity_scores = tf.matmul(query_embedding, tf.transpose(sentence_embeddings))
for i, sim in enumerate(cosine_similarity_scores[0]):
print(f"cosine similarity score between sentence {i+1} and the query = {sim} ")
cosine similarity score between sentence 1 and the query = 0.96630859375
cosine similarity score between sentence 2 and the query = 0.97607421875
cosine similarity score between sentence 3 and the query = 0.99365234375
对于训练,我们将使用MeanSquaredError()
作为损失函数,并使用学习率为 2e-5 的Adam()
优化器。
roberta_regression_siamese = RegressionSiamese(roberta_normal_encoder)
roberta_regression_siamese.compile(
loss=keras.losses.MeanSquaredError(),
optimizer=keras.optimizers.Adam(2e-5),
jit_compile=False,
)
roberta_regression_siamese.fit(stsb_train, validation_data=stsb_valid, epochs=1)
300/300 ━━━━━━━━━━━━━━━━━━━━ 115s 297ms/step - loss: 0.4751 - val_loss: 0.4025
<keras.src.callbacks.history.History at 0x7f5a78392140>
让我们尝试训练后的模型,我们会注意到输出有很大差异。这意味着微调后的模型能够生成语义上有意义的嵌入。语义上相似的句子之间角度较小。语义上不相似的句子之间角度较大。
sentences = [
"Today is a very sunny day.",
"I am hungry, I will get my meal.",
"The dog is eating his food.",
]
query = ["The dog is enjoying his food."]
encoder = roberta_regression_siamese.get_encoder()
sentence_embeddings = encoder(tf.constant(sentences))
query_embedding = encoder(tf.constant(query))
cosine_simalarities = tf.matmul(query_embedding, tf.transpose(sentence_embeddings))
for i, sim in enumerate(cosine_simalarities[0]):
print(f"cosine similarity between sentence {i+1} and the query = {sim} ")
cosine similarity between sentence 1 and the query = 0.10986328125
cosine similarity between sentence 2 and the query = 0.53466796875
cosine similarity between sentence 3 and the query = 0.83544921875
对于具有三元组目标函数的 Siamese 网络,三个句子传递给 Siamese 网络:锚点、正例和负例句子。锚点和正例句子在语义上相似,锚点和负例句子在语义上不相似。目标是最小化锚点句子和正例句子之间的距离,并最大化锚点句子和负例句子之间的距离。
我们将使用 Wikipedia-sections-triplets 数据集进行微调。此数据集包含从维基百科网站派生的句子。它包含 3 个句子:锚点、正例、负例。锚点和正例来自同一节。锚点和负例来自不同的部分。
此数据集有 180 万个训练三元组和 22 万个测试三元组。在本例中,我们只使用 1200 个三元组进行训练,300 个三元组进行测试。
!wget https://sbert.net/datasets/wikipedia-sections-triplets.zip -q
!unzip wikipedia-sections-triplets.zip -d wikipedia-sections-triplets
NUM_TRAIN_BATCHES = 200
NUM_TEST_BATCHES = 75
AUTOTUNE = tf.data.experimental.AUTOTUNE
def prepare_wiki_data(dataset, num_batches):
dataset = dataset.map(
lambda z: ((z["Sentence1"], z["Sentence2"], z["Sentence3"]), 0)
)
dataset = dataset.batch(6)
dataset = dataset.take(num_batches)
dataset = dataset.prefetch(AUTOTUNE)
return dataset
wiki_train = tf.data.experimental.make_csv_dataset(
"wikipedia-sections-triplets/train.csv",
batch_size=1,
num_epochs=1,
)
wiki_test = tf.data.experimental.make_csv_dataset(
"wikipedia-sections-triplets/test.csv",
batch_size=1,
num_epochs=1,
)
wiki_train = prepare_wiki_data(wiki_train, NUM_TRAIN_BATCHES)
wiki_test = prepare_wiki_data(wiki_test, NUM_TEST_BATCHES)
Archive: wikipedia-sections-triplets.zip
inflating: wikipedia-sections-triplets/validation.csv
inflating: wikipedia-sections-triplets/Readme.txt
inflating: wikipedia-sections-triplets/test.csv
inflating: wikipedia-sections-triplets/train.csv
对于此编码器模型,我们将使用带有平均池化的 RoBERTa,并且不会对输出嵌入进行归一化。编码器模型包括
preprocessor = keras_hub.models.RobertaPreprocessor.from_preset("roberta_base_en")
backbone = keras_hub.models.RobertaBackbone.from_preset("roberta_base_en")
input = keras.Input(shape=(1,), dtype="string", name="sentence")
x = preprocessor(input)
h = backbone(x)
embedding = keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
h, x["padding_mask"]
)
roberta_encoder = keras.Model(inputs=input, outputs=embedding)
roberta_encoder.summary()
Model: "functional_3"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ sentence │ (None, 1) │ 0 │ - │ │ (InputLayer) │ │ │ │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ roberta_preprocess… │ [(None, 512), │ 0 │ sentence[0][0] │ │ (RobertaPreprocess… │ (None, 512)] │ │ │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ roberta_backbone_1 │ (None, 512, 768) │ 124,05… │ roberta_preprocesso… │ │ (RobertaBackbone) │ │ │ roberta_preprocesso… │ ├─────────────────────┼───────────────────┼─────────┼──────────────────────┤ │ pooling_layer │ (None, 768) │ 0 │ roberta_backbone_1[… │ │ (GlobalAveragePool… │ │ │ roberta_preprocesso… │ └─────────────────────┴───────────────────┴─────────┴──────────────────────┘
Total params: 124,052,736 (473.22 MB)
Trainable params: 124,052,736 (473.22 MB)
Non-trainable params: 0 (0.00 B)
对于具有三元组目标函数的 Siamese 网络,我们将使用编码器构建模型,并将三个句子传递给该编码器。我们将为每个句子获得一个嵌入,并将计算positive_dist
和negative_dist
,它们将传递给下面描述的损失函数。
class TripletSiamese(keras.Model):
def __init__(self, encoder, **kwargs):
anchor = keras.Input(shape=(1,), dtype="string")
positive = keras.Input(shape=(1,), dtype="string")
negative = keras.Input(shape=(1,), dtype="string")
ea = encoder(anchor)
ep = encoder(positive)
en = encoder(negative)
positive_dist = keras.ops.sum(keras.ops.square(ea - ep), axis=1)
negative_dist = keras.ops.sum(keras.ops.square(ea - en), axis=1)
positive_dist = keras.ops.sqrt(positive_dist)
negative_dist = keras.ops.sqrt(negative_dist)
output = keras.ops.stack([positive_dist, negative_dist], axis=0)
super().__init__(inputs=[anchor, positive, negative], outputs=output, **kwargs)
self.encoder = encoder
def get_encoder(self):
return self.encoder
我们将为三元组目标使用自定义损失函数。损失函数将接收锚点和正例嵌入之间的距离positive_dist
,以及锚点和负例嵌入之间的距离negative_dist
,它们在y_pred
中堆叠在一起。
我们将使用positive_dist
和negative_dist
来计算损失,使得negative_dist
至少比positive_dist
大一个特定的margin。在数学上,我们将最小化这个损失函数:max( positive_dist - negative_dist + margin, 0)
。
此损失函数中未使用y_true
。请注意,我们在数据集中将标签设置为零,但不会使用它们。
class TripletLoss(keras.losses.Loss):
def __init__(self, margin=1, **kwargs):
super().__init__(**kwargs)
self.margin = margin
def call(self, y_true, y_pred):
positive_dist, negative_dist = tf.unstack(y_pred, axis=0)
losses = keras.ops.relu(positive_dist - negative_dist + self.margin)
return keras.ops.mean(losses, axis=0)
对于训练,我们将使用自定义的TripletLoss()
损失函数和Adam()
优化器,学习率为2e-5。
roberta_triplet_siamese = TripletSiamese(roberta_encoder)
roberta_triplet_siamese.compile(
loss=TripletLoss(),
optimizer=keras.optimizers.Adam(2e-5),
jit_compile=False,
)
roberta_triplet_siamese.fit(wiki_train, validation_data=wiki_test, epochs=1)
200/200 ━━━━━━━━━━━━━━━━━━━━ 128s 467ms/step - loss: 0.7822 - val_loss: 0.7126
<keras.src.callbacks.history.History at 0x7f5c3636c580>
让我们在一个聚类示例中尝试这个模型。这里有6个问题。前3个问题关于学习英语,后3个问题关于在线工作。让我们看看我们的编码器生成的嵌入是否能正确地将它们聚类。
questions = [
"What should I do to improve my English writting?",
"How to be good at speaking English?",
"How can I improve my English?",
"How to earn money online?",
"How do I earn money online?",
"How to work and earn money through internet?",
]
encoder = roberta_triplet_siamese.get_encoder()
embeddings = encoder(tf.constant(questions))
kmeans = cluster.KMeans(n_clusters=2, random_state=0, n_init="auto").fit(embeddings)
for i, label in enumerate(kmeans.labels_):
print(f"sentence ({questions[i]}) belongs to cluster {label}")
sentence (What should I do to improve my English writting?) belongs to cluster 1
sentence (How to be good at speaking English?) belongs to cluster 1
sentence (How can I improve my English?) belongs to cluster 1
sentence (How to earn money online?) belongs to cluster 0
sentence (How do I earn money online?) belongs to cluster 0
sentence (How to work and earn money through internet?) belongs to cluster 0