► 代码示例 / 计算机视觉 / 基于 SimSiam 的自监督对比学习

基于 SimSiam 的自监督对比学习

作者： Sayak Paul
创建日期 2021/03/19
最后修改日期 2023/12/29
描述： 在计算机视觉中实现一种自监督学习方法。

ⓘ 本示例使用 Keras 2

自监督学习（SSL）是表示学习领域一个有趣的研究分支。SSL 系统试图从未标注的数据点集合中构建监督信号。例如，我们可以训练一个深度神经网络来预测给定词序列中的下一个词。在文献中，这些任务被称为*借口任务（pretext tasks）*或*辅助任务（auxiliary tasks）*。如果我们在大型数据集（例如维基百科文本语料库）上训练这样的网络，它能学习到非常有效的表示，并且这些表示能够很好地迁移到下游任务。BERT、GPT-3、ELMo 等语言模型都得益于此。

与语言模型非常相似，我们可以使用类似的方法训练计算机视觉模型。为了让计算机视觉中的方法奏效，我们需要设计学习任务，使得底层模型（深度神经网络）能够理解视觉数据中存在的语义信息。其中一个任务是让模型*对比*同一图像的两个不同版本。希望通过这种方式，模型能够学习到将相似图像尽可能分到一起，而将不相似图像分得更远的表示。

在本示例中，我们将实现一种名为 SimSiam 的系统，该系统在《探索简单的 Siamese 表示学习》论文中提出。其实现方式如下：

我们使用随机数据增强流水线创建同一数据集的两个不同版本。请注意，在创建这些版本时，随机初始化种子需要相同。
我们使用一个没有分类头的 ResNet（**主干网络**），并在其顶部添加一个浅层全连接网络（**投影头**）。两者合起来称为**编码器**。
然后，我们将编码器的输出通过一个**预测器**，该预测器也是一个具有类似**自编码器**结构的浅层全连接网络。
然后，我们训练编码器以最大化数据集两个不同版本之间的余弦相似度。

设置

import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
import keras_cv
from keras import ops

import matplotlib.pyplot as plt
import numpy as np

定义超参数

AUTO = tf.data.AUTOTUNE
BATCH_SIZE = 128
EPOCHS = 5
CROP_TO = 32
SEED = 26

PROJECT_DIM = 2048
LATENT_DIM = 512
WEIGHT_DECAY = 0.0005

加载 CIFAR-10 数据集

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
print(f"Total training examples: {len(x_train)}")
print(f"Total test examples: {len(x_test)}")

Total training examples: 50000
Total test examples: 10000

定义数据增强流水线

正如 SimCLR 研究所示，拥有正确的数据增强流水线对于 SSL 系统在计算机视觉中有效工作至关重要。其中两种似乎最重要的增强变换是：1.) 随机调整大小的裁剪和 2.) 颜色失真。大多数其他计算机视觉的 SSL 系统（如 BYOL、MoCoV2、SwAV 等）都在其训练流水线中包含这些。

strength = [0.4, 0.4, 0.4, 0.1]

random_flip = layers.RandomFlip(mode="horizontal_and_vertical")
random_crop = layers.RandomCrop(CROP_TO, CROP_TO)
random_brightness = layers.RandomBrightness(0.8 * strength[0])
random_contrast = layers.RandomContrast((1 - 0.8 * strength[1], 1 + 0.8 * strength[1]))
random_saturation = keras_cv.layers.RandomSaturation(
    (0.5 - 0.8 * strength[2], 0.5 + 0.8 * strength[2])
)
random_hue = keras_cv.layers.RandomHue(0.2 * strength[3], [0,255])
grayscale = keras_cv.layers.Grayscale()

def flip_random_crop(image):
    # With random crops we also apply horizontal flipping.
    image = random_flip(image)
    image = random_crop(image)
    return image


def color_jitter(x, strength=[0.4, 0.4, 0.3, 0.1]):
    x = random_brightness(x)
    x = random_contrast(x)
    x = random_saturation(x)
    x = random_hue(x)
    # Affine transformations can disturb the natural range of
    # RGB images, hence this is needed.
    x = ops.clip(x, 0, 255)
    return x


def color_drop(x):
    x = grayscale(x)
    x = ops.tile(x, [1, 1, 3])
    return x


def random_apply(func, x, p):
    if keras.random.uniform([], minval=0, maxval=1) < p:
        return func(x)
    else:
        return x


def custom_augment(image):
    # As discussed in the SimCLR paper, the series of augmentation
    # transformations (except for random crops) need to be applied
    # randomly to impose translational invariance.
    image = flip_random_crop(image)
    image = random_apply(color_jitter, image, p=0.8)
    image = random_apply(color_drop, image, p=0.2)
    return image

应该注意的是，数据增强流水线通常取决于我们处理的数据集的各种属性。例如，如果数据集中的图像高度以物体为中心，那么以非常高的概率进行随机裁剪可能会损害训练性能。

现在，我们将数据增强流水线应用于数据集，并可视化一些输出。

将数据转换为 TensorFlow Dataset 对象

在这里，我们创建了数据集的两个不同版本，没有任何真实标签。

ssl_ds_one = tf.data.Dataset.from_tensor_slices(x_train)
ssl_ds_one = (
    ssl_ds_one.shuffle(1024, seed=SEED)
    .map(custom_augment, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

ssl_ds_two = tf.data.Dataset.from_tensor_slices(x_train)
ssl_ds_two = (
    ssl_ds_two.shuffle(1024, seed=SEED)
    .map(custom_augment, num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

# We then zip both of these datasets.
ssl_ds = tf.data.Dataset.zip((ssl_ds_one, ssl_ds_two))

# Visualize a few augmented images.
sample_images_one = next(iter(ssl_ds_one))
plt.figure(figsize=(10, 10))
for n in range(25):
    ax = plt.subplot(5, 5, n + 1)
    plt.imshow(sample_images_one[n].numpy().astype("int"))
    plt.axis("off")
plt.show()

# Ensure that the different versions of the dataset actually contain
# identical images.
sample_images_two = next(iter(ssl_ds_two))
plt.figure(figsize=(10, 10))
for n in range(25):
    ax = plt.subplot(5, 5, n + 1)
    plt.imshow(sample_images_two[n].numpy().astype("int"))
    plt.axis("off")
plt.show()

png

请注意，samples_images_one 和 sample_images_two 中的图像本质上是相同的，但采用了不同的增强方式。

定义编码器和预测器

我们使用了专门为 CIFAR10 数据集配置的 ResNet20 实现。代码取自 keras-idiomatic-programmer 仓库。这些架构的超参数参考了原始论文的第 3 节和附录 A。

!wget -q https://git.io/JYx2x -O resnet_cifar10_v2.py

import resnet_cifar10_v2

N = 2
DEPTH = N * 9 + 2
NUM_BLOCKS = ((DEPTH - 2) // 9) - 1


def get_encoder():
    # Input and backbone.
    inputs = layers.Input((CROP_TO, CROP_TO, 3))
    x = layers.Rescaling(scale=1.0 / 127.5, offset=-1)(
        inputs
    )
    x = resnet_cifar10_v2.stem(x)
    x = resnet_cifar10_v2.learner(x, NUM_BLOCKS)
    x = layers.GlobalAveragePooling2D(name="backbone_pool")(x)

    # Projection head.
    x = layers.Dense(
        PROJECT_DIM, use_bias=False, kernel_regularizer=regularizers.l2(WEIGHT_DECAY)
    )(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Dense(
        PROJECT_DIM, use_bias=False, kernel_regularizer=regularizers.l2(WEIGHT_DECAY)
    )(x)
    outputs = layers.BatchNormalization()(x)
    return keras.Model(inputs, outputs, name="encoder")


def get_predictor():
    model = keras.Sequential(
        [
            # Note the AutoEncoder-like structure.
            layers.Input((PROJECT_DIM,)),
            layers.Dense(
                LATENT_DIM,
                use_bias=False,
                kernel_regularizer=regularizers.l2(WEIGHT_DECAY),
            ),
            layers.ReLU(),
            layers.BatchNormalization(),
            layers.Dense(PROJECT_DIM),
        ],
        name="predictor",
    )
    return model

定义（预）训练循环

采用这类方法训练网络的主要原因之一是利用学到的表示进行下游任务，如分类。因此，这个特定的训练阶段也称为*预训练*。

我们首先定义损失函数。

def compute_loss(p, z):
    # The authors of SimSiam emphasize the impact of
    # the `stop_gradient` operator in the paper as it
    # has an important role in the overall optimization.
    z = ops.stop_gradient(z)
    p = keras.utils.normalize(p, axis=1, order=2)
    z = keras.utils.normalize(z, axis=1, order=2)
    # Negative cosine similarity (minimizing this is
    # equivalent to maximizing the similarity).
    return -ops.mean(ops.sum((p * z), axis=1))

然后，通过重写 keras.Model 类的 train_step() 函数来定义我们的训练循环。

class SimSiam(keras.Model):
    def __init__(self, encoder, predictor):
        super().__init__()
        self.encoder = encoder
        self.predictor = predictor
        self.loss_tracker = keras.metrics.Mean(name="loss")

    @property
    def metrics(self):
        return [self.loss_tracker]

    def train_step(self, data):
        # Unpack the data.
        ds_one, ds_two = data

        # Forward pass through the encoder and predictor.
        with tf.GradientTape() as tape:
            z1, z2 = self.encoder(ds_one), self.encoder(ds_two)
            p1, p2 = self.predictor(z1), self.predictor(z2)
            # Note that here we are enforcing the network to match
            # the representations of two differently augmented batches
            # of data.
            loss = compute_loss(p1, z2) / 2 + compute_loss(p2, z1) / 2

        # Compute gradients and update the parameters.
        learnable_params = (
            self.encoder.trainable_variables + self.predictor.trainable_variables
        )
        gradients = tape.gradient(loss, learnable_params)
        self.optimizer.apply_gradients(zip(gradients, learnable_params))

        # Monitor loss.
        self.loss_tracker.update_state(loss)
        return {"loss": self.loss_tracker.result()}

预训练网络

为了本示例的目的，我们将模型仅训练 5 个周期（epoch）。实际上，这应该至少是 100 个周期。

# Create a cosine decay learning scheduler.
num_training_samples = len(x_train)
steps = EPOCHS * (num_training_samples // BATCH_SIZE)
lr_decayed_fn = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.03, decay_steps=steps
)

# Create an early stopping callback.
early_stopping = keras.callbacks.EarlyStopping(
    monitor="loss", patience=5, restore_best_weights=True
)

# Compile model and start training.
simsiam = SimSiam(get_encoder(), get_predictor())
simsiam.compile(optimizer=keras.optimizers.SGD(lr_decayed_fn, momentum=0.6))
history = simsiam.fit(ssl_ds, epochs=EPOCHS, callbacks=[early_stopping])

# Visualize the training progress of the model.
plt.plot(history.history["loss"])
plt.grid()
plt.title("Negative Cosine Similairty")
plt.show()

Epoch 1/5
391/391 [==============================] - 33s 42ms/step - loss: -0.8973
Epoch 2/5
391/391 [==============================] - 16s 40ms/step - loss: -0.9129
Epoch 3/5
391/391 [==============================] - 16s 40ms/step - loss: -0.9165
Epoch 4/5
391/391 [==============================] - 16s 40ms/step - loss: -0.9176
Epoch 5/5
391/391 [==============================] - 16s 40ms/step - loss: -0.9182

png

如果您的解决方案在使用不同的数据集和不同的主干网络架构时，损失很快接近 -1（我们损失的最小值），这很可能是由于表示崩溃（representation collapse）引起的。这是一种编码器对所有图像产生相似输出的现象。在这种情况下，需要进行额外的超参数调优，尤其是在以下方面：

颜色失真的强度及其概率。
学习率及其调度。
主干网络及其投影头的架构。

评估我们的 SSL 方法

在计算机视觉中评估 SSL 方法（或任何其他此类预训练方法）最常用的方法是，在训练好的主干模型（本例中为 ResNet20）的冻结特征上训练一个线性分类器，并在未见过的图像上评估该分类器。其他方法包括在源数据集上或甚至在只有 5% 或 10% 标签的目标数据集上进行微调。实际上，我们可以将主干模型用于任何下游任务，例如语义分割、目标检测等，其中主干模型通常是使用纯监督学习进行预训练的。

# We first create labeled `Dataset` objects.
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))

# Then we shuffle, batch, and prefetch this dataset for performance. We
# also apply random resized crops as an augmentation but only to the
# training set.
train_ds = (
    train_ds.shuffle(1024)
    .map(lambda x, y: (flip_random_crop(x), y), num_parallel_calls=AUTO)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)
test_ds = test_ds.batch(BATCH_SIZE).prefetch(AUTO)

# Extract the backbone ResNet20.
backbone = keras.Model(
    simsiam.encoder.input, simsiam.encoder.get_layer("backbone_pool").output
)

# We then create our linear classifier and train it.
backbone.trainable = False
inputs = layers.Input((CROP_TO, CROP_TO, 3))
x = backbone(inputs, training=False)
outputs = layers.Dense(10, activation="softmax")(x)
linear_model = keras.Model(inputs, outputs, name="linear_model")

# Compile model and start training.
linear_model.compile(
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
    optimizer=keras.optimizers.SGD(lr_decayed_fn, momentum=0.9),
)
history = linear_model.fit(
    train_ds, validation_data=test_ds, epochs=EPOCHS, callbacks=[early_stopping]
)
_, test_acc = linear_model.evaluate(test_ds)
print("Test accuracy: {:.2f}%".format(test_acc * 100))

Epoch 1/5
391/391 [==============================] - 7s 11ms/step - loss: 3.8072 - accuracy: 0.1527 - val_loss: 3.7449 - val_accuracy: 0.2046
Epoch 2/5
391/391 [==============================] - 3s 8ms/step - loss: 3.7356 - accuracy: 0.2107 - val_loss: 3.7055 - val_accuracy: 0.2308
Epoch 3/5
391/391 [==============================] - 3s 8ms/step - loss: 3.7036 - accuracy: 0.2228 - val_loss: 3.6874 - val_accuracy: 0.2329
Epoch 4/5
391/391 [==============================] - 3s 8ms/step - loss: 3.6893 - accuracy: 0.2276 - val_loss: 3.6808 - val_accuracy: 0.2334
Epoch 5/5
391/391 [==============================] - 3s 9ms/step - loss: 3.6845 - accuracy: 0.2305 - val_loss: 3.6798 - val_accuracy: 0.2339
79/79 [==============================] - 1s 7ms/step - loss: 3.6798 - accuracy: 0.2339
Test accuracy: 23.39%

注意事项

更多数据和更长的预训练计划通常对 SSL 有益。
当您只有非常有限的*带标签*训练数据，但能够构建大型的未标注数据语料库时，SSL 特别有用。最近，Facebook 的一组研究人员使用名为 SwAV 的 SSL 方法，在 20 亿张图像上训练了一个 RegNet。他们下游任务的性能非常接近于纯监督预训练达到的性能。对于某些下游任务，他们的方法甚至优于有监督的方法。您可以查看他们的论文了解详情。
如果您有兴趣了解为什么对比式 SSL 有助于网络学习有意义的表示，您可以查看以下资源：
- 自监督学习：智能的暗物质
- 使用已知结构的受控数据集理解自监督学习