作者: Sayak Paul
创建日期 2021/08/01
上次修改日期 2021/08/01
知识蒸馏(Hinton 等人)是一种将大型模型压缩成小型模型的技术。这使我们能够获得高性能大型模型的优势,同时降低存储和内存成本并实现更高的推理速度
在 知识蒸馏:一位好老师要有耐心和一致性 中,Beyer 等人研究了执行知识蒸馏的各种现有设置,并表明所有这些设置都会导致次优性能。因此,从业者在开发资源受限的生产系统时,通常会选择其他替代方案(量化、剪枝、权重聚类等)。
Beyer 等人研究了如何改进知识蒸馏过程中产生的学生模型,并始终匹配其教师模型的性能。在本例中,我们将使用 Flowers102 数据集 研究他们介绍的配方。作为参考,使用这些配方,作者能够生成一个在 ImageNet-1k 数据集上达到 82.8% 准确率的 ResNet50 模型。
如果您需要复习知识蒸馏并希望了解如何在 Keras 中实现它,您可以参考 此示例。您还可以关注 此示例,该示例展示了应用于一致性训练的知识蒸馏扩展。
要遵循此示例,您将需要 TensorFlow 2.5 或更高版本以及 TensorFlow Addons,可以使用以下命令安装
!pip install -q tensorflow-addons
from tensorflow import keras
import tensorflow_addons as tfa
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import tensorflow_datasets as tfds
AUTO = tf.data.AUTOTUNE # Used to dynamically adjust parallelism.
# Comes from Table 4 and "Training setup" section.
TEMPERATURE = 10 # Used to soften the logits before they go to softmax.
INIT_LR = 0.003 # Initial learning rate that will be decayed over the training period.
WEIGHT_DECAY = 0.001 # Used for regularization.
CLIP_THRESHOLD = 1.0 # Used for clipping the gradients by L2-norm.
# We will first resize the training images to a bigger size and then we will take
# random crops of a lower size.
BIGGER = 160
RESIZE = 128
train_ds, validation_ds, test_ds = tfds.load(
"oxford_flowers102", split=["train", "validation", "test"], as_supervised=True
print(f"Number of training examples: {train_ds.cardinality()}.")
f"Number of validation examples: {validation_ds.cardinality()}."
print(f"Number of test examples: {test_ds.cardinality()}.")
Number of training examples: 1020.
Number of validation examples: 1020.
Number of test examples: 6149.
与任何蒸馏技术一样,首先训练一个性能良好的教师模型非常重要,该模型通常比后续的学生模型更大。作者将 BiT ResNet152x2 模型(教师)蒸馏成 BiT ResNet50 模型(学生)。
BiT 代表 Big Transfer,并在 大型迁移 (BiT):通用视觉表示学习 中引入。ResNet 的 BiT 变体使用组归一化(Wu 等人)和权重标准化(Qiao 等人)代替批归一化(Ioffe 等人)。为了限制运行此示例所需的时间,我们将使用一个已经在 Flowers102 数据集上训练好的 BiT ResNet101x3。您可以参考 此笔记本 以了解有关训练过程的更多信息。此模型在 Flowers102 的测试集上达到了 98.18% 的准确率。
模型权重作为数据集托管在 Kaggle 上。要下载权重,请按照以下步骤操作
的下载,该文件包含您的 API 凭据。现在运行以下内容
import os
os.environ["KAGGLE_USERNAME"] = "" # TODO: enter your Kaggle user name here
os.environ["KAGGLE_KEY"] = "" # TODO: enter your Kaggle key here
$ kaggle datasets download -d spsayakpaul/bitresnet101x3flowers102
$ unzip -qq bitresnet101x3flowers102.zip
这应该会生成一个名为 T-r101x3-128
的文件夹,它本质上是一个教师 SavedModel
!kaggle datasets download -d spsayakpaul/bitresnet101x3flowers102
!unzip -qq bitresnet101x3flowers102.zip
# Since the teacher model is not going to be trained further we make
# it non-trainable.
teacher_model = keras.models.load_model(
teacher_model.trainable = False
Model: "my_bi_t_model_1"
Layer (type) Output Shape Param #
dense_1 (Dense) multiple 626790
keras_layer_1 (KerasLayer) multiple 381789888
Total params: 382,416,678
Trainable params: 0
Non-trainable params: 382,416,678
参数来完成的。此处使用 MixUp 是为了帮助学生模型捕获教师模型的底层函数。MixUp 在数据流形上不同样本之间进行线性插值。因此,这里的理由是,如果学生被训练以适应它,它应该能够更好地匹配教师模型。为了结合更多不变性,MixUp 与“Inception 风格”裁剪(Szegedy 等人)相结合。这就是“函数匹配”一词在 原始论文 中出现的方式。总之,在训练学生模型时需要保持一致和耐心。
def mixup(images, labels):
alpha = tf.random.uniform([], 0, 1)
mixedup_images = alpha * images + (1 - alpha) * tf.reverse(images, axis=[0])
# The labels do not matter here since they are NOT used during
# training.
return mixedup_images, labels
def preprocess_image(image, label, train=True):
image = tf.cast(image, tf.float32) / 255.0
if train:
image = tf.image.resize(image, (BIGGER, BIGGER))
image = tf.image.random_crop(image, (RESIZE, RESIZE, 3))
image = tf.image.random_flip_left_right(image)
# Central fraction amount is from here:
# https://git.io/J8Kda.
image = tf.image.central_crop(image, central_fraction=0.875)
image = tf.image.resize(image, (RESIZE, RESIZE))
return image, label
def prepare_dataset(dataset, train=True, batch_size=BATCH_SIZE):
if train:
dataset = dataset.map(preprocess_image, num_parallel_calls=AUTO)
dataset = dataset.shuffle(BATCH_SIZE * 10)
dataset = dataset.map(
lambda x, y: (preprocess_image(x, y, train)), num_parallel_calls=AUTO
dataset = dataset.batch(batch_size)
if train:
dataset = dataset.map(mixup, num_parallel_calls=AUTO)
dataset = dataset.prefetch(AUTO)
return dataset
请注意,为简洁起见,我们对训练集使用了轻微的裁剪,但在实践中应应用“Inception 风格”预处理。您可以参考 此脚本 以了解更接近的实现。此外,不会将真实标签用于训练学生。
train_ds = prepare_dataset(train_ds, True)
validation_ds = prepare_dataset(validation_ds, False)
test_ds = prepare_dataset(test_ds, False)
sample_images, _ = next(iter(train_ds))
plt.figure(figsize=(10, 10))
for n in range(25):
ax = plt.subplot(5, 5, n + 1)
出于本示例的目的,我们将使用标准 ResNet50V2(He 等人)。
def get_resnetv2():
resnet_v2 = keras.applications.ResNet50V2(
input_shape=(RESIZE, RESIZE, 3),
return resnet_v2
与教师模型相比,此模型的参数减少了 3.58 亿个。
我们将重用 此关于知识蒸馏的示例 中的一些代码。
class Distiller(tf.keras.Model):
def __init__(self, student, teacher):
self.student = student
self.teacher = teacher
self.loss_tracker = keras.metrics.Mean(name="distillation_loss")
def metrics(self):
metrics = super().metrics
return metrics
def compile(
self, optimizer, metrics, distillation_loss_fn, temperature=TEMPERATURE,
super().compile(optimizer=optimizer, metrics=metrics)
self.distillation_loss_fn = distillation_loss_fn
self.temperature = temperature
def train_step(self, data):
# Unpack data
x, _ = data
# Forward pass of teacher
teacher_predictions = self.teacher(x, training=False)
with tf.GradientTape() as tape:
# Forward pass of student
student_predictions = self.student(x, training=True)
# Compute loss
distillation_loss = self.distillation_loss_fn(
tf.nn.softmax(teacher_predictions / self.temperature, axis=1),
tf.nn.softmax(student_predictions / self.temperature, axis=1),
# Compute gradients
trainable_vars = self.student.trainable_variables
gradients = tape.gradient(distillation_loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Report progress
return {"distillation_loss": self.loss_tracker.result()}
def test_step(self, data):
# Unpack data
x, y = data
# Forward passes
teacher_predictions = self.teacher(x, training=False)
student_predictions = self.student(x, training=False)
# Calculate the loss
distillation_loss = self.distillation_loss_fn(
tf.nn.softmax(teacher_predictions / self.temperature, axis=1),
tf.nn.softmax(student_predictions / self.temperature, axis=1),
# Report progress
self.compiled_metrics.update_state(y, student_predictions)
results = {m.name: m.result() for m in self.metrics}
return results
# Some code is taken from:
# https://www.kaggle.com/ashusma/training-rfcx-tensorflow-tpu-effnet-b2.
class WarmUpCosine(keras.optimizers.schedules.LearningRateSchedule):
def __init__(
self, learning_rate_base, total_steps, warmup_learning_rate, warmup_steps
self.learning_rate_base = learning_rate_base
self.total_steps = total_steps
self.warmup_learning_rate = warmup_learning_rate
self.warmup_steps = warmup_steps
self.pi = tf.constant(np.pi)
def __call__(self, step):
if self.total_steps < self.warmup_steps:
raise ValueError("Total_steps must be larger or equal to warmup_steps.")
cos_annealed_lr = tf.cos(
* (tf.cast(step, tf.float32) - self.warmup_steps)
/ float(self.total_steps - self.warmup_steps)
learning_rate = 0.5 * self.learning_rate_base * (1 + cos_annealed_lr)
if self.warmup_steps > 0:
if self.learning_rate_base < self.warmup_learning_rate:
raise ValueError(
"Learning_rate_base must be larger or equal to "
slope = (
self.learning_rate_base - self.warmup_learning_rate
) / self.warmup_steps
warmup_rate = slope * tf.cast(step, tf.float32) + self.warmup_learning_rate
learning_rate = tf.where(
step < self.warmup_steps, warmup_rate, learning_rate
return tf.where(
step > self.total_steps, 0.0, learning_rate, name="learning_rate"
scheduled_lrs = WarmUpCosine(
lrs = [scheduled_lrs(step) for step in range(TOTAL_STEPS)]
plt.xlabel("Step", fontsize=14)
plt.ylabel("LR", fontsize=14)
原始论文使用至少 1000 个 epoch 和 512 的批大小来执行“函数匹配”。本示例的目的是展示实现配方的工作流程,而不是展示在完全扩展时应用的结果。但是,这些配方将转移到论文中的原始设置。如果您有兴趣了解更多信息,请参考 此存储库。
optimizer = tfa.optimizers.AdamW(
weight_decay=WEIGHT_DECAY, learning_rate=scheduled_lrs, clipnorm=CLIP_THRESHOLD
student_model = get_resnetv2()
distiller = Distiller(student=student_model, teacher=teacher_model)
history = distiller.fit(
steps_per_epoch=int(np.ceil(DATASET_NUM_TRAIN_EXAMPLES / BATCH_SIZE)),
epochs=30, # This should be at least 1000.
student = distiller.student
_, top1_accuracy = student.evaluate(test_ds)
print(f"Top-1 accuracy on the test set: {round(top1_accuracy * 100, 2)}%")
Epoch 1/30
97/97 [==============================] - 7s 64ms/step - loss: 0.0000e+00 - accuracy: 0.0107
Top-1 accuracy on the test set: 1.07%
仅经过 30 个 epoch 的训练,结果远未达到预期。这就是耐心(即更长的训练时间表)的好处发挥作用的地方。让我们研究一下经过 1000 个 epoch 训练的模型可以做什么。
# Download the pre-trained weights.
!wget https://git.io/JBO3Y -O S-r50x1-128-1000.tar.gz
!tar xf S-r50x1-128-1000.tar.gz
pretrained_student = keras.models.load_model("S-r50x1-128-1000")
Model: "resnet"
Layer (type) Output Shape Param #
root_block (Sequential) (None, 32, 32, 64) 9408
block1 (Sequential) (None, 32, 32, 256) 214912
block2 (Sequential) (None, 16, 16, 512) 1218048
block3 (Sequential) (None, 8, 8, 1024) 7095296
block4 (Sequential) (None, 4, 4, 2048) 14958592
group_norm (GroupNormalizati multiple 4096
re_lu_97 (ReLU) multiple 0
global_average_pooling2d_1 ( multiple 0
head/dense (Dense) multiple 208998
Total params: 23,709,350
Trainable params: 23,709,350
Non-trainable params: 0
_, top1_accuracy = pretrained_student.evaluate(test_ds)
print(f"Top-1 accuracy on the test set: {round(top1_accuracy * 100, 2)}%")
97/97 [==============================] - 14s 131ms/step - loss: 0.0000e+00 - accuracy: 0.8102
Top-1 accuracy on the test set: 81.02%
经过 100000 个 epoch 的训练,同一模型的 top-1 准确率达到 95.54%。
使用基于 TPU 的硬件基础设施,我们可以更快地训练模型 1000 个 epoch。这甚至不需要对该代码库进行大量更改。建议您查看 此存储库,因为它提供了这些配方的兼容 TPU 的训练工作流程,并且可以在 Kaggle Kernel 上运行,利用其免费的 TPU v3-8 硬件。