► 代码示例 / 音频数据 / 使用 Hugging Face Transformers 进行音频分类

使用 Hugging Face Transformers 进行音频分类

作者： Sreyan Ghosh
创建日期 2022/07/01
最后修改日期 2022/08/27
描述： 使用 Hugging Face Transformers 训练 Wav2Vec 2.0 进行音频分类。

ⓘ 此示例使用 Keras 2

在 Colab 中查看 • GitHub 源代码

引言

语音命令识别，也称为*关键词识别*(KWS)，从工程角度来看非常重要，广泛应用于各种领域，从音频数据库索引和关键词索引，到在微控制器上本地运行语音模型。目前，许多人机交互界面 (HCI)，如 Google Assistant、Microsoft Cortana、Amazon Alexa、Apple Siri 等都依赖于关键词识别。所有主要公司，尤其是 Google 和百度，都在该领域进行了大量研究。

在过去十年中，深度学习在该任务上取得了显著的性能提升。虽然 MFCC 或梅尔滤波器组等从原始音频中提取的低级音频特征已经使用了几十年，但这些低级特征的设计存在偏差。此外，在这些低级特征上训练的深度学习模型很容易对噪声或与任务无关的信号过拟合。这使得任何系统都必须学习语音表示，以便从语音信号中获取声学和语言内容（包括音素、词语、语义、语调、说话人特征等高级信息），从而解决下游任务。旨在通过自监督对比学习任务学习高级语音表示的Wav2Vec 2.0 为训练用于 KWS 的深度学习模型提供了优于传统低级特征的绝佳替代方案。

在本 Notebook 中，我们基于 Hugging Face Transformers 库，使用 Wav2Vec 2.0 (base) 模型，在关键词识别任务上进行端到端训练，并在 Google Speech Commands Dataset 上取得了最先进的结果。

设置

安装依赖

pip install git+https://github.com/huggingface/transformers.git
pip install datasets
pip install huggingface-hub
pip install joblib
pip install librosa

导入必要的库

import random
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)
# Set random seed
tf.keras.utils.set_random_seed(42)

定义一些变量

# Maximum duration of the input audio file we feed to our Wav2Vec 2.0 model.
MAX_DURATION = 1
# Sampling rate is the number of samples of audio recorded every second
SAMPLING_RATE = 16000
BATCH_SIZE = 32  # Batch-size for training and evaluating our model.
NUM_CLASSES = 10  # Number of classes our dataset will have (11 in our case).
HIDDEN_DIM = 768  # Dimension of our model output (768 in case of Wav2Vec 2.0 - Base).
MAX_SEQ_LENGTH = MAX_DURATION * SAMPLING_RATE  # Maximum length of the input audio file.
# Wav2Vec 2.0 results in an output frequency with a stride of about 20ms.
MAX_FRAMES = 49
MAX_EPOCHS = 2  # Maximum number of training epochs.

MODEL_CHECKPOINT = "facebook/wav2vec2-base"  # Name of pretrained model from Hugging Face Model Hub

加载 Google Speech Commands Dataset

我们现在下载 Google Speech Commands V1 Dataset，这是一个用于训练和评估解决 KWS 任务的深度学习模型的流行基准。该数据集总共包含 60,973 个音频文件，每个文件的持续时间为 1 秒，分为十个关键词类别（"Yes"、"No"、"Up"、"Down"、"Left"、"Right"、"On"、"Off"、"Stop" 和 "Go"），一个静音类别，以及一个未知类别用于包含误报。我们可以通过 Hugging Face Datasets 加载数据集。使用 load_dataset 函数可以轻松完成此操作。

from datasets import load_dataset

speech_commands_v1 = load_dataset("superb", "ks")

数据集包含以下字段

file: 原始 .wav 音频文件的路径
audio: 采样率为 16kHz 的音频文件
label: 音频话语的标签 ID

print(speech_commands_v1)

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 51094
    })
    validation: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 6798
    })
    test: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 3081
    })
})

数据预处理

为了演示工作流程，在本 Notebook 中，我们仅从训练集中抽取少量分层平衡的拆分（50%）作为我们的训练集和测试集。您可以使用 train_test_split 方法轻松拆分数据集，该方法需要拆分大小以及您希望进行分层所依据的列名称。

拆分数据集后，我们移除 unknown（未知）和 silence（静音）类别，只关注十个主要类别。filter 方法可以轻松为您完成此操作。

接下来，我们将训练集和测试集样本调整为 BATCH_SIZE 的倍数，以方便顺利进行训练和推理。您可以使用 select 方法实现此目的，该方法需要您希望保留的样本索引。其余所有样本都将被丢弃。

speech_commands_v1 = speech_commands_v1["train"].train_test_split(
    train_size=0.5, test_size=0.5, stratify_by_column="label"
)

speech_commands_v1 = speech_commands_v1.filter(
    lambda x: x["label"]
    != (
        speech_commands_v1["train"].features["label"].names.index("_unknown_")
        and speech_commands_v1["train"].features["label"].names.index("_silence_")
    )
)

speech_commands_v1["train"] = speech_commands_v1["train"].select(
    [i for i in range((len(speech_commands_v1["train"]) // BATCH_SIZE) * BATCH_SIZE)]
)
speech_commands_v1["test"] = speech_commands_v1["test"].select(
    [i for i in range((len(speech_commands_v1["test"]) // BATCH_SIZE) * BATCH_SIZE)]
)

print(speech_commands_v1)

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 896
    })
    test: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 896
    })
})

此外，您可以查看每个标签 ID 对应的实际标签。

labels = speech_commands_v1["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

print(id2label)

{'0': 'yes', '1': 'no', '2': 'up', '3': 'down', '4': 'left', '5': 'right', '6': 'on', '7': 'off', '8': 'stop', '9': 'go', '10': '_silence_', '11': '_unknown_'}

在将音频话语样本输入模型之前，我们需要对其进行预处理。这由 Hugging Face Transformers 的“特征提取器”完成，它将（顾名思义）将您的输入重新采样到模型期望的采样率（如果它们的采样率不同），并生成模型所需的其他输入。

为了完成所有这些，我们使用 AutoFeatureExtractor.from_pretrained 实例化我们的 Feature Extractor，这将确保

我们获得一个与我们想要使用的模型架构相对应的 Feature Extractor。我们下载预训练此特定检查点时使用的配置。这将缓存在本地，以便下次运行此单元格时不再重新下载。

from_pretrained() 方法需要 Hugging Face Hub 中模型的名称。这与 MODEL_CHECKPOINT 完全相同，我们只需传入该名称。

我们编写一个简单的函数来帮助我们进行与 Hugging Face Datasets 兼容的预处理。总结来说，我们的预处理函数应该

调用音频列以加载音频文件，并在必要时进行重新采样。
检查音频文件的采样率是否与模型预训练所用的音频数据的采样率匹配。您可以在 Wav2Vec 2.0 模型卡中找到此信息。
设置最大输入长度，以便较长的输入可以分批处理而不会被截断。

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    MODEL_CHECKPOINT, return_attention_mask=True
)


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=MAX_SEQ_LENGTH,
        truncation=True,
        padding=True,
    )
    return inputs


# This line with pre-process our speech_commands_v1 dataset. We also remove the "audio"
# and "file" columns as they will be of no use to us while training.
processed_speech_commands_v1 = speech_commands_v1.map(
    preprocess_function, remove_columns=["audio", "file"], batched=True
)

# Load the whole dataset splits as a dict of numpy arrays
train = processed_speech_commands_v1["train"].shuffle(seed=42).with_format("numpy")[:]
test = processed_speech_commands_v1["test"].shuffle(seed=42).with_format("numpy")[:]

定义带有分类头 (Classification-Head) 的 Wav2Vec 2.0

我们现在定义我们的模型。准确地说，我们定义一个 Wav2Vec 2.0 模型，并在其顶部添加一个分类头 (Classification-Head)，以输出每个输入音频样本的所有类别的概率分布。由于模型可能会变得复杂，我们首先将带有分类头 (Classification-Head) 的 Wav2Vec 2.0 模型定义为一个 Keras 层，然后使用它来构建最终模型。

我们使用 TFWav2Vec2Model 类实例化我们的主 Wav2Vec 2.0 模型。这将实例化一个模型，该模型将根据您选择的配置（BASE 或 LARGE）输出 768 或 1024 维的嵌入。from_pretrained() 此外还可以帮助您从 Hugging Face Model Hub 加载预训练权重。它将下载预训练权重以及与您调用方法时提到的模型名称对应的配置。对于我们的任务，我们选择刚刚预训练好的 BASE 变体模型，因为我们将在其基础上进行微调。

from transformers import TFWav2Vec2Model


def mean_pool(hidden_states, feature_lengths):
    attenion_mask = tf.sequence_mask(
        feature_lengths, maxlen=MAX_FRAMES, dtype=tf.dtypes.int64
    )
    padding_mask = tf.cast(
        tf.reverse(tf.cumsum(tf.reverse(attenion_mask, [-1]), -1), [-1]),
        dtype=tf.dtypes.bool,
    )
    hidden_states = tf.where(
        tf.broadcast_to(
            tf.expand_dims(~padding_mask, -1), (BATCH_SIZE, MAX_FRAMES, HIDDEN_DIM)
        ),
        0.0,
        hidden_states,
    )
    pooled_state = tf.math.reduce_sum(hidden_states, axis=1) / tf.reshape(
        tf.math.reduce_sum(tf.cast(padding_mask, dtype=tf.dtypes.float32), axis=1),
        [-1, 1],
    )
    return pooled_state


class TFWav2Vec2ForAudioClassification(layers.Layer):
    """Combines the encoder and decoder into an end-to-end model for training."""

    def __init__(self, model_checkpoint, num_classes):
        super().__init__()
        # Instantiate the Wav2Vec 2.0 model without the Classification-Head
        self.wav2vec2 = TFWav2Vec2Model.from_pretrained(
            model_checkpoint, apply_spec_augment=False, from_pt=True
        )
        self.pooling = layers.GlobalAveragePooling1D()
        # Drop-out layer before the final Classification-Head
        self.intermediate_layer_dropout = layers.Dropout(0.5)
        # Classification-Head
        self.final_layer = layers.Dense(num_classes, activation="softmax")

    def call(self, inputs):
        # We take only the first output in the returned dictionary corresponding to the
        # output of the last layer of Wav2vec 2.0
        hidden_states = self.wav2vec2(inputs["input_values"])[0]

        # If attention mask does exist then mean-pool only un-masked output frames
        if tf.is_tensor(inputs["attention_mask"]):
            # Get the length of each audio input by summing up the attention_mask
            # (attention_mask = (BATCH_SIZE x MAX_SEQ_LENGTH) ∈ {1,0})
            audio_lengths = tf.cumsum(inputs["attention_mask"], -1)[:, -1]
            # Get the number of Wav2Vec 2.0 output frames for each corresponding audio input
            # length
            feature_lengths = self.wav2vec2.wav2vec2._get_feat_extract_output_lengths(
                audio_lengths
            )
            pooled_state = mean_pool(hidden_states, feature_lengths)
        # If attention mask does not exist then mean-pool only all output frames
        else:
            pooled_state = self.pooling(hidden_states)

        intermediate_state = self.intermediate_layer_dropout(pooled_state)
        final_state = self.final_layer(intermediate_state)

        return final_state

构建和编译模型

现在我们构建并编译我们的模型。由于这是一个分类任务，我们使用 SparseCategoricalCrossentropy 来训练模型。遵循许多文献中的做法，我们使用 accuracy 指标评估模型。

def build_model():
    # Model's input
    inputs = {
        "input_values": tf.keras.Input(shape=(MAX_SEQ_LENGTH,), dtype="float32"),
        "attention_mask": tf.keras.Input(shape=(MAX_SEQ_LENGTH,), dtype="int32"),
    }
    # Instantiate the Wav2Vec 2.0 model with Classification-Head using the desired
    # pre-trained checkpoint
    wav2vec2_model = TFWav2Vec2ForAudioClassification(MODEL_CHECKPOINT, NUM_CLASSES)(
        inputs
    )
    # Model
    model = tf.keras.Model(inputs, wav2vec2_model)
    # Loss
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    # Optimizer
    optimizer = keras.optimizers.Adam(learning_rate=1e-5)
    # Compile and return
    model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])
    return model


model = build_model()

训练模型

在开始训练模型之前，我们将输入分为因变量和自变量。

# Remove targets from training dictionaries
train_x = {x: y for x, y in train.items() if x != "label"}
test_x = {x: y for x, y in test.items() if x != "label"}

现在我们终于可以开始训练我们的模型了。

model.fit(
    train_x,
    train["label"],
    validation_data=(test_x, test["label"]),
    batch_size=BATCH_SIZE,
    epochs=MAX_EPOCHS,
)

Epoch 1/2
28/28 [==============================] - 25s 338ms/step - loss: 2.3122 - accuracy: 0.1205 - val_loss: 2.2023 - val_accuracy: 0.2176
Epoch 2/2
28/28 [==============================] - 5s 189ms/step - loss: 2.0533 - accuracy: 0.2868 - val_loss: 1.8177 - val_accuracy: 0.5089

<keras.callbacks.History at 0x7fcee542dc50>

太棒了！现在我们已经训练好了模型，我们可以使用 model.predict() 方法预测测试集中的音频样本的类别！我们看到模型的预测结果并不是很好，因为它只用非常少的样本训练了 1 个 epoch。为了获得最佳结果，我们建议在完整数据集上至少训练 5 个 epoch！

preds = model.predict(test_x)

28/28 [==============================] - 4s 44ms/step

现在我们尝试对随机抽样的音频文件使用我们训练好的模型进行推理。我们可以听听音频文件，然后看看我们的模型预测得有多好！

import IPython.display as ipd

rand_int = random.randint(0, len(test_x))

ipd.Audio(data=np.asarray(test_x["input_values"][rand_int]), autoplay=True, rate=16000)

print("Original Label is ", id2label[str(test["label"][rand_int])])
print("Predicted Label is ", id2label[str(np.argmax((preds[rand_int])))])

Original Label is  up
Predicted Label is  on

现在您可以将此模型推送到 Hugging Face Model Hub，并与您的所有朋友、家人、心爱的宠物分享：他们都可以使用标识符，例如 "your-username/the-name-you-picked" 来加载它。

model.push_to_hub("wav2vec2-ks", organization="keras-io")
tokenizer.push_to_hub("wav2vec2-ks", organization="keras-io")

在您推送模型之后，将来就可以这样加载它！

from transformers import TFWav2Vec2Model

model = TFWav2Vec2Model.from_pretrained("your-username/my-awesome-model", from_pt=True)