► 代码示例 / 结构化数据 / 使用 FeatureSpace 进行结构化数据分类

使用 FeatureSpace 进行结构化数据分类

作者： fchollet
创建日期 2022/11/09
最后修改 2022/11/09
描述：只需几行代码即可对表格数据进行分类。

ⓘ 本示例使用 Keras 3

简介

本示例演示了如何从原始 CSV 文件开始进行结构化数据分类（也称为表格数据分类）。我们的数据包括数值特征、整数类别特征和字符串类别特征。我们将使用实用工具 keras.utils.FeatureSpace 对特征进行索引、预处理和编码。

此代码改编自示例《从零开始结构化数据分类》。之前的示例使用 Keras 预处理层自行管理低级特征预处理和编码，而本示例将一切委托给 FeatureSpace，从而使工作流程变得极其快速简便。

数据集

我们的数据集由克利夫兰诊所基金会提供，用于心脏病研究。它是一个包含 303 行的 CSV 文件。每一行包含一位患者的信息（一个样本），每一列描述患者的一个属性（一个特征）。我们使用这些特征来预测患者是否患有心脏病（二元分类）。

以下是每个特征的描述

列	描述	特征类型
age	年龄（岁）	数值型
sex	（1 = 男性；0 = 女性）	类别型
cp	胸痛类型 (0, 1, 2, 3, 4)	类别型
trestbpd	入院时的静息血压（单位：mm Hg）	数值型
chol	血清胆固醇（单位：mg/dl）	数值型
fbs	空腹血糖 > 120 mg/dl（1 = 是；0 = 否）	类别型
restecg	静息心电图结果 (0, 1, 2)	类别型
thalach	达到的最大心率	数值型
exang	运动诱发心绞痛（1 = 是；0 = 否）	类别型
oldpeak	运动相对于静息引起的 ST 段压低	数值型
slope	运动高峰期 ST 段的斜率	数值型
ca	通过荧光透视着色的主要血管数量 (0-3)	数值型和类别型
thal	3 = 正常；6 = 固定缺陷；7 = 可逆缺陷	类别型
target	心脏病诊断（1 = 是；0 = 否）	target

设置

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import pandas as pd
import keras
from keras.utils import FeatureSpace

准备数据

让我们下载数据并将其加载到 Pandas dataframe 中

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

该数据集包含 303 个样本，每个样本有 14 列（13 个特征，外加目标标签）

print(dataframe.shape)

(303, 14)

以下是几个样本的预览

dataframe.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	63	1	1	145	233	1	2	150	0	2.3	3	0	fixed	0
1	67	1	4	160	286	0	2	108	1	1.5	2	3	normal	1
2	67	1	4	120	229	0	2	129	1	2.6	2	2	reversible	0
3	37	1	3	130	250	0	0	187	0	3.5	3	0	normal	0
4	41	0	2	130	204	0	2	172	0	1.4	1	0	normal	0

最后一列，“target”，表示患者是否患有心脏病（1）或未患病（0）。

让我们将数据分割成训练集和验证集

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 242 samples for training and 61 for validation

让我们为每个 dataframe 生成 tf.data.Dataset 对象

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds


train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

每个 Dataset 生成一个元组 (input, target)，其中 input 是一个特征字典，target 是值 0 或 1

for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

Input: {'age': <tf.Tensor: shape=(), dtype=int64, numpy=65>, 'sex': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'cp': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=138>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=282>, 'fbs': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'restecg': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=174>, 'exang': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=1.4>, 'slope': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'ca': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'thal': <tf.Tensor: shape=(), dtype=string, numpy=b'normal'>}
Target: tf.Tensor(0, shape=(), dtype=int64)

让我们批量处理数据集

train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

配置 `FeatureSpace`

为了配置每个特征的预处理方式，我们实例化一个 keras.utils.FeatureSpace，并向其传递一个字典，该字典将特征名称映射到描述特征类型的字符串。

我们有一些“整数类别型”特征，例如 "FBS"，一个“字符串类别型”特征（"thal"），以及一些数值型特征，我们希望对其进行归一化——除了 "age"，我们希望将其离散化为若干个 bin。

我们还使用 crosses 参数来捕获某些类别特征的特征交互，也就是说，创建额外的特征来表示这些类别特征的值共现。您可以为任意类别的特征集计算这样的特征交叉，而不仅仅是两个特征的元组。由于结果的共现被哈希到固定大小的向量中，您无需担心共现空间是否太大。

feature_space = FeatureSpace(
    features={
        # Categorical features encoded as integers
        "sex": "integer_categorical",
        "cp": "integer_categorical",
        "fbs": "integer_categorical",
        "restecg": "integer_categorical",
        "exang": "integer_categorical",
        "ca": "integer_categorical",
        # Categorical feature encoded as string
        "thal": "string_categorical",
        # Numerical features to discretize
        "age": "float_discretized",
        # Numerical features to normalize
        "trestbps": "float_normalized",
        "chol": "float_normalized",
        "thalach": "float_normalized",
        "oldpeak": "float_normalized",
        "slope": "float_normalized",
    },
    # We create additional features by hashing
    # value co-occurrences for the
    # following groups of categorical features.
    crosses=[("sex", "age"), ("thal", "ca")],
    # The hashing space for these co-occurrences
    # wil be 32-dimensional.
    crossing_dim=32,
    # Our utility will one-hot encode all categorical
    # features and concat all features into a single
    # vector (one vector per sample).
    output_mode="concat",
)

进一步自定义 `FeatureSpace`

通过字符串名称指定特征类型既快速又简单，但有时您可能希望进一步配置每个特征的预处理。例如，在我们的案例中，我们的类别特征没有大量的可能值——每个特征只有少量的值（例如，“FBS”特征的值是 1 和 0），并且所有可能的值都已在训练集中出现。因此，对于这些特征，我们无需保留一个索引来表示“词汇外”值——这会是默认行为。下面，我们在每个特征中指定 num_oov_indices=0，以告诉特征预处理器跳过“词汇外”索引。

您可以进行的其他自定义包括指定类型为 "float_discretized" 的特征的离散化分箱数量，或者特征交叉的哈希空间的维度。

feature_space = FeatureSpace(
    features={
        # Categorical features encoded as integers
        "sex": FeatureSpace.integer_categorical(num_oov_indices=0),
        "cp": FeatureSpace.integer_categorical(num_oov_indices=0),
        "fbs": FeatureSpace.integer_categorical(num_oov_indices=0),
        "restecg": FeatureSpace.integer_categorical(num_oov_indices=0),
        "exang": FeatureSpace.integer_categorical(num_oov_indices=0),
        "ca": FeatureSpace.integer_categorical(num_oov_indices=0),
        # Categorical feature encoded as string
        "thal": FeatureSpace.string_categorical(num_oov_indices=0),
        # Numerical features to discretize
        "age": FeatureSpace.float_discretized(num_bins=30),
        # Numerical features to normalize
        "trestbps": FeatureSpace.float_normalized(),
        "chol": FeatureSpace.float_normalized(),
        "thalach": FeatureSpace.float_normalized(),
        "oldpeak": FeatureSpace.float_normalized(),
        "slope": FeatureSpace.float_normalized(),
    },
    # Specify feature cross with a custom crossing dim.
    crosses=[
        FeatureSpace.cross(feature_names=("sex", "age"), crossing_dim=64),
        FeatureSpace.cross(
            feature_names=("thal", "ca"),
            crossing_dim=16,
        ),
    ],
    output_mode="concat",
)

将 `FeatureSpace` 与训练数据适配

在我们开始使用 FeatureSpace 构建模型之前，我们必须将其与训练数据适配。在 adapt() 过程中，FeatureSpace 将会

索引类别特征的可能值集合。
计算数值特征的均值和方差以进行归一化。
计算数值特征不同分箱的值边界以进行离散化。

请注意，adapt() 应该在生成特征值字典（不包含标签）的 tf.data.Dataset 上调用。

train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

此时，FeatureSpace 可以对原始特征值的字典进行调用，并为每个样本返回一个连接后的向量，该向量结合了编码后的特征和特征交叉。

for x, _ in train_ds.take(1):
    preprocessed_x = feature_space(x)
    print("preprocessed_x.shape:", preprocessed_x.shape)
    print("preprocessed_x.dtype:", preprocessed_x.dtype)

preprocessed_x.shape: (32, 138)
preprocessed_x.dtype: <dtype: 'float32'>

管理预处理的两种方式：作为 `tf.data` 流水线的一部分，或在模型本身中进行

有两种方式可以利用您的 FeatureSpace

在 `tf.data` 中进行异步预处理

您可以将其作为数据流水线的一部分，放在模型之前。这样可以在数据到达模型之前在 CPU 上进行异步并行预处理。如果您在 GPU 或 TPU 上训练，或者想要加快预处理速度，请这样做。通常，这始终是训练期间正确的做法。

在模型中进行同步预处理

您可以将其作为模型的一部分。这意味着模型将期望原始特征值的字典，并且预处理批次将在前向传播的其余部分之前同步（以阻塞方式）完成。如果您想拥有一个可以处理原始特征值的端到端模型，请这样做——但请记住，您的模型只能在 CPU 上运行，因为大多数类型的特征预处理（例如字符串预处理）与 GPU 或 TPU 不兼容。

请勿在 GPU/TPU 或对性能敏感的设置中这样做。通常，当您在 CPU 上进行推理时，您希望在模型内部进行预处理。

在我们的案例中，我们将在训练期间在 tf.data 流水线中应用 FeatureSpace，但将使用包含 FeatureSpace 的端到端模型进行推理。

让我们创建经过预处理的批次的训练集和验证集

preprocessed_train_ds = train_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_train_ds = preprocessed_train_ds.prefetch(tf.data.AUTOTUNE)

preprocessed_val_ds = val_ds.map(
    lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
preprocessed_val_ds = preprocessed_val_ds.prefetch(tf.data.AUTOTUNE)

构建模型

是时候构建一个模型了——或者更确切地说，是两个模型

一个需要预处理特征的训练模型（一个样本 = 一个向量）
一个需要原始特征的推理模型（一个样本 = 原始特征值的字典）

dict_inputs = feature_space.get_inputs()
encoded_features = feature_space.get_encoded_features()

x = keras.layers.Dense(32, activation="relu")(encoded_features)
x = keras.layers.Dropout(0.5)(x)
predictions = keras.layers.Dense(1, activation="sigmoid")(x)

training_model = keras.Model(inputs=encoded_features, outputs=predictions)
training_model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

inference_model = keras.Model(inputs=dict_inputs, outputs=predictions)

训练模型

让我们训练模型 50 个 epoch。请注意，特征预处理作为 tf.data 流水线的一部分进行，而不是作为模型的一部分。

training_model.fit(
    preprocessed_train_ds,
    epochs=20,
    validation_data=preprocessed_val_ds,
    verbose=2,
)

Epoch 1/20
8/8 - 3s - 352ms/step - accuracy: 0.5200 - loss: 0.7407 - val_accuracy: 0.6196 - val_loss: 0.6663
Epoch 2/20
8/8 - 0s - 20ms/step - accuracy: 0.5881 - loss: 0.6874 - val_accuracy: 0.7732 - val_loss: 0.6015
Epoch 3/20
8/8 - 0s - 19ms/step - accuracy: 0.6580 - loss: 0.6192 - val_accuracy: 0.7839 - val_loss: 0.5577
Epoch 4/20
8/8 - 0s - 19ms/step - accuracy: 0.7096 - loss: 0.5721 - val_accuracy: 0.7856 - val_loss: 0.5200
Epoch 5/20
8/8 - 0s - 18ms/step - accuracy: 0.7292 - loss: 0.5553 - val_accuracy: 0.7764 - val_loss: 0.4853
Epoch 6/20
8/8 - 0s - 19ms/step - accuracy: 0.7561 - loss: 0.5103 - val_accuracy: 0.7732 - val_loss: 0.4627
Epoch 7/20
8/8 - 0s - 19ms/step - accuracy: 0.7231 - loss: 0.5374 - val_accuracy: 0.7764 - val_loss: 0.4413
Epoch 8/20
8/8 - 0s - 19ms/step - accuracy: 0.7769 - loss: 0.4564 - val_accuracy: 0.7683 - val_loss: 0.4320
Epoch 9/20
8/8 - 0s - 18ms/step - accuracy: 0.7769 - loss: 0.4324 - val_accuracy: 0.7856 - val_loss: 0.4191
Epoch 10/20
8/8 - 0s - 19ms/step - accuracy: 0.7778 - loss: 0.4340 - val_accuracy: 0.7888 - val_loss: 0.4084
Epoch 11/20
8/8 - 0s - 19ms/step - accuracy: 0.7760 - loss: 0.4124 - val_accuracy: 0.7716 - val_loss: 0.3977
Epoch 12/20
8/8 - 0s - 19ms/step - accuracy: 0.7964 - loss: 0.4125 - val_accuracy: 0.7667 - val_loss: 0.3959
Epoch 13/20
8/8 - 0s - 18ms/step - accuracy: 0.8051 - loss: 0.3979 - val_accuracy: 0.7856 - val_loss: 0.3891
Epoch 14/20
8/8 - 0s - 19ms/step - accuracy: 0.8043 - loss: 0.3891 - val_accuracy: 0.7856 - val_loss: 0.3840
Epoch 15/20
8/8 - 0s - 18ms/step - accuracy: 0.8633 - loss: 0.3571 - val_accuracy: 0.7872 - val_loss: 0.3764
Epoch 16/20
8/8 - 0s - 19ms/step - accuracy: 0.8728 - loss: 0.3548 - val_accuracy: 0.7888 - val_loss: 0.3699
Epoch 17/20
8/8 - 0s - 19ms/step - accuracy: 0.8698 - loss: 0.3171 - val_accuracy: 0.7872 - val_loss: 0.3727
Epoch 18/20
8/8 - 0s - 18ms/step - accuracy: 0.8529 - loss: 0.3454 - val_accuracy: 0.7904 - val_loss: 0.3669
Epoch 19/20
8/8 - 0s - 17ms/step - accuracy: 0.8589 - loss: 0.3359 - val_accuracy: 0.7980 - val_loss: 0.3770
Epoch 20/20
8/8 - 0s - 17ms/step - accuracy: 0.8455 - loss: 0.3113 - val_accuracy: 0.8044 - val_loss: 0.3684

<keras.src.callbacks.history.History at 0x7f139bb4ed10>

我们很快达到了 80% 的验证准确率。

使用端到端模型在新数据上进行推理

现在，我们可以使用推理模型（其中包含 FeatureSpace）根据原始特征值的字典进行预测，如下所示

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = inference_model.predict(input_dict)

print(
    f"This particular patient had a {100 * predictions[0][0]:.2f}% probability "
    "of having a heart disease, as evaluated by our model."
)

 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 273ms/step
This particular patient had a 43.13% probability of having a heart disease, as evaluated by our model.

使用 FeatureSpace 进行结构化数据分类

简介

数据集

设置

准备数据

配置 FeatureSpace

进一步自定义 FeatureSpace

将 FeatureSpace 与训练数据适配

管理预处理的两种方式：作为 tf.data 流水线的一部分，或在模型本身中进行

在 tf.data 中进行异步预处理

在模型中进行同步预处理

构建模型

训练模型

使用端到端模型在新数据上进行推理

使用 FeatureSpace 进行结构化数据分类

简介

数据集

设置

准备数据

配置 FeatureSpace

进一步自定义 FeatureSpace

将 FeatureSpace 与训练数据适配

管理预处理的两种方式：作为 tf.data 流水线的一部分，或在模型本身中进行

在 tf.data 中进行异步预处理

在模型中进行同步预处理

构建模型

训练模型

使用端到端模型在新数据上进行推理

配置 `FeatureSpace`

进一步自定义 `FeatureSpace`

将 `FeatureSpace` 与训练数据适配

管理预处理的两种方式：作为 `tf.data` 流水线的一部分，或在模型本身中进行

在 `tf.data` 中进行异步预处理