使用 Keras 预处理层对结构化数据进行分类

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

使用 Keras 预处理层对结构化数据进行分类#

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

本教程演示了如何使用存储在 CSV 文件中的 Kaggle 竞赛中的 PetFinder 数据集的简化版本对结构化数据（例如表格式数据）进行分类。

您将使用 Keras 定义模型，并将 Keras 预处理层作为将 CSV 文件中的列映射到用于训练模型的特征的桥梁。目标是预测宠物是否会被领养。

本教程包含以下完整代码：

使用 pandas 将 CSV 文件加载到 DataFrame 中。
使用 tf.data 构建输入流水线以批处理和打乱行。（请访问 tf.data：构建 TensorFlow 输入流水线，了解详细信息。）
从 CSV 文件中的列映射到用于使用 Keras 预处理层训练模型的特征。
使用 Keras 内置方法构建、训练和评估模型。

注：本教程类似于使用特征列对结构化数据进行分类。此版本使用 Keras 预处理层而不是 tf.feature_column API，因为前者更直观，可以轻松包含在模型中以简化部署。

PetFinder.my mini 数据集#

PetFinder.my mini 的 CSV 数据集文件中有数千行，其中每一行描述一只宠物（狗或猫），每一列描述一个特性（例如年龄、品种、颜色等）。

在下面的数据集摘要中，请注意主要是数值和分类列。在本教程中，您在数据预处理期间将只处理这两种特征类型，忽略 Description（自由文本特征）和 AdoptionSpeed（分类特征）。

列	宠物描述	特征类型	数据类型
`Type`	动物类型（狗、猫）	分类	字符串
`Age`	年龄	数值	整数
`Breed1`	主要品种	分类	字符串
`Color1`	颜色 1	分类	字符串
`Color2`	颜色 2	分类	字符串
`MaturitySize`	成年个体大小	分类	字符串
`FurLength`	毛发长度	分类	字符串
`Vaccinated`	宠物已接种疫苗	分类	字符串
`Sterilized`	宠物已绝育	分类	字符串
`Health`	健康状况	分类	字符串
`Fee`	领养费	数值	整数
`Description`	资料文字内容	文本	字符串
`PhotoAmt`	上传的照片总数	数值	整数
`AdoptionSpeed`	分类领养速度	分类	整数

导入TensorFlow和其他库#

import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras import layers

tf.__version__

加载数据集并将其读入 pandas DataFrame#

pandas 是一个 Python 库，其中包含许多用于加载和处理结构化数据的有用效用函数。使用 tf.keras.utils.get_file 下载并提取包含 PetFinder.my mini 数据集的 CSV 文件，并使用 pandas.read_csv 将其加载到 DataFrame 中：

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
dataframe = pd.read_csv(csv_file)

通过检查 DataFrame 的前五行来检查数据集：

dataframe.head()

创建目标变量#

Kaggle 的 PetFinder.my 领养预测竞赛的最初任务是预测宠物被领养的速度（例如，第一周、第一个月、前三个月等）。

在本教程中，您将通过将其转换为二元分类问题来简化任务，您只需预测宠物是否被领养。

修改 AdoptionSpeed 列后，0 表示宠物未被领养，1 表示宠物已被领养。

# In the original dataset, `'AdoptionSpeed'` of `4` indicates
# a pet was not adopted.
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)

# Drop unused features.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

将 DataFrame 拆分为训练集、验证集和测试集#

数据集位于单个 pandas DataFrame 中。使用 80:10:10 之类的比例将其分别拆分为训练集、验证集和测试集：

train, val, test = np.split(dataframe.sample(frac=1), [int(0.8*len(dataframe)), int(0.9*len(dataframe))])

print(len(train), 'training examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

使用 tf.data 创建输入流水线#

接下来，创建一个效用函数，将每个训练集、验证集和测试集 DataFrame 转换为 tf.data.Dataset，然后对数据进行打乱和批处理。

注：如果您处理的 CSV 文件非常大（大到无法放入内存），则可以使用 tf.data API 直接从磁盘读取文件。本教程中没有涉及这方面的内容。

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  df = dataframe.copy()
  labels = df.pop('target')
  df = {key: value[:,tf.newaxis] for key, value in dataframe.items()}
  ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

现在，通过在训练数据上调用来使用新创建的函数 (df_to_dataset) 检查输入流水线辅助函数返回的数据格式，并使用小批量来保持输出可读：

batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)

[(train_features, label_batch)] = train_ds.take(1)
print('Every feature:', list(train_features.keys()))
print('A batch of ages:', train_features['Age'])
print('A batch of targets:', label_batch )

如输出所示，训练集返回了一个列名（来自 DataFrame）字典，列名映射到行中的列值。

应用 Keras 预处理层#

Keras 预处理层允许您构建 Keras 原生输入处理流水线，可在非 Keras 工作流中用作独立预处理代码，直接与 Keras 模型结合，并作为 Keras SavedModel 的一部分导出。

在本教程中，您将使用以下四个预处理层来演示如何执行预处理、结构化数据编码和特征工程：

tf.keras.layers.Normalization：对输入特征执行逐特征归一化。
tf.keras.layers.CategoryEncoding：将整数分类特征转换为独热、多热或 tf-idf 密集表示。
tf.keras.layers.StringLookup：将字符串分类值转换为整数索引。
tf.keras.layers.IntegerLookup：将整数分类值转换为整数索引。

您可以在使用预处理层指南中指南中详细了解可用层。

对于 PetFinder.my mini 数据集的数值特征，您将使用 tf.keras.layers.Normalization 层来标准化数据分布。
对于分类特征，例如宠物 Type（Dog 和 Cat 字符串），您将使用 tf.keras.layers.CategoryEncoding 将它们转换为多热编码张量。

数值列#

对于 PetFinder.my mini 数据集的数值特征，您将使用 tf.keras.layers.Normalization 层来标准化数据分布。

定义一个新的效用函数，该函数返回一个层，后者使用该 Keras 预处理层将逐特征归一化应用于数值特征：

def get_normalization_layer(name, dataset):
  # Create a Normalization layer for the feature.
  normalizer = layers.Normalization(axis=None)

  # Prepare a Dataset that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

接下来，通过在上传的宠物照片总数特征上调用新函数来测试新函数以归一化 'PhotoAmt'：

photo_count_col = train_features['PhotoAmt']
layer = get_normalization_layer('PhotoAmt', train_ds)
layer(photo_count_col)

注：如果您有许多数值特征（数百个或更多），首先将它们连接起来并使用单个 tf.keras.layers.Normalization 层会更有效。

分类列#

数据集中的宠物 Type 表示为字符串（Dog 和 Cat），在馈入模型之前需要进行多热编码。Age 特征

定义另一个新的效用函数，该函数返回一个层，后者将词汇表中的值映射到整数索引，并使用 tf.keras.layers.StringLookup、tf.keras.layers.IntegerLookup 和 tf.keras.CategoryEncoding 预处理层对特征进行多热编码：

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a layer that turns strings into integer indices.
  if dtype == 'string':
    index = layers.StringLookup(max_tokens=max_tokens)
  # Otherwise, create a layer that turns integer values into integer indices.
  else:
    index = layers.IntegerLookup(max_tokens=max_tokens)

  # Prepare a `tf.data.Dataset` that only yields the feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Encode the integer indices.
  encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())

  # Apply multi-hot encoding to the indices. The lambda function captures the
  # layer, so you can use them, or include them in the Keras Functional model later.
  return lambda feature: encoder(index(feature))

通过在宠物 'Type' 特征上调用来测试 get_category_encoding_layer 函数，以将它们转换为多热编码张量：

test_type_col = train_features['Type']
test_type_layer = get_category_encoding_layer(name='Type',
                                              dataset=train_ds,
                                              dtype='string')
test_type_layer(test_type_col)

对宠物 'Age' 特征重复该过程：

test_age_col = train_features['Age']
test_age_layer = get_category_encoding_layer(name='Age',
                                             dataset=train_ds,
                                             dtype='int64',
                                             max_tokens=5)
test_age_layer(test_age_col)

预处理所选特征以训练模型#

您已经学习了如何使用多种类型的 Keras 预处理层。接下来，您将：

将前面定义的预处理效用函数应用于 PetFinder.my mini 数据集中的 13 个数值和分类特征。
将所有特征输入添加到列表中。

如开头所述，为了训练模型，您将使用 PetFinder.my mini 数据集的数值（'PhotoAmt'、'Fee'）和分类（'Age'、'Type'、'Color1'、'Color2'、'Gender'、'MaturitySize'、'FurLength'、'Vaccinated'、'Sterilized'、'Health'、'Breed1'）特征。

注：如果您的目标是构建一个准确的模型，请尝试使用自己的更大的数据集，并仔细考虑包含哪些特征最有意义，以及它们应该如何表示。

之前，您使用了小批次来演示输入流水线。现在让我们创建一个具有更大批次大小 (256) 的新输入流水线：

batch_size = 256
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

归一化数值特征（宠物照片的数量和领养费），并将它们添加到一个名为 encoded_features 的输入列表中：

all_inputs = []
encoded_features = []

# Numerical features.
for header in ['PhotoAmt', 'Fee']:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

将数据集中的整数分类值（宠物年龄）转换为整数索引，执行多热编码，并将生成的特征输入添加到 encoded_features：

age_col = tf.keras.Input(shape=(1,), name='Age', dtype='int64')

encoding_layer = get_category_encoding_layer(name='Age',
                                             dataset=train_ds,
                                             dtype='int64',
                                             max_tokens=5)
encoded_age_col = encoding_layer(age_col)
all_inputs.append(age_col)
encoded_features.append(encoded_age_col)

对字符串分类值重复相同的步骤：

categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                    'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']

for header in categorical_cols:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(name=header,
                                               dataset=train_ds,
                                               dtype='string',
                                               max_tokens=5)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)

创建、编译并训练模型#

下一步是使用 Keras 函数式 API 创建模型。对于模型中的第一层，通过与 tf.keras.layers.concatenate 串联将特征输入列表 encoded_features 合并到一个向量中。

all_features = tf.keras.layers.concatenate(encoded_features)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(all_inputs, output)

使用 Keras Model.compile 配置模型：

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

我们来呈现连接图：

# Use `rankdir='LR'` to make the graph horizontal.
tf.keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

接下来，训练和测试模型：

model.fit(train_ds, epochs=10, validation_data=val_ds)

loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

执行推断#

在您将预处理层包含在模型本身中之后，您开发的模型现在可以直接从 CSV 文件中对行进行分类。

您现在可以在使用新数据执行推断之前使用 Model.save 和 Model.load_model 保存和重新加载 Keras 模型：

model.save('my_pet_classifier.keras')
reloaded_model = tf.keras.models.load_model('my_pet_classifier.keras')

要获得新样本的预测，只需调用 Keras Model.predict 方法。您只需要做两件事：

将标量封装成列表，以便具有批次维度（Model 只处理批量数据，而非单个样本）。
对每个特征调用 tf.convert_to_tensor。

sample = {
    'Type': 'Cat',
    'Age': 3,
    'Breed1': 'Tabby',
    'Gender': 'Male',
    'Color1': 'Black',
    'Color2': 'White',
    'MaturitySize': 'Small',
    'FurLength': 'Short',
    'Vaccinated': 'No',
    'Sterilized': 'No',
    'Health': 'Healthy',
    'Fee': 100,
    'PhotoAmt': 2,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = reloaded_model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])

print(
    "This particular pet had a %.1f percent probability "
    "of getting adopted." % (100 * prob)
)

注：使用更大、更复杂的数据集进行深度学习通常会获得更好的结果。在处理小型数据集（例如简化的 PetFinder.my 数据集）时，您可以使用决策树或随机森林作为强基线。本教程的目标是演示使用结构化数据的机制，以便您在将来使用自己的数据集时清楚从哪里着手。

后续步骤#

要详细了解如何对结构化数据进行分类，请尝试使用其他数据集。为了提高训练和测试模型的准确率，请仔细考虑模型中要包含哪些特征以及它们应如何表示。

以下是对数据集的一些建议：

TensorFlow Datasets: MovieLens：来自电影推荐服务的一组电影评级。
TensorFlow Datasets: Wine Quality：与葡萄牙“Vinho Verde”葡萄酒的红白变种相关的两个数据集。您还可以在 Kaggle 上找到红葡萄酒品质数据集。
Kaggle: arXiv Dataset：来自 arXiv 的 170 万篇学术文章的语料库，涵盖物理学、计算机科学、数学、统计学、电气工程学、定量生物学和经济学。