对结构化数据进行分类

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

对结构化数据进行分类#

在 TensorFlow.org 上查看

在 Google Colab 运行

在 Github 上查看源代码

下载笔记本

警告：不推荐为新代码使用本教程中介绍的 tf.feature_columns 模块。Keras 预处理层介绍了此功能，有关迁移说明，请参阅迁移特征列指南。tf.feature_columns 模块旨在与 TF1 Estimators 结合使用。它不在我们的兼容性保证范围内，除了安全漏洞修正外，不会获得其他修正。

我们将使用一个小型数据集，该数据集由克利夫兰心脏病诊所基金会（Cleveland Clinic Foundation for Heart Disease）提供。CSV 中有几百行数据。每行描述了一个病人（patient），每列描述了一个属性（attribute）。我们将使用这些信息来预测一位病人是否患有心脏病，这是在该数据集上的二分类任务。

用 Pandas 导入 CSV 文件。
用 tf.data 建立了一个输入流水线（pipeline），用于对行进行分批（batch）和随机排序（shuffle）。
用特征列将 CSV 中的列映射到用于训练模型的特征。
用 Keras 构建，训练并评估模型。

数据集#

下面是该数据集的描述。请注意，有数值（numeric）和类别（categorical）类型的列。

Following is a description of this dataset. Notice there are both numeric and categorical columns. There is a free text column which we will not use in this tutorial.

列	描述	特征类型	数据类型
Type	动物类型（狗、猫）	分类	字符串
Age	宠物年龄	数值	整数
Breed1	宠物的主要品种	分类	字符串
Color1	宠物的颜色 1	分类	字符串
Color2	宠物的颜色 2	分类	字符串
MaturitySize	成年个体大小	分类	字符串
FurLength	毛发长度	分类	字符串
Vaccinated	宠物已接种疫苗	分类	字符串
Sterilized	宠物已绝育	分类	字符串
Health	健康状况	分类	字符串
Fee	领养费	数值	整数
Description	关于此宠物的简介	文本	字符串
PhotoAmt	为该宠物上传的照片总数	数值	整数
AdoptionSpeed	领养速度	分类	整数

导入 TensorFlow 和其他库#

!pip install sklearn

import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

使用 Pandas 创建一个 dataframe#

Pandas 是一个 Python 库，它有许多有用的实用程序，用于加载和处理结构化数据。我们将使用 Pandas 从 URL下载数据集，并将其加载到 dataframe 中。

import pathlib

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
dataframe = pd.read_csv(csv_file)

dataframe.head()

创建目标变量#

原始数据集中的任务是预测宠物被领养的速度（例如，在第一周、第一个月、前三个月等）。我们针对教程进行一下简化。在这里，我们将把它转化为一个二元分类问题，并简单地预测宠物是否被领养。

修改标签列后，0 表示宠物未被领养，1 表示宠物已被领养。

# In the original dataset "4" indicates the pet was not adopted.
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

将 dataframe 拆分为训练、验证和测试集#

我们下载的数据集是一个 CSV 文件。我们将其拆分为训练、验证和测试集。

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

用 tf.data 创建输入流水线#

接下来，我们将使用 tf.data 包装 dataframe。这让我们能将特征列作为一座桥梁，该桥梁将 Pandas dataframe 中的列映射到用于训练模型的特征。如果我们使用一个非常大的 CSV 文件（非常大以至于它不能放入内存），我们将使用 tf.data 直接从磁盘读取它。本教程不涉及这一点。

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

理解输入流水线#

现在我们已经创建了输入流水线，让我们调用它来查看它返回的数据的格式。我们使用了一小批量大小来保持输出的可读性。

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['Age'])
  print('A batch of targets:', label_batch )

我们可以看到数据集返回了一个字典，该字典从列名称（来自 dataframe）映射到 dataframe 中行的列值。

演示几种特征列#

TensorFlow 提供了多种特征列。本节中，我们将创建几类特征列，并演示特征列如何转换 dataframe 中的列。

# We will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_ds))[0]

# A utility method to create a feature column
# and to transform a batch of data
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

数值列#

一个特征列的输出将成为模型的输入（使用上面定义的 demo 函数，我们将能准确地看到 dataframe 中的每列的转换方式）。数值列（numeric column）是最简单的列类型。它用于表示实数特征。使用此列时，模型将从 dataframe 中接收未更改的列值。

photo_count = feature_column.numeric_column('PhotoAmt')
demo(photo_count)

在这个心脏病数据集中，dataframe 中的大多数列都是数值列。

分桶列#

通常，您不希望将数字直接输入模型，而是根据数值范围将其值分成不同的类别。考虑代表一个人年龄的原始数据。我们可以用分桶列（bucketized column）将年龄分成几个分桶（buckets），而不是将年龄表示成数值列。请注意下面的 one-hot 数值表示每行匹配的年龄范围。

age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 3, 5])
demo(age_buckets)

分类列#

在此数据集中，thal 用字符串表示（如 ‘fixed’，‘normal’，或 ‘reversible’）。我们无法直接将字符串提供给模型。相反，我们必须首先将它们映射到数值。分类词汇列（categorical vocabulary columns）提供了一种用 one-hot 向量表示字符串的方法（就像您在上面看到的年龄分桶一样）。词汇表可以用 categorical_column_with_vocabulary_list 作为 list 传递，或者用 categorical_column_with_vocabulary_file 从文件中加载。

animal_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', ['Cat', 'Dog'])

animal_type_one_hot = feature_column.indicator_column(animal_type)
demo(animal_type_one_hot)

嵌入列#

假设我们不是只有几个可能的字符串，而是每个类别有数千（或更多）值。由于多种原因，随着类别数量的增加，使用 one-hot 编码训练神经网络变得不可行。我们可以使用嵌入列来克服此限制。嵌入列（embedding column）将数据表示为一个低维度密集向量，而非多维的 one-hot 向量，该低维度密集向量可以包含任何数，而不仅仅是 0 或 1。嵌入的大小（在下面的示例中为 8）是必须调整的参数。

关键点：当分类列具有许多可能的值时，最好使用嵌入列。我们在这里使用嵌入列用于演示目的，为此您有一个完整的示例，以在将来可以修改用于其他数据集。

# Notice the input to the embedding column is the categorical column
# we previously created
breed1 = feature_column.categorical_column_with_vocabulary_list(
      'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
demo(breed1_embedding)

经过哈希处理的特征列#

表示具有大量数值的分类列的另一种方法是使用 categorical_column_with_hash_bucket。该特征列计算输入的一个哈希值，然后选择一个 hash_bucket_size 分桶来编码字符串。使用此列时，您不需要提供词汇表，并且可以选择使 hash_buckets 的数量远远小于实际类别的数量以节省空间。

关键点：该技术的一个重要缺点是可能存在冲突，不同的字符串被映射到同一个范围。实际上，无论如何，经过哈希处理的特征列对某些数据集都有效。

breed1_hashed = feature_column.categorical_column_with_hash_bucket(
      'Breed1', hash_bucket_size=10)
demo(feature_column.indicator_column(breed1_hashed))

组合的特征列#

将多种特征组合到一个特征中，称为特征组合（feature crosses），它让模型能够为每种特征组合学习单独的权重。此处，我们将创建一个 age 和 thal 组合的新特征。请注意，crossed_column 不会构建所有可能组合的完整列表（可能非常大）。相反，它由 hashed_column 支持，因此您可以选择表的大小。

crossed_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))

选择要使用的列#

我们已经了解了如何使用几种类型的特征列。现在我们将使用它们来训练模型。本教程的目标是向您展示使用特征列所需的完整代码（例如，机制）。我们任意地选择了几列来训练我们的模型。

关键点：如果您的目标是建立一个准确的模型，请尝试使用您自己的更大的数据集，并仔细考虑哪些特征最有意义，以及如何表示它们。

feature_columns = []

# numeric cols
for header in ['PhotoAmt', 'Fee', 'Age']:
  feature_columns.append(feature_column.numeric_column(header))

# bucketized cols
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 2, 3, 4, 5])
feature_columns.append(age_buckets)

# indicator_columns
indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                          'FurLength', 'Vaccinated', 'Sterilized', 'Health']
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list(
      col_name, dataframe[col_name].unique())
  indicator_column = feature_column.indicator_column(categorical_column)
  feature_columns.append(indicator_column)

# embedding columns
breed1 = feature_column.categorical_column_with_vocabulary_list(
      'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
feature_columns.append(breed1_embedding)

# crossed columns
age_type_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)
feature_columns.append(feature_column.indicator_column(age_type_feature))

建立一个新的特征层#

现在我们已经定义了我们的特征列，我们将使用密集特征（DenseFeatures）层将特征列输入到我们的 Keras 模型中。

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

之前，我们使用一个小批量大小来演示特征列如何运转。我们将创建一个新的更大批量的输入流水线。

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

创建，编译和训练模型#

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dropout(.1),
  layers.Dense(1)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(train_ds,
          validation_data=val_ds,
          epochs=10)

loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

关键点：通常使用更大更复杂的数据集进行深度学习，您将看到最佳结果。使用像这样的小数据集时，我们建议使用决策树或随机森林作为强有力的基准。本教程的目的不是训练一个准确的模型，而是演示处理结构化数据的机制，这样，在将来使用自己的数据集时，您有可以使用的代码作为起点。

后续步骤#

了解有关分类结构化数据的更多信息的最佳方法是亲自尝试。我们建议寻找另一个可以使用的数据集，并使用和上面相似的代码，训练一个模型，对其分类。要提高准确率，请仔细考虑模型中包含哪些特征，以及如何表示这些特征。