{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rNdWfPXCjTjY"
      },
      "outputs": [],
      "source": [
        "##### Copyright 2019 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "I1dUQ0GejU8N",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c05P9g5WjizZ"
      },
      "source": [
        "# 对结构化数据进行分类"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zofH_gCzgplN"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/tutorials/structured_data/feature_columns\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看</a>\n",
        "</td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/structured_data/feature_columns.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 运行</a>\n",
        "</td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/structured_data/feature_columns.ipynb\">     <img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">     在 Github 上查看源代码</a>   </td>\n",
        "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/tutorials/structured_data/feature_columns.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a>\n",
        "</td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K1y4OHpGgss7"
      },
      "source": [
        "> 警告：不推荐为新代码使用本教程中介绍的 `tf.feature_columns` 模块。[Keras 预处理层](https://tensorflow.google.cn/tutorials/structured_data/preprocessing_layers)介绍了此功能，有关迁移说明，请参阅[迁移特征列](../../guide/migrate/migrating_feature_columns.ipynb)指南。`tf.feature_columns` 模块旨在与 TF1 `Estimators` 结合使用。它不在我们的[兼容性保证](https://tensorflow.org/guide/versions)范围内，除了安全漏洞修正外，不会获得其他修正。\n",
        "\n",
        "我们将使用一个小型 [数据集](https://archive.ics.uci.edu/ml/datasets/heart+Disease)，该数据集由克利夫兰心脏病诊所基金会（Cleveland Clinic Foundation for Heart Disease）提供。CSV 中有几百行数据。每行描述了一个病人（patient），每列描述了一个属性（attribute）。我们将使用这些信息来预测一位病人是否患有心脏病，这是在该数据集上的二分类任务。\n",
        "\n",
        "- 用 [Pandas](https://pandas.pydata.org/) 导入 CSV 文件。\n",
        "- 用 [tf.data](https://tensorflow.google.cn/guide/datasets) 建立了一个输入流水线（pipeline），用于对行进行分批（batch）和随机排序（shuffle）。\n",
        "- 用特征列将 CSV 中的列映射到用于训练模型的特征。\n",
        "- 用 Keras 构建，训练并评估模型。\n",
        "\n",
        "## 数据集\n",
        "\n",
        "下面是该数据集的[描述](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names)。 请注意，有数值（numeric）和类别（categorical）类型的列。\n",
        "\n",
        "Following is a description of this dataset. Notice there are both numeric and categorical columns. There is a free text column which we will not use in this tutorial.\n",
        "\n",
        "列 | 描述 | 特征类型 | 数据类型\n",
        "--- | --- | --- | ---\n",
        "Type | 动物类型（狗、猫） | 分类 | 字符串\n",
        "Age | 宠物年龄 | 数值 | 整数\n",
        "Breed1 | 宠物的主要品种 | 分类 | 字符串\n",
        "Color1 | 宠物的颜色 1 | 分类 | 字符串\n",
        "Color2 | 宠物的颜色 2 | 分类 | 字符串\n",
        "MaturitySize | 成年个体大小 | 分类 | 字符串\n",
        "FurLength | 毛发长度 | 分类 | 字符串\n",
        "Vaccinated | 宠物已接种疫苗 | 分类 | 字符串\n",
        "Sterilized | 宠物已绝育 | 分类 | 字符串\n",
        "Health | 健康状况 | 分类 | 字符串\n",
        "Fee | 领养费 | 数值 | 整数\n",
        "Description | 关于此宠物的简介 | 文本 | 字符串\n",
        "PhotoAmt | 为该宠物上传的照片总数 | 数值 | 整数\n",
        "AdoptionSpeed | 领养速度 | 分类 | 整数"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VxyBFc_kKazA"
      },
      "source": [
        "## 导入 TensorFlow 和其他库"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LuOWVJBz8a6G",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "!pip install sklearn"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9dEreb4QKizj",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "import tensorflow as tf\n",
        "\n",
        "from tensorflow import feature_column\n",
        "from tensorflow.keras import layers\n",
        "from sklearn.model_selection import train_test_split"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KCEhSZcULZ9n"
      },
      "source": [
        "## 使用 Pandas 创建一个 dataframe\n",
        "\n",
        "[Pandas](https://pandas.pydata.org/) 是一个 Python 库，它有许多有用的实用程序，用于加载和处理结构化数据。我们将使用 Pandas 从 URL下载数据集，并将其加载到 dataframe 中。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "REZ57BXCLdfG",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "import pathlib\n",
        "\n",
        "dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'\n",
        "csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'\n",
        "\n",
        "tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,\n",
        "                        extract=True, cache_dir='.')\n",
        "dataframe = pd.read_csv(csv_file)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "o8QIi0_jT5LM",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "dataframe.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "awGiBeBWbQC8"
      },
      "source": [
        "## 创建目标变量\n",
        "\n",
        "原始数据集中的任务是预测宠物被领养的速度（例如，在第一周、第一个月、前三个月等）。我们针对教程进行一下简化。在这里，我们将把它转化为一个二元分类问题，并简单地预测宠物是否被领养。\n",
        "\n",
        "修改标签列后，0 表示宠物未被领养，1 表示宠物已被领养。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xcbTpEXWbMDz",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# In the original dataset \"4\" indicates the pet was not adopted.\n",
        "dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)\n",
        "\n",
        "# Drop un-used columns.\n",
        "dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u0zhLtQqMPem"
      },
      "source": [
        "## 将 dataframe 拆分为训练、验证和测试集\n",
        "\n",
        "我们下载的数据集是一个 CSV 文件。 我们将其拆分为训练、验证和测试集。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YEOpw7LhMYsI",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "train, test = train_test_split(dataframe, test_size=0.2)\n",
        "train, val = train_test_split(train, test_size=0.2)\n",
        "print(len(train), 'train examples')\n",
        "print(len(val), 'validation examples')\n",
        "print(len(test), 'test examples')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "84ef46LXMfvu"
      },
      "source": [
        "## 用 tf.data 创建输入流水线\n",
        "\n",
        "接下来，我们将使用 [tf.data](https://tensorflow.google.cn/guide/datasets) 包装 dataframe。这让我们能将特征列作为一座桥梁，该桥梁将 Pandas dataframe 中的列映射到用于训练模型的特征。如果我们使用一个非常大的 CSV 文件（非常大以至于它不能放入内存），我们将使用 tf.data 直接从磁盘读取它。本教程不涉及这一点。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NkcaMYP-MsRe",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# A utility method to create a tf.data dataset from a Pandas Dataframe\n",
        "def df_to_dataset(dataframe, shuffle=True, batch_size=32):\n",
        "  dataframe = dataframe.copy()\n",
        "  labels = dataframe.pop('target')\n",
        "  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))\n",
        "  if shuffle:\n",
        "    ds = ds.shuffle(buffer_size=len(dataframe))\n",
        "  ds = ds.batch(batch_size)\n",
        "  return ds"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CXbbXkJvMy34",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "batch_size = 5 # A small batch sized is used for demonstration purposes\n",
        "train_ds = df_to_dataset(train, batch_size=batch_size)\n",
        "val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)\n",
        "test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qRLGSMDzM-dl"
      },
      "source": [
        "## 理解输入流水线\n",
        "\n",
        "现在我们已经创建了输入流水线，让我们调用它来查看它返回的数据的格式。 我们使用了一小批量大小来保持输出的可读性。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CSBo3dUVNFc9",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "for feature_batch, label_batch in train_ds.take(1):\n",
        "  print('Every feature:', list(feature_batch.keys()))\n",
        "  print('A batch of ages:', feature_batch['Age'])\n",
        "  print('A batch of targets:', label_batch )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OT5N6Se-NQsC"
      },
      "source": [
        "我们可以看到数据集返回了一个字典，该字典从列名称（来自 dataframe）映射到 dataframe 中行的列值。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ttIvgLRaNoOQ"
      },
      "source": [
        "## 演示几种特征列\n",
        "\n",
        "TensorFlow 提供了多种特征列。本节中，我们将创建几类特征列，并演示特征列如何转换 dataframe 中的列。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mxwiHFHuNhmf",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# We will use this batch to demonstrate several types of feature columns\n",
        "example_batch = next(iter(train_ds))[0]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0wfLB8Q3N3UH",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# A utility method to create a feature column\n",
        "# and to transform a batch of data\n",
        "def demo(feature_column):\n",
        "  feature_layer = layers.DenseFeatures(feature_column)\n",
        "  print(feature_layer(example_batch).numpy())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q7OEKe82N-Qb"
      },
      "source": [
        "### 数值列\n",
        "\n",
        "一个特征列的输出将成为模型的输入（使用上面定义的 demo 函数，我们将能准确地看到 dataframe 中的每列的转换方式）。 [数值列（numeric column）](https://tensorflow.google.cn/api_docs/python/tf/feature_column/numeric_column) 是最简单的列类型。它用于表示实数特征。使用此列时，模型将从 dataframe 中接收未更改的列值。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QZTZ0HnHOCxC",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "photo_count = feature_column.numeric_column('PhotoAmt')\n",
        "demo(photo_count)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7a6ddSyzOKpq"
      },
      "source": [
        "在这个心脏病数据集中，dataframe 中的大多数列都是数值列。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IcSxUoYgOlA1"
      },
      "source": [
        "### 分桶列\n",
        "\n",
        "通常，您不希望将数字直接输入模型，而是根据数值范围将其值分成不同的类别。考虑代表一个人年龄的原始数据。我们可以用 [分桶列（bucketized column）](https://tensorflow.google.cn/api_docs/python/tf/feature_column/bucketized_column)将年龄分成几个分桶（buckets），而不是将年龄表示成数值列。请注意下面的 one-hot 数值表示每行匹配的年龄范围。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wJ4Wt3SAOpTQ",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "age = feature_column.numeric_column('Age')\n",
        "age_buckets = feature_column.bucketized_column(age, boundaries=[1, 3, 5])\n",
        "demo(age_buckets)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r1tArzewPb-b"
      },
      "source": [
        "### 分类列\n",
        "\n",
        "在此数据集中，thal 用字符串表示（如 'fixed'，'normal'，或 'reversible'）。我们无法直接将字符串提供给模型。相反，我们必须首先将它们映射到数值。分类词汇列（categorical vocabulary columns）提供了一种用 one-hot 向量表示字符串的方法（就像您在上面看到的年龄分桶一样）。词汇表可以用 [categorical_column_with_vocabulary_list](https://tensorflow.google.cn/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list) 作为 list 传递，或者用 [categorical_column_with_vocabulary_file](https://tensorflow.google.cn/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file) 从文件中加载。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DJ6QnSHkPtOC",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "animal_type = feature_column.categorical_column_with_vocabulary_list(\n",
        "      'Type', ['Cat', 'Dog'])\n",
        "\n",
        "animal_type_one_hot = feature_column.indicator_column(animal_type)\n",
        "demo(animal_type_one_hot)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LEFPjUr6QmwS"
      },
      "source": [
        "### 嵌入列\n",
        "\n",
        "假设我们不是只有几个可能的字符串，而是每个类别有数千（或更多）值。 由于多种原因，随着类别数量的增加，使用 one-hot 编码训练神经网络变得不可行。我们可以使用嵌入列来克服此限制。[嵌入列（embedding column）](https://tensorflow.google.cn/api_docs/python/tf/feature_column/embedding_column)将数据表示为一个低维度密集向量，而非多维的 one-hot 向量，该低维度密集向量可以包含任何数，而不仅仅是 0 或 1。嵌入的大小（在下面的示例中为 8）是必须调整的参数。\n",
        "\n",
        "关键点：当分类列具有许多可能的值时，最好使用嵌入列。我们在这里使用嵌入列用于演示目的，为此您有一个完整的示例，以在将来可以修改用于其他数据集。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hSlohmr2Q_UU",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# Notice the input to the embedding column is the categorical column\n",
        "# we previously created\n",
        "breed1 = feature_column.categorical_column_with_vocabulary_list(\n",
        "      'Breed1', dataframe.Breed1.unique())\n",
        "breed1_embedding = feature_column.embedding_column(breed1, dimension=8)\n",
        "demo(breed1_embedding)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "urFCAvTVRMpB"
      },
      "source": [
        "### 经过哈希处理的特征列\n",
        "\n",
        "表示具有大量数值的分类列的另一种方法是使用 [categorical_column_with_hash_bucket](https://tensorflow.google.cn/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket)。该特征列计算输入的一个哈希值，然后选择一个 `hash_bucket_size` 分桶来编码字符串。使用此列时，您不需要提供词汇表，并且可以选择使 hash_buckets 的数量远远小于实际类别的数量以节省空间。\n",
        "\n",
        "关键点：该技术的一个重要缺点是可能存在冲突，不同的字符串被映射到同一个范围。实际上，无论如何，经过哈希处理的特征列对某些数据集都有效。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YHU_Aj2nRRDC",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "breed1_hashed = feature_column.categorical_column_with_hash_bucket(\n",
        "      'Breed1', hash_bucket_size=10)\n",
        "demo(feature_column.indicator_column(breed1_hashed))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fB94M27DRXtZ"
      },
      "source": [
        "### 组合的特征列\n",
        "\n",
        "将多种特征组合到一个特征中，称为[特征组合（feature crosses）](https://developers.google.com/machine-learning/glossary/#feature_cross)，它让模型能够为每种特征组合学习单独的权重。此处，我们将创建一个 age 和 thal 组合的新特征。请注意，`crossed_column` 不会构建所有可能组合的完整列表（可能非常大）。相反，它由 `hashed_column` 支持，因此您可以选择表的大小。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oaPVERd9Rep6",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "crossed_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=10)\n",
        "demo(feature_column.indicator_column(crossed_feature))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ypkI9zx6Rj1q"
      },
      "source": [
        "## 选择要使用的列\n",
        "\n",
        "我们已经了解了如何使用几种类型的特征列。 现在我们将使用它们来训练模型。本教程的目标是向您展示使用特征列所需的完整代码（例如，机制）。我们任意地选择了几列来训练我们的模型。\n",
        "\n",
        "关键点：如果您的目标是建立一个准确的模型，请尝试使用您自己的更大的数据集，并仔细考虑哪些特征最有意义，以及如何表示它们。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4PlLY7fORuzA",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "feature_columns = []\n",
        "\n",
        "# numeric cols\n",
        "for header in ['PhotoAmt', 'Fee', 'Age']:\n",
        "  feature_columns.append(feature_column.numeric_column(header))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jdF4rXkcDmBl",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# bucketized cols\n",
        "age = feature_column.numeric_column('Age')\n",
        "age_buckets = feature_column.bucketized_column(age, boundaries=[1, 2, 3, 4, 5])\n",
        "feature_columns.append(age_buckets)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RsteO7FGDmNc",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# indicator_columns\n",
        "indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',\n",
        "                          'FurLength', 'Vaccinated', 'Sterilized', 'Health']\n",
        "for col_name in indicator_column_names:\n",
        "  categorical_column = feature_column.categorical_column_with_vocabulary_list(\n",
        "      col_name, dataframe[col_name].unique())\n",
        "  indicator_column = feature_column.indicator_column(categorical_column)\n",
        "  feature_columns.append(indicator_column)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6MhdqQ5uDmYU",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# embedding columns\n",
        "breed1 = feature_column.categorical_column_with_vocabulary_list(\n",
        "      'Breed1', dataframe.Breed1.unique())\n",
        "breed1_embedding = feature_column.embedding_column(breed1, dimension=8)\n",
        "feature_columns.append(breed1_embedding)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qkzRNfCLDsQf",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# crossed columns\n",
        "age_type_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)\n",
        "feature_columns.append(feature_column.indicator_column(age_type_feature))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M-nDp8krS_ts"
      },
      "source": [
        "### 建立一个新的特征层\n",
        "\n",
        "现在我们已经定义了我们的特征列，我们将使用[密集特征（DenseFeatures）](https://tensorflow.google.cn/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures)层将特征列输入到我们的 Keras 模型中。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6o-El1R2TGQP",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "feature_layer = tf.keras.layers.DenseFeatures(feature_columns)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8cf6vKfgTH0U"
      },
      "source": [
        "之前，我们使用一个小批量大小来演示特征列如何运转。我们将创建一个新的更大批量的输入流水线。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gcemszoGSse_",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "batch_size = 32\n",
        "train_ds = df_to_dataset(train, batch_size=batch_size)\n",
        "val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)\n",
        "test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bBx4Xu0eTXWq"
      },
      "source": [
        "## 创建，编译和训练模型"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_YJPPb3xTPeZ",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "model = tf.keras.Sequential([\n",
        "  feature_layer,\n",
        "  layers.Dense(128, activation='relu'),\n",
        "  layers.Dense(128, activation='relu'),\n",
        "  layers.Dropout(.1),\n",
        "  layers.Dense(1)\n",
        "])\n",
        "\n",
        "model.compile(optimizer='adam',\n",
        "              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "              metrics=['accuracy'])\n",
        "\n",
        "model.fit(train_ds,\n",
        "          validation_data=val_ds,\n",
        "          epochs=10)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GnFmMOW0Tcaa",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "loss, accuracy = model.evaluate(test_ds)\n",
        "print(\"Accuracy\", accuracy)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3bdfbq20V6zu"
      },
      "source": [
        "关键点：通常使用更大更复杂的数据集进行深度学习，您将看到最佳结果。使用像这样的小数据集时，我们建议使用决策树或随机森林作为强有力的基准。本教程的目的不是训练一个准确的模型，而是演示处理结构化数据的机制，这样，在将来使用自己的数据集时，您有可以使用的代码作为起点。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SotnhVWuHQCw"
      },
      "source": [
        "## 后续步骤\n",
        "\n",
        "了解有关分类结构化数据的更多信息的最佳方法是亲自尝试。我们建议寻找另一个可以使用的数据集，并使用和上面相似的代码，训练一个模型，对其分类。要提高准确率，请仔细考虑模型中包含哪些特征，以及如何表示这些特征。"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "feature_columns.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}