{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zwBCE43Cv3PH"
      },
      "outputs": [],
      "source": [
        "##### Copyright 2019 The TensorFlow Authors.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\");"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "fOad0I2cv569"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YQB7yiF6v9GR"
      },
      "source": [
        "# 加载 pandas DataFrame"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Oqa952X4wQKK"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/tutorials/load_data/pandas_dataframe\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">View on TensorFlow.org</a> </td>\n",
        "  <td>     在 Google Colab 中运行   </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/load_data/pandas_dataframe.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 Github 上查看源代码</a>\n",
        "</td>\n",
        "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/tutorials/load_data/pandas_dataframe.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a>\n",
        "</td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UmyEaf4Awl2v"
      },
      "source": [
        "本教程提供了将 <a href=\"https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html\" class=\"external\">pandas DataFrame</a> 加载到 TensorFlow 中的示例。\n",
        "\n",
        "本教程使用了一个小型[数据集](https://archive.ics.uci.edu/ml/datasets/heart+Disease)，由克利夫兰诊所心脏病基金会（Cleveland Clinic Foundation for Heart Disease）提供. 此数据集中有几百行CSV。每行表示一个患者，每列表示一个属性（describe）。我们将使用这些信息来预测患者是否患有心脏病，这是一个二分类问题。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iiyC7HkqxlUD"
      },
      "source": [
        "## 使用 pandas 读取数据"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5IoRbCA2n0_V"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import tensorflow as tf\n",
        "\n",
        "SHUFFLE_BUFFER = 500\n",
        "BATCH_SIZE = 2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-2kBGy_pxn47"
      },
      "source": [
        "下载包含心脏病数据集的 CSV 文件："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VS4w2LePn9g3"
      },
      "outputs": [],
      "source": [
        "csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6BXRPD2-xtQ1"
      },
      "source": [
        "使用 pandas 读取 CSV 文件："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UEfJ8TcMpe-2"
      },
      "outputs": [],
      "source": [
        "df = pd.read_csv(csv_file)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4K873P-Pp8c7"
      },
      "source": [
        "数据如下："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8FkK6QIRpjd4"
      },
      "outputs": [],
      "source": [
        "df.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_MOAKz654CT5"
      },
      "outputs": [],
      "source": [
        "df.dtypes"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jVyGjKvnqGlb"
      },
      "source": [
        "您将构建模型来预测 `target` 列中包含的标签。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2wwhILm1ycSp"
      },
      "outputs": [],
      "source": [
        "target = df.pop('target')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vFGv9fgjDeao"
      },
      "source": [
        "## 创建并训练模型"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xNxJ41MafiB-"
      },
      "source": [
        "如果您的数据具有统一的数据类型或 `dtype`，则可在任何可以使用 NumPy 数组的地方使用 pandas DataFrame。这是因为 `pandas.DataFrame` 类支持 `__array__` 协议，并且 TensorFlow 的 `tf.convert_to_tensor` 函数接受支持该协议的对象。\n",
        "\n",
        "从数据集中获取数值特征（暂时跳过分类特征）："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "b9VlFGAie3K0"
      },
      "outputs": [],
      "source": [
        "numeric_feature_names = ['age', 'thalach', 'trestbps',  'chol', 'oldpeak']\n",
        "numeric_features = df[numeric_feature_names]\n",
        "numeric_features.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xe1CMRvSpR_R"
      },
      "source": [
        "可以使用 `DataFrame.values` 属性或 `numpy.array(df)` 将 DataFrame 转换为 NumPy 数组。要将其转换为张量，请使用 `tf.convert_to_tensor`："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OVv6Nwc9oDBU"
      },
      "outputs": [],
      "source": [
        "tf.convert_to_tensor(numeric_features)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7iRYvoTrr1_G"
      },
      "source": [
        "通常，如果一个对象可以使用 `tf.convert_to_tensor` 转换为张量，则可以在任何可以传递 `tf.Tensor` 的位置传递该对象。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RVF7_Z-Mp-qD"
      },
      "source": [
        "### 使用 Model.fit"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vqkc9gIapQNu"
      },
      "source": [
        "解释为单个张量的 DataFrame，可以直接用作 `Model.fit` 方法的参数。\n",
        "\n",
        "下面是使用数据集的数值特征训练模型的示例。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u8M3oYHZgH_t"
      },
      "source": [
        "第一步是归一化输入范围。为此，请使用 `tf.keras.layers.Normalization` 层。\n",
        "\n",
        "要在运行之前设置层的均值和标准差，请务必调用 `Normalization.adapt` 方法："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "88XTmyEdgkJn"
      },
      "outputs": [],
      "source": [
        "normalizer = tf.keras.layers.Normalization(axis=-1)\n",
        "normalizer.adapt(numeric_features)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_D7JqUtnYCnb"
      },
      "source": [
        "调用 DataFrame 前三行的层，以呈现此层的输出的样本："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jOwzIG-DhB0y"
      },
      "outputs": [],
      "source": [
        "normalizer(numeric_features.iloc[:3])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KWKcuVZJh-HY"
      },
      "source": [
        "使用归一化层作为简单模型的第一层："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lu-bni-nh6mX"
      },
      "outputs": [],
      "source": [
        "def get_basic_model():\n",
        "  model = tf.keras.Sequential([\n",
        "    normalizer,\n",
        "    tf.keras.layers.Dense(10, activation='relu'),\n",
        "    tf.keras.layers.Dense(10, activation='relu'),\n",
        "    tf.keras.layers.Dense(1)\n",
        "  ])\n",
        "\n",
        "  model.compile(optimizer='adam',\n",
        "                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "                metrics=['accuracy'])\n",
        "  return model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ntGi6ngYitob"
      },
      "source": [
        "当您将 DataFrame 作为 `x` 参数传递给 `Model.fit` 时，Keras 会将 DataFrame 视为 NumPy 数组："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XMjM-eddiNNT"
      },
      "outputs": [],
      "source": [
        "model = get_basic_model()\n",
        "model.fit(numeric_features, target, epochs=15, batch_size=BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EjtQbsRPEoJT"
      },
      "source": [
        "### 使用 tf.data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nSjV5gy3EsVv"
      },
      "source": [
        "如果您想对统一 `dtype` 的 DataFrame 应用 `tf.data` 转换，`Dataset.from_tensor_slices` 方法将创建一个遍历 DataFrame 的行的数据集。每行最初都是一个值向量。要训练模型，您需要 `(inputs, labels)` 对，因此传递 `(features, labels)` 和 `Dataset.from_tensor_slices` 将返回所需的切片对："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FCphpgdRGikx"
      },
      "outputs": [],
      "source": [
        "numeric_dataset = tf.data.Dataset.from_tensor_slices((numeric_features, target))\n",
        "\n",
        "for row in numeric_dataset.take(3):\n",
        "  print(row)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lStkN86gEkCe"
      },
      "outputs": [],
      "source": [
        "numeric_batches = numeric_dataset.shuffle(1000).batch(BATCH_SIZE)\n",
        "\n",
        "model = get_basic_model()\n",
        "model.fit(numeric_batches, epochs=15)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NRASs9IIESWQ"
      },
      "source": [
        "## DataFrame 作为字典"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NQcp7kiPF8TP"
      },
      "source": [
        "当您开始处理异构数据时，不再可能将 DataFrame 视为单个数组。TensorFlow 张量要求所有元素都具有相同的 `dtype`。\n",
        "\n",
        "因此，在这种情况下，您需要开始将它视为列字典，其中每一列都具有统一的 `dtype`。DataFrame 非常像数组字典，所以您通常只需将 DataFrame 强制转换为 Python 字典。许多重要的 TensorFlow API 都支持将（嵌套）数组字典作为输入。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9y5UMKL8bury"
      },
      "source": [
        "`tf.data` 输入流水线可以很好地进行此项处理。所有 `tf.data` 运算都会自动处理字典和元组。因此，要从 DataFrame 制作字典样本数据集，只需将其强制转换为字典，然后再使用 `Dataset.from_tensor_slices` 对其进行切片："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U3QDo-jwHYXc"
      },
      "outputs": [],
      "source": [
        "numeric_dict_ds = tf.data.Dataset.from_tensor_slices((dict(numeric_features), target))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yyEERK9ldIi_"
      },
      "source": [
        "以下是该数据集中的前三个样本："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "q0tDwk0VdH6D"
      },
      "outputs": [],
      "source": [
        "for row in numeric_dict_ds.take(3):\n",
        "  print(row)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DEAM6HAFxlMy"
      },
      "source": [
        "### 接受字典的 Keras"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dnoyoWLWx07i"
      },
      "source": [
        "通常，Keras 模型和层需要单个输入张量，但这些类可以接受和返回字典、元组和张量的嵌套结构。这些结构称为“嵌套”（有关详细信息，请参阅 `tf.nest` 模块）。\n",
        "\n",
        "可以通过两种等效方式编写接受字典作为输入的 Keras 模型。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5xUTrm0apDTr"
      },
      "source": [
        "#### 1. 模型-子类样式\n",
        "\n",
        "编写 `tf.keras.Model`（或 `tf.keras.Layer`）的子类。直接处理输入，并创建输出："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Zc3HV99CFRWL"
      },
      "outputs": [],
      "source": [
        "  def stack_dict(inputs, fun=tf.stack):\n",
        "    values = []\n",
        "    for key in sorted(inputs.keys()):\n",
        "      values.append(tf.cast(inputs[key], tf.float32))\n",
        "\n",
        "    return fun(values, axis=-1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Rz4Cg6WpzNzi"
      },
      "outputs": [],
      "source": [
        "#@title\n",
        "class MyModel(tf.keras.Model):\n",
        "  def __init__(self):\n",
        "    # Create all the internal layers in init.\n",
        "    super().__init__(self)\n",
        "\n",
        "    self.normalizer = tf.keras.layers.Normalization(axis=-1)\n",
        "\n",
        "    self.seq = tf.keras.Sequential([\n",
        "      self.normalizer,\n",
        "      tf.keras.layers.Dense(10, activation='relu'),\n",
        "      tf.keras.layers.Dense(10, activation='relu'),\n",
        "      tf.keras.layers.Dense(1)\n",
        "    ])\n",
        "\n",
        "  def adapt(self, inputs):\n",
        "    # Stack the inputs and `adapt` the normalization layer.\n",
        "    inputs = stack_dict(inputs)\n",
        "    self.normalizer.adapt(inputs)\n",
        "\n",
        "  def call(self, inputs):\n",
        "    # Stack the inputs\n",
        "    inputs = stack_dict(inputs)\n",
        "    # Run them through all the layers.\n",
        "    result = self.seq(inputs)\n",
        "\n",
        "    return result\n",
        "\n",
        "model = MyModel()\n",
        "\n",
        "model.adapt(dict(numeric_features))\n",
        "\n",
        "model.compile(optimizer='adam',\n",
        "              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "              metrics=['accuracy'],\n",
        "              run_eagerly=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hMLXNEDF_tu2"
      },
      "source": [
        "此模型可以接受列字典或字典元素数据集进行训练："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "v3xEjtHY8gZG"
      },
      "outputs": [],
      "source": [
        "model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "73wgiTaVAA2F"
      },
      "outputs": [],
      "source": [
        "numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)\n",
        "model.fit(numeric_dict_batches, epochs=5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-xDB3HLZGzAb"
      },
      "source": [
        "以下是前三个样本的预测："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xtolTQA-GpBW"
      },
      "outputs": [],
      "source": [
        "model.predict(dict(numeric_features.iloc[:3]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QIIdxIYm13Ik"
      },
      "source": [
        "#### 2. Keras 函数式样式"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DG_bmO0sS_G5"
      },
      "outputs": [],
      "source": [
        "inputs = {}\n",
        "for name, column in numeric_features.items():\n",
        "  inputs[name] = tf.keras.Input(\n",
        "      shape=(1,), name=name, dtype=tf.float32)\n",
        "\n",
        "inputs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9iXU9oem12dL"
      },
      "outputs": [],
      "source": [
        "x = stack_dict(inputs, fun=tf.concat)\n",
        "\n",
        "normalizer = tf.keras.layers.Normalization(axis=-1)\n",
        "normalizer.adapt(stack_dict(dict(numeric_features)))\n",
        "\n",
        "x = normalizer(x)\n",
        "x = tf.keras.layers.Dense(10, activation='relu')(x)\n",
        "x = tf.keras.layers.Dense(10, activation='relu')(x)\n",
        "x = tf.keras.layers.Dense(1)(x)\n",
        "\n",
        "model = tf.keras.Model(inputs, x)\n",
        "\n",
        "model.compile(optimizer='adam',\n",
        "              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "              metrics=['accuracy'],\n",
        "              run_eagerly=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xrAxmuJrEwnf"
      },
      "outputs": [],
      "source": [
        "tf.keras.utils.plot_model(model, rankdir=\"LR\", show_shapes=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UYtoAOIzCFY1"
      },
      "source": [
        "您可以像模型子类一样训练函数式模型："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yAwjPq7I_ehX"
      },
      "outputs": [],
      "source": [
        "model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "brwodxxVApO_"
      },
      "outputs": [],
      "source": [
        "numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)\n",
        "model.fit(numeric_dict_batches, epochs=5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xhn0Bt_Xw4nO"
      },
      "source": [
        "## 完整样本"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zYQ5fDaRxRWQ"
      },
      "source": [
        "如果您将异构 DataFrame 传递给 Keras，则每列都可能需要独特的预处理。您可以直接在 DataFrame 中进行此预处理，但要使模型正常工作，始终需要以相同的方式对输入进行预处理。因此，最好的方式是将预处理构建到模型中。[Keras 预处理层](https://tensorflow.google.cn/guide/keras/preprocessing_layers)涵盖许多常见任务。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BFsDZeu-BQ-h"
      },
      "source": [
        "### 构建预处理头文件"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C6aVQN4Gw-Va"
      },
      "source": [
        "在此数据集中，原始数据中的一些“整数”特征实际上是分类索引。这些索引并非真正有序的数值（有关详细信息，请参阅<a href=\"https://archive.ics.uci.edu/ml/datasets/heart+Disease\" class=\"external\">数据集描述</a>）。这些索引是无序的，因此不适合直接馈送给模型；该模型会将它们解释为有序索引。要使用这些输入，您需要将它们编码为独热向量或嵌入向量。这同样适用于字符串分类特征。\n",
        "\n",
        "注：如果您有许多特征需要相同的预处理，那么在应用预处理之前将它们连接在一起会更加有效。\n",
        "\n",
        "另一方面，二元特征通常不需要编码或归一化。\n",
        "\n",
        "首先创建属于每个组的特征的列表："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IH2VCyLBPYX8"
      },
      "outputs": [],
      "source": [
        "binary_feature_names = ['sex', 'fbs', 'exang']"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Pxh4FPucOpDz"
      },
      "outputs": [],
      "source": [
        "categorical_feature_names = ['cp', 'restecg', 'slope', 'thal', 'ca']"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HRcC8WkyamJb"
      },
      "source": [
        "下一步为构建预处理模型，该模型将对每个输入应用适当的预处理并连接结果。\n",
        "\n",
        "本部分使用 [Keras 函数式 API](https://tensorflow.google.cn/guide/keras/functional) 来实现预处理。首先为 dataframe 的每一列创建一个 `tf.keras.Input`："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D3OeiteJbWvI"
      },
      "outputs": [],
      "source": [
        "inputs = {}\n",
        "for name, column in df.items():\n",
        "  if type(column[0]) == str:\n",
        "    dtype = tf.string\n",
        "  elif (name in categorical_feature_names or\n",
        "        name in binary_feature_names):\n",
        "    dtype = tf.int64\n",
        "  else:\n",
        "    dtype = tf.float32\n",
        "\n",
        "  inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5N3vBMjidpx6"
      },
      "outputs": [],
      "source": [
        "inputs"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_EEmzxinyhI4"
      },
      "source": [
        "对于每个输入，您都将使用 Keras 层和 TensorFlow 运算应用一些转换。每个特征都以一批标量 (`shape=(batch,)`) 开始。每个特征的输出都应是一批 `tf.float32` 向量 (`shape=(batch, n)`)。最后一步将把这些向量全部连接到一起。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ubBDazjNFWiF"
      },
      "source": [
        "#### 二元输入\n",
        "\n",
        "二元输入不需要任何预处理，因此只需添加向量轴，将它们强制转换为 `float32` 并将它们添加到预处理输入列表中："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tmAIkOIid-Mp"
      },
      "outputs": [],
      "source": [
        "preprocessed = []\n",
        "\n",
        "for name in binary_feature_names:\n",
        "  inp = inputs[name]\n",
        "  inp = inp[:, tf.newaxis]\n",
        "  float_value = tf.cast(inp, tf.float32)\n",
        "  preprocessed.append(float_value)\n",
        "\n",
        "preprocessed"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZHQcdtG1GN7E"
      },
      "source": [
        "#### 数值输入\n",
        "\n",
        "与之前的部分一样，使用前需要先通过 `tf.keras.layers.Normalization` 层运行这些数值输入。不同之处是此次它们将作为字典输入。以下代码会从 DataFrame 中收集数值特征，将它们堆叠在一起并将传递给 `Normalization.adapt` 方法。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UC9LaIBNIK5V"
      },
      "outputs": [],
      "source": [
        "normalizer = tf.keras.layers.Normalization(axis=-1)\n",
        "normalizer.adapt(stack_dict(dict(numeric_features)))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S537tideIpeh"
      },
      "source": [
        "以下代码堆叠数值特征并通过规一化层运行它们。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U8MJiFpPK5uD"
      },
      "outputs": [],
      "source": [
        "numeric_inputs = {}\n",
        "for name in numeric_feature_names:\n",
        "  numeric_inputs[name]=inputs[name]\n",
        "\n",
        "numeric_inputs = stack_dict(numeric_inputs)\n",
        "numeric_normalized = normalizer(numeric_inputs)\n",
        "\n",
        "preprocessed.append(numeric_normalized)\n",
        "\n",
        "preprocessed"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G5f-VzASKPF7"
      },
      "source": [
        "#### 分类特征"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z3wcFs1oKVao"
      },
      "source": [
        "要使用分类特征，您首先需要将它们编码为二元向量或嵌入向量。这些特征仅包含少量类别，因此使用 `tf.keras.layers.StringLookup` 和 `tf.keras.layers.IntegerLookup` 层均支持的 `output_mode='one_hot'` 选项将输入直接转换为独热向量。\n",
        "\n",
        "以下是这些层如何工作的示例："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vXleJfBRS9xr"
      },
      "outputs": [],
      "source": [
        "vocab = ['a','b','c']\n",
        "lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')\n",
        "lookup(['c','a','a','b','zzz'])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kRnsFYJiSVmH"
      },
      "outputs": [],
      "source": [
        "vocab = [1,4,7,99]\n",
        "lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')\n",
        "\n",
        "lookup([-1,4,1])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "est6aCFBZDVs"
      },
      "source": [
        "要确定每个输入的词汇表，请创建一个用于将该词汇表转换为独热向量的层："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HELhoFlo0H9Q"
      },
      "outputs": [],
      "source": [
        "for name in categorical_feature_names:\n",
        "  vocab = sorted(set(df[name]))\n",
        "  print(f'name: {name}')\n",
        "  print(f'vocab: {vocab}\\n')\n",
        "\n",
        "  if type(vocab[0]) is str:\n",
        "    lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')\n",
        "  else:\n",
        "    lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')\n",
        "\n",
        "  x = inputs[name][:, tf.newaxis]\n",
        "  x = lookup(x)\n",
        "  preprocessed.append(x)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PzMMkwNBa2pK"
      },
      "source": [
        "#### 组装预处理头文件"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GaQ-_pEQbCE8"
      },
      "source": [
        "此时，`preprocessed` 仅为所有预处理结果的 Python 列表，每个结果的形状均为 `(batch_size, depth)`："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LlLaq_BVRlnO"
      },
      "outputs": [],
      "source": [
        "preprocessed"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U9lYYHIXbYv-"
      },
      "source": [
        "沿 `depth` 轴连接所有预处理特征，使每个字典样本都转换为单个向量。向量包含分类特征、数值特征和分类独热特征："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "j2I8vpQh313w"
      },
      "outputs": [],
      "source": [
        "preprocesssed_result = tf.concat(preprocessed, axis=-1)\n",
        "preprocesssed_result"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OBFowyJtb0WB"
      },
      "source": [
        "现在通过该计算创建模型以便重用："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rHQBFHwE37TO"
      },
      "outputs": [],
      "source": [
        "preprocessor = tf.keras.Model(inputs, preprocesssed_result)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ViMARQ-f6zfx"
      },
      "outputs": [],
      "source": [
        "tf.keras.utils.plot_model(preprocessor, rankdir=\"LR\", show_shapes=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IURRtL_WZbht"
      },
      "source": [
        "要测试预处理器，请使用 <a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html\" class=\"external\">DataFrame.iloc</a> 访问器对 DataFrame 中的第一个样本进行切片。然后将它转换为字典并将字典传递给预处理器。结果为包含二元特征、归一化数值特征和独热分类特征的单个向量，按该顺序："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QjBzCKsZUj0y"
      },
      "outputs": [],
      "source": [
        "preprocessor(dict(df.iloc[:1]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bB9C0XJkyQEk"
      },
      "source": [
        "### 创建和训练模型"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WfU_FFXMbKGM"
      },
      "source": [
        "现在，构建模型主体。使用与上一个示例相同的配置：一对 `Dense` 修正线性层和一个 `Dense(1)` 输出层用于分类。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "75OxXTnfboKN"
      },
      "outputs": [],
      "source": [
        "body = tf.keras.Sequential([\n",
        "  tf.keras.layers.Dense(10, activation='relu'),\n",
        "  tf.keras.layers.Dense(10, activation='relu'),\n",
        "  tf.keras.layers.Dense(1)\n",
        "])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MpD6WNX5_zh5"
      },
      "source": [
        "现在，使用 Keras 函数式 API 将这两部分结合在一起。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_TY_BuVMbNcB"
      },
      "outputs": [],
      "source": [
        "inputs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iin2kvA9bDpz"
      },
      "outputs": [],
      "source": [
        "x = preprocessor(inputs)\n",
        "x"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FQd9PcPRpkP4"
      },
      "outputs": [],
      "source": [
        "result = body(x)\n",
        "result"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "v_KerrXabhgP"
      },
      "outputs": [],
      "source": [
        "model = tf.keras.Model(inputs, result)\n",
        "\n",
        "model.compile(optimizer='adam',\n",
        "                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "                metrics=['accuracy'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S1MR-XD9kC6C"
      },
      "source": [
        "此模型需要一个输入字典。将数据传递给它的最简单方式是将 DataFrame 转换为字典并将该字典作为 `x` 参数传递给 `Model.fit`："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ybDzNUheqxJw"
      },
      "outputs": [],
      "source": [
        "history = model.fit(dict(df), target, epochs=5, batch_size=BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dacoEIB_BSsL"
      },
      "source": [
        "也可以使用 `tf.data`："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rYadV3wwE4G3"
      },
      "outputs": [],
      "source": [
        "ds = tf.data.Dataset.from_tensor_slices((\n",
        "    dict(df),\n",
        "    target\n",
        "))\n",
        "\n",
        "ds = ds.batch(BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2YIpp2r0bv-6"
      },
      "outputs": [],
      "source": [
        "import pprint\n",
        "\n",
        "for x, y in ds.take(1):\n",
        "  pprint.pprint(x)\n",
        "  print()\n",
        "  print(y)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NMT-AevGFmdu"
      },
      "outputs": [],
      "source": [
        "history = model.fit(ds, epochs=5)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "pandas_dataframe.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "xxx",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.12.2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}