{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "zwBCE43Cv3PH" }, "outputs": [], "source": [ "##### Copyright 2019 The TensorFlow Authors.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "fOad0I2cv569" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "YQB7yiF6v9GR" }, "source": [ "# 加载 pandas DataFrame" ] }, { "cell_type": "markdown", "metadata": { "id": "Oqa952X4wQKK" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
View on TensorFlow.org 在 Google Colab 中运行 在 Github 上查看源代码\n", " 下载笔记本\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "UmyEaf4Awl2v" }, "source": [ "本教程提供了将 pandas DataFrame 加载到 TensorFlow 中的示例。\n", "\n", "本教程使用了一个小型[数据集](https://archive.ics.uci.edu/ml/datasets/heart+Disease),由克利夫兰诊所心脏病基金会(Cleveland Clinic Foundation for Heart Disease)提供. 此数据集中有几百行CSV。每行表示一个患者,每列表示一个属性(describe)。我们将使用这些信息来预测患者是否患有心脏病,这是一个二分类问题。" ] }, { "cell_type": "markdown", "metadata": { "id": "iiyC7HkqxlUD" }, "source": [ "## 使用 pandas 读取数据" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5IoRbCA2n0_V" }, "outputs": [], "source": [ "import pandas as pd\n", "import tensorflow as tf\n", "\n", "SHUFFLE_BUFFER = 500\n", "BATCH_SIZE = 2" ] }, { "cell_type": "markdown", "metadata": { "id": "-2kBGy_pxn47" }, "source": [ "下载包含心脏病数据集的 CSV 文件:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VS4w2LePn9g3" }, "outputs": [], "source": [ "csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "6BXRPD2-xtQ1" }, "source": [ "使用 pandas 读取 CSV 文件:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UEfJ8TcMpe-2" }, "outputs": [], "source": [ "df = pd.read_csv(csv_file)" ] }, { "cell_type": "markdown", "metadata": { "id": "4K873P-Pp8c7" }, "source": [ "数据如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8FkK6QIRpjd4" }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_MOAKz654CT5" }, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": { "id": "jVyGjKvnqGlb" }, "source": [ "您将构建模型来预测 `target` 列中包含的标签。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2wwhILm1ycSp" }, "outputs": [], "source": [ "target = df.pop('target')" ] }, { "cell_type": "markdown", "metadata": { "id": "vFGv9fgjDeao" }, "source": [ "## 创建并训练模型" ] }, { "cell_type": "markdown", "metadata": { "id": "xNxJ41MafiB-" }, "source": [ "如果您的数据具有统一的数据类型或 `dtype`,则可在任何可以使用 NumPy 数组的地方使用 pandas DataFrame。这是因为 `pandas.DataFrame` 类支持 `__array__` 协议,并且 TensorFlow 的 `tf.convert_to_tensor` 函数接受支持该协议的对象。\n", "\n", "从数据集中获取数值特征(暂时跳过分类特征):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b9VlFGAie3K0" }, "outputs": [], "source": [ "numeric_feature_names = ['age', 'thalach', 'trestbps', 'chol', 'oldpeak']\n", "numeric_features = df[numeric_feature_names]\n", "numeric_features.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "Xe1CMRvSpR_R" }, "source": [ "可以使用 `DataFrame.values` 属性或 `numpy.array(df)` 将 DataFrame 转换为 NumPy 数组。要将其转换为张量,请使用 `tf.convert_to_tensor`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OVv6Nwc9oDBU" }, "outputs": [], "source": [ "tf.convert_to_tensor(numeric_features)" ] }, { "cell_type": "markdown", "metadata": { "id": "7iRYvoTrr1_G" }, "source": [ "通常,如果一个对象可以使用 `tf.convert_to_tensor` 转换为张量,则可以在任何可以传递 `tf.Tensor` 的位置传递该对象。" ] }, { "cell_type": "markdown", "metadata": { "id": "RVF7_Z-Mp-qD" }, "source": [ "### 使用 Model.fit" ] }, { "cell_type": "markdown", "metadata": { "id": "Vqkc9gIapQNu" }, "source": [ "解释为单个张量的 DataFrame,可以直接用作 `Model.fit` 方法的参数。\n", "\n", "下面是使用数据集的数值特征训练模型的示例。" ] }, { "cell_type": "markdown", "metadata": { "id": "u8M3oYHZgH_t" }, "source": [ "第一步是归一化输入范围。为此,请使用 `tf.keras.layers.Normalization` 层。\n", "\n", "要在运行之前设置层的均值和标准差,请务必调用 `Normalization.adapt` 方法:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "88XTmyEdgkJn" }, "outputs": [], "source": [ "normalizer = tf.keras.layers.Normalization(axis=-1)\n", "normalizer.adapt(numeric_features)" ] }, { "cell_type": "markdown", "metadata": { "id": "_D7JqUtnYCnb" }, "source": [ "调用 DataFrame 前三行的层,以呈现此层的输出的样本:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jOwzIG-DhB0y" }, "outputs": [], "source": [ "normalizer(numeric_features.iloc[:3])" ] }, { "cell_type": "markdown", "metadata": { "id": "KWKcuVZJh-HY" }, "source": [ "使用归一化层作为简单模型的第一层:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lu-bni-nh6mX" }, "outputs": [], "source": [ "def get_basic_model():\n", " model = tf.keras.Sequential([\n", " normalizer,\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(1)\n", " ])\n", "\n", " model.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'])\n", " return model" ] }, { "cell_type": "markdown", "metadata": { "id": "ntGi6ngYitob" }, "source": [ "当您将 DataFrame 作为 `x` 参数传递给 `Model.fit` 时,Keras 会将 DataFrame 视为 NumPy 数组:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XMjM-eddiNNT" }, "outputs": [], "source": [ "model = get_basic_model()\n", "model.fit(numeric_features, target, epochs=15, batch_size=BATCH_SIZE)" ] }, { "cell_type": "markdown", "metadata": { "id": "EjtQbsRPEoJT" }, "source": [ "### 使用 tf.data" ] }, { "cell_type": "markdown", "metadata": { "id": "nSjV5gy3EsVv" }, "source": [ "如果您想对统一 `dtype` 的 DataFrame 应用 `tf.data` 转换,`Dataset.from_tensor_slices` 方法将创建一个遍历 DataFrame 的行的数据集。每行最初都是一个值向量。要训练模型,您需要 `(inputs, labels)` 对,因此传递 `(features, labels)` 和 `Dataset.from_tensor_slices` 将返回所需的切片对:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FCphpgdRGikx" }, "outputs": [], "source": [ "numeric_dataset = tf.data.Dataset.from_tensor_slices((numeric_features, target))\n", "\n", "for row in numeric_dataset.take(3):\n", " print(row)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lStkN86gEkCe" }, "outputs": [], "source": [ "numeric_batches = numeric_dataset.shuffle(1000).batch(BATCH_SIZE)\n", "\n", "model = get_basic_model()\n", "model.fit(numeric_batches, epochs=15)" ] }, { "cell_type": "markdown", "metadata": { "id": "NRASs9IIESWQ" }, "source": [ "## DataFrame 作为字典" ] }, { "cell_type": "markdown", "metadata": { "id": "NQcp7kiPF8TP" }, "source": [ "当您开始处理异构数据时,不再可能将 DataFrame 视为单个数组。TensorFlow 张量要求所有元素都具有相同的 `dtype`。\n", "\n", "因此,在这种情况下,您需要开始将它视为列字典,其中每一列都具有统一的 `dtype`。DataFrame 非常像数组字典,所以您通常只需将 DataFrame 强制转换为 Python 字典。许多重要的 TensorFlow API 都支持将(嵌套)数组字典作为输入。" ] }, { "cell_type": "markdown", "metadata": { "id": "9y5UMKL8bury" }, "source": [ "`tf.data` 输入流水线可以很好地进行此项处理。所有 `tf.data` 运算都会自动处理字典和元组。因此,要从 DataFrame 制作字典样本数据集,只需将其强制转换为字典,然后再使用 `Dataset.from_tensor_slices` 对其进行切片:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U3QDo-jwHYXc" }, "outputs": [], "source": [ "numeric_dict_ds = tf.data.Dataset.from_tensor_slices((dict(numeric_features), target))" ] }, { "cell_type": "markdown", "metadata": { "id": "yyEERK9ldIi_" }, "source": [ "以下是该数据集中的前三个样本:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q0tDwk0VdH6D" }, "outputs": [], "source": [ "for row in numeric_dict_ds.take(3):\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": { "id": "DEAM6HAFxlMy" }, "source": [ "### 接受字典的 Keras" ] }, { "cell_type": "markdown", "metadata": { "id": "dnoyoWLWx07i" }, "source": [ "通常,Keras 模型和层需要单个输入张量,但这些类可以接受和返回字典、元组和张量的嵌套结构。这些结构称为“嵌套”(有关详细信息,请参阅 `tf.nest` 模块)。\n", "\n", "可以通过两种等效方式编写接受字典作为输入的 Keras 模型。" ] }, { "cell_type": "markdown", "metadata": { "id": "5xUTrm0apDTr" }, "source": [ "#### 1. 模型-子类样式\n", "\n", "编写 `tf.keras.Model`(或 `tf.keras.Layer`)的子类。直接处理输入,并创建输出:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Zc3HV99CFRWL" }, "outputs": [], "source": [ " def stack_dict(inputs, fun=tf.stack):\n", " values = []\n", " for key in sorted(inputs.keys()):\n", " values.append(tf.cast(inputs[key], tf.float32))\n", "\n", " return fun(values, axis=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Rz4Cg6WpzNzi" }, "outputs": [], "source": [ "#@title\n", "class MyModel(tf.keras.Model):\n", " def __init__(self):\n", " # Create all the internal layers in init.\n", " super().__init__(self)\n", "\n", " self.normalizer = tf.keras.layers.Normalization(axis=-1)\n", "\n", " self.seq = tf.keras.Sequential([\n", " self.normalizer,\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(1)\n", " ])\n", "\n", " def adapt(self, inputs):\n", " # Stack the inputs and `adapt` the normalization layer.\n", " inputs = stack_dict(inputs)\n", " self.normalizer.adapt(inputs)\n", "\n", " def call(self, inputs):\n", " # Stack the inputs\n", " inputs = stack_dict(inputs)\n", " # Run them through all the layers.\n", " result = self.seq(inputs)\n", "\n", " return result\n", "\n", "model = MyModel()\n", "\n", "model.adapt(dict(numeric_features))\n", "\n", "model.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'],\n", " run_eagerly=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "hMLXNEDF_tu2" }, "source": [ "此模型可以接受列字典或字典元素数据集进行训练:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v3xEjtHY8gZG" }, "outputs": [], "source": [ "model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "73wgiTaVAA2F" }, "outputs": [], "source": [ "numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)\n", "model.fit(numeric_dict_batches, epochs=5)" ] }, { "cell_type": "markdown", "metadata": { "id": "-xDB3HLZGzAb" }, "source": [ "以下是前三个样本的预测:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xtolTQA-GpBW" }, "outputs": [], "source": [ "model.predict(dict(numeric_features.iloc[:3]))" ] }, { "cell_type": "markdown", "metadata": { "id": "QIIdxIYm13Ik" }, "source": [ "#### 2. Keras 函数式样式" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DG_bmO0sS_G5" }, "outputs": [], "source": [ "inputs = {}\n", "for name, column in numeric_features.items():\n", " inputs[name] = tf.keras.Input(\n", " shape=(1,), name=name, dtype=tf.float32)\n", "\n", "inputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9iXU9oem12dL" }, "outputs": [], "source": [ "x = stack_dict(inputs, fun=tf.concat)\n", "\n", "normalizer = tf.keras.layers.Normalization(axis=-1)\n", "normalizer.adapt(stack_dict(dict(numeric_features)))\n", "\n", "x = normalizer(x)\n", "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", "x = tf.keras.layers.Dense(1)(x)\n", "\n", "model = tf.keras.Model(inputs, x)\n", "\n", "model.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'],\n", " run_eagerly=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xrAxmuJrEwnf" }, "outputs": [], "source": [ "tf.keras.utils.plot_model(model, rankdir=\"LR\", show_shapes=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "UYtoAOIzCFY1" }, "source": [ "您可以像模型子类一样训练函数式模型:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yAwjPq7I_ehX" }, "outputs": [], "source": [ "model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "brwodxxVApO_" }, "outputs": [], "source": [ "numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)\n", "model.fit(numeric_dict_batches, epochs=5)" ] }, { "cell_type": "markdown", "metadata": { "id": "xhn0Bt_Xw4nO" }, "source": [ "## 完整样本" ] }, { "cell_type": "markdown", "metadata": { "id": "zYQ5fDaRxRWQ" }, "source": [ "如果您将异构 DataFrame 传递给 Keras,则每列都可能需要独特的预处理。您可以直接在 DataFrame 中进行此预处理,但要使模型正常工作,始终需要以相同的方式对输入进行预处理。因此,最好的方式是将预处理构建到模型中。[Keras 预处理层](https://tensorflow.google.cn/guide/keras/preprocessing_layers)涵盖许多常见任务。" ] }, { "cell_type": "markdown", "metadata": { "id": "BFsDZeu-BQ-h" }, "source": [ "### 构建预处理头文件" ] }, { "cell_type": "markdown", "metadata": { "id": "C6aVQN4Gw-Va" }, "source": [ "在此数据集中,原始数据中的一些“整数”特征实际上是分类索引。这些索引并非真正有序的数值(有关详细信息,请参阅数据集描述)。这些索引是无序的,因此不适合直接馈送给模型;该模型会将它们解释为有序索引。要使用这些输入,您需要将它们编码为独热向量或嵌入向量。这同样适用于字符串分类特征。\n", "\n", "注:如果您有许多特征需要相同的预处理,那么在应用预处理之前将它们连接在一起会更加有效。\n", "\n", "另一方面,二元特征通常不需要编码或归一化。\n", "\n", "首先创建属于每个组的特征的列表:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IH2VCyLBPYX8" }, "outputs": [], "source": [ "binary_feature_names = ['sex', 'fbs', 'exang']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Pxh4FPucOpDz" }, "outputs": [], "source": [ "categorical_feature_names = ['cp', 'restecg', 'slope', 'thal', 'ca']" ] }, { "cell_type": "markdown", "metadata": { "id": "HRcC8WkyamJb" }, "source": [ "下一步为构建预处理模型,该模型将对每个输入应用适当的预处理并连接结果。\n", "\n", "本部分使用 [Keras 函数式 API](https://tensorflow.google.cn/guide/keras/functional) 来实现预处理。首先为 dataframe 的每一列创建一个 `tf.keras.Input`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D3OeiteJbWvI" }, "outputs": [], "source": [ "inputs = {}\n", "for name, column in df.items():\n", " if type(column[0]) == str:\n", " dtype = tf.string\n", " elif (name in categorical_feature_names or\n", " name in binary_feature_names):\n", " dtype = tf.int64\n", " else:\n", " dtype = tf.float32\n", "\n", " inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5N3vBMjidpx6" }, "outputs": [], "source": [ "inputs" ] }, { "cell_type": "markdown", "metadata": { "id": "_EEmzxinyhI4" }, "source": [ "对于每个输入,您都将使用 Keras 层和 TensorFlow 运算应用一些转换。每个特征都以一批标量 (`shape=(batch,)`) 开始。每个特征的输出都应是一批 `tf.float32` 向量 (`shape=(batch, n)`)。最后一步将把这些向量全部连接到一起。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ubBDazjNFWiF" }, "source": [ "#### 二元输入\n", "\n", "二元输入不需要任何预处理,因此只需添加向量轴,将它们强制转换为 `float32` 并将它们添加到预处理输入列表中:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tmAIkOIid-Mp" }, "outputs": [], "source": [ "preprocessed = []\n", "\n", "for name in binary_feature_names:\n", " inp = inputs[name]\n", " inp = inp[:, tf.newaxis]\n", " float_value = tf.cast(inp, tf.float32)\n", " preprocessed.append(float_value)\n", "\n", "preprocessed" ] }, { "cell_type": "markdown", "metadata": { "id": "ZHQcdtG1GN7E" }, "source": [ "#### 数值输入\n", "\n", "与之前的部分一样,使用前需要先通过 `tf.keras.layers.Normalization` 层运行这些数值输入。不同之处是此次它们将作为字典输入。以下代码会从 DataFrame 中收集数值特征,将它们堆叠在一起并将传递给 `Normalization.adapt` 方法。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UC9LaIBNIK5V" }, "outputs": [], "source": [ "normalizer = tf.keras.layers.Normalization(axis=-1)\n", "normalizer.adapt(stack_dict(dict(numeric_features)))" ] }, { "cell_type": "markdown", "metadata": { "id": "S537tideIpeh" }, "source": [ "以下代码堆叠数值特征并通过规一化层运行它们。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U8MJiFpPK5uD" }, "outputs": [], "source": [ "numeric_inputs = {}\n", "for name in numeric_feature_names:\n", " numeric_inputs[name]=inputs[name]\n", "\n", "numeric_inputs = stack_dict(numeric_inputs)\n", "numeric_normalized = normalizer(numeric_inputs)\n", "\n", "preprocessed.append(numeric_normalized)\n", "\n", "preprocessed" ] }, { "cell_type": "markdown", "metadata": { "id": "G5f-VzASKPF7" }, "source": [ "#### 分类特征" ] }, { "cell_type": "markdown", "metadata": { "id": "Z3wcFs1oKVao" }, "source": [ "要使用分类特征,您首先需要将它们编码为二元向量或嵌入向量。这些特征仅包含少量类别,因此使用 `tf.keras.layers.StringLookup` 和 `tf.keras.layers.IntegerLookup` 层均支持的 `output_mode='one_hot'` 选项将输入直接转换为独热向量。\n", "\n", "以下是这些层如何工作的示例:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vXleJfBRS9xr" }, "outputs": [], "source": [ "vocab = ['a','b','c']\n", "lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')\n", "lookup(['c','a','a','b','zzz'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kRnsFYJiSVmH" }, "outputs": [], "source": [ "vocab = [1,4,7,99]\n", "lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')\n", "\n", "lookup([-1,4,1])" ] }, { "cell_type": "markdown", "metadata": { "id": "est6aCFBZDVs" }, "source": [ "要确定每个输入的词汇表,请创建一个用于将该词汇表转换为独热向量的层:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HELhoFlo0H9Q" }, "outputs": [], "source": [ "for name in categorical_feature_names:\n", " vocab = sorted(set(df[name]))\n", " print(f'name: {name}')\n", " print(f'vocab: {vocab}\\n')\n", "\n", " if type(vocab[0]) is str:\n", " lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')\n", " else:\n", " lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')\n", "\n", " x = inputs[name][:, tf.newaxis]\n", " x = lookup(x)\n", " preprocessed.append(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "PzMMkwNBa2pK" }, "source": [ "#### 组装预处理头文件" ] }, { "cell_type": "markdown", "metadata": { "id": "GaQ-_pEQbCE8" }, "source": [ "此时,`preprocessed` 仅为所有预处理结果的 Python 列表,每个结果的形状均为 `(batch_size, depth)`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LlLaq_BVRlnO" }, "outputs": [], "source": [ "preprocessed" ] }, { "cell_type": "markdown", "metadata": { "id": "U9lYYHIXbYv-" }, "source": [ "沿 `depth` 轴连接所有预处理特征,使每个字典样本都转换为单个向量。向量包含分类特征、数值特征和分类独热特征:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j2I8vpQh313w" }, "outputs": [], "source": [ "preprocesssed_result = tf.concat(preprocessed, axis=-1)\n", "preprocesssed_result" ] }, { "cell_type": "markdown", "metadata": { "id": "OBFowyJtb0WB" }, "source": [ "现在通过该计算创建模型以便重用:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rHQBFHwE37TO" }, "outputs": [], "source": [ "preprocessor = tf.keras.Model(inputs, preprocesssed_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ViMARQ-f6zfx" }, "outputs": [], "source": [ "tf.keras.utils.plot_model(preprocessor, rankdir=\"LR\", show_shapes=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "IURRtL_WZbht" }, "source": [ "要测试预处理器,请使用 DataFrame.iloc 访问器对 DataFrame 中的第一个样本进行切片。然后将它转换为字典并将字典传递给预处理器。结果为包含二元特征、归一化数值特征和独热分类特征的单个向量,按该顺序:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QjBzCKsZUj0y" }, "outputs": [], "source": [ "preprocessor(dict(df.iloc[:1]))" ] }, { "cell_type": "markdown", "metadata": { "id": "bB9C0XJkyQEk" }, "source": [ "### 创建和训练模型" ] }, { "cell_type": "markdown", "metadata": { "id": "WfU_FFXMbKGM" }, "source": [ "现在,构建模型主体。使用与上一个示例相同的配置:一对 `Dense` 修正线性层和一个 `Dense(1)` 输出层用于分类。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "75OxXTnfboKN" }, "outputs": [], "source": [ "body = tf.keras.Sequential([\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(1)\n", "])" ] }, { "cell_type": "markdown", "metadata": { "id": "MpD6WNX5_zh5" }, "source": [ "现在,使用 Keras 函数式 API 将这两部分结合在一起。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_TY_BuVMbNcB" }, "outputs": [], "source": [ "inputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iin2kvA9bDpz" }, "outputs": [], "source": [ "x = preprocessor(inputs)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FQd9PcPRpkP4" }, "outputs": [], "source": [ "result = body(x)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v_KerrXabhgP" }, "outputs": [], "source": [ "model = tf.keras.Model(inputs, result)\n", "\n", "model.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'])" ] }, { "cell_type": "markdown", "metadata": { "id": "S1MR-XD9kC6C" }, "source": [ "此模型需要一个输入字典。将数据传递给它的最简单方式是将 DataFrame 转换为字典并将该字典作为 `x` 参数传递给 `Model.fit`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ybDzNUheqxJw" }, "outputs": [], "source": [ "history = model.fit(dict(df), target, epochs=5, batch_size=BATCH_SIZE)" ] }, { "cell_type": "markdown", "metadata": { "id": "dacoEIB_BSsL" }, "source": [ "也可以使用 `tf.data`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rYadV3wwE4G3" }, "outputs": [], "source": [ "ds = tf.data.Dataset.from_tensor_slices((\n", " dict(df),\n", " target\n", "))\n", "\n", "ds = ds.batch(BATCH_SIZE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2YIpp2r0bv-6" }, "outputs": [], "source": [ "import pprint\n", "\n", "for x, y in ds.take(1):\n", " pprint.pprint(x)\n", " print()\n", " print(y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NMT-AevGFmdu" }, "outputs": [], "source": [ "history = model.fit(ds, epochs=5)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "pandas_dataframe.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "xxx", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 0 }