{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "rNdWfPXCjTjY" }, "outputs": [], "source": [ "##### Copyright 2019 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "I1dUQ0GejU8N", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "c05P9g5WjizZ" }, "source": [ "# 对结构化数据进行分类" ] }, { "cell_type": "markdown", "metadata": { "id": "zofH_gCzgplN" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看\n", " 在 Google Colab 运行\n", " 在 Github 上查看源代码 下载笔记本\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "K1y4OHpGgss7" }, "source": [ "> 警告:不推荐为新代码使用本教程中介绍的 `tf.feature_columns` 模块。[Keras 预处理层](https://tensorflow.google.cn/tutorials/structured_data/preprocessing_layers)介绍了此功能,有关迁移说明,请参阅[迁移特征列](../../guide/migrate/migrating_feature_columns.ipynb)指南。`tf.feature_columns` 模块旨在与 TF1 `Estimators` 结合使用。它不在我们的[兼容性保证](https://tensorflow.org/guide/versions)范围内,除了安全漏洞修正外,不会获得其他修正。\n", "\n", "我们将使用一个小型 [数据集](https://archive.ics.uci.edu/ml/datasets/heart+Disease),该数据集由克利夫兰心脏病诊所基金会(Cleveland Clinic Foundation for Heart Disease)提供。CSV 中有几百行数据。每行描述了一个病人(patient),每列描述了一个属性(attribute)。我们将使用这些信息来预测一位病人是否患有心脏病,这是在该数据集上的二分类任务。\n", "\n", "- 用 [Pandas](https://pandas.pydata.org/) 导入 CSV 文件。\n", "- 用 [tf.data](https://tensorflow.google.cn/guide/datasets) 建立了一个输入流水线(pipeline),用于对行进行分批(batch)和随机排序(shuffle)。\n", "- 用特征列将 CSV 中的列映射到用于训练模型的特征。\n", "- 用 Keras 构建,训练并评估模型。\n", "\n", "## 数据集\n", "\n", "下面是该数据集的[描述](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names)。 请注意,有数值(numeric)和类别(categorical)类型的列。\n", "\n", "Following is a description of this dataset. Notice there are both numeric and categorical columns. There is a free text column which we will not use in this tutorial.\n", "\n", "列 | 描述 | 特征类型 | 数据类型\n", "--- | --- | --- | ---\n", "Type | 动物类型(狗、猫) | 分类 | 字符串\n", "Age | 宠物年龄 | 数值 | 整数\n", "Breed1 | 宠物的主要品种 | 分类 | 字符串\n", "Color1 | 宠物的颜色 1 | 分类 | 字符串\n", "Color2 | 宠物的颜色 2 | 分类 | 字符串\n", "MaturitySize | 成年个体大小 | 分类 | 字符串\n", "FurLength | 毛发长度 | 分类 | 字符串\n", "Vaccinated | 宠物已接种疫苗 | 分类 | 字符串\n", "Sterilized | 宠物已绝育 | 分类 | 字符串\n", "Health | 健康状况 | 分类 | 字符串\n", "Fee | 领养费 | 数值 | 整数\n", "Description | 关于此宠物的简介 | 文本 | 字符串\n", "PhotoAmt | 为该宠物上传的照片总数 | 数值 | 整数\n", "AdoptionSpeed | 领养速度 | 分类 | 整数" ] }, { "cell_type": "markdown", "metadata": { "id": "VxyBFc_kKazA" }, "source": [ "## 导入 TensorFlow 和其他库" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LuOWVJBz8a6G", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "!pip install sklearn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9dEreb4QKizj", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import tensorflow as tf\n", "\n", "from tensorflow import feature_column\n", "from tensorflow.keras import layers\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": { "id": "KCEhSZcULZ9n" }, "source": [ "## 使用 Pandas 创建一个 dataframe\n", "\n", "[Pandas](https://pandas.pydata.org/) 是一个 Python 库,它有许多有用的实用程序,用于加载和处理结构化数据。我们将使用 Pandas 从 URL下载数据集,并将其加载到 dataframe 中。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "REZ57BXCLdfG", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "import pathlib\n", "\n", "dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'\n", "csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'\n", "\n", "tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,\n", " extract=True, cache_dir='.')\n", "dataframe = pd.read_csv(csv_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "o8QIi0_jT5LM", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "dataframe.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "awGiBeBWbQC8" }, "source": [ "## 创建目标变量\n", "\n", "原始数据集中的任务是预测宠物被领养的速度(例如,在第一周、第一个月、前三个月等)。我们针对教程进行一下简化。在这里,我们将把它转化为一个二元分类问题,并简单地预测宠物是否被领养。\n", "\n", "修改标签列后,0 表示宠物未被领养,1 表示宠物已被领养。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xcbTpEXWbMDz", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# In the original dataset \"4\" indicates the pet was not adopted.\n", "dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)\n", "\n", "# Drop un-used columns.\n", "dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])" ] }, { "cell_type": "markdown", "metadata": { "id": "u0zhLtQqMPem" }, "source": [ "## 将 dataframe 拆分为训练、验证和测试集\n", "\n", "我们下载的数据集是一个 CSV 文件。 我们将其拆分为训练、验证和测试集。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YEOpw7LhMYsI", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "train, test = train_test_split(dataframe, test_size=0.2)\n", "train, val = train_test_split(train, test_size=0.2)\n", "print(len(train), 'train examples')\n", "print(len(val), 'validation examples')\n", "print(len(test), 'test examples')" ] }, { "cell_type": "markdown", "metadata": { "id": "84ef46LXMfvu" }, "source": [ "## 用 tf.data 创建输入流水线\n", "\n", "接下来,我们将使用 [tf.data](https://tensorflow.google.cn/guide/datasets) 包装 dataframe。这让我们能将特征列作为一座桥梁,该桥梁将 Pandas dataframe 中的列映射到用于训练模型的特征。如果我们使用一个非常大的 CSV 文件(非常大以至于它不能放入内存),我们将使用 tf.data 直接从磁盘读取它。本教程不涉及这一点。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NkcaMYP-MsRe", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# A utility method to create a tf.data dataset from a Pandas Dataframe\n", "def df_to_dataset(dataframe, shuffle=True, batch_size=32):\n", " dataframe = dataframe.copy()\n", " labels = dataframe.pop('target')\n", " ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))\n", " if shuffle:\n", " ds = ds.shuffle(buffer_size=len(dataframe))\n", " ds = ds.batch(batch_size)\n", " return ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CXbbXkJvMy34", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "batch_size = 5 # A small batch sized is used for demonstration purposes\n", "train_ds = df_to_dataset(train, batch_size=batch_size)\n", "val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)\n", "test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)" ] }, { "cell_type": "markdown", "metadata": { "id": "qRLGSMDzM-dl" }, "source": [ "## 理解输入流水线\n", "\n", "现在我们已经创建了输入流水线,让我们调用它来查看它返回的数据的格式。 我们使用了一小批量大小来保持输出的可读性。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CSBo3dUVNFc9", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "for feature_batch, label_batch in train_ds.take(1):\n", " print('Every feature:', list(feature_batch.keys()))\n", " print('A batch of ages:', feature_batch['Age'])\n", " print('A batch of targets:', label_batch )" ] }, { "cell_type": "markdown", "metadata": { "id": "OT5N6Se-NQsC" }, "source": [ "我们可以看到数据集返回了一个字典,该字典从列名称(来自 dataframe)映射到 dataframe 中行的列值。" ] }, { "cell_type": "markdown", "metadata": { "id": "ttIvgLRaNoOQ" }, "source": [ "## 演示几种特征列\n", "\n", "TensorFlow 提供了多种特征列。本节中,我们将创建几类特征列,并演示特征列如何转换 dataframe 中的列。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mxwiHFHuNhmf", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# We will use this batch to demonstrate several types of feature columns\n", "example_batch = next(iter(train_ds))[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0wfLB8Q3N3UH", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# A utility method to create a feature column\n", "# and to transform a batch of data\n", "def demo(feature_column):\n", " feature_layer = layers.DenseFeatures(feature_column)\n", " print(feature_layer(example_batch).numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "Q7OEKe82N-Qb" }, "source": [ "### 数值列\n", "\n", "一个特征列的输出将成为模型的输入(使用上面定义的 demo 函数,我们将能准确地看到 dataframe 中的每列的转换方式)。 [数值列(numeric column)](https://tensorflow.google.cn/api_docs/python/tf/feature_column/numeric_column) 是最简单的列类型。它用于表示实数特征。使用此列时,模型将从 dataframe 中接收未更改的列值。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QZTZ0HnHOCxC", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "photo_count = feature_column.numeric_column('PhotoAmt')\n", "demo(photo_count)" ] }, { "cell_type": "markdown", "metadata": { "id": "7a6ddSyzOKpq" }, "source": [ "在这个心脏病数据集中,dataframe 中的大多数列都是数值列。" ] }, { "cell_type": "markdown", "metadata": { "id": "IcSxUoYgOlA1" }, "source": [ "### 分桶列\n", "\n", "通常,您不希望将数字直接输入模型,而是根据数值范围将其值分成不同的类别。考虑代表一个人年龄的原始数据。我们可以用 [分桶列(bucketized column)](https://tensorflow.google.cn/api_docs/python/tf/feature_column/bucketized_column)将年龄分成几个分桶(buckets),而不是将年龄表示成数值列。请注意下面的 one-hot 数值表示每行匹配的年龄范围。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wJ4Wt3SAOpTQ", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "age = feature_column.numeric_column('Age')\n", "age_buckets = feature_column.bucketized_column(age, boundaries=[1, 3, 5])\n", "demo(age_buckets)" ] }, { "cell_type": "markdown", "metadata": { "id": "r1tArzewPb-b" }, "source": [ "### 分类列\n", "\n", "在此数据集中,thal 用字符串表示(如 'fixed','normal',或 'reversible')。我们无法直接将字符串提供给模型。相反,我们必须首先将它们映射到数值。分类词汇列(categorical vocabulary columns)提供了一种用 one-hot 向量表示字符串的方法(就像您在上面看到的年龄分桶一样)。词汇表可以用 [categorical_column_with_vocabulary_list](https://tensorflow.google.cn/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list) 作为 list 传递,或者用 [categorical_column_with_vocabulary_file](https://tensorflow.google.cn/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file) 从文件中加载。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DJ6QnSHkPtOC", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "animal_type = feature_column.categorical_column_with_vocabulary_list(\n", " 'Type', ['Cat', 'Dog'])\n", "\n", "animal_type_one_hot = feature_column.indicator_column(animal_type)\n", "demo(animal_type_one_hot)" ] }, { "cell_type": "markdown", "metadata": { "id": "LEFPjUr6QmwS" }, "source": [ "### 嵌入列\n", "\n", "假设我们不是只有几个可能的字符串,而是每个类别有数千(或更多)值。 由于多种原因,随着类别数量的增加,使用 one-hot 编码训练神经网络变得不可行。我们可以使用嵌入列来克服此限制。[嵌入列(embedding column)](https://tensorflow.google.cn/api_docs/python/tf/feature_column/embedding_column)将数据表示为一个低维度密集向量,而非多维的 one-hot 向量,该低维度密集向量可以包含任何数,而不仅仅是 0 或 1。嵌入的大小(在下面的示例中为 8)是必须调整的参数。\n", "\n", "关键点:当分类列具有许多可能的值时,最好使用嵌入列。我们在这里使用嵌入列用于演示目的,为此您有一个完整的示例,以在将来可以修改用于其他数据集。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hSlohmr2Q_UU", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# Notice the input to the embedding column is the categorical column\n", "# we previously created\n", "breed1 = feature_column.categorical_column_with_vocabulary_list(\n", " 'Breed1', dataframe.Breed1.unique())\n", "breed1_embedding = feature_column.embedding_column(breed1, dimension=8)\n", "demo(breed1_embedding)" ] }, { "cell_type": "markdown", "metadata": { "id": "urFCAvTVRMpB" }, "source": [ "### 经过哈希处理的特征列\n", "\n", "表示具有大量数值的分类列的另一种方法是使用 [categorical_column_with_hash_bucket](https://tensorflow.google.cn/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket)。该特征列计算输入的一个哈希值,然后选择一个 `hash_bucket_size` 分桶来编码字符串。使用此列时,您不需要提供词汇表,并且可以选择使 hash_buckets 的数量远远小于实际类别的数量以节省空间。\n", "\n", "关键点:该技术的一个重要缺点是可能存在冲突,不同的字符串被映射到同一个范围。实际上,无论如何,经过哈希处理的特征列对某些数据集都有效。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YHU_Aj2nRRDC", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "breed1_hashed = feature_column.categorical_column_with_hash_bucket(\n", " 'Breed1', hash_bucket_size=10)\n", "demo(feature_column.indicator_column(breed1_hashed))" ] }, { "cell_type": "markdown", "metadata": { "id": "fB94M27DRXtZ" }, "source": [ "### 组合的特征列\n", "\n", "将多种特征组合到一个特征中,称为[特征组合(feature crosses)](https://developers.google.com/machine-learning/glossary/#feature_cross),它让模型能够为每种特征组合学习单独的权重。此处,我们将创建一个 age 和 thal 组合的新特征。请注意,`crossed_column` 不会构建所有可能组合的完整列表(可能非常大)。相反,它由 `hashed_column` 支持,因此您可以选择表的大小。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oaPVERd9Rep6", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "crossed_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=10)\n", "demo(feature_column.indicator_column(crossed_feature))" ] }, { "cell_type": "markdown", "metadata": { "id": "ypkI9zx6Rj1q" }, "source": [ "## 选择要使用的列\n", "\n", "我们已经了解了如何使用几种类型的特征列。 现在我们将使用它们来训练模型。本教程的目标是向您展示使用特征列所需的完整代码(例如,机制)。我们任意地选择了几列来训练我们的模型。\n", "\n", "关键点:如果您的目标是建立一个准确的模型,请尝试使用您自己的更大的数据集,并仔细考虑哪些特征最有意义,以及如何表示它们。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4PlLY7fORuzA", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "feature_columns = []\n", "\n", "# numeric cols\n", "for header in ['PhotoAmt', 'Fee', 'Age']:\n", " feature_columns.append(feature_column.numeric_column(header))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jdF4rXkcDmBl", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# bucketized cols\n", "age = feature_column.numeric_column('Age')\n", "age_buckets = feature_column.bucketized_column(age, boundaries=[1, 2, 3, 4, 5])\n", "feature_columns.append(age_buckets)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RsteO7FGDmNc", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# indicator_columns\n", "indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',\n", " 'FurLength', 'Vaccinated', 'Sterilized', 'Health']\n", "for col_name in indicator_column_names:\n", " categorical_column = feature_column.categorical_column_with_vocabulary_list(\n", " col_name, dataframe[col_name].unique())\n", " indicator_column = feature_column.indicator_column(categorical_column)\n", " feature_columns.append(indicator_column)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6MhdqQ5uDmYU", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# embedding columns\n", "breed1 = feature_column.categorical_column_with_vocabulary_list(\n", " 'Breed1', dataframe.Breed1.unique())\n", "breed1_embedding = feature_column.embedding_column(breed1, dimension=8)\n", "feature_columns.append(breed1_embedding)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qkzRNfCLDsQf", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# crossed columns\n", "age_type_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)\n", "feature_columns.append(feature_column.indicator_column(age_type_feature))" ] }, { "cell_type": "markdown", "metadata": { "id": "M-nDp8krS_ts" }, "source": [ "### 建立一个新的特征层\n", "\n", "现在我们已经定义了我们的特征列,我们将使用[密集特征(DenseFeatures)](https://tensorflow.google.cn/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures)层将特征列输入到我们的 Keras 模型中。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6o-El1R2TGQP", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "feature_layer = tf.keras.layers.DenseFeatures(feature_columns)" ] }, { "cell_type": "markdown", "metadata": { "id": "8cf6vKfgTH0U" }, "source": [ "之前,我们使用一个小批量大小来演示特征列如何运转。我们将创建一个新的更大批量的输入流水线。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gcemszoGSse_", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "batch_size = 32\n", "train_ds = df_to_dataset(train, batch_size=batch_size)\n", "val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)\n", "test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)" ] }, { "cell_type": "markdown", "metadata": { "id": "bBx4Xu0eTXWq" }, "source": [ "## 创建,编译和训练模型" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_YJPPb3xTPeZ", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "model = tf.keras.Sequential([\n", " feature_layer,\n", " layers.Dense(128, activation='relu'),\n", " layers.Dense(128, activation='relu'),\n", " layers.Dropout(.1),\n", " layers.Dense(1)\n", "])\n", "\n", "model.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'])\n", "\n", "model.fit(train_ds,\n", " validation_data=val_ds,\n", " epochs=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GnFmMOW0Tcaa", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "loss, accuracy = model.evaluate(test_ds)\n", "print(\"Accuracy\", accuracy)" ] }, { "cell_type": "markdown", "metadata": { "id": "3bdfbq20V6zu" }, "source": [ "关键点:通常使用更大更复杂的数据集进行深度学习,您将看到最佳结果。使用像这样的小数据集时,我们建议使用决策树或随机森林作为强有力的基准。本教程的目的不是训练一个准确的模型,而是演示处理结构化数据的机制,这样,在将来使用自己的数据集时,您有可以使用的代码作为起点。" ] }, { "cell_type": "markdown", "metadata": { "id": "SotnhVWuHQCw" }, "source": [ "## 后续步骤\n", "\n", "了解有关分类结构化数据的更多信息的最佳方法是亲自尝试。我们建议寻找另一个可以使用的数据集,并使用和上面相似的代码,训练一个模型,对其分类。要提高准确率,请仔细考虑模型中包含哪些特征,以及如何表示这些特征。" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "feature_columns.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }