{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "b518b04cbfe0" }, "outputs": [], "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "906e07f6e562", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "8bd329a4bbca" }, "source": [ "# Keras 中的遮盖和填充" ] }, { "cell_type": "markdown", "metadata": { "id": "8b208d0913b8" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 中运行在 GitHub 上查看源代码下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "8d4ac441b1fc" }, "source": [ "## 设置" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ec52be14e686", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "import numpy as np\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "from tensorflow.keras import layers" ] }, { "cell_type": "markdown", "metadata": { "id": "e94d7a46bda8" }, "source": [ "## 简介\n", "\n", "**遮盖**的作用是告知序列处理层输入中有某些时间步骤丢失,因此在处理数据时应将其跳过。\n", "\n", "**填充**是遮盖的一种特殊形式,其中被遮盖的步骤位于序列的起点或开头。填充是出于将序列数据编码成连续批次的需要:为了使批次中的所有序列适合给定的标准长度,有必要填充或截断某些序列。\n", "\n", "让我们仔细看看。" ] }, { "cell_type": "markdown", "metadata": { "id": "ac6b121d6be0" }, "source": [ "## 填充序列数据\n", "\n", "在处理序列数据时,各个样本常常具有不同长度。请考虑以下示例(文本被切分为单词):\n", "\n", "```\n", "[\n", " [\"Hello\", \"world\", \"!\"],\n", " [\"How\", \"are\", \"you\", \"doing\", \"today\"],\n", " [\"The\", \"weather\", \"will\", \"be\", \"nice\", \"tomorrow\"],\n", "]\n", "```\n", "\n", "进行词汇查询后,数据可能会被向量化为整数,例如:\n", "\n", "```\n", "[\n", " [71, 1331, 4231]\n", " [73, 8, 3215, 55, 927],\n", " [83, 91, 1, 645, 1253, 927],\n", "]\n", "```\n", "\n", "此数据是一个嵌套列表,其中各个样本的长度分别为 3、5 和 6。由于深度学习模型的输入数据必须为单一张量(例如在此例中形状为 `(batch_size, 6, vocab_size)`),短于最长条目的样本需要用占位符值进行填充(或者,也可以在填充短样本前截断长样本)。\n", "\n", "Keras 提供了一个效用函数来截断和填充 Python 列表,使其具有相同长度:`tf.keras.preprocessing.sequence.pad_sequences`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bb64fb185a05", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "raw_inputs = [\n", " [711, 632, 71],\n", " [73, 8, 3215, 55, 927],\n", " [83, 91, 1, 645, 1253, 927],\n", "]\n", "\n", "# By default, this will pad using 0s; it is configurable via the\n", "# \"value\" parameter.\n", "# Note that you could \"pre\" padding (at the beginning) or\n", "# \"post\" padding (at the end).\n", "# We recommend using \"post\" padding when working with RNN layers\n", "# (in order to be able to use the\n", "# CuDNN implementation of the layers).\n", "padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(\n", " raw_inputs, padding=\"post\"\n", ")\n", "print(padded_inputs)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "03092b2da690" }, "source": [ "## 遮盖\n", "\n", "既然所有样本现在都具有了统一长度,那就必须告知模型,数据的某些部分实际上是填充,应该忽略。这种机制就是**遮盖**。\n", "\n", "在 Keras 模型中引入输入掩码有三种方式:\n", "\n", "- 添加一个 `keras.layers.Masking` 层。\n", "- 使用 `mask_zero=True` 配置一个 `keras.layers.Embedding` 层。\n", "- 在调用支持 `mask` 参数的层(如 RNN 层)时,手动传递此参数。" ] }, { "cell_type": "markdown", "metadata": { "id": "6103601e5fff" }, "source": [ "## 掩码生成层:`Embedding` 和 `Masking`\n", "\n", "这些层将在后台创建一个掩码张量(形状为 `(batch, sequence_length)` 的二维张量),并将其附加到由 `Masking` 或 `Embedding` 层返回的张量输出上。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b2363b293483", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)\n", "masked_output = embedding(padded_inputs)\n", "\n", "print(masked_output._keras_mask)\n", "\n", "masking_layer = layers.Masking()\n", "# Simulate the embedding lookup by expanding the 2D input to 3D,\n", "# with embedding dimension of 10.\n", "unmasked_embedding = tf.cast(\n", " tf.tile(tf.expand_dims(padded_inputs, axis=-1), [1, 1, 10]), tf.float32\n", ")\n", "\n", "masked_embedding = masking_layer(unmasked_embedding)\n", "print(masked_embedding._keras_mask)" ] }, { "cell_type": "markdown", "metadata": { "id": "17e4bdb563b2" }, "source": [ "您可以在输出结果中看到,该掩码是一个形状为 `(batch_size, sequence_length)` 的二维布尔张量,其中每个 `False` 条目表示对应的时间步骤应在处理时忽略。" ] }, { "cell_type": "markdown", "metadata": { "id": "cf11a0399fcf" }, "source": [ "## 函数式 API 和序列式 API 中的掩码传播\n", "\n", "在使用函数式 API 或序列式 API 时,由 `Embedding` 或 `Masking` 层生成的掩码将通过网络传播给任何能够使用它们的层(如 RNN 层)。Keras 将自动提取与输入相对应的掩码,并将其传递给任何知道该掩码使用方法的层。\n", "\n", "例如,在下面的序贯模型中,`LSTM` 层将自动接收掩码,这意味着它将忽略填充的值:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0adbecda288a", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "model = keras.Sequential(\n", " [layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True), layers.LSTM(32),]\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "a8ac6481a1d5" }, "source": [ "对以下函数式 API 的情况也是如此:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f374ab06743d", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "inputs = keras.Input(shape=(None,), dtype=\"int32\")\n", "x = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)(inputs)\n", "outputs = layers.LSTM(32)(x)\n", "\n", "model = keras.Model(inputs, outputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "2f2c4b96ecb5" }, "source": [ "## 将掩码张量直接传递给层" ] }, { "cell_type": "markdown", "metadata": { "id": "11dccb581014" }, "source": [ "能够处理掩码的层(如 `LSTM` 层)在其 `__call__` 方法中有一个 `mask` 参数。\n", "\n", "同时,生成掩码的层(如 `Embedding`)会公开一个 `compute_mask(input, previous_mask)` 方法,供您调用。\n", "\n", "因此,您可以将掩码生成层的 `compute_mask()` 方法的输出传递给掩码使用层的 `__call__` 方法,如下所示:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1955aa63896b", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class MyLayer(layers.Layer):\n", " def __init__(self, **kwargs):\n", " super(MyLayer, self).__init__(**kwargs)\n", " self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)\n", " self.lstm = layers.LSTM(32)\n", "\n", " def call(self, inputs):\n", " x = self.embedding(inputs)\n", " # Note that you could also prepare a `mask` tensor manually.\n", " # It only needs to be a boolean tensor\n", " # with the right shape, i.e. (batch_size, timesteps).\n", " mask = self.embedding.compute_mask(inputs)\n", " output = self.lstm(x, mask=mask) # The layer will ignore the masked values\n", " return output\n", "\n", "\n", "layer = MyLayer()\n", "x = np.random.random((32, 10)) * 100\n", "x = x.astype(\"int32\")\n", "layer(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "b04dd330f848" }, "source": [ "## 在自定义层中支持遮盖" ] }, { "cell_type": "markdown", "metadata": { "id": "8451a1a8ff27" }, "source": [ "有时,您可能需要编写生成掩码的层(如 `Embedding`),或者需要修改当前掩码的层。\n", "\n", "例如,任何生成与其输入具有不同时间维度的张量的层(如在时间维度上进行连接的 `Concatenate` 层)都需要修改当前掩码,这样下游层才能正确顾及被遮盖的时间步骤。\n", "\n", "为此,您的层应实现 `layer.compute_mask()` 方法,该方法会根据输入和当前掩码生成新的掩码。\n", "\n", "以下是需要修改当前掩码的 `TemporalSplit` 层的示例。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a06fb2194c0d", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class TemporalSplit(keras.layers.Layer):\n", " \"\"\"Split the input tensor into 2 tensors along the time dimension.\"\"\"\n", "\n", " def call(self, inputs):\n", " # Expect the input to be 3D and mask to be 2D, split the input tensor into 2\n", " # subtensors along the time axis (axis 1).\n", " return tf.split(inputs, 2, axis=1)\n", "\n", " def compute_mask(self, inputs, mask=None):\n", " # Also split the mask into 2 if it presents.\n", " if mask is None:\n", " return None\n", " return tf.split(mask, 2, axis=1)\n", "\n", "\n", "first_half, second_half = TemporalSplit()(masked_embedding)\n", "print(first_half._keras_mask)\n", "print(second_half._keras_mask)" ] }, { "cell_type": "markdown", "metadata": { "id": "282b867dcd95" }, "source": [ "下面是关于 `CustomEmbedding` 层的另一个示例,该层能够根据输入值生成掩码:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e760655cd39c", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class CustomEmbedding(keras.layers.Layer):\n", " def __init__(self, input_dim, output_dim, mask_zero=False, **kwargs):\n", " super(CustomEmbedding, self).__init__(**kwargs)\n", " self.input_dim = input_dim\n", " self.output_dim = output_dim\n", " self.mask_zero = mask_zero\n", "\n", " def build(self, input_shape):\n", " self.embeddings = self.add_weight(\n", " shape=(self.input_dim, self.output_dim),\n", " initializer=\"random_normal\",\n", " dtype=\"float32\",\n", " )\n", "\n", " def call(self, inputs):\n", " return tf.nn.embedding_lookup(self.embeddings, inputs)\n", "\n", " def compute_mask(self, inputs, mask=None):\n", " if not self.mask_zero:\n", " return None\n", " return tf.not_equal(inputs, 0)\n", "\n", "\n", "layer = CustomEmbedding(10, 32, mask_zero=True)\n", "x = np.random.random((3, 10)) * 9\n", "x = x.astype(\"int32\")\n", "\n", "y = layer(x)\n", "mask = layer.compute_mask(x)\n", "\n", "print(mask)" ] }, { "cell_type": "markdown", "metadata": { "id": "bb34149eb837" }, "source": [ "## 在兼容层上选择启用掩码传播\n", "\n", "大多数层都不会修改时间维度,因此无需修改当前掩码。但是,这些层可能仍希望能够将当前掩码不加更改地**传播**到下一层。**这是一种可以选择启用的行为。**默认情况下,自定义层将破坏当前掩码(因为框架无法确定传播该掩码是否安全)。\n", "\n", "如果您有一个不会修改时间维度的自定义层,且您希望它能够传播当前的输入掩码,您应该在层构造函数中设置 `self.supports_masking = True`。在这种情况下,`compute_mask()` 的默认行为是仅传递当前掩码。\n", "\n", "下面是被列入掩码传播白名单的层的示例:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "895c35534d06", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class MyActivation(keras.layers.Layer):\n", " def __init__(self, **kwargs):\n", " super(MyActivation, self).__init__(**kwargs)\n", " # Signal that the layer is safe for mask propagation\n", " self.supports_masking = True\n", "\n", " def call(self, inputs):\n", " return tf.nn.relu(inputs)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "a2e1e0a81995" }, "source": [ "现在,您可以在掩码生成层(如 `Embedding`)和掩码使用层(如 `LSTM`)之间使用此自定义层,它会将掩码一路传递到掩码使用层。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "486e39e9a9a7", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "inputs = keras.Input(shape=(None,), dtype=\"int32\")\n", "x = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)(inputs)\n", "x = MyActivation()(x) # Will pass the mask along\n", "print(\"Mask found:\", x._keras_mask)\n", "outputs = layers.LSTM(32)(x) # Will receive the mask\n", "\n", "model = keras.Model(inputs, outputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "55ab9c02f4ba" }, "source": [ "## 编写需要掩码信息的层\n", "\n", "有些层是掩码*使用者*:他们会在 `call` 中接受 `mask` 参数,并使用该参数来决定是否跳过某些时间步骤。\n", "\n", "要编写这样的层,您只需在 `call` 签名中添加一个 `mask=None` 参数。与输入关联的掩码只要可用就会被传递到您的层。\n", "\n", "下面是一个简单示例:示例中的层在输入序列的时间维度(轴 1)上计算 Softmax,同时丢弃遮盖的时间步骤。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9774bcd59908", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class TemporalSoftmax(keras.layers.Layer):\n", " def call(self, inputs, mask=None):\n", " broadcast_float_mask = tf.expand_dims(tf.cast(mask, \"float32\"), -1)\n", " inputs_exp = tf.exp(inputs) * broadcast_float_mask\n", " inputs_sum = tf.reduce_sum(\n", " inputs_exp * broadcast_float_mask, axis=-1, keepdims=True\n", " )\n", " return inputs_exp / inputs_sum\n", "\n", "\n", "inputs = keras.Input(shape=(None,), dtype=\"int32\")\n", "x = layers.Embedding(input_dim=10, output_dim=32, mask_zero=True)(inputs)\n", "x = layers.Dense(1)(x)\n", "outputs = TemporalSoftmax()(x)\n", "\n", "model = keras.Model(inputs, outputs)\n", "y = model(np.random.randint(0, 10, size=(32, 100)), np.random.random((32, 100, 1)))" ] }, { "cell_type": "markdown", "metadata": { "id": "6373f43bbe18" }, "source": [ "## 总结\n", "\n", "以上是您需要了解的关于 Keras 中填充和遮盖的所有信息。回顾一下:\n", "\n", "- “遮盖”是层得知何时应该跳过/忽略序列输入中的某些时间步骤的方式。\n", "- 有些层是掩码生成者:`Embedding` 可以通过输入值来生成掩码(如果 `mask_zero=True`),`Masking` 层也可以。\n", "- 有些层是掩码使用者:它们会在其 `__call__` 方法中公开 `mask` 参数。RNN 层就是如此。\n", "- 在函数式 API 和序列式 API 中,掩码信息会自动传播。\n", "- 单独使用层时,您可以将 `mask` 参数手动传递给层。\n", "- 您可以轻松编写会修改当前掩码的层、生成新掩码的层,或使用与输入关联的掩码的层。" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "masking_and_padding.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }