{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "s_qNSzzyaCbD"
      },
      "outputs": [],
      "source": [
        "##### Copyright 2019 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "jmjh290raIky",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J0Qjg6vuaHNt"
      },
      "source": [
        "# 理解语言的 Transformer 模型"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AOpGoE2T-YXS"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://tensorflow.google.cn/tutorials/text/transformer\">\n",
        "    <img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\" />\n",
        "    在 tensorflow.google.cn 上查看</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/text/transformer.ipynb\">\n",
        "    <img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\" />\n",
        "    在 Google Colab 运行</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/text/transformer.ipynb\">\n",
        "    <img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\" />\n",
        "    在 Github 上查看源代码</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/tutorials/text/transformer.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\" />下载此 notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7Saq5g1mnE5Y"
      },
      "source": [
        "Note: 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为， 所以无法保证它们是最准确的，并且反映了最新的\n",
        "[官方英文文档](https://tensorflow.google.cn/?hl=en)。如果您有改进此翻译的建议， 请提交 pull request 到\n",
        "[tensorflow/docs](https://github.com/tensorflow/docs) GitHub 仓库。要志愿地撰写或者审核译文，请加入\n",
        "[docs-zh-cn@tensorflow.org Google Group](https://groups.google.com/a/tensorflow.org/forum/#!forum/docs-zh-cn)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M-f8TnGpE_ex"
      },
      "source": [
        "本教程训练了一个 <a href=\"https://arxiv.org/abs/1706.03762\" class=\"external\">Transformer 模型</a> 用于将葡萄牙语翻译成英语。这是一个高级示例，假定您具备[文本生成（text generation）](text_generation.ipynb)和 [注意力机制（attention）](nmt_with_attention.ipynb) 的知识。\n",
        "\n",
        "Transformer 模型的核心思想是*自注意力机制（self-attention）*——能注意输入序列的不同位置以计算该序列的表示的能力。Transformer 创建了多层自注意力层（self-attetion layers）组成的堆栈，下文的*按比缩放的点积注意力（Scaled dot product attention）*和*多头注意力（Multi-head attention）*部分对此进行了说明。\n",
        "\n",
        "一个 transformer 模型用自注意力层而非 [RNNs](text_classification_rnn.ipynb) 或 [CNNs](../images/intro_to_cnns.ipynb) 来处理变长的输入。这种通用架构有一系列的优势：\n",
        "\n",
        "* 它不对数据间的时间/空间关系做任何假设。这是处理一组对象（objects）的理想选择（例如，[星际争霸单位（StarCraft units）](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii)）。\n",
        "* 层输出可以并行计算，而非像 RNN 这样的序列计算。\n",
        "* 远距离项可以影响彼此的输出，而无需经过许多 RNN 步骤或卷积层（例如，参见[场景记忆 Transformer（Scene Memory Transformer）](https://arxiv.org/pdf/1903.03878.pdf)）\n",
        "* 它能学习长距离的依赖。在许多序列任务中，这是一项挑战。\n",
        "\n",
        "该架构的缺点是：\n",
        "\n",
        "* 对于时间序列，一个单位时间的输出是从*整个历史记录*计算的，而非仅从输入和当前的隐含状态计算得到。这*可能*效率较低。   \n",
        "* 如果输入*确实*有时间/空间的关系，像文本，则必须加入一些位置编码，否则模型将有效地看到一堆单词。\n",
        "\n",
        "在此 notebook 中训练完模型后，您将能输入葡萄牙语句子，得到其英文翻译。\n",
        "\n",
        "<img src=\"https://tensorflow.google.cn/images/tutorials/transformer/attention_map_portuguese.png\" width=\"800\" alt=\"Attention heatmap\">"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JjJJyJTZYebt",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "import tensorflow_datasets as tfds\n",
        "import tensorflow as tf\n",
        "\n",
        "import time\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fd1NWMxjfsDd"
      },
      "source": [
        "## 设置输入流水线（input pipeline）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t4_Qt8W1hJE_"
      },
      "source": [
        "使用 [TFDS](https://tensorflow.google.cn/datasets) 来导入 [葡萄牙语-英语翻译数据集](https://github.com/neulab/word-embeddings-for-nmt)，该数据集来自于 [TED 演讲开放翻译项目](https://www.ted.com/participate/translate).\n",
        "\n",
        "该数据集包含来约 50000 条训练样本，1100 条验证样本，以及 2000 条测试样本。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8q9t4FmN96eN",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,\n",
        "                               as_supervised=True)\n",
        "train_examples, val_examples = examples['train'], examples['validation']"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RCEKotqosGfq"
      },
      "source": [
        "从训练数据集创建自定义子词分词器（subwords tokenizer）。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KVBg5Q8tBk5z",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(\n",
        "    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)\n",
        "\n",
        "tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(\n",
        "    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4DYWukNFkGQN",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_string = 'Transformer is awesome.'\n",
        "\n",
        "tokenized_string = tokenizer_en.encode(sample_string)\n",
        "print ('Tokenized string is {}'.format(tokenized_string))\n",
        "\n",
        "original_string = tokenizer_en.decode(tokenized_string)\n",
        "print ('The original string: {}'.format(original_string))\n",
        "\n",
        "assert original_string == sample_string"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o9KJWJjrsZ4Y"
      },
      "source": [
        "如果单词不在词典中，则分词器（tokenizer）通过将单词分解为子词来对字符串进行编码。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bf2ntBxjkqK6",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "for ts in tokenized_string:\n",
        "  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bcRp7VcQ5m6g",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "BUFFER_SIZE = 20000\n",
        "BATCH_SIZE = 64"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kGi4PoVakxdc"
      },
      "source": [
        "将开始和结束标记（token）添加到输入和目标。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UZwnPr4R055s",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def encode(lang1, lang2):\n",
        "  lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(\n",
        "      lang1.numpy()) + [tokenizer_pt.vocab_size+1]\n",
        "\n",
        "  lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(\n",
        "      lang2.numpy()) + [tokenizer_en.vocab_size+1]\n",
        "  \n",
        "  return lang1, lang2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6JrGp5Gek6Ql"
      },
      "source": [
        "Note：为了使本示例较小且相对较快，删除长度大于40个标记的样本。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2QEgbjntk6Yf",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "MAX_LENGTH = 40"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c081xPGv1CPI",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def filter_max_length(x, y, max_length=MAX_LENGTH):\n",
        "  return tf.logical_and(tf.size(x) <= max_length,\n",
        "                        tf.size(y) <= max_length)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tx1sFbR-9fRs"
      },
      "source": [
        "`.map()` 内部的操作以图模式（graph mode）运行，`.map()` 接收一个不具有 numpy 属性的图张量（graph tensor）。该`分词器（tokenizer）`需要将一个字符串或 Unicode 符号，编码成整数。因此，您需要在 `tf.py_function` 内部运行编码过程，`tf.py_function` 接收一个 eager 张量，该 eager 张量有一个包含字符串值的 numpy 属性。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Mah1cS-P70Iz",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def tf_encode(pt, en):\n",
        "  result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])\n",
        "  result_pt.set_shape([None])\n",
        "  result_en.set_shape([None])\n",
        "\n",
        "  return result_pt, result_en"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9mk9AZdZ5bcS",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "train_dataset = train_examples.map(tf_encode)\n",
        "train_dataset = train_dataset.filter(filter_max_length)\n",
        "# 将数据集缓存到内存中以加快读取速度。\n",
        "train_dataset = train_dataset.cache()\n",
        "train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)\n",
        "train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)\n",
        "\n",
        "\n",
        "val_dataset = val_examples.map(tf_encode)\n",
        "val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_fXvfYVfQr2n",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "pt_batch, en_batch = next(iter(val_dataset))\n",
        "pt_batch, en_batch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nBQuibYA4n0n"
      },
      "source": [
        "## 位置编码（Positional encoding）\n",
        "\n",
        "因为该模型并不包括任何的循环（recurrence）或卷积，所以模型添加了位置编码，为模型提供一些关于单词在句子中相对位置的信息。\n",
        "\n",
        "位置编码向量被加到嵌入（embedding）向量中。嵌入表示一个 d 维空间的标记，在 d 维空间中有着相似含义的标记会离彼此更近。但是，嵌入并没有对在一句话中的词的相对位置进行编码。因此，当加上位置编码后，词将基于*它们含义的相似度以及它们在句子中的位置*，在 d 维空间中离彼此更近。\n",
        "\n",
        "参看 [位置编码](https://github.com/tensorflow/examples/blob/master/community/en/position_encoding.ipynb) 的 notebook 了解更多信息。计算位置编码的公式如下：\n",
        "\n",
        "$$\\Large{PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})} $$\n",
        "$$\\Large{PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})} $$"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WhIOZjMNKujn",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def get_angles(pos, i, d_model):\n",
        "  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))\n",
        "  return pos * angle_rates"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1Rz82wEs5biZ",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def positional_encoding(position, d_model):\n",
        "  angle_rads = get_angles(np.arange(position)[:, np.newaxis],\n",
        "                          np.arange(d_model)[np.newaxis, :],\n",
        "                          d_model)\n",
        "  \n",
        "  # 将 sin 应用于数组中的偶数索引（indices）；2i\n",
        "  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])\n",
        "  \n",
        "  # 将 cos 应用于数组中的奇数索引；2i+1\n",
        "  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])\n",
        "    \n",
        "  pos_encoding = angle_rads[np.newaxis, ...]\n",
        "    \n",
        "  return tf.cast(pos_encoding, dtype=tf.float32)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1kLCla68EloE",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "pos_encoding = positional_encoding(50, 512)\n",
        "print (pos_encoding.shape)\n",
        "\n",
        "plt.pcolormesh(pos_encoding[0], cmap='RdBu')\n",
        "plt.xlabel('Depth')\n",
        "plt.xlim((0, 512))\n",
        "plt.ylabel('Position')\n",
        "plt.colorbar()\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a_b4ou4TYqUN"
      },
      "source": [
        "## 遮挡（Masking）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s42Uydjkv0hF"
      },
      "source": [
        "遮挡一批序列中所有的填充标记（pad tokens）。这确保了模型不会将填充作为输入。该 mask 表明填充值 `0` 出现的位置：在这些位置 mask 输出 `1`，否则输出 `0`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U2i8-e1s8ti9",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def create_padding_mask(seq):\n",
        "  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)\n",
        "  \n",
        "  # 添加额外的维度来将填充加到\n",
        "  # 注意力对数（logits）。\n",
        "  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "A7BYeBCNvi7n",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n",
        "create_padding_mask(x)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z0hzukDBgVom"
      },
      "source": [
        "前瞻遮挡（look-ahead mask）用于遮挡一个序列中的后续标记（future tokens）。换句话说，该 mask 表明了不应该使用的条目。\n",
        "\n",
        "这意味着要预测第三个词，将仅使用第一个和第二个词。与此类似，预测第四个词，仅使用第一个，第二个和第三个词，依此类推。 "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dVxS8OPI9uI0",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def create_look_ahead_mask(size):\n",
        "  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)\n",
        "  return mask  # (seq_len, seq_len)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yxKGuXxaBeeE",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "x = tf.random.uniform((1, 3))\n",
        "temp = create_look_ahead_mask(x.shape[1])\n",
        "temp"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xluDl5cXYy4y"
      },
      "source": [
        "## 按比缩放的点积注意力（Scaled dot product attention）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vsxEE_-Wa1gF"
      },
      "source": [
        "<img src=\"https://tensorflow.google.cn/images/tutorials/transformer/scaled_attention.png\" width=\"500\" alt=\"scaled_dot_product_attention\">\n",
        "\n",
        "Transformer 使用的注意力函数有三个输入：Q（请求（query））、K（主键（key））、V（数值（value））。用于计算注意力权重的等式为：\n",
        "\n",
        "$$\\Large{Attention(Q, K, V) = softmax_k(\\frac{QK^T}{\\sqrt{d_k}}) V} $$\n",
        "\n",
        "点积注意力被缩小了深度的平方根倍。这样做是因为对于较大的深度值，点积的大小会增大，从而推动 softmax 函数往仅有很小的梯度的方向靠拢，导致了一种很硬的（hard）softmax。\n",
        "\n",
        "例如，假设 `Q` 和 `K` 的均值为0，方差为1。它们的矩阵乘积将有均值为0，方差为 `dk`。因此，*`dk` 的平方根*被用于缩放（而非其他数值），因为，`Q` 和 `K` 的矩阵乘积的均值本应该为 0，方差本应该为1，这样会获得一个更平缓的 softmax。\n",
        "\n",
        "遮挡（mask）与 -1e9（接近于负无穷）相乘。这样做是因为遮挡与缩放的 Q 和 K 的矩阵乘积相加，并在 softmax 之前立即应用。目标是将这些单元归零，因为 softmax 的较大负数输入在输出中接近于零。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LazzUq3bJ5SH",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def scaled_dot_product_attention(q, k, v, mask):\n",
        "  \"\"\"计算注意力权重。\n",
        "  q, k, v 必须具有匹配的前置维度。\n",
        "  k, v 必须有匹配的倒数第二个维度，例如：seq_len_k = seq_len_v。\n",
        "  虽然 mask 根据其类型（填充或前瞻）有不同的形状，\n",
        "  但是 mask 必须能进行广播转换以便求和。\n",
        "  \n",
        "  参数:\n",
        "    q: 请求的形状 == (..., seq_len_q, depth)\n",
        "    k: 主键的形状 == (..., seq_len_k, depth)\n",
        "    v: 数值的形状 == (..., seq_len_v, depth_v)\n",
        "    mask: Float 张量，其形状能转换成\n",
        "          (..., seq_len_q, seq_len_k)。默认为None。\n",
        "    \n",
        "  返回值:\n",
        "    输出，注意力权重\n",
        "  \"\"\"\n",
        "\n",
        "  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)\n",
        "  \n",
        "  # 缩放 matmul_qk\n",
        "  dk = tf.cast(tf.shape(k)[-1], tf.float32)\n",
        "  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n",
        "\n",
        "  # 将 mask 加入到缩放的张量上。\n",
        "  if mask is not None:\n",
        "    scaled_attention_logits += (mask * -1e9)  \n",
        "\n",
        "  # softmax 在最后一个轴（seq_len_k）上归一化，因此分数\n",
        "  # 相加等于1。\n",
        "  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)\n",
        "\n",
        "  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)\n",
        "\n",
        "  return output, attention_weights"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FiqETnhCkoXh"
      },
      "source": [
        "当 softmax 在 K 上进行归一化后，它的值决定了分配到 Q 的重要程度。\n",
        "\n",
        "输出表示注意力权重和 V（数值）向量的乘积。这确保了要关注的词保持原样，而无关的词将被清除掉。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "n90YjClyInFy",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def print_out(q, k, v):\n",
        "  temp_out, temp_attn = scaled_dot_product_attention(\n",
        "      q, k, v, None)\n",
        "  print ('Attention weights are:')\n",
        "  print (temp_attn)\n",
        "  print ('Output is:')\n",
        "  print (temp_out)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yAzUAf2DPlNt",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "np.set_printoptions(suppress=True)\n",
        "\n",
        "temp_k = tf.constant([[10,0,0],\n",
        "                      [0,10,0],\n",
        "                      [0,0,10],\n",
        "                      [0,0,10]], dtype=tf.float32)  # (4, 3)\n",
        "\n",
        "temp_v = tf.constant([[   1,0],\n",
        "                      [  10,0],\n",
        "                      [ 100,5],\n",
        "                      [1000,6]], dtype=tf.float32)  # (4, 2)\n",
        "\n",
        "# 这条 `请求（query）符合第二个`主键（key）`，\n",
        "# 因此返回了第二个`数值（value）`。\n",
        "temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32)  # (1, 3)\n",
        "print_out(temp_q, temp_k, temp_v)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zg6k-fGhgXra",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# 这条请求符合重复出现的主键（第三第四个），\n",
        "# 因此，对所有的相关数值取了平均。\n",
        "temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32)  # (1, 3)\n",
        "print_out(temp_q, temp_k, temp_v)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UAq3YOzUgXhb",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# 这条请求符合第一和第二条主键，\n",
        "# 因此，对它们的数值去了平均。\n",
        "temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32)  # (1, 3)\n",
        "print_out(temp_q, temp_k, temp_v)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aOz-4_XIhaTP"
      },
      "source": [
        "将所有请求一起*传递*。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6dlU8Tm-hYrF",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32)  # (3, 3)\n",
        "print_out(temp_q, temp_k, temp_v)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kmzGPEy64qmA"
      },
      "source": [
        "## 多头注意力（Multi-head attention）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fz5BMC8Kaoqo"
      },
      "source": [
        "<img src=\"https://tensorflow.google.cn/images/tutorials/transformer/multi_head_attention.png\" width=\"500\" alt=\"multi-head attention\">\n",
        "\n",
        "\n",
        "多头注意力由四部分组成：\n",
        "*    线性层并分拆成多头。\n",
        "*    按比缩放的点积注意力。\n",
        "*    多头及联。\n",
        "*    最后一层线性层。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JPmbr6F1C-v_"
      },
      "source": [
        "每个多头注意力块有三个输入：Q（请求）、K（主键）、V（数值）。这些输入经过线性（Dense）层，并分拆成多头。 \n",
        "\n",
        "将上面定义的 `scaled_dot_product_attention` 函数应用于每个头（进行了广播（broadcasted）以提高效率）。注意力这步必须使用一个恰当的 mask。然后将每个头的注意力输出连接起来（用`tf.transpose` 和 `tf.reshape`），并放入最后的 `Dense` 层。\n",
        "\n",
        "Q、K、和 V 被拆分到了多个头，而非单个的注意力头，因为多头允许模型共同注意来自不同表示空间的不同位置的信息。在分拆后，每个头部的维度减少，因此总的计算成本与有着全部维度的单个注意力头相同。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BSV3PPKsYecw",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class MultiHeadAttention(tf.keras.layers.Layer):\n",
        "  def __init__(self, d_model, num_heads):\n",
        "    super(MultiHeadAttention, self).__init__()\n",
        "    self.num_heads = num_heads\n",
        "    self.d_model = d_model\n",
        "    \n",
        "    assert d_model % self.num_heads == 0\n",
        "    \n",
        "    self.depth = d_model // self.num_heads\n",
        "    \n",
        "    self.wq = tf.keras.layers.Dense(d_model)\n",
        "    self.wk = tf.keras.layers.Dense(d_model)\n",
        "    self.wv = tf.keras.layers.Dense(d_model)\n",
        "    \n",
        "    self.dense = tf.keras.layers.Dense(d_model)\n",
        "        \n",
        "  def split_heads(self, x, batch_size):\n",
        "    \"\"\"分拆最后一个维度到 (num_heads, depth).\n",
        "    转置结果使得形状为 (batch_size, num_heads, seq_len, depth)\n",
        "    \"\"\"\n",
        "    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n",
        "    return tf.transpose(x, perm=[0, 2, 1, 3])\n",
        "    \n",
        "  def call(self, v, k, q, mask):\n",
        "    batch_size = tf.shape(q)[0]\n",
        "    \n",
        "    q = self.wq(q)  # (batch_size, seq_len, d_model)\n",
        "    k = self.wk(k)  # (batch_size, seq_len, d_model)\n",
        "    v = self.wv(v)  # (batch_size, seq_len, d_model)\n",
        "    \n",
        "    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)\n",
        "    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)\n",
        "    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)\n",
        "    \n",
        "    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)\n",
        "    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)\n",
        "    scaled_attention, attention_weights = scaled_dot_product_attention(\n",
        "        q, k, v, mask)\n",
        "    \n",
        "    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)\n",
        "\n",
        "    concat_attention = tf.reshape(scaled_attention, \n",
        "                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)\n",
        "\n",
        "    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)\n",
        "        \n",
        "    return output, attention_weights"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0D8FJue5lDyZ"
      },
      "source": [
        "创建一个 `MultiHeadAttention` 层进行尝试。在序列中的每个位置 `y`，`MultiHeadAttention` 在序列中的所有其他位置运行所有8个注意力头，在每个位置y，返回一个新的同样长度的向量。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Hu94p-_-2_BX",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "temp_mha = MultiHeadAttention(d_model=512, num_heads=8)\n",
        "y = tf.random.uniform((1, 60, 512))  # (batch_size, encoder_sequence, d_model)\n",
        "out, attn = temp_mha(y, k=y, q=y, mask=None)\n",
        "out.shape, attn.shape"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RdDqGayx67vv"
      },
      "source": [
        "## 点式前馈网络（Point wise feed forward network）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gBqzJXGfHK3X"
      },
      "source": [
        "点式前馈网络由两层全联接层组成，两层之间有一个 ReLU 激活函数。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ET7xLt0yCT6Z",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def point_wise_feed_forward_network(d_model, dff):\n",
        "  return tf.keras.Sequential([\n",
        "      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)\n",
        "      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)\n",
        "  ])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mytb1lPyOHLB",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_ffn = point_wise_feed_forward_network(512, 2048)\n",
        "sample_ffn(tf.random.uniform((64, 50, 512))).shape"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7e7hKcxn6-zd"
      },
      "source": [
        "## 编码与解码（Encoder and decoder）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yScbC0MUH8dS"
      },
      "source": [
        "<img src=\"https://tensorflow.google.cn/images/tutorials/transformer/transformer.png\" width=\"600\" alt=\"transformer\">"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MfYJG-Kvgwy2"
      },
      "source": [
        "Transformer 模型与标准的[具有注意力机制的序列到序列模型（sequence to sequence with attention model）](nmt_with_attention.ipynb)，遵循相同的一般模式。\n",
        "\n",
        "* 输入语句经过 `N` 个编码器层，为序列中的每个词/标记生成一个输出。\n",
        "* 解码器关注编码器的输出以及它自身的输入（自注意力）来预测下一个词。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QFv-FNYUmvpn"
      },
      "source": [
        "### 编码器层（Encoder layer）\n",
        "\n",
        "每个编码器层包括以下子层：\n",
        "\n",
        "1.   多头注意力（有填充遮挡）\n",
        "2.   点式前馈网络（Point wise feed forward networks）。\n",
        "\n",
        "每个子层在其周围有一个残差连接，然后进行层归一化。残差连接有助于避免深度网络中的梯度消失问题。\n",
        "\n",
        "每个子层的输出是 `LayerNorm(x + Sublayer(x))`。归一化是在 `d_model`（最后一个）维度完成的。Transformer 中有 N 个编码器层。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ncyS-Ms3i2x_",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class EncoderLayer(tf.keras.layers.Layer):\n",
        "  def __init__(self, d_model, num_heads, dff, rate=0.1):\n",
        "    super(EncoderLayer, self).__init__()\n",
        "\n",
        "    self.mha = MultiHeadAttention(d_model, num_heads)\n",
        "    self.ffn = point_wise_feed_forward_network(d_model, dff)\n",
        "\n",
        "    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "    \n",
        "    self.dropout1 = tf.keras.layers.Dropout(rate)\n",
        "    self.dropout2 = tf.keras.layers.Dropout(rate)\n",
        "    \n",
        "  def call(self, x, training, mask):\n",
        "\n",
        "    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)\n",
        "    attn_output = self.dropout1(attn_output, training=training)\n",
        "    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)\n",
        "    \n",
        "    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)\n",
        "    ffn_output = self.dropout2(ffn_output, training=training)\n",
        "    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)\n",
        "    \n",
        "    return out2"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AzZRXdO0mI48",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_encoder_layer = EncoderLayer(512, 8, 2048)\n",
        "\n",
        "sample_encoder_layer_output = sample_encoder_layer(\n",
        "    tf.random.uniform((64, 43, 512)), False, None)\n",
        "\n",
        "sample_encoder_layer_output.shape  # (batch_size, input_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6LO_48Owmx_o"
      },
      "source": [
        "### 解码器层（Decoder layer）\n",
        "\n",
        "每个解码器层包括以下子层：\n",
        "\n",
        "1.   遮挡的多头注意力（前瞻遮挡和填充遮挡）\n",
        "2.   多头注意力（用填充遮挡）。V（数值）和 K（主键）接收*编码器输出*作为输入。Q（请求）接收*遮挡的多头注意力子层的输出*。\n",
        "3.   点式前馈网络\n",
        "\n",
        "每个子层在其周围有一个残差连接，然后进行层归一化。每个子层的输出是 `LayerNorm(x + Sublayer(x))`。归一化是在 `d_model`（最后一个）维度完成的。\n",
        "\n",
        "Transformer 中共有 N 个解码器层。\n",
        "\n",
        "当 Q 接收到解码器的第一个注意力块的输出，并且 K 接收到编码器的输出时，注意力权重表示根据编码器的输出赋予解码器输入的重要性。换一种说法，解码器通过查看编码器输出和对其自身输出的自注意力，预测下一个词。参看按比缩放的点积注意力部分的演示。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9SoX0-vd1hue",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class DecoderLayer(tf.keras.layers.Layer):\n",
        "  def __init__(self, d_model, num_heads, dff, rate=0.1):\n",
        "    super(DecoderLayer, self).__init__()\n",
        "\n",
        "    self.mha1 = MultiHeadAttention(d_model, num_heads)\n",
        "    self.mha2 = MultiHeadAttention(d_model, num_heads)\n",
        "\n",
        "    self.ffn = point_wise_feed_forward_network(d_model, dff)\n",
        " \n",
        "    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "    \n",
        "    self.dropout1 = tf.keras.layers.Dropout(rate)\n",
        "    self.dropout2 = tf.keras.layers.Dropout(rate)\n",
        "    self.dropout3 = tf.keras.layers.Dropout(rate)\n",
        "    \n",
        "    \n",
        "  def call(self, x, enc_output, training, \n",
        "           look_ahead_mask, padding_mask):\n",
        "    # enc_output.shape == (batch_size, input_seq_len, d_model)\n",
        "\n",
        "    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)\n",
        "    attn1 = self.dropout1(attn1, training=training)\n",
        "    out1 = self.layernorm1(attn1 + x)\n",
        "    \n",
        "    attn2, attn_weights_block2 = self.mha2(\n",
        "        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)\n",
        "    attn2 = self.dropout2(attn2, training=training)\n",
        "    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)\n",
        "    \n",
        "    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)\n",
        "    ffn_output = self.dropout3(ffn_output, training=training)\n",
        "    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)\n",
        "    \n",
        "    return out3, attn_weights_block1, attn_weights_block2"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ne2Bqx8k71l0",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_decoder_layer = DecoderLayer(512, 8, 2048)\n",
        "\n",
        "sample_decoder_layer_output, _, _ = sample_decoder_layer(\n",
        "    tf.random.uniform((64, 50, 512)), sample_encoder_layer_output, \n",
        "    False, None, None)\n",
        "\n",
        "sample_decoder_layer_output.shape  # (batch_size, target_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SE1H51Ajm0q1"
      },
      "source": [
        "### 编码器（Encoder）\n",
        "\n",
        "`编码器` 包括：\n",
        "1.   输入嵌入（Input Embedding）\n",
        "2.   位置编码（Positional Encoding）\n",
        "3.   N 个编码器层（encoder layers）\n",
        "\n",
        "输入经过嵌入（embedding）后，该嵌入与位置编码相加。该加法结果的输出是编码器层的输入。编码器的输出是解码器的输入。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jpEox7gJ8FCI",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class Encoder(tf.keras.layers.Layer):\n",
        "  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,\n",
        "               maximum_position_encoding, rate=0.1):\n",
        "    super(Encoder, self).__init__()\n",
        "\n",
        "    self.d_model = d_model\n",
        "    self.num_layers = num_layers\n",
        "    \n",
        "    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)\n",
        "    self.pos_encoding = positional_encoding(maximum_position_encoding, \n",
        "                                            self.d_model)\n",
        "    \n",
        "    \n",
        "    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) \n",
        "                       for _ in range(num_layers)]\n",
        "  \n",
        "    self.dropout = tf.keras.layers.Dropout(rate)\n",
        "        \n",
        "  def call(self, x, training, mask):\n",
        "\n",
        "    seq_len = tf.shape(x)[1]\n",
        "    \n",
        "    # 将嵌入和位置编码相加。\n",
        "    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)\n",
        "    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n",
        "    x += self.pos_encoding[:, :seq_len, :]\n",
        "\n",
        "    x = self.dropout(x, training=training)\n",
        "    \n",
        "    for i in range(self.num_layers):\n",
        "      x = self.enc_layers[i](x, training, mask)\n",
        "    \n",
        "    return x  # (batch_size, input_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8QG9nueFQKXx",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8, \n",
        "                         dff=2048, input_vocab_size=8500,\n",
        "                         maximum_position_encoding=10000)\n",
        "\n",
        "sample_encoder_output = sample_encoder(tf.random.uniform((64, 62)), \n",
        "                                       training=False, mask=None)\n",
        "\n",
        "print (sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p-uO6ls8m2O5"
      },
      "source": [
        "### 解码器（Decoder）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZtT7PKzrXkNr"
      },
      "source": [
        "`解码器`包括：\n",
        "1.   输出嵌入（Output Embedding）\n",
        "2.   位置编码（Positional Encoding）\n",
        "3.   N 个解码器层（decoder layers）\n",
        "\n",
        "目标（target）经过一个嵌入后，该嵌入和位置编码相加。该加法结果是解码器层的输入。解码器的输出是最后的线性层的输入。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d5_d5-PLQXwY",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class Decoder(tf.keras.layers.Layer):\n",
        "  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,\n",
        "               maximum_position_encoding, rate=0.1):\n",
        "    super(Decoder, self).__init__()\n",
        "\n",
        "    self.d_model = d_model\n",
        "    self.num_layers = num_layers\n",
        "    \n",
        "    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)\n",
        "    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)\n",
        "    \n",
        "    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) \n",
        "                       for _ in range(num_layers)]\n",
        "    self.dropout = tf.keras.layers.Dropout(rate)\n",
        "    \n",
        "  def call(self, x, enc_output, training, \n",
        "           look_ahead_mask, padding_mask):\n",
        "\n",
        "    seq_len = tf.shape(x)[1]\n",
        "    attention_weights = {}\n",
        "    \n",
        "    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)\n",
        "    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n",
        "    x += self.pos_encoding[:, :seq_len, :]\n",
        "    \n",
        "    x = self.dropout(x, training=training)\n",
        "\n",
        "    for i in range(self.num_layers):\n",
        "      x, block1, block2 = self.dec_layers[i](x, enc_output, training,\n",
        "                                             look_ahead_mask, padding_mask)\n",
        "      \n",
        "      attention_weights['decoder_layer{}_block1'.format(i+1)] = block1\n",
        "      attention_weights['decoder_layer{}_block2'.format(i+1)] = block2\n",
        "    \n",
        "    # x.shape == (batch_size, target_seq_len, d_model)\n",
        "    return x, attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a1jXoAMRZyvu",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8, \n",
        "                         dff=2048, target_vocab_size=8000,\n",
        "                         maximum_position_encoding=5000)\n",
        "\n",
        "output, attn = sample_decoder(tf.random.uniform((64, 26)), \n",
        "                              enc_output=sample_encoder_output, \n",
        "                              training=False, look_ahead_mask=None, \n",
        "                              padding_mask=None)\n",
        "\n",
        "output.shape, attn['decoder_layer2_block2'].shape"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y54xnJnuYgJ7"
      },
      "source": [
        "## 创建 Transformer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uERO1y54cOKq"
      },
      "source": [
        "Transformer 包括编码器，解码器和最后的线性层。解码器的输出是线性层的输入，返回线性层的输出。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PED3bIpOYkBu",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class Transformer(tf.keras.Model):\n",
        "  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, \n",
        "               target_vocab_size, pe_input, pe_target, rate=0.1):\n",
        "    super(Transformer, self).__init__()\n",
        "\n",
        "    self.encoder = Encoder(num_layers, d_model, num_heads, dff, \n",
        "                           input_vocab_size, pe_input, rate)\n",
        "\n",
        "    self.decoder = Decoder(num_layers, d_model, num_heads, dff, \n",
        "                           target_vocab_size, pe_target, rate)\n",
        "\n",
        "    self.final_layer = tf.keras.layers.Dense(target_vocab_size)\n",
        "    \n",
        "  def call(self, inp, tar, training, enc_padding_mask, \n",
        "           look_ahead_mask, dec_padding_mask):\n",
        "\n",
        "    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)\n",
        "    \n",
        "    # dec_output.shape == (batch_size, tar_seq_len, d_model)\n",
        "    dec_output, attention_weights = self.decoder(\n",
        "        tar, enc_output, training, look_ahead_mask, dec_padding_mask)\n",
        "    \n",
        "    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)\n",
        "    \n",
        "    return final_output, attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tJ4fbQcIkHW1",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "sample_transformer = Transformer(\n",
        "    num_layers=2, d_model=512, num_heads=8, dff=2048, \n",
        "    input_vocab_size=8500, target_vocab_size=8000, \n",
        "    pe_input=10000, pe_target=6000)\n",
        "\n",
        "temp_input = tf.random.uniform((64, 62))\n",
        "temp_target = tf.random.uniform((64, 26))\n",
        "\n",
        "fn_out, _ = sample_transformer(temp_input, temp_target, training=False, \n",
        "                               enc_padding_mask=None, \n",
        "                               look_ahead_mask=None,\n",
        "                               dec_padding_mask=None)\n",
        "\n",
        "fn_out.shape  # (batch_size, tar_seq_len, target_vocab_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wsINyf1VEQLC"
      },
      "source": [
        "## 配置超参数（hyperparameters）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zVjWCxFNcgbt"
      },
      "source": [
        "为了让本示例小且相对较快，已经减小了*num_layers、 d_model 和  dff* 的值。 \n",
        "\n",
        "Transformer 的基础模型使用的数值为：*num_layers=6*，*d_model = 512*，*dff = 2048*。关于所有其他版本的 Transformer，请查阅[论文](https://arxiv.org/abs/1706.03762)。\n",
        "\n",
        "Note：通过改变以下数值，您可以获得在许多任务上达到最先进水平的模型。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lnJn5SLA2ahP",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "num_layers = 4\n",
        "d_model = 128\n",
        "dff = 512\n",
        "num_heads = 8\n",
        "\n",
        "input_vocab_size = tokenizer_pt.vocab_size + 2\n",
        "target_vocab_size = tokenizer_en.vocab_size + 2\n",
        "dropout_rate = 0.1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xYEGhEOtzn5W"
      },
      "source": [
        "## 优化器（Optimizer）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GOmWW--yP3zx"
      },
      "source": [
        "根据[论文](https://arxiv.org/abs/1706.03762)中的公式，将 Adam 优化器与自定义的学习速率调度程序（scheduler）配合使用。\n",
        "\n",
        "$$\\Large{lrate = d_{model}^{-0.5} * min(step{\\_}num^{-0.5}, step{\\_}num * warmup{\\_}steps^{-1.5})}$$\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iYQdOO1axwEI",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):\n",
        "  def __init__(self, d_model, warmup_steps=4000):\n",
        "    super(CustomSchedule, self).__init__()\n",
        "    \n",
        "    self.d_model = d_model\n",
        "    self.d_model = tf.cast(self.d_model, tf.float32)\n",
        "\n",
        "    self.warmup_steps = warmup_steps\n",
        "    \n",
        "  def __call__(self, step):\n",
        "    arg1 = tf.math.rsqrt(step)\n",
        "    arg2 = step * (self.warmup_steps ** -1.5)\n",
        "    \n",
        "    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7r4scdulztRx",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "learning_rate = CustomSchedule(d_model)\n",
        "\n",
        "optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, \n",
        "                                     epsilon=1e-9)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f33ZCgvHpPdG",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "temp_learning_rate_schedule = CustomSchedule(d_model)\n",
        "\n",
        "plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))\n",
        "plt.ylabel(\"Learning Rate\")\n",
        "plt.xlabel(\"Train Step\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YgkDE7hzo8r5"
      },
      "source": [
        "## 损失函数与指标（Loss and metrics）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oxGJtoDuYIHL"
      },
      "source": [
        "由于目标序列是填充（padded）过的，因此在计算损失函数时，应用填充遮挡非常重要。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MlhsJMm0TW_B",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n",
        "    from_logits=True, reduction='none')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "67oqVHiT0Eiu",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def loss_function(real, pred):\n",
        "  mask = tf.math.logical_not(tf.math.equal(real, 0))\n",
        "  loss_ = loss_object(real, pred)\n",
        "\n",
        "  mask = tf.cast(mask, dtype=loss_.dtype)\n",
        "  loss_ *= mask\n",
        "  \n",
        "  return tf.reduce_mean(loss_)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "phlyxMnm-Tpx",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "train_loss = tf.keras.metrics.Mean(name='train_loss')\n",
        "train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
        "    name='train_accuracy')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aeHumfr7zmMa"
      },
      "source": [
        "## 训练与检查点（Training and checkpointing）"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UiysUa--4tOU",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "transformer = Transformer(num_layers, d_model, num_heads, dff,\n",
        "                          input_vocab_size, target_vocab_size, \n",
        "                          pe_input=input_vocab_size, \n",
        "                          pe_target=target_vocab_size,\n",
        "                          rate=dropout_rate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZOJUSB1T8GjM",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def create_masks(inp, tar):\n",
        "  # 编码器填充遮挡\n",
        "  enc_padding_mask = create_padding_mask(inp)\n",
        "  \n",
        "  # 在解码器的第二个注意力模块使用。\n",
        "  # 该填充遮挡用于遮挡编码器的输出。\n",
        "  dec_padding_mask = create_padding_mask(inp)\n",
        "  \n",
        "  # 在解码器的第一个注意力模块使用。\n",
        "  # 用于填充（pad）和遮挡（mask）解码器获取到的输入的后续标记（future tokens）。\n",
        "  look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])\n",
        "  dec_target_padding_mask = create_padding_mask(tar)\n",
        "  combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)\n",
        "  \n",
        "  return enc_padding_mask, combined_mask, dec_padding_mask"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fzuf06YZp66w"
      },
      "source": [
        "创建检查点的路径和检查点管理器（manager）。这将用于在每 `n` 个周期（epochs）保存检查点。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hNhuYfllndLZ",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "checkpoint_path = \"./checkpoints/train\"\n",
        "\n",
        "ckpt = tf.train.Checkpoint(transformer=transformer,\n",
        "                           optimizer=optimizer)\n",
        "\n",
        "ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)\n",
        "\n",
        "# 如果检查点存在，则恢复最新的检查点。\n",
        "if ckpt_manager.latest_checkpoint:\n",
        "  ckpt.restore(ckpt_manager.latest_checkpoint)\n",
        "  print ('Latest checkpoint restored!!')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Di_Yaa1gf9r"
      },
      "source": [
        "目标（target）被分成了 tar_inp 和 tar_real。tar_inp 作为输入传递到解码器。`tar_real` 是位移了 1 的同一个输入：在 `tar_inp` 中的每个位置，`tar_real` 包含了应该被预测到的下一个标记（token）。\n",
        "\n",
        "例如，`sentence` = \"SOS A lion in the jungle is sleeping EOS\"\n",
        "\n",
        "`tar_inp` =  \"SOS A lion in the jungle is sleeping\"\n",
        "\n",
        "`tar_real` = \"A lion in the jungle is sleeping EOS\"\n",
        "\n",
        "Transformer 是一个自回归（auto-regressive）模型：它一次作一个部分的预测，然后使用到目前为止的自身的输出来决定下一步要做什么。\n",
        "\n",
        "在训练过程中，本示例使用了 teacher-forcing 的方法（就像[文本生成教程](./text_generation.ipynb)中一样）。无论模型在当前时间步骤下预测出什么，teacher-forcing 方法都会将真实的输出传递到下一个时间步骤上。\n",
        "\n",
        "当 transformer 预测每个词时，*自注意力（self-attention）*功能使它能够查看输入序列中前面的单词，从而更好地预测下一个单词。\n",
        "\n",
        "为了防止模型在期望的输出上达到峰值，模型使用了前瞻遮挡（look-ahead mask）。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LKpoA6q1sJFj",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "EPOCHS = 20"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iJwmp9OE29oj",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "# 该 @tf.function 将追踪-编译 train_step 到 TF 图中，以便更快地\n",
        "# 执行。该函数专用于参数张量的精确形状。为了避免由于可变序列长度或可变\n",
        "# 批次大小（最后一批次较小）导致的再追踪，使用 input_signature 指定\n",
        "# 更多的通用形状。\n",
        "\n",
        "train_step_signature = [\n",
        "    tf.TensorSpec(shape=(None, None), dtype=tf.int64),\n",
        "    tf.TensorSpec(shape=(None, None), dtype=tf.int64),\n",
        "]\n",
        "\n",
        "@tf.function(input_signature=train_step_signature)\n",
        "def train_step(inp, tar):\n",
        "  tar_inp = tar[:, :-1]\n",
        "  tar_real = tar[:, 1:]\n",
        "  \n",
        "  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)\n",
        "  \n",
        "  with tf.GradientTape() as tape:\n",
        "    predictions, _ = transformer(inp, tar_inp, \n",
        "                                 True, \n",
        "                                 enc_padding_mask, \n",
        "                                 combined_mask, \n",
        "                                 dec_padding_mask)\n",
        "    loss = loss_function(tar_real, predictions)\n",
        "\n",
        "  gradients = tape.gradient(loss, transformer.trainable_variables)    \n",
        "  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))\n",
        "  \n",
        "  train_loss(loss)\n",
        "  train_accuracy(tar_real, predictions)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qM2PDWGDJ_8V"
      },
      "source": [
        "葡萄牙语作为输入语言，英语为目标语言。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bbvmaKNiznHZ",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "for epoch in range(EPOCHS):\n",
        "  start = time.time()\n",
        "  \n",
        "  train_loss.reset_states()\n",
        "  train_accuracy.reset_states()\n",
        "  \n",
        "  # inp -> portuguese, tar -> english\n",
        "  for (batch, (inp, tar)) in enumerate(train_dataset):\n",
        "    train_step(inp, tar)\n",
        "    \n",
        "    if batch % 50 == 0:\n",
        "      print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(\n",
        "          epoch + 1, batch, train_loss.result(), train_accuracy.result()))\n",
        "      \n",
        "  if (epoch + 1) % 5 == 0:\n",
        "    ckpt_save_path = ckpt_manager.save()\n",
        "    print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,\n",
        "                                                         ckpt_save_path))\n",
        "    \n",
        "  print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, \n",
        "                                                train_loss.result(), \n",
        "                                                train_accuracy.result()))\n",
        "\n",
        "  print ('Time taken for 1 epoch: {} secs\\n'.format(time.time() - start))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QfcsSWswSdGV"
      },
      "source": [
        "## 评估（Evaluate）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y6APsFrgImLW"
      },
      "source": [
        "以下步骤用于评估：\n",
        "\n",
        "* 用葡萄牙语分词器（`tokenizer_pt`）编码输入语句。此外，添加开始和结束标记，这样输入就与模型训练的内容相同。这是编码器输入。\n",
        "* 解码器输入为 `start token == tokenizer_en.vocab_size`。\n",
        "* 计算填充遮挡和前瞻遮挡。\n",
        "* `解码器`通过查看`编码器输出`和它自身的输出（自注意力）给出预测。\n",
        "* 选择最后一个词并计算它的 argmax。\n",
        "* 将预测的词连接到解码器输入，然后传递给解码器。\n",
        "* 在这种方法中，解码器根据它预测的之前的词预测下一个。\n",
        "\n",
        "Note：这里使用的模型具有较小的能力以保持相对较快，因此预测可能不太正确。要复现论文中的结果，请使用全部数据集，并通过修改上述超参数来使用基础 transformer 模型或者 transformer XL。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5buvMlnvyrFm",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def evaluate(inp_sentence):\n",
        "  start_token = [tokenizer_pt.vocab_size]\n",
        "  end_token = [tokenizer_pt.vocab_size + 1]\n",
        "  \n",
        "  # 输入语句是葡萄牙语，增加开始和结束标记\n",
        "  inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token\n",
        "  encoder_input = tf.expand_dims(inp_sentence, 0)\n",
        "  \n",
        "  # 因为目标是英语，输入 transformer 的第一个词应该是\n",
        "  # 英语的开始标记。\n",
        "  decoder_input = [tokenizer_en.vocab_size]\n",
        "  output = tf.expand_dims(decoder_input, 0)\n",
        "    \n",
        "  for i in range(MAX_LENGTH):\n",
        "    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(\n",
        "        encoder_input, output)\n",
        "  \n",
        "    # predictions.shape == (batch_size, seq_len, vocab_size)\n",
        "    predictions, attention_weights = transformer(encoder_input, \n",
        "                                                 output,\n",
        "                                                 False,\n",
        "                                                 enc_padding_mask,\n",
        "                                                 combined_mask,\n",
        "                                                 dec_padding_mask)\n",
        "    \n",
        "    # 从 seq_len 维度选择最后一个词\n",
        "    predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)\n",
        "\n",
        "    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)\n",
        "    \n",
        "    # 如果 predicted_id 等于结束标记，就返回结果\n",
        "    if predicted_id == tokenizer_en.vocab_size+1:\n",
        "      return tf.squeeze(output, axis=0), attention_weights\n",
        "    \n",
        "    # 连接 predicted_id 与输出，作为解码器的输入传递到解码器。\n",
        "    output = tf.concat([output, predicted_id], axis=-1)\n",
        "\n",
        "  return tf.squeeze(output, axis=0), attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CN-BV43FMBej",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def plot_attention_weights(attention, sentence, result, layer):\n",
        "  fig = plt.figure(figsize=(16, 8))\n",
        "  \n",
        "  sentence = tokenizer_pt.encode(sentence)\n",
        "  \n",
        "  attention = tf.squeeze(attention[layer], axis=0)\n",
        "  \n",
        "  for head in range(attention.shape[0]):\n",
        "    ax = fig.add_subplot(2, 4, head+1)\n",
        "    \n",
        "    # 画出注意力权重\n",
        "    ax.matshow(attention[head][:-1, :], cmap='viridis')\n",
        "\n",
        "    fontdict = {'fontsize': 10}\n",
        "    \n",
        "    ax.set_xticks(range(len(sentence)+2))\n",
        "    ax.set_yticks(range(len(result)))\n",
        "    \n",
        "    ax.set_ylim(len(result)-1.5, -0.5)\n",
        "        \n",
        "    ax.set_xticklabels(\n",
        "        ['<start>']+[tokenizer_pt.decode([i]) for i in sentence]+['<end>'], \n",
        "        fontdict=fontdict, rotation=90)\n",
        "    \n",
        "    ax.set_yticklabels([tokenizer_en.decode([i]) for i in result \n",
        "                        if i < tokenizer_en.vocab_size], \n",
        "                       fontdict=fontdict)\n",
        "    \n",
        "    ax.set_xlabel('Head {}'.format(head+1))\n",
        "  \n",
        "  plt.tight_layout()\n",
        "  plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lU2_yG_vBGza",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "def translate(sentence, plot=''):\n",
        "  result, attention_weights = evaluate(sentence)\n",
        "  \n",
        "  predicted_sentence = tokenizer_en.decode([i for i in result \n",
        "                                            if i < tokenizer_en.vocab_size])  \n",
        "\n",
        "  print('Input: {}'.format(sentence))\n",
        "  print('Predicted translation: {}'.format(predicted_sentence))\n",
        "  \n",
        "  if plot:\n",
        "    plot_attention_weights(attention_weights, sentence, result, plot)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YsxrAlvFG8SZ",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "translate(\"este é um problema que temos que resolver.\")\n",
        "print (\"Real translation: this is a problem we have to solve .\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7EH5y_aqI4t1",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "translate(\"os meus vizinhos ouviram sobre esta ideia.\")\n",
        "print (\"Real translation: and my neighboring homes heard about this idea .\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J-hVCTSUMlkb",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "translate(\"vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.\")\n",
        "print (\"Real translation: so i 'll just share with you some stories very quickly of some magical things that have happened .\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_1MxkSZvz0jX"
      },
      "source": [
        "您可以为 `plot` 参数传递不同的层和解码器的注意力模块。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "t-kFyiOLH0xg",
        "vscode": {
          "languageId": "python"
        }
      },
      "outputs": [],
      "source": [
        "translate(\"este é o primeiro livro que eu fiz.\", plot='decoder_layer4_block2')\n",
        "print (\"Real translation: this is the first book i've ever done.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RqQ1fIsLwkGE"
      },
      "source": [
        "## 总结\n",
        "\n",
        "在本教程中，您已经学习了位置编码，多头注意力，遮挡的重要性以及如何创建一个 transformer。\n",
        "\n",
        "尝试使用一个不同的数据集来训练 transformer。您可也可以通过修改上述的超参数来创建基础 transformer 或者 transformer XL。您也可以使用这里定义的层来创建 [BERT](https://arxiv.org/abs/1810.04805) 并训练最先进的模型。此外，您可以实现 beam search 得到更好的预测。"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [
        "s_qNSzzyaCbD"
      ],
      "name": "transformer.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}