{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "s_qNSzzyaCbD" }, "outputs": [], "source": [ "##### Copyright 2019 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "jmjh290raIky", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "J0Qjg6vuaHNt" }, "source": [ "# 理解语言的 Transformer 模型" ] }, { "cell_type": "markdown", "metadata": { "id": "AOpGoE2T-YXS" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " 在 tensorflow.google.cn 上查看\n", " \n", " \n", " \n", " 在 Google Colab 运行\n", " \n", " \n", " \n", " 在 Github 上查看源代码\n", " \n", " 下载此 notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "7Saq5g1mnE5Y" }, "source": [ "Note: 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为, 所以无法保证它们是最准确的,并且反映了最新的\n", "[官方英文文档](https://tensorflow.google.cn/?hl=en)。如果您有改进此翻译的建议, 请提交 pull request 到\n", "[tensorflow/docs](https://github.com/tensorflow/docs) GitHub 仓库。要志愿地撰写或者审核译文,请加入\n", "[docs-zh-cn@tensorflow.org Google Group](https://groups.google.com/a/tensorflow.org/forum/#!forum/docs-zh-cn)" ] }, { "cell_type": "markdown", "metadata": { "id": "M-f8TnGpE_ex" }, "source": [ "本教程训练了一个 Transformer 模型 用于将葡萄牙语翻译成英语。这是一个高级示例,假定您具备[文本生成(text generation)](text_generation.ipynb)和 [注意力机制(attention)](nmt_with_attention.ipynb) 的知识。\n", "\n", "Transformer 模型的核心思想是*自注意力机制(self-attention)*——能注意输入序列的不同位置以计算该序列的表示的能力。Transformer 创建了多层自注意力层(self-attetion layers)组成的堆栈,下文的*按比缩放的点积注意力(Scaled dot product attention)*和*多头注意力(Multi-head attention)*部分对此进行了说明。\n", "\n", "一个 transformer 模型用自注意力层而非 [RNNs](text_classification_rnn.ipynb) 或 [CNNs](../images/intro_to_cnns.ipynb) 来处理变长的输入。这种通用架构有一系列的优势:\n", "\n", "* 它不对数据间的时间/空间关系做任何假设。这是处理一组对象(objects)的理想选择(例如,[星际争霸单位(StarCraft units)](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii))。\n", "* 层输出可以并行计算,而非像 RNN 这样的序列计算。\n", "* 远距离项可以影响彼此的输出,而无需经过许多 RNN 步骤或卷积层(例如,参见[场景记忆 Transformer(Scene Memory Transformer)](https://arxiv.org/pdf/1903.03878.pdf))\n", "* 它能学习长距离的依赖。在许多序列任务中,这是一项挑战。\n", "\n", "该架构的缺点是:\n", "\n", "* 对于时间序列,一个单位时间的输出是从*整个历史记录*计算的,而非仅从输入和当前的隐含状态计算得到。这*可能*效率较低。 \n", "* 如果输入*确实*有时间/空间的关系,像文本,则必须加入一些位置编码,否则模型将有效地看到一堆单词。\n", "\n", "在此 notebook 中训练完模型后,您将能输入葡萄牙语句子,得到其英文翻译。\n", "\n", "\"Attention" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JjJJyJTZYebt", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "import tensorflow_datasets as tfds\n", "import tensorflow as tf\n", "\n", "import time\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "id": "fd1NWMxjfsDd" }, "source": [ "## 设置输入流水线(input pipeline)" ] }, { "cell_type": "markdown", "metadata": { "id": "t4_Qt8W1hJE_" }, "source": [ "使用 [TFDS](https://tensorflow.google.cn/datasets) 来导入 [葡萄牙语-英语翻译数据集](https://github.com/neulab/word-embeddings-for-nmt),该数据集来自于 [TED 演讲开放翻译项目](https://www.ted.com/participate/translate).\n", "\n", "该数据集包含来约 50000 条训练样本,1100 条验证样本,以及 2000 条测试样本。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8q9t4FmN96eN", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,\n", " as_supervised=True)\n", "train_examples, val_examples = examples['train'], examples['validation']" ] }, { "cell_type": "markdown", "metadata": { "id": "RCEKotqosGfq" }, "source": [ "从训练数据集创建自定义子词分词器(subwords tokenizer)。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KVBg5Q8tBk5z", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(\n", " (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)\n", "\n", "tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(\n", " (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4DYWukNFkGQN", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_string = 'Transformer is awesome.'\n", "\n", "tokenized_string = tokenizer_en.encode(sample_string)\n", "print ('Tokenized string is {}'.format(tokenized_string))\n", "\n", "original_string = tokenizer_en.decode(tokenized_string)\n", "print ('The original string: {}'.format(original_string))\n", "\n", "assert original_string == sample_string" ] }, { "cell_type": "markdown", "metadata": { "id": "o9KJWJjrsZ4Y" }, "source": [ "如果单词不在词典中,则分词器(tokenizer)通过将单词分解为子词来对字符串进行编码。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bf2ntBxjkqK6", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "for ts in tokenized_string:\n", " print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bcRp7VcQ5m6g", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "BUFFER_SIZE = 20000\n", "BATCH_SIZE = 64" ] }, { "cell_type": "markdown", "metadata": { "id": "kGi4PoVakxdc" }, "source": [ "将开始和结束标记(token)添加到输入和目标。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UZwnPr4R055s", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def encode(lang1, lang2):\n", " lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(\n", " lang1.numpy()) + [tokenizer_pt.vocab_size+1]\n", "\n", " lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(\n", " lang2.numpy()) + [tokenizer_en.vocab_size+1]\n", " \n", " return lang1, lang2" ] }, { "cell_type": "markdown", "metadata": { "id": "6JrGp5Gek6Ql" }, "source": [ "Note:为了使本示例较小且相对较快,删除长度大于40个标记的样本。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2QEgbjntk6Yf", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "MAX_LENGTH = 40" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c081xPGv1CPI", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def filter_max_length(x, y, max_length=MAX_LENGTH):\n", " return tf.logical_and(tf.size(x) <= max_length,\n", " tf.size(y) <= max_length)" ] }, { "cell_type": "markdown", "metadata": { "id": "Tx1sFbR-9fRs" }, "source": [ "`.map()` 内部的操作以图模式(graph mode)运行,`.map()` 接收一个不具有 numpy 属性的图张量(graph tensor)。该`分词器(tokenizer)`需要将一个字符串或 Unicode 符号,编码成整数。因此,您需要在 `tf.py_function` 内部运行编码过程,`tf.py_function` 接收一个 eager 张量,该 eager 张量有一个包含字符串值的 numpy 属性。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Mah1cS-P70Iz", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def tf_encode(pt, en):\n", " result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])\n", " result_pt.set_shape([None])\n", " result_en.set_shape([None])\n", "\n", " return result_pt, result_en" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9mk9AZdZ5bcS", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "train_dataset = train_examples.map(tf_encode)\n", "train_dataset = train_dataset.filter(filter_max_length)\n", "# 将数据集缓存到内存中以加快读取速度。\n", "train_dataset = train_dataset.cache()\n", "train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)\n", "train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)\n", "\n", "\n", "val_dataset = val_examples.map(tf_encode)\n", "val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_fXvfYVfQr2n", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "pt_batch, en_batch = next(iter(val_dataset))\n", "pt_batch, en_batch" ] }, { "cell_type": "markdown", "metadata": { "id": "nBQuibYA4n0n" }, "source": [ "## 位置编码(Positional encoding)\n", "\n", "因为该模型并不包括任何的循环(recurrence)或卷积,所以模型添加了位置编码,为模型提供一些关于单词在句子中相对位置的信息。\n", "\n", "位置编码向量被加到嵌入(embedding)向量中。嵌入表示一个 d 维空间的标记,在 d 维空间中有着相似含义的标记会离彼此更近。但是,嵌入并没有对在一句话中的词的相对位置进行编码。因此,当加上位置编码后,词将基于*它们含义的相似度以及它们在句子中的位置*,在 d 维空间中离彼此更近。\n", "\n", "参看 [位置编码](https://github.com/tensorflow/examples/blob/master/community/en/position_encoding.ipynb) 的 notebook 了解更多信息。计算位置编码的公式如下:\n", "\n", "$$\\Large{PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})} $$\n", "$$\\Large{PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})} $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WhIOZjMNKujn", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def get_angles(pos, i, d_model):\n", " angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))\n", " return pos * angle_rates" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1Rz82wEs5biZ", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def positional_encoding(position, d_model):\n", " angle_rads = get_angles(np.arange(position)[:, np.newaxis],\n", " np.arange(d_model)[np.newaxis, :],\n", " d_model)\n", " \n", " # 将 sin 应用于数组中的偶数索引(indices);2i\n", " angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])\n", " \n", " # 将 cos 应用于数组中的奇数索引;2i+1\n", " angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])\n", " \n", " pos_encoding = angle_rads[np.newaxis, ...]\n", " \n", " return tf.cast(pos_encoding, dtype=tf.float32)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1kLCla68EloE", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "pos_encoding = positional_encoding(50, 512)\n", "print (pos_encoding.shape)\n", "\n", "plt.pcolormesh(pos_encoding[0], cmap='RdBu')\n", "plt.xlabel('Depth')\n", "plt.xlim((0, 512))\n", "plt.ylabel('Position')\n", "plt.colorbar()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "a_b4ou4TYqUN" }, "source": [ "## 遮挡(Masking)" ] }, { "cell_type": "markdown", "metadata": { "id": "s42Uydjkv0hF" }, "source": [ "遮挡一批序列中所有的填充标记(pad tokens)。这确保了模型不会将填充作为输入。该 mask 表明填充值 `0` 出现的位置:在这些位置 mask 输出 `1`,否则输出 `0`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U2i8-e1s8ti9", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def create_padding_mask(seq):\n", " seq = tf.cast(tf.math.equal(seq, 0), tf.float32)\n", " \n", " # 添加额外的维度来将填充加到\n", " # 注意力对数(logits)。\n", " return seq[:, tf.newaxis, tf.newaxis, :] # (batch_size, 1, 1, seq_len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A7BYeBCNvi7n", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n", "create_padding_mask(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "Z0hzukDBgVom" }, "source": [ "前瞻遮挡(look-ahead mask)用于遮挡一个序列中的后续标记(future tokens)。换句话说,该 mask 表明了不应该使用的条目。\n", "\n", "这意味着要预测第三个词,将仅使用第一个和第二个词。与此类似,预测第四个词,仅使用第一个,第二个和第三个词,依此类推。 " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dVxS8OPI9uI0", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def create_look_ahead_mask(size):\n", " mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)\n", " return mask # (seq_len, seq_len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yxKGuXxaBeeE", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "x = tf.random.uniform((1, 3))\n", "temp = create_look_ahead_mask(x.shape[1])\n", "temp" ] }, { "cell_type": "markdown", "metadata": { "id": "xluDl5cXYy4y" }, "source": [ "## 按比缩放的点积注意力(Scaled dot product attention)" ] }, { "cell_type": "markdown", "metadata": { "id": "vsxEE_-Wa1gF" }, "source": [ "\"scaled_dot_product_attention\"\n", "\n", "Transformer 使用的注意力函数有三个输入:Q(请求(query))、K(主键(key))、V(数值(value))。用于计算注意力权重的等式为:\n", "\n", "$$\\Large{Attention(Q, K, V) = softmax_k(\\frac{QK^T}{\\sqrt{d_k}}) V} $$\n", "\n", "点积注意力被缩小了深度的平方根倍。这样做是因为对于较大的深度值,点积的大小会增大,从而推动 softmax 函数往仅有很小的梯度的方向靠拢,导致了一种很硬的(hard)softmax。\n", "\n", "例如,假设 `Q` 和 `K` 的均值为0,方差为1。它们的矩阵乘积将有均值为0,方差为 `dk`。因此,*`dk` 的平方根*被用于缩放(而非其他数值),因为,`Q` 和 `K` 的矩阵乘积的均值本应该为 0,方差本应该为1,这样会获得一个更平缓的 softmax。\n", "\n", "遮挡(mask)与 -1e9(接近于负无穷)相乘。这样做是因为遮挡与缩放的 Q 和 K 的矩阵乘积相加,并在 softmax 之前立即应用。目标是将这些单元归零,因为 softmax 的较大负数输入在输出中接近于零。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LazzUq3bJ5SH", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def scaled_dot_product_attention(q, k, v, mask):\n", " \"\"\"计算注意力权重。\n", " q, k, v 必须具有匹配的前置维度。\n", " k, v 必须有匹配的倒数第二个维度,例如:seq_len_k = seq_len_v。\n", " 虽然 mask 根据其类型(填充或前瞻)有不同的形状,\n", " 但是 mask 必须能进行广播转换以便求和。\n", " \n", " 参数:\n", " q: 请求的形状 == (..., seq_len_q, depth)\n", " k: 主键的形状 == (..., seq_len_k, depth)\n", " v: 数值的形状 == (..., seq_len_v, depth_v)\n", " mask: Float 张量,其形状能转换成\n", " (..., seq_len_q, seq_len_k)。默认为None。\n", " \n", " 返回值:\n", " 输出,注意力权重\n", " \"\"\"\n", "\n", " matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)\n", " \n", " # 缩放 matmul_qk\n", " dk = tf.cast(tf.shape(k)[-1], tf.float32)\n", " scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n", "\n", " # 将 mask 加入到缩放的张量上。\n", " if mask is not None:\n", " scaled_attention_logits += (mask * -1e9) \n", "\n", " # softmax 在最后一个轴(seq_len_k)上归一化,因此分数\n", " # 相加等于1。\n", " attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)\n", "\n", " output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)\n", "\n", " return output, attention_weights" ] }, { "cell_type": "markdown", "metadata": { "id": "FiqETnhCkoXh" }, "source": [ "当 softmax 在 K 上进行归一化后,它的值决定了分配到 Q 的重要程度。\n", "\n", "输出表示注意力权重和 V(数值)向量的乘积。这确保了要关注的词保持原样,而无关的词将被清除掉。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n90YjClyInFy", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def print_out(q, k, v):\n", " temp_out, temp_attn = scaled_dot_product_attention(\n", " q, k, v, None)\n", " print ('Attention weights are:')\n", " print (temp_attn)\n", " print ('Output is:')\n", " print (temp_out)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yAzUAf2DPlNt", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "np.set_printoptions(suppress=True)\n", "\n", "temp_k = tf.constant([[10,0,0],\n", " [0,10,0],\n", " [0,0,10],\n", " [0,0,10]], dtype=tf.float32) # (4, 3)\n", "\n", "temp_v = tf.constant([[ 1,0],\n", " [ 10,0],\n", " [ 100,5],\n", " [1000,6]], dtype=tf.float32) # (4, 2)\n", "\n", "# 这条 `请求(query)符合第二个`主键(key)`,\n", "# 因此返回了第二个`数值(value)`。\n", "temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)\n", "print_out(temp_q, temp_k, temp_v)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zg6k-fGhgXra", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# 这条请求符合重复出现的主键(第三第四个),\n", "# 因此,对所有的相关数值取了平均。\n", "temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32) # (1, 3)\n", "print_out(temp_q, temp_k, temp_v)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UAq3YOzUgXhb", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# 这条请求符合第一和第二条主键,\n", "# 因此,对它们的数值去了平均。\n", "temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32) # (1, 3)\n", "print_out(temp_q, temp_k, temp_v)" ] }, { "cell_type": "markdown", "metadata": { "id": "aOz-4_XIhaTP" }, "source": [ "将所有请求一起*传递*。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6dlU8Tm-hYrF", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32) # (3, 3)\n", "print_out(temp_q, temp_k, temp_v)" ] }, { "cell_type": "markdown", "metadata": { "id": "kmzGPEy64qmA" }, "source": [ "## 多头注意力(Multi-head attention)" ] }, { "cell_type": "markdown", "metadata": { "id": "fz5BMC8Kaoqo" }, "source": [ "\"multi-head\n", "\n", "\n", "多头注意力由四部分组成:\n", "* 线性层并分拆成多头。\n", "* 按比缩放的点积注意力。\n", "* 多头及联。\n", "* 最后一层线性层。" ] }, { "cell_type": "markdown", "metadata": { "id": "JPmbr6F1C-v_" }, "source": [ "每个多头注意力块有三个输入:Q(请求)、K(主键)、V(数值)。这些输入经过线性(Dense)层,并分拆成多头。 \n", "\n", "将上面定义的 `scaled_dot_product_attention` 函数应用于每个头(进行了广播(broadcasted)以提高效率)。注意力这步必须使用一个恰当的 mask。然后将每个头的注意力输出连接起来(用`tf.transpose` 和 `tf.reshape`),并放入最后的 `Dense` 层。\n", "\n", "Q、K、和 V 被拆分到了多个头,而非单个的注意力头,因为多头允许模型共同注意来自不同表示空间的不同位置的信息。在分拆后,每个头部的维度减少,因此总的计算成本与有着全部维度的单个注意力头相同。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BSV3PPKsYecw", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class MultiHeadAttention(tf.keras.layers.Layer):\n", " def __init__(self, d_model, num_heads):\n", " super(MultiHeadAttention, self).__init__()\n", " self.num_heads = num_heads\n", " self.d_model = d_model\n", " \n", " assert d_model % self.num_heads == 0\n", " \n", " self.depth = d_model // self.num_heads\n", " \n", " self.wq = tf.keras.layers.Dense(d_model)\n", " self.wk = tf.keras.layers.Dense(d_model)\n", " self.wv = tf.keras.layers.Dense(d_model)\n", " \n", " self.dense = tf.keras.layers.Dense(d_model)\n", " \n", " def split_heads(self, x, batch_size):\n", " \"\"\"分拆最后一个维度到 (num_heads, depth).\n", " 转置结果使得形状为 (batch_size, num_heads, seq_len, depth)\n", " \"\"\"\n", " x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n", " return tf.transpose(x, perm=[0, 2, 1, 3])\n", " \n", " def call(self, v, k, q, mask):\n", " batch_size = tf.shape(q)[0]\n", " \n", " q = self.wq(q) # (batch_size, seq_len, d_model)\n", " k = self.wk(k) # (batch_size, seq_len, d_model)\n", " v = self.wv(v) # (batch_size, seq_len, d_model)\n", " \n", " q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth)\n", " k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)\n", " v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)\n", " \n", " # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)\n", " # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)\n", " scaled_attention, attention_weights = scaled_dot_product_attention(\n", " q, k, v, mask)\n", " \n", " scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth)\n", "\n", " concat_attention = tf.reshape(scaled_attention, \n", " (batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model)\n", "\n", " output = self.dense(concat_attention) # (batch_size, seq_len_q, d_model)\n", " \n", " return output, attention_weights" ] }, { "cell_type": "markdown", "metadata": { "id": "0D8FJue5lDyZ" }, "source": [ "创建一个 `MultiHeadAttention` 层进行尝试。在序列中的每个位置 `y`,`MultiHeadAttention` 在序列中的所有其他位置运行所有8个注意力头,在每个位置y,返回一个新的同样长度的向量。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Hu94p-_-2_BX", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "temp_mha = MultiHeadAttention(d_model=512, num_heads=8)\n", "y = tf.random.uniform((1, 60, 512)) # (batch_size, encoder_sequence, d_model)\n", "out, attn = temp_mha(y, k=y, q=y, mask=None)\n", "out.shape, attn.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "RdDqGayx67vv" }, "source": [ "## 点式前馈网络(Point wise feed forward network)" ] }, { "cell_type": "markdown", "metadata": { "id": "gBqzJXGfHK3X" }, "source": [ "点式前馈网络由两层全联接层组成,两层之间有一个 ReLU 激活函数。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ET7xLt0yCT6Z", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def point_wise_feed_forward_network(d_model, dff):\n", " return tf.keras.Sequential([\n", " tf.keras.layers.Dense(dff, activation='relu'), # (batch_size, seq_len, dff)\n", " tf.keras.layers.Dense(d_model) # (batch_size, seq_len, d_model)\n", " ])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mytb1lPyOHLB", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_ffn = point_wise_feed_forward_network(512, 2048)\n", "sample_ffn(tf.random.uniform((64, 50, 512))).shape" ] }, { "cell_type": "markdown", "metadata": { "id": "7e7hKcxn6-zd" }, "source": [ "## 编码与解码(Encoder and decoder)" ] }, { "cell_type": "markdown", "metadata": { "id": "yScbC0MUH8dS" }, "source": [ "\"transformer\"" ] }, { "cell_type": "markdown", "metadata": { "id": "MfYJG-Kvgwy2" }, "source": [ "Transformer 模型与标准的[具有注意力机制的序列到序列模型(sequence to sequence with attention model)](nmt_with_attention.ipynb),遵循相同的一般模式。\n", "\n", "* 输入语句经过 `N` 个编码器层,为序列中的每个词/标记生成一个输出。\n", "* 解码器关注编码器的输出以及它自身的输入(自注意力)来预测下一个词。" ] }, { "cell_type": "markdown", "metadata": { "id": "QFv-FNYUmvpn" }, "source": [ "### 编码器层(Encoder layer)\n", "\n", "每个编码器层包括以下子层:\n", "\n", "1. 多头注意力(有填充遮挡)\n", "2. 点式前馈网络(Point wise feed forward networks)。\n", "\n", "每个子层在其周围有一个残差连接,然后进行层归一化。残差连接有助于避免深度网络中的梯度消失问题。\n", "\n", "每个子层的输出是 `LayerNorm(x + Sublayer(x))`。归一化是在 `d_model`(最后一个)维度完成的。Transformer 中有 N 个编码器层。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ncyS-Ms3i2x_", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class EncoderLayer(tf.keras.layers.Layer):\n", " def __init__(self, d_model, num_heads, dff, rate=0.1):\n", " super(EncoderLayer, self).__init__()\n", "\n", " self.mha = MultiHeadAttention(d_model, num_heads)\n", " self.ffn = point_wise_feed_forward_network(d_model, dff)\n", "\n", " self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " \n", " self.dropout1 = tf.keras.layers.Dropout(rate)\n", " self.dropout2 = tf.keras.layers.Dropout(rate)\n", " \n", " def call(self, x, training, mask):\n", "\n", " attn_output, _ = self.mha(x, x, x, mask) # (batch_size, input_seq_len, d_model)\n", " attn_output = self.dropout1(attn_output, training=training)\n", " out1 = self.layernorm1(x + attn_output) # (batch_size, input_seq_len, d_model)\n", " \n", " ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)\n", " ffn_output = self.dropout2(ffn_output, training=training)\n", " out2 = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)\n", " \n", " return out2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AzZRXdO0mI48", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_encoder_layer = EncoderLayer(512, 8, 2048)\n", "\n", "sample_encoder_layer_output = sample_encoder_layer(\n", " tf.random.uniform((64, 43, 512)), False, None)\n", "\n", "sample_encoder_layer_output.shape # (batch_size, input_seq_len, d_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "6LO_48Owmx_o" }, "source": [ "### 解码器层(Decoder layer)\n", "\n", "每个解码器层包括以下子层:\n", "\n", "1. 遮挡的多头注意力(前瞻遮挡和填充遮挡)\n", "2. 多头注意力(用填充遮挡)。V(数值)和 K(主键)接收*编码器输出*作为输入。Q(请求)接收*遮挡的多头注意力子层的输出*。\n", "3. 点式前馈网络\n", "\n", "每个子层在其周围有一个残差连接,然后进行层归一化。每个子层的输出是 `LayerNorm(x + Sublayer(x))`。归一化是在 `d_model`(最后一个)维度完成的。\n", "\n", "Transformer 中共有 N 个解码器层。\n", "\n", "当 Q 接收到解码器的第一个注意力块的输出,并且 K 接收到编码器的输出时,注意力权重表示根据编码器的输出赋予解码器输入的重要性。换一种说法,解码器通过查看编码器输出和对其自身输出的自注意力,预测下一个词。参看按比缩放的点积注意力部分的演示。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9SoX0-vd1hue", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class DecoderLayer(tf.keras.layers.Layer):\n", " def __init__(self, d_model, num_heads, dff, rate=0.1):\n", " super(DecoderLayer, self).__init__()\n", "\n", " self.mha1 = MultiHeadAttention(d_model, num_heads)\n", " self.mha2 = MultiHeadAttention(d_model, num_heads)\n", "\n", " self.ffn = point_wise_feed_forward_network(d_model, dff)\n", " \n", " self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " \n", " self.dropout1 = tf.keras.layers.Dropout(rate)\n", " self.dropout2 = tf.keras.layers.Dropout(rate)\n", " self.dropout3 = tf.keras.layers.Dropout(rate)\n", " \n", " \n", " def call(self, x, enc_output, training, \n", " look_ahead_mask, padding_mask):\n", " # enc_output.shape == (batch_size, input_seq_len, d_model)\n", "\n", " attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask) # (batch_size, target_seq_len, d_model)\n", " attn1 = self.dropout1(attn1, training=training)\n", " out1 = self.layernorm1(attn1 + x)\n", " \n", " attn2, attn_weights_block2 = self.mha2(\n", " enc_output, enc_output, out1, padding_mask) # (batch_size, target_seq_len, d_model)\n", " attn2 = self.dropout2(attn2, training=training)\n", " out2 = self.layernorm2(attn2 + out1) # (batch_size, target_seq_len, d_model)\n", " \n", " ffn_output = self.ffn(out2) # (batch_size, target_seq_len, d_model)\n", " ffn_output = self.dropout3(ffn_output, training=training)\n", " out3 = self.layernorm3(ffn_output + out2) # (batch_size, target_seq_len, d_model)\n", " \n", " return out3, attn_weights_block1, attn_weights_block2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ne2Bqx8k71l0", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_decoder_layer = DecoderLayer(512, 8, 2048)\n", "\n", "sample_decoder_layer_output, _, _ = sample_decoder_layer(\n", " tf.random.uniform((64, 50, 512)), sample_encoder_layer_output, \n", " False, None, None)\n", "\n", "sample_decoder_layer_output.shape # (batch_size, target_seq_len, d_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "SE1H51Ajm0q1" }, "source": [ "### 编码器(Encoder)\n", "\n", "`编码器` 包括:\n", "1. 输入嵌入(Input Embedding)\n", "2. 位置编码(Positional Encoding)\n", "3. N 个编码器层(encoder layers)\n", "\n", "输入经过嵌入(embedding)后,该嵌入与位置编码相加。该加法结果的输出是编码器层的输入。编码器的输出是解码器的输入。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jpEox7gJ8FCI", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class Encoder(tf.keras.layers.Layer):\n", " def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,\n", " maximum_position_encoding, rate=0.1):\n", " super(Encoder, self).__init__()\n", "\n", " self.d_model = d_model\n", " self.num_layers = num_layers\n", " \n", " self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)\n", " self.pos_encoding = positional_encoding(maximum_position_encoding, \n", " self.d_model)\n", " \n", " \n", " self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) \n", " for _ in range(num_layers)]\n", " \n", " self.dropout = tf.keras.layers.Dropout(rate)\n", " \n", " def call(self, x, training, mask):\n", "\n", " seq_len = tf.shape(x)[1]\n", " \n", " # 将嵌入和位置编码相加。\n", " x = self.embedding(x) # (batch_size, input_seq_len, d_model)\n", " x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n", " x += self.pos_encoding[:, :seq_len, :]\n", "\n", " x = self.dropout(x, training=training)\n", " \n", " for i in range(self.num_layers):\n", " x = self.enc_layers[i](x, training, mask)\n", " \n", " return x # (batch_size, input_seq_len, d_model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8QG9nueFQKXx", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8, \n", " dff=2048, input_vocab_size=8500,\n", " maximum_position_encoding=10000)\n", "\n", "sample_encoder_output = sample_encoder(tf.random.uniform((64, 62)), \n", " training=False, mask=None)\n", "\n", "print (sample_encoder_output.shape) # (batch_size, input_seq_len, d_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "p-uO6ls8m2O5" }, "source": [ "### 解码器(Decoder)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZtT7PKzrXkNr" }, "source": [ "`解码器`包括:\n", "1. 输出嵌入(Output Embedding)\n", "2. 位置编码(Positional Encoding)\n", "3. N 个解码器层(decoder layers)\n", "\n", "目标(target)经过一个嵌入后,该嵌入和位置编码相加。该加法结果是解码器层的输入。解码器的输出是最后的线性层的输入。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d5_d5-PLQXwY", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class Decoder(tf.keras.layers.Layer):\n", " def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,\n", " maximum_position_encoding, rate=0.1):\n", " super(Decoder, self).__init__()\n", "\n", " self.d_model = d_model\n", " self.num_layers = num_layers\n", " \n", " self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)\n", " self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)\n", " \n", " self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) \n", " for _ in range(num_layers)]\n", " self.dropout = tf.keras.layers.Dropout(rate)\n", " \n", " def call(self, x, enc_output, training, \n", " look_ahead_mask, padding_mask):\n", "\n", " seq_len = tf.shape(x)[1]\n", " attention_weights = {}\n", " \n", " x = self.embedding(x) # (batch_size, target_seq_len, d_model)\n", " x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n", " x += self.pos_encoding[:, :seq_len, :]\n", " \n", " x = self.dropout(x, training=training)\n", "\n", " for i in range(self.num_layers):\n", " x, block1, block2 = self.dec_layers[i](x, enc_output, training,\n", " look_ahead_mask, padding_mask)\n", " \n", " attention_weights['decoder_layer{}_block1'.format(i+1)] = block1\n", " attention_weights['decoder_layer{}_block2'.format(i+1)] = block2\n", " \n", " # x.shape == (batch_size, target_seq_len, d_model)\n", " return x, attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a1jXoAMRZyvu", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8, \n", " dff=2048, target_vocab_size=8000,\n", " maximum_position_encoding=5000)\n", "\n", "output, attn = sample_decoder(tf.random.uniform((64, 26)), \n", " enc_output=sample_encoder_output, \n", " training=False, look_ahead_mask=None, \n", " padding_mask=None)\n", "\n", "output.shape, attn['decoder_layer2_block2'].shape" ] }, { "cell_type": "markdown", "metadata": { "id": "y54xnJnuYgJ7" }, "source": [ "## 创建 Transformer" ] }, { "cell_type": "markdown", "metadata": { "id": "uERO1y54cOKq" }, "source": [ "Transformer 包括编码器,解码器和最后的线性层。解码器的输出是线性层的输入,返回线性层的输出。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PED3bIpOYkBu", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class Transformer(tf.keras.Model):\n", " def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, \n", " target_vocab_size, pe_input, pe_target, rate=0.1):\n", " super(Transformer, self).__init__()\n", "\n", " self.encoder = Encoder(num_layers, d_model, num_heads, dff, \n", " input_vocab_size, pe_input, rate)\n", "\n", " self.decoder = Decoder(num_layers, d_model, num_heads, dff, \n", " target_vocab_size, pe_target, rate)\n", "\n", " self.final_layer = tf.keras.layers.Dense(target_vocab_size)\n", " \n", " def call(self, inp, tar, training, enc_padding_mask, \n", " look_ahead_mask, dec_padding_mask):\n", "\n", " enc_output = self.encoder(inp, training, enc_padding_mask) # (batch_size, inp_seq_len, d_model)\n", " \n", " # dec_output.shape == (batch_size, tar_seq_len, d_model)\n", " dec_output, attention_weights = self.decoder(\n", " tar, enc_output, training, look_ahead_mask, dec_padding_mask)\n", " \n", " final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)\n", " \n", " return final_output, attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tJ4fbQcIkHW1", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "sample_transformer = Transformer(\n", " num_layers=2, d_model=512, num_heads=8, dff=2048, \n", " input_vocab_size=8500, target_vocab_size=8000, \n", " pe_input=10000, pe_target=6000)\n", "\n", "temp_input = tf.random.uniform((64, 62))\n", "temp_target = tf.random.uniform((64, 26))\n", "\n", "fn_out, _ = sample_transformer(temp_input, temp_target, training=False, \n", " enc_padding_mask=None, \n", " look_ahead_mask=None,\n", " dec_padding_mask=None)\n", "\n", "fn_out.shape # (batch_size, tar_seq_len, target_vocab_size)" ] }, { "cell_type": "markdown", "metadata": { "id": "wsINyf1VEQLC" }, "source": [ "## 配置超参数(hyperparameters)" ] }, { "cell_type": "markdown", "metadata": { "id": "zVjWCxFNcgbt" }, "source": [ "为了让本示例小且相对较快,已经减小了*num_layers、 d_model 和 dff* 的值。 \n", "\n", "Transformer 的基础模型使用的数值为:*num_layers=6*,*d_model = 512*,*dff = 2048*。关于所有其他版本的 Transformer,请查阅[论文](https://arxiv.org/abs/1706.03762)。\n", "\n", "Note:通过改变以下数值,您可以获得在许多任务上达到最先进水平的模型。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lnJn5SLA2ahP", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "num_layers = 4\n", "d_model = 128\n", "dff = 512\n", "num_heads = 8\n", "\n", "input_vocab_size = tokenizer_pt.vocab_size + 2\n", "target_vocab_size = tokenizer_en.vocab_size + 2\n", "dropout_rate = 0.1" ] }, { "cell_type": "markdown", "metadata": { "id": "xYEGhEOtzn5W" }, "source": [ "## 优化器(Optimizer)" ] }, { "cell_type": "markdown", "metadata": { "id": "GOmWW--yP3zx" }, "source": [ "根据[论文](https://arxiv.org/abs/1706.03762)中的公式,将 Adam 优化器与自定义的学习速率调度程序(scheduler)配合使用。\n", "\n", "$$\\Large{lrate = d_{model}^{-0.5} * min(step{\\_}num^{-0.5}, step{\\_}num * warmup{\\_}steps^{-1.5})}$$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iYQdOO1axwEI", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):\n", " def __init__(self, d_model, warmup_steps=4000):\n", " super(CustomSchedule, self).__init__()\n", " \n", " self.d_model = d_model\n", " self.d_model = tf.cast(self.d_model, tf.float32)\n", "\n", " self.warmup_steps = warmup_steps\n", " \n", " def __call__(self, step):\n", " arg1 = tf.math.rsqrt(step)\n", " arg2 = step * (self.warmup_steps ** -1.5)\n", " \n", " return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7r4scdulztRx", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "learning_rate = CustomSchedule(d_model)\n", "\n", "optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, \n", " epsilon=1e-9)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f33ZCgvHpPdG", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "temp_learning_rate_schedule = CustomSchedule(d_model)\n", "\n", "plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))\n", "plt.ylabel(\"Learning Rate\")\n", "plt.xlabel(\"Train Step\")" ] }, { "cell_type": "markdown", "metadata": { "id": "YgkDE7hzo8r5" }, "source": [ "## 损失函数与指标(Loss and metrics)" ] }, { "cell_type": "markdown", "metadata": { "id": "oxGJtoDuYIHL" }, "source": [ "由于目标序列是填充(padded)过的,因此在计算损失函数时,应用填充遮挡非常重要。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MlhsJMm0TW_B", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n", " from_logits=True, reduction='none')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "67oqVHiT0Eiu", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def loss_function(real, pred):\n", " mask = tf.math.logical_not(tf.math.equal(real, 0))\n", " loss_ = loss_object(real, pred)\n", "\n", " mask = tf.cast(mask, dtype=loss_.dtype)\n", " loss_ *= mask\n", " \n", " return tf.reduce_mean(loss_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "phlyxMnm-Tpx", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "train_loss = tf.keras.metrics.Mean(name='train_loss')\n", "train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n", " name='train_accuracy')" ] }, { "cell_type": "markdown", "metadata": { "id": "aeHumfr7zmMa" }, "source": [ "## 训练与检查点(Training and checkpointing)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UiysUa--4tOU", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "transformer = Transformer(num_layers, d_model, num_heads, dff,\n", " input_vocab_size, target_vocab_size, \n", " pe_input=input_vocab_size, \n", " pe_target=target_vocab_size,\n", " rate=dropout_rate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZOJUSB1T8GjM", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def create_masks(inp, tar):\n", " # 编码器填充遮挡\n", " enc_padding_mask = create_padding_mask(inp)\n", " \n", " # 在解码器的第二个注意力模块使用。\n", " # 该填充遮挡用于遮挡编码器的输出。\n", " dec_padding_mask = create_padding_mask(inp)\n", " \n", " # 在解码器的第一个注意力模块使用。\n", " # 用于填充(pad)和遮挡(mask)解码器获取到的输入的后续标记(future tokens)。\n", " look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])\n", " dec_target_padding_mask = create_padding_mask(tar)\n", " combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)\n", " \n", " return enc_padding_mask, combined_mask, dec_padding_mask" ] }, { "cell_type": "markdown", "metadata": { "id": "Fzuf06YZp66w" }, "source": [ "创建检查点的路径和检查点管理器(manager)。这将用于在每 `n` 个周期(epochs)保存检查点。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hNhuYfllndLZ", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "checkpoint_path = \"./checkpoints/train\"\n", "\n", "ckpt = tf.train.Checkpoint(transformer=transformer,\n", " optimizer=optimizer)\n", "\n", "ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)\n", "\n", "# 如果检查点存在,则恢复最新的检查点。\n", "if ckpt_manager.latest_checkpoint:\n", " ckpt.restore(ckpt_manager.latest_checkpoint)\n", " print ('Latest checkpoint restored!!')" ] }, { "cell_type": "markdown", "metadata": { "id": "0Di_Yaa1gf9r" }, "source": [ "目标(target)被分成了 tar_inp 和 tar_real。tar_inp 作为输入传递到解码器。`tar_real` 是位移了 1 的同一个输入:在 `tar_inp` 中的每个位置,`tar_real` 包含了应该被预测到的下一个标记(token)。\n", "\n", "例如,`sentence` = \"SOS A lion in the jungle is sleeping EOS\"\n", "\n", "`tar_inp` = \"SOS A lion in the jungle is sleeping\"\n", "\n", "`tar_real` = \"A lion in the jungle is sleeping EOS\"\n", "\n", "Transformer 是一个自回归(auto-regressive)模型:它一次作一个部分的预测,然后使用到目前为止的自身的输出来决定下一步要做什么。\n", "\n", "在训练过程中,本示例使用了 teacher-forcing 的方法(就像[文本生成教程](./text_generation.ipynb)中一样)。无论模型在当前时间步骤下预测出什么,teacher-forcing 方法都会将真实的输出传递到下一个时间步骤上。\n", "\n", "当 transformer 预测每个词时,*自注意力(self-attention)*功能使它能够查看输入序列中前面的单词,从而更好地预测下一个单词。\n", "\n", "为了防止模型在期望的输出上达到峰值,模型使用了前瞻遮挡(look-ahead mask)。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LKpoA6q1sJFj", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "EPOCHS = 20" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iJwmp9OE29oj", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "# 该 @tf.function 将追踪-编译 train_step 到 TF 图中,以便更快地\n", "# 执行。该函数专用于参数张量的精确形状。为了避免由于可变序列长度或可变\n", "# 批次大小(最后一批次较小)导致的再追踪,使用 input_signature 指定\n", "# 更多的通用形状。\n", "\n", "train_step_signature = [\n", " tf.TensorSpec(shape=(None, None), dtype=tf.int64),\n", " tf.TensorSpec(shape=(None, None), dtype=tf.int64),\n", "]\n", "\n", "@tf.function(input_signature=train_step_signature)\n", "def train_step(inp, tar):\n", " tar_inp = tar[:, :-1]\n", " tar_real = tar[:, 1:]\n", " \n", " enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)\n", " \n", " with tf.GradientTape() as tape:\n", " predictions, _ = transformer(inp, tar_inp, \n", " True, \n", " enc_padding_mask, \n", " combined_mask, \n", " dec_padding_mask)\n", " loss = loss_function(tar_real, predictions)\n", "\n", " gradients = tape.gradient(loss, transformer.trainable_variables) \n", " optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))\n", " \n", " train_loss(loss)\n", " train_accuracy(tar_real, predictions)" ] }, { "cell_type": "markdown", "metadata": { "id": "qM2PDWGDJ_8V" }, "source": [ "葡萄牙语作为输入语言,英语为目标语言。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bbvmaKNiznHZ", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "for epoch in range(EPOCHS):\n", " start = time.time()\n", " \n", " train_loss.reset_states()\n", " train_accuracy.reset_states()\n", " \n", " # inp -> portuguese, tar -> english\n", " for (batch, (inp, tar)) in enumerate(train_dataset):\n", " train_step(inp, tar)\n", " \n", " if batch % 50 == 0:\n", " print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(\n", " epoch + 1, batch, train_loss.result(), train_accuracy.result()))\n", " \n", " if (epoch + 1) % 5 == 0:\n", " ckpt_save_path = ckpt_manager.save()\n", " print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,\n", " ckpt_save_path))\n", " \n", " print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, \n", " train_loss.result(), \n", " train_accuracy.result()))\n", "\n", " print ('Time taken for 1 epoch: {} secs\\n'.format(time.time() - start))" ] }, { "cell_type": "markdown", "metadata": { "id": "QfcsSWswSdGV" }, "source": [ "## 评估(Evaluate)" ] }, { "cell_type": "markdown", "metadata": { "id": "y6APsFrgImLW" }, "source": [ "以下步骤用于评估:\n", "\n", "* 用葡萄牙语分词器(`tokenizer_pt`)编码输入语句。此外,添加开始和结束标记,这样输入就与模型训练的内容相同。这是编码器输入。\n", "* 解码器输入为 `start token == tokenizer_en.vocab_size`。\n", "* 计算填充遮挡和前瞻遮挡。\n", "* `解码器`通过查看`编码器输出`和它自身的输出(自注意力)给出预测。\n", "* 选择最后一个词并计算它的 argmax。\n", "* 将预测的词连接到解码器输入,然后传递给解码器。\n", "* 在这种方法中,解码器根据它预测的之前的词预测下一个。\n", "\n", "Note:这里使用的模型具有较小的能力以保持相对较快,因此预测可能不太正确。要复现论文中的结果,请使用全部数据集,并通过修改上述超参数来使用基础 transformer 模型或者 transformer XL。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5buvMlnvyrFm", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def evaluate(inp_sentence):\n", " start_token = [tokenizer_pt.vocab_size]\n", " end_token = [tokenizer_pt.vocab_size + 1]\n", " \n", " # 输入语句是葡萄牙语,增加开始和结束标记\n", " inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token\n", " encoder_input = tf.expand_dims(inp_sentence, 0)\n", " \n", " # 因为目标是英语,输入 transformer 的第一个词应该是\n", " # 英语的开始标记。\n", " decoder_input = [tokenizer_en.vocab_size]\n", " output = tf.expand_dims(decoder_input, 0)\n", " \n", " for i in range(MAX_LENGTH):\n", " enc_padding_mask, combined_mask, dec_padding_mask = create_masks(\n", " encoder_input, output)\n", " \n", " # predictions.shape == (batch_size, seq_len, vocab_size)\n", " predictions, attention_weights = transformer(encoder_input, \n", " output,\n", " False,\n", " enc_padding_mask,\n", " combined_mask,\n", " dec_padding_mask)\n", " \n", " # 从 seq_len 维度选择最后一个词\n", " predictions = predictions[: ,-1:, :] # (batch_size, 1, vocab_size)\n", "\n", " predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)\n", " \n", " # 如果 predicted_id 等于结束标记,就返回结果\n", " if predicted_id == tokenizer_en.vocab_size+1:\n", " return tf.squeeze(output, axis=0), attention_weights\n", " \n", " # 连接 predicted_id 与输出,作为解码器的输入传递到解码器。\n", " output = tf.concat([output, predicted_id], axis=-1)\n", "\n", " return tf.squeeze(output, axis=0), attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CN-BV43FMBej", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def plot_attention_weights(attention, sentence, result, layer):\n", " fig = plt.figure(figsize=(16, 8))\n", " \n", " sentence = tokenizer_pt.encode(sentence)\n", " \n", " attention = tf.squeeze(attention[layer], axis=0)\n", " \n", " for head in range(attention.shape[0]):\n", " ax = fig.add_subplot(2, 4, head+1)\n", " \n", " # 画出注意力权重\n", " ax.matshow(attention[head][:-1, :], cmap='viridis')\n", "\n", " fontdict = {'fontsize': 10}\n", " \n", " ax.set_xticks(range(len(sentence)+2))\n", " ax.set_yticks(range(len(result)))\n", " \n", " ax.set_ylim(len(result)-1.5, -0.5)\n", " \n", " ax.set_xticklabels(\n", " ['']+[tokenizer_pt.decode([i]) for i in sentence]+[''], \n", " fontdict=fontdict, rotation=90)\n", " \n", " ax.set_yticklabels([tokenizer_en.decode([i]) for i in result \n", " if i < tokenizer_en.vocab_size], \n", " fontdict=fontdict)\n", " \n", " ax.set_xlabel('Head {}'.format(head+1))\n", " \n", " plt.tight_layout()\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lU2_yG_vBGza", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "def translate(sentence, plot=''):\n", " result, attention_weights = evaluate(sentence)\n", " \n", " predicted_sentence = tokenizer_en.decode([i for i in result \n", " if i < tokenizer_en.vocab_size]) \n", "\n", " print('Input: {}'.format(sentence))\n", " print('Predicted translation: {}'.format(predicted_sentence))\n", " \n", " if plot:\n", " plot_attention_weights(attention_weights, sentence, result, plot)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YsxrAlvFG8SZ", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "translate(\"este é um problema que temos que resolver.\")\n", "print (\"Real translation: this is a problem we have to solve .\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7EH5y_aqI4t1", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "translate(\"os meus vizinhos ouviram sobre esta ideia.\")\n", "print (\"Real translation: and my neighboring homes heard about this idea .\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "J-hVCTSUMlkb", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "translate(\"vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.\")\n", "print (\"Real translation: so i 'll just share with you some stories very quickly of some magical things that have happened .\")" ] }, { "cell_type": "markdown", "metadata": { "id": "_1MxkSZvz0jX" }, "source": [ "您可以为 `plot` 参数传递不同的层和解码器的注意力模块。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t-kFyiOLH0xg", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "translate(\"este é o primeiro livro que eu fiz.\", plot='decoder_layer4_block2')\n", "print (\"Real translation: this is the first book i've ever done.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RqQ1fIsLwkGE" }, "source": [ "## 总结\n", "\n", "在本教程中,您已经学习了位置编码,多头注意力,遮挡的重要性以及如何创建一个 transformer。\n", "\n", "尝试使用一个不同的数据集来训练 transformer。您可也可以通过修改上述的超参数来创建基础 transformer 或者 transformer XL。您也可以使用这里定义的层来创建 [BERT](https://arxiv.org/abs/1810.04805) 并训练最先进的模型。此外,您可以实现 beam search 得到更好的预测。" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [ "s_qNSzzyaCbD" ], "name": "transformer.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }