{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "b518b04cbfe0" }, "outputs": [], "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "906e07f6e562", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "6e083398b477" }, "source": [ "# 使用预处理层" ] }, { "cell_type": "markdown", "metadata": { "id": "64010bd23c2e" }, "source": [ "
![]() | \n",
" ![]() | \n",
" ![]() | \n",
" ![]() | \n",
"
Embedding
模式组合的实际使用情况。\n",
"\n",
"请注意,在训练此类模型时,为了获得最佳性能,您应始终使用 `TextVectorization` 层作为输入流水线的一部分。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "28c2f2ff61fb"
},
"source": [
"### 通过多热编码将文本编码为 ngram 的密集矩阵\n",
"\n",
"这是预处理要传递到 `Dense` 层的文本时应采用的方式。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7bae1c223cd8",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Define some text data to adapt the layer\n",
"adapt_data = tf.constant(\n",
" [\n",
" \"The Brain is wider than the Sky\",\n",
" \"For put them side by side\",\n",
" \"The one the other will contain\",\n",
" \"With ease and You beside\",\n",
" ]\n",
")\n",
"# Instantiate TextVectorization with \"multi_hot\" output_mode\n",
"# and ngrams=2 (index all bigrams)\n",
"text_vectorizer = layers.TextVectorization(output_mode=\"multi_hot\", ngrams=2)\n",
"# Index the bigrams via `adapt()`\n",
"text_vectorizer.adapt(adapt_data)\n",
"\n",
"# Try out the layer\n",
"print(\n",
" \"Encoded text:\\n\", text_vectorizer([\"The Brain is deeper than the sea\"]).numpy(),\n",
")\n",
"\n",
"# Create a simple model\n",
"inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))\n",
"outputs = layers.Dense(1)(inputs)\n",
"model = keras.Model(inputs, outputs)\n",
"\n",
"# Create a labeled dataset (which includes unknown tokens)\n",
"train_dataset = tf.data.Dataset.from_tensor_slices(\n",
" ([\"The Brain is deeper than the sea\", \"for if they are held Blue to Blue\"], [1, 0])\n",
")\n",
"\n",
"# Preprocess the string inputs, turning them into int sequences\n",
"train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))\n",
"# Train the model on the int sequences\n",
"print(\"\\nTraining model...\")\n",
"model.compile(optimizer=\"rmsprop\", loss=\"mse\")\n",
"model.fit(train_dataset)\n",
"\n",
"# For inference, you can export a model that accepts strings as input\n",
"inputs = keras.Input(shape=(1,), dtype=\"string\")\n",
"x = text_vectorizer(inputs)\n",
"outputs = model(x)\n",
"end_to_end_model = keras.Model(inputs, outputs)\n",
"\n",
"# Call the end-to-end model on test data (which includes unknown tokens)\n",
"print(\"\\nCalling end-to-end model on test string...\")\n",
"test_data = tf.constant([\"The one the other will absorb\"])\n",
"test_output = end_to_end_model(test_data)\n",
"print(\"Model output:\", test_output)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "336a4d3426ed"
},
"source": [
"### 通过 TF-IDF 加权将文本编码为 ngram 的密集矩阵\n",
"\n",
"这是在将文本传递到 `Dense` 层之前对其进行预处理的另一种方式。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5b6c0fec928e",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Define some text data to adapt the layer\n",
"adapt_data = tf.constant(\n",
" [\n",
" \"The Brain is wider than the Sky\",\n",
" \"For put them side by side\",\n",
" \"The one the other will contain\",\n",
" \"With ease and You beside\",\n",
" ]\n",
")\n",
"# Instantiate TextVectorization with \"tf-idf\" output_mode\n",
"# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)\n",
"text_vectorizer = layers.TextVectorization(output_mode=\"tf-idf\", ngrams=2)\n",
"# Index the bigrams and learn the TF-IDF weights via `adapt()`\n",
"\n",
"with tf.device(\"CPU\"):\n",
" # A bug that prevents this from running on GPU for now.\n",
" text_vectorizer.adapt(adapt_data)\n",
"\n",
"# Try out the layer\n",
"print(\n",
" \"Encoded text:\\n\", text_vectorizer([\"The Brain is deeper than the sea\"]).numpy(),\n",
")\n",
"\n",
"# Create a simple model\n",
"inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))\n",
"outputs = layers.Dense(1)(inputs)\n",
"model = keras.Model(inputs, outputs)\n",
"\n",
"# Create a labeled dataset (which includes unknown tokens)\n",
"train_dataset = tf.data.Dataset.from_tensor_slices(\n",
" ([\"The Brain is deeper than the sea\", \"for if they are held Blue to Blue\"], [1, 0])\n",
")\n",
"\n",
"# Preprocess the string inputs, turning them into int sequences\n",
"train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))\n",
"# Train the model on the int sequences\n",
"print(\"\\nTraining model...\")\n",
"model.compile(optimizer=\"rmsprop\", loss=\"mse\")\n",
"model.fit(train_dataset)\n",
"\n",
"# For inference, you can export a model that accepts strings as input\n",
"inputs = keras.Input(shape=(1,), dtype=\"string\")\n",
"x = text_vectorizer(inputs)\n",
"outputs = model(x)\n",
"end_to_end_model = keras.Model(inputs, outputs)\n",
"\n",
"# Call the end-to-end model on test data (which includes unknown tokens)\n",
"print(\"\\nCalling end-to-end model on test string...\")\n",
"test_data = tf.constant([\"The one the other will absorb\"])\n",
"test_output = end_to_end_model(test_data)\n",
"print(\"Model output:\", test_output)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "143ce01c5558"
},
"source": [
"## 重要问题\n",
"\n",
"### 处理包含非常大的词汇的查找层\n",
"\n",
"您可能会在 `TextVectorization`、`StringLookup` 层或 `IntegerLookup` 层中处理非常大的词汇。通常,大于 500MB 的词汇就会被视为“非常大”。\n",
"\n",
"在这种情况下,为了获得最佳性能,您应避免使用 `adapt()`。相反,应提前预先计算您的词汇(可使用 Apache Beam 或 TF Transform 来实现)并将其存储在文件中。然后,在构建时将文件路径作为 `vocabulary` 参数传递,以将词汇加载到层中。\n",
"\n",
"### 在 TPU pod 上或与 `ParameterServerStrategy` 一起使用查找层。\n",
"\n",
"有一个未解决的问题,它会导致在 TPU pod 上或通过 `ParameterServerStrategy` 在多台计算机上进行训练时,使用 `TextVectorization`、`StringLookup` 或 `IntegerLookup` 层时出现性能下降。该问题预计将在 TensorFlow 2.7 中得到修正。"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "preprocessing_layers.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}