{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "oL9KopJirB2g" }, "outputs": [], "source": [ "##### Copyright 2018 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "SKaX3Hd3ra6C", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "AAK88XQ9Pm9N" }, "source": [ "# Unicode 字符串" ] }, { "cell_type": "markdown", "metadata": { "id": "0TD5ZrvEMbhZ" }, "source": [ "
\n",
" \n",
" ![]() | \n",
" \n",
" \n",
" ![]() | \n",
" \n",
" \n",
" ![]() | \n",
" \n",
" ![]() | \n",
"
0
和 `0x10FFFF` 之间的唯一整数[码位](https://en.wikipedia.org/wiki/Code_point)进行编码。*Unicode 字符串*是由零个或更多码位组成的序列。\n",
"\n",
"本教程介绍了如何在 TensorFlow 中表示 Unicode 字符串,以及如何使用标准字符串运算的 Unicode 等效项对其进行操作。它会根据字符体系检测将 Unicode 字符串划分为不同词例。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OIKHl5Lvn4gh",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"import tensorflow as tf"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n-LkcI-vtWNj"
},
"source": [
"## `tf.string` 数据类型\n",
"\n",
"您可以使用基本的 TensorFlow `tf.string` `dtype` 构建字节字符串张量。Unicode 字符串默认使用 UTF-8 编码。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3yo-Qv6ntaFr",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.constant(u\"Thanks 😊\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2kA1ziG2tyCT"
},
"source": [
"`tf.string` 张量可以容纳不同长度的字节字符串,因为字节字符串会被视为原子单元。字符串长度不包括在张量维度中。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eyINCmTztyyS",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.constant([u\"You're\", u\"welcome!\"]).shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jsMPnjb6UDJ1"
},
"source": [
"注:使用 Python 构造字符串时,v2 和 v3 对 Unicode 的处理方式有所不同。在 v2 中,Unicode 字符串用前缀“u”表示(如上所示)。在 v3 中,字符串默认使用 Unicode 编码。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hUFZ7B1Lk-uj"
},
"source": [
"## 表示 Unicode\n",
"\n",
"在 TensorFlow 中有两种表示 Unicode 字符串的标准方式:\n",
"\n",
"- `string` 标量 - 使用已知[字符编码](https://en.wikipedia.org/wiki/Character_encoding)对码位序列进行编码。\n",
"- `int32` 向量 - 每个位置包含单个码位。\n",
"\n",
"例如,以下三个值均表示 Unicode 字符串 `\"语言处理\"`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cjQIkfJWvC_u",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Unicode string, represented as a UTF-8 encoded string scalar.\n",
"text_utf8 = tf.constant(u\"语言处理\")\n",
"text_utf8"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yQqcUECcvF2r",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Unicode string, represented as a UTF-16-BE encoded string scalar.\n",
"text_utf16be = tf.constant(u\"语言处理\".encode(\"UTF-16-BE\"))\n",
"text_utf16be"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ExdBr1t7vMuS",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Unicode string, represented as a vector of Unicode code points.\n",
"text_chars = tf.constant([ord(char) for char in u\"语言处理\"])\n",
"text_chars"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "B8czv4JNpBnZ"
},
"source": [
"### 在不同表示之间进行转换\n",
"\n",
"TensorFlow 提供了在下列不同表示之间进行转换的运算:\n",
"\n",
"- `tf.strings.unicode_decode`:将编码的字符串标量转换为码位的向量。\n",
"- `tf.strings.unicode_encode`:将码位的向量转换为编码的字符串标量。\n",
"- `tf.strings.unicode_transcode`:将编码的字符串标量转换为其他编码。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qb-UQ_oLpAJg",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_decode(text_utf8,\n",
" input_encoding='UTF-8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kEBUcunnp-9n",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_encode(text_chars,\n",
" output_encoding='UTF-8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0MLhWcLZrph-",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_transcode(text_utf8,\n",
" input_encoding='UTF8',\n",
" output_encoding='UTF-16-BE')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QVeLeVohqN7I"
},
"source": [
"### 批次维度\n",
"\n",
"解码多个字符串时,每个字符串中的字符数可能不相等。返回结果是 [`tf.RaggedTensor`](../../guide/ragged_tensor.ipynb),其中最里面的维度的长度会根据每个字符串中的字符数而变化:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "N2jVzPymr_Mm",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# A batch of Unicode strings, each represented as a UTF8-encoded string.\n",
"batch_utf8 = [s.encode('UTF-8') for s in\n",
" [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]\n",
"batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,\n",
" input_encoding='UTF-8')\n",
"for sentence_chars in batch_chars_ragged.to_list():\n",
" print(sentence_chars)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iRh3n1hPsJ9v"
},
"source": [
"您可以直接使用此 `tf.RaggedTensor`,也可以使用 `tf.RaggedTensor.to_tensor` 和 `tf.RaggedTensor.to_sparse` 方法将其转换为带有填充的密集 `tf.Tensor` 或 `tf.SparseTensor`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yz17yeSMsUid",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)\n",
"print(batch_chars_padded.numpy())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kBjsPQp3rhfm",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"batch_chars_sparse = batch_chars_ragged.to_sparse()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GCCkZh-nwlbL"
},
"source": [
"在对多个具有相同长度的字符串进行编码时,可以将 `tf.Tensor` 用作输入:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_lP62YUAwjK9",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]],\n",
" output_encoding='UTF-8')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w58CMRg9tamW"
},
"source": [
"当对多个具有不同长度的字符串进行编码时,应将 `tf.RaggedTensor` 用作输入:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d7GtOtrltaMl",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T2Nh5Aj9xob3"
},
"source": [
"如果您的张量具有填充或稀疏格式的多个字符串,请在调用 `unicode_encode` 之前将其转换为 `tf.RaggedTensor`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "R2bYCYl0u-Ue",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_encode(\n",
" tf.RaggedTensor.from_sparse(batch_chars_sparse),\n",
" output_encoding='UTF-8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UlV2znh_u_zm",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_encode(\n",
" tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),\n",
" output_encoding='UTF-8')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hQOOGkscvDpc"
},
"source": [
"## Unicode 运算"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NkmtsA_yvMB0"
},
"source": [
"### 字符长度\n",
"\n",
"`tf.strings.length` 运算具有 `unit` 参数,该参数表示计算长度的方式。`unit` 默认为 `\"BYTE\"`,但也可以将其设置为其他值(例如 `\"UTF8_CHAR\"` 或 `\"UTF16_CHAR\"`),以确定每个已编码 `string` 中的 Unicode 码位数量。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1ZzMe59mvLHr",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Note that the final character takes up 4 bytes in UTF8.\n",
"thanks = u'Thanks 😊'.encode('UTF-8')\n",
"num_bytes = tf.strings.length(thanks).numpy()\n",
"num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()\n",
"print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fHG85gxlvVU0"
},
"source": [
"### 字符子字符串\n",
"\n",
"类似地,`tf.strings.substr` 运算会接受 \"`unit`\" 参数,并用它来确定 \"`pos`\" 和 \"`len`\" 参数包含的偏移类型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WlWRLV-4xWYq",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# default: unit='BYTE'. With len=1, we return a single byte.\n",
"tf.strings.substr(thanks, pos=7, len=1).numpy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JfNUVDPwxkCS",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# Specifying unit='UTF8_CHAR', we return a single character, which in this case\n",
"# is 4 bytes.\n",
"print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zJUEsVSyeIa3"
},
"source": [
"### 拆分 Unicode 字符串\n",
"\n",
"`tf.strings.unicode_split` 运算会将 Unicode 字符串拆分为单个字符的子字符串:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dDjkh5G1ejMt",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_split(thanks, 'UTF-8').numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HQqEEZEbdG9O"
},
"source": [
"### 字符的字节偏移量\n",
"\n",
"为了将 `tf.strings.unicode_decode` 生成的字符张量与原始字符串对齐,了解每个字符开始位置的偏移量很有用。方法 `tf.strings.unicode_decode_with_offsets` 与 `unicode_decode` 类似,不同的是它会返回包含每个字符起始偏移量的第二张量。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Cug7cmwYdowd",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"codepoints, offsets = tf.strings.unicode_decode_with_offsets(u\"🎈🎉🎊\", 'UTF-8')\n",
"\n",
"for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):\n",
" print(\"At byte offset {}: codepoint {}\".format(offset, codepoint))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2ZnCNxOvx66T"
},
"source": [
"## Unicode 字符体系"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nRRHqkqNyGZ6"
},
"source": [
"每个 Unicode 码位都属于某个码位集合,这些集合被称作[字符体系](https://en.wikipedia.org/wiki/Script_%28Unicode%29)。某个字符的字符体系有助于确定该字符可能所属的语言。例如,已知 'Б' 属于西里尔字符体系,表明包含该字符的现代文本很可能来自某个斯拉夫语种(如俄语或乌克兰语)。\n",
"\n",
"TensorFlow 提供了 `tf.strings.unicode_script` 运算来确定某一给定码位使用的是哪个字符体系。字符体系代码是对应于[国际 Unicode 组件](http://site.icu-project.org/home) (ICU) [`UScriptCode`](http://icu-project.org/apiref/icu4c/uscript_8h.html) 值的 `int32` 值。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "K7DeYHrRyFPy",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"uscript = tf.strings.unicode_script([33464, 1041]) # ['芸', 'Б']\n",
"\n",
"print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2fW992a1lIY6"
},
"source": [
"`tf.strings.unicode_script` 运算还可以应用于码位的多维 `tf.Tensor` 或 `tf.RaggedTensor`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uR7b8meLlFnp",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"print(tf.strings.unicode_script(batch_chars_ragged))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mx7HEFpBzEsB"
},
"source": [
"## 示例:简单分词\n",
"\n",
"分词是将文本拆分为类似单词的单元的任务。当使用空格字符分隔单词时,这通常很容易,但是某些语言(如中文和日语)不使用空格,而某些语言(如德语)中存在长复合词,必须进行拆分才能分析其含义。在网页文本中,不同语言和字符体系常常混合在一起,例如“NY株価”(纽约证券交易所)。\n",
"\n",
"我们可以利用字符体系的变化进行粗略分词(不实现任何 ML 模型),从而估算词边界。这对类似上面“NY株価”示例的字符串都有效。这种方法对大多数使用空格的语言也都有效,因为各种字符体系中的空格字符都归类为 USCRIPT_COMMON,这是一种特殊的字符体系代码,不同于任何实际文本。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "grsvFiC4BoPb",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# dtype: string; shape: [num_sentences]\n",
"#\n",
"# The sentences to process. Edit this line to try out different inputs!\n",
"sentence_texts = [u'Hello, world.', u'世界こんにちは']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CapnbShuGU8i"
},
"source": [
"首先,我们将句子解码为字符码位,然后查找每个字符的字符体系标识符。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ReQVcDQh1MB8",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]\n",
"#\n",
"# sentence_char_codepoint[i, j] is the codepoint for the j'th character in\n",
"# the i'th sentence.\n",
"sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')\n",
"print(sentence_char_codepoint)\n",
"\n",
"# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]\n",
"#\n",
"# sentence_char_scripts[i, j] is the unicode script of the j'th character in\n",
"# the i'th sentence.\n",
"sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)\n",
"print(sentence_char_script)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O2fapF5UGcUc"
},
"source": [
"接下来,我们使用这些字符体系标识符来确定添加词边界的位置。我们在每个句子的开头添加一个词边界;如果某个字符与前一个字符属于不同的字符体系,也为该字符添加词边界。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7v5W6MOr1Rlc",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]\n",
"#\n",
"# sentence_char_starts_word[i, j] is True if the j'th character in the i'th\n",
"# sentence is the start of a word.\n",
"sentence_char_starts_word = tf.concat(\n",
" [tf.fill([sentence_char_script.nrows(), 1], True),\n",
" tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],\n",
" axis=1)\n",
"\n",
"# dtype: int64; shape: [num_words]\n",
"#\n",
"# word_starts[i] is the index of the character that starts the i'th word (in\n",
"# the flattened list of characters from all sentences).\n",
"word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)\n",
"print(word_starts)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LAwh-1QkGuC9"
},
"source": [
"然后,我们可以使用这些起始偏移量来构建 `RaggedTensor`,它包含了所有批次的单词列表:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bNiA1O_eBBCL",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# dtype: int32; shape: [num_words, (num_chars_per_word)]\n",
"#\n",
"# word_char_codepoint[i, j] is the codepoint for the j'th character in the\n",
"# i'th word.\n",
"word_char_codepoint = tf.RaggedTensor.from_row_starts(\n",
" values=sentence_char_codepoint.values,\n",
" row_starts=word_starts)\n",
"print(word_char_codepoint)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "66a2ZnYmG2ao"
},
"source": [
"最后,我们可以将词码位 `RaggedTensor` 划分回句子中:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NCfwcqLSEjZb",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"# dtype: int64; shape: [num_sentences]\n",
"#\n",
"# sentence_num_words[i] is the number of words in the i'th sentence.\n",
"sentence_num_words = tf.reduce_sum(\n",
" tf.cast(sentence_char_starts_word, tf.int64),\n",
" axis=1)\n",
"\n",
"# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]\n",
"#\n",
"# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character\n",
"# in the j'th word in the i'th sentence.\n",
"sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(\n",
" values=word_char_codepoint,\n",
" row_lengths=sentence_num_words)\n",
"print(sentence_word_char_codepoint)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xWaX8WcbHyqY"
},
"source": [
"为了使最终结果更易于阅读,我们可以将其重新编码为 UTF-8 字符串:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HSivquOgFr3C",
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"oL9KopJirB2g"
],
"name": "unicode.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}