Unicode 字符串

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Unicode 字符串#

在 tensorFlow.google.cn 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载 notebook

简介#

处理自然语言的模型通常使用不同的字符集来处理不同的语言。Unicode 是一种标准的编码系统，用于表示几乎所有语言的字符。每个字符使用 0 和 0x10FFFF 之间的唯一整数码位进行编码。Unicode 字符串是由零个或更多码位组成的序列。

本教程介绍了如何在 TensorFlow 中表示 Unicode 字符串，以及如何使用标准字符串运算的 Unicode 等效项对其进行操作。它会根据字符体系检测将 Unicode 字符串划分为不同词例。

import tensorflow as tf

`tf.string` 数据类型#

您可以使用基本的 TensorFlow tf.string dtype 构建字节字符串张量。Unicode 字符串默认使用 UTF-8 编码。

tf.constant(u"Thanks 😊")

tf.string 张量可以容纳不同长度的字节字符串，因为字节字符串会被视为原子单元。字符串长度不包括在张量维度中。

tf.constant([u"You're", u"welcome!"]).shape

注：使用 Python 构造字符串时，v2 和 v3 对 Unicode 的处理方式有所不同。在 v2 中，Unicode 字符串用前缀“u”表示（如上所示）。在 v3 中，字符串默认使用 Unicode 编码。

表示 Unicode#

在 TensorFlow 中有两种表示 Unicode 字符串的标准方式：

string 标量 - 使用已知字符编码对码位序列进行编码。
int32 向量 - 每个位置包含单个码位。

例如，以下三个值均表示 Unicode 字符串 "语言处理"：

# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant(u"语言处理")
text_utf8

# Unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

# Unicode string, represented as a vector of Unicode code points.
text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

在不同表示之间进行转换#

TensorFlow 提供了在下列不同表示之间进行转换的运算：

tf.strings.unicode_decode：将编码的字符串标量转换为码位的向量。
tf.strings.unicode_encode：将码位的向量转换为编码的字符串标量。
tf.strings.unicode_transcode：将编码的字符串标量转换为其他编码。

tf.strings.unicode_decode(text_utf8,
                          input_encoding='UTF-8')

tf.strings.unicode_encode(text_chars,
                          output_encoding='UTF-8')

tf.strings.unicode_transcode(text_utf8,
                             input_encoding='UTF8',
                             output_encoding='UTF-16-BE')

批次维度#

解码多个字符串时，每个字符串中的字符数可能不相等。返回结果是 tf.RaggedTensor，其中最里面的维度的长度会根据每个字符串中的字符数而变化：

# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in
              [u'hÃllo',  u'What is the weather tomorrow',  u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
                                               input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
  print(sentence_chars)

您可以直接使用此 tf.RaggedTensor，也可以使用 tf.RaggedTensor.to_tensor 和 tf.RaggedTensor.to_sparse 方法将其转换为带有填充的密集 tf.Tensor 或 tf.SparseTensor。

batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())

batch_chars_sparse = batch_chars_ragged.to_sparse()

在对多个具有相同长度的字符串进行编码时，可以将 tf.Tensor 用作输入：

tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]],
                          output_encoding='UTF-8')

当对多个具有不同长度的字符串进行编码时，应将 tf.RaggedTensor 用作输入：

tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')

如果您的张量具有填充或稀疏格式的多个字符串，请在调用 unicode_encode 之前将其转换为 tf.RaggedTensor：

tf.strings.unicode_encode(
    tf.RaggedTensor.from_sparse(batch_chars_sparse),
    output_encoding='UTF-8')

tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
    output_encoding='UTF-8')

Unicode 运算#

字符长度#

tf.strings.length 运算具有 unit 参数，该参数表示计算长度的方式。unit 默认为 "BYTE"，但也可以将其设置为其他值（例如 "UTF8_CHAR" 或 "UTF16_CHAR"），以确定每个已编码 string 中的 Unicode 码位数量。

# Note that the final character takes up 4 bytes in UTF8.
thanks = u'Thanks 😊'.encode('UTF-8')
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))

字符子字符串#

类似地，tf.strings.substr 运算会接受 “unit” 参数，并用它来确定 “pos” 和 “len” 参数包含的偏移类型。

# default: unit='BYTE'. With len=1, we return a single byte.
tf.strings.substr(thanks, pos=7, len=1).numpy()

# Specifying unit='UTF8_CHAR', we return a single character, which in this case
# is 4 bytes.
print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())

拆分 Unicode 字符串#

tf.strings.unicode_split 运算会将 Unicode 字符串拆分为单个字符的子字符串：

tf.strings.unicode_split(thanks, 'UTF-8').numpy()

字符的字节偏移量#

为了将 tf.strings.unicode_decode 生成的字符张量与原始字符串对齐，了解每个字符开始位置的偏移量很有用。方法 tf.strings.unicode_decode_with_offsets 与 unicode_decode 类似，不同的是它会返回包含每个字符起始偏移量的第二张量。

codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
  print("At byte offset {}: codepoint {}".format(offset, codepoint))

Unicode 字符体系#

每个 Unicode 码位都属于某个码位集合，这些集合被称作字符体系。某个字符的字符体系有助于确定该字符可能所属的语言。例如，已知 ‘Б’ 属于西里尔字符体系，表明包含该字符的现代文本很可能来自某个斯拉夫语种（如俄语或乌克兰语）。

TensorFlow 提供了 tf.strings.unicode_script 运算来确定某一给定码位使用的是哪个字符体系。字符体系代码是对应于国际 Unicode 组件 (ICU) UScriptCode 值的 int32 值。

uscript = tf.strings.unicode_script([33464, 1041])  # ['芸', 'Б']

print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

tf.strings.unicode_script 运算还可以应用于码位的多维 tf.Tensor 或 tf.RaggedTensor：

print(tf.strings.unicode_script(batch_chars_ragged))

示例：简单分词#

分词是将文本拆分为类似单词的单元的任务。当使用空格字符分隔单词时，这通常很容易，但是某些语言（如中文和日语）不使用空格，而某些语言（如德语）中存在长复合词，必须进行拆分才能分析其含义。在网页文本中，不同语言和字符体系常常混合在一起，例如“NY株価”（纽约证券交易所）。

我们可以利用字符体系的变化进行粗略分词（不实现任何 ML 模型），从而估算词边界。这对类似上面“NY株価”示例的字符串都有效。这种方法对大多数使用空格的语言也都有效，因为各种字符体系中的空格字符都归类为 USCRIPT_COMMON，这是一种特殊的字符体系代码，不同于任何实际文本。

# dtype: string; shape: [num_sentences]
#
# The sentences to process.  Edit this line to try out different inputs!
sentence_texts = [u'Hello, world.', u'世界こんにちは']

首先，我们将句子解码为字符码位，然后查找每个字符的字符体系标识符。

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_codepoint[i, j] is the codepoint for the j'th character in
# the i'th sentence.
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_scripts[i, j] is the unicode script of the j'th character in
# the i'th sentence.
sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)

接下来，我们使用这些字符体系标识符来确定添加词边界的位置。我们在每个句子的开头添加一个词边界；如果某个字符与前一个字符属于不同的字符体系，也为该字符添加词边界。

# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_starts_word[i, j] is True if the j'th character in the i'th
# sentence is the start of a word.
sentence_char_starts_word = tf.concat(
    [tf.fill([sentence_char_script.nrows(), 1], True),
     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
    axis=1)

# dtype: int64; shape: [num_words]
#
# word_starts[i] is the index of the character that starts the i'th word (in
# the flattened list of characters from all sentences).
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

然后，我们可以使用这些起始偏移量来构建 RaggedTensor，它包含了所有批次的单词列表：

# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word.
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts)
print(word_char_codepoint)

最后，我们可以将词码位 RaggedTensor 划分回句子中：

# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)

# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)

为了使最终结果更易于阅读，我们可以将其重新编码为 UTF-8 字符串：

tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()