► Keras 3 API 文档 / KerasNLP / 分词器 / compute_sentence_piece_proto 函数

compute_sentence_piece_proto 函数

`compute_sentence_piece_proto` 函数

keras_nlp.tokenizers.compute_sentence_piece_proto(
    data, vocabulary_size, model_type="unigram", proto_output_file=None, lowercase=False
)

用于训练 SentencePiece 词汇表的实用程序。

从输入数据集或文件名列表训练 SentencePiece 词汇表。

如果 data 是文件名列表，则文件格式必须为纯文本文件，并且在训练期间会逐行读取文本。

参数

data: tf.data.Dataset 或文件名列表。
vocabulary_size: int。要训练的词汇表最大大小。
model_type: str。模型算法必须是 "unigram"、"bpe"、"word" 或 "char" 之一。默认为 "unigram"。
proto_output_file: str。如果提供，它将用作传递给 model_writer 的 model_file。如果为 None，则 model_file 将是 io.BytesIO 对象。默认为 None。
lowercase: bool。如果为 True，则在分词之前将输入文本转换为小写。默认为 False。

返回值

一个包含序列化 SentencePiece 协议缓冲区 (proto) 的 bytes 对象，或者如果提供了 proto_output_file 则返回 None。

示例

基本用法（来自数据集）。

>>> inputs = tf.data.Dataset.from_tensor_slices(["Drifting Along"])
>>> proto = keras_nlp.tokenizers.compute_sentence_piece_proto(inputs, vocabulary_size=15)
>>> tokenizer = keras_nlp.tokenizers.SentencePieceTokenizer(proto=proto)
>>> outputs = inputs.map(tokenizer)
>>> for output in outputs:
...     print(output)
tf.Tensor([ 4  8 12  5  9 14  5  6 13  4  7 10 11  6 13],
shape=(15,), dtype=int32)

基本用法（使用文件）。

with open("test.txt", "w+") as f: f.write("Drifting Along\n")
inputs = ["test.txt"]
proto = keras_nlp.tokenizers.compute_sentence_piece_proto(
     inputs, vocabulary_size=15, proto_output_file="model.spm")
tokenizer = keras_nlp.tokenizers.SentencePieceTokenizer(proto="model.spm")
ds = tf.data.Dataset.from_tensor_slices(["the quick brown fox."])
ds = ds.map(tokenizer)

使用小写

>>> inputs = tf.data.Dataset.from_tensor_slices(["Drifting Along"])
>>> proto = keras_nlp.tokenizers.compute_sentence_piece_proto(
...     inputs, vocabulary_size=15, lowercase=True)
>>> tokenizer = keras_nlp.tokenizers.SentencePieceTokenizer(proto=proto)
>>> outputs = inputs.map(tokenizer)
>>> for output in outputs:
...     print(output)
tf.Tensor([ 4  8 12  5  9 14  5  6 13  4  7 10 11  6 13],
shape=(15,), dtype=int32)

compute_sentence_piece_proto 函数

compute_sentence_piece_proto 函数