► KerasHub：预训练模型 / API文档 / 模型架构 / T5Gemma / T5GemmaSeq2SeqLMPreprocessor 层

T5GemmaSeq2SeqLMPreprocessor 层

`T5GemmaSeq2SeqLMPreprocessor` 类

keras_hub.models.T5GemmaSeq2SeqLMPreprocessor(
    tokenizer,
    encoder_sequence_length=512,
    decoder_sequence_length=512,
    add_start_token=False,
    add_end_token=True,
    **kwargs
)

T5Gemma Seq2Seq LM 预处理器。

此预处理层用于与 keras_hub.models.T5GemmaSeq2SeqLM 一起使用。默认情况下，它将接收字符串的批次，并以 (x, y, sample_weight) 格式返回输出，其中 y 标签是 x 序列中的下一个 token ID。

对于生成任务，该层还公开了 generate_preprocess() 和 generate_postprocess() 两个方法。当此预处理器附加到 keras_hub.models.T5GemmaSeq2SeqLM 实例时，这些方法将在 generate() 中被隐式调用。它们也可以独立调用（例如，在单独的进程中预计算生成所需的预处理输入）。

参数

tokenizer: 一个 keras_hub.models.T5GemmaTokenizer 实例。
encoder_sequence_length: 填充后的编码器输入的长度。
decoder_sequence_length: 填充后的解码器输入的长度。
add_start_token: 如果为 True，预处理器将在每个输入序列前添加分词器的起始 token。对于 T5Gemma 模型，这应该是 False。默认为 False。
add_end_token: 如果为 True，预处理器将在每个输入序列后添加分词器的结束 token。对于 T5Gemma 模型，这应该是 True。默认为 True。

调用参数

x: 一个包含两个键的字典，"encoder_text" 和 "decoder_text"。值可以是字符串、tf.Tensor 或 Python 字符串列表。
y：标签数据。应始终为 None，因为该层会生成标签。
sample_weight：标签权重。应始终为 None，因为该层会生成标签权重。
encoder_sequence_length: 传递此参数以覆盖层配置的 encoder_sequence_length。
decoder_sequence_length: 传递此参数以覆盖层配置的 decoder_sequence_length。

示例

import tensorflow as tf
import numpy as np

# Load the preprocessor from a preset.
preprocessor = keras_hub.models.T5GemmaSeq2SeqLMPreprocessor.from_preset(
    "t5gemma_b_b_prefixlm_it"
)

# For example usage, see the dictionary example below which provides
# both encoder and decoder text.
# Tokenize a batch of sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
# Tokenize a dictionary with separate encoder and decoder inputs.
preprocessor({
    "encoder_text": "The quick brown fox jumped.",
    "decoder_text": "The fast fox."
})

# Apply tokenization to a [`tf.data.Dataset`](https://tensorflowcn.cn/api_docs/python/tf/data/Dataset).
encoder_features = tf.constant(["The quick brown fox.", "Call me Ishmael."])
decoder_features = tf.constant(["The fast fox.", "I am Ishmael."])
ds = tf.data.Dataset.from_tensor_slices(
    {"encoder_text": encoder_features, "decoder_text": decoder_features}
)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)

# Prepare tokens for generation.
preprocessor.generate_preprocess({
    "encoder_text": "The quick brown fox jumped.",
    "decoder_text": "The fast fox."
})

# Map generation outputs back to strings.
preprocessor.generate_postprocess({
    'decoder_token_ids': np.array([[2, 714, 4320, 8426, 25341, 1, 0, 0]]),
    'decoder_padding_mask': np.array([[1, 1, 1, 1, 1, 1, 0, 0]]),
})

[源代码]

`from_preset` 方法

T5GemmaSeq2SeqLMPreprocessor.from_preset(
    preset, config_file="preprocessor.json", **kwargs
)

从模型预设实例化一个 keras_hub.models.Preprocessor。

预设是一个包含配置、权重和其他文件资产的目录，用于保存和加载预训练模型。preset 可以作为以下之一传递：

一个内置的预设标识符，如 'bert_base_en'
一个 Kaggle Models 句柄，如 'kaggle://user/bert/keras/bert_base_en'
一个 Hugging Face 句柄，如 'hf://user/bert_base_en'
一个本地预设目录的路径，如 './bert_base_en'

对于任何 Preprocessor 子类，您可以运行 cls.presets.keys() 来列出该类上所有可用的内置预设。

由于一个给定模型通常有多个预处理类，因此应在特定的子类上调用此方法，例如 keras_hub.models.BertTextClassifierPreprocessor.from_preset()。

参数

preset：字符串。一个内置预设标识符、一个 Kaggle Models 句柄、一个 Hugging Face 句柄或一个本地目录的路径。

示例

# Load a preprocessor for Gemma generation.
preprocessor = keras_hub.models.CausalLMPreprocessor.from_preset(
    "gemma_2b_en",
)

# Load a preprocessor for Bert classification.
preprocessor = keras_hub.models.TextClassifierPreprocessor.from_preset(
    "bert_base_en",
)

预设	参数	描述
t5gemma_s_s_ul2	312.52M	T5Gemma S/S 模型，具有小型编码器和小型解码器，被适配为 UL2 模型。
t5gemma_s_s_prefixlm	312.52M	T5Gemma S/S 模型，具有小型编码器和小型解码器，被适配为前缀语言模型。
t5gemma_s_s_ul2_it	312.52M	T5Gemma S/S 模型，具有小型编码器和小型解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_s_s_prefixlm_it	312.52M	T5Gemma S/S 模型，具有小型编码器和小型解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_b_b_ul2	591.49M	T5Gemma B/B 模型，具有基础编码器和基础解码器，被适配为 UL2 模型。
t5gemma_b_b_prefixlm	591.49M	T5Gemma B/B 模型，具有基础编码器和基础解码器，被适配为前缀语言模型。
t5gemma_b_b_ul2_it	591.49M	T5Gemma B/B 模型，具有基础编码器和基础解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_b_b_prefixlm_it	591.49M	T5Gemma B/B 模型，具有基础编码器和基础解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_l_l_ul2	1.24B	T5Gemma L/L 模型，具有大型编码器和大型解码器，被适配为 UL2 模型。
t5gemma_l_l_prefixlm	1.24B	T5Gemma L/L 模型，具有大型编码器和大型解码器，被适配为前缀语言模型。
t5gemma_l_l_ul2_it	1.24B	T5Gemma L/L 模型，具有大型编码器和大型解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_l_l_prefixlm_it	1.24B	T5Gemma L/L 模型，具有大型编码器和大型解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_ml_ml_ul2	2.20B	T5Gemma ML/ML 模型，具有中大型编码器和中大型解码器，被适配为 UL2 模型。
t5gemma_ml_ml_prefixlm	2.20B	T5Gemma ML/ML 模型，具有中大型编码器和中大型解码器，被适配为前缀语言模型。
t5gemma_ml_ml_ul2_it	2.20B	T5Gemma ML/ML 模型，具有中大型编码器和中大型解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_ml_ml_prefixlm_it	2.20B	T5Gemma ML/ML 模型，具有中大型编码器和中大型解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_xl_xl_ul2	3.77B	T5Gemma XL/XL 模型，具有超大型编码器和超大型解码器，被适配为 UL2 模型。
t5gemma_xl_xl_prefixlm	3.77B	T5Gemma XL/XL 模型，具有超大型编码器和超大型解码器，被适配为前缀语言模型。
t5gemma_xl_xl_ul2_it	3.77B	T5Gemma XL/XL 模型，具有超大型编码器和超大型解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_xl_xl_prefixlm_it	3.77B	T5Gemma XL/XL 模型，具有超大型编码器和超大型解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_2b_2b_ul2	5.60B	T5Gemma 2B/2B 模型，具有 20 亿参数的编码器和 20 亿参数的解码器，被适配为 UL2 模型。
t5gemma_2b_2b_prefixlm	5.60B	T5Gemma 2B/2B 模型，具有 20 亿参数的编码器和 20 亿参数的解码器，被适配为前缀语言模型。
t5gemma_2b_2b_ul2_it	5.60B	T5Gemma 2B/2B 模型，具有 20 亿参数的编码器和 20 亿参数的解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_2b_2b_prefixlm_it	5.60B	T5Gemma 2B/2B 模型，具有 20 亿参数的编码器和 20 亿参数的解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_9b_2b_ul2	12.29B	T5Gemma 9B/2B 模型，具有 90 亿参数的编码器和 20 亿参数的解码器，被适配为 UL2 模型。
t5gemma_9b_2b_prefixlm	12.29B	T5Gemma 9B/2B 模型，具有 90 亿参数的编码器和 20 亿参数的解码器，被适配为前缀语言模型。
t5gemma_9b_2b_ul2_it	12.29B	T5Gemma 9B/2B 模型，具有 90 亿参数的编码器和 20 亿参数的解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_9b_2b_prefixlm_it	12.29B	T5Gemma 9B/2B 模型，具有 90 亿参数的编码器和 20 亿参数的解码器，被适配为前缀语言模型，并进行了指令遵循微调。
t5gemma_9b_9b_ul2	20.33B	T5Gemma 9B/9B 模型，具有 90 亿参数的编码器和 90 亿参数的解码器，被适配为 UL2 模型。
t5gemma_9b_9b_prefixlm	20.33B	T5Gemma 9B/9B 模型，具有 90 亿参数的编码器和 90 亿参数的解码器，被适配为前缀语言模型。
t5gemma_9b_9b_ul2_it	20.33B	T5Gemma 9B/9B 模型，具有 90 亿参数的编码器和 90 亿参数的解码器，被适配为 UL2 模型，并进行了指令遵循微调。
t5gemma_9b_9b_prefixlm_it	20.33B	T5Gemma 9B/9B 模型，具有 90 亿参数的编码器和 90 亿参数的解码器，被适配为前缀语言模型，并进行了指令遵循微调。

[源代码]

`generate_preprocess` 方法

T5GemmaSeq2SeqLMPreprocessor.generate_preprocess(
    x, encoder_sequence_length=None, decoder_sequence_length=None, sequence_length=None
)

将输入字符串转换为用于生成的整数 token 输入。

与调用该层进行训练类似，此方法接收一个包含 "encoder_text" 和 "decoder_text" 的字典，其值可以是字符串或张量字符串，对输入进行分词和填充，并计算一个填充掩码，该掩码对所有未填充为填充值的输入进行屏蔽。

与调用该层进行训练不同，此方法不计算标签，并且永远不会在解码器序列末尾添加 tokenizer.end_token_id（因为预计生成会在输入的解码器提示符的末尾继续）。

[源代码]

`generate_postprocess` 方法

T5GemmaSeq2SeqLMPreprocessor.generate_postprocess(x)

将整数 token 输出转换为字符串以进行生成。

此方法通过首先删除所有填充和开始/结束 token，然后将整数序列转换回字符串来反转 generate_preprocess()。

`tokenizer` 属性

keras_hub.models.T5GemmaSeq2SeqLMPreprocessor.tokenizer

用于对字符串进行分词的分词器。

T5GemmaSeq2SeqLMPreprocessor 层

T5GemmaSeq2SeqLMPreprocessor 类

from_preset 方法

generate_preprocess 方法

generate_postprocess 方法

tokenizer 属性

T5GemmaSeq2SeqLMPreprocessor 层

T5GemmaSeq2SeqLMPreprocessor 类

from_preset 方法

generate_preprocess 方法

generate_postprocess 方法

tokenizer 属性

`T5GemmaSeq2SeqLMPreprocessor` 类

`from_preset` 方法

`generate_preprocess` 方法

`generate_postprocess` 方法

`tokenizer` 属性