作者: fchollet
创建日期 2020/05/05
上次修改 2020/05/05
描述: 使用预训练 GloVe 词嵌入在 Newsgroup20 数据集上进行文本分类。
import os
# Only the TensorFlow backend supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"
import pathlib
import numpy as np
import tensorflow.data as tf_data
import keras
from keras import layers
我们将使用 Newsgroup20 数据集,这是一个包含 20,000 个属于 20 个不同主题类别的留言板消息的集合。
对于预训练词嵌入,我们将使用 GloVe 嵌入。
data_path = keras.utils.get_file(
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)
fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])
Number of directories: 20
Directory names: ['comp.sys.ibm.pc.hardware', 'comp.os.ms-windows.misc', 'comp.windows.x', 'sci.space', 'sci.crypt', 'sci.med', 'alt.atheism', 'rec.autos', 'rec.sport.hockey', 'talk.politics.misc', 'talk.politics.mideast', 'rec.motorcycles', 'talk.politics.guns', 'misc.forsale', 'sci.electronics', 'talk.religion.misc', 'comp.graphics', 'soc.religion.christian', 'comp.sys.mac.hardware', 'rec.sport.baseball']
Number of files in comp.graphics: 1000
Some example filenames: ['39638', '38747', '38242', '39057', '39031']
print(open(data_dir / "comp.graphics" / "38987").read())
Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: [email protected] (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <[email protected]>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: [email protected]
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7
Jasen Mabus
RPI student
I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to [email protected].
Thank you in advance,
Jasen Mabus
可以看到,有一些标题行泄露了文件的类别,要么是明确的(第一行就是类别的名称),要么是隐含的,例如通过 `Organization` 字段。 让我们去掉标题
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
dirpath = data_dir / dirname
fnames = os.listdir(dirpath)
print("Processing %s, %d files found" % (dirname, len(fnames)))
for fname in fnames:
fpath = dirpath / fname
f = open(fpath, encoding="latin-1")
content = f.read()
lines = content.split("\n")
lines = lines[10:]
content = "\n".join(lines)
class_index += 1
print("Classes:", class_names)
print("Number of samples:", len(samples))
Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Number of samples: 19997
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]
让我们使用 `TextVectorization` 来索引数据集中找到的词汇表。 稍后,我们将使用相同的层实例来向量化样本。
我们的层只考虑前 20,000 个单词,并将截断或填充序列以实际长度为 200 个标记。
vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128)
您可以通过 `vectorizer.get_vocabulary()` 获取计算出的词汇表。 让我们打印前 5 个词
['', '[UNK]', 'the', 'to', 'of']
output = vectorizer([["the cat sat on the mat"]])
output.numpy()[0, :6]
array([ 2, 3480, 1818, 15, 2, 5830])
可以看到,"the" 被表示为 "2"。 为什么不是 0,因为 "the" 是词汇表中的第一个词? 这是因为索引 0 保留用于填充,索引 1 保留用于“词汇表外”的标记。
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]
[2, 3480, 1818, 15, 2, 5830]
让我们下载预训练的 GloVe 嵌入(一个 822M 的 zip 文件)。
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip
该存档包含各种尺寸的文本编码向量:50 维、100 维、200 维、300 维。 我们将使用 100D 的。
让我们创建一个字典,将单词(字符串)映射到它们的 NumPy 向量表示
path_to_glove_file = "glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, "f", sep=" ")
embeddings_index[word] = coefs
print("Found %s word vectors." % len(embeddings_index))
Found 400000 word vectors.
现在,让我们准备一个相应的嵌入矩阵,我们可以在 Keras `Embedding` 层中使用它。 它是一个简单的 NumPy 矩阵,其中索引 `i` 处的条目是我们 `vectorizer` 词汇表中索引 `i` 的单词的预训练向量。
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0
# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
# This includes the representation for "padding" and "OOV"
embedding_matrix[i] = embedding_vector
hits += 1
misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
Converted 18021 words (1979 misses)
接下来,我们将预训练的词嵌入矩阵加载到 `Embedding` 层中。
请注意,我们设置 `trainable=False`,以便保持嵌入固定(我们不希望在训练期间更新它们)。
from keras.layers import Embedding
embedding_layer = Embedding(
一个简单的 1D 卷积网络,使用全局最大池化,最后是一个分类器。
int_sequences_input = keras.Input(shape=(None,), dtype="int32")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ input_layer (InputLayer) │ (None, None) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ embedding (Embedding) │ (None, None, 100) │ 2,000,200 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ conv1d (Conv1D) │ (None, None, 128) │ 64,128 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ max_pooling1d (MaxPooling1D) │ (None, None, 128) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ conv1d_1 (Conv1D) │ (None, None, 128) │ 82,048 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ max_pooling1d_1 (MaxPooling1D) │ (None, None, 128) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ conv1d_2 (Conv1D) │ (None, None, 128) │ 82,048 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ global_max_pooling1d │ (None, 128) │ 0 │ │ (GlobalMaxPooling1D) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense (Dense) │ (None, 128) │ 16,512 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_1 (Dense) │ (None, 20) │ 2,580 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 2,247,516 (8.57 MB)
Trainable params: 2,247,516 (8.57 MB)
Non-trainable params: 0 (0.00 B)
首先,将我们的字符串列表数据转换为整数索引的 NumPy 数组。 数组右填充。
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()
y_train = np.array(train_labels)
y_val = np.array(val_labels)
我们使用分类交叉熵作为我们的损失函数,因为我们正在执行 softmax 分类。 此外,我们使用 `sparse_categorical_crossentropy`,因为我们的标签是整数。
loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))
现在,我们可能希望导出一个 `Model` 对象,它将任意长度的字符串作为输入,而不是索引序列。 它将使模型更便携,因为您不必担心输入预处理管道。
我们的 `vectorizer` 实际上是一个 Keras 层,所以它很简单
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)
probabilities = end_to_end_model(
[["this message is about computer graphics and 3D modeling"]]