Keshawn_lu's Blog

吴恩达团队NLP C3_W2_Assignment

字数统计: 1.4k阅读时长: 7 min
2021/08/28 Share

吴恩达团队NLP C3_W2_Assignment

任务:探索递归神经网络RNN

Part1:将一行字符串中的字符都转化为unicode整数,将其称之为tensor(张量)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def line_to_tensor(line, EOS_int=1):
"""Turns a line of text into a tensor

Args:
line (str): A single line of text.
EOS_int (int, optional): End-of-sentence integer. Defaults to 1.

Returns:
list: a list of integers (unicode values) for the characters in the `line`.
"""

tensor = []
for c in line:
c_int = ord(c)
tensor.append(c_int)

tensor.append(1) # 代表结束

return tensor

1.2 实现批处理数据生成器

将文本行转化为整数的numpy数组,为了使其都有相同的长度,并用整数0进行填充。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
"""Generator function that yields batches of data

Args:
batch_size (int): number of examples (in this case, sentences) per batch.
max_length (int): maximum length of the output tensor.
NOTE: max_length includes the end-of-sentence character that will be added
to the tensor.
Keep in mind that the length of the tensor is always 1 + the length
of the original line of characters.
data_lines (list): list of the sentences to group into batches.
line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

Yields:
tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray
"""
index = 0

cur_batch = []

# 一共有几行文本
num_lines = len(data_lines)

# create an array with the indexes of data_lines that can be shuffled
lines_index = [*range(num_lines)]

if shuffle:
rnd.shuffle(lines_index)

while True:
if index >= num_lines:
index = 0
if shuffle:
rnd.shuffle(lines_index)

line = data_lines[lines_index[index]]

if len(line) < max_length:
cur_batch.append(line)

index += 1

if len(cur_batch) == batch_size:

batch = []
mask = []

for li in cur_batch:
tensor = line_to_tensor(li) # 文本行转换为张量

# 填充
pad = [0] * (max_length - len(tensor))
tensor_pad = tensor + pad

batch.append(tensor_pad)

# A mask for tensor_pad is 1 wherever tensor_pad is not
# 0 and 0 wherever tensor_pad is 0, i.e. if tensor_pad is
# [1, 2, 3, 0, 0, 0] then example_mask should be
# [1, 1, 1, 0, 0, 0]
# Hint: Use a list comprehension for this
example_mask = [0 if t == 0 else 1 for t in tensor_pad]
mask.append(example_mask)

# 转换为numpy数组
batch_np_arr = np.array(batch)
mask_np_arr = np.array(mask)


# 输入 目标 掩码
# 第二个返回值与第一个相同,用于评估
yield batch_np_arr, batch_np_arr, mask_np_arr

# 重置
cur_batch = []

1.3 重复批处理生成器

在训练期间对数据集进行多次循环,使用itertools.cycle进行实现

1
2
3
4
import itertools

infinite_data_generator = itertools.cycle(
data_generator(batch_size=2, max_length=10, data_lines=tmp_lines))

Part2:定义GRU模型

  • tl.ShiftRight:允许模型在前馈中向右移动
  • tl.Embedding:初始化嵌入
  • tl.GRU:构建传统GRU
  • tl.Dense:密集层
  • tl.LogSoftmax:输出概率的对数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def GRULM(vocab_size=256, d_model=512, n_layers=2, mode='train'):
"""Returns a GRU language model.

Args:
vocab_size (int, optional): Size of the vocabulary. Defaults to 256.
d_model (int, optional): Depth of embedding (n_units in the GRU cell). Defaults to 512.
n_layers (int, optional): Number of GRU layers. Defaults to 2.
mode (str, optional): 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to "train".

Returns:
trax.layers.combinators.Serial: A GRU language model as a layer that maps from a tensor of tokens to activations over a vocab set.
"""

model = tl.Serial(
tl.ShiftRight(mode=mode),
tl.Embedding(vocab_size=vocab_size, d_feature=d_model),
[tl.GRU(n_units=d_model) for _ in range(n_layers)],
tl.Dense(n_units=vocab_size),
tl.LogSoftmax()
)
return model

Part3:训练模型

  • trax.supervised.training.TrainTask:将训练数据、损失、优化器等打包到一个对象中
    • labeled_data:需要训练的带标签的数据
    • loss_fn:损失函数
    • 优化器
  • trax.supervised.training.EvalTask:将评估数据和度量进行打包
    • labeled_data:需要训练的带标签的数据
    • metrics:度量
  • trax.supervised.training.Loop:将所有事物放到一起进行训练
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from trax.supervised import training

def train_model(model, data_generator, batch_size=32, max_length=64, lines=lines, eval_lines=eval_lines, n_steps=1, output_dir='model/'):
"""Function that trains the model

Args:
model (trax.layers.combinators.Serial): GRU model.
data_generator (function): Data generator function.
batch_size (int, optional): Number of lines per batch. Defaults to 32.
max_length (int, optional): Maximum length allowed for a line to be processed. Defaults to 64.
lines (list, optional): List of lines to use for training. Defaults to lines.
eval_lines (list, optional): List of lines to use for evaluation. Defaults to eval_lines.
n_steps (int, optional): Number of steps to train. Defaults to 1.
output_dir (str, optional): Relative path of directory to save model. Defaults to "model/".

Returns:
trax.supervised.training.Loop: Training loop for the model.
"""

# 生成训练数据
bare_train_generator = data_generator(batch_size, max_length, data_lines=lines)

# 循环训练,多次迭代
infinite_train_generator = itertools.cycle(bare_train_generator)

#评估数据
bare_eval_generator = data_generator(batch_size, max_length, data_lines=eval_lines)
infinite_eval_generator = itertools.cycle(bare_eval_generator)

train_task = training.TrainTask(
labeled_data=infinite_train_generator,
loss_layer=tl.CrossEntropyLoss(),
optimizer=trax.optimizers.Adam(0.0005)
)

eval_task = training.EvalTask(
labeled_data=infinite_eval_generator,
metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
n_eval_batches=3
)

training_loop = training.Loop(model,
train_task,
eval_tasks=eval_task, # trax==1.3.9 参数为eval_tasks
output_dir=output_dir)

training_loop.run(n_steps=n_steps)


# We return this because it contains a handle to the model, which has the weights etc.
return training_loop

Part4:评估

使用困惑度来衡量概率模型预测样本的能力:

取对数:

  • tl.one_hot:将目标转换为与预测张量相同的维度
  • 以下代码并没有完全看懂…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def test_model(preds, target):
"""Function to test the model.

Args:
preds (jax.interpreters.xla.DeviceArray): Predictions of a list of batches of tensors corresponding to lines of text.
target (jax.interpreters.xla.DeviceArray): Actual list of batches of tensors corresponding to lines of text.

Returns:
float: log_perplexity of the model.
"""

total_log_ppx = np.sum(preds * tl.one_hot(target, preds.shape[-1]), axis= -1)

non_pad = 1.0 - np.equal(target, 0)

ppx = total_log_ppx * non_pad

log_ppx = np.sum(ppx) / np.sum(non_pad)

return -log_ppx

Part5:vanilla RNNs与GRUs的前馈

5.1 vanilla RNNs参照下图左边:

https://pic.imgdb.cn/item/61161d385132923bf8d32371.png

1
2
3
4
5
6
7
8
9
10
11
def forward_V_RNN(inputs, weights):
x, h_t = inputs

# weights.
wh, _, _, bh, _, _ = weights

# new hidden state
h_t = np.dot(wh, np.concatenate([h_t, x])) + bh
h_t = sigmoid(h_t)

return h_t, h_t

5.2 GRUs的前馈参考上图右边

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def forward_GRU(inputs, weights):
x, h_t = inputs

# weights.
wu, wr, wc, bu, br, bc = weights

u = np.dot(wu, np.concatenate([h_t, x])) + bu
u = sigmoid(u)

# Relevance gate
r = np.dot(wr, np.concatenate([h_t, x])) + br
r = sigmoid(u)

# Candidate hidden state
c = np.dot(wc, np.concatenate([r * h_t, x])) + bc
c = np.tanh(c)

# New Hidden state h_t
h_t = u* c + (1 - u)* h_t
return h_t, h_t
CATALOG
  1. 1. 吴恩达团队NLP C3_W2_Assignment
    1. 1.1. 任务:探索递归神经网络RNN
      1. 1.1.1. Part1:将一行字符串中的字符都转化为unicode整数,将其称之为tensor(张量)
      2. 1.1.2. 1.2 实现批处理数据生成器
      3. 1.1.3. 1.3 重复批处理生成器
    2. 1.2. Part2:定义GRU模型
    3. 1.3. Part3:训练模型
    4. 1.4. Part4:评估
    5. 1.5. Part5:vanilla RNNs与GRUs的前馈
      1. 1.5.1. 5.1 vanilla RNNs参照下图左边:
      2. 1.5.2. 5.2 GRUs的前馈参考上图右边