Keshawn_lu's Blog

吴恩达团队NLP C3_W3_Assignment

字数统计: 958阅读时长: 4 min
2021/08/28 Share

吴恩达团队NLP C3_W3_Assignment

任务:命名实体识别(NER)

  • French:地缘政治实体
  • Morocco:地理实体
  • Christmas:时间指标
  • 其他不被视为命名实体

https://pic.imgdb.cn/item/611893025132923bf8d4fd26.png

Part1:数据生成器

  • shuffle的好处:我们不使用索引直接访问句子列表的位置。相反,我们使用它从索引列表中选择一个索引。通过这种方式,我们可以改变遍历原始列表的顺序,保持原始列表不变。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def data_generator(batch_size, x, y, pad, shuffle=False, verbose=False):
'''
Input:
batch_size - 一个batch的大小
x - 文本行的列表,单词都用整数表示
y - 包含与句子相关的标签的列表
shuffle - 洗牌,变换顺序
pad - 代表填充的字符
verbose - Print information during runtime
Output:
a tuple containing 2 elements:
X - np.ndarray of dim (batch_size, max_len) of padded sentences
Y - np.ndarray of dim (batch_size, max_len) of tags associated with the sentences in X
'''

num_lines = len(x)
lines_index = [*range(num_lines)]

if shuffle:
rnd.shuffle(lines_index)

index = 0
while True:
buffer_x = [0] * batch_size
buffer_y = [0] * batch_size

max_len = 0
for i in range(batch_size):
if index >= num_lines:
index = 0
if shuffle:
rnd.shuffle(lines_index)

buffer_x[i] = x[lines_index[index]]
buffer_y[i] = y[lines_index[index]]

lenx = len(x[lines_index[index]])
if lenx > max_len:
max_len = lenx # 存储最大长度,用于后续填充

index += 1

# 生成 batch_size * max_len的数组,内容都为pad
X = np.full((batch_size, max_len), pad)
Y = np.full((batch_size, max_len), pad)

for i in range(batch_size):
x_i = buffer_x[i]
y_i = buffer_y[i]

for j in range(len(x_i)):
X[i, j] = x_i[j]
Y[i, j] = y_i[j]

if verbose: print("index=", index)
yield((X,Y))

Part2:建立模型

https://pic.imgdb.cn/item/6118a2915132923bf837d1c5.png

  1. 使用数据生成器产生的数据送入输入层
  2. 输入嵌入层
  3. 输入LSTM
  4. 通过线形层输出
  5. 最后通过softmax获得每个单词的预测
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def NER(vocab_size=35181, d_model=50, tags=tag_map):
'''
Input:
vocab_size - integer containing the size of the vocabulary
d_model - integer describing the embedding size
Output:
model - a trax serial model
'''

model = tl.Serial(
tl.Embedding(vocab_size, d_model), # d_model: 单词嵌入中的元素数
tl.LSTM(d_model), # LSTM layer
tl.Dense(len(tags)), # Dense layer with len(tags) units
tl.LogSoftmax() # LogSoftmax layer
)

return model

LSTM的介绍:

人人都能看懂的LSTM

Part3:训练模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def train_model(NER, train_generator, eval_generator, train_steps=1, output_dir='model'):
'''
Input:
NER - the model you are building
train_generator - The data generator for training examples
eval_generator - The data generator for validation examples,
train_steps - number of training steps
output_dir - folder to save your model
Output:
training_loop - a trax supervised training Loop
'''

train_task = training.TrainTask(
train_generator,
loss_layer = tl.CrossEntropyLoss(),
optimizer = trax.optimizers.Adam(0.01),
)

eval_task = training.EvalTask(
labeled_data = eval_generator,
metrics = [tl.CrossEntropyLoss(), tl.Accuracy()],
n_eval_batches = 10
)

training_loop = training.Loop(
NER, # A model to train
train_task, # A train task
eval_tasks = eval_task, # The evaluation task
output_dir = output_dir) # The output directory

training_loop.run(n_steps = train_steps)

return training_loop

Part4:计算准确率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def evaluate_prediction(pred, labels, pad):
"""
Inputs:
pred: prediction array with shape
(num examples, max sentence length in batch, num of classes)
labels: array of size (batch_size, seq_len)
pad: integer representing pad character
Outputs:
accuracy: float
"""
## step 1 ##
outputs = np.argmax(pred, axis=2)
print("outputs shape:", outputs.shape)

## step 2 ##
mask = labels != pad
print("mask shape:", mask.shape, "mask[0][20:30]:", mask[0][20:30])

## step 3 ##
accuracy = np.sum(outputs == labels) / float(np.sum(mask))

return accuracy

np.argmax(pred, axis) :用于返回一个numpy数组中最大值的索引值,当一组中同时出现几个最大值时,返回第一个最大值的索引值。

对于多维数组,我们可以看一些例子来感受一下:

  • axis = 0 :外层
  • axis = 1 :内层
1
2
3
4
5
two_dim_array = np.array([[1, 3, 5], [0, 4, 3]])
max_index_axis0 = np.argmax(two_dim_array, axis = 0)
max_index_axis1 = np.argmax(two_dim_array, axis = 1)
print(max_index_axis0)
print(max_index_axis1)
1
2
[0 1 0] 
[2 1]

三维时,假设数组为m * n * p ,则

  • axis = 0:舍去m,返回一个 n×p维的矩阵
  • axis = 1:舍去n,返回一个 m×p维的矩阵
  • axis = 2:舍去p,返回一个 m×n维的矩阵
1
2
3
4
5
6
7
8
9
three_dim_array = [[[1, 2, 3, 4],  [-1, 0, 3, 5]],
[[2, 7, -1, 3], [0, 3, 12, 4]],
[[5, 1, 0, 19], [4, 2, -2, 13]]] # 3 * 2 * 4
a = np.argmax(three_dim_array, axis = 0)
print(a)
b = np.argmax(three_dim_array, axis = 1)
print(b)
c = np.argmax(three_dim_array , axis = 2)
print(c)
1
2
3
4
5
6
7
8
9
10
[[2 1 0 2]                                                                                                               
[2 1 1 2]] # 2 * 4 每一列最大值对应的索引

[[0 0 0 1]
[0 0 1 1]
[0 1 0 0]] # 3 * 4 每一行的两个数组进行对比

[[3 3]
[1 2]
[3 3]] # 3 * 2 每个数组挑选出自身最大值对应的索引
CATALOG
  1. 1. 吴恩达团队NLP C3_W3_Assignment
    1. 1.1. 任务:命名实体识别(NER)
    2. 1.2. Part1:数据生成器
    3. 1.3. Part2:建立模型
    4. 1.4. Part3:训练模型
    5. 1.5. Part4:计算准确率