Keshawn_lu's Blog

吴恩达团队NLP C3_W1_Assignment

字数统计: 1.6k阅读时长: 8 min
2021/08/28 Share

吴恩达团队NLP C3_W1_Assignment

任务:使用深度神经网络进行情感分析

Part1:准备数据

1.1 以8 : 2的比例准备训练集和验证集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np

all_positive_tweets, all_negative_tweets = load_tweets()

print(f"The number of positive tweets: {len(all_positive_tweets)}")
print(f"The number of negative tweets: {len(all_negative_tweets)}")

val_pos = all_positive_tweets[4000:] # generating validation set for positive tweets
train_pos = all_positive_tweets[:4000]# generating training set for positive tweets

val_neg = all_negative_tweets[4000:] # generating validation set for negative tweets
train_neg = all_negative_tweets[:4000] # generating training set for nagative tweets

train_x = train_pos + train_neg
val_x = val_pos + val_neg

# Set the labels for the training set (1 for positive, 0 for negative)
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))

# Set the labels for the validation set (1 for positive, 0 for negative)
val_y = np.append(np.ones(len(val_pos)), np.zeros(len(val_neg)))

print(f"length of train_x {len(train_x)}")
print(f"length of val_x {len(val_x)}")

1.2 构建字典,每个字典有唯一的id标志

1
2
3
4
5
6
7
8
9
10
11
Vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2} 

# Note that we build vocab using training data
for tweet in train_x:
processed_tweet = process_tweet(tweet)
for word in processed_tweet:
if word not in Vocab:
Vocab[word] = len(Vocab)

print("Total words in vocab are",len(Vocab))
display(Vocab)
1
2
3
4
5
6
7
8
9
The dictionary Vocab will look like this:

{'__PAD__': 0,
'__</e>__': 1,
'__UNK__': 2,
'followfriday': 3,
'top': 4,
'engag': 5,
...

1.3 将tweet转换为tensor

Example

输入一个tweet:

1
'@happypuppy, is Maria happy?'

将有关单词提取,转换为列表:

1
['maria', 'happy']

然后转换为字典中的id标志:

1
[2, 56]
  • 由于字典中没有maria,所以将其转化为__UNK__
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def tweet_to_tensor(tweet, vocab_dict, unk_token='__UNK__', verbose=False):
'''
Input:
tweet - A string containing a tweet
vocab_dict - The words dictionary
unk_token - The special string for unknown tokens
verbose - Print info durign runtime
Output:
tensor_l - A python list with

'''

word_l = process_tweet(tweet)

if verbose:
print("List of words from the processed tweet:")
print(word_l)

tensor_l = []
unk_ID = vocab_dict[unk_token]

if verbose:
print(f"The unique integer ID for the unk_token is {unk_ID}")

for word in word_l:
if word not in vocab_dict:
word_ID = unk_ID
else:
word_ID = vocab_dict[word]

tensor_l.append(word_ID)

return tensor_l

1.4 创建批处理生成器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def data_generator(data_pos, data_neg, batch_size, loop, vocab_dict, shuffle=False):
'''
Input:
data_pos - Set of posstive examples
data_neg - Set of negative examples
batch_size - number of samples per batch. Must be even
loop - True or False
vocab_dict - The words dictionary
shuffle - Shuffle the data order
Yield:
inputs - Subset of positive and negative examples
targets - The corresponding labels for the subset
example_weights - An array specifying the importance of each example

'''
# 确保正面负面数据量相同,所以需要为偶数
assert batch_size % 2 == 0

# 都为batch_size的一半 //代表整数除法
n_to_take = batch_size // 2

pos_index = 0
neg_index = 0

len_data_pos = len(data_pos)
len_data_neg = len(data_neg)

pos_index_lines = list(range(len_data_pos))
neg_index_lines = list(range(len_data_neg))

# 洗牌
if shuffle:
rnd.shuffle(pos_index_lines)
rnd.shuffle(neg_index_lines)

stop = False

while not stop:

batch = []

# First part: Pack n_to_take positive examples
for i in range(n_to_take):

# 若超过数据最大值
if pos_index >= len_data_pos:
if not loop:
stop = True;
break;

# 想要继续loop,重新置为0
pos_index = 0

if shuffle:
rnd.shuffle(pos_index_lines)

# 获取对应数据
tweet = data_pos[pos_index_lines[pos_index]]

# 转化为对应id标志
tensor = tweet_to_tensor(tweet, vocab_dict)

batch.append(tensor)

pos_index = pos_index + 1

# Second part: Pack n_to_take negative examples
for i in range(n_to_take):
if neg_index >= len_data_neg
if not loop:
stop = True;
break;

neg_index = 0

if shuffle:
rnd.shuffle(neg_index_lines)

tweet = data_neg[neg_index_lines[neg_index]]

tensor = tweet_to_tensor(tweet, vocab_dict)

batch.append(tensor)

neg_index = neg_index + 1

if stop:
break;

# Update the start index for positive data
pos_index += n_to_take

# Update the start index for negative data
neg_index += n_to_take

# 获取所有tweet中的最大长度,以便后续填充
max_len = max([len(t) for t in batch])


tensor_pad_l = []

# 将长度不够的tweet对应的tensor填充0
for tensor in batch:
n_pad = max_len - len(tensor) # 需要填充的数量

pad_l = [0] * n_pad # 补0

tensor_pad = tensor + pad_l
tensor_pad_l.appenda(tensor_pad)

# 转换
inputs = np.array(tensor_pad_l)

# 生成目标为正面数据的列表,均为1
target_pos = [1] * n_to_take

# 负面,均为0
target_neg = [0] * n_to_take

target_l = target_pos + target_neg

targets = np.array(target_l)

# 将所有例子的比重设为相同
example_weights = np.ones_like(targets)


yield inputs, targets, example_weights

Part2:编写自己的layers

2.1 编写Relu类

https://pic.imgdb.cn/item/6112207c5132923bf817cb37.png

1
2
3
4
5
6
7
8
9
10
11
12
13
class Relu(Layer):
"""Relu activation function implementation"""
def forward(self, x):
'''
Input:
- x (a numpy array): the input
Output:
- activation (numpy array): all positive or 0 version of x
'''

activation = np.maximum(x,0)

return activation

2.2 编写Dense类

https://pic.imgdb.cn/item/611220665132923bf8179dfd.png

其中:

  • 随机值标准偏差stdev = 0.1
  • n_units :单元数
  • 权重矩阵由trax.fastmath.random.normal(key, shape, dtype=tf.float32)生成
    • key可通过random.get_prng(seed=)生成
    • shape为一个元组,(行数, 列数)
      • 行数为x的列数(为了可以相乘),x可能为(row, col)(batch_size, row, col),所以取最后一个维即可(input_shape[-1])。
      • 列数为n_units的数量,即单元数
    • dtype为矩阵中数值的类型
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Dense(Layer):
"""
A dense (fully-connected) layer.
"""
def __init__(self, n_units, init_stdev=0.1):
self._n_units = n_units
self._init_stdev = 0.1

def forward(self, x):
dense = np.dot(x, self.weights)
return dense

def init_weights_and_state(self, input_signature, random_key):
# The input_signature has a .shape attribute that gives the shape as a tuple
input_shape = input_signature.shape

# Generate the weight matrix from a normal distribution,
# and standard deviation of 'stdev'
w = self._init_stdev * random.normal(
key = random_key, shape = (input_shape[-1], self._n_units))

self.weights = w
return self.weights

2.3 实现分类器

https://pic.imgdb.cn/item/6112240c5132923bf81f8935.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def classifier(vocab_size=len(Vocab), embedding_dim=256, output_dim=2, mode='train'):

# create embedding layer
embed_layer = tl.Embedding(
vocab_size = vocab_size, # Size of the vocabulary
d_feature = embedding_dim) # Embedding dimension

# Create a mean layer, to create an "average" word embedding
mean_layer = tl.Mean(axis = 1)

# Create a dense layer, one unit for each output
dense_output_layer = tl.Dense(n_units = output_dim)

# Create the log softmax layer (no parameters needed)
log_softmax_layer = tl.LogSoftmax()

model = tl.Serial(
embed_layer, # embedding layer
mean_layer, # mean layer
dense_output_layer, # dense output layer
log_softmax_layer # log softmax layer
)

# return the model of type
return model

Part3:训练模型

3.1 定义TrainTask, EvalTask and Loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from trax.supervised import training

batch_size = 16
rnd.seed(271)

train_task = training.TrainTask(
labeled_data=train_generator(batch_size=batch_size, shuffle=True),
loss_layer=tl.CrossEntropyLoss(),
optimizer=trax.optimizers.Adam(0.01),
n_steps_per_checkpoint=10,
)

eval_task = training.EvalTask(
labeled_data=val_generator(batch_size=batch_size, shuffle=True),
metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
)

model = classifier()

3.2 实现训练模型函数

  • 实际实现中,在trax==1.3.9版本中,参数必须为eval_tasks而非eval_task
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def train_model(classifier, train_task, eval_task, n_steps, output_dir):
'''
Input:
classifier - the model you are building
train_task - Training task
eval_task - Evaluation task
n_steps - the evaluation steps
output_dir - folder to save your files
Output:
trainer - trax trainer
'''
training_loop = training.Loop(
classifier, # The learning model
train_task, # The training task
eval_tasks = eval_task, # The evaluation task
output_dir = output_dir) # The output directory

training_loop.run(n_steps = n_steps)

return training_loop

Part4:计算准确率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def compute_accuracy(preds, y, y_weights):
"""
Input:
preds: a tensor of shape (dim_batch, output_dim)
y: a tensor of shape (dim_batch, output_dim) with the true labels
y_weights: a n.ndarray with the a weight for each example
Output:
accuracy: a float between 0-1
weighted_num_correct (np.float32): Sum of the weighted correct predictions
sum_weights (np.float32): Sum of the weights
"""

# 创建一个列表,正面概率大于负面概率则为True,否则为False
is_pos = preds[:, 1] > preds[:, 0]

# 转换为int类型(0, 1)
is_pos_int = is_pos.astype(np.int32)

# 进行比较,判断正确与否
correct = (is_pos_ints == y)

# Count the sum of the weights.
sum_weights = np.sum(y_weights)

# 转换
correct_float = correct.astype(np.float32)

# Multiply each prediction with its corresponding weight.
weighted_correct_float = correct_float * y_weights

weighted_num_correct = np.sum(weighted_correct_float)

# 计算准确率
accuracy = weighted_num_correct / sum_weights

return accuracy, weighted_num_correct, sum_weights

Part5:在验证集上测试模型

  • batch的维数为:(X, Y, weights)

    • X为tweettensor
    • Y为实际的标签,正面或负面
    • weights为相应的权重

    测试模型,并返回准确率

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    def test_model(generator, model):
    '''
    Input:
    generator: an iterator instance that provides batches of inputs and targets
    model: a model instance
    Output:
    accuracy: float corresponding to the accuracy
    '''

    accuracy = 0.
    total_num_correct = 0
    total_num_pred = 0

    for batch in generator:

    # X
    inputs = batch[0]

    # Y
    targets = batch[1]

    # weights
    example_weight = batch[2]

    # 模型做出预测
    pred = model(inputs)

    # 计算准确率
    batch_accuracy, batch_num_correct, batch_num_pred = compute_accuracy(pred, targets, example_weight)

    total_num_correct += batch_num_correct
    total_num_pred += batch_num_pred

    accuracy = total_num_correct / total_num_pred

    return accuracy
CATALOG
  1. 1. 吴恩达团队NLP C3_W1_Assignment
    1. 1.1. 任务:使用深度神经网络进行情感分析
    2. 1.2. Part1:准备数据
      1. 1.2.1. 1.1 以8 : 2的比例准备训练集和验证集
      2. 1.2.2. 1.2 构建字典,每个字典有唯一的id标志
      3. 1.2.3. 1.3 将tweet转换为tensor
      4. 1.2.4. Example
      5. 1.2.5. 1.4 创建批处理生成器
    3. 1.3. Part2:编写自己的layers
      1. 1.3.1. 2.1 编写Relu类
      2. 1.3.2. 2.2 编写Dense类
      3. 1.3.3. 2.3 实现分类器
    4. 1.4. Part3:训练模型
      1. 1.4.1. 3.1 定义TrainTask, EvalTask and Loop
      2. 1.4.2. 3.2 实现训练模型函数
    5. 1.5. Part4:计算准确率
    6. 1.6. Part5:在验证集上测试模型