Keshawn_lu's Blog

吴恩达团队NLP C1_W1_Assignment

字数统计: 2.1k阅读时长: 11 min
2021/07/25 Share

吴恩达团队NLP C1_W1_Assignment

课程链接:

Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

任务:

学习逻辑回归,对于任意一条tweet(推文),可以进行情感分析(正面或负面)。

步骤:

  1. 将数据集分为训练集与测试集,比例为8 : 2
  2. 将正面与负面情绪的数据数量汇总成两个矩阵,分别为训练矩阵与测试矩阵。其中正面情绪的标为1,负面的标为0。
1
2
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
  1. 建立字典,标注每个单词在正负面情况下出现的不同次数,如
1
2
(sad, 0) = 6 # sad作为负面出现的次数为6
(sad, 1) = 1 # sad作为正面出现的次数为1

其中,每条tweet(推文)都提前经过数据清洗,通过process_tweet()函数将tweet分割为单个单词,并移除了禁用词以及一些标点符号,效果如下:

1
2
3
4
5
This is an example of a positive tweet: 
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet:
['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']

Part 1:逻辑回归

1.1 Sigmoid

sigmoid函数定义如下:

它可以将输入的z映射到(0, 1)的值,函数图像如下:

https://pic.imgdb.cn/item/60f81dcc5132923bf8c77f61.png

实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def sigmoid(z): 
'''
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
'''

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# calculate the sigmoid of z
h = 1 / (1 + np.exp(-z))
### END CODE HERE ###

return h

逻辑回归采用了正常的线性回归,$\theta$为权重,即:

合在一起便是:

Part 1.2 代价函数与梯度

代价函数可表示为如下公式:

  • m为训练的样本数
  • $y^{(i)}$为第i个训练的样本
  • $h(z(\theta)^{(i)})$为模型对第i个样本的预测结果

tips:

  • 使用-的原因为$h(z(\theta)^{(i)})$的值在(0, 1)之间,所以$log(h(z(\theta)^{(i)}))$的值肯定为负,所以负负得正。
  • 当实际值为1,预测值为0时,即$y^{(i)}=1, h(z(\theta))≈0$,此时我们可以计算出$J(\theta)=∞$,即代价很大,预测失败。反之则$J(\theta)=0$,即预测成功。

为了更好的提高模型精度,我们需要不断的更新权重$\theta$,其中梯度公式为:

详细推导过程为:

更新梯度的公式为:

我们将向量形式的数据带入:

$\theta$为(n + 1, 1)的向量,其中n为特征的数量,$\theta_{0}$为偏差项,可以理解为y = wx + b中的b

,与之相对应的$x_{0}=1$,此时$x=z\theta$。

  • x的维度为(m, n + 1)
  • $\theta$的维度为(n + 1, 1)
  • z的维度为(m, 1)

此时:

梯度下降代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def gradientDescent(x, y, theta, alpha, num_iters):
'''
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# get 'm', the number of rows in matrix x
m = np.shape(x)[0] # 行数

for i in range(0, num_iters):

# get z, the dot product of x and theta
z = np.dot(x, theta)

# get the sigmoid of z
h = sigmoid(z)


# calculate the cost function
vector_1 = np.ones((np.shape(y)[0], 1)) # 生成m行1列的二维数组,数值都为1

J = (-1/m) * (np.dot(y.T, np.log(h)) + np.dot((vector_1 - y).T, np.log(vector_1 - h)))

# update the weights theta
theta = theta - (alpha / m) * np.dot(x.T, (h - y))

### END CODE HERE ###
print(J)
J = float(J)
return J, theta

Part 2: 特征提取

对于一个推文,我们提取其中正面与负面的词语数量,并根据这些数据训练模型,最后在测试集上验证我们的模型。

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def extract_features(tweet, freqs):
'''
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
'''
# process_tweet tokenizes, stems, and removes stopwords
word_l = process_tweet(tweet)

# 3 elements in the form of a 1 x 3 vector
x = np.zeros((1, 3))

#bias term is set to 1
x[0,0] = 1

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

# loop through each word in the list of words
for word in word_l:

# increment the word count for the positive label 1
if (word, 1) in freqs:
x[0,1] += freqs[(word, 1)]

# increment the word count for the negative label 0
if (word, 0) in freqs:
x[0,2] += freqs[(word, 0)]

### END CODE HERE ###
assert(x.shape == (1, 3))
return x

Part 3: 训练模型

我们利用之前写好的函数来进行模型的训练

1
2
3
4
5
6
7
8
9
10
11
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

结果如下:

1
2
3
[[0.24216529]]
The cost after training is 0.24216529.
The resulting vector of weights is [7e-08, 0.0005239, -0.00055517]

Part 4: 测试模型

  • 对于一条推文,首先提取特征,获得x
  • 对于x,乘上相应的权重
  • 最后通过sigmoid() 来预测结果,$y_{pred} = sigmoid(\mathbf{x} \cdot \theta)$
  • 若结果>0.5,则视为正面,反之为负面情绪。

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def predict_tweet(tweet, freqs, theta):
'''
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

# extract the features of the tweet and store it into x
x = extract_features(tweet, freqs)

# make the prediction using x and theta
y_pred = sigmoid(np.dot(x, theta))

### END CODE HERE ###

return y_pred

最后编写测试函数,通过平均正确率来评估我们的模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def test_logistic_regression(test_x, test_y, freqs, theta):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

# the list for storing predictions
y_hat = []

for tweet in test_x:
# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)

if y_pred > 0.5:
# append 1.0 to the list
y_hat.append(1.0) # 正面
else:
# append 0 to the list
y_hat.append(0) # 负面

len_y = len(y_hat) # 计算m,即数据总数

# With the above implementation, y_hat is a list, but test_y is (m,1) array
# convert both to one-dimensional arrays in order to compare them using the '==' operator

y_hat = np.array(y_hat)

correct = 0
print(y_hat)
for i in range(len_y):
if(y_hat[i] == test_y[i][0]): # 一一进行对比
correct += 1

accuracy = correct / len_y

### END CODE HERE ###

return accuracy
CATALOG
  1. 1. 吴恩达团队NLP C1_W1_Assignment
    1. 1.1. 课程链接:
    2. 1.2. 任务:
    3. 1.3. 步骤:
    4. 1.4. Part 1:逻辑回归
      1. 1.4.1. 1.1 Sigmoid
    5. 1.5. Part 1.2 代价函数与梯度
    6. 1.6. Part 2: 特征提取
    7. 1.7. Part 3: 训练模型
    8. 1.8. Part 4: 测试模型