Keshawn_lu's Blog

吴恩达团队NLP C1_W2_Assignment

字数统计: 1.7k阅读时长: 9 min
2021/07/25 Share

吴恩达团队NLP C1_W2_Assignment

Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

任务:

通过朴素贝叶斯的方法来对tweet进行情感分析

Part 1: 处理数据

首先对数据进行预处理,删除杂音数据,并对tweet进行分词。

1
2
3
4
5
6
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

result: ['hello', 'great', 'day', ':)', 'good', 'morn']

Part 1.1 对tweet的单词进行计数

编写函数count_tweets ,来对tweet的单词进行计数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def count_tweets(result, tweets, ys):
'''
Input:
result: a dictionary that will be used to map each pair to its frequency
tweets: a list of tweets
ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
Output:
result: a dictionary mapping each pair to its frequency
'''

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
for y, tweet in zip(ys, tweets):
for word in process_tweet(tweet):
# define the key, which is the word and label tuple
pair = (word, y)

# if the key exists in the dictionary, increment the count
if pair in result:
result[pair] += 1

# else, if the key is new, add it to the dictionary and set the count to 1
else:
result[pair] = 1
### END CODE HERE ###

return result

Part2: 使用朴素贝叶斯训练模型

首先我们需要确定我们所拥有的数据中,正面与负面的数据存在的概率(即占的比例),公式如下:

其次,我们可以获取先验概率,即盲目的挑选一条tweet,它为正面与负面的概率为多少,公式如下:

然后,我们计算词汇表中每个词语作为正面和负面的概率,公式如下:

  • V代表不同单词的数量
  • +1的目的为避免概率为0的情况(拉普拉斯平滑法)

最后,我们可以计算这个单词的对数似然:

若单词正面概率大于负面概率,则结果 > 0,反之则 < 0。

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def train_naive_bayes(freqs, train_x, train_y):
'''
Input:
freqs: dictionary from (word, label) to how often the word appears
train_x: a list of tweets
train_y: a list of labels correponding to the tweets (0,1)
Output:
logprior: the log prior. (equation 3 above)
loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
'''
loglikelihood = {}
logprior = 0

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

# calculate V, the number of unique words in the vocabulary
vocab = set([pair[0] for pair in freqs.keys()])
V = len(vocab)

# calculate N_pos and N_neg
N_pos = N_neg = 0
for pair in freqs.keys():
# if the label is positive (greater than zero)
if pair[1] > 0:

# Increment the number of positive words by the count for this (word, label) pair
N_pos += freqs[pair]

# else, the label is negative
else:

# increment the number of negative words by the count for this (word,label) pair
N_neg += freqs[pair]

# Calculate D, the number of documents
D = len(train_y)

# Calculate D_pos, the number of positive documents (*hint: use sum(<np_array>))

D_pos = 0
for num in train_y:
if(num == 1):
D_pos += 1

# Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)
D_neg = D - D_pos

# Calculate logprior
logprior = np.log(D_pos) - np.log(D_neg)

# For each word in the vocabulary...
for word in vocab:
# get the positive and negative frequency of the word

if (word, 1) in freqs:
freq_pos = freqs[(word, 1)]
else:
freq_pos = 0

if (word, 0) in freqs:
freq_neg = freqs[(word, 0)]
else:
freq_neg = 0

# calculate the probability that each word is positive, and negative
p_w_pos = (freq_pos + 1) / (N_pos + V)
p_w_neg = (freq_neg + 1) / (N_neg + V)

# calculate the log likelihood of the word
loglikelihood[word] = np.log(p_w_pos / p_w_neg)

### END CODE HERE ###

return logprior, loglikelihood

tips:

  • $N{pos}$与$N{neg}$来源于freqs 中的数据
  • $D{pos}$与$D{neg}$来源于train_y 中的数据

Part 3: 训练模型

我们使用下述的公式来计算一条tweet的情感概率(大于0为正面,小于0为负面):

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def naive_bayes_predict(tweet, logprior, loglikelihood):
'''
Input:
tweet: a string
logprior: a number
loglikelihood: a dictionary of words mapping to numbers
Output:
p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# process the tweet to get a list of words
word_l = process_tweet(tweet)

# initialize probability to zero
p = 0

# add the logprior
p += logprior

for word in word_l:

# check if the word exists in the loglikelihood dictionary
if word in loglikelihood:
# add the log likelihood of that word to the probability
p += loglikelihood[word]

### END CODE HERE ###

return p

然后,我们将预测的结果与实际结果进行对比,计算出误差率与准确率。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
"""
Input:
test_x: A list of tweets
test_y: the corresponding labels for the list of tweets
logprior: the logprior
loglikelihood: a dictionary with the loglikelihoods for each word
Output:
accuracy: (# of tweets classified correctly)/(total # of tweets)
"""
accuracy = 0 # return this properly

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
y_hats = []
for tweet in test_x:
# if the prediction is > 0
if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
# the predicted class is 1
y_hat_i = 1
else:
# otherwise the predicted class is 0
y_hat_i = 0

# append the predicted class to the list y_hats
y_hats.append(y_hat_i)

# error is the average of the absolute values of the differences between y_hats and test_y

error = np.mean(np.absolute(y_hats-test_y)) # 计算平均值

# Accuracy is 1 minus the error
accuracy = 1 - error

### END CODE HERE ###

return accuracy

Part4: 计算每个单词正负面出现次数的比率

使用公式如下:

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def get_ratio(freqs, word):
'''
Input:
freqs: dictionary containing the words
word: string to lookup

Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
'''
pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# use lookup() to find positive counts for the word (denoted by the integer 1)
pos_neg_ratio['positive'] = lookup(freqs, word, 1)

# use lookup() to find negative counts for the word (denoted by integer 0)
pos_neg_ratio['negative'] = lookup(freqs, word, 0)

# calculate the ratio of positive to negative counts for the word
pos_neg_ratio['ratio'] = (pos_neg_ratio['positive'] + 1) / (pos_neg_ratio['negative'] + 1)
### END CODE HERE ###
return pos_neg_ratio

通过设定阈值,来查找符合条件的单词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def get_words_by_threshold(freqs, label, threshold):
'''
Input:
freqs: dictionary of words
label: 1 for positive, 0 for negative
threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
Output:
word_list: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
example of a key value pair:
{'happi':
{'positive': 10, 'negative': 20, 'ratio': 0.5}
}
'''
word_list = {}

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
for key in freqs.keys():
word, _ = key

# get the positive/negative ratio for a word
pos_neg_ratio = get_ratio(freqs, word)


# if the label is 1 and the ratio is greater than or equal to the threshold...
if label == 1 and pos_neg_ratio["ratio"] >= threshold:

# Add the pos_neg_ratio to the dictionary
word_list[word] = pos_neg_ratio

# If the label is 0 and the pos_neg_ratio is less than or equal to the threshold...
elif label == 0 and pos_neg_ratio["ratio"] <= threshold:

# Add the pos_neg_ratio to the dictionary
word_list[word] = pos_neg_ratio

# otherwise, do not include this word in the list (do nothing)

### END CODE HERE ###
return word_list

Part 5: 错误分析

可以观察一下模型判断错误的tweet都是什么样的:

Truth Predicted Tweet
1 0 b’’
1 0 b’truli later move know queen bee upward bound movingonup’
1 0 b’new report talk burn calori cold work harder warm feel better weather :p’
1 0 b’harri niall 94 harri born ik stupid wanna chang :D’
1 0 b’park get sunlight’
0 1 b’hello info possibl interest jonatha close join beti :( great’
0 1 b’u prob fun david’
0 1 b’pat jay’

Part 6: 测试自己的tweet

我们可以任意写一个tweet,来进行情感分析的预测

1
2
3
4
5
6
my_tweet = 'I am happy because I am learning :)'

p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(p)

result: 9.533333142270019(代表正面)
CATALOG
  1. 1. 吴恩达团队NLP C1_W2_Assignment
    1. 1.1. 任务:
    2. 1.2. Part 1: 处理数据
    3. 1.3. Part 1.1 对tweet的单词进行计数
    4. 1.4. Part2: 使用朴素贝叶斯训练模型
    5. 1.5. Part 3: 训练模型
    6. 1.6. Part4: 计算每个单词正负面出现次数的比率
    7. 1.7. Part 5: 错误分析
    8. 1.8. Part 6: 测试自己的tweet