Keshawn_lu's Blog

吴恩达团队NLP C2_W2_Assignment

字数统计: 1.7k阅读时长: 9 min
2021/08/08 Share

吴恩达团队NLP C2_W2_Assignment

任务:词性标注(POS)

Part1:Training

1.1 构建 transition_counts, emission_counts, tag_counts

  • transition_counts 计算的是每个tag与另一个tag相邻的次数 → (prev_tag, tag), 便于之后计算$P(ti |t{i-1})$
  • emission_counts 计算的是word与相应tag出现的次数 → (tag, word),便于之后计算$P(w_i|t_i)$
  • tag_counts 计算的是每个tag出现的次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def create_dictionaries(training_corpus, vocab):
"""
Input:
training_corpus: a corpus where each line has a word followed by its tag.
vocab: a dictionary where keys are words in vocabulary and value is an index
Output:
emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
tag_counts: a dictionary where the keys are the tags and the values are the counts
"""

#初始化为defaultdict
emission_counts = defaultdict(int)
transition_counts = defaultdict(int)
tag_counts = defaultdict(int)

# Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'
prev_tag = '--s--'

i = 0 # 统计总数

for word_tag in training_corpus:

i += 1

# Every 50,000 words, print the word count
if i % 50000 == 0:
print(f"word count = {i}")

word, tag = get_word_tag (word_tag, vocab)

# Increment the transition count for the previous word and tag
transition_counts[(prev_tag, tag)] += 1

# Increment the emission count for the tag and word
emission_counts[(tag, word)] += 1

# Increment the tag count
tag_counts[tag] += 1

# Set the previous tag to this tag (for the next iteration of the loop)
prev_tag = tag


return emission_counts, transition_counts, tag_counts

1.2 预测pos,并计算准确度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def predict_pos(prep, y, emission_counts, vocab, states):
'''
Input:
prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.
y: a corpus composed of a list of tuples where each tuple consists of (word, POS)
emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count
vocab: a dictionary where keys are words in vocabulary and value is an index
states: a sorted list of all possible tags for this assignment
Output:
accuracy: Number of times you classified a word correctly
'''

num_correct = 0

# (tag, word)
all_words = set(emission_counts.keys())

total = len(y)
for word, y_tup in zip(prep, y):

# Split the (word, POS) string into a list of two items
y_tup_l = y_tup.split()

# Verify that y_tup contain both word and POS
if len(y_tup_l) == 2:

# Set the true POS label for this word
true_label = y_tup_l[1]

else:
continue

count_final = 0
pos_final = ''

if word in vocab:
for pos in states:

# define the key as the tuple containing the POS and word
key = (pos, word)

# check if the (pos, word) key exists in the emission_counts dictionary
if key in emission_counts:
count = emission_counts[key]

if count > count_final:
count_final = count
pos_final = pos

# If the final POS (with the largest count) matches the true POS:
if pos_final == true_label:
num_correct += 1

accuracy = num_correct / total

return accuracy

Part2:pos的隐马尔科夫模型

2.1 构建A矩阵

  • A:转移概率模型(使用transition_counts
  • 并做平滑处理,最终公式如下:
  • N:tag的总数
  • $C(t{i-1}, t{i})$:(prev_pos, current_pos)transition_counts中的数量
  • $C(t_{i-1})$:pre_postag_counts中的数量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def create_transition_matrix(alpha, tag_counts, transition_counts):
'''
Input:
alpha: number used for smoothing
tag_counts: a dictionary mapping each tag to its respective count
transition_counts: transition count for the previous word and tag
Output:
A: matrix of dimension (num_tags,num_tags)
'''

all_tags = sorted(tag_counts.keys())

num_tags = len(all_tags)

# 初始化A
A = np.zeros((num_tags,num_tags))

# 得到(previous POS, current POS)
trans_keys = set(transition_counts.keys())

# 遍历A的行
for i in range(num_tags):

# 遍历A的列
for j in range(num_tags):

# 初始化(prev POS, current POS)
count = 0

# (prev POS, current POS)
key = (all_tags[i], all_tags[j])

if key in transition_counts.keys():
count = transition_counts[key]

count_prev_tag = tag_counts[key[0]]

A[i,j] = (count + alpha) / (count_prev_tag + alpha * num_tags)


return A

2.2 构建B矩阵

  • B:发射概率模型(使用emission_counts
  • 平滑处理,最终公式如下:
  • $C(t_i, word_i)$:(tag, word)emission_counts 中的相应数量
  • $C(t_{i})$:$tag_i$在tag_counts中的数量
  • N:单词的总数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def create_emission_matrix(alpha, tag_counts, emission_counts, vocab):
'''
Input:
alpha: tuning parameter used in smoothing
tag_counts: a dictionary mapping each tag to its respective count
emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
vocab: a dictionary where keys are words in vocabulary and value is an index.
within the function it'll be treated as a list
Output:
B: a matrix of dimension (num_tags, len(vocab))
'''

num_tags = len(tag_counts)

all_tags = sorted(tag_counts.keys())

num_words = len(vocab)

# 初始化B
B = np.zeros((num_tags, num_words))

# (POS, word)
emis_keys = set(list(emission_counts.keys()))

# 行
for i in range(num_tags):

# 列
for j in range(num_words):

# 初始化(POS tag, word)数量
count = 0

# (POS tag, word)
key = (all_tags[i], vocab[j])

if key in emission_counts.keys():
count = emission_counts[key]

count_tag = tag_counts[key[0]]

B[i,j] = (count + alpha) / (count_tag + alpha * num_words)

return B

Part3:维特比算法

3.1 初始化矩阵

  • best_probs:每个单元格为一个pos_tag到一个word的概率
  • best_paths:目的是为了找到最佳路径
  • 从起始处到特定的索引i,最佳路线的概率为$bestprobs[s_{idx}, i]$
  • 为了避免一些问题,我们使用log的加法来取代乘法,并且当$A[s_{idx}, i] == 0$时,我们将其设置为-∞,即log(0)

综上,初始化方程为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def initialize(states, tag_counts, A, B, corpus, vocab):
'''
Input:
states: a list of all possible parts-of-speech
tag_counts: a dictionary mapping each tag to its respective count
A: Transition Matrix of dimension (num_tags, num_tags)
B: Emission Matrix of dimension (num_tags, len(vocab))
corpus: a sequence of words whose POS is to be identified in a list
vocab: a dictionary where keys are words in vocabulary and value is an index
Output:
best_probs: matrix of dimension (num_tags, len(corpus)) of floats
best_paths: matrix of dimension (num_tags, len(corpus)) of integers
'''

num_tags = len(tag_counts)

# 初始化best_probs矩阵
best_probs = np.zeros((num_tags, len(corpus)))

# 初始化best_paths矩阵
best_paths = np.zeros((num_tags, len(corpus)), dtype=int)

# 定义起始处
s_idx = states.index("--s--")

# 遍历每个pos_tag
for i in range(num_tags):

if A[s_idx, i] == 0:
best_probs[i,0] = float('-inf')
else:
best_probs[i,0] = math.log(A[s_idx, i]) + math.log(B[i, vocab[corpus[0]]])

return best_probs, best_paths

3.2 维特比 Forward

  • 对于每个单元格:$prob=𝐛𝐞𝐬𝐭𝐩𝐫𝐨𝐛{𝑘,𝑖−1}+log(𝐀{𝑘,𝑗})+log(𝐁{𝑗,𝑣𝑜𝑐𝑎𝑏(𝑐𝑜𝑟𝑝𝑢𝑠𝑖)})$
  • 然后从中选出概率最大的作为最后的结果
  • $\mathrm{path} = k$

https://pic.imgdb.cn/item/61079ac95132923bf85227e4.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab):
'''
Input:
A, B: The transition and emission matrices respectively
test_corpus: a list containing a preprocessed corpus
best_probs: an initilized matrix of dimension (num_tags, len(corpus))
best_paths: an initilized matrix of dimension (num_tags, len(corpus))
vocab: a dictionary where keys are words in vocabulary and value is an index
Output:
best_probs: a completed matrix of dimension (num_tags, len(corpus))
best_paths: a completed matrix of dimension (num_tags, len(corpus))
'''

num_tags = best_probs.shape[0]

# 从word_1开始遍历
for i in range(1, len(test_corpus)):

if i % 5000 == 0:
print("Words processed: {:>8}".format(i))

# 对于每个word,遍历tag的可能性
for j in range(num_tags):

# 初始化
best_prob_i = float('-inf')

# 初始化
best_path_i = None

# For each POS tag that the previous word can be:
for k in range(num_tags):
prob = best_probs[k, i - 1] + math.log(A[k, j]) + math.log(B[j, vocab[test_corpus[i]]])

if prob > best_prob_i:
best_prob_i = prob
best_path_i = k

# 记录该tag -> word 的最佳概率
best_probs[j,i] = best_prob_i

# 将前一个POS标记的唯一整数ID保存到best_paths矩阵中
best_paths[j,i] = best_path_i

return best_probs, best_paths

3.3 维特比 Backward

https://pic.imgdb.cn/item/61079a025132923bf84ef241.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def viterbi_backward(best_probs, best_paths, corpus, states):
'''
This function returns the best path.

'''
m = best_paths.shape[1]

# Initialize array z, same length as the corpus
z = [None] * m

num_tags = best_probs.shape[0]

# 初始化
best_prob_for_last_word = float('-inf')

# Initialize pred array, same length as corpus
pred = [None] * m

## Step 1 ##

# 寻找概率最大的行
for k in range(num_tags):

if best_probs[k, -1] > best_prob_for_last_word: # 从最后一列开始

best_prob_for_last_word = best_probs[k, -1]
z[m - 1] = k

# 通过行号K转化为tag
pred[m - 1] = states[k]

## Step 2 ##
for i in range(len(corpus) - 1, -1, -1): # 从后往前

pos_tag_for_word_i = best_paths[np.argmax(best_probs[:,i]),i]

z[i - 1] = best_paths[pos_tag_for_word_i, i]

pred[i - 1] =states[pos_tag_for_word_i]

return pred

3.4 计算准确率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def compute_accuracy(pred, y):
'''
Input:
pred: a list of the predicted parts-of-speech
y: a list of lines where each word is separated by a '\t' (i.e. word \t tag)
Output:

'''
num_correct = 0
total = 0

# Zip together the prediction and the labels
for prediction, y in zip(pred, y):

# (word, tag)
word_tag_tuple = y.split()

if len(word_tag_tuple) < 2:
continue

word, tag = word_tag_tuple

if prediction == tag:
num_correct += 1

total += 1

return num_correct/total

tips:

CATALOG
  1. 1. 吴恩达团队NLP C2_W2_Assignment
    1. 1.1. 任务:词性标注(POS)
    2. 1.2. Part1:Training
      1. 1.2.1. 1.1 构建 transition_counts, emission_counts, tag_counts
      2. 1.2.2. 1.2 预测pos,并计算准确度
    3. 1.3. Part2:pos的隐马尔科夫模型
      1. 1.3.1. 2.1 构建A矩阵
      2. 1.3.2. 2.2 构建B矩阵
    4. 1.4. Part3:维特比算法
      1. 1.4.1. 3.1 初始化矩阵
      2. 1.4.2. 3.2 维特比 Forward
      3. 1.4.3. 3.3 维特比 Backward
      4. 1.4.4. 3.4 计算准确率