Keshawn_lu's Blog

吴恩达团队NLP C2_W1_Assignment

字数统计: 1.9k阅读时长: 10 min
2021/08/08 Share

吴恩达团队NLP C2_W1_Assignment

任务:自动校正单词拼写

Part1:处理数据

将字符串数据读入,并将字符串都转化为小写,最后通过正则表达式将其中的单词找出来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def process_data(file_name):
"""
Input:
A file_name which is found in your current directory. You just have to read it in.
Output:
words: a list containing all the words in the corpus (text file you read) in lower case.
"""
words = [] # return this variable correctly

with open(file_name, 'r') as f:
data = f.read()

data = data.lower()

words = re.findall('\w+', data) # \w匹配字符 +一次或多次

return words

1.1 获得每个单词的出现次数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def get_count(word_l):
'''
Input:
word_l: a set of words representing the corpus.
Output:
word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
'''

word_count_dict = {} # fill this with word counts

# for word in word_l:
# if word not in word_count_dict.keys():
# word_count_dict[word] = 1
# else:
# word_count_dict[word] += 1


word_count_dict = Counter(word_l)


return word_count_dict

1.2 获得每个单词的出现频率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def get_probs(word_count_dict):
'''
Input:
word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
Output:
probs: A dictionary where keys are the words and the values are the probability that a word will occur.
'''
probs = {} # return this variable correctly

M = sum(word_count_dict.values())

for word in word_count_dict.items():
probs[word[0]] = word[1] / M


return probs

Part2:修改字符串,来实现错误单词的修正

其中包括:

  • 删除某个单词
  • 更换单词中两个相邻字母
  • 更换某个字母
  • 插入某个字母

2.1 删除某个字符

  1. 首先获得单词可切割的位置
  2. 通过这些可切割的位置得到删除某个字母后的单词结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def delete_letter(word, verbose=False):
'''
Input:
word: the string/word for which you will generate all possible words
in the vocabulary which have 1 missing character
Output:
delete_l: a list of all possible strings obtained by deleting 1 character from word
'''

delete_l = []
split_l = []

for i in range(len(word)):
split_l.append((word[:i], word[i:]))

for word in split_l:
delete_l.append(word[0] + word[1][1:])

if verbose: print(f"input word {word}, \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

return delete_l

2.2 更换相邻的两个字符

  1. 首先获得单词可切割的位置
  2. 再通过切割位置的前后部分来获得可更换的字符
  3. 最后获得更换后的新单词
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def switch_letter(word, verbose=False):
'''
Input:
word: input string
Output:
switches: a list of all possible strings with one adjacent charater switched
'''

switch_l = []
split_l = []

for i in range(len(word)):
split_l.append((word[:i], word[i:]))

for item in split_l:
if item[0]: # 即前半部分的最后一个字母与后半部分的第一个字母进行交换
switch_l.append(item[0][:-1] + item[1][0] + item[0][-1] + item[1][1:])


if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}")

return switch_l

2.3 更换单词中的某个字母

  1. 首先获得单词可切割的位置
  2. 并通过切割位置来获得可替换的字母位置
  3. 最后获得新单词
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def replace_letter(word, verbose=False):
'''
Input:
word: the input string/word
Output:
replaces: a list of all possible strings where we replaced one letter from the original word.
'''

letters = 'abcdefghijklmnopqrstuvwxyz'
replace_l = []
split_l = []


for i in range(len(word)):
split_l.append((word[:i], word[i:]))

for item in split_l:
for c in letters:
new_word = item[0] + c + item[1][1:]

if new_word != word:
replace_l.append(new_word)

replace_set = set(replace_l)

# turn the set back into a list and sort it, for easier viewing
replace_l = sorted(list(replace_set))

if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nreplace_l {replace_l}")

return replace_l

2.4 插入新字母

  1. 首先获得单词可切割的位置
  2. 并通过切割位置来获得可插入新字母的位置
  3. 最后获得新单词
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def insert_letter(word, verbose=False):
'''
Input:
word: the input string/word
Output:
inserts: a set of all possible strings with one new letter inserted at every offset
'''
letters = 'abcdefghijklmnopqrstuvwxyz'
insert_l = []
split_l = []

for i in range(len(word) + 1):
split_l.append((word[:i], word[i:]))


for item in split_l:
for c in letters:
new_word = item[0] + c + item[1]
insert_l.append(new_word)


if verbose: print(f"Input word {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")

return insert_l

Part3:组合修改函数

将上述的几个函数组合起来,获得一个单词修改一个字母后的结果组合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def edit_one_letter(word, allow_switches = True):
"""
Input:
word: the string/word for which we will generate all possible wordsthat are one edit away.
Output:
edit_one_set: a set of words with one possible edit. Please return a set. and not a list.
"""

edit_one_set = set()

delete_set = set(delete_letter(word))
insert_set = set(insert_letter(word))
replace_set = set(replace_letter(word))

edit_one_set.update(delete_set)
edit_one_set.update(insert_set)
edit_one_set.update(replace_set)

if allow_switches == True:
switch_set = switch_letter(word)
edit_one_set.update(switch_set)

return edit_one_set

3.2 在修改一个字母的基础上,再进行修改,便获得了修改两个字母的结果组合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def edit_two_letters(word, allow_switches = True):
'''
Input:
word: the input string/word
Output:
edit_two_set: a set of strings with all possible two edits
'''

edit_two_set = set()

edit_one_set = edit_one_letter(word, allow_switches)

for word in edit_one_set:
delete_set = set(delete_letter(word))
insert_set = set(insert_letter(word))
replace_set = set(replace_letter(word))

edit_two_set.update(delete_set)
edit_two_set.update(insert_set)
edit_two_set.update(replace_set)

if allow_switches == True:
switch_set = switch_letter(word)
edit_two_set.update(switch_set)


return edit_two_set

3.3 给出修改建议

我们按照以下的顺序来获得修改后单词的结果组合:

  • 原始单词已在单词表中
  • 修改一个字母后的单词组合中有单词在单词表中
  • 修改两个个字母后的单词组合中有单词在单词表中
  • 啥也不管,依然返回原始单词

获得结果组合后,我们将按单词概率进行排序,最后返回最有可能的n个单词:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def get_corrections(word, probs, vocab, n=2, verbose = False):
'''
Input:
word: a user entered string to check for suggestions
probs: a dictionary that maps each word to its probability in the corpus
vocab: a set containing all the vocabulary
n: number of possible word corrections you want returned in the dictionary
Output:
n_best: a list of tuples with the most probable n corrected words and their probabilities.
'''

suggestions = []
n_best = []


if word in vocab:
suggestions = [word]
elif edit_one_letter(word).intersection(vocab): # 交集判断是否存在
suggestions = edit_one_letter(word).intersection(vocab)
elif edit_two_letters(word).intersection(vocab):
suggestions = edit_two_letters(word).intersection(vocab)
else:
suggestions = [word]


word_probs = {}
for item in suggestions:
if item not in probs.keys():
word_probs[item] = 0
else:
word_probs[item] = probs[item]

# 按概率从大到小排序
word_probs_order = sorted(word_probs.items(),key=lambda x:x[1],reverse=True)

for item in word_probs_order:
if len(n_best) < n:
n_best.append([item[0], item[1]])

if verbose: print("entered word = ", word, "\nsuggestions = ", suggestions)

return n_best

Part4:最小化修改路径

我们想知道一个单词转化成另一个单词需要花费的最小代价是多少,这时候便要通过动态规划的方法来进行计算。

首先进行初始化:

并给出转移方程:

D[i, j]代表源单词的前i个字母转化为结果单词的前j个字母,所需要的最小代价。

https://pic.imgdb.cn/item/6103ade75132923bf8cfed21.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def min_edit_distance(source, target, ins_cost = 1, del_cost = 1, rep_cost = 2):
'''
Input:
source: a string corresponding to the string you are starting with
target: a string corresponding to the string you want to end with
ins_cost: an integer setting the insert cost
del_cost: an integer setting the delete cost
rep_cost: an integer setting the replace cost
Output:
D: a matrix of len(source)+1 by len(target)+1 containing minimum edit distances
med: the minimum edit distance (med) required to convert the source string to the target
'''
# use deletion and insert cost as 1
m = len(source)
n = len(target)
#initialize cost matrix with zeros and dimensions (m+1,n+1)
D = np.zeros((m+1, n+1), dtype=int)

# Fill in column 0, from row 1 to row m, both inclusive
for row in range(1, m + 1): # Replace None with the proper range
D[row,0] = D[row - 1, 0] + del_cost

# Fill in row 0, for all columns from 1 to n, both inclusive
for col in range(1, n + 1): # Replace None with the proper range
D[0,col] = D[0, col - 1] + ins_cost

# Loop through row 1 to row m, both inclusive
for row in range(1, m + 1):

# Loop through column 1 to column n, both inclusive
for col in range(1, n + 1):

# Intialize r_cost to the 'replace' cost that is passed into this function
r_cost = rep_cost

# Check to see if source character at the previous row
# matches the target character at the previous column,
if source[row - 1] == target[col - 1]:
# Update the replacement cost to 0 if source and target are the same
r_cost = 0

# Update the cost at row, col based on previous entries in the cost matrix
# Refer to the equation calculate for D[i,j] (the minimum of three calculated costs)
D[row,col] = min(min(D[row, col - 1] + ins_cost, D[row - 1, col] + del_cost),
D[row - 1, col - 1] + r_cost)

# Set the minimum edit distance with the cost found at row m, column n
med = D[m, n]

return D, med
CATALOG
  1. 1. 吴恩达团队NLP C2_W1_Assignment
    1. 1.1. 任务:自动校正单词拼写
    2. 1.2. Part1:处理数据
      1. 1.2.1. 1.1 获得每个单词的出现次数
      2. 1.2.2. 1.2 获得每个单词的出现频率
    3. 1.3. Part2:修改字符串,来实现错误单词的修正
      1. 1.3.1. 2.1 删除某个字符
      2. 1.3.2. 2.2 更换相邻的两个字符
      3. 1.3.3. 2.3 更换单词中的某个字母
      4. 1.3.4. 2.4 插入新字母
    4. 1.4. Part3:组合修改函数
      1. 1.4.1. 3.2 在修改一个字母的基础上,再进行修改,便获得了修改两个字母的结果组合
      2. 1.4.2. 3.3 给出修改建议
    5. 1.5. Part4:最小化修改路径