吴恩达团队NLP C2_W1_Assignment

字数统计: 1.9k阅读时长: 10 min

 2021/08/08   Share

吴恩达团队NLP C2_W1_Assignment

任务：自动校正单词拼写

Part1：处理数据

将字符串数据读入，并将字符串都转化为小写，最后通过正则表达式将其中的单词找出来。

def process_data(file_name):
    """
    Input: 
        A file_name which is found in your current directory. You just have to read it in. 
    Output: 
        words: a list containing all the words in the corpus (text file you read) in lower case. 
    """
    words = [] # return this variable correctly 
    
    with open(file_name, 'r') as f:
        data = f.read()
        
    data = data.lower()
    
    words = re.findall('\w+', data) # \w匹配字符   +一次或多次
    
    return words

1.1 获得每个单词的出现次数

def get_count(word_l):
    '''
    Input:
        word_l: a set of words representing the corpus. 
    Output:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    '''
    
    word_count_dict = {}  # fill this with word counts
    
#     for word in word_l:
#         if word not in word_count_dict.keys():
#             word_count_dict[word] = 1
#         else:
#             word_count_dict[word] += 1
            
    
    word_count_dict = Counter(word_l)
    
    
    return word_count_dict

1.2 获得每个单词的出现频率

def get_probs(word_count_dict):
    '''
    Input:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    Output:
        probs: A dictionary where keys are the words and the values are the probability that a word will occur. 
    '''
    probs = {}  # return this variable correctly
    
    M = sum(word_count_dict.values())
    
    for word in word_count_dict.items():
        probs[word[0]] = word[1] / M
    
    
    return probs

Part2：修改字符串，来实现错误单词的修正

其中包括：

删除某个单词
更换单词中两个相邻字母
更换某个字母
插入某个字母

2.1 删除某个字符

首先获得单词可切割的位置
通过这些可切割的位置得到删除某个字母后的单词结果

def delete_letter(word, verbose=False):
    '''
    Input:
        word: the string/word for which you will generate all possible words 
                in the vocabulary which have 1 missing character
    Output:
        delete_l: a list of all possible strings obtained by deleting 1 character from word
    '''
    
    delete_l = []
    split_l = []
    
    for i in range(len(word)):
        split_l.append((word[:i], word[i:]))
    
    for word in split_l:
        delete_l.append(word[0] + word[1][1:])

    if verbose: print(f"input word {word}, \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

    return delete_l

2.2 更换相邻的两个字符

首先获得单词可切割的位置
再通过切割位置的前后部分来获得可更换的字符
最后获得更换后的新单词

def switch_letter(word, verbose=False):
    '''
    Input:
        word: input string
     Output:
        switches: a list of all possible strings with one adjacent charater switched
    ''' 
    
    switch_l = []
    split_l = []
    
    for i in range(len(word)):
        split_l.append((word[:i], word[i:]))
        
    for item in split_l:
        if item[0]: # 即前半部分的最后一个字母与后半部分的第一个字母进行交换
            switch_l.append(item[0][:-1] + item[1][0] + item[0][-1] + item[1][1:])
    
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}") 

    return switch_l

2.3 更换单词中的某个字母

首先获得单词可切割的位置
并通过切割位置来获得可替换的字母位置
最后获得新单词

def replace_letter(word, verbose=False):
    '''
    Input:
        word: the input string/word 
    Output:
        replaces: a list of all possible strings where we replaced one letter from the original word. 
    ''' 
    
    letters = 'abcdefghijklmnopqrstuvwxyz'
    replace_l = []
    split_l = []

    
    for i in range(len(word)):
        split_l.append((word[:i], word[i:]))
    
    for item in split_l:
        for c in letters:
            new_word = item[0] + c + item[1][1:]
            
            if new_word != word:
                replace_l.append(new_word)
    
    replace_set = set(replace_l)
    
    # turn the set back into a list and sort it, for easier viewing
    replace_l = sorted(list(replace_set))
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nreplace_l {replace_l}")   
    
    return replace_l

2.4 插入新字母

首先获得单词可切割的位置
并通过切割位置来获得可插入新字母的位置
最后获得新单词

def insert_letter(word, verbose=False):
    '''
    Input:
        word: the input string/word 
    Output:
        inserts: a set of all possible strings with one new letter inserted at every offset
    ''' 
    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []
    
    for i in range(len(word) + 1):
        split_l.append((word[:i], word[i:]))
    
    
    for item in split_l:
        for c in letters:
            new_word = item[0] + c + item[1]
            insert_l.append(new_word)
    

    if verbose: print(f"Input word {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")
    
    return insert_l

Part3：组合修改函数

将上述的几个函数组合起来，获得一个单词修改一个字母后的结果组合。

def edit_one_letter(word, allow_switches = True):
    """
    Input:
        word: the string/word for which we will generate all possible wordsthat are one edit away.
    Output:
        edit_one_set: a set of words with one possible edit. Please return a set. and not a list.
    """
    
    edit_one_set = set()
    
    delete_set = set(delete_letter(word))
    insert_set = set(insert_letter(word))
    replace_set = set(replace_letter(word))
    
    edit_one_set.update(delete_set)
    edit_one_set.update(insert_set)
    edit_one_set.update(replace_set)
    
    if allow_switches == True:
        switch_set = switch_letter(word)
        edit_one_set.update(switch_set)

    return edit_one_set

3.2 在修改一个字母的基础上，再进行修改，便获得了修改两个字母的结果组合

def edit_two_letters(word, allow_switches = True):
    '''
    Input:
        word: the input string/word 
    Output:
        edit_two_set: a set of strings with all possible two edits
    '''
    
    edit_two_set = set()
    
    edit_one_set = edit_one_letter(word, allow_switches)
    
    for word in edit_one_set:
        delete_set = set(delete_letter(word))
        insert_set = set(insert_letter(word))
        replace_set = set(replace_letter(word))
    
        edit_two_set.update(delete_set)
        edit_two_set.update(insert_set)
        edit_two_set.update(replace_set)
    
        if allow_switches == True:
            switch_set = switch_letter(word)
            edit_two_set.update(switch_set)
    
    
    return edit_two_set

3.3 给出修改建议

我们按照以下的顺序来获得修改后单词的结果组合：

原始单词已在单词表中
修改一个字母后的单词组合中有单词在单词表中
修改两个个字母后的单词组合中有单词在单词表中
啥也不管，依然返回原始单词

获得结果组合后，我们将按单词概率进行排序，最后返回最有可能的n个单词：

def get_corrections(word, probs, vocab, n=2, verbose = False):
    '''
    Input: 
        word: a user entered string to check for suggestions
        probs: a dictionary that maps each word to its probability in the corpus
        vocab: a set containing all the vocabulary
        n: number of possible word corrections you want returned in the dictionary
    Output: 
        n_best: a list of tuples with the most probable n corrected words and their probabilities.
    '''
    
    suggestions = []
    n_best = []
    
    
    if word in vocab:
        suggestions = [word]
    elif edit_one_letter(word).intersection(vocab): # 交集判断是否存在
        suggestions = edit_one_letter(word).intersection(vocab)
    elif edit_two_letters(word).intersection(vocab):
        suggestions = edit_two_letters(word).intersection(vocab)
    else:
        suggestions = [word]
        
    
    word_probs = {}
    for item in suggestions:
        if item not in probs.keys():
            word_probs[item] = 0
        else:
            word_probs[item] = probs[item]
    
    # 按概率从大到小排序
    word_probs_order = sorted(word_probs.items(),key=lambda x:x[1],reverse=True)
    
    for item in word_probs_order:
        if len(n_best) < n:
            n_best.append([item[0], item[1]])
    
    if verbose: print("entered word = ", word, "\nsuggestions = ", suggestions)

    return n_best

Part4：最小化修改路径

我们想知道一个单词转化成另一个单词需要花费的最小代价是多少，这时候便要通过动态规划的方法来进行计算。

首先进行初始化：

$\text{Initialization}\\ \begin{align}D[0,0] &= 0 \\D[i,0] &= D[i-1,0] + del\_cost(source[i]) \\D[0,j] &= D[0,j-1] + ins\_cost(target[j]) \\\end{align}$

并给出转移方程：

$\text{Per Cell Operations}\\\begin{align} \\D[i,j] =min\begin{cases}D[i-1,j] + del\_cost\\D[i,j-1] + ins\_cost\\D[i-1,j-1] + \left\{\begin{matrix}rep\_cost; & if src[i]\neq tar[j]\\0 ; & if src[i]=tar[j]\end{matrix}\right.\end{cases}\end{align}$

D[i, j]代表源单词的前i个字母转化为结果单词的前j个字母，所需要的最小代价。

def min_edit_distance(source, target, ins_cost = 1, del_cost = 1, rep_cost = 2):
    '''
    Input: 
        source: a string corresponding to the string you are starting with
        target: a string corresponding to the string you want to end with
        ins_cost: an integer setting the insert cost
        del_cost: an integer setting the delete cost
        rep_cost: an integer setting the replace cost
    Output:
        D: a matrix of len(source)+1 by len(target)+1 containing minimum edit distances
        med: the minimum edit distance (med) required to convert the source string to the target
    '''
    # use deletion and insert cost as  1
    m = len(source) 
    n = len(target) 
    #initialize cost matrix with zeros and dimensions (m+1,n+1) 
    D = np.zeros((m+1, n+1), dtype=int) 
    
    # Fill in column 0, from row 1 to row m, both inclusive
    for row in range(1, m + 1): # Replace None with the proper range
        D[row,0] = D[row - 1, 0] + del_cost
        
    # Fill in row 0, for all columns from 1 to n, both inclusive
    for col in range(1, n + 1): # Replace None with the proper range
        D[0,col] = D[0, col - 1] + ins_cost
        
    # Loop through row 1 to row m, both inclusive
    for row in range(1, m + 1): 
        
        # Loop through column 1 to column n, both inclusive
        for col in range(1, n + 1):
            
            # Intialize r_cost to the 'replace' cost that is passed into this function
            r_cost = rep_cost
            
            # Check to see if source character at the previous row
            # matches the target character at the previous column, 
            if source[row - 1] == target[col - 1]:
                # Update the replacement cost to 0 if source and target are the same
                r_cost = 0
                
            # Update the cost at row, col based on previous entries in the cost matrix
            # Refer to the equation calculate for D[i,j] (the minimum of three calculated costs)
            D[row,col] = min(min(D[row, col - 1] + ins_cost, D[row - 1, col] + del_cost), 
                             D[row - 1, col - 1] + r_cost)
          
    # Set the minimum edit distance with the cost found at row m, column n
    med = D[m, n]
    
    return D, med

Next Post

吴恩达团队NLP C2_W2_Assignment
Previous Post

吴恩达团队NLP C1_W4_Assignment

CATALOG

1. 吴恩达团队NLP C2_W1_Assignment



Total : 424

2023

2022

2021

2020

2019

缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true