吴恩达团队NLP C2_W2_Assignment

吴恩达团队NLP

字数统计: 1.7k阅读时长: 9 min

 2021/08/08   Share

吴恩达团队NLP C2_W2_Assignment

任务：词性标注(POS)

Part1：Training

1.1 构建 `transition_counts, emission_counts, tag_counts`

transition_counts 计算的是每个tag与另一个tag相邻的次数 → (prev_tag, tag), 便于之后计算$P(ti |t{i-1})$
emission_counts 计算的是word与相应tag出现的次数 → (tag, word)，便于之后计算$P(w_i|t_i)$
tag_counts 计算的是每个tag出现的次数

def create_dictionaries(training_corpus, vocab):
    """
    Input: 
        training_corpus: a corpus where each line has a word followed by its tag.
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output: 
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
        tag_counts: a dictionary where the keys are the tags and the values are the counts
    """
    
    #初始化为defaultdict
    emission_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    
    # Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'
    prev_tag = '--s--' 
    
    i = 0 # 统计总数 
    
    for word_tag in training_corpus:
        
        i += 1
        
        # Every 50,000 words, print the word count
        if i % 50000 == 0:
            print(f"word count = {i}")
            
        word, tag = get_word_tag (word_tag, vocab)
        
        # Increment the transition count for the previous word and tag
        transition_counts[(prev_tag, tag)] += 1
        
        # Increment the emission count for the tag and word
        emission_counts[(tag, word)] += 1

        # Increment the tag count
        tag_counts[tag] += 1

        # Set the previous tag to this tag (for the next iteration of the loop)
        prev_tag = tag
        
        
    return emission_counts, transition_counts, tag_counts

1.2 预测pos，并计算准确度

def predict_pos(prep, y, emission_counts, vocab, states):
    '''
    Input: 
        prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.
        y: a corpus composed of a list of tuples where each tuple consists of (word, POS)
        emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count
        vocab: a dictionary where keys are words in vocabulary and value is an index
        states: a sorted list of all possible tags for this assignment
    Output: 
        accuracy: Number of times you classified a word correctly
    '''
    
    num_correct = 0
    
    # (tag, word)
    all_words = set(emission_counts.keys())
    
    total = len(y)
    for word, y_tup in zip(prep, y): 

        # Split the (word, POS) string into a list of two items
        y_tup_l = y_tup.split()
        
        # Verify that y_tup contain both word and POS
        if len(y_tup_l) == 2:
            
            # Set the true POS label for this word
            true_label = y_tup_l[1]

        else:
            continue
    
        count_final = 0
        pos_final = ''
   
        if word in vocab:
            for pos in states:
                        
                # define the key as the tuple containing the POS and word
                key = (pos, word)

                # check if the (pos, word) key exists in the emission_counts dictionary
                if key in emission_counts:
                    count = emission_counts[key]

                    if count > count_final:
                        count_final = count
                        pos_final = pos

            # If the final POS (with the largest count) matches the true POS:
            if pos_final == true_label:
                num_correct += 1
            
    accuracy = num_correct / total
    
    return accuracy

Part2：pos的隐马尔科夫模型

2.1 构建A矩阵

A：转移概率模型（使用transition_counts）
并做平滑处理，最终公式如下：

$P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_{i}) + \alpha }{C(t_{i-1}) +\alpha * N}$

N：tag的总数
$C(t{i-1}, t{i})$：(prev_pos, current_pos) 在transition_counts中的数量
$C(t_{i-1})$：pre_pos在tag_counts中的数量

def create_transition_matrix(alpha, tag_counts, transition_counts):
    ''' 
    Input: 
        alpha: number used for smoothing
        tag_counts: a dictionary mapping each tag to its respective count
        transition_counts: transition count for the previous word and tag
    Output:
        A: matrix of dimension (num_tags,num_tags)
    '''

    all_tags = sorted(tag_counts.keys())
    
    num_tags = len(all_tags)
    
    # 初始化A
    A = np.zeros((num_tags,num_tags))
    
    # 得到(previous POS, current POS)
    trans_keys = set(transition_counts.keys())
    
    # 遍历A的行
    for i in range(num_tags):
        
        # 遍历A的列
        for j in range(num_tags):

            # 初始化(prev POS, current POS)
            count = 0
        
            # (prev POS, current POS)
            key = (all_tags[i], all_tags[j])

            if key in transition_counts.keys():
                count = transition_counts[key]
                
            count_prev_tag = tag_counts[key[0]]
            
            A[i,j] = (count + alpha) / (count_prev_tag + alpha * num_tags)

    
    return A

2.2 构建B矩阵

B：发射概率模型（使用emission_counts）
平滑处理，最终公式如下：

$P(w_i | t_i) = \frac{C(t_i, word_i)+ \alpha}{C(t_{i}) +\alpha * N}$

$C(t_i, word_i)$：(tag, word)在emission_counts 中的相应数量
$C(t_{i})$：$tag_i$在tag_counts中的数量
N：单词的总数

def create_emission_matrix(alpha, tag_counts, emission_counts, vocab):
    '''
    Input: 
        alpha: tuning parameter used in smoothing 
        tag_counts: a dictionary mapping each tag to its respective count
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        vocab: a dictionary where keys are words in vocabulary and value is an index.
               within the function it'll be treated as a list
    Output:
        B: a matrix of dimension (num_tags, len(vocab))
    '''
    
    num_tags = len(tag_counts)
 
    all_tags = sorted(tag_counts.keys())
    
    num_words = len(vocab)
    
    # 初始化B
    B = np.zeros((num_tags, num_words))
    
    # (POS, word)
    emis_keys = set(list(emission_counts.keys()))
    
    # 行
    for i in range(num_tags):
        
        # 列
        for j in range(num_words):

            # 初始化(POS tag, word)数量
            count = 0
                    
            # (POS tag, word)
            key =  (all_tags[i], vocab[j])

            if key in emission_counts.keys():
                count = emission_counts[key]
              
            count_tag = tag_counts[key[0]]
                
            B[i,j] = (count + alpha) / (count_tag + alpha * num_words)
            
    return B

Part3：维特比算法

3.1 初始化矩阵

best_probs：每个单元格为一个pos_tag到一个word的概率
best_paths：目的是为了找到最佳路径
从起始处到特定的索引i，最佳路线的概率为$bestprobs[s_{idx}, i]$
为了避免一些问题，我们使用log的加法来取代乘法，并且当$A[s_{idx}, i] == 0$时，我们将其设置为-∞，即log(0) 。

综上，初始化方程为：

$if A[s_{idx}, i] <> 0 : best\_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]])\\if A[s_{idx}, i] == 0 : best\_probs[i,0] = float('-inf')$

def initialize(states, tag_counts, A, B, corpus, vocab):
    '''
    Input: 
        states: a list of all possible parts-of-speech
        tag_counts: a dictionary mapping each tag to its respective count
        A: Transition Matrix of dimension (num_tags, num_tags)
        B: Emission Matrix of dimension (num_tags, len(vocab))
        corpus: a sequence of words whose POS is to be identified in a list 
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output:
        best_probs: matrix of dimension (num_tags, len(corpus)) of floats
        best_paths: matrix of dimension (num_tags, len(corpus)) of integers
    '''

    num_tags = len(tag_counts)
    
    # 初始化best_probs矩阵
    best_probs = np.zeros((num_tags, len(corpus)))
    
    # 初始化best_paths矩阵
    best_paths = np.zeros((num_tags, len(corpus)), dtype=int)
    
    # 定义起始处
    s_idx = states.index("--s--")
    
    # 遍历每个pos_tag
    for i in range(num_tags):
        
        if A[s_idx, i] == 0:
            best_probs[i,0] = float('-inf')
        else:
            best_probs[i,0] = math.log(A[s_idx, i]) + math.log(B[i, vocab[corpus[0]]])
                        
    return best_probs, best_paths

3.2 维特比 Forward

对于每个单元格：$prob=𝐛𝐞𝐬𝐭𝐩𝐫𝐨𝐛{𝑘,𝑖−1}+log(𝐀{𝑘,𝑗})+log(𝐁{𝑗,𝑣𝑜𝑐𝑎𝑏(𝑐𝑜𝑟𝑝𝑢𝑠𝑖)})$
然后从中选出概率最大的作为最后的结果
$\mathrm{path} = k$

def viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab):
    '''
    Input: 
        A, B: The transition and emission matrices respectively
        test_corpus: a list containing a preprocessed corpus
        best_probs: an initilized matrix of dimension (num_tags, len(corpus))
        best_paths: an initilized matrix of dimension (num_tags, len(corpus))
        vocab: a dictionary where keys are words in vocabulary and value is an index 
    Output: 
        best_probs: a completed matrix of dimension (num_tags, len(corpus))
        best_paths: a completed matrix of dimension (num_tags, len(corpus))
    '''
   
    num_tags = best_probs.shape[0]
    
    # 从word_1开始遍历
    for i in range(1, len(test_corpus)): 
        
        if i % 5000 == 0:
            print("Words processed: {:>8}".format(i))
            
        # 对于每个word,遍历tag的可能性
        for j in range(num_tags):
            
            # 初始化
            best_prob_i = float('-inf')
            
            # 初始化
            best_path_i = None

            # For each POS tag that the previous word can be:
            for k in range(num_tags):
                prob = best_probs[k, i - 1] + math.log(A[k, j]) + math.log(B[j, vocab[test_corpus[i]]])

                if prob > best_prob_i:
                    best_prob_i = prob
                    best_path_i = k

						# 记录该tag -> word 的最佳概率
            best_probs[j,i] = best_prob_i
            
						# 将前一个POS标记的唯一整数ID保存到best_paths矩阵中
            best_paths[j,i] = best_path_i

    return best_probs, best_paths

3.3 维特比 Backward

def viterbi_backward(best_probs, best_paths, corpus, states):
    '''
    This function returns the best path.
    
    '''
    m = best_paths.shape[1] 
    
    # Initialize array z, same length as the corpus
    z = [None] * m
    
    num_tags = best_probs.shape[0]
    
    # 初始化
    best_prob_for_last_word = float('-inf')
    
    # Initialize pred array, same length as corpus
    pred = [None] * m
    
    ## Step 1 ##
    
    # 寻找概率最大的行
    for k in range(num_tags):

        if best_probs[k, -1] > best_prob_for_last_word: # 从最后一列开始
            
            best_prob_for_last_word = best_probs[k, -1]
            z[m - 1] = k
            
		# 通过行号K转化为tag
    pred[m - 1] = states[k]
    
    ## Step 2 ##
    for i in range(len(corpus) - 1, -1, -1): # 从后往前
        
        pos_tag_for_word_i = best_paths[np.argmax(best_probs[:,i]),i]
        
        z[i - 1] = best_paths[pos_tag_for_word_i, i]
        
        pred[i - 1] =states[pos_tag_for_word_i]
        
    return pred

3.4 计算准确率

def compute_accuracy(pred, y):
    '''
    Input: 
        pred: a list of the predicted parts-of-speech 
        y: a list of lines where each word is separated by a '\t' (i.e. word \t tag)
    Output: 
        
    '''
    num_correct = 0
    total = 0
    
    # Zip together the prediction and the labels
    for prediction, y in zip(pred, y):

        # (word, tag)
        word_tag_tuple = y.split()
        
        if len(word_tag_tuple) < 2: 
            continue 

        word, tag = word_tag_tuple
        
        if prediction == tag:
            num_correct += 1
            
        total += 1
        
    return num_correct/total

tips：

维特比算法参考：https://www.zhihu.com/question/20136144

Next Post

吴恩达团队NLP C2_W3_Assignment
Previous Post

吴恩达团队NLP C2_W1_Assignment

CATALOG

1. 吴恩达团队NLP C2_W2_Assignment



Total : 424

2023

2022

2021

2020

2019

缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true

吴恩达团队NLP C2_W2_Assignment

吴恩达团队NLP C2_W2_Assignment

任务：词性标注(POS)

Part1：Training

1.1 构建 transition_counts, emission_counts, tag_counts

1.2 预测pos，并计算准确度

Part2：pos的隐马尔科夫模型

2.1 构建A矩阵

2.2 构建B矩阵

Part3：维特比算法

3.1 初始化矩阵

3.2 维特比 Forward

3.3 维特比 Backward

3.4 计算准确率

1.1 构建 `transition_counts, emission_counts, tag_counts`