吴恩达团队NLP C3_W3_Assignment

吴恩达团队NLP

字数统计: 958阅读时长: 4 min

 2021/08/28   Share

吴恩达团队NLP C3_W3_Assignment

任务：命名实体识别（NER）

French：地缘政治实体
Morocco：地理实体
Christmas：时间指标
其他不被视为命名实体

Part1：数据生成器

shuffle的好处：我们不使用索引直接访问句子列表的位置。相反，我们使用它从索引列表中选择一个索引。通过这种方式，我们可以改变遍历原始列表的顺序，保持原始列表不变。

def data_generator(batch_size, x, y, pad, shuffle=False, verbose=False):
    '''
      Input: 
        batch_size - 一个batch的大小
        x - 文本行的列表，单词都用整数表示
        y - 包含与句子相关的标签的列表
        shuffle - 洗牌，变换顺序
        pad - 代表填充的字符
        verbose - Print information during runtime
      Output:
        a tuple containing 2 elements:
        X - np.ndarray of dim (batch_size, max_len) of padded sentences
        Y - np.ndarray of dim (batch_size, max_len) of tags associated with the sentences in X
    '''
    
    num_lines = len(x)
    lines_index = [*range(num_lines)]
    
    if shuffle:
        rnd.shuffle(lines_index)
    
    index = 0 
    while True:
        buffer_x = [0] * batch_size
        buffer_y = [0] * batch_size
              
        max_len = 0
        for i in range(batch_size):
            if index >= num_lines:
                index = 0
                if shuffle:
                    rnd.shuffle(lines_index)
            
            buffer_x[i] = x[lines_index[index]]
            buffer_y[i] = y[lines_index[index]]
            
            lenx = len(x[lines_index[index]])
            if lenx > max_len:
                max_len = lenx    # 存储最大长度，用于后续填充
            
            index += 1

	   # 生成 batch_size * max_len的数组，内容都为pad
        X = np.full((batch_size, max_len), pad) 
        Y = np.full((batch_size, max_len), pad)

        for i in range(batch_size):
            x_i = buffer_x[i]
            y_i = buffer_y[i]
            
            for j in range(len(x_i)):
                X[i, j] = x_i[j]
                Y[i, j] = y_i[j]

        if verbose: print("index=", index)
        yield((X,Y))

Part2：建立模型

使用数据生成器产生的数据送入输入层
输入嵌入层
输入LSTM
通过线形层输出
最后通过softmax获得每个单词的预测

def NER(vocab_size=35181, d_model=50, tags=tag_map):
    '''
      Input: 
        vocab_size - integer containing the size of the vocabulary
        d_model - integer describing the embedding size
      Output:
        model - a trax serial model
    '''

    model = tl.Serial(
      tl.Embedding(vocab_size, d_model), # d_model: 单词嵌入中的元素数
      tl.LSTM(d_model), # LSTM layer
      tl.Dense(len(tags)), # Dense layer with len(tags) units
      tl.LogSoftmax()  # LogSoftmax layer
      )

    return model

LSTM的介绍：

人人都能看懂的LSTM

Part3：训练模型

def train_model(NER, train_generator, eval_generator, train_steps=1, output_dir='model'):
    '''
    Input: 
        NER - the model you are building
        train_generator - The data generator for training examples
        eval_generator - The data generator for validation examples,
        train_steps - number of training steps
        output_dir - folder to save your model
    Output:
        training_loop - a trax supervised training Loop
    '''
 
    train_task = training.TrainTask(
      train_generator, 
      loss_layer = tl.CrossEntropyLoss(), 
      optimizer = trax.optimizers.Adam(0.01), 
    )

    eval_task = training.EvalTask(
      labeled_data = eval_generator,
      metrics = [tl.CrossEntropyLoss(), tl.Accuracy()], 
      n_eval_batches = 10
    )

    training_loop = training.Loop(
        NER, # A model to train
        train_task, # A train task
        eval_tasks = eval_task, # The evaluation task
        output_dir = output_dir) # The output directory

    training_loop.run(n_steps = train_steps)

    return training_loop

Part4：计算准确率

def evaluate_prediction(pred, labels, pad):
    """
    Inputs:
        pred: prediction array with shape 
            (num examples, max sentence length in batch, num of classes)
        labels: array of size (batch_size, seq_len)
        pad: integer representing pad character
    Outputs:
        accuracy: float
    """
## step 1 ##
    outputs = np.argmax(pred, axis=2)
    print("outputs shape:", outputs.shape)

## step 2 ##
    mask = labels != pad
    print("mask shape:", mask.shape, "mask[0][20:30]:", mask[0][20:30])

## step 3 ##
    accuracy = np.sum(outputs == labels) / float(np.sum(mask))

    return accuracy

np.argmax(pred, axis) ：用于返回一个numpy数组中最大值的索引值，当一组中同时出现几个最大值时，返回第一个最大值的索引值。

对于多维数组，我们可以看一些例子来感受一下：

axis = 0 ：外层
axis = 1 ：内层

two_dim_array = np.array([[1, 3, 5], [0, 4, 3]])
max_index_axis0 = np.argmax(two_dim_array, axis = 0)
max_index_axis1 = np.argmax(two_dim_array, axis = 1)
print(max_index_axis0)
print(max_index_axis1)

1 2	[0 1 0] [2 1]

三维时，假设数组为m * n * p ，则

axis = 0：舍去m，返回一个 n×p维的矩阵
axis = 1：舍去n，返回一个 m×p维的矩阵
axis = 2：舍去p，返回一个 m×n维的矩阵

three_dim_array = [[[1, 2, 3, 4],  [-1, 0, 3, 5]],
				   [[2, 7, -1, 3], [0, 3, 12, 4]],
				   [[5, 1, 0, 19], [4, 2, -2, 13]]]     # 3 * 2 * 4
a = np.argmax(three_dim_array, axis = 0)
print(a)
b = np.argmax(three_dim_array, axis = 1)
print(b)
c = np.argmax(three_dim_array , axis = 2)
print(c)

[[2 1 0 2]                                                                                                               
 [2 1 1 2]]  # 2 * 4     每一列最大值对应的索引

[[0 0 0 1]                                                                                                               
 [0 0 1 1]                                                                                                               
 [0 1 0 0]]  # 3 * 4     每一行的两个数组进行对比

[[3 3]                                                                                                                   
 [1 2]                                                                                                                   
 [3 3]]   # 3 * 2        每个数组挑选出自身最大值对应的索引

Next Post

吴恩达团队NLP C3_W4_Assignment
Previous Post

吴恩达团队NLP C3_W2_Assignment

CATALOG

1. 吴恩达团队NLP C3_W3_Assignment



Total : 424

2023

2022

2021

2020

2019

缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true