Keshawn_lu's Blog

吴恩达团队NLP C4_W2_Assignment

字数统计: 2.9k阅读时长: 16 min
2021/08/28 Share

吴恩达团队NLP C4_W2_Assignment

任务:从头实现Transformer

Part1:数据准备

1.1 库的准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import sys
import os

import numpy as np

import textwrap
wrapper = textwrap.TextWrapper(width=70)

import trax
from trax import layers as tl
from trax.fastmath import numpy as jnp

# to print the entire np array
np.set_printoptions(threshold=sys.maxsize)

1.2 导入数据

1
2
3
4
5
6
7
8
9
10
11
# Importing CNN/DailyMail articles dataset
train_stream_fn = trax.data.TFDS('cnn_dailymail',
data_dir='data/',
keys=('article', 'highlights'),
train=True)

# This should be much faster as the data is downloaded already.
eval_stream_fn = trax.data.TFDS('cnn_dailymail',
data_dir='data/',
keys=('article', 'highlights'),
train=False)

1.3 标记化和取消标记化辅助函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def tokenize(input_str, EOS=1):
"""Input str to features dict, ready for inference"""

# Use the trax.data.tokenize method. It takes streams and returns streams,
# we get around it by making a 1-element stream with `iter`.
inputs = next(trax.data.tokenize(iter([input_str]),
vocab_dir='vocab_dir/',
vocab_file='summarize32k.subword.subwords'))

# 结尾加上标记
return list(inputs) + [EOS]

def detokenize(integers):
"""List of ints to str"""

s = trax.data.detokenize(integers,
vocab_dir='vocab_dir/',
vocab_file='summarize32k.subword.subwords')

return wrapper.fill(s)

1.4 预处理数据

在输入与目标(文章与摘要)连接起来,并插入分隔符分开。同时创建一个掩码(mask),0代表输入,1代表目标。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Special tokens
SEP = 0 # Padding or separator token
EOS = 1 # End of sentence token

# Concatenate tokenized inputs and targets using 0 as separator.
def preprocess(stream):
for (article, summary) in stream:
joint = np.array(list(article) + [EOS, SEP] + list(summary) + [EOS])
mask = [0] * (len(list(article)) + 2) + [1] * (len(list(summary)) + 1) # Accounting for EOS and SEP
yield joint, joint, np.array(mask)

# You can combine a few data preprocessing steps into a pipeline like this.
input_pipeline = trax.data.Serial(
# Tokenizes
trax.data.Tokenize(vocab_dir='vocab_dir/',
vocab_file='summarize32k.subword.subwords'),
# Uses function defined above
preprocess,
# Filters out examples longer than 2048
trax.data.FilterByLength(2048)
)

# Apply preprocessing to data streams.
train_stream = input_pipeline(train_stream_fn())
eval_stream = input_pipeline(eval_stream_fn())

train_input, train_target, train_mask = next(train_stream)

assert sum((train_input - train_target)**2) == 0 # They are the same in Language Model (LM).

1.5 Bucketing

我们将长度相似的句子放在一起,并提供最小的填充,如下图所示:

https://pic.imgdb.cn/item/612097e74907e2d39c475e43.png

1
2
3
4
5
6
7
8
9
10
11
12
13
# Buckets are defined in terms of boundaries and batch sizes.
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 16 sentences of length < 128 , 8 of length < 256,
# 4 of length < 512. And so on.
boundaries = [128, 256, 512, 1024]
batch_sizes = [16, 8, 4, 2, 1]

# Create the streams.
train_batch_stream = trax.data.BucketByLength(
boundaries, batch_sizes)(train_stream)

eval_batch_stream = trax.data.BucketByLength(
boundaries, batch_sizes)(eval_stream)

经过上述处理,现在数据如下所示:

[Article] -> <EOS> -> <pad> -> [Article Summary] -> <EOS> -> (possibly) multiple <pad>

Part2:构建模型

https://pic.imgdb.cn/item/612489a844eaada739deced5.png

2.1 帮助函数

1
2
3
4
5
6
7
8
def create_tensor(t):
"""Create tensor from list of lists"""
return jnp.array(t)

def display_tensor(t, name):
"""Display shape and tensor"""
print(f'{name} shape: {t.shape}\n')
print(f'{t}\n')

2.2 Dot product attention

attention的公式如下:

  • $d_{k}$为queries和keys的维度(用于比例缩小)
  • $Q$:query
  • $K$:key
  • $V$:values
  • $M$:mask

进行一些测试:

1
2
3
4
5
6
7
8
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, 'query')
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, 'key')
v = create_tensor([[0, 1, 0], [1, 0, 1]])
display_tensor(v, 'value')
m = create_tensor([[0, 0], [-1e9, 0]])
display_tensor(m, 'mask')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Expected Output:

query shape: (2, 3)

[[1 0 0]
[0 1 0]]

key shape: (2, 3)

[[1 2 3]
[4 5 6]]

value shape: (2, 3)

[[0 1 0]
[1 0 1]]

mask shape: (2, 2)

[[ 0.e+00 0.e+00]
[-1.e+09 0.e+00]]
1
2
q_dot_k = q @ k.T / jnp.sqrt(3)
display_tensor(q_dot_k, 'query dot key')
1
2
3
4
5
6
Expected Output:

query dot key shape: (2, 2)

[[0.57735026 2.309401 ]
[1.1547005 2.8867514 ]]
1
2
masked = q_dot_k + m
display_tensor(masked, 'masked query dot key')
1
2
3
4
5
6
Expected Output:

masked query dot key shape: (2, 2)

[[ 5.7735026e-01 2.3094010e+00]
[-1.0000000e+09 2.8867514e+00]]
1
display_tensor(masked @ v, 'masked query dot key dot value')
1
2
3
4
5
6
Expected Output:

masked query dot key dot value shape: (2, 3)

[[ 2.3094010e+00 5.7735026e-01 2.3094010e+00]
[ 2.8867514e+00 -1.0000000e+09 2.8867514e+00]]

添加一个批量维度:

1
2
3
4
5
6
7
8
q_with_batch = q[None,:]
display_tensor(q_with_batch, 'query with batch dim')
k_with_batch = k[None,:]
display_tensor(k_with_batch, 'key with batch dim')
v_with_batch = v[None,:]
display_tensor(v_with_batch, 'value with batch dim')
m_bool = create_tensor([[True, True], [False, True]])
display_tensor(m_bool, 'boolean mask')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Expected Output:

query with batch dim shape: (1, 2, 3)

[[[1 0 0]
[0 1 0]]]

key with batch dim shape: (1, 2, 3)

[[[1 2 3]
[4 5 6]]]

value with batch dim shape: (1, 2, 3)

[[[0 1 0]
[1 0 1]]]

boolean mask shape: (2, 2)

[[True True]
[False True]]

利用trax实现上述公式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def DotProductAttention(query, key, value, mask):
"""Dot product self-attention.
Args:
query (jax.interpreters.xla.DeviceArray): array of query representations with shape (L_q by d)
key (jax.interpreters.xla.DeviceArray): array of key representations with shape (L_k by d)
value (jax.interpreters.xla.DeviceArray): array of value representations with shape (L_k by d) where L_v = L_k
mask (jax.interpreters.xla.DeviceArray): attention-mask, gates attention with shape (L_q by L_k)

Returns:
jax.interpreters.xla.DeviceArray: Self-attention array for q, k, v arrays. (L_q by L_k)
"""

assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"

# Save depth/dimension of the query embedding for scaling down the dot product
depth = query.shape[-1]

# swapaxes()交换轴编号,起到转置作用
dots = jnp.matmul(query, jnp.swapaxes(key, -1, -2)) / jnp.sqrt(depth)

if mask is not None:
dots = jnp.where(mask, dots, jnp.full_like(dots, -1e9)) # 满足mask输出dots,不满足输出jnp....()

# Softmax formula implementation
logsumexp = trax.fastmath.logsumexp(dots, axis=-1, keepdims=True)
dots = jnp.exp(dots - logsumexp)

attention = jnp.matmul(dots, value)

return attention

2.3 Causal Attention

如下图所示,一个单词可以看到它之前的单词,却看不到它后面的单词

https://pic.imgdb.cn/item/61248a2544eaada739dfcd7f.png

compute_attention_heads:获得输入X(batch_size, seqlen, n_heads × d_head) ,将最后一个维度拆分至第一个维度上,得到(batch_size × n_heads, seqlen, d_head)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def compute_attention_heads_closure(n_heads, d_head):
""" Function that simulates environment inside CausalAttention function.
Args:
d_head (int): dimensionality of heads.
n_heads (int): number of attention heads.
Returns:
function: compute_attention_heads function
"""

def compute_attention_heads(x):
""" Compute the attention heads.
Args:
x (jax.interpreters.xla.DeviceArray): tensor with shape (batch_size, seqlen, n_heads X d_head).
Returns:
jax.interpreters.xla.DeviceArray: reshaped tensor with shape (batch_size X n_heads, seqlen, d_head).
"""

# Size of the x's batch dimension
batch_size = x.shape[0]

# Length of the sequence
seqlen = x.shape[1]

# batch_size, seqlen, n_heads*d_head -> batch_size, seqlen, n_heads, d_head
x = jnp.reshape(x, (batch_size, seqlen, n_heads, d_head))

# batch_size, seqlen, n_heads, d_head -> batch_size, n_heads, seqlen, d_head
# jnp.transpose() 改变索引值
x = jnp.transpose(x, (0, 2, 1, 3))


# batch_size, n_heads, seqlen, d_head -> batch_size*n_heads, seqlen, d_head
# -1代表缺省, 系统自动调整
x = jnp.reshape(x, (-1, seqlen, d_head))

return x

return compute_attention_heads

测试:

1
2
3
display_tensor(tensor3dc3b, "input tensor")
result_cah = compute_attention_heads_closure(2,3)(tensor3dc3b)
display_tensor(result_cah, "output tensor")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Expected Output:

input tensor shape: (3, 2, 6)

[[[1 0 0 1 0 0]
[0 1 0 0 1 0]]

[[1 0 0 1 0 0]
[0 1 0 0 1 0]]

[[1 0 0 1 0 0]
[0 1 0 0 1 0]]]

output tensor shape: (6, 2, 3)

[[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]]

dot_product_self_attention:创建一个mask矩阵,对角线上方为True,下方为False

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def dot_product_self_attention(q, k, v):
""" Masked dot product self attention.
Args:
q (jax.interpreters.xla.DeviceArray): queries.
k (jax.interpreters.xla.DeviceArray): keys.
v (jax.interpreters.xla.DeviceArray): values.
Returns:
jax.interpreters.xla.DeviceArray: masked dot product self attention tensor.
"""

# Hint: mask size should be equal to L_q. Remember that q has shape (batch_size, L_q, d)
mask_size = q.shape[-2]

# Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
# Notice that 1's and 0's get casted to True/False by setting dtype to jnp.bool_
# Use jnp.tril() - Lower triangle of an array and jnp.ones()
mask = jnp.tril(jnp.ones((1, mask_size, mask_size), dtype=jnp.bool_), k=0)


return DotProductAttention(q, k, v, mask)

compute_attention_output:取消compute_attention_heads 所做的事,从(batch_size × n_heads, seqlen, d_head)转换为(batch_size, seqlen, n_heads × d_head)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def compute_attention_output_closure(n_heads, d_head):
""" Function that simulates environment inside CausalAttention function.
Args:
d_head (int): dimensionality of heads.
n_heads (int): number of attention heads.
Returns:
function: compute_attention_output function
"""

def compute_attention_output(x):
""" Compute the attention output.
Args:
x (jax.interpreters.xla.DeviceArray): tensor with shape (batch_size X n_heads, seqlen, d_head).
Returns:
jax.interpreters.xla.DeviceArray: reshaped tensor with shape (batch_size, seqlen, n_heads X d_head).
"""

# Length of the sequence
seqlen = x.shape[1]

# Reshape x using jnp.reshape() to shape (batch_size, n_heads, seqlen, d_head)
x = jnp.reshape(x, (-1, n_heads, seqlen, d_head)) # -1代表缺省,系统自动调整

# Transpose x using jnp.transpose() to shape (batch_size, seqlen, n_heads, d_head)
x = jnp.transpose(x, (0, 2, 1, 3))

# Reshape to allow to concatenate the heads
return jnp.reshape(x, (-1, seqlen, n_heads * d_head))

return compute_attention_output

测试:

1
2
3
display_tensor(result_cah, "input tensor")
result_cao = compute_attention_output_closure(2,3)(result_cah)
display_tensor(result_cao, "output tensor")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Expected Output:

input tensor shape: (6, 2, 3)

[[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]

[[1 0 0]
[0 1 0]]]

output tensor shape: (3, 2, 6)

[[[1 0 0 1 0 0]
[0 1 0 0 1 0]]

[[1 0 0 1 0 0]
[0 1 0 0 1 0]]

[[1 0 0 1 0 0]
[0 1 0 0 1 0]]]

将上述函数整合起来,实现Causal Attention Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def CausalAttention(d_feature, 
n_heads,
compute_attention_heads_closure=compute_attention_heads_closure,
dot_product_self_attention=dot_product_self_attention,
compute_attention_output_closure=compute_attention_output_closure,
mode='train'):
"""Transformer-style multi-headed causal attention.

Args:
d_feature (int): dimensionality of feature embedding.
n_heads (int): number of attention heads.
compute_attention_heads_closure (function): Closure around compute_attention heads.
dot_product_self_attention (function): dot_product_self_attention function.
compute_attention_output_closure (function): Closure around compute_attention_output.
mode (str): 'train' or 'eval'.

Returns:
trax.layers.combinators.Serial: Multi-headed self-attention model.
"""

assert d_feature % n_heads == 0
d_head = d_feature // n_heads

ComputeAttentionHeads = tl.Fn('AttnHeads', compute_attention_heads_closure(n_heads, d_head), n_out=1)

return tl.Serial(
tl.Branch(
[tl.Dense(d_feature), ComputeAttentionHeads], # queries
[tl.Dense(d_feature), ComputeAttentionHeads], # keys
[tl.Dense(d_feature), ComputeAttentionHeads], # values
),

tl.Fn('DotProductAttn', dot_product_self_attention, n_out=1), # takes QKV

tl.Fn('AttnOutput', compute_attention_output_closure(n_heads, d_head), n_out=1), # to allow for parallel
tl.Dense(d_feature) # Final dense layer
)

2.4 实现Transformer decoder block

https://pic.imgdb.cn/item/61248f7544eaada739ea9209.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def DecoderBlock(d_model, d_ff, n_heads,
dropout, mode, ff_activation):
"""Returns a list of layers that implements a Transformer decoder block.

The input is an activation tensor.

Args:
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
n_heads (int): number of attention heads.
dropout (float): dropout rate (how much to drop out).
mode (str): 'train' or 'eval'.
ff_activation (function): the non-linearity in feed-forward layer.

Returns:
list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
"""

# Create masked multi-head attention block using CausalAttention function
causal_attention = CausalAttention(
d_model,
n_heads=n_heads,
mode=mode
)

# Create feed-forward block (list) with two dense layers with dropout and input normalized
feed_forward = [
tl.LayerNorm(),
tl.Dense(d_ff),
ff_activation(), # Generally ReLU
tl.Dropout(rate=dropout, mode=mode),
tl.Dense(d_model),
tl.Dropout(rate=dropout,mode=mode)
]

# Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks
return [
tl.Residual(
tl.LayerNorm(),
causal_attention,
tl.Dropout(rate=dropout, mode=mode)
),
tl.Residual(
feed_forward
),
]

2.5 Transformer Language Model

将之前所实现的都整合起来:

https://pic.imgdb.cn/item/6124905944eaada739ec3eb0.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def TransformerLM(vocab_size=33300,
d_model=512,
d_ff=2048,
n_layers=6,
n_heads=8,
dropout=0.1,
max_len=4096,
mode='train',
ff_activation=tl.Relu):
"""Returns a Transformer language model.

The input to the model is a tensor of tokens. (This model uses only the
decoder part of the overall Transformer.)

Args:
vocab_size (int): vocab size.
d_model (int): depth of embedding.
d_ff (int): depth of feed-forward layer.
n_layers (int): number of decoder layers.
n_heads (int): number of attention heads.
dropout (float): dropout rate (how much to drop out).
max_len (int): maximum symbol length for positional encoding.
mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.
ff_activation (function): the non-linearity in feed-forward layer.

Returns:
trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens
to activations over a vocab set.
"""

# Embedding inputs and positional encoder
positional_encoder = [
tl.Embedding(vocab_size, d_model),
tl.Dropout(rate=dropout, mode=mode),
tl.PositionalEncoding(max_len=max_len, mode=mode)]

# Create stack (list) of decoder blocks with n_layers with necessary parameters
decoder_blocks = [
DecoderBlock(d_model, d_ff, n_heads,
dropout, mode, ff_activation) for _ in range(n_layers)]

return tl.Serial(
tl.ShiftRight(mode=mode), # Specify the mode!
positional_encoder,
decoder_blocks,
tl.LayerNorm(),
tl.Dense(vocab_size),
tl.LogSoftmax(),
)

Part3:训练模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from trax.supervised import training

def training_loop(TransformerLM, train_gen, eval_gen, output_dir = "~/model"):
'''
Input:
TransformerLM (trax.layers.combinators.Serial): The model you are building.
train_gen (generator): Training stream of data.
eval_gen (generator): Evaluation stream of data.
output_dir (str): folder to save your file.

Returns:
trax.supervised.training.Loop: Training loop.
'''
output_dir = os.path.expanduser(output_dir) # trainer is an object
lr_schedule = trax.lr.warmup_and_rsqrt_decay(n_warmup_steps=1000, max_value=0.01)

train_task = training.TrainTask(
labeled_data=train_gen,
loss_layer=tl.CrossEntropyLoss(),
optimizer=trax.optimizers.Adam(0.01),
lr_schedule=lr_schedule,
n_steps_per_checkpoint=10
)

eval_task = training.EvalTask(
labeled_data=eval_gen,
metrics=[tl.CrossEntropyLoss(), tl.Accuracy()]
)

loop = training.Loop(TransformerLM(d_model=4,
d_ff=16,
n_layers=1,
n_heads=2,
mode='train'),
train_task,
eval_tasks=[eval_task],
output_dir=output_dir)

return loop
1
2
loop = training_loop(TransformerLM, train_batch_stream, eval_batch_stream)
loop.run(10)

Part4:评估

4.1 加载模型

1
2
3
4
5
# Get the model architecture
model = TransformerLM(mode='eval')

# Load the pre-trained weights
model.init_from_file('model.pkl.gz', weights_only=True)

4.2 预测下一个单词,并返回索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def next_symbol(cur_output_tokens, model):
"""Returns the next symbol for a given sentence.

Args:
cur_output_tokens (list): tokenized sentence with EOS and PAD tokens at the end.
model (trax.layers.combinators.Serial): The transformer model.

Returns:
int: tokenized symbol.
"""
# current output tokens length
token_length = len(cur_output_tokens)

padded_length = 2**int(np.ceil(np.log2(token_length + 1)))

# Fill cur_output_tokens with 0's until it reaches padded_length
padded = cur_output_tokens + [0] * (padded_length - token_length)
padded_with_batch = np.array(padded)[None, :]

# model expects a tuple containing two padded tensors (with batch)
output, _ = model((padded_with_batch, padded_with_batch))

log_probs = output[0, token_length, :]

return int(np.argmax(log_probs))

4.3 实现贪婪解码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def greedy_decode(input_sentence, model):
"""Greedy decode function.

Args:
input_sentence (string): a sentence or article.
model (trax.layers.combinators.Serial): Transformer model.

Returns:
string: summary of the input.
"""

# Use tokenize()
cur_output_tokens = tokenize(input_sentence) + [0]
generated_output = []
cur_output = 0
EOS = 1

while cur_output != EOS:
# Get next symbol
cur_output = next_symbol(cur_output_tokens, model)
# Append next symbol to original sentence
cur_output_tokens.append(cur_output)
# Append next symbol to generated sentence
generated_output.append(cur_output)
print(detokenize(generated_output))

return detokenize(generated_output)

测试:

1
2
3
4
5
# Test it out on a sentence!
test_sentence = "It padded_with_batch was a sunny day when I went to the market to buy some flowers. But I only found roses, not tulips."
print(wrapper.fill(test_sentence), '\n')
print(greedy_decode(test_sentence, model))
# Strange outout but auto_grader doesnt count it uncorrect
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
It padded_with_batch was a sunny day when I went to the market to buy
some flowers. But I only found roses, not tulips.

:
: I
: I just
: I just found
: I just found ros
: I just found roses
: I just found roses,
: I just found roses, not
: I just found roses, not tu
: I just found roses, not tulips
: I just found roses, not tulips
: I just found roses, not tulips.
: I just found roses, not tulips.<EOS>
: I just found roses, not tulips.<EOS>
CATALOG
  1. 1. 吴恩达团队NLP C4_W2_Assignment
    1. 1.1. 任务:从头实现Transformer
    2. 1.2. Part1:数据准备
      1. 1.2.1. 1.1 库的准备
      2. 1.2.2. 1.2 导入数据
      3. 1.2.3. 1.3 标记化和取消标记化辅助函数
      4. 1.2.4. 1.4 预处理数据
      5. 1.2.5. 1.5 Bucketing
    3. 1.3. Part2:构建模型
      1. 1.3.1. 2.1 帮助函数
      2. 1.3.2. 2.2 Dot product attention
      3. 1.3.3. 2.3 Causal Attention
      4. 1.3.4. 2.4 实现Transformer decoder block
      5. 1.3.5. 2.5 Transformer Language Model
    4. 1.4. Part3:训练模型
    5. 1.5. Part4:评估
      1. 1.5.1. 4.1 加载模型
      2. 1.5.2. 4.2 预测下一个单词,并返回索引
      3. 1.5.3. 4.3 实现贪婪解码