Keshawn_lu's Blog

吴恩达团队NLP C2_W3_Assignment

字数统计: 2k阅读时长: 10 min
2021/08/08 Share

吴恩达团队NLP C2_W3_Assignment

任务:自动完成系统

Part1:处理数据

1.1 将文本数据切割成一行一个字符串

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def split_to_sentences(data):
"""
Split data by linebreak "\n"

Args:
data: str

Returns:
A list of sentences
"""
sentences = data.split('\n')

# Additional clearning (This part is already implemented)
# - Remove leading and trailing spaces from each sentence
# - Drop sentences if they are empty strings.
sentences = [s.strip() for s in sentences]
sentences = [s for s in sentences if len(s) > 0]

return sentences

1.2 将每个句子中的字母都转换为小写,并将单词切割出来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def tokenize_sentences(sentences):
"""
Tokenize sentences into tokens (words)

Args:
sentences: List of strings

Returns:
List of lists of tokens
"""

tokenized_sentences = []

for sentence in sentences:

sentence = sentence.lower()

# 切割单词
tokenized = nltk.word_tokenize(sentence)

tokenized_sentences.append(tokenized)


return tokenized_sentences

1.3 统计每个单词出现的次数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def count_words(tokenized_sentences):
"""
Count the number of word appearence in the tokenized sentences

Args:
tokenized_sentences: List of lists of strings

Returns:
dict that maps word (str) to the frequency (int)
"""

word_counts = {}

for sentence in tokenized_sentences:
for token in sentence:

if token not in word_counts.keys():
word_counts[token] = 1
else:
word_counts[token] += 1

return word_counts

1.4 将出现次数大于等于阈值的单词组成一个集合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):
"""
Find the words that appear N times or more

Args:
tokenized_sentences: List of lists of sentences
count_threshold: 阈值

Returns:
List of words that appear N times or more
"""
closed_vocab = []

word_counts = count_words(tokenized_sentences)

for word, cnt in word_counts.items():
if cnt >= count_threshold:
closed_vocab.append(word)

return closed_vocab

1.5 将不存在于上述集合中的单词设为,其余不变

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token="<unk>"):
"""
Replace words not in the given vocabulary with '<unk>' token.

Args:
tokenized_sentences: List of lists of strings
vocabulary: List of strings that we will use
unknown_token: A string representing unknown (out-of-vocabulary) words

Returns:
List of lists of strings, with words not in the vocabulary replaced
"""

vocabulary = set(vocabulary)

replaced_tokenized_sentences = []

for sentence in tokenized_sentences:

replaced_sentence = []

for token in sentence:
if token in vocabulary:
replaced_sentence.append(token)
else:
replaced_sentence.append(unknown_token)

replaced_tokenized_sentences.append(replaced_sentence)

return replaced_tokenized_sentences

1.6 将train_datatest_data都做上述处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def preprocess_data(train_data, test_data, count_threshold):
"""
Preprocess data, i.e.,
- Find tokens that appear at least N times in the training data.
- Replace tokens that appear less than N times by "<unk>" both for training and test data.
Args:
train_data, test_data: List of lists of strings.
count_threshold: Words whose count is less than this are
treated as unknown.

Returns:
Tuple of
- training data with low frequent words replaced by "<unk>"
- test data with low frequent words replaced by "<unk>"
- vocabulary of words that appear n times or more in the training data
"""

# Get the closed vocabulary using the train data
vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)

# For the train data, replace less common words with "<unk>"
train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary, "<unk>")

# For the test data, replace less common words with "<unk>"
test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary, "<unk>")

return train_data_replaced, test_data_replaced, vocabulary

Part2:开发基于n-gram的语言模型

假设下一个单词的概率仅取决于上一个n-gram,则:

2.1 统计n-gram单词组出现次数

  • 首先在文本开头加上n - 1<s>,用于标志开始位置
  • 同样也在文本末尾加上<e>,表示文本结束
  • 然后统计文本连续n个单词组成的单词组出现次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):
"""
Count all n-grams in the data

Args:
data: List of lists of words
n: number of words in a sequence

Returns:
A dictionary that maps a tuple of n-words to its frequency
"""

n_grams = {}

for sentence in data:

sentence = [start_token] * n + sentence + [end_token]

sentence = tuple(sentence)

for i in range(len(sentence) - n + 1):

n_gram = sentence[i : i + n]

if n_gram in n_grams.keys():
n_grams[n_gram] += 1
else:
n_grams[n_gram] = 1

return n_grams

2.2 对上述公式进行改进,增加k-smoothing,用于估计概率

  • V为单词表中单词的数量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def estimate_probability(word, previous_n_gram, 
n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
"""
Estimate the probabilities of a next word using the n-gram counts with k-smoothing

Args:
word: next word
previous_n_gram: A sequence of words of length n
n_gram_counts: Dictionary of counts of n-grams
n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
vocabulary_size: number of words in the vocabulary
k: positive constant, smoothing parameter

Returns:
A probability
"""

previous_n_gram = tuple(previous_n_gram)

# 计算之前的n-gram出现次数
if previous_n_gram in n_gram_counts:
previous_n_gram_count = n_gram_counts[previous_n_gram]
else:
previous_n_gram_count = 0

# 分母
denominator = previous_n_gram_count + k * vocabulary_size

# 之前的n-gram加上当前单词
n_plus1_gram = previous_n_gram + (word, )

# 计算出现次数
if n_plus1_gram in n_plus1_gram_counts:
n_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram]
else:
n_plus1_gram_count = 0

# 分子
numerator = n_plus1_gram_count + k

probability = numerator / denominator

return probability

2.3 对单词表中所有单词进行概率统计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
"""
Estimate the probabilities of next words using the n-gram counts with k-smoothing

Args:
previous_n_gram: A sequence of words of length n
n_gram_counts: Dictionary of counts of (n+1)-grams
n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
vocabulary: List of words
k: positive constant, smoothing parameter

Returns:
A dictionary mapping from next words to the probability.
"""

# convert list to tuple to use it as a dictionary key
previous_n_gram = tuple(previous_n_gram)

# add <e> <unk> to the vocabulary
# <s> is not needed since it should not appear as the next word
vocabulary = vocabulary + ["<e>", "<unk>"]
vocabulary_size = len(vocabulary)

probabilities = {}
for word in vocabulary:
probability = estimate_probability(word, previous_n_gram,
n_gram_counts, n_plus1_gram_counts,
vocabulary_size, k=k)
probabilities[word] = probability

return probabilities

2.4 将n-gram单词组出现次数表示为矩阵形式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def make_count_matrix(n_plus1_gram_counts, vocabulary):
# add <e> <unk> to the vocabulary
# <s> is omitted since it should not appear as the next word
vocabulary = vocabulary + ["<e>", "<unk>"]

# obtain unique n-grams
n_grams = []
for n_plus1_gram in n_plus1_gram_counts.keys():
n_gram = n_plus1_gram[0:-1]
n_grams.append(n_gram)
n_grams = list(set(n_grams))

# mapping from n-gram to row
row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
# mapping from next word to column
col_index = {word:j for j, word in enumerate(vocabulary)}

nrow = len(n_grams)
ncol = len(vocabulary)
count_matrix = np.zeros((nrow, ncol))
for n_plus1_gram, count in n_plus1_gram_counts.items():
n_gram = n_plus1_gram[0:-1]
word = n_plus1_gram[-1]
if word not in vocabulary:
continue
i = row_index[n_gram]
j = col_index[word]
count_matrix[i, j] = count

count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
return count_matrix

测试:

1
2
3
4
5
6
7
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

print('bigram counts')
display(make_count_matrix(bigram_counts, unique_words))

https://pic.imgdb.cn/item/610a27de5132923bf8d81ffb.png

2.5 编写概率矩阵

1
2
3
4
5
def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
count_matrix += k
prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
return prob_matrix

测试:

1
2
3
4
5
6
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print("bigram probabilities")
display(make_probability_matrix(bigram_counts, unique_words, k=1))

https://pic.imgdb.cn/item/610a294c5132923bf8dd2d60.png

Part3:困惑度

我们将使用困惑度来评估我们的模型好坏,使用以下公式:

在代码编写中,由于索引从0开始,所以使用如下公式:

概率越高,困惑度越小,模型越好。

3.1 计算困惑度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
"""
Calculate perplexity for a list of sentences

Args:
sentence: List of strings
n_gram_counts: Dictionary of counts of (n+1)-grams
n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
vocabulary_size: number of unique words in the vocabulary
k: Positive smoothing constant

Returns:
Perplexity score
"""
# length of previous words
n = len(list(n_gram_counts.keys())[0])

# prepend <s> and append <e>
sentence = ["<s>"] * n + sentence + ["<e>"]

# Cast the sentence from a list to a tuple
sentence = tuple(sentence)

# length of sentence (after adding <s> and <e> tokens)
N = len(sentence)

# The variable p will hold the product
# that is calculated inside the n-root
# Update this in the code below
product_pi = 1.0


# Index t ranges from n to N - 1, inclusive on both ends
for t in range(n, N):

# get the n-gram preceding the word at position t
n_gram = sentence[t - n : t]

# get the word at position t
word = sentence[t]

# Estimate the probability of the word given the n-gram
# using the n-gram counts, n-plus1-gram counts,
# vocabulary size, and smoothing constant
probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1)

# Update the product of the probabilities
# This 'product_pi' is a cumulative product
# of the (1/P) factors that are calculated in the loop
product_pi *= 1 / probability

# Take the Nth root of the product
perplexity = product_pi ** float(1 / N)

return perplexity

Part4:构建自动完成系统

计算所有单词的概率,并提供最有概率的那个

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):
"""
Get suggestion for the next word

Args:
previous_tokens: The sentence you input where each token is a word. Must have length > n
n_gram_counts: Dictionary of counts of (n+1)-grams
n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
vocabulary: List of words
k: positive constant, smoothing parameter
start_with: If not None, specifies the first few letters of the next word

Returns:
A tuple of
- string of the most likely next word
- corresponding probability
"""

# length of previous words
n = len(list(n_gram_counts.keys())[0])

# 获得最后面的n个单词
previous_n_gram = previous_tokens[-n:]

probabilities = estimate_probabilities(previous_n_gram,
n_gram_counts, n_plus1_gram_counts,
vocabulary, k=k)

suggestion = None
max_prob = 0

for word, prob in probabilities.items():

# 如果有特殊要求,则进行单词过滤
if start_with != None:
if not word.startswith(start_with):
continue

if prob > max_prob:
suggestion = word
max_prob = prob


return suggestion, max_prob
CATALOG
  1. 1. 吴恩达团队NLP C2_W3_Assignment
    1. 1.1. 任务:自动完成系统
    2. 1.2. Part1:处理数据
      1. 1.2.1. 1.1 将文本数据切割成一行一个字符串
      2. 1.2.2. 1.2 将每个句子中的字母都转换为小写,并将单词切割出来
      3. 1.2.3. 1.3 统计每个单词出现的次数
      4. 1.2.4. 1.4 将出现次数大于等于阈值的单词组成一个集合
      5. 1.2.5. 1.5 将不存在于上述集合中的单词设为,其余不变
      6. 1.2.6. 1.6 将train_data和test_data都做上述处理
    3. 1.3. Part2:开发基于n-gram的语言模型
      1. 1.3.1. 2.1 统计n-gram单词组出现次数
      2. 1.3.2. 2.2 对上述公式进行改进,增加k-smoothing,用于估计概率
      3. 1.3.3. 2.3 对单词表中所有单词进行概率统计
      4. 1.3.4. 2.4 将n-gram单词组出现次数表示为矩阵形式
      5. 1.3.5. 2.5 编写概率矩阵
    4. 1.4. Part3:困惑度
      1. 1.4.1. 3.1 计算困惑度
    5. 1.5. Part4:构建自动完成系统