Keshawn_lu's Blog

吴恩达团队NLP C1_W3_Assignment

字数统计: 1.6k阅读时长: 8 min
2021/07/25 Share

吴恩达团队NLP C1_W3_Assignment

课程链接:

Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

任务:学习词向量,预测单词的类比,通过PCA降维,通过余弦相似性来比较单词。

Part1: 余弦相似性

余弦相似性可由下述公式表示:

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def cosine_similarity(A, B):
'''
Input:
A: a numpy array which corresponds to a word vector
B: A numpy array which corresponds to a word vector
Output:
cos: numerical number representing the cosine similarity between A and B.
'''

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

dot = np.dot(A, B)
norma = np.linalg.norm(A)
normb = np.linalg.norm(B)
cos = dot / (norma * normb)

### END CODE HERE ###
return cos

1.1 欧氏距离

欧氏距离可由以下公式表示:

单词间越相似,欧氏距离则越小。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def euclidean(A, B):
"""
Input:
A: a numpy array which corresponds to a word vector
B: A numpy array which corresponds to a word vector
Output:
d: A与B之间的欧氏距离
"""

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

# euclidean distance

d = np.sum((A - B) * (A - B))

d = np.sqrt(d)

### END CODE HERE ###

return d

1.2 寻找每个国家的首都

通过上述方法来计算单词向量间的相似度,并来查找各国的首都,如同King - Man + Woman = Queen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def get_country(city1, country1, city2, embeddings):
"""
Input:
city1: a string (the capital city of country1)
country1: a string (the country of capital1)
city2: a string (the capital city of country2)
embeddings: a dictionary where the keys are words and values are their embeddings
Output:
countries: a dictionary with the most likely country and its similarity score
"""
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

# store the city1, country1, and city2 in a set called group
group = set((city1, country1, city2))

# get embeddings of city 1
city1_emb = embeddings[city1]

# get embedding of country 1
country1_emb = embeddings[country1]

# get embedding of city 2
city2_emb = embeddings[city2]

# get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
# Remember: King - Man + Woman = Queen
vec = country1_emb - city1_emb + city2_emb

# Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
similarity = -1

# initialize country to an empty string
country = ''

# loop through all words in the embeddings dictionary
for word in embeddings.keys():

# first check that the word is not already in the 'group'
if word not in group:

# get the word embedding
word_emb = embeddings[word]

# calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
cur_similarity = cosine_similarity(vec, word_emb)

# if the cosine similarity is more similar than the previously best similarity...
if cur_similarity > similarity:

# update the similarity to the new, better similarity
similarity = cur_similarity

# store the country as a tuple, which contains the word and the similarity
country = (word, similarity)

### END CODE HERE ###

return country

1.3 模型的准确度

通过下述公式来计算模型的准确度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def get_accuracy(word_embeddings, data):
'''
Input:
word_embeddings: a dictionary where the key is a word and the value is its embedding
data: a pandas dataframe containing all the country and capital city pairs

Output:
accuracy: the accuracy of the model
'''

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# initialize num correct to zero
num_correct = 0

# loop through the rows of the dataframe
for i, row in data.iterrows():

# get city1
city1 = row["city1"]

# get country1
country1 = row["country1"]

# get city2
city2 = row["city2"]

# get country2
country2 = row["country2"]

# use get_country to find the predicted country2
predicted_country2, _ = get_country(city1, country1, city2, word_embeddings)

# if the predicted country2 is the same as the actual country2...
if predicted_country2 == country2:
# increment the number of correct by 1
num_correct += 1

# get the number of rows in the data dataframe (length of dataframe)
m = len(data)

# calculate the accuracy by dividing the number correct by m
accuracy = num_correct / m

### END CODE HERE ###
return accuracy

Part2: 通过PCA来降维

PCA概念:

PCA(Principal Component Analysis),即主成分分析方法,是一种使用最广泛的数据降维算法。PCA的主要思想是将n维特征映射到k维上,这k维是全新的正交特征也被称为主成分,是在原有n维特征的基础上重新构造出来的k维特征。PCA的工作就是从原始的空间中顺序地找一组相互正交的坐标轴,新的坐标轴的选择与数据本身是密切相关的。其中,第一个新坐标轴选择是原始数据中方差最大的方向,第二个新坐标轴选取是与第一个坐标轴正交的平面中使得方差最大的,第三个轴是与第1,2个轴正交的平面中方差最大的。依次类推,可以得到n个这样的坐标轴。通过这种方式获得的新的坐标轴,我们发现,大部分方差都包含在前面k个坐标轴中,后面的坐标轴所含的方差几乎为0。于是,我们可以忽略余下的坐标轴,只保留前面k个含有绝大部分方差的坐标轴。事实上,这相当于只保留包含绝大部分方差的维度特征,而忽略包含方差几乎为0的特征维度,实现对数据特征的降维处理。

通过下述步骤来使维数降至k维:

  • 平均值归一化数据
  • 计算协方差矩阵
  • 计算协方差矩阵的特征值, 特征向量
  • 对特征值从大到小排序,选择其中最大的k个。然后将其对应的k个特征向量分别作为列向量组成特征向量矩阵。
  • 使用上述矩阵来乘以平均值过后的数据,得到降维后的结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def compute_pca(X, n_components=2):
"""
Input:
X: of dimension (m,n) where each row corresponds to a word vector
n_components: Number of components you want to keep.
Output:
X_reduced: data transformed in 2 dims/columns + regenerated original data
"""

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# mean center the data
X_demeaned = X - np.mean(X, axis=0) # 必须为列

# 计算协方差矩阵
covariance_matrix = np.cov(X_demeaned, rowvar=False) # 将列作为独立的变量

# 计算特征值, 特征向量
eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix)

# 将数组从小到大的元素索引赋值给idx_sorted
idx_sorted = np.argsort(eigen_vals)

# reverse the order so that it's from highest to lowest.
idx_sorted_decreasing = idx_sorted[::-1] # 逆置, 从大到小

# sort the eigen values by idx_sorted_decreasing
eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]


print(eigen_vecs)
# sort eigenvectors using the idx_sorted_decreasing indices
eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing] # 将每个数组元素都逆置

print("\n", eigen_vecs_sorted)

# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
eigen_vecs_subset = eigen_vecs_sorted[:,0:n_components] # 每行只要前两列

# transform the data by multiplying the transpose of the eigenvectors
# with the transpose of the de-meaned data
# Then take the transpose of that product.
X_reduced = np.dot(eigen_vecs_subset.transpose(),X_demeaned.transpose()).transpose() # transpose 转置矩阵

### END CODE HERE ###

return X_reduced

2.1 绘制图像

['oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful'] 这些单词数据通过PCA降至二维时,我们可以得到如下的图像:

https://pic.imgdb.cn/item/60fd136f5132923bf899478b.png

可以看出,有关系的几个单词都离彼此十分接近。

CATALOG
  1. 1. 吴恩达团队NLP C1_W3_Assignment
    1. 1.1. 课程链接:
    2. 1.2. 任务:学习词向量,预测单词的类比,通过PCA降维,通过余弦相似性来比较单词。
    3. 1.3. Part1: 余弦相似性
    4. 1.4. 1.1 欧氏距离
    5. 1.5. 1.2 寻找每个国家的首都
    6. 1.6. 1.3 模型的准确度
    7. 1.7. Part2: 通过PCA来降维
      1. 1.7.1. PCA概念:
    8. 1.8. 2.1 绘制图像