Keshawn_lu's Blog

吴恩达深度学习 C2_W1_Assignment

字数统计: 4.8k阅读时长: 26 min
2021/09/16 Share

吴恩达深度学习 C2_W1_Assignment

任务1:初始化指定权重

Part0:准备数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation
from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# load image dataset: blue/red dots in circles
train_X, train_Y, test_X, test_Y = load_dataset()

https://pic.imgdb.cn/item/613b1ca944eaada739dc6f2e.png

我们需要一个分类器来区分蓝点和红点。

Part1:神经网络模型

如下已有一个三层神经网络,并尝试以下三种方法:

  • Zeros initialization
  • Random initialization
  • He initialization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):
"""
Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.

Arguments:
X -- input data, of shape (2, number of examples)
Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)
learning_rate -- learning rate for gradient descent
num_iterations -- number of iterations to run gradient descent
print_cost -- if True, print the cost every 1000 iterations
initialization -- flag to choose which initialization to use ("zeros","random" or "he")

Returns:
parameters -- parameters learnt by the model
"""

grads = {}
costs = [] # to keep track of the loss
m = X.shape[1] # number of examples
layers_dims = [X.shape[0], 10, 5, 1]

# Initialize parameters dictionary.
if initialization == "zeros":
parameters = initialize_parameters_zeros(layers_dims)
elif initialization == "random":
parameters = initialize_parameters_random(layers_dims)
elif initialization == "he":
parameters = initialize_parameters_he(layers_dims)

# Loop (gradient descent)

for i in range(0, num_iterations):

# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
a3, cache = forward_propagation(X, parameters)

# Loss
cost = compute_loss(a3, Y)

# Backward propagation.
grads = backward_propagation(X, Y, cache)

# Update parameters.
parameters = update_parameters(parameters, grads, learning_rate)

# Print the loss every 1000 iterations
if print_cost and i % 1000 == 0:
print("Cost after iteration {}: {}".format(i, cost))
costs.append(cost)

# plot the loss
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters

Exercise1:Zero initialization

将以下参数初始化为0:

  • the weight matrices (𝑊[1],𝑊[2],𝑊[3],…,𝑊[𝐿−1],𝑊[𝐿])
  • the bias vectors (𝑏[1],𝑏[2],𝑏[3],…,𝑏[𝐿−1],𝑏[𝐿])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def initialize_parameters_zeros(layers_dims):
"""
Arguments:
layer_dims -- python array (list) containing the size of each layer.

Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
b1 -- bias vector of shape (layers_dims[1], 1)
...
WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
bL -- bias vector of shape (layers_dims[L], 1)
"""

parameters = {}
L = len(layers_dims) # number of layers in the network

for l in range(1, L):
### START CODE HERE ### (≈ 2 lines of code)
parameters["W" + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
parameters["b" + str(l)] = np.zeros((layers_dims[l], 1))
### END CODE HERE ###
return parameters

我们迭代15000次训练模型,观察结果:

https://pic.imgdb.cn/item/613b1e9d44eaada739df13db.png

1
2
3
4
On the train set:
Accuracy: 0.5
On the test set:
Accuracy: 0.5

我们可以看出性能非常差,成本没有真正降低,我们来可视化一下分类结果:

1
2
3
4
5
plt.title("Model with Zeros initialization")
axes = plt.gca()
axes.set_xlim([-1.5,1.5])
axes.set_ylim([-1.5,1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

https://pic.imgdb.cn/item/613b1f1444eaada739dfbe19.png

我们可以看出该模型预测每个示例均为0。一般来说,所有权重初始化为0会导致网络无法打破对称性,这意味着每一层的每一个神经元将学习相同的东西。

Exercise2:Random initialization

为了打破对称性,我们需要随机初始化权重。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def initialize_parameters_random(layers_dims):
"""
Arguments:
layer_dims -- python array (list) containing the size of each layer.

Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
b1 -- bias vector of shape (layers_dims[1], 1)
...
WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
bL -- bias vector of shape (layers_dims[L], 1)
"""

np.random.seed(3)
parameters = {}
L = len(layers_dims) # integer representing the number of layers

for l in range(1, L):
### START CODE HERE ### (≈ 2 lines of code)
parameters["W" + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
parameters["b" + str(l)] = np.zeros((layers_dims[l], 1))
### END CODE HERE ###

return parameters

同样的,我们来观察一下迭代15000后的效果:

https://pic.imgdb.cn/item/613b20b044eaada739e1d711.png

1
2
3
4
On the train set:
Accuracy: 0.83
On the test set:
Accuracy: 0.86

可以看到,我们得到了比之前更好的结果。同样的,我们来可视化一下分类效果:

1
2
3
4
5
plt.title("Model with large random initialization")
axes = plt.gca()
axes.set_xlim([-1.5,1.5])
axes.set_ylim([-1.5,1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

https://pic.imgdb.cn/item/613b21b044eaada739e32aa0.png

Exercise3:He initialization

初始化W时,乘以$\sqrt{\frac{2}{\text{dimension of the previous layer}}}$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def initialize_parameters_he(layers_dims):
"""
Arguments:
layer_dims -- python array (list) containing the size of each layer.

Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
b1 -- bias vector of shape (layers_dims[1], 1)
...
WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
bL -- bias vector of shape (layers_dims[L], 1)
"""

np.random.seed(3)
parameters = {}
L = len(layers_dims) - 1 # integer representing the number of layers

for l in range(1, L + 1):
### START CODE HERE ### (≈ 2 lines of code)
parameters["W" + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2 / layers_dims[l-1])
parameters["b" + str(l)] = np.zeros((layers_dims[l], 1))
### END CODE HERE ###

return parameters

同样的,我们来观察一下迭代15000后的效果:

https://pic.imgdb.cn/item/613b226944eaada739e41b42.png

1
2
3
4
On the train set:
Accuracy: 0.9933333333333333
On the test set:
Accuracy: 0.96

观察一下分类效果:

https://pic.imgdb.cn/item/613b22fb44eaada739e4e478.png

可以看出,He initialization很好的进行了分类。

任务2:在深度学习模型中使用正则化

Part0:库的准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
import sklearn
import sklearn.datasets
import scipy.io
from testCases import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

问题描述:推荐法国队守门员踢球的位置,这样法国的球员就可以头击球了

https://pic.imgdb.cn/item/613b249844eaada739e70d87.png

数据为法国过去十场比赛的2D数据集:

1
train_X, train_Y, test_X, test_Y = load_2D_dataset()

https://pic.imgdb.cn/item/613b24d044eaada739e756f9.png

每个点对应于足球场上的一个位置,如果圆点是蓝色的,说明法国选手设法用头击球。如果为红色,则表示对方球员用头击球。

我们的目标为:找出守门员应该在球场上踢球的位置。

Part1:使用非正则化的模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
"""
Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.

Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
learning_rate -- learning rate of the optimization
num_iterations -- number of iterations of the optimization loop
print_cost -- If True, print the cost every 10000 iterations
lambd -- regularization hyperparameter, scalar
keep_prob - probability of keeping a neuron active during drop-out, scalar.

Returns:
parameters -- parameters learned by the model. They can then be used to predict.
"""

grads = {}
costs = [] # to keep track of the cost
m = X.shape[1] # number of examples
layers_dims = [X.shape[0], 20, 3, 1]

# Initialize parameters dictionary.
parameters = initialize_parameters(layers_dims)

# Loop (gradient descent)

for i in range(0, num_iterations):

# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
if keep_prob == 1:
a3, cache = forward_propagation(X, parameters)
elif keep_prob < 1:
a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)

# Cost function
if lambd == 0:
cost = compute_cost(a3, Y)
else:
cost = compute_cost_with_regularization(a3, Y, parameters, lambd)

# Backward propagation.
assert(lambd==0 or keep_prob==1) # it is possible to use both L2 regularization and dropout,
# but this assignment will only explore one at a time
if lambd == 0 and keep_prob == 1:
grads = backward_propagation(X, Y, cache)
elif lambd != 0:
grads = backward_propagation_with_regularization(X, Y, cache, lambd)
elif keep_prob < 1:
grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)

# Update parameters.
parameters = update_parameters(parameters, grads, learning_rate)

# Print the loss every 10000 iterations
if print_cost and i % 10000 == 0:
print("Cost after iteration {}: {}".format(i, cost))
if print_cost and i % 1000 == 0:
costs.append(cost)

# plot the cost
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (x1,000)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters

我们在不进行任何正则化的情况下训练模型,并在训练/测试集上观察精度

1
2
3
4
5
parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

https://pic.imgdb.cn/item/613b270744eaada739ea4f63.png

1
2
3
4
On the training set:
Accuracy: 0.9478672985781991
On the test set:
Accuracy: 0.915

观察一下决策边界:

1
2
3
4
5
plt.title("Model without regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

https://pic.imgdb.cn/item/613b274744eaada739eaa3e6.png

可以看出,此模型过度拟合了训练集,接下来我们来看两种减少过度装配的技术。

Exercise1:L2 Regularization

代价函数如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def compute_cost_with_regularization(A3, Y, parameters, lambd):
"""
Implement the cost function with L2 regularization. See formula (2) above.

Arguments:
A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
parameters -- python dictionary containing parameters of the model

Returns:
cost - value of the regularized loss function (formula (2))
"""
m = Y.shape[1]
W1 = parameters["W1"]
W2 = parameters["W2"]
W3 = parameters["W3"]

cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost

### START CODE HERE ### (approx. 1 line)
L2_regularization_cost = (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) * lambd / (2 * m)
### END CODER HERE ###

cost = cross_entropy_cost + L2_regularization_cost

return cost

Exercise2:实现有正则化后的反向传播

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def backward_propagation_with_regularization(X, Y, cache, lambd):
"""
Implements the backward propagation of our baseline model to which we added an L2 regularization.

Arguments:
X -- input dataset, of shape (input size, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
cache -- cache output from forward_propagation()
lambd -- regularization hyperparameter, scalar

Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""

m = X.shape[1]
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

dZ3 = A3 - Y

### START CODE HERE ### (approx. 1 line)
dW3 = 1 / m * np.dot(dZ3, A2.T) + lambd * W3 / m
### END CODE HERE ###
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))

### START CODE HERE ### (approx. 1 line)
dW2 = 1 / m * np.dot(dZ2, A1.T) + lambd * W2 / m
### END CODE HERE ###
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))

### START CODE HERE ### (approx. 1 line)
dW1 = 1 / m * np.dot(dZ1, X.T) + lambd * W1 / m
### END CODE HERE ###
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}

return gradients

现在让我们使用L2正则化来运行模型,观察效果:

1
2
3
4
5
parameters = model(train_X, train_Y, lambd = 0.7)
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

https://pic.imgdb.cn/item/613b2f6a44eaada739f5f66b.png

1
2
3
4
On the train set:
Accuracy: 0.9383886255924171
On the test set:
Accuracy: 0.93

可以看出,测试集的准确率提高了,让我们绘制一下决策边界:

1
2
3
4
5
plt.title("Model with L2-regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

https://pic.imgdb.cn/item/613b2fc244eaada739f67532.png

  • $\lambda$是一个超参数,可以通过验证集进行调整
  • L2正则化使决策边界更加平滑,如果$\lambda$太大,也可能过平滑,导致模型具有高偏差。

Exercise3:实现带有dropout的前向传播

在每次迭代中,通过𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏来控制关闭神经元的概率。Dropout背后的想法是,在每一次迭代中,训练一个只使用神经元的一个子集的的不同模型,这样,神经元对一个特定神经元的激活就会变的不那么敏感。

步骤如下:

  • 创建一个和$A^{[1]}$具有相同维度的矩阵$D^{[1]} = [d^{1} d^{1} … d^{1}]$
  • 通过𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏来对矩阵$D^{[1]}$赋值
  • 将A矩阵与D矩阵相乘,起到关闭神经元的作用
  • 将$A^{[1]}$除以𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏,确保结果仍然具有没有dropout时相同的预期值。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
"""
Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.

Arguments:
X -- input dataset, of shape (2, number of examples)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
W1 -- weight matrix of shape (20, 2)
b1 -- bias vector of shape (20, 1)
W2 -- weight matrix of shape (3, 20)
b2 -- bias vector of shape (3, 1)
W3 -- weight matrix of shape (1, 3)
b3 -- bias vector of shape (1, 1)
keep_prob - probability of keeping a neuron active during drop-out, scalar

Returns:
A3 -- last activation value, output of the forward propagation, of shape (1,1)
cache -- tuple, information stored for computing the backward propagation
"""

np.random.seed(1)

# retrieve parameters
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]

# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)
### START CODE HERE ### (approx. 4 lines)
# Step 1: initialize matrix D1 = np.random.rand(..., ...)
D1 = np.random.rand(A1.shape[0], A1.shape[1])
# Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
D1 = (D1 < 0.5)
# Step 3: shut down some neurons of A1
A1 = A1 * D1
# Step 4: scale the value of neurons that haven't been shut down
A1 = A1 / keep_prob
### END CODE HERE ###


Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)
### START CODE HERE ### (approx. 4 lines)
# Step 1: initialize matrix D1 = np.random.rand(..., ...)
D2 = np.random.rand(A2.shape[0], A2.shape[1])
# Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
D2 = (D2 < 0.5)
# Step 3: shut down some neurons of A1
A2 = A2 * D2
# Step 4: scale the value of neurons that haven't been shut down
A2 = A2 / keep_prob

### END CODE HERE ###
Z3 = np.dot(W3, A2) + b3
A3 = sigmoid(Z3)

cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)

return A3, cache

Exercise4:实现带有dropout的反向传播

步骤如下:

  • 我们曾在前向传播中使用mask关闭了一些神经元。所以在反向传播中,我们需要关闭相同的神经元。
  • 在前向传播中,我们将A1 除以keep_prob ,因此,在反向传播中,我们需要将dA1除以keep_prob
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
"""
Implements the backward propagation of our baseline model to which we added dropout.

Arguments:
X -- input dataset, of shape (2, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
cache -- cache output from forward_propagation_with_dropout()
keep_prob - probability of keeping a neuron active during drop-out, scalar

Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""

m = X.shape[1]
(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

dZ3 = A3 - Y
dW3 = 1./m * np.dot(dZ3, A2.T)
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
dA2 = D2 * dA2
# Step 2: Scale the value of neurons that haven't been shut down
dA2 = dA2 / keep_prob
### END CODE HERE ###

dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T)
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
dA1 = D1 * dA1
# Step 2: Scale the value of neurons that haven't been shut down
dA1 = dA1 / keep_prob
### END CODE HERE ###

dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}

return gradients

现在让我们来运行带有dropout的模型,观察效果:

1
2
3
4
5
6
parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

https://pic.imgdb.cn/item/613b362244eaada739ff69bf.png

1
2
3
4
On the train set:
Accuracy: 0.8957345971563981
On the test set:
Accuracy: 0.92

很遗憾,我做出来的结果与预期结果不符,测试集的准确率并没有达到预期的95%。

观察一下决策边界:

1
2
3
4
5
plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

https://pic.imgdb.cn/item/613b36a244eaada73900208f.png

总结:

  • 正则化能帮助我们减少过度拟合
  • 正则化能使权重降低

任务3:实现和使用梯度检查

Part0:库的准备

1
2
3
4
# Packages
import numpy as np
from testCases import *
from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector

梯度检查是如何运行的?

我们先回顾一下导数(梯度)的定义:

我们需要做的便是确保这个值是正确的。

Exercise1:一维梯度检测

考虑一维线性函数$J(\theta) = \theta x$,我们将计算$J()$和导数$\frac{\partial J}{\partial \theta}$,然后再使用梯度检测确保$J()$是正确的。

https://pic.imgdb.cn/item/613b4bc544eaada739204c04.png

下面来实现它的前向传播和反向传播:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def forward_propagation(x, theta):
"""
Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)

Arguments:
x -- a real-valued input
theta -- our parameter, a real number as well

Returns:
J -- the value of function J, computed using the formula J(theta) = theta * x
"""

### START CODE HERE ### (approx. 1 line)
J = theta * x
### END CODE HERE ###

return J
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def backward_propagation(x, theta):
"""
Computes the derivative of J with respect to theta (see Figure 1).

Arguments:
x -- a real-valued input
theta -- our parameter, a real number as well

Returns:
dtheta -- the gradient of the cost with respect to theta
"""

### START CODE HERE ### (approx. 1 line)
dtheta = x
### END CODE HERE ###

return dtheta

实现梯度检测的步骤如下:

  • 计算gradapprox
    1. $\theta^{+} = \theta + \varepsilon$
    2. $\theta^{-} = \theta - \varepsilon$
    3. $J^{+} = J(\theta^{+})$
    4. $J^{-} = J(\theta^{-})$
    5. $gradapprox = \frac{J^{+} - J^{-}}{2 \varepsilon}$
  • 然后使用反向传播计算梯度
  • 最后,计算相对差值:$difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2}$

tip:

  • 计算范数时使用np.linalg.norm(...)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def gradient_check(x, theta, epsilon = 1e-7):
"""
Implement the backward propagation presented in Figure 1.

Arguments:
x -- a real-valued input
theta -- our parameter, a real number as well
epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

Returns:
difference -- difference (2) between the approximated gradient and the backward propagation gradient
"""

# Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
### START CODE HERE ### (approx. 5 lines)
thetaplus = theta + epsilon
thetaminus = theta - epsilon
J_plus = forward_propagation(x, thetaplus)
J_minus = forward_propagation(x, thetaminus)
gradapprox = (J_plus - J_minus) / (2 * epsilon)
### END CODE HERE ###

# Check if gradapprox is close enough to the output of backward_propagation()
### START CODE HERE ### (approx. 1 line)
grad = backward_propagation(x, gradapprox)
### END CODE HERE ###

### START CODE HERE ### (approx. 1 line)
numerator = np.linalg.norm(grad - gradapprox)
denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
difference = numerator / denominator
### END CODE HERE ###

if difference < 1e-7:
print ("The gradient is correct!")
else:
print ("The gradient is wrong!")

return difference

Exercise2:N维梯度检测

https://pic.imgdb.cn/item/613b4dda44eaada73923cb0e.png

实现前向传播和反向传播:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def forward_propagation_n(X, Y, parameters):
"""
Implements the forward propagation (and computes the cost) presented in Figure 3.

Arguments:
X -- training set for m examples
Y -- labels for m examples
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
W1 -- weight matrix of shape (5, 4)
b1 -- bias vector of shape (5, 1)
W2 -- weight matrix of shape (3, 5)
b2 -- bias vector of shape (3, 1)
W3 -- weight matrix of shape (1, 3)
b3 -- bias vector of shape (1, 1)

Returns:
cost -- the cost function (logistic cost for one example)
"""

# retrieve parameters
m = X.shape[1]
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]

# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)
Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)
Z3 = np.dot(W3, A2) + b3
A3 = sigmoid(Z3)

# Cost
logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)
cost = 1./m * np.sum(logprobs)

cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)

return cost, cache
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def backward_propagation_n(X, Y, cache):
"""
Implement the backward propagation presented in figure 2.

Arguments:
X -- input datapoint, of shape (input size, 1)
Y -- true "label"
cache -- cache output from forward_propagation_n()

Returns:
gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
"""

m = X.shape[1]
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

dZ3 = A3 - Y
dW3 = 1./m * np.dot(dZ3, A2.T)
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)

dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T) # * 2
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)

dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
"dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}

return gradients

导数公式仍然为:

但是由于$\theta$不再是标量,而是一个字典,所以已经提前准备好了一个函数dictionary_to_vector() ,将所有参数重塑为向量并进行连接。同时还有反函数vector_to_dictionary()

https://pic.imgdb.cn/item/613b4f9f44eaada739268660.png

计算步骤:

  • 计算J_plus[i]
    • Set $\theta^{+}$ to np.copy(parameters_values)
    • Set $\theta^{+}_i$ to $\theta^{+}_i + \varepsilon$
    • 使用forward_propagation_n(x, y, vector_to_dictionary( 𝜃+ ))计算$J^{+}_i$
  • 同样的方法计算J_minus[i]
  • 计算$gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon}$
  • 计算$difference = \frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 }$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
"""
Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

Arguments:
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.
x -- input datapoint, of shape (input size, 1)
y -- true "label"
epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

Returns:
difference -- difference (2) between the approximated gradient and the backward propagation gradient
"""

# Set-up variables
parameters_values, _ = dictionary_to_vector(parameters)
grad = gradients_to_vector(gradients)
num_parameters = parameters_values.shape[0]
J_plus = np.zeros((num_parameters, 1))
J_minus = np.zeros((num_parameters, 1))
gradapprox = np.zeros((num_parameters, 1))

# Compute gradapprox
for i in range(num_parameters):

# Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
# "_" is used because the function you have to outputs two parameters but we only care about the first one
### START CODE HERE ### (approx. 3 lines)
thetaplus = np.copy(parameters_values)
thetaplus[i][0] += epsilon
J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))
### END CODE HERE ###

# Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
### START CODE HERE ### (approx. 3 lines)
thetaminus = np.copy(parameters_values)
thetaminus[i][0] -= epsilon
J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))
### END CODE HERE ###

# Compute gradapprox[i]
### START CODE HERE ### (approx. 1 line)
gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
### END CODE HERE ###

# Compare gradapprox to backward propagation gradients by computing difference.
### START CODE HERE ### (approx. 1 line)
numerator = np.linalg.norm(grad - gradapprox)
denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
difference = numerator / denominator
### END CODE HERE ###

if difference > 1e-7:
print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

return difference
CATALOG
  1. 1. 吴恩达深度学习 C2_W1_Assignment
    1. 1.1. 任务1:初始化指定权重
    2. 1.2. Part0:准备数据
    3. 1.3. Part1:神经网络模型
      1. 1.3.1. Exercise1:Zero initialization
      2. 1.3.2. Exercise2:Random initialization
      3. 1.3.3. Exercise3:He initialization
    4. 1.4. 任务2:在深度学习模型中使用正则化
    5. 1.5. Part0:库的准备
    6. 1.6. Part1:使用非正则化的模型
      1. 1.6.1. Exercise1:L2 Regularization
      2. 1.6.2. Exercise2:实现有正则化后的反向传播
      3. 1.6.3. Exercise3:实现带有dropout的前向传播
      4. 1.6.4. Exercise4:实现带有dropout的反向传播
      5. 1.6.5. 总结:
    7. 1.7. 任务3:实现和使用梯度检查
    8. 1.8. Part0:库的准备
      1. 1.8.1. 梯度检查是如何运行的?
      2. 1.8.2. Exercise1:一维梯度检测
      3. 1.8.3. Exercise2:N维梯度检测