This is an example of a positive tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
This is an example of the processed version of the tweet: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
Part 1:逻辑回归
1.1 Sigmoid
sigmoid函数定义如下:
它可以将输入的z映射到(0, 1)的值,函数图像如下:
实现代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
defsigmoid(z): ''' Input: z: is the input (can be a scalar or an array) Output: h: the sigmoid of z ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # calculate the sigmoid of z h = 1 / (1 + np.exp(-z)) ### END CODE HERE ### return h
defgradientDescent(x, y, theta, alpha, num_iters): ''' Input: x: matrix of features which is (m,n+1) y: corresponding labels of the input matrix x, dimensions (m,1) theta: weight vector of dimension (n+1,1) alpha: learning rate num_iters: number of iterations you want to train your model for Output: J: the final cost theta: your final weight vector Hint: you might want to print the cost to make sure that it is going down. ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # get 'm', the number of rows in matrix x m = np.shape(x)[0] # 行数 for i in range(0, num_iters): # get z, the dot product of x and theta z = np.dot(x, theta) # get the sigmoid of z h = sigmoid(z) # calculate the cost function vector_1 = np.ones((np.shape(y)[0], 1)) # 生成m行1列的二维数组,数值都为1 J = (-1/m) * (np.dot(y.T, np.log(h)) + np.dot((vector_1 - y).T, np.log(vector_1 - h)))
# update the weights theta theta = theta - (alpha / m) * np.dot(x.T, (h - y)) ### END CODE HERE ### print(J) J = float(J) return J, theta
defextract_features(tweet, freqs): ''' Input: tweet: a list of words for one tweet freqs: a dictionary corresponding to the frequencies of each tuple (word, label) Output: x: a feature vector of dimension (1,3) ''' # process_tweet tokenizes, stems, and removes stopwords word_l = process_tweet(tweet) # 3 elements in the form of a 1 x 3 vector x = np.zeros((1, 3)) #bias term is set to 1 x[0,0] = 1 ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # loop through each word in the list of words for word in word_l: # increment the word count for the positive label 1 if (word, 1) in freqs: x[0,1] += freqs[(word, 1)] # increment the word count for the negative label 0 if (word, 0) in freqs: x[0,2] += freqs[(word, 0)] ### END CODE HERE ### assert(x.shape == (1, 3)) return x
Part 3: 训练模型
我们利用之前写好的函数来进行模型的训练
1 2 3 4 5 6 7 8 9 10 11
X = np.zeros((len(train_x), 3)) for i in range(len(train_x)): X[i, :]= extract_features(train_x[i], freqs)
# training labels corresponding to X Y = train_y
# Apply gradient descent J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500) print(f"The cost after training is {J:.8f}.") print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")
结果如下:
1 2 3
[[0.24216529]] The cost after training is0.24216529. The resulting vector of weights is [7e-08, 0.0005239, -0.00055517]
defpredict_tweet(tweet, freqs, theta): ''' Input: tweet: a string freqs: a dictionary corresponding to the frequencies of each tuple (word, label) theta: (3,1) vector of weights Output: y_pred: the probability of a tweet being positive or negative ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # extract the features of the tweet and store it into x x = extract_features(tweet, freqs) # make the prediction using x and theta y_pred = sigmoid(np.dot(x, theta)) ### END CODE HERE ### return y_pred
deftest_logistic_regression(test_x, test_y, freqs, theta): """ Input: test_x: a list of tweets test_y: (m, 1) vector with the corresponding labels for the list of tweets freqs: a dictionary with the frequency of each pair (or tuple) theta: weight vector of dimension (3, 1) Output: accuracy: (# of tweets classified correctly) / (total # of tweets) """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # the list for storing predictions y_hat = [] for tweet in test_x: # get the label prediction for the tweet y_pred = predict_tweet(tweet, freqs, theta) if y_pred > 0.5: # append 1.0 to the list y_hat.append(1.0) # 正面 else: # append 0 to the list y_hat.append(0) # 负面
len_y = len(y_hat) # 计算m,即数据总数 # With the above implementation, y_hat is a list, but test_y is (m,1) array # convert both to one-dimensional arrays in order to compare them using the '==' operator y_hat = np.array(y_hat) correct = 0 print(y_hat) for i in range(len_y): if(y_hat[i] == test_y[i][0]): # 一一进行对比 correct += 1 accuracy = correct / len_y