吴恩达团队NLP C1_W4_Assignment

吴恩达团队NLP

字数统计: 2k阅读时长: 11 min

 2021/08/08   Share

吴恩达团队NLP C1_W4_Assignment

课程链接：

Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

任务：朴素机器翻译及LSH(局部敏感度哈希)

Part1: 实现英语转法语

1.1 生成embedding和转换矩阵

def get_matrices(en_fr, french_vecs, english_vecs):
    """
    Input:
        en_fr: English to French dictionary
        french_vecs: French words to their corresponding word embeddings.
        english_vecs: English words to their corresponding word embeddings.
    Output: 
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the projection matrix that minimizes the F norm ||X R -Y||^2.
    """

    # X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()

    english_set = set(english_vecs.keys())
    french_set = set(french_vecs.keys())

    french_words = set(en_fr.values())  # english(key) -> french(value)

    for en_word, fr_word in en_fr.items():

        if fr_word in french_set and en_word in english_set:

            en_vec = english_vecs[en_word]
            fr_vec = french_vecs[fr_word]

            X_l.append(en_vec)
            Y_l.append(fr_vec)

    X = np.vstack(X_l) # 按垂直方向（行顺序）堆叠数组构成一个新的数组
    Y = np.vstack(Y_l)

    return X, Y

这样我们可以得到两个矩阵，X矩阵的每一行为一个英语单词的embedding，Y矩阵的对应行是对应法语单词的embedding。

1.2 翻译

我们将翻译问题转换为最小化问题，公式如下：

$\arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}$

其中：

$\|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}$

为了消除开方，并且每次获得的是平均损失，我们使用下面的公式来计算：

$\frac{1}{m} \| \mathbf{X R} - \mathbf{Y} \|_{F}^{2}$

def compute_loss(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.
    '''

    m = X.shape[0] # 行数
    
    diff = np.dot(X, R) - Y # XR - Y

    diff_squared = np.square(diff)

    sum_diff_squared = np.sum(diff_squared)

    loss = sum_diff_squared / m
    
    return loss

1.3 计算转换矩阵R的损失梯度

$\frac{d}{dR}𝐿(𝑋,𝑌,𝑅)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y)$

def compute_gradient(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        g: a scalar value - gradient of the loss function L for given X, Y and R.
    '''

    m = X.shape[0] # 行数

    gradient = np.dot(X.T, np.dot(X, R) - Y) * 2 / m
  
    return gradient

1.4 利用梯度下降法求得最优R

$R_{\text{new}}= R_{\text{old}}-\alpha g$

def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003):
    '''
    Inputs:
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        train_steps: positive int - describes how many steps will gradient descent algorithm do.
        learning_rate: positive float - describes how big steps will  gradient descent algorithm do.
    Outputs:
        R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2
    '''
    np.random.seed(129)

    R = np.random.rand(X.shape[1], X.shape[1]) # 初始化R

    for i in range(train_steps):
        if i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
   
        gradient = compute_gradient(X, Y, R)
       
        R -= learning_rate * gradient
    
    return R

1.5 测试翻译

使用K-NN算法，寻找最接近目标向量的K个向量。

我们使用余弦相似度来度量两者距离。

$\cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|}\\d_{\text{cos}}(u,v)=1-\cos(u,v)$

def nearest_neighbor(v, candidates, k=1):
    """
    Input:
      - v, the vector you are going find the nearest neighbor for
      - candidates: a set of vectors where we will find the neighbors
      - k: top k nearest neighbors to find
    Output:
      - k_idx: the indices of the top k closest vectors in sorted form
    """

    similarity_l = []

    for row in candidates:
        cos_similarity = cosine_similarity(row, v)

        similarity_l.append(cos_similarity)
        
    sorted_ids = np.argsort(similarity_l) # 从小到大排，提取索引
                                           
    k_idx = sorted_ids[-k:] # 得到cos最大的K个,即距离最接近

    return k_idx

1.6 计算准确率

$\text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})}$

def test_vocabulary(X, Y, R):
    '''
    Input:
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the transform matrix which translates word embeddings from
        English to French word vector space.
    Output:
        accuracy: for the English to French capitals
    '''

    pred = np.dot(X, R)

    num_correct = 0

    for i in range(len(pred)):
        pred_idx = nearest_neighbor(pred[i], Y)

        if pred_idx == i:
            num_correct += 1

    accuracy = num_correct / pred.shape[0]

    return accuracy

Part2: LSH和文档搜索

2.1 计算文档的embedding

预处理文档，去除杂音
将文档中各个单词的embedding相加即可

def get_document_embedding(tweet, en_embeddings): 
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''
    doc_embedding = np.zeros(300) # 单词都是300维的向量

    processed_doc = process_tweet(tweet)
    
    for word in processed_doc:
        doc_embedding += en_embeddings.get(word, 0)

    return doc_embedding

2.2 将文档向量存储至字典中

def get_document_vecs(all_docs, en_embeddings):
    '''
    Input:
        - all_docs: list of strings - all tweets in our dataset.
        - en_embeddings: dictionary with words as the keys and their embeddings as the values.
    Output:
        - document_vec_matrix: matrix of tweet embeddings.
        - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values.
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []

    for i, doc in enumerate(all_docs):

        doc_embedding = get_document_embedding(doc, en_embeddings)

        ind2Doc_dict[i] = doc_embedding

        document_vec_l.append(doc_embedding)

    # convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict

2.3 使用LSH来识别最相似的tweet

用以下公式来为每个向量赋予hash值：

$hash = \sum_{i=0}^{N-1} \left( 2^{i} \times h_{i} \right)$

其中$h_{i}$为0或1。

def hash_value_of_vector(v, planes):
    """Create a hash for a vector; hash_id says which random hash to use.
    Input:
        - v:  vector of tweet. It's dimension is (1, N_DIMS)
        - planes: matrix of dimension (N_DIMS, N_PLANES) - the set of planes that divide up the region
    Output:
        - res: a number which is used as a hash for your vector

    """
    # for the set of planes,
    # calculate the dot product between the vector and the matrix containing the planes
    # remember that planes has shape (300, 10)
    # The dot product will have the shape (1,10)
    dot_product = np.dot(v, planes)

    sign_of_dot_product = np.sign(dot_product) # sign函数获得0或1

    h = (sign_of_dot_product >= 0)

    # remove extra un-used dimensions (convert this from a 2D to a 1D array)
    h = np.squeeze(h)

    # initialize the hash value to 0
    hash_value = 0

    n_planes = planes.shape[1] # 平面的数量

    for i in range(n_planes):
        hash_value += 2 ** i * h[i]

    # cast hash_value as an integer
    hash_value = int(hash_value)

    return hash_value

2.4 创建哈希表

def make_hash_table(vecs, planes):
    """
    Input:
        - vecs: list of vectors to be hashed.
        - planes: the matrix of planes in a single "universe", with shape (embedding dimensions, number of planes).
    Output:
        - hash_table: dictionary - keys are hashes, values are lists of vectors (hash buckets)
        - id_table: dictionary - keys are hashes, values are list of vectors id's
                            (it's used to know which tweet corresponds to the hashed vector)
    """

    num_of_planes = planes.shape[1] # 平面数量

    num_buckets = 2 ** num_of_planes

    hash_table = {}
    
    for i in range(num_buckets):
        hash_table[i] = []
    
    id_table = {}
    
    for i in range(num_buckets):
        id_table[i] = []

    for i, v in enumerate(vecs):
        h = hash_value_of_vector(v, planes) # 计算哈希值

        hash_table[h].append(v)

        id_table[h].append(i)

    return hash_table, id_table

2.5 实现近似K-NN

doc_id为document在all_tweets的下标
v为document的向量(all_tweets[doc_id])
planes_l是平面列表
k是寻找最接近的向量数量
num_universes_to_use为重复哈希的次数，以提高准确率

def approximate_knn(doc_id, v, planes_l, k=1, num_universes_to_use=N_UNIVERSES):
    """Search for k-NN using hashes."""
    assert num_universes_to_use <= N_UNIVERSES

    # Vectors that will be checked as possible nearest neighbor
    vecs_to_consider_l = list()

    # list of document IDs
    ids_to_consider_l = list()

    # create a set for ids to consider, for faster checking if a document ID already exists in the set
    ids_to_consider_set = set()

    for universe_id in range(num_universes_to_use):

        planes = planes_l[universe_id]

        hash_value = hash_value_of_vector(v, planes)

        hash_table = hash_tables[universe_id]

        document_vectors_l = hash_table[hash_value]

        id_table = id_tables[universe_id]

        new_ids_to_consider = id_table[hash_value]

        if doc_id in new_ids_to_consider:
            new_ids_to_consider.remove(doc_id) # 删除我们正在寻找的doc_id
            print(f"removed doc_id {doc_id} of input vector from new_ids_to_search")

        for i, new_id in enumerate(new_ids_to_consider):

            if new_id not in ids_to_consider_set:
                document_vector_at_i = document_vectors_l[i]
                vecs_to_consider_l.append(document_vector_at_i)

                ids_to_consider_l.append(new_id)

                # (use this to check if new_id is not already in the IDs to consider)
                ids_to_consider_set.add(new_id)

    # Now run k-NN on the smaller set of vecs-to-consider.
    print("Fast considering %d vecs" % len(vecs_to_consider_l))

    # convert the vecs to consider set to a list, then to a numpy array
    vecs_to_consider_arr = np.array(vecs_to_consider_l)

    # call nearest neighbors on the reduced list of candidate vectors
    nearest_neighbor_idx_l = nearest_neighbor(v, vecs_to_consider_arr, k=k)

    # Use the nearest neighbor index list as indices into the ids to consider
    # create a list of nearest neighbors by the document ids
    nearest_neighbor_ids = [ids_to_consider_l[idx]
                            for idx in nearest_neighbor_idx_l]

    return nearest_neighbor_ids

应用例子：

doc_id = 0
doc_to_search = all_tweets[doc_id]
vec_to_search = document_vecs[doc_id]

nearest_neighbor_ids = approximate_knn(
    doc_id, vec_to_search, planes_l, k=3, num_universes_to_use=5)

print(f"Nearest neighbors for document {doc_id}")
print(f"Document contents: {doc_to_search}")
print("")

for neighbor_id in nearest_neighbor_ids:
    print(f"Nearest neighbor at document id {neighbor_id}")
    print(f"document contents: {all_tweets[neighbor_id]}")

result:

removed doc_id 0 of input vector from new_ids_to_search
removed doc_id 0 of input vector from new_ids_to_search
removed doc_id 0 of input vector from new_ids_to_search
removed doc_id 0 of input vector from new_ids_to_search
removed doc_id 0 of input vector from new_ids_to_search
Fast considering 77 vecs

Nearest neighbors for document 0
Document contents: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Nearest neighbor at document id 2140
document contents: @PopsRamjet come one, every now and then is not so bad :)
Nearest neighbor at document id 701
document contents: With the top cutie of Bohol :) https://t.co/Jh7F6U46UB
Nearest neighbor at document id 51
document contents: #FollowFriday @France_Espana @reglisse_menthe @CCI_inter for being top engaged members in my community this week :)

Next Post

吴恩达团队NLP C2_W1_Assignment
Previous Post

吴恩达团队NLP C1_W3_Assignment

CATALOG

1. 吴恩达团队NLP C1_W4_Assignment



Total : 424

2023

2022

2021

2020

2019

缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true