国产毛片一区二区三区,操逼小动漫,www.国产乱伦,天天日天天碰,欧美精品系列,免费看黄色视频,天天爱夜夜爱,亚洲色色导航

作者：楊夕
本文鏈接：https://github.com/km1994/nlp_paper_study
論文名稱：Billion-scale similarity search with GPUs
論文鏈接：https://arxiv.org/abs/1702.08734
官方源碼地址：https://github.com/facebookresearch/Faiss
個(gè)人介紹：大佬們好，我叫楊夕，該項(xiàng)目主要是本人在研讀頂會(huì)論文和復(fù)現(xiàn)經(jīng)典論文過程中，所見、所思、所想、所聞，可能存在一些理解錯(cuò)誤，希望大佬們多多指正。
【注：手機(jī)閱讀可能圖片打不開！！！】

一、動(dòng)機(jī)篇

1.1 傳統(tǒng)的相似度算法所存在的問題？

傳統(tǒng)的相似度算法（歐幾里得距離、曼哈頓距離、明可夫斯基距離、余弦相似度等[1]）在計(jì)算向量相似度時(shí)，效率低下；
傳統(tǒng)的相似度算法（歐幾里得距離、曼哈頓距離、明可夫斯基距離、余弦相似度等[1]）在計(jì)算向量相似度時(shí)，不支持 GPU 計(jì)算；

二、介紹篇

2.1 什么是 Faiss ？

Faiss是針對稠密向量進(jìn)行相似性搜索和聚類的一個(gè)高效類庫。
它包含可搜索任意大小的向量集的算法，這些向量集的大小甚至都不適合RAM。
它還包含用于評估和參數(shù)調(diào)整的支持代碼。
Faiss用C ++編寫，并且有python2與python3的封裝代碼。
一些最有用的算法在GPU上有實(shí)現(xiàn)。
Faiss是由Facebook AI Research開發(fā)的。

2.2 Faiss 如何使用？

Faiss 的使用方式可以分為三個(gè)步驟：

構(gòu)建訓(xùn)練數(shù)據(jù)以矩陣的形式表示，比如我們現(xiàn)在經(jīng)常使用的embedding，embedding出來的向量就是矩陣的一行。
為數(shù)據(jù)集選擇合適的index，index是整個(gè)Faiss的核心部分，將第一步得到的訓(xùn)練數(shù)據(jù)add到index當(dāng)中。
search，或者說query，搜索到最終結(jié)果。

2.3 Faiss原理與核心算法

Faiss的主要功能是對向量進(jìn)行相似搜索。
具體介紹：就是給定一個(gè)向量，在所有已知的向量庫中找出與其相似度最高的一些向量；
本質(zhì)：是一個(gè)KNN(K近鄰)問題，比如google的以圖找圖功能。根據(jù)上面的描述不難看出，F(xiàn)aiss本質(zhì)是一個(gè)向量(矢量)數(shù)據(jù)庫，這個(gè)數(shù)據(jù)庫在進(jìn)行向量查詢的時(shí)候有其獨(dú)到之處，因此速度比較快，同時(shí)占用的空間也比較小。

三、Faiss 實(shí)戰(zhàn)篇

3.1 Faiss 如何安裝？

# 更新conda
    conda update conda
    # 先安裝mkl
    conda install mkl
    # faiss提供gpu和cpu版，根據(jù)服務(wù)選擇
    # cpu版本
    conda install faiss-cpu -c pytorch
    # gpu版本 -- 記得根據(jù)自己安裝的cuda版本安裝對應(yīng)的faiss版本，不然會(huì)出異常。使用命令：nvcc -V 查看
    conda install faiss-gpu cudatoolkit=8.0 -c pytorch # For CUDA8
    conda install faiss-gpu cudatoolkit=9.0 -c pytorch # For CUDA9
    conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA10
    # 校驗(yàn)是否安裝成功
    python -c "import faiss"

【注：這里小編嘗試過多次 window 安裝，最后都失敗，最后Google了一下，發(fā)現(xiàn)Faiss不支持 window 系統(tǒng)！】

3.2 Faiss 的索引Index有哪些？

Faiss中最重要的是索引 Index

為什么要?jiǎng)?chuàng)建索引？

Faiss 創(chuàng)建索引對向量預(yù)處理，提高查詢效率

Faiss中的稠密向量各種索引都是基于 Index實(shí)現(xiàn)的，主要的索引方法包括：IndexFlatL2、IndexFlatIP、IndexHNSWFlat、IndexIVFFlat、IndexLSH、IndexScalarQuantizer、IndexPQ、IndexIVFScalarQuantizer、IndexIVFPQ、IndexIVFPQR等。每個(gè)方法的具體介紹見：

3.3 Faiss 的索引Index都怎么用？

3.3.1 數(shù)據(jù)預(yù)備

創(chuàng)建訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)
代碼：

import numpy as np
    d = 64                           # dimension
    nb = 100000                      # database size
    nq = 10000                       # nb of queries
    np.random.seed(1234)             # make reproducible
    xb = np.random.random((nb, d)).astype('float32')
    xb[:, 0] += np.arange(nb) / 1000.
    xq = np.random.random((nq, d)).astype('float32')
    xq[:, 0] += np.arange(nq) / 1000.

解析：

訓(xùn)練集 xb：[100000,64]
查詢數(shù)據(jù)集 xq：[10000,64]

3.3.2 暴力美學(xué) IndexFlatL2

介紹：暴力檢索 L2 距離的索引；
方式：全量搜索
流程：

創(chuàng)建索引

import faiss                   # make faiss available
    index = faiss.IndexFlatL2(d)   # build the index
    print(index.is_trained)

注：創(chuàng)建索引時(shí)必須指定向量的維度d。大部分索引需要訓(xùn)練的步驟。IndexFlatL2 跳過這一步。

當(dāng)索引創(chuàng)建好并訓(xùn)練(如果需要)之后，我們就可以執(zhí)行 add 方法，add方法一般添加訓(xùn)練時(shí)的樣本；

添加訓(xùn)練集

index.add(xb)                  # add vectors to the index
    print(index.ntotal)

尋找相似相似向量

包含向量的索引后，就可以傳入搜索向量查找相似向量，search就是尋找相似相似向量了。

k = 4                          # we want to see 4 nearest neighbors
    D, I = index.search(xq, k)     # actual search
    print(I[:5])                   # neighbors of the 5 first queries
    print(D[-5:])                  # neighbors of the 5 last queries
    >>>
    [[  0 393 363  78]
    [  1 555 277 364]
    [  2 304 101  13]
    [  3 173  18 182]
    [  4 288 370 531]]

    [[ 0.          7.17517328  7.2076292   7.25116253]
    [ 0.          6.32356453  6.6845808   6.79994535]
    [ 0.          5.79640865  6.39173603  7.28151226]
    [ 0.          7.27790546  7.52798653  7.66284657]
    [ 0.          6.76380348  7.29512024  7.36881447]]

注：
D：numpy array對象，表示與相似向量的距離(distance)，維度
I：numpy array對象，表示相似用戶的ID

存在問題：雖然查詢速度高于傳統(tǒng)相似度計(jì)算方法，但是速度還是太慢

3.3.3 閃電俠 IndexIVFFlat

引言：暴力美學(xué) IndexFlatL2 查詢速度太慢了
介紹：IndexIVFFlat 是一種加速索引方法，其所用的方法為倒排法；
方式：
先聚類再搜索，可以加快檢索速度，先將xb中的數(shù)據(jù)進(jìn)行聚類（聚類的數(shù)目是超參）

nlist: 聚類的數(shù)目；
nprobe: 在多少個(gè)聚類中進(jìn)行搜索，默認(rèn)為1, nprobe越大，結(jié)果越精確，但是速度越慢

流程：

使用K-means建立聚類中心；
然后通過查詢最近的聚類中心；
最后比較聚類中的所有向量得到相似的向量

代碼：

定義索引和聚類簇?cái)?shù)

nlist = 100                       #聚類中心的個(gè)數(shù)
    k = 4
    quantizer = faiss.IndexFlatL2(d)  # the other index
    index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
       # here we specify METRIC_L2, by default it performs inner-product search

注：創(chuàng)建IndexIVFFlat時(shí)需要指定一個(gè)其他的索引作為量化器(quantizer)來計(jì)算距離或相似度。
參數(shù)介紹：
faiss.METRIC_L2: faiss定義了兩種衡量相似度的方法(metrics)，分別為faiss.METRIC_L2、faiss.METRIC_INNER_PRODUCT。一個(gè)是歐式距離，一個(gè)是向量內(nèi)積。
nlist：聚類中心的個(gè)數(shù)

添加訓(xùn)練集

index.train(xb)
    assert index.is_trained

    index.add(xb)                  # add may be a bit slower as well

注：與 IndexFlatL2 對比，在 add 方法之前需要先訓(xùn)練

尋找相似相似向量

包含向量的索引后，就可以傳入搜索向量查找相似向量，search就是尋找相似相似向量了。

D, I = index.search(xq, k)     # actual search
    print(I[-5:])                  # neighbors of the 5 last queries
    index.nprobe = 10              # default nprobe is 1, try a few more
    D, I = index.search(xq, k)
    print(I[-5:])                  # neighbors of the 5 last queries

參數(shù)介紹：
k：查找最相似的k個(gè)向量
index.nprobe：查找聚類中心的個(gè)數(shù)，默認(rèn)為1個(gè)。

3.3.4 內(nèi)存管家 IndexIVFPQ

動(dòng)機(jī)：索引IndexFlatL2和IndexIVFFlat都會(huì)全量存儲(chǔ)所有的向量在內(nèi)存中，如果數(shù)據(jù)量是海量級別的時(shí)候，怎么辦呢？
介紹：IndexIVFPQ 基于Product Quantizer(乘積量化)的壓縮算法編碼向量大小到指定的字節(jié)數(shù)的索引算法，存儲(chǔ)的向量時(shí)壓縮過的，查詢的距離也是近似的。關(guān)于乘積量化的算法可自行搜索。
方式：基于乘積量化（product quantizers）對存儲(chǔ)向量進(jìn)行壓縮，節(jié)省存儲(chǔ)空間

m：乘積量化中，將原來的向量維度平均分成多少份，d必須為m的整數(shù)倍
bits: 每個(gè)子向量用多少個(gè)bits表示

代碼：

定義索引和聚類簇?cái)?shù)

nlist = 100
    m = 8                             # number of bytes per vector
    k = 4
    quantizer = faiss.IndexFlatL2(d)  # this remains the same
    index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
                                        # 8 specifies that each sub-vector is encoded a

注：之前我們定義的維度為d = 64，向量的數(shù)據(jù)類型為float32。這里壓縮成了8個(gè)字節(jié)。所以壓縮比率為 (64*32/8) / 8 = 32

添加訓(xùn)練集

index.train(xb)
    index.add(xb)

注：與 IndexFlatL2 對比，在 add 方法之前需要先訓(xùn)練

尋找相似相似向量

包含向量的索引后，就可以傳入搜索向量查找相似向量，search就是尋找相似相似向量了。

D, I = index.search(xb[:5], k) # sanity check
    print(I)
    print(D)
    index.nprobe = 10              # make comparable with experiment above
    D, I = index.search(xq, k)     # search
    print(I[-5:])
    >>>
    [[   0  608  220  228]
    [   1 1063  277  617]
    [   2   46  114  304]
    [   3  791  527  316]
    [   4  159  288  393]]

    [[ 1.40704751  6.19361687  6.34912491  6.35771513]
    [ 1.49901485  5.66632462  5.94188499  6.29570007]
    [ 1.63260388  6.04126883  6.18447495  6.26815748]
    [ 1.5356375   6.33165455  6.64519501  6.86594009]
    [ 1.46203303  6.5022912   6.62621975  6.63154221]]

3.4 Faiss 然后使用 GPU？

注：并不是所有的索引都支持 GPU，所以在使用之前建議查閱一下 Basic indexes

可通過faiss.get_num_gpus()查詢有多少個(gè)gpu

ngpus = faiss.get_num_gpus()
    print("number of GPUs:", ngpus)

單 GPU

res = faiss.StandardGpuResources()  # use a single GPU, 這個(gè)命令需要安裝Faiss GPU 版本
    # build a flat (CPU) index
    index_flat = faiss.IndexFlatL2(d)
    # make it into a gpu index
    gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
    gpu_index_flat.add(xb)         # add vectors to the index
    print(gpu_index_flat.ntotal)

    k = 4                          # we want to see 4 nearest neighbors
    D, I = gpu_index_flat.search(xq, k)  # actual search
    print(I[:5])                   # neighbors of the 5 first queries
    print(I[-5:])                  # neighbors of the 5 last queries

多 GPU

ngpus = faiss.get_num_gpus()
    print("number of GPUs:", ngpus)
    cpu_index = faiss.IndexFlatL2(d)
    gpu_index = faiss.index_cpu_to_all_gpus(cpu_index)   # build the index
    gpu_index.add(xb)              # add vectors to the index
    print(gpu_index.ntotal)
    k = 4                          # we want to see 4 nearest neighbors
    D, I = gpu_index.search(xq, k) # actual search
    print(I[:5])                   # neighbors of the 5 first queries
    print(I[-5:])                  # neighbors of the 5 last queries

四、 Faiss 對比篇

4.1 sklearn cosine_similarity 和 Faiss 哪家強(qiáng)

方法一：使用 sklearn 中的 cosine_similarity

# encoding utf8
    import pandas as pd
    import csv
    import numpy as np
    import sys
    from sklearn.metrics.pairwise import cosine_similarity
    # sys.path.append('../')
    # 自定義 包
    from tools.loader import loadDict,saveDict,dictCutBatchSave,dictLoadMerge
    from tools.bert_tools import Bert_Class
    bertModel = Bert_Class()
    from tools.common_tools import timer

    commonPath = "data/"
    inf =  commonPath+"resource/"
    dataType = ""
    fileName = f"medQA.valid.txt"
    QueAnsFileName = f"QueAns{dataType}"
    query = "睡前練瑜伽好嗎睡覺之前練習(xí)40分鐘的瑜伽好嗎、能起到瘦身的作用嗎？"
    topN = 20

    # 使用 sklearn cosine_similarity 方法 計(jì)算相似度
    def use_sklearn_get_sim_query(query,training_vectors,topN):
        # step 1：測試文本 轉(zhuǎn) bert sent
        test_vecs = bertModel.get_vec_to_sent(query).astype('float32')
        # step 2：計(jì)算 相似度
        print("------------sklearn-------------")
        ag=cosine_similarity(test_vecs,training_vectors)
        # step 3：排序
        fe=np.sort(ag,axis=1)
        fe_index = np.argsort(ag,axis=1) 
        score_list = fe[0].tolist()
        index_list = fe_index[0].tolist()
        score_list.reverse()
        index_list.reverse()
        # step 4：取 Top N
        return index_list[:topN],score_list[:topN]

    training_vectors = np.load(f"{inf}training_vectors.npy")
    id2QusAns = loadDict(inf,QueAnsFileName)
    index_list,score_list = use_sklearn_get_sim_query(query,training_vectors,topN)

    for index,score in zip(index_list,score_list):
        print(f"index:{index} => query:{id2QusAns[index]}:{score}\n")

方法二：Faiss

# encoding utf8
    import pandas as pd
    import csv
    import numpy as np
    import sys
    import faiss
    from faiss import normalize_L2
    # sys.path.append('../')
    # 自定義 包
    from tools.loader import loadDict,saveDict,dictCutBatchSave,dictLoadMerge
    from tools.bert_tools import Bert_Class
    bertModel = Bert_Class()
    from tools.common_tools import timer

    commonPath = "data/"
    inf =  commonPath+"resource/"
    dataType = ""
    fileName = f"medQA.valid.txt"
    QueAnsFileName = f"QueAns{dataType}"
    query = "睡前練瑜伽好嗎睡覺之前練習(xí)40分鐘的瑜伽好嗎、能起到瘦身的作用嗎？"
    topN = 20

    # 使用 sklearn cosine_similarity 方法 計(jì)算相似度
    def use_faiss_get_sim_query(query,training_vectors,topN,d=768):
        # step 1：測試文本 轉(zhuǎn) bert sent
        test_vecs = bertModel.get_vec_to_sent(query).astype('float32')
        # step 2：計(jì)算 相似度
        print("------------faiss-------------")
        print('normalize_L2')
        normalize_L2(training_vectors)
        normalize_L2(test_vecs)
        print('IndexFlatIP')
        index=faiss.IndexFlatIP(d)        # the other index，需要以其他index作為基礎(chǔ)
        index.train(training_vectors) 
        print(f"training_vectors.shape:{training_vectors.shape}")
        print(index)
        print('train')
        print(index.is_trained)
        print('add')
        print(index)
        index.add(training_vectors)
        print('search')
        print(f"test_vecs.shape:{test_vecs.shape}")
        D, I =index.search(test_vecs, topN)
        print(f"I:{I}")                   # 表示最相近的前5個(gè)的index
        print(f"D:{D}")                     # 表示最相近的前5個(gè)的相似度的值
        # step 3：排序
        score_list = D.tolist()[0]
        index_list = I.tolist()[0]
        # step 4：取 Top N
        return index_list,score_list

    training_vectors = np.load(f"{inf}training_vectors.npy")
    id2QusAns = loadDict(inf,QueAnsFileName)
    index_list,score_list = use_faiss_get_sim_query(query,training_vectors,topN)

    for index,score in zip(index_list,score_list):
        print(f"index:{index} => query:{id2QusAns[index]}:{score}\n")

分析：

從預(yù)測結(jié)果角度看，兩者結(jié)果雷同；
從計(jì)算速度角度看：當(dāng) 數(shù)據(jù)集特別大時(shí)，F(xiàn)aiss 秒殺 sklearn cosine_similarity

參考資料

常用的相似度計(jì)算方法原理及實(shí)現(xiàn)
Faiss從入門到實(shí)戰(zhàn)精通
Faiss 教程
Faiss 用法
Basic indexes

【關(guān)于 Faiss 】那些的你不知道的事

目錄

一、動(dòng)機(jī)篇

1.1 傳統(tǒng)的相似度算法所存在的問題？

二、介紹篇

2.1 什么是 Faiss ？

2.2 Faiss 如何使用？

2.3 Faiss原理與核心算法

三、Faiss 實(shí)戰(zhàn)篇

3.1 Faiss 如何安裝？

3.2 Faiss 的索引Index有哪些？

3.3 Faiss 的索引Index都怎么用？

3.3.1 數(shù)據(jù)預(yù)備

3.3.2 暴力美學(xué) IndexFlatL2

3.3.3 閃電俠 IndexIVFFlat

3.3.4 內(nèi)存管家 IndexIVFPQ

3.4 Faiss 然后使用 GPU？

四、 Faiss 對比篇

4.1 sklearn cosine_similarity 和 Faiss 哪家強(qiáng)

參考資料

【關(guān)于 Faiss 】 那些的你不知道的事

目錄

一、動(dòng)機(jī)篇

1.1 傳統(tǒng)的相似度算法所存在的問題？

二、介紹篇

2.1 什么是 Faiss ？

2.2 Faiss 如何使用？

2.3 Faiss原理與核心算法

三、Faiss 實(shí)戰(zhàn)篇

3.1 Faiss 如何安裝？

3.2 Faiss 的索引Index有哪些？

3.3 Faiss 的索引Index都怎么用？

3.3.1 數(shù)據(jù)預(yù)備

3.3.2 暴力美學(xué) IndexFlatL2

3.3.3 閃電俠 IndexIVFFlat

3.3.4 內(nèi)存管家 IndexIVFPQ

3.4 Faiss 然后使用 GPU？

四、 Faiss 對比篇

4.1 sklearn cosine_similarity 和 Faiss 哪家強(qiáng)

參考資料

【關(guān)于 Faiss 】那些的你不知道的事