操逼视频欧美国产,91美女被艹,久久成年人精品視频,黄色网址国产,人人人人操,成人性生交大片免费看网站2023年 ,日本黄色视频免费在线观看,超碰成人福利

作者 | Audhi Aprilliant

編譯 | VK
來源 | Towards Data Science

在處理真實(shí)數(shù)據(jù)時(shí)，最大的問題主要是數(shù)據(jù)預(yù)處理。

匹配可能是許多分析師面臨的最大挑戰(zhàn)之一。例如，當(dāng)我們談?wù)搯讨巍とA盛頓和G·華盛頓時(shí)，當(dāng)然，我們談?wù)摰氖且粋€(gè)人，即美國的第一任總統(tǒng)。幸運(yùn)的是，研究人員已經(jīng)開發(fā)出了概率數(shù)據(jù)匹配算法或眾所周知的模糊匹配。

研究表明，94%的企業(yè)承認(rèn)有重復(fù)的數(shù)據(jù)，而且這些重復(fù)的數(shù)據(jù)大部分是不完全匹配的，因此通常不會被發(fā)現(xiàn)

什么是模糊字符串匹配?

概率數(shù)據(jù)匹配，通常稱為模糊字符串匹配，是一種算法或計(jì)算技術(shù)，用于尋找與模式近似(而不是精確)匹配的某些字符串。

概率數(shù)據(jù)匹配中的概率明確指示輸出必須是概率(在0到1的范圍內(nèi)或相似度的百分比)，而不是精確的數(shù)字。

有很多方法可以執(zhí)行模糊字符串匹配，但是隨著數(shù)據(jù)大小的增加，它們的性能會受到很大的影響，例如Levenshtein距離。原因是每個(gè)記錄都要與數(shù)據(jù)中的所有其他記錄進(jìn)行比較。

隨著數(shù)據(jù)大小的增加，執(zhí)行模糊字符串匹配所花費(fèi)的時(shí)間呈二次增長。這種現(xiàn)象被稱為二次時(shí)間復(fù)雜度。二次時(shí)間復(fù)雜度表示一種算法，其性能與輸入數(shù)據(jù)的平方大小成正比。

隨著數(shù)據(jù)的增長，對速度的需求也在增長。

TF-IDF和KNN如何加速計(jì)算時(shí)間?

詞頻-逆文檔頻率(TF-IDF)是一種自然語言處理(NLP)技術(shù)，用于測量文檔集合中一個(gè)單詞對文檔的重要性。

例如，從一個(gè)文檔集合中，單詞“flower”有多重要?TF-IDF值與單詞出現(xiàn)的次數(shù)成比例增加(術(shù)語頻率)，但隨著包含該單詞的文檔數(shù)量減少(與文檔頻率相反)。

IDF部分有助于解釋這樣一個(gè)事實(shí)，即某些詞通常更常見(例如單詞“The”沒有任何信息)。

TF-IDF是使用n個(gè)字母組來實(shí)現(xiàn)的，這些字母組可能是由拼寫錯(cuò)誤或打字錯(cuò)誤引起的。例如，independence這個(gè)詞被分成以下形式，取決于n的個(gè)數(shù)。

1-gram
['independence']
2-gram
['in','nd','de','ep','pe','en','nd','de','en','nc','ce']
3-gram
['ind','nde','dep','epe','pen','end','nde','den','enc','nce']

因此，n-gram是從用于字符串匹配的字符串列表中創(chuàng)建的。

TF-IDF背后的思想是，它將文檔表示為數(shù)據(jù)，并且選擇用k最近鄰和余弦相似度，而不是Levenshtein距離(插入、刪除或替換)匹配。

余弦相似度是一種相似性度量，可用于比較文檔，或者根據(jù)給定的查詢詞向量給出文檔的排序。x和y是兩個(gè)比較的向量。用余弦度量作為相似函數(shù)，我們得到

之后，最匹配的候選數(shù)據(jù)將與主數(shù)據(jù)進(jìn)行比較。它的目的是計(jì)算主數(shù)據(jù)中最匹配的候選字符串之間的相似度。

讓我們來練習(xí)一下

對于本例，我將通過從Room Type Kaggle數(shù)據(jù)來演示TF-IDF模糊字符串匹配方法

https://www.kaggle.com/leandrodoze/room-type

老實(shí)說，我選擇這個(gè)數(shù)據(jù)是因?yàn)?(1)數(shù)據(jù)非常小，只有103行，(2)列表示要比較的混亂字符串和主字符串。在實(shí)際情況中，這兩種數(shù)據(jù)(雜亂的和干凈的)可能來自不同的來源。

請先運(yùn)行下面的命令，使用TF-IDF和KNN創(chuàng)建一個(gè)模糊字符串匹配函數(shù)。

# 導(dǎo)入數(shù)據(jù)操作模塊
import pandas as pd
# 導(dǎo)入線性代數(shù)模塊
import numpy as np
# 導(dǎo)入模糊字符串匹配模塊
from fuzzywuzzy import fuzz, process
# 導(dǎo)入正則表達(dá)式模塊
import re
# 導(dǎo)入迭代模塊
import itertools
# 導(dǎo)入函數(shù)開發(fā)模塊
from typing import Union, List, Tuple
# 導(dǎo)入TF-IDF模塊
from sklearn.feature_extraction.text import TfidfVectorizer
# 導(dǎo)入余弦相似度模塊
from sklearn.metrics.pairwise import cosine_similarity
# 導(dǎo)入KNN模塊
from sklearn.neighbors import NearestNeighbors

# 字符串預(yù)處理
def preprocess_string(s):
    # 去掉字符串之間的空格
    s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
    return s

# 字符串匹配- TF-IDF
def build_vectorizer(
    clean: pd.Series,
    analyzer: str = 'char', 
    ngram_range: Tuple[int, int] = (1, 4), 
    n_neighbors: int = 1, 
    **kwargs
    ) -> Tuple:
    # 創(chuàng)建vectorizer
    vectorizer = TfidfVectorizer(analyzer = analyzer, ngram_range = ngram_range, **kwargs)
    X = vectorizer.fit_transform(clean.values.astype('U'))

    # 最近鄰
    nbrs = NearestNeighbors(n_neighbors = n_neighbors, metric = 'cosine').fit(X)
    return vectorizer, nbrs

# 字符串匹配- KNN
def tfidf_nn(
    messy, 
    clean, 
    n_neighbors = 1, 
    **kwargs
    ):
    # 擬合干凈的數(shù)據(jù)和轉(zhuǎn)換凌亂的數(shù)據(jù)
    vectorizer, nbrs = build_vectorizer(clean, n_neighbors = n_neighbors, **kwargs)
    input_vec = vectorizer.transform(messy)

    # 確定可能的最佳匹配
    distances, indices = nbrs.kneighbors(input_vec, n_neighbors = n_neighbors)
    nearest_values = np.array(clean)[indices]
    return nearest_values, distances

# 字符串匹配-匹配模糊
def find_matches_fuzzy(
    row, 
    match_candidates, 
    limit = 5
    ):
    row_matches = process.extract(
        row, dict(enumerate(match_candidates)), 
        scorer = fuzz.token_sort_ratio, 
        limit = limit
        )
    result = [(row, match[0], match[1]) for match in row_matches]
    return result

# 字符串匹配- TF-IDF
def fuzzy_nn_match(
    messy,
    clean,
    column,
    col,
    n_neighbors = 100,
    limit = 5, **kwargs):
    nearest_values, _ = tfidf_nn(messy, clean, n_neighbors, **kwargs)

    results = [find_matches_fuzzy(row, nearest_values[i], limit) for i, row in enumerate(messy)]
    df = pd.DataFrame(itertools.chain.from_iterable(results),
        columns = [column, col, 'Ratio']
        )
    return df

# 字符串匹配-模糊
def fuzzy_tf_idf(
    df: pd.DataFrame,
    column: str,
    clean: pd.Series,
    mapping_df: pd.DataFrame,
    col: str,
    analyzer: str = 'char',
    ngram_range: Tuple[int, int] = (1, 3)
    ) -> pd.Series:
    # 創(chuàng)建vectorizer
    clean = clean.drop_duplicates().reset_index(drop = True)
    messy_prep = df[column].drop_duplicates().dropna().reset_index(drop = True).astype(str)
    messy = messy_prep.apply(preprocess_string)
    result = fuzzy_nn_match(messy = messy, clean = clean, column = column, col = col, n_neighbors = 1)
    # 混亂到干凈
    return result

請將數(shù)據(jù)保存在特定的文件夾中，這樣我們可以很容易的加載到j(luò)upyter上。

# 導(dǎo)入計(jì)時(shí)模塊
import time

# 加載數(shù)據(jù)
df = pd.read_csv('room_type.csv')

# 檢查數(shù)據(jù)的維度
print('Dimension of data: {} rows and {} columns'.format(len(df), len(df.columns)))

# 打印前5行
df.head()

在這種情況下，Expedia將成為混亂的數(shù)據(jù)，而Booking.com將成為干凈的主數(shù)據(jù)。為了清楚地理解，我將演示如何運(yùn)行代碼并顯示結(jié)果。

# 運(yùn)行模糊字符串匹配算法
start = time.time()
df_result = (df.pipe(fuzzy_tf_idf, # 函數(shù)和混亂的數(shù)據(jù)
                     column = 'Expedia', # 混亂數(shù)據(jù)列
                     clean = df['Booking.com'], # 主數(shù)據(jù)(列表)
                     mapping_df = df, # 主數(shù)據(jù)
                     col = 'Result') # 可以自定義
            )
end = time.time()

# 打印計(jì)算時(shí)間
print('Fuzzy string matching in {} seconds'.format(end - start))

# 查看模糊字符串匹配的結(jié)果
df_result.head()

該算法只需要0.050秒就可以將103行雜亂數(shù)據(jù)與103行主數(shù)據(jù)進(jìn)行比較。

它可以在你自己的更大的數(shù)據(jù)上運(yùn)行。但是當(dāng)我們使用Levenshtein距離時(shí)，計(jì)算過程有多長?

我們必須進(jìn)行比較，以確保TF-IDF和k -最近鄰能夠有效地提高計(jì)算時(shí)間。

與Levenshtein 距離的比較

對于本教程，我制作了一個(gè)與前一個(gè)函數(shù)具有類似輸入和輸出的函數(shù)(與TF-IDF和k近鄰進(jìn)行模糊字符串匹配)。

首先，運(yùn)行以下代碼來實(shí)現(xiàn)Levenshtein距離模糊字符串匹配。

# 導(dǎo)入數(shù)據(jù)操作模塊
import pandas as pd
# 導(dǎo)入線性代數(shù)模塊
import numpy as np
# 導(dǎo)入模糊字符串匹配模塊
from fuzzywuzzy import fuzz, process
# 導(dǎo)入二分搜索模塊

def stringMatching(
    df: pd.DataFrame,
    column: str,
    clean: pd.Series,
    mapping_df: pd.DataFrame,
    col: str
    ):
    # 創(chuàng)建vectorizer
    categoryClean = clean.drop_duplicates().reset_index(drop = True)
    categoryMessy = df[column].drop_duplicates().dropna().reset_index(drop = True).astype(str)

    categoryFuzzy = {}
    ratioFuzzy = {}
    for i in range(len(categoryMessy)):
        resultFuzzy = process.extractOne(categoryMessy[i].lower(), categoryClean)
        # 映射
        catFuzzy = {categoryMessy[i]:resultFuzzy[0]}
        ratFuzzy = {categoryMessy[i]:resultFuzzy[1]}
        # 保存結(jié)果
        categoryFuzzy.update(catFuzzy)
        #保存比率
        ratioFuzzy.update(ratFuzzy)

    # 創(chuàng)建列名
    catCol = col
    ratCol = 'Ratio'
    # 合并結(jié)果
    df[catCol] = df[column]
    df[ratCol] = df[column]
    # 映射結(jié)果
    df[catCol] = df[catCol].map(categoryFuzzy)
    df[ratCol] = df[ratCol].map(ratioFuzzy)
    return df

現(xiàn)在，是時(shí)候使用Levenshtein距離實(shí)現(xiàn)模糊字符串匹配算法了!

# 運(yùn)行模糊字符串匹配算法
start = time.time()
df_result = (df.pipe(stringMatching, # 函數(shù)和混亂的數(shù)據(jù)
                     column = 'Expedia', # 混亂數(shù)據(jù)列
                     clean = df['Booking.com'], # 主數(shù)據(jù)
                     mapping_df = df, # 主數(shù)據(jù)
                     col = 'Result') # 可以自定義
            )
end = time.time()

# 打印計(jì)算時(shí)間
print('Fuzzy string matching in {} seconds'.format(end - start))

# 查看模糊字符串匹配的結(jié)果
df_result.head()

與使用TF-IDF和KNN的模糊字符串匹配算法相比，Levenshtein距離需要1.216秒，長24.32倍。

需要注意的是，這個(gè)計(jì)算時(shí)間將隨著數(shù)據(jù)的數(shù)量而增加。在這里了解更多關(guān)于TF-IDF和Levenshtein距離計(jì)算時(shí)間的比較。

https://medium.com/tim-black/fuzzy-string-matching-at-scale-41ae6ac452c2

結(jié)論

對于小數(shù)據(jù)，Levenshtein距離(fuzzy-wuzzy模塊)是執(zhí)行模糊字符串匹配的好選擇。然而，隨著數(shù)據(jù)大小的增長，計(jì)算時(shí)間也會增加。

另一種方法是使用TF-IDF結(jié)合k近鄰和n-gram的NLP技術(shù)來查找匹配的字符串。FAISS和HSNW是其他的一些算法，可以用來提高尋找最近鄰居的性能。

參考文獻(xiàn)

[1] J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques (2012). Elsevier Inc.

[2] C. van den Berg. Super Fast String Matching in Python (2019). https://bergvca.github.io/2017/10/14/super-fast-string-matching.html.

[3] E. Elahi. Fuzzy Matching 101: Cleaning and Linking Messy Data Across the Enterprise(2019). https://dataladder.com/fuzzy-matching-101/.

[4] A. Pearl. Fuzzy String Matching with TF-IDF (2019). https://adrianpearl.github.io/fuzzystring.html.

往期精彩回顧




適合初學(xué)者入門人工智能的路線及資料下載
機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印
機(jī)器學(xué)習(xí)在線手冊
深度學(xué)習(xí)筆記專輯
《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯
AI基礎(chǔ)下載
機(jī)器學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)專輯
溫州大學(xué)《機(jī)器學(xué)習(xí)課程》視頻
本站qq群851320808，加入微信群請掃碼：

【NLP】基于TF-IDF和KNN的模糊字符串匹配優(yōu)化

讓我們來練習(xí)一下