【機器學(xué)習(xí)】使用奇異值分解(SVD)構(gòu)建推薦系統(tǒng)
今天將和大家一起學(xué)習(xí)如何僅使用奇異值分解來構(gòu)建推薦系統(tǒng)。如果你對奇異值分解不是很熟悉,推薦閱讀戳???這次終于徹底理解了奇異值分解(SVD)原理及應(yīng)用
奇異值分解是一種非常流行的線性代數(shù)技術(shù),用于將矩陣分解為幾個較小矩陣的乘積。該技術(shù)用途廣泛。可以使用 SVD 來挖掘項目之間的關(guān)系,由此構(gòu)建推薦系統(tǒng)。
本文主要介紹
如何對矩陣進行奇異值分解 如何解釋奇異值分解的結(jié)果 單個推薦系統(tǒng)需要哪些數(shù)據(jù),以及如何利用 SVD 對其進行分析 如何利用 SVD 的結(jié)果提出建議
奇異值分解簡介
一個整數(shù)24可以分解為 24=2×3×4 的因數(shù),矩陣也可以表示為其他一些矩陣的乘積。因為矩陣是數(shù)字數(shù)組,所以它們有自己的乘法規(guī)則,因此有不同的分解方式,或稱為分解。一般有 QR 分解或 LU 分解。另一種是奇異值分解,它對要分解的矩陣的形狀或性質(zhì)沒有限制。
假設(shè)一個矩陣 (如m×n矩陣)被分解為
是一個 矩陣, 是一個對角矩陣 , 和 是一個 矩陣。對角矩陣 可以是非正方形的,但只有對角線上的條目可能是非零的。矩陣 和 是正交矩陣。表示的列 和 均是單位向量且彼此正交并。如果任意兩個向量的點積為零,那么它們就是正交的。如果一個向量的l2范數(shù)是1,那么它就是單位向量。正交矩陣的性質(zhì)是它的轉(zhuǎn)置就是它的逆。換句話說,由于 是一個正交矩陣, 或者 , 是單位矩陣。
奇異值分解得名于對角矩陣 ,稱為矩陣 的奇異值。它們實際上是矩陣 特征值的平方根。類比于分解為素數(shù)的數(shù)字,矩陣的奇異值分解揭示了該矩陣的結(jié)構(gòu)。
實際上上面描述的被稱為full SVD。還有另一種稱為reduced SVD 或compact SVD 的版本。同樣,奇異值分解公式 ,但此時 一種 方對角矩陣, 是矩陣的 的秩,通常小于或等于 和 。矩陣 是 矩陣, 是一個 矩陣。因為矩陣 和 是非正方形的,它們被稱為半正交, 和 , 這兩種情況中 均為 r×r單位矩陣。
奇異值分解在推薦系統(tǒng)中的意義
如果矩陣 的秩是 ,那么可以證明矩陣 和矩陣 的秩均為 。在奇異值分解(簡化 SVD)中,矩陣 的列是矩陣 的特征向量,矩陣 的行是矩陣 的特征向量。有趣的是矩陣 和矩陣 可能有不同的形狀大?。ㄒ驗榫仃? 可以是非正方形),但它們具有相同的特征值集,即對角矩陣 對角線上的值的平方。
這就是為什么奇異值分解的結(jié)果可以揭示很多關(guān)于矩陣 的信息。
假設(shè)我們收集了一些書評,比如書是列,人是行,條目是一個人對一本書的評分。在這種情況下, 將是一個人對人的表格,其中的條目即為一個人給出的評分與匹配的另一個人給出的評分的總和。相似地 將是一個書到書的表格,其中條目是收到的評分與相匹配的另一本書收到的評分總和。人與書之間隱藏的聯(lián)系是什么?那可能是類型,作者,或類似性質(zhì)的東西。
構(gòu)建推薦系統(tǒng)
數(shù)據(jù)集
接下來看看如何利用 SVD 的結(jié)果來構(gòu)建推薦系統(tǒng)。首先從這個鏈接下載數(shù)據(jù)集(注意:它是 600MB 大)
該數(shù)據(jù)集是“推薦系統(tǒng)和個性化數(shù)據(jù)集[1]”中的“社交推薦數(shù)據(jù)[2]”。它包含用戶對Librarything[3]書籍的評論。我們是對用戶給一本書的“starts”數(shù)感興趣。
如果解壓這個 tar 文件,會看到一個名為“reviews.json”的大文件??梢蕴崛∷蛘呒磿r讀取包含的文件。
import?tarfile
#?公眾號:機器學(xué)習(xí)研習(xí)院?后臺回復(fù) lthing_data 獲取
with?tarfile.open("lthing_data.tar.gz")?as?tar:
????print("Files?in?tar?archive:")
????tar.list()
????with?tar.extractfile("lthing_data/reviews.json")?as?file:
????????count?=?0
????????for?line?in?file:
????????????print(line)
????????????count?+=?1
????????????if?count?>?3:
????????????????break
以上將打?。?/p>
Files?in?tar?archive:
?rwxr-xr-x?julian/julian?0?2016-09-30?17:58:55?lthing_data/
?rw-r--r--?julian/julian?4824989?2014-01-02?13:55:12?lthing_data/edges.txt
?rw-rw-r--?julian/julian?1604368260?2016-09-30?17:58:25?lthing_data/reviews.json
b"{'work':?'3206242',?'flags':?[],?'unixtime':?1194393600,?'stars':?5.0,?'nhelpful':?0,?'time':?'Nov?7,?2007',?'comment':?'This?a?great?book?for?young?readers?to?be?introduced?to?the?world?of?Middle?Earth.?',?'user':?'van_stef'}\n"
b"{'work':?'12198649',?'flags':?[],?'unixtime':?1333756800,?'stars':?5.0,?'nhelpful':?0,?'time':?'Apr?7,?2012',?'comment':?'Help?Wanted:?Tales?of?On?The?Job?Terror?from?Evil?Jester?Press?is?a?fun?and?scary?read.?This?book?is?edited?by?Peter?Giglio?and?has?short?stories?by?Joe?McKinney,?Gary?Brandner,?Henry?Snider?and?many?more.?As?if?work?wasnt?already?scary?enough,?this?book?gives?you?more?reasons?to?be?scared.?Help?Wanted?is?an?excellent?anthology?that?includes?some?great?stories?by?some?master?storytellers.\\nOne?of?the?stories?includes?Agnes:?A?Love?Story?by?David?C.?Hayes,?which?tells?the?tale?of?a?lawyer?named?Jack?who?feels?unappreciated?at?work?and?by?his?wife?so?he?starts?a?relationship?with?a?photocopier.?They?get?along?well?until?the?photocopier?starts?wanting?the?lawyer?to?kill?for?it.?The?thing?I?liked?about?this?story?was?how?the?author?makes?you?feel?sorry?for?Jack.?His?two?co-workers?are?happily?married?and?love?their?jobs?while?Jack?is?married?to?a?paranoid?alcoholic?and?he?hates?and?works?at?a?job?he?cant?stand.?You?completely?understand?how?he?can?fall?in?love?with?a?copier?because?he?is?a?lonely?soul?that?no?one?understands?except?the?copier?of?course.\\nAnother?story?in?Help?Wanted?is?Work?Life?Balance?by?Jeff?Strand.?In?this?story?a?man?works?for?a?company?that?starts?to?let?their?employees?do?what?they?want?at?work.?It?starts?with?letting?them?come?to?work?a?little?later?than?usual,?then?the?employees?are?allowed?to?hug?and?kiss?on?the?job.?Things?get?really?out?of?hand?though?when?the?company?starts?letting?employees?carry?knives?and?stab?each?other,?as?long?as?it?doesnt?interfere?with?their?job.?This?story?is?meant?to?be?more?funny?then?scary?but?still?has?its?scary?moments.?Jeff?Strand?does?a?great?job?mixing?humor?and?horror?in?this?story.\\nAnother?good?story?in?Help?Wanted:?On?The?Job?Terror?is?The?Chapel?Of?Unrest?by?Stephen?Volk.?This?is?a?gothic?horror?story?that?takes?place?in?the?1800s?and?has?to?deal?with?an?undertaker?who?has?the?duty?of?capturing?and?embalming?a?ghoul?who?has?been?eating?dead?bodies?in?a?graveyard.?Stephen?Volk?through?his?use?of?imagery?in?describing?the?graveyard,?the?chapel?and?the?clothes?of?the?time,?transports?you?into?an?1800s?gothic?setting?that?reminded?me?of?Bram?Stokers?Dracula.\\nOne?more?story?in?this?anthology?that?I?have?to?mention?is?Expulsion?by?Eric?Shapiro?which?tells?the?tale?of?a?mad?man?going?into?a?office?to?kill?his?fellow?employees.?This?is?a?very?short?but?very?powerful?story?that?gets?you?into?the?mind?of?a?disgruntled?employee?but?manages?to?end?on?a?positive?note.?Though?there?were?stories?I?didnt?like?in?Help?Wanted,?all?in?all?its?a?very?good?anthology.?I?highly?recommend?this?book?',?'user':?'dwatson2'}\n"
b"{'work':?'12533765',?'flags':?[],?'unixtime':?1352937600,?'nhelpful':?0,?'time':?'Nov?15,?2012',?'comment':?'Magoon,?K.?(2012).?Fire?in?the?streets.?New?York:?Simon?and?Schuster/Aladdin.?336?pp.?ISBN:?978-1-4424-2230-8.?(Hardcover);?$16.99.\\nKekla?Magoon?is?an?author?to?watch?(http://www.spicyreads.org/Author_Videos.html-?scroll?down).?One?of?my?favorite?books?from?2007?is?Magoons?The?Rock?and?the?River.?At?the?time,?I?mentioned?in?reviews?that?we?have?very?few?books?that?even?mention?the?Black?Panther?Party,?let?alone?deal?with?them?in?a?careful,?thorough?way.?Fire?in?the?Streets?continues?the?story?Magoon?began?in?her?debut?book.?While?her?familys?financial?fortunes?drip?away,?not?helped?by?her?mothers?drinking?and?assortment?of?boyfriends,?the?Panthers?provide?a?very?real?respite?for?Maxie.?Sam?is?still?dealing?with?the?death?of?his?brother.?Maxies?relationship?with?Sam?only?serves?to?confuse?and?upset?them?both.?Her?friends,?Emmalee?and?Patrice,?are?slowly?drifting?away.?The?Panther?Party?is?the?only?thing?that?seems?to?make?sense?and?she?basks?in?its?routine?and?consistency.?She?longs?to?become?a?full?member?of?the?Panthers?and?constantly?battles?with?her?Panther?brother?Raheem?over?her?maturity?and?ability?to?do?more?than?office?tasks.?Maxie?wants?to?have?her?own?gun.?When?Maxie?discovers?that?there?is?someone?working?with?the?Panthers?that?is?leaking?information?to?the?government?about?Panther?activity,?Maxie?investigates.?Someone?is?attempting?to?destroy?the?only?place?that?offers?her?shelter.?Maxie?is?determined?to?discover?the?identity?of?the?traitor,?thinking?that?this?will?prove?her?worth?to?the?organization.?However,?the?truth?is?not?simple?and?it?is?filled?with?pain.?Unfortunately?we?still?do?not?have?many?teen?books?that?deal?substantially?with?the?Democratic?National?Convention?in?Chicago,?the?Black?Panther?Party,?and?the?social?problems?in?Chicago?that?lead?to?the?civil?unrest.?Thankfully,?Fire?in?the?Streets?lives?up?to?the?standard?Magoon?set?with?The?Rock?and?the?River.?Readers?will?feel?like?they?have?stepped?back?in?time.?Magoons?factual?tidbits?add?journalistic?realism?to?the?story?and?only?improves?the?atmosphere.?Maxie?has?spunk.?Readers?will?empathize?with?her?Atlas-task?of?trying?to?hold?onto?her?world.?Fire?in?the?Streets?belongs?in?all?middle?school?and?high?school?libraries.?While?readers?are?able?to?read?this?story?independently?of?The?Rock?and?the?River,?I?strongly?urge?readers?to?read?both?and?in?order.?Magoons?recognition?by?the?Coretta?Scott?King?committee?and?the?NAACP?Image?awards?are?NOT?mistakes!',?'user':?'edspicer'}\n"
b'{\'work\':?\'12981302\',?\'flags\':?[],?\'unixtime\':?1364515200,?\'stars\':?4.0,?\'nhelpful\':?0,?\'time\':?\'Mar?29,?2013\',?\'comment\':?"Well,?I?definitely?liked?this?book?better?than?the?last?in?the?series.?There?was?less?fighting?and?more?story.?I?liked?both?Toni?and?Ricky?Lee?and?thought?they?were?pretty?good?together.?The?banter?between?the?two?was?sweet?and?often?times?funny.?I?enjoyed?seeing?some?of?the?past?characters?and?of?course?it\'s?always?nice?to?be?introduced?to?new?ones.?I?just?wonder?how?many?more?of?these?books?there?will?be.?At?least?two?hopefully,?one?each?for?Rory?and?Reece.?",?\'user\':?\'amdrane2\'}\n'
解壓數(shù)據(jù)集
reviews.json 中的每一行都是一條記錄。我們將提取每條記錄的“user”、“work”和“stars”字段,只要這三個字段中沒有缺失數(shù)據(jù)。盡管有名稱,單該數(shù)據(jù)集不是嚴格遵循 JSON 字符串格式的,尤其是它使用單引號而不是雙引號。因此這里并不能使用Python 中的json包,而是用ast來解碼這樣的字符串。
import?ast
reviews?=?[]
with?tarfile.open("lthing_data.tar.gz")?as?tar:
????with?tar.extractfile("lthing_data/reviews.json")?as?file:
????????for?line?in?file:
????????????record?=?ast.literal_eval(line.decode("utf8"))
????????????if?any(x?not?in?record?for?x?in?['user',?'work',?'stars']):
????????????????continue
????????????reviews.append([record['user'],?record['work'],?record['stars']])
print(len(reviews),?"records?retrieved")
1387209 records retrieved
構(gòu)建數(shù)據(jù)框
現(xiàn)在創(chuàng)建一個矩陣,存儲不同的用戶如何評價每本書。利用pandas庫將數(shù)據(jù)矩陣轉(zhuǎn)換成表格:
import?pandas?as?pd
reviews?=?pd.DataFrame(reviews,?columns=["user",?"work",?"stars"])
print(reviews.head())
user work stars
0 van_stef 3206242 5.0
1 dwatson2 12198649 5.0
2 amdrane2 12981302 4.0
3 Lila_Gustavus 5231009 3.0
4 skinglist 184318 2.0
數(shù)據(jù)篩選
這里,小猴子為了節(jié)省時間和內(nèi)存,沒有使用所有數(shù)據(jù)。只考慮那些評論超過 50 本書的用戶以及那些被超過 50 位用戶評論的圖書。這樣可以數(shù)據(jù)集裁剪到其原始大小的 15% 以下:
查找評論超過50本書的用戶
usercount?=?reviews[["work","user"]].groupby("user").count()
usercount?=?usercount[usercount["work"]?>=?50]
print(usercount.head())
work
user
84
-Eva- 602
06nwingert 370
1983mk 63
1dragones 194
查找被超過50個用戶評論過的書
workcount?=?reviews[["work","user"]].groupby("work").count()
workcount?=?workcount[workcount["user"]?>=?50]
print(workcount.head())
user
work
10000 106
10001 53
1000167 186
10001797 53
10005525 134
只保留流行的書籍和活躍的用戶
reviews?=?reviews[reviews["user"].isin(usercount.index)?&?reviews["work"].isin(workcount.index)]
print(reviews)
user work stars
0 van_stef 3206242 5.0
6 justine 3067 4.5
18 stephmo 1594925 4.0
19 Eyejaybee 2849559 5.0
35 LisaMaria_C 452949 4.5
... ... ... ...
1387161 connie53 1653 4.0
1387177 BruderBane 24623 4.5
1387192 StuartAston 8282225 4.0
1387202 danielx 9759186 4.0
1387206 jclark88 8253945 3.0
[205110 rows x 3 columns]
數(shù)據(jù)轉(zhuǎn)換
然后利用 pandas 中的"數(shù)據(jù)透視表"功能將其轉(zhuǎn)換為矩陣:
reviewmatrix?=?reviews.pivot(index="user",?columns="work",?values="stars").fillna(0)
結(jié)果是一個 5593 行 2898 列的矩陣

應(yīng)用奇異值分解
在一個矩陣中表示 5593 個用戶和 2898 本書。然后應(yīng)用 SVD(這需要一段時間):
from?numpy.linalg?import?svd
matrix?=?reviewmatrix.values
u,?s,?vh?=?svd(matrix,?full_matrices=False)
默認情況下,svd() 返回一個完整的奇異值分解。選擇一個簡化的版本,可以使用更小的矩陣來節(jié)省內(nèi)存。列vh對應(yīng)于書籍,可以基于向量空間模型來找出哪本書與正在看的那本書最相似:
import?numpy?as?np
def?cosine_similarity(v,u):
????return?(v?@?u)/?(np.linalg.norm(v)?*?np.linalg.norm(u))
highest_similarity?=?-np.inf
highest_sim_col?=?-1
for?col?in?range(1,vh.shape[1]):
????similarity?=?cosine_similarity(vh[:,0],?vh[:,col])
????if?similarity?>?highest_similarity:
????????highest_similarity?=?similarity
????????highest_sim_col?=?col
print("Column?%d?is?most?similar?to?column?0"?%?highest_sim_col)
Column 906 is most similar to column 0
嘗試找到與第一列最匹配的書,結(jié)果是906行。
在推薦系統(tǒng)中,當(dāng)用戶選擇一本書時,可能會根據(jù)上面計算的余弦距離,并向她展示與她選擇的那本書最相似的其他幾本書。
取決于數(shù)據(jù)集,我們可以使用截斷的 SVD 來降低矩陣的維數(shù)vh。本質(zhì)上,在使用它來計算相似度之前,在列vh上刪除了幾行s中對應(yīng)的奇異值很小的行。這可能會使預(yù)測更加準確,因為一本書的那些不太重要的特征被排除在考慮之外。
注意,在分解 中, 的行是用戶和 的列是書,我們不能確定 的列或 的行是什么意思。例如,我們知道它們可能是在用戶和書籍之間提供一些潛在聯(lián)系的類型,而我們無法確定它們到底是什么。但這并不防礙將它們用作推薦系統(tǒng)中的特征。
參考資料
推薦系統(tǒng)和個性化數(shù)據(jù)集: https://gitee.com/yunduodatastudio/picture/raw/master/data.png
[2]社交推薦數(shù)據(jù): https://gitee.com/yunduodatastudio/picture/raw/master/data.png
[3]Librarything: https://www.librarything.com/
往期精彩回顧
