日韩三级在线观看,天堂av在线免费观看,中文无码字幕视频,亚洲欧洲高清无码在线视频 ,天干天天干在线视频,好吊操这里只有精品,欧美天堂,国内免费av

前言

對于文本分析而言，大家都繞不開詞云圖，而python中制作詞云圖，又繞不開wordcloud，但我想說的是，你真的會用嗎？你可能已經(jīng)按照網(wǎng)上的教程，做出來了一張好看的詞云圖，但是我想今天這篇文章，絕對讓你明白wordcloud背后的原理。

小試牛刀

首先你需要使用pip安裝這個第三方庫。接著我們簡單看一下中英文制作詞云有什么不同。

from?matplotlib?import?pyplot?as?plt
from?wordcloud?import?WordCloud

text?=?'my?is?luopan.?he?is?zhangshan'

wc?=?WordCloud()
wc.generate(text)

plt.imshow(wc)

from?matplotlib?import?pyplot?as?plt
from?wordcloud?import?WordCloud

text?=?'我叫羅攀，他叫張三，我叫羅攀'

wc?=?WordCloud(font_path?=?r'/System/Library/Fonts/Supplemental/Songti.ttc')?#設置中文字體
wc.generate(text)

plt.imshow(wc)

聰明的你會發(fā)現(xiàn)，中文的詞云圖并不是我們想要的，那是因為wordcloud并不能成功為中文進行分詞。通過下面wordcloud的源代碼分析，我想你就應該能弄明白了。

WordCloud源碼分析

我們主要是要看WordCloud類，這里我不會把全部源代碼打上來，而是主要分析制作詞云的整個流程。

class?WordCloud(object):
????
????def?__init__(self,):
????????'''這個主要是初始化一些參數(shù)
????????'''
????????pass

????def?fit_words(self,?frequencies):
????????return?self.generate_from_frequencies(frequencies)

????def?generate_from_frequencies(self,?frequencies,?max_font_size=None):
????????'''詞頻歸一化，創(chuàng)建繪圖對象?
????????'''
????????pass

????def?process_text(self,?text):
????????"""對文本進行分詞，預處理
????????"""

????????flags?=?(re.UNICODE?if?sys.version?'3'?and?type(text)?is?unicode??#?noqa:?F821
?????????????????else?0)
????????pattern?=?r"\w[\w']*"?if?self.min_word_length?<=?1?else?r"\w[\w']+"
????????regexp?=?self.regexp?if?self.regexp?is?not?None?else?pattern

????????words?=?re.findall(regexp,?text,?flags)
????????#?remove?'s
????????words?=?[word[:-2]?if?word.lower().endswith("'s")?else?word
?????????????????for?word?in?words]
????????#?remove?numbers
????????if?not?self.include_numbers:
????????????words?=?[word?for?word?in?words?if?not?word.isdigit()]
????????#?remove?short?words
????????if?self.min_word_length:
????????????words?=?[word?for?word?in?words?if?len(word)?>=?self.min_word_length]

????????stopwords?=?set([i.lower()?for?i?in?self.stopwords])
????????if?self.collocations:
????????????word_counts?=?unigrams_and_bigrams(words,?stopwords,?self.normalize_plurals,?self.collocation_threshold)
????????else:
????????????#?remove?stopwords
????????????words?=?[word?for?word?in?words?if?word.lower()?not?in?stopwords]
????????????word_counts,?_?=?process_tokens(words,?self.normalize_plurals)

????????return?word_counts

????def?generate_from_text(self,?text):
????????words?=?self.process_text(text)
????????self.generate_from_frequencies(words)
????????return?self

????def?generate(self,?text):
????????return?self.generate_from_text(text)

當我們使用generate方法時，其調(diào)用順序是：

generate_from_text
process_text??#對文本預處理
generate_from_frequencies?#詞頻歸一化，創(chuàng)建繪圖對象

備注：所以制作詞云時，不管你使用generate還是generate_from_text方法，其實最終都是會調(diào)用generate_from_text方法。

所以，這里最重要的就是process_text 和generate_from_frequencies函數(shù)。接下來我們就來一一講解。

process_text函數(shù)

process_text函數(shù)其實就是對文本進行分詞，然后清洗，最好返回一個分詞計數(shù)的字典。我們可以嘗試使用一下：

text?=?'my?is?luopan.?he?is?zhangshan'

wc?=?WordCloud()
cut_word?=?wc.process_text(text)
print(cut_word)
#??{'luopan':?1,?'zhangshan':?1}

text?=?'我叫羅攀，他叫張三，我叫羅攀'

wc?=?WordCloud()
cut_word?=?wc.process_text(text)
print(cut_word)
#?{'我叫羅攀':?2,?'他叫張三':?1}

所以可以看出process_text函數(shù)是沒法對中文進行好分詞的。我們先不管process_text函數(shù)是怎么清洗分詞的，我們就著重看看是怎么對文本進行分詞的。

def?process_text(self,?text):
????"""對文本進行分詞，預處理
????"""

????flags?=?(re.UNICODE?if?sys.version?'3'?and?type(text)?is?unicode??#?noqa:?F821
?????????????else?0)
????pattern?=?r"\w[\w']*"?if?self.min_word_length?<=?1?else?r"\w[\w']+"
????regexp?=?self.regexp?if?self.regexp?is?not?None?else?pattern

????words?=?re.findall(regexp,?text,?flags)

這里的關鍵就在于使用的是正則表達式進行分詞（"\w[\w']+"），學過正則表達式的都知道，\w[\w]+代表的是匹配2個至多個字母，數(shù)字，中文，下劃線（python正則表達式中\(zhòng)w可代表中文）。

所以中文沒法切分，只會在各種標點符號中切分中文，這是不符合中文分詞的邏輯的。但英文文本本身就是通過空格進行了分割，所以英文單詞可以輕松的分詞出來。

總結(jié)來說，wordcloud本身就是為了英文文本來做詞云的，如果需要制作中文文本詞云，就需要先對中文進行分詞。

generate_from_frequencies函數(shù)

最后再簡單說下這個函數(shù)，這個函數(shù)的功能就是詞頻歸一化，創(chuàng)建繪圖對象。

繪圖這個代碼很多，也不是我們今天要講的重點，我們只需要了解到底是需要什么數(shù)據(jù)來繪制詞云圖，下面是詞頻歸一化的代碼，我想大家應該能看的懂。

from?operator?import?itemgetter

def?generate_from_frequencies(frequencies):
????frequencies?=?sorted(frequencies.items(),?key=itemgetter(1),?reverse=True)
????if?len(frequencies)?<=?0:
????????raise?ValueError("We?need?at?least?1?word?to?plot?a?word?cloud,?"
?????????????????????????"got?%d."?%?len(frequencies))

????max_frequency?=?float(frequencies[0][1])

????frequencies?=?[(word,?freq?/?max_frequency)
???????????????????for?word,?freq?in?frequencies]
????return?frequencies

test?=?generate_from_frequencies({'我叫羅攀':?2,?'他叫張三':?1})
test

#?[('我叫羅攀',?1.0),?('他叫張三',?0.5)]

中文文本制作詞云圖的正確方式

我們先通過jieba分詞，用空格拼接文本，這樣process_text函數(shù)就能返回正確的分詞計數(shù)的字典。

from?matplotlib?import?pyplot?as?plt
from?wordcloud?import?WordCloud
import?jieba

text?=?'我叫羅攀，他叫張三，我叫羅攀'
cut_word?=?"?".join(jieba.cut(text))

wc?=?WordCloud(font_path?=?r'/System/Library/Fonts/Supplemental/Songti.ttc')
wc.generate(cut_word)

plt.imshow(wc)

當然，如果你直接就有分詞計數(shù)的字典，就不需要調(diào)用generate函數(shù)，而是直接調(diào)用generate_from_frequencies函數(shù)。

text?=?{
????'羅攀':2,
????'張三':1
}

wc?=?WordCloud(font_path?=?r'/System/Library/Fonts/Supplemental/Songti.ttc')
wc.generate_from_frequencies(text)

plt.imshow(wc)