用python采集50W基金貼吧帖子,分析一下投資者情緒
大家好,歡迎來(lái)到 Crossin的編程教室 !
01
網(wǎng)頁(yè)分析
我們首先挑選一只白酒基金,看看這只基金貼吧的數(shù)據(jù),網(wǎng)址及網(wǎng)頁(yè)內(nèi)容如下:
http://guba.eastmoney.com/list,of161725.html

http://guba.eastmoney.com/list,of161725_2.html
02
數(shù)據(jù)爬取
import?csv
import?time
import?random
import?requests
import?traceback
from?time?import?sleep
from?fake_useragent?import?UserAgent
from?lxml?import?etree
page?=?1??#設(shè)置爬取的頁(yè)數(shù)
fundcode?=?161725????#可替換任意基金代碼
sleep(random.uniform(1,?2))??#隨機(jī)出現(xiàn)1-2之間的數(shù),包含小數(shù)
headers?=?{"User-Agent":UserAgent(verify_ssl=False).random}
url?=?f'http://guba.eastmoney.com/list,of{fundcode}_{page}.html'
response?=?requests.get(url,?headers=headers,?timeout=10)
print(reponse)

parse?=?etree.HTML(response.text)??#?解析網(wǎng)頁(yè)
items?=?parse.xpath('//*[@id="articlelistnew"]/div')[1:91]
for?item?in?items:
????item?=?{
?????????????'閱讀':?''.join(item.xpath('./span[1]/text()')).strip(),
?????????????'評(píng)論':?''.join(item.xpath('./span[2]/text()')).strip(),
?????????????'標(biāo)題':?''.join(item.xpath('./span[3]/a/text()')).strip(),
?????????????'作者':?''.join(item.xpath('./span[4]/a/font/text()')).strip(),
?????????????'時(shí)間':?''.join(item.xpath('./span[5]/text()')).strip()
????????????}
????print(item)
with?open(f'./{fundcode}.csv',?'a',?encoding='utf_8_sig',?newline='')?as?fp:
????fieldnames?=?['閱讀',?'評(píng)論',?'標(biāo)題',?'作者',?'時(shí)間']
????writer?=?csv.DictWriter(fp,?fieldnames)
????writer.writerow(item)
#?主函數(shù)
def?main(page):
????fundcode?=?161725????#可替換任意基金代碼
????url?=?f'http://guba.eastmoney.com/list,of{fundcode}_{page}.html'
????html?=?get_fund(url)
????parse_fund(html,fundcode)
if?__name__?==?'__main__':
????for?page?in?range(1,6372):???#爬取多頁(yè)(共6371頁(yè))
????????main(page)
????????time.sleep(random.uniform(1,?2))
????????print(f"第{page}頁(yè)提取完成")
03
投資者情緒
import?pandas?as?pd
import?numpy?as?np
df?=?pd.read_csv("/菜J學(xué)Python/金融/天天基金/161725.csv",
?????????????????names=['閱讀',?'評(píng)論',?'標(biāo)題',?'作者',?'時(shí)間'])
#重復(fù)和缺失數(shù)據(jù)
df?=?df.drop_duplicates()
df?=?df.dropna()
#數(shù)據(jù)類型轉(zhuǎn)換
df['閱讀']?=?df['閱讀'].str.replace('萬(wàn)','').astype('float')
df['時(shí)間']?=?pd.to_datetime(df['時(shí)間'],errors='ignore')?
#機(jī)械壓縮去重
def?yasuo(st):
????for?i?in?range(1,int(len(st)/2)+1):
????????for?j?in?range(len(st)):
????????????if?st[j:j+i]?==?st[j+i:j+2*i]:
????????????????k?=?j?+?i
????????????????while?st[k:k+i]?==?st[k+i:k+2*i]?and?k????????????????????k?=?k?+?i
????????????????st?=?st[:j]?+?st[k:]????
????return?st
yasuo(st="J哥J哥J哥J哥J哥")
df["標(biāo)題"]?=?df["標(biāo)題"].apply(yasuo)
#過(guò)濾表情
df['標(biāo)題']?=?df['標(biāo)題'].str.extract(r"([\u4e00-\u9fa5]+)")
df?=?df.dropna()??#純表情直接刪除
#過(guò)濾短句
df?=?df[df["標(biāo)題"].apply(len)>=3]
df?=?df.dropna()
先制作一個(gè)詞云圖,看看大家對(duì)于這只基金的看法:
import?jieba
import?stylecloud
from?IPython.display?import?Image?
#?繪制詞云圖
text1?=?get_cut_words(content_series=df['標(biāo)題'])
stylecloud.gen_stylecloud(text='?'.join(text1),?max_words=200,
??????????????????????????collocations=False,
??????????????????????????font_path='simhei.ttf',
??????????????????????????icon_name='fas?fa-heart',
??????????????????????????size=653,
??????????????????????????#palette='matplotlib.Inferno_9',
??????????????????????????output_name='./基金.png')
Image(filename='./基金.png')?

好像很難明顯看出基民們的情緒......
于是,繼續(xù)用更為量化的方法,計(jì)算出每個(gè)評(píng)論的情感評(píng)分:
import?paddlehub?as?hub
senta?=?hub.Module(name="senta_bilstm")
texts?=?df['標(biāo)題'].tolist()
input_data?=?{'text':texts}
res?=?senta.sentiment_classify(data=input_data)
df['投資者情緒']?=?[x['positive_probs']?for?x?in?res]
對(duì)數(shù)據(jù)進(jìn)行重采樣:
#重采樣至15分鐘
df['時(shí)間']?=?pd.to_datetime(df['時(shí)間'])?
df.index?=?df['時(shí)間']
data?=?df.resample('15min').mean().reset_index()
通過(guò)AkShare這一開源API接口獲取上證指數(shù)分時(shí)數(shù)據(jù),AkShare是基于Python的財(cái)經(jīng)數(shù)據(jù)接口庫(kù),可以實(shí)現(xiàn)對(duì)股票、期貨、期權(quán)、基金、外匯、債券、指數(shù)、數(shù)字貨幣等金融產(chǎn)品的基本面數(shù)據(jù)、歷史行情數(shù)據(jù)的快速采集和清洗。
import?akshare?as?ak
import?matplotlib.pyplot?as?plt
sz_index?=?ak.stock_zh_a_minute(symbol='sh000001',?period='15',?adjust="qfq")
sz_index['日期']?=?pd.to_datetime(sz_index['day'])
sz_index['收盤價(jià)']?=?sz_index['close'].astype('float')
data?=?data.merge(sz_index,left_on='時(shí)間',right_on='日期',how='inner')
matplotlib.use('Qt5Agg')
data.index?=?data['時(shí)間']
data[['投資者情緒','收盤價(jià)']].plot(secondary_y=['close'])
plt.show()

可以看出,投資者情緒相對(duì)于上證指數(shù)存在一個(gè)滯后效應(yīng)。
以上就是我們對(duì)基金貼吧里帖子數(shù)據(jù)進(jìn)行處理并進(jìn)行分析的全過(guò)程。
如果文章對(duì)你有幫助,歡迎轉(zhuǎn)發(fā)/點(diǎn)贊/收藏~
作者:J哥
_往期文章推薦_
