尤物二区,91美女视频在线观看,在线天堂中文在线资源网,国产免费又粗又大又硬又爽视频,天天色天天干天天,91精品久,亚欧免费视频,丁香五月在线观看

最近金融市場輾轉(zhuǎn)波動，年初入場的小伙伴也許還在等待市場的回暖。面對錯綜復(fù)雜的市場環(huán)境，如何才能通過技術(shù)手段，更快更好判斷市場的變化，提前行動，是每一個會編程或想學(xué)編程的基民共同追求的目標(biāo)。

本文通過利用Python技術(shù)，手把手教你爬取天天基金貼吧50W+數(shù)據(jù)并分析投資者情緒，讓你更快洞察金融市場變化。

網(wǎng)頁分析

我們首先挑選一只白酒基金，看看這只基金貼吧的數(shù)據(jù)，網(wǎng)址及網(wǎng)頁內(nèi)容如下：

http://guba.eastmoney.com/list,of161725.html

由上圖可知，該基金共有6371頁合計509669條討論記錄，且還在不斷更新。數(shù)據(jù)字段包括閱讀、評論、標(biāo)題、作者、最后更新（評論時間）。點擊下一頁，URL變?yōu)椋?/span>

http://guba.eastmoney.com/list,of161725_2.html

很顯然，這是簡單的靜態(tài)網(wǎng)頁，只需設(shè)置基金代碼參數(shù)和頁碼參數(shù)來拼接URL，即可爬取任意基金貼吧數(shù)據(jù)。

數(shù)據(jù)爬取

本文爬蟲用的Pycharm，首先導(dǎo)入爬蟲相關(guān)包：

import csv
import time
import random
import requests
import traceback
from time import sleep
from fake_useragent import UserAgent
from lxml import etree

嘗試請求一頁數(shù)據(jù)，盡量設(shè)置隨機睡眠時間和使用隨機生成的headers，這是爬蟲人最基本的道德修養(yǎng)，也是最簡單的防反爬措施：

page = 1  #設(shè)置爬取的頁數(shù)
fundcode = 161725    #可替換任意基金代碼
sleep(random.uniform(1, 2))  #隨機出現(xiàn)1-2之間的數(shù)，包含小數(shù)
headers = {"User-Agent":UserAgent(verify_ssl=False).random}
url = f'http://guba.eastmoney.com/list,of{fundcode}_{page}.html'
response = requests.get(url, headers=headers, timeout=10)
print(reponse)

F12看下網(wǎng)頁源代碼：

網(wǎng)頁結(jié)構(gòu)還是很簡單的，數(shù)據(jù)存放在id為articlelistnew的div下，該div下的第一個div為標(biāo)題行，因此從第二個div解析數(shù)據(jù)即可。本文采用xpath解析，其他解析方式也很簡單。

parse = etree.HTML(response.text)  # 解析網(wǎng)頁
items = parse.xpath('//*[@id="articlelistnew"]/div')[1:91]
for item in items:
    item = {
             '閱讀': ''.join(item.xpath('./span[1]/text()')).strip(),
             '評論': ''.join(item.xpath('./span[2]/text()')).strip(),
             '標(biāo)題': ''.join(item.xpath('./span[3]/a/text()')).strip(),
             '作者': ''.join(item.xpath('./span[4]/a/font/text()')).strip(),
             '時間': ''.join(item.xpath('./span[5]/text()')).strip()
            }
    print(item)

數(shù)據(jù)爬取下來，我們將其存儲為csv格式：

with open(f'./{fundcode}.csv', 'a', encoding='utf_8_sig', newline='') as fp:
    fieldnames = ['閱讀', '評論', '標(biāo)題', '作者', '時間']
    writer = csv.DictWriter(fp, fieldnames)
    writer.writerow(item)

爬取多頁數(shù)據(jù)并將爬蟲代碼封裝成函數(shù)，另外，建議在各代碼段加入異常處理，以防程序中途退出：

# 主函數(shù)
def main(page):
    fundcode = 161725    #可替換任意基金代碼
    url = f'http://guba.eastmoney.com/list,of{fundcode}_{page}.html'
    html = get_fund(url)
    parse_fund(html,fundcode)


if __name__ == '__main__':
    for page in range(1,6372):   #爬取多頁（共6371頁）
        main(page)
        time.sleep(random.uniform(1, 2))
        print(f"第{page}頁提取完成")

OK，數(shù)據(jù)爬取完成。

投資者情緒

本文數(shù)據(jù)處理分析用的Jupyter notebook，數(shù)據(jù)爬取完成后，我們就可以開始分析數(shù)據(jù)了，首先導(dǎo)入數(shù)據(jù)：

import pandas as pd
import numpy as np

df = pd.read_csv("/菜J學(xué)Python/金融/天天基金/161725.csv",
                 names=['閱讀', '評論', '標(biāo)題', '作者', '時間'])

做一些基本的數(shù)據(jù)清洗：

#重復(fù)和缺失數(shù)據(jù)
df = df.drop_duplicates()
df = df.dropna()

#數(shù)據(jù)類型轉(zhuǎn)換
df['閱讀'] = df['閱讀'].str.replace('萬','').astype('float')
df['時間'] = pd.to_datetime(df['時間'],errors='ignore') 

#機械壓縮去重
def yasuo(st):
    for i in range(1,int(len(st)/2)+1):
        for j in range(len(st)):
            if st[j:j+i] == st[j+i:j+2*i]:
                k = j + i
                while st[k:k+i] == st[k+i:k+2*i] and k<len(st):   
                    k = k + i
                st = st[:j] + st[k:]    
    return st
yasuo(st="J哥J哥J哥J哥J哥")
df["標(biāo)題"] = df["標(biāo)題"].apply(yasuo)

#過濾表情
df['標(biāo)題'] = df['標(biāo)題'].str.extract(r"([\u4e00-\u9fa5]+)")
df = df.dropna()  #純表情直接刪除

#過濾短句
df = df[df["標(biāo)題"].apply(len)>=3]
df = df.dropna()

先制作一個詞云圖，看看大家對于這只基金的看法：

import jieba
import stylecloud
from IPython.display import Image 

# 繪制詞云圖
text1 = get_cut_words(content_series=df['標(biāo)題'])
stylecloud.gen_stylecloud(text=' '.join(text1), max_words=200,
                          collocations=False,
                          font_path='simhei.ttf',
                          icon_name='fas fa-heart',
                          size=653,
                          #palette='matplotlib.Inferno_9',
                          output_name='./基金.png')
Image(filename='./基金.png')

好像很難明顯看出基民們的情緒......

于是，繼續(xù)用更為量化的方法，計算出每個評論的情感評分：

import paddlehub as hub
senta = hub.Module(name="senta_bilstm")
texts = df['標(biāo)題'].tolist()
input_data = {'text':texts}
res = senta.sentiment_classify(data=input_data)
df['投資者情緒'] = [x['positive_probs'] for x in res]

對數(shù)據(jù)進行重采樣：

#重采樣至15分鐘
df['時間'] = pd.to_datetime(df['時間']) 
df.index = df['時間']
data = df.resample('15min').mean().reset_index()

通過AkShare這一開源API接口獲取上證指數(shù)分時數(shù)據(jù)，AkShare是基于Python的財經(jīng)數(shù)據(jù)接口庫，可以實現(xiàn)對股票、期貨、期權(quán)、基金、外匯、債券、指數(shù)、數(shù)字貨幣等金融產(chǎn)品的基本面數(shù)據(jù)、歷史行情數(shù)據(jù)的快速采集和清洗。

import akshare as ak
import matplotlib.pyplot as plt

sz_index = ak.stock_zh_a_minute(symbol='sh000001', period='15', adjust="qfq")
sz_index['日期'] = pd.to_datetime(sz_index['day'])
sz_index['收盤價'] = sz_index['close'].astype('float')
data = data.merge(sz_index,left_on='時間',right_on='日期',how='inner')
matplotlib.use('Qt5Agg')
data.index = data['時間']
data[['投資者情緒','收盤價']].plot(secondary_y=['close'])
plt.show()

可以看出，投資者情緒相對于上證指數(shù)存在一個滯后效應(yīng)。

更多閱讀

2020 年最佳流行 Python 庫 Top 10

2020 Python中文社區(qū)熱門文章 Top 10

5分鐘掌握 Python 對象的引用

特別推薦

點擊下方閱讀原文加入社區(qū)會員

手把手教你爬取50W基金貼吧數(shù)據(jù)，并做投資者情緒分析！

手把手教你爬取50W基金貼吧數(shù)據(jù)，并做投資者情緒分析！