爬蟲:使用 MySQL 維護 IP 代理池
點擊藍色“Python交流圈”關(guān)注我丫
加個“星標(biāo)”,每天一起進步一點點

來源:?夏悠然然
https://blog.csdn.net/qq_42730750/article/details/108026476
【導(dǎo)語】:做爬蟲的時候,難免會快速多次地訪問某個網(wǎng)站,觸發(fā)網(wǎng)站的反爬蟲機制,就會“封IP”。解決方法之一就是用代理池,本文就向大家介紹使用Mysql維護一個代理池的方法。
1. 配置PyCharm
??為了方便寫SQL代碼及實時關(guān)注數(shù)據(jù)庫的信息,我們先配置一下PyCharm。
??找到PyCharm右邊欄的Database,點擊它,然后它的界面會彈出來,點擊+號,選擇數(shù)據(jù)庫。

如圖示操作,找到我們的小鯨魚MySQL,點擊進入。

通過這個界面就可以配置MySQL了,這里有幾個填寫的我已經(jīng)作了標(biāo)注,簡單介紹一下:
Name: 本次的配置的名字,這里我填的是spider,也可以使用默認(rèn)名@localhost。
Host:?數(shù)據(jù)庫的IP地址,因為我的MySQL在本地,所以這里我填的就是localhost。
User:?數(shù)據(jù)庫的用戶名。
Password:?數(shù)據(jù)庫的密碼。
Database:?數(shù)據(jù)庫的名字,我這里提前建了一個名為spider的數(shù)據(jù)庫。
URL: 這里我們在后面加上?serverTimezone=UTC,否則的話等會兒我們連接時會出現(xiàn)Server returns invalid timezone. Go to ‘Advanced’ tab and set‘serverTimezone’ property manually.錯誤,或者按照錯誤提示,去Advanced選項配置一下Advanced。
配置完后就點擊Test Connection按鈕,如果提示缺少驅(qū)動文件,直接在彈出的對話框點擊下載即可,不出意外的話,就會在下面出現(xiàn)連接成功的信息。

插入數(shù)據(jù)后,按圖示操作刷新一下,然后雙擊數(shù)據(jù)表,就可以看到數(shù)據(jù)表中的信息了,美滋滋ヾ(@▽@)ノ。


2. 函數(shù)介紹
??這里我們通過pymysql庫來操作MySQL數(shù)據(jù)庫,我的數(shù)據(jù)庫版本是8.0.16,還是去年安裝的,這里不再敘述其安裝步驟了,問問度娘。
??維護我們代理IP池的大致流程就是:先建立一個數(shù)據(jù)表ipproxy,包含有ip字段、score字段,因為有些IP有時候可以用,有時候不可以,所以這里對每個要存入數(shù)據(jù)庫的IP設(shè)置一個分?jǐn)?shù),我這里設(shè)置的最高分是5,也就是質(zhì)量最高。如果我們在使用過程中發(fā)現(xiàn)IP不能用了,就將其分?jǐn)?shù)減1;如果可以用,且分?jǐn)?shù)小于5,就加1,然后定期清理分?jǐn)?shù)為0的IP。

3. 代碼實現(xiàn)
這里只貼出了增加的數(shù)據(jù)庫操作代碼及修改后的IP測試代碼。
import pymysqlimport requestsfrom bs4 import BeautifulSoupimport pickleimport aiohttpimport asyncioimport timeimport randomasync def test_newip(ip_, url, ip_ok):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}conn = aiohttp.TCPConnector(verify_ssl=False)async with aiohttp.ClientSession(connector=conn) as session:print('正在測試ip: ' + ip_)try:proxy_ip = 'http://' + ip_async with session.get(url=url, headers=headers, proxy=proxy_ip, timeout=15) as response:if response.status == 200:print('代理可用: ' + ip_)ip_ok.append((ip_, 5))else:print('請求響應(yīng)碼不合法 ' + ip_)except:ip_ok.append((ip_, 4))print('代理請求失敗', ip_)async def test_mysqlip(ip_, url, ip_ok):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}conn = aiohttp.TCPConnector(verify_ssl=False)async with aiohttp.ClientSession(connector=conn) as session:print('正在測試ip: ' + ip_[0])try:proxy_ip = 'http://' + ip_[0]async with session.get(url=url, headers=headers, proxy=proxy_ip, timeout=15) as response:if response.status == 200:print('ip可用: ' + ip_[0])new_score = 5 if ip_[1] == 5 else ip_[1] + 1ip_ok.append((ip_[0], new_score))else:print('請求響應(yīng)碼不合法 ' + ip_[0])except:new_score = 0 if ip_[1] == 0 else ip_[1] - 1ip_ok.append((ip_[0], new_score))print('代理請求失敗', ip_[0])def get_mysqlip():db = pymysql.connect(host='localhost', port=3306, user='用戶名', password='密碼',database='數(shù)據(jù)庫名', charset='utf8')cursor = db.cursor()sql = 'select ip, score from ipproxy'try:cursor.execute(sql)mysql_ip = list(cursor.fetchall())return mysql_ipexcept Exception as err:print('查詢錯誤!!!')print(err)def update_ipscore(ip_list):db = pymysql.connect(host='localhost', port=3306, user='用戶名', password='密碼',database='數(shù)據(jù)庫名', charset='utf8')cursor = db.cursor()for ip_ in ip_list:sql = 'update ipproxy set score=%s where ip=%s'cursor.execute(sql, (ip_[1], ip_[0]))db.commit()cursor.close()db.close()def delete_ip():db = pymysql.connect(host='localhost', port=3306, user='用戶名', password='密碼',database='數(shù)據(jù)庫名', charset='utf8')cursor = db.cursor()sql = 'delete from ipproxy where score=0'try:cursor.execute(sql)except Exception as err:print('刪除錯誤!!!')print(err)db.commit()cursor.close()db.close()def delete_ideticalip():db = pymysql.connect(host='localhost', port=3306, user='用戶名', password='密碼',database='數(shù)據(jù)庫名', charset='utf8')cursor = db.cursor()sql = 'delete from ipproxy where ip in (select ip from (select ip from ipproxy group by ip having count(*)>1) s1)' \'and id not in (select id from (select id from ipproxy group by ip having count(*)>1) s2)'try:cursor.execute(sql)except Exception as err:print('刪除錯誤!!!')print(err)db.commit()cursor.close()db.close()def insert_ip(ip_list):# 新爬取的ip直接插入數(shù)據(jù)庫db = pymysql.connect(host='localhost', port=3306, user='用戶名', password='密碼',database='數(shù)據(jù)庫名', charset='utf8')cursor = db.cursor()sql = 'create table if not exists ipproxy(' \'id int not null primary key auto_increment, ' \'ip char(21) not null , ' \'score int not null ) default charset utf8'cursor.execute(sql)try:sql = 'insert into ipproxy (ip, score) values (%s, %s)'cursor.executemany(sql, ip_list)# cursor.execute('drop table ipproxy')except Exception as err:print('插入錯誤!!!')print(err)db.commit()cursor.close()db.close()def insret_mysqlip(urls):ip_list1 = get_66ip()ip_list2 = get_kaixinip()ip_list3 = get_goubanjiaip()ip_list = list(set(ip_list1 + ip_list2 + ip_list3))print('已做去重處理!')ip_ok = []print('開始測試新爬取的ip: ')try:loop = asyncio.get_event_loop()for i in range(0, len(ip_list), 10):proxies_ip = ip_list[i: i + 10]tasks = [test_newip(proxy_ip, random.choice(urls), ip_ok) for proxy_ip in proxies_ip]loop.run_until_complete(asyncio.wait(tasks))time.sleep(3)except Exception as err:print('發(fā)生錯誤:', err.args)insert_ip(ip_ok)print('數(shù)據(jù)保存完畢!')def update_mysqlip(urls):ip_list = get_mysqlip()ip_ok = []print('開始測試新爬取的ip: ')try:loop = asyncio.get_event_loop()for i in range(0, len(ip_list), 10):proxies_ip = ip_list[i: i + 10]tasks = [test_mysqlip(proxy_ip, random.choice(urls), ip_ok) for proxy_ip in proxies_ip]loop.run_until_complete(asyncio.wait(tasks))time.sleep(3)except Exception as err:print('發(fā)生錯誤:', err.args)update_ipscore(ip_ok)print('數(shù)據(jù)更新完畢!')delete_ip()print('已刪除score為0的ip!')delete_ideticalip()print('已做去重處理!')if __name__ == '__main__':urls = ['https://blog.csdn.net/qq_42730750/article/details/107868879','https://blog.csdn.net/qq_42730750/article/details/107931738','https://blog.csdn.net/qq_42730750/article/details/107869022','https://blog.csdn.net/qq_42730750/article/details/108016855','https://blog.csdn.net/qq_42730750/article/details/107703589','https://blog.csdn.net/qq_42730750/article/details/107869233','https://blog.csdn.net/qq_42730750/article/details/107869944','https://blog.csdn.net/qq_42730750/article/details/107919690']insret_mysqlip(urls)update_mysqlip(urls)
--End--
近期熱門推薦? 1、大廠的 404 頁面都長啥樣?看到最后一個,我笑了。。。
4、用 Python 使用 Google Colab?豈止是炫酷
5、Ubuntu被曝嚴(yán)重漏洞:切換系統(tǒng)語言+輸入幾行命令,就能獲取root權(quán)限
關(guān)注公眾號,回復(fù)“001” 領(lǐng)取Python入門+進階+實戰(zhàn)開發(fā)92天全套視頻教程
點贊最大的支持?



