selenium拉勾網(wǎng)職位信息爬取
點擊藍字“Python學(xué)習(xí)部落”關(guān)注我
讓學(xué)習(xí)變成你的習(xí)慣!


本例爬取數(shù)據(jù)分析師
環(huán)境:
?1.python 3
2.Anaconda3-Spyder
3.win10?
源碼:
from?selenium?import?webdriverimport timeimport loggingimport randomimport openpyxlwb = openpyxl.Workbook()sheet = wb.activesheet.append(['job_name', 'company_name', 'city','industry', 'salary', 'experience_edu','welfare','job_label'])logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')def search_product(key_word):browser.find_element_by_id('cboxClose').click() # 關(guān)閉讓你選城市的窗口time.sleep(2)browser.find_element_by_id('search_input').send_keys(key_word) # 定位搜索框 輸入關(guān)鍵字browser.find_element_by_class_name('search_button').click() # 點擊搜索browser.maximize_window() # 最大化窗口time.sleep(2)#time.sleep(random.randint(1, 3))browser.execute_script("scroll(0,2500)") # 下拉滾動條????get_data()?#?調(diào)用抓取數(shù)據(jù)的函數(shù)# 模擬點擊下一頁 翻頁爬取數(shù)據(jù) 每爬取一頁數(shù)據(jù) 休眠 控制抓取速度 防止被反爬 讓輸驗證碼for i in range(4):browser.find_element_by_class_name('pager_next ').click()time.sleep(1)browser.execute_script("scroll(0,2300)")get_data()??????time.sleep(random.randint(3,?5))??????def get_data():items = browser.find_elements_by_xpath('//*[@id="s_position_list"]/ul/li')for item in items:job_name = item.find_element_by_xpath('.//div[@class="p_top"]/a/h3').textcompany_name = item.find_element_by_xpath('.//div[@class="company_name"]').textcity = item.find_element_by_xpath('.//div[@class="p_top"]/a/span[@class="add"]/em').textindustry = item.find_element_by_xpath('.//div[@class="industry"]').textsalary = item.find_element_by_xpath('.//span[@class="money"]').textexperience_edu = item.find_element_by_xpath('.//div[@class="p_bot"]/div[@class="li_b_l"]').textwelfare = item.find_element_by_xpath('.//div[@class="li_b_r"]').textjob_label = item.find_element_by_xpath('.//div[@class="list_item_bot"]/div[@class="li_b_l"]').textdata = f'{job_name},{company_name},{city},{industry},{salary},{experience_edu},{welfare},{job_label}'logging.info(data)sheet.append([job_name, company_name, city,industry, salary, experience_edu, welfare, job_label])?????????def main():browser.get('https://www.lagou.com/')time.sleep(random.randint(1, 3))search_product(keyword)????wb.save('C:/Users/liz/job_info.xlsx')if __name__ == '__main__':keyword = 'Python 數(shù)據(jù)分析'????chrome_driver?=?r'C:/Users/liz/chromedriver.exe'?#chromedriver驅(qū)動的路徑options = webdriver.ChromeOptions()# 關(guān)閉左上方 Chrome 正受到自動測試軟件的控制的提示options.add_experimental_option('useAutomationExtension', False)options.add_experimental_option("excludeSwitches", ['enable-automation'])browser = webdriver.Chrome(options=options, executable_path=chrome_driver)main()browser.quit()
運行截圖:

注意:
??1,chromedriver版本必須和運行的谷歌瀏覽器一致
2,非完全原創(chuàng),借鑒網(wǎng)上代碼運行的
3,可以反反爬:現(xiàn)在很多網(wǎng)站為防止爬蟲,加載的數(shù)據(jù)都使用js的方式加載,如果使用python的request庫爬取的話就爬不到數(shù)據(jù),selenium庫能模擬打開瀏覽器,瀏覽器打開網(wǎng)頁并加載js數(shù)據(jù)后,再獲取數(shù)據(jù),這樣就達到反反爬蟲。??
最后也要有好看的小姐姐



評論
圖片
表情
