jparser網(wǎng)頁(yè)轉(zhuǎn)碼 python 庫(kù)
jparser是一個(gè)python庫(kù),用于網(wǎng)頁(yè)轉(zhuǎn)碼,也就是從html源碼中抽取正文的結(jié)構(gòu)化數(shù)據(jù):文本段落和圖片。目前主要針對(duì)新聞資訊類頁(yè)面進(jìn)行了優(yōu)化。
用法:
import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()
print "==title=="
print result['title']
print "==content=="
for x in result['content']:
if x['type'] == 'text':
print x['data']
if x['type'] == 'image':
print "[IMAGE]", x['data']['src']
示例:
依賴:lxml
評(píng)論
圖片
表情
