pyrailgun網(wǎng)頁抓取工具
這是一個非常簡單易用的抓取工具
怎么使用? 首先你需要創(chuàng)建一個對應(yīng)站點(diǎn)的規(guī)則文件 比如test.json
{
"name": "bing searcher",
"action": "main",
"subaction": [
{
"action": "fetcher",
"url": "http://www.bing.com/search?q=${@q}",
"timeout": 1,
"subaction": [
{
"action": "parser",
"subaction": [
{
"action": "shell",
"subaction": [
{
"action": "parser",
"setField": "title",
"pos": 0,
"rule": "a",
"strip": "true"
},
{
"action": "parser",
"setField": "description",
"pos": 0,
"rule": "p"
}
],
"group": "default"
}
],
"rule": "#results .sa_wr"
}
]
}
]
}
然后在代碼里面把它作為一個任務(wù)加入到railgun
from railgun import RailGun
railgun = RailGun()
railgun.setTask(file("testsite.yaml"));
railgun.fire();
nodes = railgun.getShells('default')
print nodes
然后你就可以得到一個包含了所有解析后數(shù)據(jù)的節(jié)點(diǎn)列表 [{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx},{img:xxx,src:xxx,score:xxx,dest:xxx,description:xxx}]
同時支持用webkit內(nèi)核運(yùn)行javascript抓取網(wǎng)頁,css方式的dom選擇方式
跨平臺 支持windows
評論
圖片
表情
