<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          woodyHTML 解析/提取器

          聯(lián)合創(chuàng)作 · 2023-09-21 23:34

          woody 是一款基于 Java 的 HTML 解析/提取器,用法非常類似 webmagic,是對其抽取模塊的完全重寫。

          功能:

          • 多種結(jié)果數(shù)據(jù)類型(String, char, byte, short int, long, double, float, string[], Set, List,Data)
          • 支持用戶之定義腳本處理函數(shù)(目前支持 Javascript 函數(shù)配置處理)
          • 支持 css、xpath 內(nèi)核替換
          • 支持 filter 功能
          • 對 css、xpath 內(nèi)核對象的緩存

          一個完整的例子:

          public class OsChinaBlog {
          
          	public static void main(String[] args) throws Exception {
          		Document doc = Jsoup.connect("http://www.oschina.net/news/43879/webmagic-0-3-0").timeout(60000)
          				.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0").get();
          		String html = doc.html();
          		OsChinaBlogModel model = AnnotationExtractor.me().process(html, OsChinaBlogModel.class);
          		System.out.println(model.toJson());
          	}
          
          	public static class OsChinaBlogModel extends Model {
          
          		public OsChinaBlogModel() {
          			//use to reflect
          		}
          
          		@Inject
          		@ComboExtract(value = { @ExtractBy(value = "h1.OSCTitle", type = ExprType.CSS),
          				@ExtractBy(value = "http://title/text()", type = ExprType.XPATH) }, op = OP.OR)
          		public String title;
          
          		@Inject
          		@ExtractBy(value = "div.PubDate a[href~=http://my\\.oschina\\.net/]", type = ExprType.CSS)
          		public String author;
          
          		@Inject
          		@ExtractBy(value = "發(fā)布于.\\s*(\\d+年\\d+月\\d+日)", type = ExprType.REGEX)
          		public Date publishDate;
          
          		@Inject
          		@ComboExtract(value = {
          				@ExtractBy(value = "div.PubDate", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
          				@ExtractBy(value = "(\\d+)評", type = ExprType.REGEX) }, op = OP.AND)
          		public int commentNum;
          
          		@Inject
          		@ExtractBy(value = "span#p_favor_count", type = ExprType.CSS, setting = @Setting(function = @Function(value = "replace", args = {
          				"+", "" })))
          		public int collectNum;
          
          		@Inject
          		@ComboExtract(value = {
          				@ExtractBy(value = "div[id=userComments]", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
          				@ExtractBy(value = "div.TextContent", type = ExprType.CSS) }, op = OP.AND, multi = true)
          		public List commentContents;
          
          		@Inject
          		@ExtractBy(value = "div[id=toolbar_wrapper]", setting = @Setting(fliters = { "b", "span" }), type = ExprType.CSS, impl = Document.class)
          		public String weibo;
          
          	}
          }
          瀏覽 31
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          編輯 分享
          舉報
          評論
          圖片
          表情
          推薦
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          編輯 分享
          舉報
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  久久亚洲影院 | 福利导航网| 亚洲无在线播放 | 国产精品第一区 | 五月天av在线 |