<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          crawler-javaJava 爬蟲框架

          聯(lián)合創(chuàng)作 · 2023-09-29 13:38

          一個簡約靈活強(qiáng)大的Java爬蟲框架。

          Features:

          1、代碼簡單易懂,可定制性強(qiáng)
          2、簡單且易于使用的api
          3、支持文件下載、分塊抓取
          4、請求和相應(yīng)支持的內(nèi)容和選項比較豐富、每個請求可定制性強(qiáng)
          5、支持網(wǎng)絡(luò)請求前后執(zhí)行自定義操作
          6、Selenium+PhantomJS支持
          7、Redis支持

          Future:

          1、Complete the code comment and test(完善代碼注釋和完善測試代碼)  

          demo:

          import com.github.xbynet.crawler.http.DefaultDownloader;
          import com.github.xbynet.crawler.http.FileDownloader;
          import com.github.xbynet.crawler.http.HttpClientFactory;
          import com.github.xbynet.crawler.parser.JsoupParser;
          import com.github.xbynet.crawler.scheduler.DefaultScheduler;
          
          public class GithubCrawler extends Processor {
          	@Override
          	public void process(Response resp) {
          		String currentUrl = resp.getRequest().getUrl();
          		System.out.println("CurrentUrl:" + currentUrl);
          		int respCode = resp.getCode();
          		System.out.println("ResponseCode:" + respCode);
          		System.out.println("type:" + resp.getRespType().name());
          		String contentType = resp.getContentType();
          		System.out.println("ContentType:" + contentType);
          		Map> headers = resp.getHeaders();
          		System.out.println("ResonseHeaders:");
          		for (String key : headers.keySet()) {
          			List values=headers.get(key);
          			for(String str:values){
          				System.out.println(key + ":" +str);
          			}
          		}
          		JsoupParser parser = resp.html();
          		// suppport parted ,分塊抓取是會有個parent response來關(guān)聯(lián)所有分塊response
          		// System.out.println("isParted:"+resp.isPartResponse());
          		// Response parent=resp.getParentResponse();
          		// resp.addPartRequest(null);
          		//Map extras=resp.getRequest().getExtras();
          
          		if (currentUrl.equals("https://github.com/xbynet")) {
          			String avatar = parser.single("img.avatar", "src");
          			String dir = System.getProperty("java.io.tmpdir");
          			String savePath = Paths.get(dir, UUID.randomUUID().toString())
          					.toString();
          			boolean avatarDownloaded = download(avatar, savePath);
          			System.out.println("avatar:" + avatar + ", saved:" + savePath);
          			// System.out.println("avtar downloaded status:"+avatarDownloaded);
          			String name = parser.single(".vcard-names > .vcard-fullname",
          					"text");
          			System.out.println("name:" + name);
          			List reponames = parser.list(
          					".pinned-repos-list .repo.js-repo", "text");
          			List repoUrls = parser.list(
          					".pinned-repo-item .d-block >a", "href");
          			System.out.println("reponame:url");
          			if (reponames != null) {
          				for (int i = 0; i < reponames.size(); i++) {
          					String tmpUrl="https://github.com"+repoUrls.get(i);
          					System.out.println(reponames.get(i) + ":"+tmpUrl);
          					Request req=new Request(tmpUrl).putExtra("name", reponames.get(i));
          					resp.addRequest(req);
          				}
          			}
          		}else{
          			Map extras=resp.getRequest().getExtras();
          			String name=extras.get("name").toString();
          			System.out.println("repoName:"+name);
          			String shortDesc=parser.single(".repository-meta-content","allText");
          			System.out.println("shortDesc:"+shortDesc);
          		}
          	}
          
          	public void start() {
          		Site site = new Site();
          		Spider spider = Spider.builder(this).threadNum(5).site(site)
          				.urls("https://github.com/xbynet").build();
          		spider.run();
          	}
            
          	public static void main(String[] args) {
          		new GithubCrawler().start();
          	}
            
            
          	public void startCompleteConfig() {
          		String pcUA = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
          		String androidUA = "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36";
          
          		Site site = new Site();
          		site.setEncoding("UTF-8").setHeader("Referer", "https://github.com/")
          				.setRetry(3).setRetrySleep(3000).setSleep(50).setTimeout(30000)
          				.setUa(pcUA);
          
          		Request request = new Request("https://github.com/xbynet");
          		HttpClientContext ctx = new HttpClientContext();
          		BasicCookieStore cookieStore = new BasicCookieStore();
          		ctx.setCookieStore(cookieStore);
          		request.setAction(new RequestAction() {
          			@Override
          			public void before(CloseableHttpClient client, HttpUriRequest req) {
          				System.out.println("before-haha");
          			}
          
          			@Override
          			public void after(CloseableHttpClient client,
          					CloseableHttpResponse resp) {
          				System.out.println("after-haha");
          			}
          		}).setCtx(ctx).setEncoding("UTF-8")
          				.putExtra("somekey", "I can use in the response by your own")
          				.setHeader("User-Agent", pcUA).setMethod(Const.HttpMethod.GET)
          				.setPartRequest(null).setEntity(null)
          				.setParams("appkeyqqqqqq", "1213131232141").setRetryCount(5)
          				.setRetrySleepTime(10000);
          
          		Spider spider = Spider.builder(this).threadNum(5)
          				.name("Spider-github-xbynet")
          				.defaultDownloader(new DefaultDownloader())
          				.fileDownloader(new FileDownloader())
          				.httpClientFactory(new HttpClientFactory()).ipProvider(null)
          				.listener(null).pool(null).scheduler(new DefaultScheduler())
          				.shutdownOnComplete(true).site(site).build();
          		spider.run();
          	}
          
          
          }
          

          Examples:

          • Github(github個人項目信息)
          • OSChinaTweets(開源中國動彈)
          • Qiushibaike(醜事百科)
          • Neihanshequ(內(nèi)涵段子)
          • ZihuRecommend(知乎推薦)

          More Examples: Please see here

          Thinks:

          webmagic:本項目借鑒了webmagic多處代碼,設(shè)計上也作了較多參考,非常感謝。
          xsoup:本項目使用xsoup作為底層xpath處理器  
          JsonPath:本項目使用JsonPath作為底層jsonpath處理器
          Jsoup 本項目使用Jsoup作為底層HTML/XML處理器
          HttpClient 本項目使用HttpClient作為底層網(wǎng)絡(luò)請求工具

          瀏覽 17
          點贊
          評論
          收藏
          分享

          手機(jī)掃一掃分享

          編輯 分享
          舉報
          評論
          圖片
          表情
          推薦
          點贊
          評論
          收藏
          分享

          手機(jī)掃一掃分享

          編輯 分享
          舉報
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  国产l精品久久久久久久久久 | 经典一区二区 | 国产久久在线播放 | 日韩黄色在线播放 | 热99视频 |