基于 ElasticSearch 實現(xiàn)站內全文搜索
來源:blog.csdn.net/weixin_44671737/
article/details/114456257
摘要 1 技術選型 1.1 ElasticSearch 1.2 springBoot 1.3 ik分詞器 2 環(huán)境準備 3 項目架構 4 實現(xiàn)效果 4.1 搜索頁面 4.2 搜索結果頁面 5 具體代碼實現(xiàn) 5.1 全文檢索的實現(xiàn)對象 5.2 客戶端配置 5.3 業(yè)務代碼編寫 5.4 對外接口 5.5 頁面 6 小結

摘要
對于一家公司而言,數(shù)據(jù)量越來越多,如果快速去查找這些信息是一個很難的問題,在計算機領域有一個專門的領域IR(Information Retrival)研究如果獲取信息,做信息檢索。在國內的如百度這樣的搜索引擎也屬于這個領域,要自己實現(xiàn)一個搜索引擎是非常難的,不過信息查找對每一個公司都非常重要,對于開發(fā)人員也可以選則一些市場上的開源項目來構建自己的站內搜索引擎,本文將通過ElasticSearch來構建一個這樣的信息檢索項目。
1 技術選型
搜索引擎服務使用ElasticSearch 提供的對外web服務選則springboot web
1.1 ElasticSearch
Elasticsearch是一個基于Lucene的搜索服務器。它提供了一個分布式多用戶能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java語言開發(fā)的,并作為Apache許可條款下的開放源碼發(fā)布,是一種流行的企業(yè)級搜索引擎。Elasticsearch用于云計算中,能夠達到實時搜索,穩(wěn)定,可靠,快速,安裝使用方便。
官方客戶端在Java、.NET(C#)、PHP、Python、Apache Groovy、Ruby和許多其他語言中都是可用的。根據(jù)DB-Engines的排名顯示,Elasticsearch是最受歡迎的企業(yè)搜索引擎,其次是Apache Solr,也是基于Lucene。1
現(xiàn)在開源的搜索引擎在市面上最常見的就是ElasticSearch和Solr,二者都是基于Lucene的實現(xiàn),其中ElasticSearch相對更加重量級,在分布式環(huán)境表現(xiàn)也更好,二者的選則需考慮具體的業(yè)務場景和數(shù)據(jù)量級。對于數(shù)據(jù)量不大的情況下,完全需要使用像Lucene這樣的搜索引擎服務,通過關系型數(shù)據(jù)庫檢索即可。
1.2 springBoot
Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can “just run”.2
現(xiàn)在springBoot在做web開發(fā)上是絕對的主流,其不僅僅是開發(fā)上的優(yōu)勢,在布署,運維各個方面都有著非常不錯的表現(xiàn),并且spring生態(tài)圈的影響力太大了,可以找到各種成熟的解決方案。
1.3 ik分詞器
elasticSearch本身不支持中文的分詞,需要安裝中文分詞插件,如果需要做中文的信息檢索,中文分詞是基礎,此處選則了ik,下載好后放入elasticSearch的安裝位置的plugin目錄即可。
2 環(huán)境準備
需要安裝好elastiSearch以及kibana(可選),并且需要lk分詞插件。
安裝elasticSearch elasticsearch官網(wǎng). 筆者使用的是7.5.1。 ik插件下載 ik插件github地址. 注意下載和你下載elasticsearch版本一樣的ik插件。 將ik插件放入elasticsearch安裝目錄下的plugins包下,新建報名ik,將下載好的插件解壓到該目錄下即可,啟動es的時候會自動加載該插件。

搭建springboot項目 idea ->new project ->spring initializer

3 項目架構
獲取數(shù)據(jù)使用ik分詞插件 將數(shù)據(jù)存儲在es引擎中 通過es檢索方式對存儲的數(shù)據(jù)進行檢索 使用es的java客戶端提供外部服務

4 實現(xiàn)效果
4.1 搜索頁面
簡單實現(xiàn)一個類似百度的搜索框即可。

4.2 搜索結果頁面

點擊第一個搜索結果是我個人的某一篇博文,為了避免數(shù)據(jù)版權問題,筆者在es引擎中存放的全是個人的博客數(shù)據(jù)。

5 具體代碼實現(xiàn)
5.1 全文檢索的實現(xiàn)對象
按照博文的基本信息定義了如下實體類,主要需要知道每一個博文的url,通過檢索出來的文章具體查看要跳轉到該url。
package?com.lbh.es.entity;
import?com.fasterxml.jackson.annotation.JsonIgnore;
import?javax.persistence.*;
/**
?*?PUT?articles
?*?{
?*?"mappings":
?*?{"properties":{
?*?"author":{"type":"text"},
?*?"content":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"},
?*?"title":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"},
?*?"createDate":{"type":"date","format":"yyyy-MM-dd?HH:mm:ss||yyyy-MM-dd"},
?*?"url":{"type":"text"}
?*?}?},
?*?"settings":{
?*?????"index":{
?*???????"number_of_shards":1,
?*???????"number_of_replicas":2
?*?????}
?*???}
?*?}
?*?---------------------------------------------------------------------------------------------------------------------
?*?Copyright(c)[email protected]
?*?@author?liubinhao
?*?@date?2021/3/3
?*/
@Entity
@Table(name?=?"es_article")
public?class?ArticleEntity?{
????@Id
????@JsonIgnore
????@GeneratedValue(strategy?=?GenerationType.IDENTITY)
????private?long?id;
????@Column(name?=?"author")
????private?String?author;
????@Column(name?=?"content",columnDefinition="TEXT")
????private?String?content;
????@Column(name?=?"title")
????private?String?title;
????@Column(name?=?"createDate")
????private?String?createDate;
????@Column(name?=?"url")
????private?String?url;
????public?String?getAuthor()?{
????????return?author;
????}
????public?void?setAuthor(String?author)?{
????????this.author?=?author;
????}
????public?String?getContent()?{
????????return?content;
????}
????public?void?setContent(String?content)?{
????????this.content?=?content;
????}
????public?String?getTitle()?{
????????return?title;
????}
????public?void?setTitle(String?title)?{
????????this.title?=?title;
????}
????public?String?getCreateDate()?{
????????return?createDate;
????}
????public?void?setCreateDate(String?createDate)?{
????????this.createDate?=?createDate;
????}
????public?String?getUrl()?{
????????return?url;
????}
????public?void?setUrl(String?url)?{
????????this.url?=?url;
????}
}
5.2 客戶端配置
通過java配置es的客戶端。
package?com.lbh.es.config;
import?org.apache.http.HttpHost;
import?org.elasticsearch.client.RestClient;
import?org.elasticsearch.client.RestClientBuilder;
import?org.elasticsearch.client.RestHighLevelClient;
import?org.springframework.beans.factory.annotation.Value;
import?org.springframework.context.annotation.Bean;
import?org.springframework.context.annotation.Configuration;
import?java.util.ArrayList;
import?java.util.List;
/**
?*?Copyright(c)[email protected]
?*?@author?liubinhao
?*?@date?2021/3/3
?*/
@Configuration
public?class?EsConfig?{
????@Value("${elasticsearch.schema}")
????private?String?schema;
????@Value("${elasticsearch.address}")
????private?String?address;
????@Value("${elasticsearch.connectTimeout}")
????private?int?connectTimeout;
????@Value("${elasticsearch.socketTimeout}")
????private?int?socketTimeout;
????@Value("${elasticsearch.connectionRequestTimeout}")
????private?int?tryConnTimeout;
????@Value("${elasticsearch.maxConnectNum}")
????private?int?maxConnNum;
????@Value("${elasticsearch.maxConnectPerRoute}")
????private?int?maxConnectPerRoute;
????@Bean
????public?RestHighLevelClient?restHighLevelClient()?{
????????//?拆分地址
????????List?hostLists?=?new?ArrayList<>();
????????String[]?hostList?=?address.split(",");
????????for?(String?addr?:?hostList)?{
????????????String?host?=?addr.split(":")[0];
????????????String?port?=?addr.split(":")[1];
????????????hostLists.add(new?HttpHost(host,?Integer.parseInt(port),?schema));
????????}
????????//?轉換成?HttpHost?數(shù)組
????????HttpHost[]?httpHost?=?hostLists.toArray(new?HttpHost[]{});
????????//?構建連接對象
????????RestClientBuilder?builder?=?RestClient.builder(httpHost);
????????//?異步連接延時配置
????????builder.setRequestConfigCallback(requestConfigBuilder?->?{
????????????requestConfigBuilder.setConnectTimeout(connectTimeout);
????????????requestConfigBuilder.setSocketTimeout(socketTimeout);
????????????requestConfigBuilder.setConnectionRequestTimeout(tryConnTimeout);
????????????return?requestConfigBuilder;
????????});
????????//?異步連接數(shù)配置
????????builder.setHttpClientConfigCallback(httpClientBuilder?->?{
????????????httpClientBuilder.setMaxConnTotal(maxConnNum);
????????????httpClientBuilder.setMaxConnPerRoute(maxConnectPerRoute);
????????????return?httpClientBuilder;
????????});
????????return?new?RestHighLevelClient(builder);
????}
}
5.3 業(yè)務代碼編寫
包括一些檢索文章的信息,可以從文章標題,文章內容以及作者信息這些維度來查看相關信息。
package?com.lbh.es.service;
import?com.google.gson.Gson;
import?com.lbh.es.entity.ArticleEntity;
import?com.lbh.es.repository.ArticleRepository;
import?org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import?org.elasticsearch.action.get.GetRequest;
import?org.elasticsearch.action.get.GetResponse;
import?org.elasticsearch.action.index.IndexRequest;
import?org.elasticsearch.action.index.IndexResponse;
import?org.elasticsearch.action.search.SearchRequest;
import?org.elasticsearch.action.search.SearchResponse;
import?org.elasticsearch.action.support.master.AcknowledgedResponse;
import?org.elasticsearch.client.RequestOptions;
import?org.elasticsearch.client.RestHighLevelClient;
import?org.elasticsearch.client.indices.CreateIndexRequest;
import?org.elasticsearch.client.indices.CreateIndexResponse;
import?org.elasticsearch.common.settings.Settings;
import?org.elasticsearch.common.xcontent.XContentType;
import?org.elasticsearch.index.query.QueryBuilders;
import?org.elasticsearch.search.SearchHit;
import?org.elasticsearch.search.builder.SearchSourceBuilder;
import?org.springframework.stereotype.Service;
import?javax.annotation.Resource;
import?java.io.IOException;
import?java.util.*;
/**
?*?Copyright(c)[email protected]
?*?@author?liubinhao
?*?@date?2021/3/3
?*/
@Service
public?class?ArticleService?{
????private?static?final?String?ARTICLE_INDEX?=?"article";
????@Resource
????private?RestHighLevelClient?client;
????@Resource
????private?ArticleRepository?articleRepository;
????public?boolean?createIndexOfArticle(){
????????Settings?settings?=?Settings.builder()
????????????????.put("index.number_of_shards",?1)
????????????????.put("index.number_of_replicas",?1)
????????????????.build();
//?{"properties":{"author":{"type":"text"},
//?"content":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"}
//?,"title":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"},
//?,"createDate":{"type":"date","format":"yyyy-MM-dd?HH:mm:ss||yyyy-MM-dd"}
//?}
????????String?mapping?=?"{\"properties\":{\"author\":{\"type\":\"text\"},\n"?+
????????????????"\"content\":{\"type\":\"text\",\"analyzer\":\"ik_max_word\",\"search_analyzer\":\"ik_smart\"}\n"?+
????????????????",\"title\":{\"type\":\"text\",\"analyzer\":\"ik_max_word\",\"search_analyzer\":\"ik_smart\"}\n"?+
????????????????",\"createDate\":{\"type\":\"date\",\"format\":\"yyyy-MM-dd?HH:mm:ss||yyyy-MM-dd\"}\n"?+
????????????????"},\"url\":{\"type\":\"text\"}\n"?+
????????????????"}";
????????CreateIndexRequest?indexRequest?=?new?CreateIndexRequest(ARTICLE_INDEX)
????????????????.settings(settings).mapping(mapping,XContentType.JSON);
????????CreateIndexResponse?response?=?null;
????????try?{
????????????response?=?client.indices().create(indexRequest,?RequestOptions.DEFAULT);
????????}?catch?(IOException?e)?{
????????????e.printStackTrace();
????????}
????????if?(response!=null)?{
????????????System.err.println(response.isAcknowledged()???"success"?:?"default");
????????????return?response.isAcknowledged();
????????}?else?{
????????????return?false;
????????}
????}
????public?boolean?deleteArticle(){
????????DeleteIndexRequest?request?=?new?DeleteIndexRequest(ARTICLE_INDEX);
????????try?{
????????????AcknowledgedResponse?response?=?client.indices().delete(request,?RequestOptions.DEFAULT);
????????????return?response.isAcknowledged();
????????}?catch?(IOException?e)?{
????????????e.printStackTrace();
????????}
????????return?false;
????}
????public?IndexResponse?addArticle(ArticleEntity?article){
????????Gson?gson?=?new?Gson();
????????String?s?=?gson.toJson(article);
????????//創(chuàng)建索引創(chuàng)建對象
????????IndexRequest?indexRequest?=?new?IndexRequest(ARTICLE_INDEX);
????????//文檔內容
????????indexRequest.source(s,XContentType.JSON);
????????//通過client進行http的請求
????????IndexResponse?re?=?null;
????????try?{
????????????re?=?client.index(indexRequest,?RequestOptions.DEFAULT);
????????}?catch?(IOException?e)?{
????????????e.printStackTrace();
????????}
????????return?re;
????}
????public?void?transferFromMysql(){
????????articleRepository.findAll().forEach(this::addArticle);
????}
????public?List?queryByKey(String?keyword) {
????????SearchRequest?request?=?new?SearchRequest();
????????/*
?????????*?創(chuàng)建??搜索內容參數(shù)設置對象:SearchSourceBuilder
?????????*?相對于matchQuery,multiMatchQuery針對的是多個fi eld,也就是說,當multiMatchQuery中,fieldNames參數(shù)只有一個時,其作用與matchQuery相當;
?????????*?而當fieldNames有多個參數(shù)時,如field1和field2,那查詢的結果中,要么field1中包含text,要么field2中包含text。
?????????*/
????????SearchSourceBuilder?searchSourceBuilder?=?new?SearchSourceBuilder();
????????searchSourceBuilder.query(QueryBuilders
????????????????.multiMatchQuery(keyword,?"author","content","title"));
????????request.source(searchSourceBuilder);
????????List?result?=?new?ArrayList<>();
????????try?{
????????????SearchResponse?search?=?client.search(request,?RequestOptions.DEFAULT);
????????????for?(SearchHit?hit:search.getHits()){
????????????????Map?map?=?hit.getSourceAsMap();
????????????????ArticleEntity?item?=?new?ArticleEntity();
????????????????item.setAuthor((String)?map.get("author"));
????????????????item.setContent((String)?map.get("content"));
????????????????item.setTitle((String)?map.get("title"));
????????????????item.setUrl((String)?map.get("url"));
????????????????result.add(item);
????????????}
????????????return?result;
????????}?catch?(IOException?e)?{
????????????e.printStackTrace();
????????}
????????return?null;
????}
????public?ArticleEntity?queryById(String?indexId){
????????GetRequest?request?=?new?GetRequest(ARTICLE_INDEX,?indexId);
????????GetResponse?response?=?null;
????????try?{
????????????response?=?client.get(request,?RequestOptions.DEFAULT);
????????}?catch?(IOException?e)?{
????????????e.printStackTrace();
????????}
????????if?(response!=null&&response.isExists()){
????????????Gson?gson?=?new?Gson();
????????????return?gson.fromJson(response.getSourceAsString(),ArticleEntity.class);
????????}
????????return?null;
????}
}
5.4 對外接口
和使用springboot開發(fā)web程序相同。
package?com.lbh.es.controller;
import?com.lbh.es.entity.ArticleEntity;
import?com.lbh.es.service.ArticleService;
import?org.elasticsearch.action.index.IndexResponse;
import?org.springframework.web.bind.annotation.*;
import?javax.annotation.Resource;
import?java.util.List;
/**
?*?Copyright(c)[email protected]
?*?@author?liubinhao
?*?@date?2021/3/3
?*/
@RestController
@RequestMapping("article")
public?class?ArticleController?{
????@Resource
????private?ArticleService?articleService;
????@GetMapping("/create")
????public?boolean?create(){
????????return?articleService.createIndexOfArticle();
????}
????@GetMapping("/delete")
????public?boolean?delete()?{
????????return?articleService.deleteArticle();
????}
????@PostMapping("/add")
????public?IndexResponse?add(@RequestBody?ArticleEntity?article){
????????return?articleService.addArticle(article);
????}
????@GetMapping("/fransfer")
????public?String?transfer(){
????????articleService.transferFromMysql();
????????return?"successful";
????}
????@GetMapping("/query")
????public?List?query(String?keyword) {
????????return?articleService.queryByKey(keyword);
????}
}
5.5 頁面
此處頁面使用thymeleaf,主要原因是筆者真滴不會前端,只懂一丟丟簡單的h5,就隨便做了一個可以展示的頁面。
搜索頁面
"en"?xmlns:th="http://www.thymeleaf.org">
????"UTF-8"?/>
????"viewport"?content="width=device-width,?initial-scale=1.0"?/>
????YiyiDu
????
????
????
????
????
????
????"font-size:?0px;">
????????"center"?style="margin-top:?0px;">
????????????"../static/img/yyd.png"?th:src?=?"@{/static/img/yyd.png}"??alt="一億度"?width="280px"?class="pic"?/>
????????
????????"center">
????????????
????????????
????????
????
搜索結果頁面
"en"?xmlns:th="http://www.thymeleaf.org">
????"stylesheet"?href="https://cdn.staticfile.org/twitter-bootstrap/4.3.1/css/bootstrap.min.css">
????"UTF-8">
????xx-manager
"search.html">
class="container?my-2">
????"article?:?${articles}">
????????"${article.url}">- "${article.author}+${article.content}">
????
6 小結
上班擼代碼,下班繼續(xù)擼代碼寫博客,花了兩天研究了以下es,其實這個玩意兒還是挺有意思的,現(xiàn)在IR領域最基礎的還是基于統(tǒng)計學的,所以對于es這類搜索引擎而言在大數(shù)據(jù)的情況下具有良好的表現(xiàn)。每一次寫實戰(zhàn)筆者其實都感覺有些無從下手,因為不知道做啥?所以也希望得到一些有意思的點子筆者會將實戰(zhàn)做出來。
