1、實(shí)戰(zhàn)項(xiàng)目

將唐詩(shī)三百首寫(xiě)入Elasticsearch會(huì)發(fā)生什么？

2、項(xiàng)目說(shuō)明

此項(xiàng)目是根據(jù)實(shí)戰(zhàn)項(xiàng)目濃縮的一個(gè)小項(xiàng)目，幾乎涵蓋之前講解的所有知識(shí)點(diǎn)。

通過(guò)這個(gè)項(xiàng)目的實(shí)戰(zhàn)，能讓你串聯(lián)起之前的知識(shí)點(diǎn)應(yīng)用于實(shí)戰(zhàn)，并建立起需求分析、整體設(shè)計(jì)、數(shù)據(jù)建模、ingest管道使用、檢索/聚合選型、kibana可視化分析等的全局認(rèn)知。

3、需求

數(shù)據(jù)來(lái)源：https://github.com/xuchunyang/300

注意數(shù)據(jù)源bug：?第1753行種的"id":178 需要手動(dòng)改成 ?"id": 252。

3.1 數(shù)據(jù)需求

注意：

1）詞典選擇
2）分詞器選型
3）mapping設(shè)置
4）支持的目標(biāo)維度考量
5）設(shè)定插入時(shí)間（自定義動(dòng)態(tài)添加，非人工）

3.2 寫(xiě)入需求

注意：

1）特殊字符清洗
2）新增插入時(shí)間

3.3 分析需求

檢索分析DSL實(shí)戰(zhàn)

1）飛花令環(huán)節(jié)：包含銘毅天下（分別包含）詩(shī)句有哪些？各有多少首？
2）李白的詩(shī)有幾首？按照詩(shī)長(zhǎng)短排序，由短到長(zhǎng)
3）取TOP10最長(zhǎng)、最短的詩(shī)的作者列表

聚合分析實(shí)戰(zhàn)及可視化實(shí)戰(zhàn)

1）三百首誰(shuí)的作品最多？取TOP10排行
2）五言絕句和七言律詩(shī)占比，以及對(duì)應(yīng)作者占比統(tǒng)計(jì)
3）同名詩(shī)排行統(tǒng)計(jì)
4）三百首詩(shī)分詞形成什么樣的詞云

4、需求解讀與設(shè)計(jì)

4.1 需求解讀

本著：編碼之前，設(shè)計(jì)先行的原則。

開(kāi)發(fā)人員的通病——新的項(xiàng)目拿到需求以后，不論其簡(jiǎn)單還是復(fù)雜，都要先梳理需求，整理出其邏輯架構(gòu)，優(yōu)先設(shè)計(jì)，以便建立全局認(rèn)知，而不是上來(lái)就動(dòng)手敲代碼。

本項(xiàng)目的核心知識(shí)點(diǎn)涵蓋如下幾塊內(nèi)容

Elasticsearch 數(shù)據(jù)建模
Elasticsearch bulk批量寫(xiě)入
Elasticsearch 預(yù)處理
Elasticsearch檢索
Elasticsearch聚合
kibana Visualize 使用
kibana Dashboard 使用

4.2 邏輯架構(gòu)梳理

有圖有真相。

根據(jù)需求梳理出如下的邏輯架構(gòu)，實(shí)際開(kāi)發(fā)中要謹(jǐn)記如下的數(shù)據(jù)流向。

4.3 建模梳理

之前也有講述，這里再?gòu)?qiáng)調(diào)一下數(shù)據(jù)建模的重要性。

數(shù)據(jù)模型支撐了系統(tǒng)和數(shù)據(jù)，系統(tǒng)和數(shù)據(jù)支撐了業(yè)務(wù)系統(tǒng)。

一個(gè)好的數(shù)據(jù)模型：

能讓系統(tǒng)更好的集成、能簡(jiǎn)化接口。
能簡(jiǎn)化數(shù)據(jù)冗余、減少磁盤(pán)空間、提升傳輸效率。
兼容更多的數(shù)據(jù)，不會(huì)因?yàn)閿?shù)據(jù)類(lèi)型的新增而導(dǎo)致實(shí)現(xiàn)邏輯更改。
能幫助更多的業(yè)務(wù)機(jī)會(huì)，提高業(yè)務(wù)效率。
能減少業(yè)務(wù)風(fēng)險(xiǎn)、降低業(yè)務(wù)成本。

對(duì)于Elasticsearch的數(shù)據(jù)建模的核心是Mapping的構(gòu)建。

對(duì)于原始json數(shù)據(jù)：

    "id": 251,    "contents": "打起黃鶯兒，莫教枝上啼。啼時(shí)驚妾夢(mèng)，不得到遼西。",    "type": "五言絕句",    "author": "金昌緒",    "title": "春怨"

我們的建模邏輯如下：

字段名稱(chēng)	字段類(lèi)型	備注說(shuō)明
_id		對(duì)應(yīng)自增id
contents	text & keyword	涉及分詞，注意開(kāi)啟：fielddata：true
type	text & keyword
author	text & keyword
title	text & keyword
timestamp	date	代表插入時(shí)間
cont_length	long	contents長(zhǎng)度，排序用

由于涉及中文分詞，選型分詞器很重要。

這里依然推薦：選擇ik分詞。

ik詞典的選擇建議：自帶詞典不完備，網(wǎng)上搜索互聯(lián)網(wǎng)的一些常用語(yǔ)詞典、行業(yè)詞典如（詩(shī)詞相關(guān)詞典）作為補(bǔ)充完善。

4.4 概要設(shè)計(jì)

原始文檔json的批量讀取和寫(xiě)入通過(guò) elasticsearch python低版本 api 和高版本 api elasticsearch-dsl 結(jié)合實(shí)現(xiàn)。
數(shù)據(jù)的預(yù)處理環(huán)節(jié)通過(guò) ingest pipeline實(shí)現(xiàn)。設(shè)計(jì)數(shù)據(jù)預(yù)處理地方：每一篇詩(shī)的json寫(xiě)入時(shí)候，插入timestamp時(shí)間戳字段。
template和mapping的構(gòu)建通過(guò)kibana實(shí)現(xiàn)。
分詞選型：ik_max_word 細(xì)粒度分詞，以查看更細(xì)粒度的詞云。

5、項(xiàng)目實(shí)戰(zhàn)

5.1 數(shù)據(jù)預(yù)處理ingest

創(chuàng)建：indexed_at 的管道，目的：

新增document時(shí)候指定插入時(shí)間戳字段。
新增長(zhǎng)度字段，以便于后續(xù)排序。

PUT _ingest/pipeline/indexed_at{  "description": "Adds timestamp  to documents",  "processors": [    {      "set": {        "field": "_source.timestamp",        "value": "{{_ingest.timestamp}}"      }    },    {      "script": {        "source": "ctx.cont_length = ctx.contents.length();"      }    }  ]}

5.2 Mapping和template構(gòu)建

如下DSL,分別構(gòu)建了模板：my_template。

指定了settings、別名、mapping的基礎(chǔ)設(shè)置。

模板的好處和便捷性，在之前的章節(jié)中有過(guò)詳細(xì)講解。

PUT _template/my_template{  "index_patterns": [    "some_index*"  ],  "aliases": {    "some_index": {}  },  "settings": {    "index.default_pipeline": "indexed_at",    "number_of_replicas": 1,    "refresh_interval": "30s"  },  "mappings": {    "properties": {      "cont_length":{        "type":"long"      },      "author": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word"      },      "contents": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word",        "fielddata": true      },      "timestamp": {        "type": "date"      },      "title": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word"      },      "type": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word"      }    }  }}
PUT some_index_01

5.3 數(shù)據(jù)讀取與寫(xiě)入

通過(guò)如下的python代碼實(shí)現(xiàn)。注意：

bulk批量寫(xiě)入比單條寫(xiě)入性能要高很多。
尤其對(duì)于大文件的寫(xiě)入優(yōu)先考慮bulk批量處理實(shí)現(xiàn)。

def read_and_write_index():    # define an empty list for the Elasticsearch docs    doc_list = []
    # use Python's enumerate() function to iterate over list of doc strings    input_file = open('300.json',  encoding="utf8", errors='ignore')    json_array = json.load(input_file)
    for item in json_array:        try:            # convert the string to a dict object            # add a new field to the Elasticsearch doc            dict_doc = {}            # add a dict key called "_id" if you'd like to specify an ID for the doc            dict_doc["_id"] = item['id']            dict_doc["contents"] = item['contents']            dict_doc["type"] = item['type']            dict_doc["author"] = item['author']            dict_doc["title"] = item['title']
            # append the dict object to the list []            doc_list += [dict_doc]
        except json.decoder.JSONDecodeError as err:            # print the errors            print("ERROR for num:", item['id'], "-- JSONDecodeError:", err, "for doc:", dict_doc)            print("Dict docs length:", len(doc_list))


    try:        print ("\nAttempting to index the list of docs using helpers.bulk()")
        # use the helpers library's Bulk API to index list of Elasticsearch docs        resp = helpers.bulk(            client,            doc_list,            index = "some_index",            doc_type = "_doc"            )
        # print the response returned by Elasticsearch        print ("helpers.bulk() RESPONSE:", resp)        print ("helpers.bulk() RESPONSE:", json.dumps(resp, indent=4))    except Exception as err:        # print any errors returned w        ## Prerequisiteshile making the helpers.bulk() API call        print("Elasticsearch helpers.bulk() ERROR:", err)        quit()

5.4 數(shù)據(jù)分析

5.5 檢索分析

5.5.1 飛花令環(huán)節(jié)：包含銘毅天下（分別包含）詩(shī)句有哪些？各有多少首？

GET some_index/_search{  "query": {    "match": {      "contents": "銘"    }  }}
GET some_index/_search{  "query": {    "match": {      "contents": "毅"    }  }}
GET some_index/_search{  "query": {    "match": {      "contents": "天下"    }  }}

實(shí)踐表明：

銘：0首
毅：1首
天下：114 首

不禁感嘆：唐詩(shī)先賢們也是心懷天下，憂(yōu)國(guó)憂(yōu)民??！

5.5.2 李白的詩(shī)有幾首？按照詩(shī)長(zhǎng)短排序，由短到長(zhǎng)

POST some_index/_search{   "query": {    "match_phrase": {      "author": "李白"    }  },  "sort": [    {      "cont_length": {        "order": "desc"      }    }  ]}
POST some_index/_search{  "aggs": {    "genres": {      "terms": {        "field": "author.keyword"      }    }  }}

唐詩(shī)三百首中，李白共33首詩(shī)（僅次于杜甫39首），最長(zhǎng)的是“蜀道難”，共：353 個(gè)字符。

李白、杜甫不愧為：詩(shī)仙和詩(shī)圣??！也都是高產(chǎn)詩(shī)人！

5.5.3 取TOP10最長(zhǎng)、最短的詩(shī)的作者列表

POST some_index/_search{  "sort": [    {      "cont_length": {        "order": "desc"      }    }  ]}
POST some_index/_search{  "sort": [    {      "cont_length": {        "order": "asc"      }    }  ]}

最長(zhǎng)的詩(shī)：白居易-長(zhǎng)恨歌-960個(gè)字符。

最短的詩(shī)：王維-鹿柴- 24個(gè)字符（并列的非常多）。

5.6 聚合分析

以下的截圖通過(guò)kibana實(shí)現(xiàn)。細(xì)節(jié)在之前的kibana可視化中都有過(guò)講解。

5.6.1 三百首誰(shuí)的作品最多？取TOP10排行

5.6.2 五言絕句和七言律詩(shī)占比，以及對(duì)應(yīng)作者占比統(tǒng)計(jì)

5.6.3 同名詩(shī)排行統(tǒng)計(jì)

5.6.4 三百首詩(shī)分詞形成什么樣的詞云

5.6.5?全局視圖

6、小結(jié)

結(jié)合唐詩(shī)300首的業(yè)務(wù)場(chǎng)景，結(jié)合本小項(xiàng)目的需求、設(shè)計(jì)、實(shí)現(xiàn)三個(gè)階段，建立起對(duì)Elasticsearch、kibana核心知識(shí)點(diǎn)的全局認(rèn)識(shí)。

核心目的：通過(guò)小項(xiàng)目練手，促進(jìn)公司實(shí)際項(xiàng)目能力、產(chǎn)品研發(fā)能力的提升

思考：本文詞云效果不好，為什么？

項(xiàng)目實(shí)戰(zhàn) 01：將唐詩(shī)三百首寫(xiě)入 Elasticsearch 會(huì)發(fā)生什么？