ES 不香嗎,為啥被大廠摒棄而遷移到ClickHouse?
架構(gòu)和設(shè)計(jì)的對(duì)比

-
Client Node,負(fù)責(zé)API和數(shù)據(jù)的訪問(wèn)的節(jié)點(diǎn),不存儲(chǔ)/處理數(shù)據(jù) -
Data Node,負(fù)責(zé)數(shù)據(jù)的存儲(chǔ)和索引 -
Master Node, 管理節(jié)點(diǎn),負(fù)責(zé)Cluster中的節(jié)點(diǎn)的協(xié)調(diào),不存儲(chǔ)數(shù)據(jù)。


為了支持搜索,Clickhouse同樣支持布隆過(guò)濾器。
查詢對(duì)比實(shí)戰(zhàn)

架構(gòu)主要有四個(gè)部分組成:
-
ES stack ES stack有一個(gè)單節(jié)點(diǎn)的Elastic的容器和一個(gè)Kibana容器組成,Elastic是被測(cè)目標(biāo)之一,Kibana作為驗(yàn)證和輔助工具。
部署代碼如下:
version: '3.7'services:elasticsearch:image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0container_name: elasticsearchenvironment:- xpack.security.enabled=false- discovery.type=single-nodeulimits:memlock:soft: -1hard: -1nofile:soft: 65536hard: 65536cap_add:- IPC_LOCKvolumes:- elasticsearch-data:/usr/share/elasticsearch/dataports:- 9200:9200- 9300:9300deploy:resources:limits:cpus: '4'memory: 4096Mreservations:memory: 4096Mkibana:container_name: kibanaimage: docker.elastic.co/kibana/kibana:7.4.0environment:- ELASTICSEARCH_HOSTS=http://elasticsearch:9200ports:- 5601:5601depends_on:- elasticsearchvolumes:elasticsearch-data:driver: local
-
Clickhouse stack Clickhouse stack有一個(gè)單節(jié)點(diǎn)的Clickhouse服務(wù)容器和一個(gè)TabixUI作為Clickhouse的客戶端。
部署代碼如下:
version: "3.7"services:clickhouse:container_name: clickhouseimage: yandex/clickhouse-servervolumes:./data/config:/var/lib/clickhouseports:"8123:8123""9000:9000""9009:9009""9004:9004"ulimits:nproc: 65535nofile:soft: 262144hard: 262144healthcheck:test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]interval: 30stimeout: 5sretries: 3deploy:resources:limits:cpus: '4'memory: 4096Mreservations:memory: 4096Mtabixui:container_name: tabixuiimage: spoonest/clickhouse-tabix-web-clientenvironment:CH_NAME=devCH_HOST=127.0.0.1:8123CH_LOGIN=defaultports:"18080:80"depends_on:clickhousedeploy:resources:limits:cpus: '0.1'memory: 128Mreservations:memory: 128M
數(shù)據(jù)導(dǎo)入 stack 數(shù)據(jù)導(dǎo)入部分使用了Vector.dev開(kāi)發(fā)的vector,該工具和fluentd類(lèi)似,都可以實(shí)現(xiàn)數(shù)據(jù)管道式的靈活的數(shù)據(jù)導(dǎo)入。
測(cè)試控制 stack 測(cè)試控制我使用了Jupyter,使用了ES和Clickhouse的Python SDK來(lái)進(jìn)行查詢的測(cè)試。
CREATE TABLE default.syslog(application String,hostname String,message String,mid String,pid String,priority Int16,raw String,timestamp DateTime('UTC'),version Int16) ENGINE = MergeTree()PARTITION BY toYYYYMMDD(timestamp)ORDER BY timestampTTL timestamp + toIntervalMonth(1);
[sources.in]type = "generator"format = "syslog"interval = 0.01count = 100000[transforms.clone_message]type = "add_fields"inputs = ["in"]fields.raw = "{{ message }}"[transforms.parser]# Generaltype = "regex_parser"inputs = ["clone_message"]field = "message" # optional, defaultpatterns = ['^<(?P<priority>\d*)>(?P<version>\d) (?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?P<hostname>\w+\.\w+) (?P<application>\w+) (?P<pid>\d+) (?P<mid>ID\d+) - (?P<message>.*)$'][transforms.coercer]type = "coercer"inputs = ["parser"]types.timestamp = "timestamp"types.version = "int"types.priority = "int"[sinks.out_console]# Generaltype = "console"inputs = ["coercer"]target = "stdout"# Encodingencoding.codec = "json"[sinks.out_clickhouse]host = "http://host.docker.internal:8123"inputs = ["coercer"]table = "syslog"type = "clickhouse"encoding.only_fields = ["application", "hostname", "message", "mid", "pid", "priority", "raw", "timestamp", "version"]encoding.timestamp_format = "unix"[sinks.out_es]# Generaltype = "elasticsearch"inputs = ["coercer"]compression = "none"endpoint = "http://host.docker.internal:9200"index = "syslog-%F"# Encoding# Healthcheckhealthcheck.enabled = true
這里簡(jiǎn)單介紹一下這個(gè)流水線:
-
http://source.in 生成syslog的模擬數(shù)據(jù),生成10w條,生成間隔和0.01秒 -
transforms.clone_message 把原始消息復(fù)制一份,這樣抽取的信息同時(shí)可以保留原始消息 -
transforms.parser 使用正則表達(dá)式,按照syslog的定義,抽取出application,hostname,message ,mid ,pid ,priority ,timestamp ,version 這幾個(gè)字段 -
transforms.coercer 數(shù)據(jù)類(lèi)型轉(zhuǎn)化 -
sinks.out_console 把生成的數(shù)據(jù)打印到控制臺(tái),供開(kāi)發(fā)調(diào)試 -
sinks.out_clickhouse 把生成的數(shù)據(jù)發(fā)送到Clickhouse -
sinks.out_es 把生成的數(shù)據(jù)發(fā)送到ES
運(yùn)行Docker命令,執(zhí)行該流水線:
docker run \-v $(mkfile_path)/vector.toml:/etc/vector/vector.toml:ro \-p 18383:8383 \timberio/vector:nightly-alpine
返回所有的記錄
# ES{"query":{"match_all":{}}}# Clickhouse"SELECT * FROM syslog"
匹配單個(gè)字段
# ES{"query":{"match":{"hostname":"for.org"}}}# Clickhouse"SELECT * FROM syslog WHERE hostname='for.org'"
匹配多個(gè)字段
# ES{"query":{"multi_match":{"query":"up.com ahmadajmi","fields":["hostname","application"]}}}# Clickhouse、"SELECT * FROM syslog WHERE hostname='for.org' OR application='ahmadajmi'"
單詞查找,查找包含特定單詞的字段
# ES{"query":{"term":{"message":"pretty"}}}# Clickhouse"SELECT * FROM syslog WHERE lowerUTF8(raw) LIKE '%pretty%'"
范圍查詢, 查找版本大于2的記錄
# ES{"query":{"range":{"version":{"gte":2}}}}# Clickhouse"SELECT * FROM syslog WHERE version >= 2"
查找到存在某字段的記錄
# ES{"query":{"exists":{"field":"application"}}}# Clickhouse"SELECT * FROM syslog WHERE application is not NULL"
正則表達(dá)式查詢,查詢匹配某個(gè)正則表達(dá)式的數(shù)據(jù)
# ES{"query":{"regexp":{"hostname":{"value":"up.*","flags":"ALL","max_determinized_states":10000,"rewrite":"constant_score"}}}}# Clickhouse"SELECT * FROM syslog WHERE match(hostname, 'up.*')"
聚合計(jì)數(shù),統(tǒng)計(jì)某個(gè)字段出現(xiàn)的次數(shù)
# ES{"aggs":{"version_count":{"value_count":{"field":"version"}}}}# Clickhouse"SELECT count(version) FROM syslog"
聚合不重復(fù)的值,查找所有不重復(fù)的字段的個(gè)數(shù)
# ES{"aggs":{"my-agg-name":{"cardinality":{"field":"priority"}}}}# Clickhouse"SELECT count(distinct(priority)) FROM syslog "
我用 Python 的 SDK,對(duì)上述的查詢?cè)趦蓚€(gè)Stack上各跑10次,然后統(tǒng)計(jì)查詢的性能結(jié)果。
我們畫(huà)出出所有的查詢的響應(yīng)時(shí)間的分布:

總查詢時(shí)間的對(duì)比如下:

通過(guò)測(cè)試數(shù)據(jù)我們可以看出 Clickhouse 在大部分的查詢的性能上都明顯要優(yōu)于 Elastic。在正則查詢(Regex query)和單詞查詢(Term query)等搜索常見(jiàn)的場(chǎng)景下,也并不遜色。
總結(jié)
本文通過(guò)對(duì)于一些基本查詢的測(cè)試,對(duì)比了 Clickhouse 和 Elasticsearch 的功能和性能,測(cè)試結(jié)果表明,Clickhouse在這些基本場(chǎng)景表現(xiàn)非常優(yōu)秀,性能優(yōu)于ES,這也解釋了為什么用很多的公司應(yīng)從 ES 切換到 Clickhouse 之上。
(版權(quán)歸原作者所有,侵刪)

