使用 vmalert 代替 Prometheus 監(jiān)控報(bào)警
前面我們已經(jīng)介紹了可以使用 vmagent 代替 prometheus 抓取監(jiān)控指標(biāo)數(shù)據(jù),要想完全替換 prometheus 還有一個(gè)非常重要的部分就是報(bào)警模塊,之前我們都是在 prometheus 中定義報(bào)警規(guī)則評(píng)估后發(fā)送給 alertmanager 的,同樣對(duì)應(yīng)到 vm 中也有一個(gè)專(zhuān)門(mén)來(lái)處理報(bào)警的模塊:vmalert。
vmalert 會(huì)針對(duì) -datasource.url 地址執(zhí)行配置的報(bào)警或記錄規(guī)則,然后可以將報(bào)警發(fā)送給 -notifier.url 配置的 Alertmanager,記錄規(guī)則結(jié)果會(huì)通過(guò)遠(yuǎn)程寫(xiě)入的協(xié)議進(jìn)行保存,所以需要配置 -remoteWrite.url。
特性
與 VictoriaMetrics TSDB 集成 VictoriaMetrics MetricsQL 支持和表達(dá)式驗(yàn)證 Prometheus 告警規(guī)則定義格式支持 與 Alertmanager 集成 在重啟時(shí)可以保持報(bào)警狀態(tài) Graphite 數(shù)據(jù)源可用于警報(bào)和記錄規(guī)則 支持記錄和報(bào)警規(guī)則重放 非常輕量級(jí),沒(méi)有額外的依賴
要開(kāi)始使用 vmalert,需要滿足以下條件:
報(bào)警規(guī)則列表:要執(zhí)行的 PromQL/MetricsQL 表達(dá)式 數(shù)據(jù)源地址:可訪問(wèn)的 VictoriaMetrics 實(shí)例,用于規(guī)則執(zhí)行 通知程序地址:可訪問(wèn)的 Alertmanager 實(shí)例,用于處理,匯總警報(bào)和發(fā)送通知
安裝
首先需要安裝一個(gè) Alertmanager 用來(lái)接收?qǐng)?bào)警信息,前面章節(jié)中我們已經(jīng)詳細(xì)講解過(guò)了,這里不再贅述了,對(duì)應(yīng)的資源清單如下所示:
#?alertmanager.yaml
apiVersion:?v1
kind:?ConfigMap
metadata:
??name:?alert-config
??namespace:?kube-vm
data:
??config.yml:?|-
????global:
??????resolve_timeout:?5m
??????smtp_smarthost:?'smtp.163.com:465'
??????smtp_from:?'[email protected]'??
??????smtp_auth_username:?'[email protected]'
??????smtp_auth_password:?''??#?使用網(wǎng)易郵箱的授權(quán)碼
??????smtp_hello:?'163.com'
??????smtp_require_tls:?false
????route:
??????group_by:?['severity',?'source']
??????group_wait:?30s
??????group_interval:?5m
??????repeat_interval:?24h?
??????receiver:?email
????receivers:
????-?name:?'email'
??????email_configs:
??????-?to:?'[email protected]'
????????send_resolved:?true
---
apiVersion:?v1
kind:?Service
metadata:
??name:?alertmanager
??namespace:?kube-vm
??labels:
????app:?alertmanager
spec:
??selector:
????app:?alertmanager
??type:?NodePort
??ports:
????-?name:?web
??????port:?9093
??????targetPort:?http
---
apiVersion:?apps/v1
kind:?Deployment
metadata:
??name:?alertmanager
??namespace:?kube-vm
??labels:
????app:?alertmanager
spec:
??selector:
????matchLabels:
??????app:?alertmanager
??template:
????metadata:
??????labels:
????????app:?alertmanager
????spec:
??????volumes:
????????-?name:?cfg
??????????configMap:
????????????name:?alert-config
??????containers:
????????-?name:?alertmanager
??????????image:?prom/alertmanager:v0.21.0
??????????imagePullPolicy:?IfNotPresent
??????????args:
????????????-?"--config.file=/etc/alertmanager/config.yml"
??????????ports:
????????????-?containerPort:?9093
??????????????name:?http
??????????volumeMounts:
????????????-?mountPath:?"/etc/alertmanager"
??????????????name:?cfg
Alertmanager 這里我們只配置了一個(gè)默認(rèn)的路由規(guī)則,根據(jù) severity、source 兩個(gè)標(biāo)簽進(jìn)行分組,然后將觸發(fā)的報(bào)警發(fā)送到 email 接收器中去。
接下來(lái)需要添加用于報(bào)警的規(guī)則配置,配置方式和 Prometheus 一樣的:
#?vmalert-config.yaml
apiVersion:?v1
kind:?ConfigMap
metadata:
??name:?vmalert-config
??namespace:?kube-vm
data:
??record.yaml:?|
????groups:
????-?name:?record
??????rules:
??????-?record:?job:node_memory_MemFree_bytes:percent??#?記錄規(guī)則名稱
????????expr:?100?-?(100?*?node_memory_MemFree_bytes?/?node_memory_MemTotal_bytes)
??pod.yaml:?|
????groups:
????-?name:?pod
??????rules:
??????-?alert:?PodMemoryUsage
????????expr:?sum(container_memory_working_set_bytes{pod!=""})?BY?(instance,?pod)??/?sum(container_spec_memory_limit_bytes{pod!=""}?>?0)?BY?(instance,?pod)?*?100?>?60
????????for:?2m
????????labels:
??????????severity:?warning
??????????source:?pod
????????annotations:
??????????summary:?"Pod?{{?$labels.pod?}}?High?Memory?usage?detected"
??????????description:?"{{$labels.instance}}:?Pod?{{?$labels.pod?}}?Memory?usage?is?above?60%?(current?value?is:?{{?$value?}})"
??node.yaml:?|
????groups:
????-?name:?node
??????rules:??#?具體的報(bào)警規(guī)則
??????-?alert:?NodeMemoryUsage??#?報(bào)警規(guī)則的名稱
????????expr:?(node_memory_MemTotal_bytes?-?(node_memory_MemFree_bytes?+?node_memory_Buffers_bytes?+?node_memory_Cached_bytes))?/?node_memory_MemTotal_bytes?*?100?>?30
????????for:?1m
????????labels:
??????????source:?node
??????????severity:?critical
????????annotations:
??????????summary:?"Node?{{$labels.instance}}?High?Memory?usage?detected"
??????????description:?"{{$labels.instance}}:?Memory?usage?is?above?30%?(current?value?is:?{{?$value?}})"
這里我們添加了一條記錄規(guī)則,兩條報(bào)警規(guī)則,更多報(bào)警規(guī)則配置可參考 https://awesome-prometheus-alerts.grep.to/。
然后就可以部署 vmalert 組件服務(wù)了:
#?vmalert.yaml
apiVersion:?v1
kind:?Service
metadata:
??name:?vmalert
??namespace:?kube-vm
??labels:
????app:?vmalert
spec:
??ports:
????-?name:?vmalert
??????port:?8080
??????targetPort:?8080
??type:?NodePort
??selector:
????app:?vmalert
---
apiVersion:?apps/v1
kind:?Deployment
metadata:
??name:?vmalert
??namespace:?kube-vm
??labels:
????app:?vmalert
spec:
??selector:
????matchLabels:
??????app:?vmalert
??template:
????metadata:
??????labels:
????????app:?vmalert
????spec:
??????containers:
????????-?name:?vmalert
??????????image:?victoriametrics/vmalert:v1.77.0
??????????imagePullPolicy:?IfNotPresent
??????????args:
????????????-?-rule=/etc/ruler/*.yaml
????????????-?-datasource.url=http://vmselect.kube-vm.svc.cluster.local:8481/select/0/prometheus
????????????-?-notifier.url=http://alertmanager.kube-vm.svc.cluster.local:9093
????????????-?-remoteWrite.url=http://vminsert.kube-vm.svc.cluster.local:8480/insert/0/prometheus
????????????-?-evaluationInterval=15s
????????????-?-httpListenAddr=0.0.0.0:8080
??????????volumeMounts:
????????????-?mountPath:?/etc/ruler/
??????????????name:?ruler
??????????????readOnly:?true
??????volumes:
????????-?configMap:
????????????name:?vmalert-config
??????????name:?ruler
上面的資源清單中將報(bào)警規(guī)則以 volumes 的形式掛載到了容器中,通過(guò) -rule 指定了規(guī)則文件路徑,-datasource.url 指定了 vmselect 的路徑,-notifier.url 指定了 Alertmanager 的地址,其中 -evaluationInterval 參數(shù)用來(lái)指定評(píng)估的頻率的,由于我們這里添加了記錄規(guī)則,所以還需要通過(guò) -remoteWrite.url 指定一個(gè)遠(yuǎn)程寫(xiě)入的地址。
直接創(chuàng)建上面的資源清單即可完成部署。
????kubectl?apply?-f?https://p8s.io/docs/victoriametrics/manifests/alertmanager.yaml
????kubectl?apply?-f?https://p8s.io/docs/victoriametrics/manifests/vmalert-config.yaml
????kubectl?apply?-f?https://p8s.io/docs/victoriametrics/manifests/vmalert.yaml
????kubectl?get?pods?-n?kube-vm?-l?app=alertmanager
NAME???????????????????????????READY???STATUS????RESTARTS???AGE
alertmanager-d88d95b4f-z2j8g???1/1?????Running???0??????????30m
????kubectl?get?svc?-n?kube-vm?-l?app=alertmanager
NAME???????????TYPE???????CLUSTER-IP?????EXTERNAL-IP???PORT(S)??????????AGE
alertmanager???NodePort???10.100.230.2???????????9093:31282/TCP???31m
????kubectl?get?pods?-n?kube-vm?-l?app=vmalert
NAME???????????????????????READY???STATUS????RESTARTS???AGE
vmalert-866674b966-675nb???1/1?????Running???0??????????7m17s
????kubectl?get?svc?-n?kube-vm?-l?app=vmalert
NAME??????TYPE???????CLUSTER-IP???????EXTERNAL-IP???PORT(S)??????????AGE
vmalert???NodePort???10.104.193.183???????????8080:30376/TCP???22m
部署成功后,如果有報(bào)警規(guī)則達(dá)到了閾值就會(huì)觸發(fā)報(bào)警,我們可以通過(guò) Alertmanager 頁(yè)面查看觸發(fā)的報(bào)警規(guī)則:

同樣 vmalert 也提供了一個(gè)簡(jiǎn)單的頁(yè)面,可以查看所有的 Groups:

也可以查看到報(bào)警規(guī)則列表的狀態(tài):

還可以查看到具體的一條報(bào)警規(guī)則的詳細(xì)信息,如下所示:
報(bào)警規(guī)則觸發(fā)后怎么發(fā)送,發(fā)送到哪個(gè)接收器就是 Alertmanager 決定的了。同樣的上面我們添加的記錄規(guī)則會(huì)通過(guò) remote write 傳遞給 vminsert 保留下來(lái),所以我們也可以通過(guò) vmselect 查詢到。
到這里基本上我們就完成了使用 vm 代替 prometheus 來(lái)進(jìn)行監(jiān)控報(bào)警了,vmagent 采集監(jiān)控指標(biāo),vmalert 用于報(bào)警監(jiān)控,vmstorage 存儲(chǔ)指標(biāo)數(shù)據(jù),vminsert 接收指標(biāo)數(shù)據(jù),vmselect 查詢指標(biāo)數(shù)據(jù),已經(jīng)完全可以不使用 prometheus 了,而且性能非常高,所需資源也比 prometheus 低很多。