點(diǎn)擊「閱讀原文」查看良許原創(chuàng)精品視頻。
作者：馬哥Linux運(yùn)維
出處：book.open-falcon.org

1. Linux運(yùn)維基礎(chǔ)采集項(xiàng)

做運(yùn)維，不怕出問題，怕的是出了問題，抓不到現(xiàn)場，兩眼摸黑。所以，依靠強(qiáng)大的監(jiān)控系統(tǒng)，收集盡可能多的指標(biāo)，意義重大。但哪些指標(biāo)才是有意義的呢，本著從實(shí)踐中來的思想，各位工程師在長期摸爬滾打中總結(jié)出來的經(jīng)驗(yàn)最有價值。

在各位運(yùn)維工程師長期的工作實(shí)踐中，我們總結(jié)了在系統(tǒng)運(yùn)維過程中，經(jīng)常會參考的一些指標(biāo)，主要包括以下幾個類別：

CPU
Load
內(nèi)存
磁盤
IO
網(wǎng)絡(luò)相關(guān)
內(nèi)核參數(shù)
ss 統(tǒng)計(jì)輸出
端口采集
核心服務(wù)的進(jìn)程存活信息采集
關(guān)鍵業(yè)務(wù)進(jìn)程資源消耗
NTP offset采集
DNS解析采集

每個類別，具體的詳細(xì)指標(biāo)如下，這些指標(biāo)，都是open-falcon的agent組件直接支持的。falcon-agent每隔一定時間間隔（目前是60秒）會采集一次相關(guān)的指標(biāo)，并匯報(bào)給server端。

2. CPU相關(guān)采集項(xiàng)

計(jì)算方法：通過采集/proc/stat來得到，大家可以參考sar命令的統(tǒng)計(jì)輸出來理解。

cpu.idle：Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
cpu.busy：與cpu.idle相對，他的值等于100減去cpu.idle。
cpu.guest：Percentage of time spent by the CPU or CPUs to run a virtual processor.
cpu.iowait：Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
cpu.irq：Percentage of time spent by the CPU or CPUs to service hardware interrupts.
cpu.softirq：Percentage of time spent by the CPU or CPUs to service software interrupts.
cpu.nice：Percentage of CPU utilization that occurred while executing at the user level with nice priority.
cpu.steal：Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
cpu.system：Percentage of CPU utilization that occurred while executing at the system level (kernel).
cpu.user：Percentage of CPU utilization that occurred while executing at the user level (application).
cpu.cnt：cpu核數(shù)。
cpu.switches：cpu上下文切換次數(shù)，計(jì)數(shù)器類型。

3. 磁盤相關(guān)采集項(xiàng)

計(jì)算方法：先讀取/proc/mounts拿到所有掛載點(diǎn)，然后通過syscall.Statfs_t拿到blocks和inode的使用情況。每個metric都會附加一組tag描述，類似mount=$mount,fstype=$fstype，其中$mount是掛載點(diǎn)，比如/home，$fstype是文件系統(tǒng)，比如ext4。

df.bytes.free：磁盤可用量，int64
df.bytes.free.percent：磁盤可用量占總量的百分比，float64，比如32.1
df.bytes.total：磁盤總大小，int64
df.bytes.used：磁盤已用大小，int64
df.bytes.used.percent：磁盤已用大小占總量的百分比，float64
df.inodes.total：inode總數(shù)，int64
df.inodes.free：可用inode數(shù)目，int64
df.inodes.free.percent：可用inode占比，float64
df.inodes.used：已用的inode數(shù)據(jù)，int64
df.inodes.used.percent：已用inode占比，float64

4. megacli工具輸出

使用 megacli 工具讀取 RAID 相關(guān)信息，每個metric都會附件一組tag描述，用來標(biāo)明所屬PD或者 VD，PD格式為PD=Enclosure_ID:SLOT_ID，比如PD=32:0表明第一塊磁盤，VD=0 表明第一個邏輯磁盤。

sys.disk.lsiraid.pd.Media_Error_Count：這個及以下三個指標(biāo)目前僅作為數(shù)據(jù)收集，不一定意味磁盤損壞（只是表示損壞概率變大）
sys.disk.lsiraid.pd.Other_Error_Count
sys.disk.lsiraid.pd.Predictive_Failure_Count
sys.disk.lsiraid.pd.Drive_Temperature
sys.disk.lsiraid.pd.Firmware_state：如果值不為0，則此物理磁盤出現(xiàn)問題
sys.disk.lsiraid.vd.cache_policy：如果值不為0，表示此邏輯磁盤緩存策略和設(shè)置不符
sys.disk.lsiraid.vd.state：如果值不為0，表示此邏輯磁盤出現(xiàn)問題

5. SMART工具輸出

使用 smartctl 工具讀取磁盤 SMART 信息，目前所有指標(biāo)僅作為數(shù)據(jù)收集，不一定意味磁盤損壞（只是表示概率變大），每個metric都會有一組tag描述，表明盤符，例如device=/dev/sda。

sys.disk.smart.Reallocated_Sector_Ct
sys.disk.smart.Spin_Retry_Count
sys.disk.smart.Reallocated_Event_Count
sys.disk.smart.Current_Pending_Sector
sys.disk.smart.Offline_Uncorrectable
sys.disk.smart.Temperature_Celsius

6. 分區(qū)讀寫監(jiān)控

測試所有已掛載分區(qū)是否可讀寫，每個metric都會有一組tag描述，表示掛載點(diǎn)，比如mount=/home

sys.disk.rw：如果值不為0，表明此分區(qū)讀寫出現(xiàn)問題

7. IO相關(guān)采集項(xiàng)

計(jì)算方法：每秒采集一次/proc/diskstats，計(jì)算差值，都是計(jì)數(shù)器類型的。每個metric都會有一組tag描述，形如device=$device，用來表示具體的設(shè)備，比如sda1、sdb。用戶可以參考iostat的幫助文檔來理解具體的metric含義。

disk.io.ios_in_progress：Number of actual I/O requests currently in flight.
disk.io.msec_read：Total number of ms spent by all reads.
disk.io.msec_total：Amount of time during which ios_in_progress >= 1.
disk.io.msec_weighted_total：Measure of recent I/O completion time and backlog.
disk.io.msec_write：Total number of ms spent by all writes.
disk.io.read_merged：Adjacent read requests merged in a single req.
disk.io.read_requests：Total number of reads completed successfully.
disk.io.read_sectors：Total number of sectors read successfully.
disk.io.write_merged：Adjacent write requests merged in a single req.
disk.io.write_requests：total number of writes completed successfully.
disk.io.write_sectors：total number of sectors written successfully.
disk.io.read_bytes：單位是byte的數(shù)字
disk.io.write_bytes：單位是byte的數(shù)字
disk.io.avgrq_sz：下面幾個值就是iostat -x 1看到的值
disk.io.avgqu-sz
disk.io.await
disk.io.svctm
disk.io.util：是個百分?jǐn)?shù)，比如56.43，表示56.43%

8. 機(jī)器負(fù)載相關(guān)采集項(xiàng)

計(jì)算方法：讀取/proc/loadavg，都是原始值類型的：

load.1min
load.5min
load.15min

9. 內(nèi)存相關(guān)采集項(xiàng)

計(jì)算方法：讀取/proc/meminfo 中的內(nèi)容，其中的mem.memfree是free+buffers+cached，mem.memused=mem.memtotal-mem.memfree。用戶具體可以參考free命令的輸出和幫助文檔來理解每個metric的含義。

mem.memtotal：內(nèi)存總大小
mem.memused：使用了多少內(nèi)存
mem.memused.percent：使用的內(nèi)存占比
mem.memfree
mem.memfree.percent
mem.swaptotal：swap總大小
mem.swapused：使用了多少swap
mem.swapused.percent：使用的swap的占比
mem.swapfree
mem.swapfree.percent

10. 網(wǎng)絡(luò)相關(guān)采集項(xiàng)

計(jì)算方法：讀取/proc/net/dev的內(nèi)容，每個metric都附加有一組tag，形如iface=$iface，標(biāo)明具體那個interface，比如eth0。metric中帶有in的表示流入情況，out表示流出情況，total是總量in+out，支持的metric如下：

net.if.in.bytes
net.if.in.compressed
net.if.in.dropped
net.if.in.errors
net.if.in.fifo.errs
net.if.in.frame.errs
net.if.in.multicast
net.if.in.packets
net.if.out.bytes
net.if.out.carrier.errs
net.if.out.collisions
net.if.out.compressed
net.if.out.dropped
net.if.out.errors
net.if.out.fifo.errs
net.if.out.packets
net.if.total.bytes
net.if.total.dropped
net.if.total.errors
net.if.total.packets

11. 端口采集項(xiàng)

計(jì)算方法，通過ss -ln，來判斷指定的端口是否處于listen狀態(tài)。原始值類型，值要么是1：代表在監(jiān)聽，要么是0，代表沒有在監(jiān)聽。每個metric都附件一組tag，形如port=$port，$port就是具體的端口。

net.port.listen

12. 機(jī)器內(nèi)核配置

kernel.maxfiles：讀取的/proc/sys/fs/file-max
kernel.files.allocated：讀取的/proc/sys/fs/file-nr第一個Field
kernel.files.left：值=kernel.maxfiles-kernel.files.allocated
kernel.maxproc：讀取的/proc/sys/kernel/pid_max

13. ntp采集項(xiàng)

使用 ntpq -pn 獲取本機(jī)時間相對于 ntp 服務(wù)器的 offset。

sys.ntp.offset：本機(jī)偏移時間，單位為ms，值過大或者為0則表明有異常，需要報(bào)警

14. 進(jìn)程監(jiān)控

proc.num：判斷某個進(jìn)程的數(shù)目，這里需要分兩個場景，一種是根據(jù)進(jìn)程的名字來判定，比如name=sshd；另外一種是根據(jù)cmdline來判定，比如Java的應(yīng)用進(jìn)程名可能都是java，根據(jù)第一種情況沒法做區(qū)分，此時可以配置cmdline，如cmdline=./falcon_agent-c./cfg.ini

15. 進(jìn)程資源監(jiān)控

process.cpu.all：進(jìn)程和它的子進(jìn)程使用的sys+user的cpu，單位是jiffies
process.cpu.sys：進(jìn)程和它的子進(jìn)程使用的sys cpu，單位是jiffies
process.cpu.user：進(jìn)程和它的子進(jìn)程使用的user cpu，單位是jiffies
process.swap：進(jìn)程和它的子進(jìn)程使用的swap，單位是page
process.fd：進(jìn)程使用的文件描述符個數(shù)
process.mem：進(jìn)程占用內(nèi)存，單位byte

16. ss命令輸出

ss.orphaned
ss.closed
ss.timewait
ss.slabinfo.timewait
ss.synrecv
ss.estab

良許個人微信

添加良許個人微信即送3套程序員必讀資料

→ 精選技術(shù)資料共享

→ 高手如云交流社群

本公眾號全部博文已整理成一個目錄，請?jiān)诠娞柪锘貜?fù)「m」獲取！

Linux 常用監(jiān)控指標(biāo)總結(jié)

點(diǎn)擊「閱讀原文」查看良許原創(chuàng)精品視頻。作者：馬哥Linux運(yùn)維出處：book.open-falcon.org