pprof 的原理與實(shí)現(xiàn)
wziww 是幫我更新 golang-notes 的小伙伴,這篇 pprof 的原理與實(shí)現(xiàn)是他寫(xiě)的,本文如果有打賞收入的話,會(huì)全額轉(zhuǎn)給他~
本章節(jié)沒(méi)有介紹具體 pprof 以及周邊工具的使用, 而是進(jìn)行了 runtime pprof 實(shí)現(xiàn)原理的分析, 旨在提供給讀者一個(gè)使用方面的參考 在進(jìn)行深入本章節(jié)之前, 讓我們來(lái)看三個(gè)問(wèn)題, 相信下面這幾個(gè)問(wèn)題也是大部分人在使用 pprof 的時(shí)候?qū)λ畲蟮睦Щ? 那么可以帶著這三個(gè)問(wèn)題來(lái)進(jìn)行接下去的分析
開(kāi)啟 pprof 會(huì)對(duì) runtime 產(chǎn)生多大的壓力? 能否選擇性在合適階段對(duì)生產(chǎn)環(huán)境的應(yīng)用進(jìn)行 pprof 的開(kāi)啟 / 關(guān)閉操作? pprof 的原理是什么?
go 內(nèi)置的 pprof API 在 runtime/pprof 包內(nèi), 它提供給了用戶與 runtime 交互的能力, 讓我們能夠在應(yīng)用運(yùn)行的過(guò)程中分析當(dāng)前應(yīng)用的各項(xiàng)指標(biāo)來(lái)輔助進(jìn)行性能優(yōu)化以及問(wèn)題排查, 當(dāng)然也可以直接加載 _ "net/http/pprof" 包使用內(nèi)置的 http 接口 來(lái)進(jìn)行使用, net 模塊內(nèi)的 pprof 即為 go 替我們封裝好的一系列調(diào)用 runtime/pprof 的方法, 當(dāng)然也可以自己直接使用
// src/runtime/pprof/pprof.go
// 可觀察類(lèi)目
profiles.m = map[string]*Profile{
"goroutine": goroutineProfile,
"threadcreate": threadcreateProfile,
"heap": heapProfile,
"allocs": allocsProfile,
"block": blockProfile,
"mutex": mutexProfile,
}
allocs
var allocsProfile = &Profile{
name: "allocs",
count: countHeap, // identical to heap profile
write: writeAlloc,
}
writeAlloc (主要涉及以下幾個(gè) api) ReadMemStats(m *MemStats) MemProfile(p []MemProfileRecord, inuseZero bool)
// ReadMemStats populates m with memory allocator statistics.
//
// The returned memory allocator statistics are up to date as of the
// call to ReadMemStats. This is in contrast with a heap profile,
// which is a snapshot as of the most recently completed garbage
// collection cycle.
func ReadMemStats(m *MemStats) {
// STW 操作
stopTheWorld("read mem stats")
// systemstack 切換
systemstack(func() {
// 將 memstats 通過(guò) copy 操作復(fù)制給 m
readmemstats_m(m)
})
startTheWorld()
}
// MemProfile returns a profile of memory allocated and freed per allocation
// site.
//
// MemProfile returns n, the number of records in the current memory profile.
// If len(p) >= n, MemProfile copies the profile into p and returns n, true.
// If len(p) < n, MemProfile does not change p and returns n, false.
//
// If inuseZero is true, the profile includes allocation records
// where r.AllocBytes > 0 but r.AllocBytes == r.FreeBytes.
// These are sites where memory was allocated, but it has all
// been released back to the runtime.
//
// The returned profile may be up to two garbage collection cycles old.
// This is to avoid skewing the profile toward allocations; because
// allocations happen in real time but frees are delayed until the garbage
// collector performs sweeping, the profile only accounts for allocations
// that have had a chance to be freed by the garbage collector.
//
// Most clients should use the runtime/pprof package or
// the testing package's -test.memprofile flag instead
// of calling MemProfile directly.
func MemProfile(p []MemProfileRecord, inuseZero bool) (n int, ok bool) {
lock(&proflock)
// If we're between mProf_NextCycle and mProf_Flush, take care
// of flushing to the active profile so we only have to look
// at the active profile below.
mProf_FlushLocked()
clear := true
/*
* 記住這個(gè) mbuckets -- memory profile buckets
* allocs 的采樣都是記錄在這個(gè)全局變量?jī)?nèi), 下面會(huì)進(jìn)行詳細(xì)分析
* -------------------------------------------------
* (gdb) info variables mbuckets
* All variables matching regular expression "mbuckets":
* File runtime:
* runtime.bucket *runtime.mbuckets;
* (gdb)
*/
for b := mbuckets; b != nil; b = b.allnext {
mp := b.mp()
if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
n++
}
if mp.active.allocs != 0 || mp.active.frees != 0 {
clear = false
}
}
if clear {
// Absolutely no data, suggesting that a garbage collection
// has not yet happened. In order to allow profiling when
// garbage collection is disabled from the beginning of execution,
// accumulate all of the cycles, and recount buckets.
n = 0
for b := mbuckets; b != nil; b = b.allnext {
mp := b.mp()
for c := range mp.future {
mp.active.add(&mp.future[c])
mp.future[c] = memRecordCycle{}
}
if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
n++
}
}
}
if n <= len(p) {
ok = true
idx := 0
for b := mbuckets; b != nil; b = b.allnext {
mp := b.mp()
if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
// mbuckets 數(shù)據(jù)拷貝
record(&p[idx], b)
idx++
}
}
}
unlock(&proflock)
return
}
總結(jié)一下 pprof/allocs 所涉及的操作
短暫的 STW以及systemstack切換來(lái)獲取runtime相關(guān)信息拷貝全局對(duì)象 mbuckets值返回給用戶
mbuckets
上文提到, pprof/allocs 的核心在于對(duì) mbuckets 的操作, 下面用一張圖來(lái)簡(jiǎn)單描述下 mbuckets 的相關(guān)操作
var mbuckets *bucket // memory profile buckets
type bucket struct {
next *bucket
allnext *bucket
typ bucketType // memBucket or blockBucket (includes mutexProfile)
hash uintptr
size uintptr
nstk uintptr
}
---------------
| user access |
---------------
|
------------------ |
| mbuckets list | copy |
| (global) | -------------------------------------
------------------
|
|
| create_or_get && insert_or_update bucket into mbuckets
|
|
--------------------------------------
| func stkbucket & typ == memProfile |
--------------------------------------
|
----------------
| mProf_Malloc | // 堆棧等信息記錄
----------------
|
----------------
| profilealloc | // next_sample 計(jì)算
----------------
|
| /*
| * if rate := MemProfileRate; rate > 0 {
| * if rate != 1 && size < c.next_sample {
| * c.next_sample -= size
| 采樣 * } else {
| 記錄 * mp := acquirem()
| * profilealloc(mp, x, size)
| * releasem(mp)
| * }
| * }
| */
|
------------ 不采樣
| mallocgc |-----------...
------------
由上圖我們可以清晰的看見(jiàn), runtime 在內(nèi)存分配的時(shí)候會(huì)根據(jù)一定策略進(jìn)行采樣, 記錄到 mbuckets 中讓用戶得以進(jìn)行分析, 而采樣算法有個(gè)重要的依賴 MemProfileRate
// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of
// one allocation per MemProfileRate bytes allocated.
//
// To include every allocated block in the profile, set MemProfileRate to 1.
// To turn off profiling entirely, set MemProfileRate to 0.
//
// The tools that process the memory profiles assume that the
// profile rate is constant across the lifetime of the program
// and equal to the current value. Programs that change the
// memory profiling rate should do so just once, as early as
// possible in the execution of the program (for example,
// at the beginning of main).
var MemProfileRate int = 512 * 1024
默認(rèn)大小是 512 KB, 可以由用戶自行配置.
值的注意的是, 由于開(kāi)啟了 pprof 會(huì)產(chǎn)生一些采樣的額外壓力及開(kāi)銷(xiāo), go 團(tuán)隊(duì)已經(jīng)在較新的編譯器中有選擇地進(jìn)行了這個(gè)變量的配置以改變[1]默認(rèn)開(kāi)啟的現(xiàn)狀
具體方式為代碼未進(jìn)行相關(guān)引用則編譯器將初始值配置為 0, 否則則為默認(rèn)(512 KB)
(本文討論的基于 1.14.3 版本, 如有差異請(qǐng)進(jìn)行版本確認(rèn))
pprof/allocs 總結(jié)
開(kāi)啟后會(huì)對(duì) runtime 產(chǎn)生額外壓力, 采樣時(shí)會(huì)在 runtime malloc時(shí)記錄額外信息以供后續(xù)分析可以人為選擇是否開(kāi)啟, 以及采樣頻率, 通過(guò)設(shè)置 runtime.MemProfileRate參數(shù), 不同 go 版本存在差異(是否默認(rèn)開(kāi)啟), 與用戶代碼內(nèi)是否引用(linker)相關(guān)模塊/變量有關(guān), 默認(rèn)大小為 512 KB
allocs 部分還包含了 heap 情況的近似計(jì)算, 放在下一節(jié)分析
heap
allocs: A sampling of all past memory allocations
heap: A sampling of memory allocations of live objects. You can specify the gc GET parameter to run GC before taking the heap sample.
對(duì)比下 allocs 和 heap 官方說(shuō)明上的區(qū)別, 一個(gè)是分析所有內(nèi)存分配的情況, 一個(gè)是當(dāng)前 heap 上的分配情況. heap 還能使用額外參數(shù)運(yùn)行一次 GC 后再進(jìn)行分析
看起來(lái)兩者差別很大。。。不過(guò)實(shí)質(zhì)上在代碼層面兩者除了一次 GC 可以人為調(diào)用以及生成的文件類(lèi)型不同之外 (debug == 0 的時(shí)候) 之外沒(méi)啥區(qū)別.
heap 采樣(偽)
// p 為上文提到過(guò)的 MemProfileRecord 采樣記錄
for _, r := range p {
hideRuntime := true
for tries := 0; tries < 2; tries++ {
stk := r.Stack()
// For heap profiles, all stack
// addresses are return PCs, which is
// what appendLocsForStack expects.
if hideRuntime {
for i, addr := range stk {
if f := runtime.FuncForPC(addr); f != nil && strings.HasPrefix(f.Name(), "runtime.") {
continue
}
// Found non-runtime. Show any runtime uses above it.
stk = stk[i:]
break
}
}
locs = b.appendLocsForStack(locs[:0], stk)
if len(locs) > 0 {
break
}
hideRuntime = false // try again, and show all frames next time.
}
// rate 即為 runtime.MemProfileRate
values[0], values[1] = scaleHeapSample(r.AllocObjects, r.AllocBytes, rate)
values[2], values[3] = scaleHeapSample(r.InUseObjects(), r.InUseBytes(), rate)
var blockSize int64
if r.AllocObjects > 0 {
blockSize = r.AllocBytes / r.AllocObjects
}
b.pbSample(values, locs, func() {
if blockSize != 0 {
b.pbLabel(tagSample_Label, "bytes", "", blockSize)
}
})
}
// scaleHeapSample adjusts the data from a heap Sample to
// account for its probability of appearing in the collected
// data. heap profiles are a sampling of the memory allocations
// requests in a program. We estimate the unsampled value by dividing
// each collected sample by its probability of appearing in the
// profile. heap profiles rely on a poisson process to determine
// which samples to collect, based on the desired average collection
// rate R. The probability of a sample of size S to appear in that
// profile is 1-exp(-S/R).
func scaleHeapSample(count, size, rate int64) (int64, int64) {
if count == 0 || size == 0 {
return 0, 0
}
if rate <= 1 {
// if rate==1 all samples were collected so no adjustment is needed.
// if rate<1 treat as unknown and skip scaling.
return count, size
}
avgSize := float64(size) / float64(count)
scale := 1 / (1 - math.Exp(-avgSize/float64(rate)))
return int64(float64(count) * scale), int64(float64(size) * scale)
}
為什么要在標(biāo)題里加個(gè)偽? 看上面代碼片段也可以注意到, 實(shí)質(zhì)上在 pprof 分析的時(shí)候并沒(méi)有掃描所有堆上內(nèi)存進(jìn)行分析 (想想也不現(xiàn)實(shí)) , 而是通過(guò)之前采樣出的數(shù)據(jù), 進(jìn)行計(jì)算 (現(xiàn)有對(duì)象數(shù)量, 大小, 采樣率等) 來(lái)估算出 heap 上的情況, 當(dāng)然給我們參考一般來(lái)說(shuō)是足夠了
goroutine
debug >= 2 的情況, 直接進(jìn)行堆棧輸出, 詳情可以查看 stack[2] 章節(jié)
// fetch == runtime.GoroutineProfile
func writeRuntimeProfile(w io.Writer, debug int, name string, fetch func([]runtime.StackRecord) (int, bool)) error {
// Find out how many records there are (fetch(nil)),
// allocate that many records, and get the data.
// There's a race—more records might be added between
// the two calls—so allocate a few extra records for safety
// and also try again if we're very unlucky.
// The loop should only execute one iteration in the common case.
var p []runtime.StackRecord
n, ok := fetch(nil)
for {
// Allocate room for a slightly bigger profile,
// in case a few more entries have been added
// since the call to ThreadProfile.
p = make([]runtime.StackRecord, n+10)
n, ok = fetch(p)
if ok {
p = p[0:n]
break
}
// Profile grew; try again.
}
return printCountProfile(w, debug, name, runtimeProfile(p))
}
// GoroutineProfile returns n, the number of records in the active goroutine stack profile.
// If len(p) >= n, GoroutineProfile copies the profile into p and returns n, true.
// If len(p) < n, GoroutineProfile does not change p and returns n, false.
//
// Most clients should use the runtime/pprof package instead
// of calling GoroutineProfile directly.
func GoroutineProfile(p []StackRecord) (n int, ok bool) {
gp := getg()
isOK := func(gp1 *g) bool {
// Checking isSystemGoroutine here makes GoroutineProfile
// consistent with both NumGoroutine and Stack.
return gp1 != gp && readgstatus(gp1) != _Gdead && !isSystemGoroutine(gp1, false)
}
// 熟悉的味道, STW 又來(lái)了
stopTheWorld("profile")
// 統(tǒng)計(jì)有多少 goroutine
n = 1
for _, gp1 := range allgs {
if isOK(gp1) {
n++
}
}
// 當(dāng)傳入的 p 非空的時(shí)候, 開(kāi)始獲取各個(gè) goroutine 信息, 整體姿勢(shì)和 stack api 幾乎一模一樣
if n <= len(p) {
ok = true
r := p
// Save current goroutine.
sp := getcallersp()
pc := getcallerpc()
systemstack(func() {
saveg(pc, sp, gp, &r[0])
})
r = r[1:]
// Save other goroutines.
for _, gp1 := range allgs {
if isOK(gp1) {
if len(r) == 0 {
// Should be impossible, but better to return a
// truncated profile than to crash the entire process.
break
}
saveg(^uintptr(0), ^uintptr(0), gp1, &r[0])
r = r[1:]
}
}
}
startTheWorld()
return n, ok
}
總結(jié)下 pprof/goroutine
STW 操作, 如果需要觀察詳情的需要注意這個(gè) API 帶來(lái)的風(fēng)險(xiǎn) 整體流程基本就是 stackdump 所有協(xié)程信息的流程, 差別不大沒(méi)什么好講的, 不熟悉的可以去看下 stack 對(duì)應(yīng)章節(jié)
pprof/threadcreate
可能會(huì)有人想問(wèn), 我們通常只關(guān)注 goroutine 就夠了, 為什么還需要對(duì)線程的一些情況進(jìn)行追蹤? 例如無(wú)法被搶占的阻塞性系統(tǒng)調(diào)用[3], cgo 相關(guān)的線程等等, 都可以利用它來(lái)進(jìn)行一個(gè)簡(jiǎn)單的分析, 當(dāng)然大多數(shù)情況考慮的線程問(wèn)題(諸如泄露等), 一般都是上層的使用問(wèn)題所導(dǎo)致的(線程泄露等)
// 還是用之前用過(guò)的無(wú)法被搶占的阻塞性系統(tǒng)調(diào)用來(lái)進(jìn)行一個(gè)簡(jiǎn)單的實(shí)驗(yàn)
package main
import (
"fmt"
"net/http"
_ "net/http/pprof"
"os"
"syscall"
"unsafe"
)
const (
SYS_futex = 202
_FUTEX_PRIVATE_FLAG = 128
_FUTEX_WAIT = 0
_FUTEX_WAKE = 1
_FUTEX_WAIT_PRIVATE = _FUTEX_WAIT | _FUTEX_PRIVATE_FLAG
_FUTEX_WAKE_PRIVATE = _FUTEX_WAKE | _FUTEX_PRIVATE_FLAG
)
func main() {
fmt.Println(os.Getpid())
go func() {
b := make([]byte, 1<<20)
_ = b
}()
for i := 1; i < 13; i++ {
go func() {
var futexVar int = 0
for {
// Syscall && RawSyscall, 具體差別分析可自行查看 syscall 章節(jié)
fmt.Println(syscall.Syscall6(
SYS_futex, // trap AX 202
uintptr(unsafe.Pointer(&futexVar)), // a1 DI 1
uintptr(_FUTEX_WAIT), // a2 SI 0
0, // a3 DX
0, //uintptr(unsafe.Pointer(&ts)), // a4 R10
0, // a5 R8
0))
}
}()
}
http.ListenAndServe("0.0.0.0:8899", nil)
}
# GET /debug/pprof/threadcreate?debug=1
threadcreate profile: total 18
17 @
# 0x0
1 @ 0x43b818 0x43bfa3 0x43c272 0x43857d 0x467fb1
# 0x43b817 runtime.allocm+0x157 /usr/local/go/src/runtime/proc.go:1414
# 0x43bfa2 runtime.newm+0x42 /usr/local/go/src/runtime/proc.go:1736
# 0x43c271 runtime.startTemplateThread+0xb1 /usr/local/go/src/runtime/proc.go:1805
# 0x43857c runtime.main+0x18c /usr/local/go/src/runtime/proc.go:186
# 再結(jié)合諸如 pstack 的工具
ps -efT | grep 22298 # pid = 22298
root 22298 22298 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22299 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22300 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22301 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22302 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22303 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22304 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22305 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22306 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22307 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22308 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22309 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22310 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22311 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22312 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22316 13767 0 16:59 pts/4 00:00:00 ./mstest
root 22298 22317 13767 0 16:59 pts/4 00:00:00 ./mstest
pstack 22299
Thread 1 (process 22299):
#0 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:568
#1 0x00000000004326f4 in runtime.futexsleep (addr=0xb2fd78 <runtime.sched+280>, val=0, ns=60000000000) at /usr/local/go/src/runtime/os_linux.go:51
#2 0x000000000040cb3e in runtime.notetsleep_internal (n=0xb2fd78 <runtime.sched+280>, ns=60000000000, ~r2=<optimized out>) at /usr/local/go/src/runtime/lock_futex.go:193
#3 0x000000000040cc11 in runtime.notetsleep (n=0xb2fd78 <runtime.sched+280>, ns=60000000000, ~r2=<optimized out>) at /usr/local/go/src/runtime/lock_futex.go:216
#4 0x00000000004433b2 in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:4558
#5 0x000000000043af33 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1112
#6 0x000000000043ae4e in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1077
#7 0x0000000000401893 in runtime/cgo(.text) ()
#8 0x00007fb1e2d53700 in ?? ()
#9 0x0000000000000000 in ?? ()
其他的線程如果感興趣也可以仔細(xì)查看
pprof/threadcreate 具體實(shí)現(xiàn)和 pprof/goroutine 類(lèi)似, 無(wú)非前者遍歷的對(duì)象是全局 allm, 而后者為 allgs, 區(qū)別在于 pprof/threadcreate => ThreadCreateProfile 時(shí)不會(huì)進(jìn)行進(jìn)行 STW
pprof/mutex
mutex 默認(rèn)是關(guān)閉采樣的, 通過(guò) runtime.SetMutexProfileFraction(int) 來(lái)進(jìn)行 rate 的配置進(jìn)行開(kāi)啟或關(guān)閉
和上文分析過(guò)的 mbuckets 類(lèi)似, 這邊用以記錄采樣數(shù)據(jù)的是 xbuckets, bucket 記錄了鎖持有的堆棧, 次數(shù)(采樣)等信息以供用戶查看
//go:linkname mutexevent sync.event
func mutexevent(cycles int64, skip int) {
if cycles < 0 {
cycles = 0
}
rate := int64(atomic.Load64(&mutexprofilerate))
// TODO(pjw): measure impact of always calling fastrand vs using something
// like malloc.go:nextSample()
// 同樣根據(jù) rate 來(lái)進(jìn)行采樣, 這邊用以記錄 rate 的是 mutexprofilerate 變量
if rate > 0 && int64(fastrand())%rate == 0 {
saveblockevent(cycles, skip+1, mutexProfile)
}
}
---------------
| user access |
---------------
|
------------------ |
| xbuckets list | copy |
| (global) | -------------------------------------
------------------
|
|
| create_or_get && insert_or_update bucket into xbuckets
|
|
--------------------------------------
| func stkbucket & typ == mutexProfile |
--------------------------------------
|
------------------
| saveblockevent | // 堆棧等信息記錄
------------------
|
|
| /*
| * //go:linkname mutexevent sync.event
| * func mutexevent(cycles int64, skip int) {
| * if cycles < 0 {
| * cycles = 0
| * }
| 采樣 * rate := int64(atomic.Load64(&mutexprofilerate))
| 記錄 * // TODO(pjw): measure impact of always calling fastrand vs using something
| * // like malloc.go:nextSample()
| * if rate > 0 && int64(fastrand())%rate == 0 {
| * saveblockevent(cycles, skip+1, mutexProfile)
| * }
| *
| */
|
------------ 不采樣
| mutexevent | ----------....
------------
|
|
------------
| semrelease1 |
------------
|
|
------------------------
| runtime_Semrelease |
------------------------
|
|
------------
| unlockSlow |
------------
|
|
------------
| Unlock |
------------
pprof/block
同上, 主要來(lái)分析下 bbuckets
---------------
| user access |
---------------
|
------------------ |
| bbuckets list | copy |
| (global) | -------------------------------------
------------------
|
|
| create_or_get && insert_or_update bucket into bbuckets
|
|
--------------------------------------
| func stkbucket & typ == blockProfile |
--------------------------------------
|
------------------
| saveblockevent | // 堆棧等信息記錄
------------------
|
|
| /*
| * func blocksampled(cycles int64) bool {
| * rate := int64(atomic.Load64(&blockprofilerate))
| * if rate <= 0 || (rate > cycles && int64(fastrand())%rate > cycles) {
| * return false
| 采樣 * }
| 記錄 * return true
| * }
| */
|
------------ 不采樣
| blockevent | ----------....
------------
|----------------------------------------------------------------------------
| | |
------------ ----------------------------------------------- ------------
| semrelease1 | | chansend / chanrecv && mysg.releasetime > 0 | | selectgo |
------------ ----------------------------------------------- ------------
相比較 mutex 的采樣, block 的埋點(diǎn)會(huì)額外存在于 chan 中, 每次 block 記錄的是前后兩個(gè) cpu 周期 的差值 (cycles) 需要注意的是 cputicks 可能在不同系統(tǒng)上存在一些問(wèn)題[4]. 暫不放在這邊討論
pprof/profile
上面分析的都屬于 runtime 在運(yùn)行的過(guò)程中自動(dòng)采用保存數(shù)據(jù)后用戶進(jìn)行觀察的, profile 則是用戶選擇指定周期內(nèi)的 CPU Profiling
#總結(jié)
pprof的確會(huì)給runtime帶來(lái)額外的壓力, 壓力的多少取決于用戶使用的各個(gè)*_rate配置, 在獲取pprof信息的時(shí)候需要按照實(shí)際情況酌情使用各個(gè)接口, 每個(gè)接口產(chǎn)生的額外壓力是不一樣的.不同版本在是否默認(rèn)開(kāi)啟上有不同策略, 需要自行根據(jù)各自的環(huán)境進(jìn)行確認(rèn) pprof獲取到的數(shù)據(jù)僅能作為參考, 和設(shè)置的采樣頻率有關(guān), 在計(jì)算例如heap情況時(shí)會(huì)進(jìn)行相關(guān)的近似預(yù)估, 非實(shí)質(zhì)上對(duì)heap進(jìn)行掃描
-------------------------
| pprof.StartCPUProfile |
-------------------------
|
|
|
-------------------------
| sleep(time.Duration) |
-------------------------
|
|
|
-------------------------
| pprof.StopCPUProfile |
-------------------------
pprof.StartCPUProfile 與 pprof.StopCPUProfile 核心為 runtime.SetCPUProfileRate(hz int) 控制 cpu profile 頻率, 但是這邊的頻率設(shè)置和前面幾個(gè)有差異, 不僅僅是設(shè)計(jì) rate 的設(shè)置, 還涉及全局對(duì)象 cpuprof log buffer 的分配
var cpuprof cpuProfile
type cpuProfile struct {
lock mutex
on bool // profiling is on
log *profBuf // profile events written here
// extra holds extra stacks accumulated in addNonGo
// corresponding to profiling signals arriving on
// non-Go-created threads. Those stacks are written
// to log the next time a normal Go thread gets the
// signal handler.
// Assuming the stacks are 2 words each (we don't get
// a full traceback from those threads), plus one word
// size for framing, 100 Hz profiling would generate
// 300 words per second.
// Hopefully a normal Go thread will get the profiling
// signal at least once every few seconds.
extra [1000]uintptr
numExtra int
lostExtra uint64 // count of frames lost because extra is full
lostAtomic uint64 // count of frames lost because of being in atomic64 on mips/arm; updated racily
}
log buffer 的大小每次分配是固定的, 無(wú)法進(jìn)行調(diào)節(jié)
cpuprof.add
將 stack trace 信息寫(xiě)入 cpuprof 的 log buffer
// add adds the stack trace to the profile.
// It is called from signal handlers and other limited environments
// and cannot allocate memory or acquire locks that might be
// held at the time of the signal, nor can it use substantial amounts
// of stack.
//go:nowritebarrierrec
func (p *cpuProfile) add(gp *g, stk []uintptr) {
// Simple cas-lock to coordinate with setcpuprofilerate.
for !atomic.Cas(&prof.signalLock, 0, 1) {
osyield()
}
if prof.hz != 0 { // implies cpuprof.log != nil
if p.numExtra > 0 || p.lostExtra > 0 || p.lostAtomic > 0 {
p.addExtra()
}
hdr := [1]uint64{1}
// Note: write "knows" that the argument is &gp.labels,
// because otherwise its write barrier behavior may not
// be correct. See the long comment there before
// changing the argument here.
cpuprof.log.write(&gp.labels, nanotime(), hdr[:], stk)
}
atomic.Store(&prof.signalLock, 0)
}
來(lái)看下調(diào)用 cpuprof.add 的流程
------------------------
| cpu profile start |
------------------------
|
|
| start timer (setitimer syscall / ITIMER_PROF)
| 每個(gè)一段時(shí)間(rate)在向當(dāng)前 P 所在線程發(fā)送一個(gè) SIGPROF 信號(hào)量 --
| |
| |
------------------------ loop |
| sighandler |----------------------------------------------
------------------------ |
| |
| /* |
| * if sig == _SIGPROF { |
| * sigprof(c.sigpc(), c.sigsp(), c.siglr(), gp, _g_.m)
| * return |
| */ } |
| |
---------------------------- | stop
| sigprof(stack strace) | |
---------------------------- |
| |
| |
| |
---------------------- |
| cpuprof.add | |
---------------------- ----------------------
| | cpu profile stop |
| ----------------------
|
----------------------
| cpuprof.log buffer |
----------------------
| --------------------- ---------------
----------------------------------------| cpuprof.read |----------------| user access |
--------------------- ---------------
由于 GMP 的模型設(shè)計(jì), 在絕大多數(shù)情況下通過(guò)這種 timer + sig + current thread 以及當(dāng)前支持的搶占式調(diào)度, 這種記錄方式是能夠很好進(jìn)行整個(gè) runtime cpu profile 采樣分析的, 但也不能排除一些極端情況是無(wú)法被覆蓋的, 畢竟也只是基于當(dāng)前 M 而已.
總結(jié)
可用性:
runtime 自帶的 pprof 已經(jīng)在數(shù)據(jù)采集的準(zhǔn)確性, 覆蓋率, 壓力等各方面替我們做好了一個(gè)比較均衡及全面的考慮
在絕大多數(shù)場(chǎng)景下使用起來(lái)需要考慮的性能點(diǎn)無(wú)非就是幾個(gè) rate 的設(shè)置
不同版本的默認(rèn)開(kāi)啟是有差別的, 幾個(gè)參數(shù)默認(rèn)值可自行確認(rèn), 有時(shí)候你覺(jué)得沒(méi)有開(kāi)啟 pprof 但是實(shí)際上已經(jīng)開(kāi)啟了
當(dāng)選擇的參數(shù)合適的時(shí)候, pprof 遠(yuǎn)遠(yuǎn)沒(méi)有想象中那般“重”
局限性:
得到的數(shù)據(jù)只是采樣(根據(jù) rate 決定) 或預(yù)估值
無(wú)法 cover 所有場(chǎng)景, 對(duì)于一些特殊的或者極端的情況, 需要各自進(jìn)行優(yōu)化來(lái)選擇合適的手段完善
安全性:
生產(chǎn)環(huán)境可用 pprof, 注意接口不能直接暴露, 畢竟存在諸如 STW 等操作, 存在潛在風(fēng)險(xiǎn)點(diǎn)
#開(kāi)源項(xiàng)目 pprof 參考 nsq[5] etcd[6] 采用的是配置式[7]選擇是否開(kāi)啟
參考資料
https://go-review.googlesource.com/c/go/+/299671
改變: https://go-review.googlesource.com/c/go/+/299671/8/src/runtime/mprof.go
[2]stack: runtime_stack.md
[3]系統(tǒng)調(diào)用: syscall.md
[4]問(wèn)題: https://github.com/golang/go/issues/8976
[5]nsq: https://github.com/nsqio/nsq/blob/v1.2.0/nsqd/http.go#L78-L88
[6]etcd: https://github.com/etcd-io/etcd/blob/release-3.4/pkg/debugutil/pprof.go#L23
[7]配置式: https://github.com/etcd-io/etcd/blob/release-3.4/etcd.conf.yml.sample#L76
