wziww 是幫我更新 golang-notes 的小伙伴，這篇 pprof 的原理與實(shí)現(xiàn)是他寫(xiě)的，本文如果有打賞收入的話，會(huì)全額轉(zhuǎn)給他~

本章節(jié)沒(méi)有介紹具體 pprof 以及周邊工具的使用, 而是進(jìn)行了 runtime pprof 實(shí)現(xiàn)原理的分析, 旨在提供給讀者一個(gè)使用方面的參考在進(jìn)行深入本章節(jié)之前, 讓我們來(lái)看三個(gè)問(wèn)題, 相信下面這幾個(gè)問(wèn)題也是大部分人在使用 pprof 的時(shí)候?qū)λ畲蟮睦Щ? 那么可以帶著這三個(gè)問(wèn)題來(lái)進(jìn)行接下去的分析

開(kāi)啟 pprof 會(huì)對(duì) runtime 產(chǎn)生多大的壓力?
能否選擇性在合適階段對(duì)生產(chǎn)環(huán)境的應(yīng)用進(jìn)行 pprof 的開(kāi)啟 / 關(guān)閉操作?
pprof 的原理是什么?

go 內(nèi)置的 pprof API 在 runtime/pprof 包內(nèi), 它提供給了用戶與 runtime 交互的能力, 讓我們能夠在應(yīng)用運(yùn)行的過(guò)程中分析當(dāng)前應(yīng)用的各項(xiàng)指標(biāo)來(lái)輔助進(jìn)行性能優(yōu)化以及問(wèn)題排查, 當(dāng)然也可以直接加載 _ "net/http/pprof" 包使用內(nèi)置的 http 接口 來(lái)進(jìn)行使用, net 模塊內(nèi)的 pprof 即為 go 替我們封裝好的一系列調(diào)用 runtime/pprof 的方法, 當(dāng)然也可以自己直接使用

// src/runtime/pprof/pprof.go
// 可觀察類(lèi)目
profiles.m = map[string]*Profile{
        "goroutine":    goroutineProfile,
        "threadcreate": threadcreateProfile,
        "heap":         heapProfile,
        "allocs":       allocsProfile,
        "block":        blockProfile,
        "mutex":        mutexProfile,
    }

allocs


var allocsProfile = &Profile{
  name:  "allocs",
  count: countHeap, // identical to heap profile
  write: writeAlloc,
}

writeAlloc (主要涉及以下幾個(gè) api)

ReadMemStats(m *MemStats)
MemProfile(p []MemProfileRecord, inuseZero bool)

// ReadMemStats populates m with memory allocator statistics.
//
// The returned memory allocator statistics are up to date as of the
// call to ReadMemStats. This is in contrast with a heap profile,
// which is a snapshot as of the most recently completed garbage
// collection cycle.
func ReadMemStats(m *MemStats) {
  // STW 操作
  stopTheWorld("read mem stats")
  // systemstack 切換
  systemstack(func() {
    // 將 memstats 通過(guò) copy 操作復(fù)制給 m
    readmemstats_m(m)
  })

  startTheWorld()
}

// MemProfile returns a profile of memory allocated and freed per allocation
// site.
//
// MemProfile returns n, the number of records in the current memory profile.
// If len(p) >= n, MemProfile copies the profile into p and returns n, true.
// If len(p) < n, MemProfile does not change p and returns n, false.
//
// If inuseZero is true, the profile includes allocation records
// where r.AllocBytes > 0 but r.AllocBytes == r.FreeBytes.
// These are sites where memory was allocated, but it has all
// been released back to the runtime.
//
// The returned profile may be up to two garbage collection cycles old.
// This is to avoid skewing the profile toward allocations; because
// allocations happen in real time but frees are delayed until the garbage
// collector performs sweeping, the profile only accounts for allocations
// that have had a chance to be freed by the garbage collector.
//
// Most clients should use the runtime/pprof package or
// the testing package's -test.memprofile flag instead
// of calling MemProfile directly.
func MemProfile(p []MemProfileRecord, inuseZero bool) (n int, ok bool) {
  lock(&proflock)
  // If we're between mProf_NextCycle and mProf_Flush, take care
  // of flushing to the active profile so we only have to look
  // at the active profile below.
  mProf_FlushLocked()
  clear := true
  /* 
   * 記住這個(gè) mbuckets -- memory profile buckets 
   * allocs 的采樣都是記錄在這個(gè)全局變量?jī)?nèi), 下面會(huì)進(jìn)行詳細(xì)分析
   * -------------------------------------------------
   * (gdb) info variables mbuckets
   * All variables matching regular expression "mbuckets":

   * File runtime:
   * runtime.bucket *runtime.mbuckets;
   * (gdb)
   */
  for b := mbuckets; b != nil; b = b.allnext {
    mp := b.mp()
    if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
      n++
    }
    if mp.active.allocs != 0 || mp.active.frees != 0 {
      clear = false
    }
  }
  if clear {
    // Absolutely no data, suggesting that a garbage collection
    // has not yet happened. In order to allow profiling when
    // garbage collection is disabled from the beginning of execution,
    // accumulate all of the cycles, and recount buckets.
    n = 0
    for b := mbuckets; b != nil; b = b.allnext {
      mp := b.mp()
      for c := range mp.future {
        mp.active.add(&mp.future[c])
        mp.future[c] = memRecordCycle{}
      }
      if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
        n++
      }
    }
  }
  if n <= len(p) {
    ok = true
    idx := 0
    for b := mbuckets; b != nil; b = b.allnext {
      mp := b.mp()
      if inuseZero || mp.active.alloc_bytes != mp.active.free_bytes {
        // mbuckets 數(shù)據(jù)拷貝
        record(&p[idx], b)
        idx++
      }
    }
  }
  unlock(&proflock)
  return
}

總結(jié)一下 pprof/allocs 所涉及的操作

短暫的 STW 以及 systemstack 切換來(lái)獲取 runtime 相關(guān)信息
拷貝全局對(duì)象 mbuckets 值返回給用戶

mbuckets

上文提到, pprof/allocs 的核心在于對(duì) mbuckets 的操作, 下面用一張圖來(lái)簡(jiǎn)單描述下 mbuckets 的相關(guān)操作

var mbuckets  *bucket // memory profile buckets
type bucket struct {
  next    *bucket
  allnext *bucket
  typ     bucketType // memBucket or blockBucket (includes mutexProfile)
  hash    uintptr
  size    uintptr
  nstk    uintptr
}

                                                  ---------------
                                                 |  user access  |
                                                  ---------------
                                                         |
 ------------------                                      |
|   mbuckets list  |              copy                   |
|     (global)     | -------------------------------------  
 ------------------
       |
       |
       | create_or_get && insert_or_update bucket into mbuckets
       |
       |
 --------------------------------------
|  func stkbucket & typ == memProfile  |
 --------------------------------------
                |
         ----------------
        |  mProf_Malloc  | // 堆棧等信息記錄
         ----------------
                |
         ----------------
        |  profilealloc  | // next_sample 計(jì)算
         ----------------
                |      
                |       /*
                |       * if rate := MemProfileRate; rate > 0 {
                |       *   if rate != 1 && size < c.next_sample {
                |       *     c.next_sample -= size
                | 采樣   *   } else {
                | 記錄   *     mp := acquirem()
                |       *     profilealloc(mp, x, size)
                |       *     releasem(mp)
                |       *   }
                |       * }
                |       */
                |
           ------------    不采樣
          |  mallocgc  |-----------...
           ------------

由上圖我們可以清晰的看見(jiàn), runtime 在內(nèi)存分配的時(shí)候會(huì)根據(jù)一定策略進(jìn)行采樣, 記錄到 mbuckets 中讓用戶得以進(jìn)行分析, 而采樣算法有個(gè)重要的依賴 MemProfileRate

// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of
// one allocation per MemProfileRate bytes allocated.
//
// To include every allocated block in the profile, set MemProfileRate to 1.
// To turn off profiling entirely, set MemProfileRate to 0.
//
// The tools that process the memory profiles assume that the
// profile rate is constant across the lifetime of the program
// and equal to the current value. Programs that change the
// memory profiling rate should do so just once, as early as
// possible in the execution of the program (for example,
// at the beginning of main).
var MemProfileRate int = 512 * 1024

默認(rèn)大小是 512 KB, 可以由用戶自行配置.

值的注意的是, 由于開(kāi)啟了 pprof 會(huì)產(chǎn)生一些采樣的額外壓力及開(kāi)銷(xiāo), go 團(tuán)隊(duì)已經(jīng)在較新的編譯器中有選擇地進(jìn)行了這個(gè)變量的配置以改變[1]默認(rèn)開(kāi)啟的現(xiàn)狀

具體方式為代碼未進(jìn)行相關(guān)引用則編譯器將初始值配置為 0, 否則則為默認(rèn)(512 KB)

(本文討論的基于 1.14.3 版本, 如有差異請(qǐng)進(jìn)行版本確認(rèn))

pprof/allocs 總結(jié)

開(kāi)啟后會(huì)對(duì) runtime 產(chǎn)生額外壓力, 采樣時(shí)會(huì)在 runtime malloc 時(shí)記錄額外信息以供后續(xù)分析
可以人為選擇是否開(kāi)啟, 以及采樣頻率, 通過(guò)設(shè)置 runtime.MemProfileRate 參數(shù), 不同 go 版本存在差異(是否默認(rèn)開(kāi)啟), 與用戶代碼內(nèi)是否引用(linker)相關(guān)模塊/變量有關(guān), 默認(rèn)大小為 512 KB

allocs 部分還包含了 heap 情況的近似計(jì)算, 放在下一節(jié)分析

heap

allocs: A sampling of all past memory allocations

heap: A sampling of memory allocations of live objects. You can specify the gc GET parameter to run GC before taking the heap sample.

對(duì)比下 allocs 和 heap 官方說(shuō)明上的區(qū)別, 一個(gè)是分析所有內(nèi)存分配的情況, 一個(gè)是當(dāng)前 heap 上的分配情況. heap 還能使用額外參數(shù)運(yùn)行一次 GC 后再進(jìn)行分析

看起來(lái)兩者差別很大。。。不過(guò)實(shí)質(zhì)上在代碼層面兩者除了一次 GC 可以人為調(diào)用以及生成的文件類(lèi)型不同之外 (debug == 0 的時(shí)候) 之外沒(méi)啥區(qū)別.

heap 采樣(偽)

// p 為上文提到過(guò)的 MemProfileRecord 采樣記錄
for _, r := range p {
    hideRuntime := true
    for tries := 0; tries < 2; tries++ {
      stk := r.Stack()
      // For heap profiles, all stack
      // addresses are return PCs, which is
      // what appendLocsForStack expects.
      if hideRuntime {
        for i, addr := range stk {
          if f := runtime.FuncForPC(addr); f != nil && strings.HasPrefix(f.Name(), "runtime.") {
            continue
          }
          // Found non-runtime. Show any runtime uses above it.
          stk = stk[i:]
          break
        }
      }
      locs = b.appendLocsForStack(locs[:0], stk)
      if len(locs) > 0 {
        break
      }
      hideRuntime = false // try again, and show all frames next time.
    }
    // rate 即為 runtime.MemProfileRate
    values[0], values[1] = scaleHeapSample(r.AllocObjects, r.AllocBytes, rate)
    values[2], values[3] = scaleHeapSample(r.InUseObjects(), r.InUseBytes(), rate)
    var blockSize int64
    if r.AllocObjects > 0 {
      blockSize = r.AllocBytes / r.AllocObjects
    }
    b.pbSample(values, locs, func() {
      if blockSize != 0 {
        b.pbLabel(tagSample_Label, "bytes", "", blockSize)
      }
    })
  }

// scaleHeapSample adjusts the data from a heap Sample to
// account for its probability of appearing in the collected
// data. heap profiles are a sampling of the memory allocations
// requests in a program. We estimate the unsampled value by dividing
// each collected sample by its probability of appearing in the
// profile. heap profiles rely on a poisson process to determine
// which samples to collect, based on the desired average collection
// rate R. The probability of a sample of size S to appear in that
// profile is 1-exp(-S/R).
func scaleHeapSample(count, size, rate int64) (int64, int64) {
  if count == 0 || size == 0 {
    return 0, 0
  }

  if rate <= 1 {
    // if rate==1 all samples were collected so no adjustment is needed.
    // if rate<1 treat as unknown and skip scaling.
    return count, size
  }

  avgSize := float64(size) / float64(count)
  scale := 1 / (1 - math.Exp(-avgSize/float64(rate)))

  return int64(float64(count) * scale), int64(float64(size) * scale)
}

為什么要在標(biāo)題里加個(gè)偽? 看上面代碼片段也可以注意到, 實(shí)質(zhì)上在 pprof 分析的時(shí)候并沒(méi)有掃描所有堆上內(nèi)存進(jìn)行分析 (想想也不現(xiàn)實(shí)) , 而是通過(guò)之前采樣出的數(shù)據(jù), 進(jìn)行計(jì)算 (現(xiàn)有對(duì)象數(shù)量, 大小, 采樣率等) 來(lái)估算出 heap 上的情況, 當(dāng)然給我們參考一般來(lái)說(shuō)是足夠了

goroutine

debug >= 2 的情況, 直接進(jìn)行堆棧輸出, 詳情可以查看 stack[2] 章節(jié)

// fetch == runtime.GoroutineProfile
func writeRuntimeProfile(w io.Writer, debug int, name string, fetch func([]runtime.StackRecord) (int, bool)) error {
  // Find out how many records there are (fetch(nil)),
  // allocate that many records, and get the data.
  // There's a race—more records might be added between
  // the two calls—so allocate a few extra records for safety
  // and also try again if we're very unlucky.
  // The loop should only execute one iteration in the common case.
  var p []runtime.StackRecord
  n, ok := fetch(nil)
  for {
    // Allocate room for a slightly bigger profile,
    // in case a few more entries have been added
    // since the call to ThreadProfile.
    p = make([]runtime.StackRecord, n+10)
    n, ok = fetch(p)
    if ok {
      p = p[0:n]
      break
    }
    // Profile grew; try again.
  }

  return printCountProfile(w, debug, name, runtimeProfile(p))
}

// GoroutineProfile returns n, the number of records in the active goroutine stack profile.
// If len(p) >= n, GoroutineProfile copies the profile into p and returns n, true.
// If len(p) < n, GoroutineProfile does not change p and returns n, false.
//
// Most clients should use the runtime/pprof package instead
// of calling GoroutineProfile directly.
func GoroutineProfile(p []StackRecord) (n int, ok bool) {
  gp := getg()

  isOK := func(gp1 *g) bool {
    // Checking isSystemGoroutine here makes GoroutineProfile
    // consistent with both NumGoroutine and Stack.
    return gp1 != gp && readgstatus(gp1) != _Gdead && !isSystemGoroutine(gp1, false)
  }
  // 熟悉的味道, STW 又來(lái)了
  stopTheWorld("profile")
  // 統(tǒng)計(jì)有多少 goroutine
  n = 1
  for _, gp1 := range allgs {
    if isOK(gp1) {
      n++
    }
  }
  // 當(dāng)傳入的 p 非空的時(shí)候, 開(kāi)始獲取各個(gè) goroutine 信息, 整體姿勢(shì)和 stack api 幾乎一模一樣
  if n <= len(p) {
    ok = true
    r := p

    // Save current goroutine.
    sp := getcallersp()
    pc := getcallerpc()
    systemstack(func() {
      saveg(pc, sp, gp, &r[0])
    })
    r = r[1:]

    // Save other goroutines.
    for _, gp1 := range allgs {
      if isOK(gp1) {
        if len(r) == 0 {
          // Should be impossible, but better to return a
          // truncated profile than to crash the entire process.
          break
        }
        saveg(^uintptr(0), ^uintptr(0), gp1, &r[0])
        r = r[1:]
      }
    }
  }

  startTheWorld()

  return n, ok
}

總結(jié)下 pprof/goroutine

STW 操作, 如果需要觀察詳情的需要注意這個(gè) API 帶來(lái)的風(fēng)險(xiǎn)
整體流程基本就是 stackdump 所有協(xié)程信息的流程, 差別不大沒(méi)什么好講的, 不熟悉的可以去看下 stack 對(duì)應(yīng)章節(jié)

pprof/threadcreate

可能會(huì)有人想問(wèn), 我們通常只關(guān)注 goroutine 就夠了, 為什么還需要對(duì)線程的一些情況進(jìn)行追蹤? 例如無(wú)法被搶占的阻塞性系統(tǒng)調(diào)用[3], cgo 相關(guān)的線程等等, 都可以利用它來(lái)進(jìn)行一個(gè)簡(jiǎn)單的分析, 當(dāng)然大多數(shù)情況考慮的線程問(wèn)題(諸如泄露等), 一般都是上層的使用問(wèn)題所導(dǎo)致的(線程泄露等)

// 還是用之前用過(guò)的無(wú)法被搶占的阻塞性系統(tǒng)調(diào)用來(lái)進(jìn)行一個(gè)簡(jiǎn)單的實(shí)驗(yàn)
package main

import (
  "fmt"
  "net/http"
  _ "net/http/pprof"
  "os"
  "syscall"
  "unsafe"
)

const (
  SYS_futex           = 202
  _FUTEX_PRIVATE_FLAG = 128
  _FUTEX_WAIT         = 0
  _FUTEX_WAKE         = 1
  _FUTEX_WAIT_PRIVATE = _FUTEX_WAIT | _FUTEX_PRIVATE_FLAG
  _FUTEX_WAKE_PRIVATE = _FUTEX_WAKE | _FUTEX_PRIVATE_FLAG
)

func main() {
  fmt.Println(os.Getpid())
  go func() {
    b := make([]byte, 1<<20)
    _ = b
  }()
  for i := 1; i < 13; i++ {
    go func() {
      var futexVar int = 0
      for {
        // Syscall && RawSyscall, 具體差別分析可自行查看 syscall 章節(jié)
        fmt.Println(syscall.Syscall6(
          SYS_futex,                          // trap AX    202
          uintptr(unsafe.Pointer(&futexVar)), // a1 DI      1
          uintptr(_FUTEX_WAIT),               // a2 SI      0
          0,                                  // a3 DX
          0,                                  //uintptr(unsafe.Pointer(&ts)), // a4 R10
          0,                                  // a5 R8
          0))
      }
    }()
  }
  http.ListenAndServe("0.0.0.0:8899", nil)
}

# GET /debug/pprof/threadcreate?debug=1
threadcreate profile: total 18
17 @
#  0x0

1 @ 0x43b818 0x43bfa3 0x43c272 0x43857d 0x467fb1
#  0x43b817  runtime.allocm+0x157      /usr/local/go/src/runtime/proc.go:1414
#  0x43bfa2  runtime.newm+0x42      /usr/local/go/src/runtime/proc.go:1736
#  0x43c271  runtime.startTemplateThread+0xb1  /usr/local/go/src/runtime/proc.go:1805
#  0x43857c  runtime.main+0x18c      /usr/local/go/src/runtime/proc.go:186

# 再結(jié)合諸如 pstack 的工具
ps -efT | grep 22298 # pid = 22298
root     22298 22298 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22299 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22300 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22301 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22302 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22303 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22304 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22305 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22306 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22307 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22308 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22309 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22310 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22311 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22312 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22316 13767  0 16:59 pts/4    00:00:00 ./mstest
root     22298 22317 13767  0 16:59 pts/4    00:00:00 ./mstest

pstack 22299
Thread 1 (process 22299):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:568
#1  0x00000000004326f4 in runtime.futexsleep (addr=0xb2fd78 <runtime.sched+280>, val=0, ns=60000000000) at /usr/local/go/src/runtime/os_linux.go:51
#2  0x000000000040cb3e in runtime.notetsleep_internal (n=0xb2fd78 <runtime.sched+280>, ns=60000000000, ~r2=<optimized out>) at /usr/local/go/src/runtime/lock_futex.go:193
#3  0x000000000040cc11 in runtime.notetsleep (n=0xb2fd78 <runtime.sched+280>, ns=60000000000, ~r2=<optimized out>) at /usr/local/go/src/runtime/lock_futex.go:216
#4  0x00000000004433b2 in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:4558
#5  0x000000000043af33 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1112
#6  0x000000000043ae4e in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1077
#7  0x0000000000401893 in runtime/cgo(.text) ()
#8  0x00007fb1e2d53700 in ?? ()
#9  0x0000000000000000 in ?? ()

其他的線程如果感興趣也可以仔細(xì)查看

pprof/threadcreate 具體實(shí)現(xiàn)和 pprof/goroutine 類(lèi)似, 無(wú)非前者遍歷的對(duì)象是全局 allm, 而后者為 allgs, 區(qū)別在于 pprof/threadcreate => ThreadCreateProfile 時(shí)不會(huì)進(jìn)行進(jìn)行 STW

pprof/mutex

mutex 默認(rèn)是關(guān)閉采樣的, 通過(guò) runtime.SetMutexProfileFraction(int) 來(lái)進(jìn)行 rate 的配置進(jìn)行開(kāi)啟或關(guān)閉

和上文分析過(guò)的 mbuckets 類(lèi)似, 這邊用以記錄采樣數(shù)據(jù)的是 xbuckets, bucket 記錄了鎖持有的堆棧, 次數(shù)(采樣)等信息以供用戶查看

//go:linkname mutexevent sync.event
func mutexevent(cycles int64, skip int) {
  if cycles < 0 {
    cycles = 0
  }
  rate := int64(atomic.Load64(&mutexprofilerate))
  // TODO(pjw): measure impact of always calling fastrand vs using something
  // like malloc.go:nextSample()
  // 同樣根據(jù) rate 來(lái)進(jìn)行采樣, 這邊用以記錄 rate 的是 mutexprofilerate 變量
  if rate > 0 && int64(fastrand())%rate == 0 {
    saveblockevent(cycles, skip+1, mutexProfile)
  }
}

                                                  ---------------
                                                 |  user access  |
                                                  ---------------
                                                         |
 ------------------                                      |
|   xbuckets list  |              copy                   |
|     (global)     | -------------------------------------  
 ------------------
       |
       |
       | create_or_get && insert_or_update bucket into xbuckets
       |
       |
 --------------------------------------
|  func stkbucket & typ == mutexProfile  |
 --------------------------------------
                 |
         ------------------
        |  saveblockevent  | // 堆棧等信息記錄
         ------------------
                 |
                 |      
                 |       /*  
                 |       *   //go:linkname mutexevent sync.event
                 |       *   func mutexevent(cycles int64, skip int) {
                 |       *     if cycles < 0 {
                 |       *       cycles = 0
                 |       *     }
                 | 采樣   *     rate := int64(atomic.Load64(&mutexprofilerate))
                 | 記錄   *     // TODO(pjw): measure impact of always calling fastrand vs using something
                 |       *     // like malloc.go:nextSample()
                 |       *     if rate > 0 && int64(fastrand())%rate == 0 {
                 |       *       saveblockevent(cycles, skip+1, mutexProfile)
                 |       *     }
                 |       * 
                 |       */
                 |
           ------------     不采樣
          | mutexevent | ----------....
           ------------
                 |
                 |
           ------------   
          | semrelease1 |
           ------------
                 |
                 |
       ------------------------  
      |   runtime_Semrelease   |
       ------------------------
                 |
                 |
           ------------   
          | unlockSlow |
           ------------
                 |
                 |
           ------------  
          |   Unlock   |
           ------------

pprof/block

同上, 主要來(lái)分析下 bbuckets

                                                  ---------------
                                                 |  user access  |
                                                  ---------------
                                                         |
 ------------------                                      |
|   bbuckets list  |              copy                   |
|     (global)     | -------------------------------------  
 ------------------
       |
       |
       | create_or_get && insert_or_update bucket into bbuckets
       |
       |
 --------------------------------------
|  func stkbucket & typ == blockProfile  |
 --------------------------------------
                 |
         ------------------
        |  saveblockevent  | // 堆棧等信息記錄
         ------------------
                 |
                 |      
                 |       /*  
                 |       *   func blocksampled(cycles int64) bool {
                 |       *     rate := int64(atomic.Load64(&blockprofilerate))
                 |       *     if rate <= 0 || (rate > cycles && int64(fastrand())%rate > cycles) {
                 |       *       return false
                 | 采樣   *     }
                 | 記錄   *     return true
                 |       *   }
                 |       */
                 |
           ------------     不采樣
          | blockevent | ----------....
           ------------
                 |----------------------------------------------------------------------------
                 |                                     |                                      |
           ------------          -----------------------------------------------        ------------
          | semrelease1 |       |  chansend / chanrecv &&  mysg.releasetime > 0 |      |  selectgo  |
           ------------          -----------------------------------------------        ------------

相比較 mutex 的采樣, block 的埋點(diǎn)會(huì)額外存在于 chan 中, 每次 block 記錄的是前后兩個(gè) cpu 周期 的差值 (cycles) 需要注意的是 cputicks 可能在不同系統(tǒng)上存在一些問(wèn)題[4]. 暫不放在這邊討論

pprof/profile

上面分析的都屬于 runtime 在運(yùn)行的過(guò)程中自動(dòng)采用保存數(shù)據(jù)后用戶進(jìn)行觀察的, profile 則是用戶選擇指定周期內(nèi)的 CPU Profiling

#總結(jié)

pprof 的確會(huì)給 runtime 帶來(lái)額外的壓力, 壓力的多少取決于用戶使用的各個(gè) *_rate 配置, 在獲取 pprof 信息的時(shí)候需要按照實(shí)際情況酌情使用各個(gè)接口, 每個(gè)接口產(chǎn)生的額外壓力是不一樣的.
不同版本在是否默認(rèn)開(kāi)啟上有不同策略, 需要自行根據(jù)各自的環(huán)境進(jìn)行確認(rèn)
pprof 獲取到的數(shù)據(jù)僅能作為參考, 和設(shè)置的采樣頻率有關(guān), 在計(jì)算例如 heap 情況時(shí)會(huì)進(jìn)行相關(guān)的近似預(yù)估, 非實(shí)質(zhì)上對(duì) heap 進(jìn)行掃描

 -------------------------
|  pprof.StartCPUProfile  |
 -------------------------
            |
            |
            |
 -------------------------
|  sleep(time.Duration)   |
 -------------------------
            |
            |
            |
 -------------------------
|  pprof.StopCPUProfile  |
 -------------------------

pprof.StartCPUProfile 與 pprof.StopCPUProfile 核心為 runtime.SetCPUProfileRate(hz int) 控制 cpu profile 頻率, 但是這邊的頻率設(shè)置和前面幾個(gè)有差異, 不僅僅是設(shè)計(jì) rate 的設(shè)置, 還涉及全局對(duì)象 cpuprof log buffer 的分配

var cpuprof cpuProfile
type cpuProfile struct {
  lock mutex
  on   bool     // profiling is on
  log  *profBuf // profile events written here

  // extra holds extra stacks accumulated in addNonGo
  // corresponding to profiling signals arriving on
  // non-Go-created threads. Those stacks are written
  // to log the next time a normal Go thread gets the
  // signal handler.
  // Assuming the stacks are 2 words each (we don't get
  // a full traceback from those threads), plus one word
  // size for framing, 100 Hz profiling would generate
  // 300 words per second.
  // Hopefully a normal Go thread will get the profiling
  // signal at least once every few seconds.
  extra      [1000]uintptr
  numExtra   int
  lostExtra  uint64 // count of frames lost because extra is full
  lostAtomic uint64 // count of frames lost because of being in atomic64 on mips/arm; updated racily
}

log buffer 的大小每次分配是固定的, 無(wú)法進(jìn)行調(diào)節(jié)

cpuprof.add

將 stack trace 信息寫(xiě)入 cpuprof 的 log buffer

// add adds the stack trace to the profile.
// It is called from signal handlers and other limited environments
// and cannot allocate memory or acquire locks that might be
// held at the time of the signal, nor can it use substantial amounts
// of stack.
//go:nowritebarrierrec
func (p *cpuProfile) add(gp *g, stk []uintptr) {
  // Simple cas-lock to coordinate with setcpuprofilerate.
  for !atomic.Cas(&prof.signalLock, 0, 1) {
    osyield()
  }

  if prof.hz != 0 { // implies cpuprof.log != nil
    if p.numExtra > 0 || p.lostExtra > 0 || p.lostAtomic > 0 {
      p.addExtra()
    }
    hdr := [1]uint64{1}
    // Note: write "knows" that the argument is &gp.labels,
    // because otherwise its write barrier behavior may not
    // be correct. See the long comment there before
    // changing the argument here.
    cpuprof.log.write(&gp.labels, nanotime(), hdr[:], stk)
  }

  atomic.Store(&prof.signalLock, 0)
}

來(lái)看下調(diào)用 cpuprof.add 的流程

 ------------------------
|   cpu profile start    |
 ------------------------
            |
            |
            | start timer (setitimer syscall / ITIMER_PROF)
            | 每個(gè)一段時(shí)間(rate)在向當(dāng)前 P 所在線程發(fā)送一個(gè) SIGPROF 信號(hào)量   --
            |                                                           |
            |                                                           |
 ------------------------                   loop                        |
|       sighandler       |----------------------------------------------
 ------------------------                                            |
            |                                                        |
            | /*                                                     |
            |  *  if sig == _SIGPROF {                               |
            |  *    sigprof(c.sigpc(), c.sigsp(), c.siglr(), gp, _g_.m)
            |  *    return                                           |
            |  */ }                                                  |
            |                                                        |
  ----------------------------                                       | stop
 |   sigprof(stack strace)    |                                      |
  ----------------------------                                       |
            |                                                        |
            |                                                        |
            |                                                        |
  ----------------------                                             |
 |     cpuprof.add      |                                            |
  ----------------------                                   ----------------------
           |                                              |   cpu profile stop   |
           |                                               ----------------------                  
           |            
  ----------------------
 |  cpuprof.log buffer  |                                         
  ----------------------
           |                                        ---------------------                  ---------------
           ----------------------------------------|   cpuprof.read      |----------------|  user access  |
                                                    ---------------------                  ---------------

由于 GMP 的模型設(shè)計(jì), 在絕大多數(shù)情況下通過(guò)這種 timer + sig + current thread 以及當(dāng)前支持的搶占式調(diào)度, 這種記錄方式是能夠很好進(jìn)行整個(gè) runtime cpu profile 采樣分析的, 但也不能排除一些極端情況是無(wú)法被覆蓋的, 畢竟也只是基于當(dāng)前 M 而已.

總結(jié)

可用性:

runtime 自帶的 pprof 已經(jīng)在數(shù)據(jù)采集的準(zhǔn)確性, 覆蓋率, 壓力等各方面替我們做好了一個(gè)比較均衡及全面的考慮

在絕大多數(shù)場(chǎng)景下使用起來(lái)需要考慮的性能點(diǎn)無(wú)非就是幾個(gè) rate 的設(shè)置

不同版本的默認(rèn)開(kāi)啟是有差別的, 幾個(gè)參數(shù)默認(rèn)值可自行確認(rèn), 有時(shí)候你覺(jué)得沒(méi)有開(kāi)啟 pprof 但是實(shí)際上已經(jīng)開(kāi)啟了

當(dāng)選擇的參數(shù)合適的時(shí)候, pprof 遠(yuǎn)遠(yuǎn)沒(méi)有想象中那般“重”

局限性:

得到的數(shù)據(jù)只是采樣(根據(jù) rate 決定) 或預(yù)估值

無(wú)法 cover 所有場(chǎng)景, 對(duì)于一些特殊的或者極端的情況, 需要各自進(jìn)行優(yōu)化來(lái)選擇合適的手段完善

安全性:

生產(chǎn)環(huán)境可用 pprof, 注意接口不能直接暴露, 畢竟存在諸如 STW 等操作, 存在潛在風(fēng)險(xiǎn)點(diǎn)

#開(kāi)源項(xiàng)目 pprof 參考 nsq[5] etcd[6] 采用的是配置式[7]選擇是否開(kāi)啟

參考資料

https://go-review.googlesource.com/c/go/+/299671

[1]

改變: https://go-review.googlesource.com/c/go/+/299671/8/src/runtime/mprof.go

[2]

stack: runtime_stack.md

[3]

系統(tǒng)調(diào)用: syscall.md

[4]

問(wèn)題: https://github.com/golang/go/issues/8976

[5]

nsq: https://github.com/nsqio/nsq/blob/v1.2.0/nsqd/http.go#L78-L88

[6]

etcd: https://github.com/etcd-io/etcd/blob/release-3.4/pkg/debugutil/pprof.go#L23

[7]

配置式: https://github.com/etcd-io/etcd/blob/release-3.4/etcd.conf.yml.sample#L76