揭秘容器啟動緩慢的罪魁禍首
隨著dockershim進入淘汰時段,Containerd項目的使用邁入議程,在替換期間,觀測到容器啟動異常,通過本文,讓我們深入containerd底層運作,排查問題根源。

近期內(nèi)部進行Containerd替換dockershim測試,我們觀測到業(yè)務(wù)容器在啟動后需要很長時間才能正常運行。其中
init容器在一秒內(nèi)完成,但是容器有時需要超過一分鐘才開始執(zhí)行。
查看 kubelet 日志,發(fā)現(xiàn)當(dāng)節(jié)點剛啟動時,PLEG的Relisting方法(通常每秒運行一次)需要超過30秒才能完成。但是,過來了幾分鐘后,這個現(xiàn)象就消失了,最終"Relisting"方法按預(yù)期的每秒一次的頻率運行。
這個觀察結(jié)果表明,在節(jié)點剛啟動時可能存在一些問題,導(dǎo)致容器執(zhí)行延遲,但隨著節(jié)點運行,這個問題會自動消失,容器最終都正常運行。
kubelet_init[1680]: I0602 05:48:06.350695 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:07.352436 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:49.669827 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:50.672822 1679 generic.go:191] "GenericPLEG: Relisting"
kubelet_init[1680]: I0602 05:48:51.676857 1679 generic.go:191] "GenericPLEG: Relisting"
其中containerd的配置config.toml如下所示:
version = 2
root = "/var/lib/containerd"state = "/run/containerd"
[grpc]
address = "/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"conf_dir = "/etc/cni/net.d"
NOTE:
Containerd:1.6.21,K8s:1.19, Kernel:5.15.0

技術(shù)背景
dockershim與CRI
2022年05月, Kubernetes 1.24正式發(fā)布 ,比較引人注意是在這個版本中正式將dockershim 組件從 kubelet 中移除。從這個版本開始,用戶使用Kubelet時可以優(yōu)先選擇containerd 或 CRI-O作為容器運行時,其中Containerd架構(gòu)演變?nèi)缦滤?,如需了解更多,參?a >Containerd深度剖析-runtime篇

PLEG
PLEG (Pod Lifecycle Event Generator),即Pod 生命周期事件生成器。在 Kubernetes 中(分布式架構(gòu)),每個節(jié)點上都運行著一個守護進程 Kubelet 來管理節(jié)點上的容器,適時調(diào)整容器的實際狀態(tài)以匹配 spec 中定義的狀態(tài)(最終一致性)。具體來說,Kubelet 需要對pod兩個地方的變更做出及時的操作:
-
Pod spec 中定義的狀態(tài)
-
容器運行時的狀態(tài)
對于 Pod來說,Kubelet 會從多個數(shù)據(jù)來源(api、file以及http) watch Pod spec 中的變化。對于容器來說,Kubelet 會定期輪詢?nèi)萜鬟\行時,以獲取所有容器的最新狀態(tài)。隨著 Pod 和容器數(shù)量的增加,輪詢會產(chǎn)生較大開銷,帶來的周期性大量并發(fā)請求會導(dǎo)致較高的 CPU 使用率峰值,降低節(jié)點性能,從而降低節(jié)點的可靠性。為了降低 Pod 的管理開銷,提升 Kubelet 的性能和可擴展性,引入了 PLEG,改進了之前的工作方式:
-
減少空閑期間的不必要工作(例如 Pod 的定義和容器的狀態(tài)沒有發(fā)生更改)。
-
減少獲取容器狀態(tài)的并發(fā)請求數(shù)量。
注意: 為了進一步降低損耗,社區(qū)推出了基于event實現(xiàn)的PLEG,當(dāng)然也需要CRI運行時支持
ImagePull

ctr image pull 對應(yīng)的步驟如下:
-
resolve用戶需要下載的鏡像
-
從registry pull鏡像,把鏡像層內(nèi)容和config保存進content服務(wù),把鏡像相關(guān)的元數(shù)據(jù)保存進images元數(shù)據(jù)服務(wù)
-
unpack進snapshot服務(wù)
注意 :這里的content、images等服務(wù)都是指containerd提供的gRPC服務(wù),在解壓每個layer時,containerd會臨時掛載/卸載所有父快照。

問題排除
基于上述的技術(shù)背景,可以知道 GenericPLEG:Relisting 調(diào)用了 containerd 的 CRI(Container Runtime Interface)來獲取所有正在運行的容器列表。
// relist queries the container runtime for list of pods/containers, compare
// with the internal pods/containers, and generates events accordingly.
func (g *GenericPLEG) relist() {
klog.V(5).Infof("GenericPLEG: Relisting")
if lastRelistTime := g.getRelistTime(); !lastRelistTime.IsZero() {
metrics.PLEGRelistInterval.Observe(metrics.SinceInSeconds(lastRelistTime))
}
timestamp := g.clock.Now()
defer func() {
metrics.PLEGRelistDuration.Observe(metrics.SinceInSeconds(timestamp))
}()
// Get all the pods.
podList, err := g.runtime.GetPods(true)
那么相應(yīng)的,我們可以排查containerd日志,其日志表明containerd 本身或者 shim 在停止容器時遇到了問題。
containerd[2206]: {"error":"failed to stop container: failed to delete task: context deadline exceeded:
unknown","level":"error","msg":"failed to handle container TaskExit event \u0026TaskExit{ContainerID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,ID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,Pid:3658,ExitStatus:0,ExitedAt:2023-06-15 06:39:36.566238047 +0000 UTC,XXX_unrecognized:[],}","time":"2023-06-15T06:39:46.568049942Z"}
因此獲取在 goroutine dump日志,從中可以發(fā)現(xiàn)一個 goroutine 正在等待 Delete 調(diào)用。
goroutine 1654 [select]:
github.com/containerd/ttrpc.(*Client).dispatch(0xc001ae1780, {0x56190174ac08?, 0xc000ae0180}, 0xc000ae05a0, 0xc0001183c0)
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:163 +0x2ad
github.com/containerd/ttrpc.defaultClientInterceptor({0x56190174ac08?, 0xc000ae0180?}, 0x5619023eb940?, 0xc000f982a0?, 0x6993c8a9f?, 0x5619023eb940?)
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:49 +0x26
github.com/containerd/ttrpc.(*Client).Call(0xc001ae1780, {0x56190174ac08, 0xc000ae0180}, {0x561900ea8b7f, 0x17}, {0x561900e8851e, 0x6}, {0x56190167d8a0?, 0xc000e6a880?}, {0x56190167d9e0, ...})
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:132 +0x3ac
github.com/containerd/containerd/runtime/v2/task.(*taskClient).Delete(0xc00011a578, {0x56190174ac08, 0xc000ae0180}, 0x5618fff5830f?)
/go/src/github.com/containerd/containerd/runtime/v2/task/shim.pb.go:3609 +0x98
github.com/containerd/containerd/runtime/v2.(*shimTask).delete(0xc00165e870, {0x56190174ac08, 0xc000ae0180}, 0xc0009ab988)
/go/src/github.com/containerd/containerd/runtime/v2/shim.go:275 +0xc2
github.com/containerd/containerd/runtime/v2.(*TaskManager).Delete(0xc00040a0f8, {0x56190174ac08, 0xc000ae0180}, {0xc0013aaa40?, 0x40?})
/go/src/github.com/containerd/containerd/runtime/v2/manager.go:419 +0x8c
github.com/containerd/containerd/services/tasks.(*local).Delete(0xc00070a5a0, {0x56190174ac08, 0xc000ae0180}, 0xc000d7cd50, {0x561901620fe0?, 0xc0009abab8?, 0x5618ffbcf13f?})
/go/src/github.com/containerd/containerd/services/tasks/local.go:291 +0x11a
github.com/containerd/containerd.(*task).Delete(0xc0006ba440, {0x56190174ac08, 0xc000ae0180}, {0xc000d08570, 0x2, 0x56190174b1f0?})
/go/src/github.com/containerd/containerd/task.go:332 +0x355
github.com/containerd/containerd/pkg/cri/server.handleContainerExit({0x56190174ac08, 0xc000ae0180}, 0xc000ae0120, {{{0xc0013aaa40, 0x40}, {0xc000dfd7a0, 0x52}, {0xc001a8e240, 0x40}, 0xc001240380, ...}, ...})
/go/src/github.com/containerd/containerd/pkg/cri/server/events.go:387 +0x202
github.com/containerd/containerd/pkg/cri/server.(*eventMonitor).startContainerExitMonitor.func1.1(0xc000089240, 0xc000ae0120)
/go/src/github.com/containerd/containerd/pkg/cri/server/events.go:190 +0x310
github.com/containerd/containerd/pkg/cri/server.(*eventMonitor).startContainerExitMonitor.func1()
/go/src/github.com/containerd/containerd/pkg/cri/server/events.go:198 +0x37f
created by github.com/containerd/containerd/pkg/cri/server.(*eventMonitor).startContainerExitMonitor
/go/src/github.com/containerd/containerd/pkg/cri/server/events.go:162 +0x12b
在 containerd-shim-runc-v2 中可以發(fā)現(xiàn)一個 goroutine 掛在了 umount 系統(tǒng)調(diào)用上
goroutine 104 [syscall]:
syscall.Syscall(0x7c?, 0xc000161800?, 0xc0000c2200?, 0x7d?)
/usr/local/go/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Unmount({0xc0000c2200?, 0x0?}, 0xc00029c448?)
/go/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux.go:1680 +0x7d
github.com/containerd/containerd/mount.unmount({0xc0000c2200, 0x7c}, 0x9ed280?)
/go/src/github.com/containerd/containerd/mount/mount_linux.go:155 +0x8c
github.com/containerd/containerd/mount.UnmountAll({0xc0000c2200, 0x7c}, 0xc000065080?)
/go/src/github.com/containerd/containerd/mount/mount_linux.go:185 +0x8c
github.com/containerd/containerd/pkg/process.(*Init).delete(0xc0000d2c60, {0x9f1460, 0xc000065080})
/go/src/github.com/containerd/containerd/pkg/process/init.go:312 +0x13a
github.com/containerd/containerd/pkg/process.(*stoppedState).Delete(0xc000012ab8, {0x9f1460?, 0xc000065080?})
/go/src/github.com/containerd/containerd/pkg/process/init_state.go:396 +0x2e
github.com/containerd/containerd/pkg/process.(*Init).Delete(0xc00002d180?, {0x9f1460?, 0xc000065080?})
/go/src/github.com/containerd/containerd/pkg/process/init.go:288 +0xb6
github.com/containerd/containerd/runtime/v2/runc.(*Container).Delete(0xc000132380?, {0x9f1460, 0xc000065080}, 0xc0003071c0)
/go/src/github.com/containerd/containerd/runtime/v2/runc/container.go:390 +0x62
github.com/containerd/containerd/runtime/v2/runc/task.(*service).Delete(0xc000132380, {0x9f1460, 0xc000065080}, 0xc0003071c0)
/go/src/github.com/containerd/containerd/runtime/v2/runc/task/service.go:216 +0x75
github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func4({0x9f1460, 0xc000065080}, 0xc00002ee00)
/go/src/github.com/containerd/containerd/runtime/v2/task/shim.pb.go:3477 +0x98
github.com/containerd/ttrpc.defaultServerInterceptor({0x9f1460?, 0xc000065080?}, 0x17?, 0xc000134760?, 0x6?)
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:45 +0x26
github.com/containerd/ttrpc.(*serviceSet).dispatch(0xc000118650, {0x9f1460, 0xc000065080}, {0xc00002aba0, 0x17}, {0xc00027e22a, 0x6}, {0xc000180280, 0x42, 0x50})
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:95 +0x1be
github.com/containerd/ttrpc.(*serviceSet).call(0xd0e330?, {0x9f1460?, 0xc000065080?}, {0xc00002aba0?, 0x84eac0?}, {0xc00027e22a?, 0xc000308000?}, {0xc000180280, 0x42, 0x50})
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:64 +0x71
github.com/containerd/ttrpc.(*serverConn).run.func2(0x1b)
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:439 +0xe5
created by github.com/containerd/ttrpc.(*serverConn).run
/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:435 +0x6b1
這時候,我們的排查思路可以回到containerd.log,通過查看containerd.log,可以發(fā)現(xiàn)? UpdateContainerResources 請求,在等待 bbolt 鎖:
goroutine 1723 [semacquire]:
sync.runtime_SemacquireMutex(0xc001a78300?, 0x9?, 0xc001500c00?)
/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0002f9178)
/usr/local/go/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:90
go.etcd.io/bbolt.(*DB).beginRWTx(0xc0002f8fc0)
/go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:640 +0x77
go.etcd.io/bbolt.(*DB).Begin(0xc001739a70?, 0x87?)
/go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:589 +0x1d
go.etcd.io/bbolt.(*DB).Update(0x20?, 0xc000b44be0)
/go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:725 +0x3e
github.com/containerd/containerd/metadata.(*DB).Update(0xc0001e92d0, 0x56190174ac40?)
/go/src/github.com/containerd/containerd/metadata/db.go:246 +0x9e
github.com/containerd/containerd/services/containers.(*local).withStoreUpdate(0xc0003e8840, {0x56190174ac40?, 0xc0017390b0?}, 0x0?)
/go/src/github.com/containerd/containerd/services/containers/local.go:212 +0x3c
github.com/containerd/containerd/services/containers.(*local).Update(0xc0003e8840, {0x56190174ac40, 0xc0017390b0}, 0xc000eb0540, {0xedc1ca928?, 0x0?, 0xc0017399b0?})
/go/src/github.com/containerd/containerd/services/containers/local.go:156 +0x24e
github.com/containerd/containerd.(*remoteContainers).Update(_, {_, _}, {{0xc00136b280, 0x40}, 0xc0017398f0, {0xc0016cc280, 0x96}, {{0xc00145a9f0, 0x15}, ...}, ...}, ...)
/go/src/github.com/containerd/containerd/containerstore.go:129 +0x2ce
github.com/containerd/containerd.(*container).Update(0xc0006e4a80, {0x56190174ac40, 0xc0017390b0}, {0xc00040a0f0, 0x1, 0xc0016f0101?})
/go/src/github.com/containerd/containerd/container.go:318 +0x214
github.com/containerd/containerd/pkg/cri/server.updateContainerSpec({0x56190174ac40, 0xc0017390b0}, {0x561901754a58, 0xc0006e4a80}, 0xc0016f4b80)
而這部分代碼的實現(xiàn)位于 containerd/pkg/cri/server/container_update_resources.go
func (c *criService) UpdateContainerResources(ctx context.Context, r *runtime.UpdateContainerResourcesRequest) (retRes *runtime.UpdateContainerResourcesResponse, retErr error) {
container, err := c.containerStore.Get(r.GetContainerId())
if err != nil {
return nil, fmt.Errorf("failed to find container: %w", err)
}
// Update resources in status update transaction, so that:
// 1) There won't be race condition with container start.
// 2) There won't be concurrent resource update to the same container.
if err := container.Status.UpdateSync(func(status containerstore.Status) (containerstore.Status, error) {
return c.updateContainerResources(ctx, container, r, status)
}); err != nil {
return nil, fmt.Errorf("failed to update resources: %w", err)
}
return &runtime.UpdateContainerResourcesResponse{}, nil
UpdateContainerResources 持有容器狀態(tài)的鎖,而 ListContainers 需要使用該鎖,這就導(dǎo)致 PLEG (Pod Lifecycle Event Generator)花費很長時間。
goroutine 1665 [semacquire]:
sync.runtime_SemacquireMutex(0xe69?, 0xae?, 0x1768c1d3f53d5f79?)
/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*RWMutex).RLock(...)
/usr/local/go/src/sync/rwmutex.go:71
github.com/containerd/containerd/pkg/cri/store/container.(*statusStorage).Get(0xc00058a880?)
/go/src/github.com/containerd/containerd/pkg/cri/store/container/status.go:205 +0x9c
github.com/containerd/containerd/pkg/cri/server.toCRIContainer({{{0xc00136b280, 0x40}, {0xc00186b6d0, 0x4d}, {0xc001a8e6c0, 0x40}, 0xc0007129a0, {0xc00050eb40, 0x47}, {0xc000dfcd80, ...}, ...}, ...})
/go/src/github.com/containerd/containerd/pkg/cri/server/container_list.go:48 +0x3f
github.com/containerd/containerd/pkg/cri/server.(*criService).ListContainers(0xc000548400, {0xc0006027b0?, 0x561900e89376?}, 0xc000ef6130)
/go/src/github.com/containerd/containerd/pkg/cri/server/container_list.go:37 +0x205
github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).ListContainers(0xc000708168, {0x56190174ac40, 0xc0006027b0}, 0xc000ef6130)
/go/src/github.com/containerd/containerd/pkg/cri/server/instrumented_service.go:515 +0x1d3
UpdateContainerResource 持有 bbolt 數(shù)據(jù)庫的鎖,其正在嘗試將數(shù)據(jù)同步到存儲中,因此我們猜測此時主機I/O存在壓力,如果啟用cgroup v2,可以通過查看io pressure文件(后續(xù)我們介紹一下PSI),或者通過iostat命令
goroutine 1268 [syscall]:
syscall.Syscall(0x0?, 0x0?, 0xc000984000?, 0x686?)
/usr/local/go/src/syscall/syscall_linux.go:68 +0x27
syscall.Fsync(0x0?)
/usr/local/go/src/syscall/zsyscall_linux_amd64.go:423 +0x2a
internal/poll.(*FD).Fsync.func1(...)
/usr/local/go/src/internal/poll/fd_fsync_posix.go:18
internal/poll.ignoringEINTR(...)
/usr/local/go/src/internal/poll/fd_posix.go:74
internal/poll.(*FD).Fsync(0x0?)
/usr/local/go/src/internal/poll/fd_fsync_posix.go:17 +0xf4
os.(*File).Sync(0xc00011a6b8)
/usr/local/go/src/os/file_posix.go:168 +0x4e
github.com/containerd/containerd/content/local.(*writer).Commit(0xc0008c7880, {0x56190174ac40, 0xc0006028a0}, 0x181ebe95, {0xc001476780, 0x47}, {0x0, 0x0, 0x5619023eb940?})
/go/src/github.com/containerd/containerd/content/local/writer.go:95 +0x24e
github.com/containerd/containerd/metadata.(*namespacedWriter).commit(0xc0006f5320, {0x56190174ac40, 0xc0006028a0}, 0x0?, 0x181ebe95, {0xc001476780, 0x47}, {0x0, 0x0, 0x0})
/go/src/github.com/containerd/containerd/metadata/content.go:641 +0x5bf
github.com/containerd/containerd/metadata.(*namespacedWriter).Commit.func1(0x0?)
/go/src/github.com/containerd/containerd/metadata/content.go:578 +0x8f
go.etcd.io/bbolt.(*DB).Update(0x0?, 0xc000bb01e0)
/go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:741 +0x82
github.com/containerd/containerd/metadata.(*DB).Update(0xc0001e92d0, 0x5619014f3000?)
/go/src/github.com/containerd/containerd/metadata/db.go:246 +0x9e
github.com/containerd/containerd/metadata.update({0x56190174ac40?, 0xc0006028a0?}, {0x561901738b90, 0xc0001e92d0}, 0xc000
為了驗證上述排查思路,我們換了一臺節(jié)點,移除dockershim,重啟機器后,發(fā)現(xiàn)大量image pull(因為containerd存儲的image與docker是不一致,所有kubelet啟動容器時需要拉取鏡像),這時候通過iostat發(fā)現(xiàn)IO非常高,原因是什么?
對于頻繁的鏡像 pull和unpack,導(dǎo)致磁盤爭用的問題,正如技術(shù)背景中提到的ImagePull的過程,雖然ImagePull 不會強制刷新所有臟頁到磁盤,但 umount 會這樣做。
那么如何解決這個問題?
我們只需要避免umount時的強制刷新,即使有再多鏡像拉取,這種現(xiàn)象也能得到緩解。

問題修復(fù)
我們在社區(qū)找到了相應(yīng)的修復(fù)方案,PR#8676 為overlay插件添加mount option,以使用 volatile,該特性需要overlay內(nèi)核存儲的支持, https://docs.kernel.org/filesystems/overlayfs.html#volatile-mount。
注:volatile 可以實現(xiàn)跳過 sync 調(diào)用,這樣umount 不會因為同步到磁盤而出現(xiàn)卡頓。
由于筆者時間、視野、認知有限,本文難免出現(xiàn)錯誤、疏漏等問題,期待各位讀者朋友、業(yè)界專家指正交流,上述排障信息已修改為社區(qū)內(nèi)容。
參考文獻1.https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/pleg 2.https://github.com/containerd/containerd/issues/2886 3. [overlay] add configurable mount options to overlay snapshotter 4.Containers slow to start after init container finishes
