時鐘源為什么會影響性能
前幾天幫同事看問題時,意外的發(fā)現(xiàn)了時鐘源影響性能的 case, 比較典型,記錄一下。網(wǎng)上也有人遇到過,參考蝦皮的[Go] Time.Now函數(shù)CPU使用率異常[1] 和 Two frequently used system calls are ~77% slower on AWS EC2[2]
本質(zhì)都是由 vdso fallback 到系統(tǒng)調(diào)用,所以慢了,但是觸發(fā)這個條件的原因不太一樣。我最后的分析也可能理解有誤,歡迎一起討論并指正。
另外,配圖的意思是知道這些就可以了,往下看沒球用 :(
現(xiàn)象

上圖是 perf 性能圖,可以發(fā)現(xiàn) __clock_gettime 系統(tǒng)調(diào)用相關(guān)的耗時最多,非常詭異。
// time_demo.go
// strace -ce clock_gettime go run time_demo.go
package main
import (
"fmt"
"time"
)
func main(){
for i := 0; i < 10; i++{
t1 := time.Now()
t2 := time.Now()
fmt.Printf("Time taken: %v\n", t2.Sub(t1))
}
}
上圖是最小復(fù)現(xiàn) demo, 直接查看 time.Now() 函數(shù)的耗時。使用 strace -ce 來查看系統(tǒng)調(diào)用的統(tǒng)計報表
~# strace -ce clock_gettime go run time_demo.go
Time taken: 1.983μs
Time taken: 1.507μs
Time taken: 2.247μs
Time taken: 2.993μs
Time taken: 2.703μs
Time taken: 1.927μs
Time taken: 2.091μs
Time taken: 2.16μs
Time taken: 2.085μs
Time taken: 2.234μs
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.001342 13 105 clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00 0.001342 105 total
上面是有問題的機(jī)器結(jié)果,可以發(fā)現(xiàn)大量的系統(tǒng)調(diào)用 clock_gettime 產(chǎn)生。
~# strace -ce clock_gettime go run time_demo.go
Time taken: 138ns
Time taken: 94ns
Time taken: 73ns
Time taken: 88ns
Time taken: 87ns
Time taken: 83ns
Time taken: 93ns
Time taken: 78ns
Time taken: 93ns
Time taken: 99ns
上面是正常性能機(jī)器的結(jié)果,耗時是納秒級別的,快了幾個量級。并且沒有任何系統(tǒng)調(diào)用產(chǎn)生。可以想象一下,每個請求,不同模塊都要做大量的 P99 統(tǒng)計,如果 time.Now 自身耗時這么大那這個服務(wù)基本不可用了。
有問題機(jī)器系統(tǒng)調(diào)用函數(shù)樣子如下:
clock_gettime(CLOCK_MONOTONIC, {tv_sec=857882, tv_nsec=454310014}) = 0
測試內(nèi)核是 5.4.0-1038
time.Now()
來看一下 go time.Now 的實現(xiàn)
// src/runtime/timestub.go
//go:linkname time_now time.now
func time_now() (sec int64, nsec int32, mono int64) {
sec, nsec = walltime()
return sec, nsec, nanotime()
}
time 只暴露了函數(shù)的定義,實現(xiàn)是由底層不同平臺的匯編實現(xiàn),暫時只關(guān)注 amd64, 來看下匯編代碼
// src/runtime/sys_linux_amd64.s
// func walltime1() (sec int64, nsec int32)
// non-zero frame-size means bp is saved and restored
TEXT runtime·walltime1(SB),NOSPLIT,$8-12
......
noswitch:
SUBQ $16, SP // Space for results
ANDQ $~15, SP // Align for C code
MOVQ runtime·vdsoClockgettimeSym(SB), AX
......
那么問題來了,vdso 是什么?
系統(tǒng)調(diào)用
首先說,大家都知道系統(tǒng)調(diào)用慢,涉及陷入內(nèi)核,上下文開銷。但是到底多慢呢?

上圖是系統(tǒng)調(diào)用和普通函數(shù)調(diào)用的開銷對比,參考 [Measurements of system call performance and overhead](http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead, Measurements of system call performance and overhead), 可以看到,getpid 走系統(tǒng)調(diào)用的開銷遠(yuǎn)大于通過 vdso 的方式,而且也遠(yuǎn)大于普通函數(shù)調(diào)用。
vdso (virtual dynamic shared object) 參考 vdso man7[3], 本質(zhì)上來說,還是因為系統(tǒng)調(diào)用太慢,涉及到上下文切換,少部分頻繁使用的系統(tǒng)調(diào)用貢獻(xiàn)了大部分時間。所以把這部分,不涉及安全的從內(nèi)核空間,映射到用戶空間。
x86-64 functions
The table below lists the symbols exported by the vDSO. All of
these symbols are also available without the "__vdso_" prefix,
but you should ignore those and stick to the names below.
symbol version
─────────────────────────────────
__vdso_clock_gettime LINUX_2.6
__vdso_getcpu LINUX_2.6
__vdso_gettimeofday LINUX_2.6
__vdso_time LINUX_2.6
上面就是 x86 支持 vdso 的函數(shù),一共 4 個?不可能這么少吧?來看一下線上真實情況的
~# uname -a
Linux 5.4.0-1041-aws #43~18.04.1-Ubuntu SMP Sat Mar 20 15:47:52 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
~# cat /proc/self/maps | grep -i vdso
7fff2edff000-7fff2ee00000 r-xp 00000000 00:00 0 [vdso]
內(nèi)核版本是 5.4.0, 通過 maps 找到當(dāng)前進(jìn)程的vdso, 權(quán)限是r-xp,可讀可執(zhí)行但不可寫,我們可以直接把他dump出來看看。先在另一個 session 執(zhí)行 cat, 等待輸入,然后用 gdb attach
~# ps aux | grep cat
root 9869 0.0 0.0 9360 792 pts/1 S+ 02:18 0:00 cat
root 9931 0.0 0.0 16152 1100 pts/0 S+ 02:18 0:00 grep --color=auto cat
~# cat /proc/9869/maps | grep -i vdso
7ffe717e6000-7ffe717e7000 r-xp 00000000 00:00 0 [vdso]
~# gdb /bin/cat 9869
...........
(gdb) dump memory /tmp/vdso.so 0x7ffe717e6000 0x7ffe717e7000
(gdb) quit
再查看符號表
~# file /tmp/vdso.so
/tmp/vdso.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=17d65245b85cd032de7ab130d053551fb0bd284a, stripped
~# objdump -T /tmp/vdso.so
/tmp/vdso.so: file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
0000000000000950 w DF .text 00000000000000a1 LINUX_2.6 clock_gettime
00000000000008a0 g DF .text 0000000000000083 LINUX_2.6 __vdso_gettimeofday
0000000000000a00 w DF .text 000000000000000a LINUX_2.6 clock_getres
0000000000000a00 g DF .text 000000000000000a LINUX_2.6 __vdso_clock_getres
00000000000008a0 w DF .text 0000000000000083 LINUX_2.6 gettimeofday
0000000000000930 g DF .text 0000000000000015 LINUX_2.6 __vdso_time
0000000000000930 w DF .text 0000000000000015 LINUX_2.6 time
0000000000000950 g DF .text 00000000000000a1 LINUX_2.6 __vdso_clock_gettime
0000000000000000 g DO *ABS* 0000000000000000 LINUX_2.6 LINUX_2.6
0000000000000a10 g DF .text 000000000000002a LINUX_2.6 __vdso_getcpu
0000000000000a10 w DF .text 000000000000002a LINUX_2.6 getcpu
為什么這么麻煩呢?因為這個 vdso.so 是在內(nèi)存中維護(hù)的,并不像其它 so 動態(tài)庫一樣有對應(yīng)的文件。
說了這么多,所以問題來了,為什么有了 vdso, 獲取時間還要走系統(tǒng)調(diào)用呢???
時鐘源
關(guān)于時鐘源,下面的引用來自于 muahao
內(nèi)核在啟動過程中會根據(jù)既定的優(yōu)先級選擇時鐘源。優(yōu)先級的排序根據(jù)時鐘的精度與訪問速度。其中CPU中的TSC寄存器是精度最高(與CPU最高主頻等同),訪問速度最快(只需一條指令,一個時鐘周期)的時鐘源,因此內(nèi)核優(yōu)選TSC作為計時的時鐘源。其它的時鐘源,如HPET, ACPI-PM,PIT等則作為備選。但是,TSC不同與HPET等時鐘,它的頻率不是預(yù)知的。因此,內(nèi)核必須在初始化過程中,利用HPET,PIT等始終來校準(zhǔn)TSC的頻率。如果兩次校準(zhǔn)結(jié)果偏差較大,則認(rèn)為TSC是不穩(wěn)定的,則使用其它時鐘源。并打印內(nèi)核日志:Clocksource tsc unstable.
正常來說,TSC的頻率很穩(wěn)定且不受CPU調(diào)頻的影響(如果CPU支持constant-tsc)。內(nèi)核不應(yīng)該偵測到它是unstable的。但是,計算機(jī)系統(tǒng)中存在一種名為SMI(System Management Interrupt)的中斷,該中斷不可被操作系統(tǒng)感知和屏蔽。如果內(nèi)核校準(zhǔn)TSC頻率的計算過程quick_ pit_ calibrate ()被SMI中斷干擾,就會導(dǎo)致計算結(jié)果偏差較大(超過1%),結(jié)果是tsc基準(zhǔn)頻率不準(zhǔn)確。最后導(dǎo)致機(jī)器上的時間戳信息都不準(zhǔn)確,可能偏慢或者偏快。
當(dāng)內(nèi)核認(rèn)為TSC unstable時,切換到HPET等時鐘,不會給你的系統(tǒng)帶來過大的影響。當(dāng)然,時鐘精度或訪問時鐘的速度會受到影響。通過實驗測試,訪問HPET的時間開銷為訪問TSC時間開銷的7倍左右。如果您的系統(tǒng)無法忍受這些,可以嘗試以下解決方法:在內(nèi)核啟動時,加入啟動參數(shù):tsc=reliable
內(nèi)核實現(xiàn)
1. 各類時鐘源注冊
參考 linux insides[4] timers 一節(jié),可以看到各個時鐘源調(diào)用 clocksource_register_khz 進(jìn)行注冊,分別看 tsc 和 xen
static int __init init_tsc_clocksource(void)
{
......
if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
if (boot_cpu_has(X86_FEATURE_ART))
art_related_clocksource = &clocksource_tsc;
clocksource_register_khz(&clocksource_tsc, tsc_khz);
......
}
static struct clocksource clocksource_tsc = {
.name = "tsc",
.rating = 300,
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_VALID_FOR_HRES |
CLOCK_SOURCE_MUST_VERIFY,
.archdata = { .vclock_mode = VCLOCK_TSC },
.resume = tsc_resume,
.mark_unstable = tsc_cs_mark_unstable,
.tick_stable = tsc_cs_tick_stable,
.list = LIST_HEAD_INIT(clocksource_tsc.list),
};
查看 clocksource_tsc 時鐘源的 vclock_mode 是 VCLOCK_TSC
static void __init xen_time_init(void)
{
......
clocksource_register_hz(&xen_clocksource, NSEC_PER_SEC);
......
}
static void xen_setup_vsyscall_time_info(void)
{
......
xen_clocksource.archdata.vclock_mode = VCLOCK_PVCLOCK;
}
查看 xen 時鐘源的 vclock_mode 是 VCLOCK_PVCLOCK
2. 時鐘源與 timekeeper
那么問題來了,clocksource 是如何與 vdso_data 關(guān)聯(lián)的呢?這里面比較復(fù)雜,參考 linux內(nèi)核中的定時器和時間管理[5] 和 vdso段數(shù)據(jù)更新, 定位到 /kernel/time/tick-common.c 的 timekeeping_update 函數(shù),由它負(fù)責(zé)將定時器更新到用戶層的 vdso 區(qū)。
/* must hold timekeeper_lock */
static void timekeeping_update(struct timekeeper *tk, unsigned int action)
{
......
update_vsyscall(tk);
update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
......
}
void update_vsyscall(struct timekeeper *tk)
{
struct vdso_data *vdata = __arch_get_k_vdso_data();
struct vdso_timestamp *vdso_ts;
s32 clock_mode;
u64 nsec;
/* copy vsyscall data */
vdso_write_begin(vdata);
clock_mode = tk->tkr_mono.clock->vdso_clock_mode;
vdata[CS_HRES_COARSE].clock_mode = clock_mode;
vdata[CS_RAW].clock_mode = clock_mode;
/* CLOCK_REALTIME also required for time() */
vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_REALTIME];
vdso_ts->sec = tk->xtime_sec;
vdso_ts->nsec = tk->tkr_mono.xtime_nsec;
/* CLOCK_REALTIME_COARSE */
vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE];
vdso_ts->sec = tk->xtime_sec;
vdso_ts->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
/* CLOCK_MONOTONIC_COARSE */
vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE];
vdso_ts->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
nsec = nsec + tk->wall_to_monotonic.tv_nsec;
vdso_ts->sec += __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
/*
* Read without the seqlock held by clock_getres().
* Note: No need to have a second copy.
*/
WRITE_ONCE(vdata[CS_HRES_COARSE].hrtimer_res, hrtimer_resolution);
/*
* If the current clocksource is not VDSO capable, then spare the
* update of the high reolution parts.
*/
if (clock_mode != VDSO_CLOCKMODE_NONE)
update_vdso_data(vdata, tk);
__arch_update_vsyscall(vdata, tk);
vdso_write_end(vdata);
__arch_sync_vdso_data(vdata);
}
static void update_pvclock_gtod(struct timekeeper *tk)
{
struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
u64 boot_ns;
boot_ns = ktime_to_ns(ktime_add(tk->tkr_mono.base, tk->offs_boot));
write_seqcount_begin(&vdata->seq);
/* copy pvclock gtod data */
vdata->clock.vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
vdata->clock.cycle_last = tk->tkr_mono.cycle_last;
vdata->clock.mask = tk->tkr_mono.mask;
vdata->clock.mult = tk->tkr_mono.mult;
vdata->clock.shift = tk->tkr_mono.shift;
vdata->boot_ns = boot_ns;
vdata->nsec_base = tk->tkr_mono.xtime_nsec;
vdata->wall_time_sec = tk->xtime_sec;
write_seqcount_end(&vdata->seq);
}
static void update_pvclock_gtod(struct timekeeper *tk)
{
struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
u64 boot_ns;
boot_ns = ktime_to_ns(ktime_add(tk->tkr_mono.base, tk->offs_boot));
write_seqcount_begin(&vdata->seq);
/* copy pvclock gtod data */
vdata->clock.vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
vdata->clock.cycle_last = tk->tkr_mono.cycle_last;
vdata->clock.mask = tk->tkr_mono.mask;
vdata->clock.mult = tk->tkr_mono.mult;
vdata->clock.shift = tk->tkr_mono.shift;
vdata->boot_ns = boot_ns;
vdata->nsec_base = tk->tkr_mono.xtime_nsec;
vdata->wall_time_sec = tk->xtime_sec;
write_seqcount_end(&vdata->seq);
}

上面的截圖來自 arm vdso 實現(xiàn),和 x86 的類似。
然后再看一下 timekeeper 和 clocksource 是如何對應(yīng)的呢?在 timekeeping_init 函數(shù)里
void __init timekeeping_init(void)
{
struct timespec64 wall_time, boot_offset, wall_to_mono;
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock;
......
clock = clocksource_default_clock();
if (clock->enable)
clock->enable(clock);
tk_setup_internals(tk, clock);
...
}
這是初始化時的函數(shù),每當(dāng)時鐘源變更時,會調(diào)用 change_clocksource 切換。
3. 如何調(diào)用時間函數(shù)
// linux/lib/vdso/gettimeofday.c
static __maybe_unused int
__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
{
int ret = __cvdso_clock_gettime_common(clock, ts);
if (unlikely(ret))
return clock_gettime_fallback(clock, ts);
return 0;
}
static __always_inline
long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;
asm ("syscall" : "=a" (ret), "=m" (*_ts) :
"0" (__NR_clock_gettime), "D" (_clkid), "S" (_ts) :
"rcx", "r11");
return ret;
}
先直接看 fallback 邏輯,好嘛,直接是匯編的 syscall 調(diào)用,注意這里匯編是和平臺相關(guān)的,這個代碼是 x86. 這里 unlikely 是做分支預(yù)測的,后面的事情大概率不會發(fā)生,如果 ret 不為 0, 說明 vdso 獲取時間失敗,那么來看下什么時候 __cvdso_clock_gettime_common 會失敗。
static __maybe_unused int
__cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts)
{
const struct vdso_data *vd = __arch_get_vdso_data();
u32 msk;
/* Check for negative values or invalid clocks */
if (unlikely((u32) clock >= MAX_CLOCKS))
return -1;
/*
* Convert the clockid to a bitmask and use it to check which
* clocks are handled in the VDSO directly.
*/
msk = 1U << clock;
if (likely(msk & VDSO_HRES)) {
return do_hres(&vd[CS_HRES_COARSE], clock, ts);
} else if (msk & VDSO_COARSE) {
do_coarse(&vd[CS_HRES_COARSE], clock, ts);
return 0;
} else if (msk & VDSO_RAW) {
return do_hres(&vd[CS_RAW], clock, ts);
}
return -1;
}
這里只看 do_hres 實現(xiàn)
static int do_hres(const struct vdso_data *vd, clockid_t clk,
struct __kernel_timespec *ts)
{
const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
u64 cycles, last, sec, ns;
u32 seq;
do {
seq = vdso_read_begin(vd);
cycles = __arch_get_hw_counter(vd->clock_mode);
ns = vdso_ts->nsec;
last = vd->cycle_last;
if (unlikely((s64)cycles < 0))
return -1;
ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult);
ns >>= vd->shift;
sec = vdso_ts->sec;
} while (unlikely(vdso_read_retry(vd, seq)));
/*
* Do this outside the loop: a race inside the loop could result
* in __iter_div_u64_rem() being extremely slow.
*/
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
return 0;
}
__arch_get_hw_counter 會根據(jù) clock_mode 求出 cycles 值,這是一個 u64 類型,如果轉(zhuǎn)成 s64 為負(fù)數(shù),那就返回 -1, 此時會觸發(fā) fallback 系統(tǒng)調(diào)用邏輯。
static inline u64 __arch_get_hw_counter(s32 clock_mode)
{
if (clock_mode == VCLOCK_TSC)
return (u64)rdtsc_ordered();
/*
* For any memory-mapped vclock type, we need to make sure that gcc
* doesn't cleverly hoist a load before the mode check. Otherwise we
* might end up touching the memory-mapped page even if the vclock in
* question isn't enabled, which will segfault. Hence the barriers.
*/
#ifdef CONFIG_PARAVIRT_CLOCK
if (clock_mode == VCLOCK_PVCLOCK) {
barrier();
return vread_pvclock();
}
#endif
#ifdef CONFIG_HYPERV_TIMER
if (clock_mode == VCLOCK_HVCLOCK) {
barrier();
return vread_hvclock();
}
#endif
return U64_MAX;
}
static u64 vread_pvclock(void)
{
......
do {
version = pvclock_read_begin(pvti);
if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
return U64_MAX;
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
return ret;
}
這里判斷如果 flags 里沒有 PVCLOCK_TSC_STABLE_BIT 標(biāo)記,則返回 U64_MAX, 來看一下什么時候沒有這個標(biāo)記
static int kvm_guest_time_update(struct kvm_vcpu *v)
{
......
u64 tsc_timestamp, host_tsc;
struct kvm_arch *ka = &v->kvm->arch;
u8 pvclock_flags;
bool use_master_clock;
......
use_master_clock = ka->use_master_clock;
......
if (use_master_clock)
pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
}
/*
*
* Assuming a stable TSC across physical CPUS, and a stable TSC
* across virtual CPUs, the following condition is possible.
* Each numbered line represents an event visible to both
* CPUs at the next numbered event.
*/
static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
{
......
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
&& !ka->backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
......
}
也就是說,如果宿主機(jī)使用了 tsc clocksource, 并且沒有觀察到時鐘回退現(xiàn)象,那么就設(shè)置 use_master_clock 為 true, 否則為 false.
所以問題來了,我們這臺機(jī)器是機(jī)器學(xué)習(xí) aws p3.2xlarge, 懷疑是和宿主機(jī)有關(guān),試了下其它 c5 系列的都己經(jīng)不支持 xen clocksource 了(僅支持 tsc kvm-clock acpi_pm),同時 kvm-clock 源測試也支持 vdso, 參考 官方玩轉(zhuǎn)GPU實例 blog[6], 最新的虛擬化技術(shù) Nitro 己經(jīng)沒有這個問題了。
分析來分析去,我可能分析個寂寞。。。
修復(fù)
當(dāng)然對于老的硬件,或是內(nèi)核還是有必要修復(fù)的
~# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
xen tsc hpet acpi_pm
~# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
xen
查看當(dāng)前時鐘源是 xen, 只需要將 tsc 寫入即可。
~# echo tsc > /sys/devices/system/clocksource/clocksource0/available_clocksource但是還有種情況,就是內(nèi)核將 tsc 標(biāo)記為不可信 Clocksource tsc unstable, 這時只能重啟內(nèi)核了?;蚴窃趩觾?nèi)核時,指定 tsc=reliable, 參考 manage-ec2-linux-clock-source[7]
GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200 clocksource=tsc tsc=reliable"
然后用 grub2-mkconfig -o /boot/grub2/grub.cfg 生成 grub.cfg 配置文件
小結(jié)
這次分享就這些,以后面還會分享更多的內(nèi)容,如果感興趣,可以關(guān)注并點擊左下角的分享轉(zhuǎn)發(fā)哦(:
參考資料
[Go] Time.Now函數(shù)CPU使用率異常: https://mp.weixin.qq.com/s/D2ulLXDFpi0FwVRwSQJ0nA,
[2]Two frequently used system calls are ~77% slower on AWS EC2: https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/,
[3]vdso man7: https://man7.org/linux/man-pages/man7/vdso.7.html,
[4]linux insides: https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html,
[5]linux內(nèi)核中的定時器和時間管理: https://garlicspace.com/2020/06/07/linux%E5%86%85%E6%A0%B8%E4%B8%AD%E7%9A%84%E5%AE%9A%E6%97%B6%E5%99%A8%E5%92%8C%E6%97%B6%E9%97%B4%E7%AE%A1%E7%90%86-part-7/,
[6]官方玩轉(zhuǎn)GPU實例 blog: https://aws.amazon.com/cn/blogs/china/using-rekognition-realize-serverless-intelligent-album-playing-with-gpu-instance-iii-system-optimization/,
[7]manage-ec2-linux-clock-source: https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source/,
