Go標(biāo)準(zhǔn)庫 http 與 fasthttp 服務(wù)端性能比較
1. 背景
Go初學(xué)者學(xué)習(xí)Go時,在編寫了經(jīng)典的“hello, world”程序之后,可能會迫不及待的體驗一下Go強大的標(biāo)準(zhǔn)庫,比如:用幾行代碼寫一個像下面示例這樣擁有完整功能的web server:
//?來自https://tip.golang.org/pkg/net/http/#example_ListenAndServe
package?main
import?(
?"io"
?"log"
?"net/http"
)
func?main()?{
?helloHandler?:=?func(w?http.ResponseWriter,?req?*http.Request)?{
??io.WriteString(w,?"Hello,?world!\n")
?}
?http.HandleFunc("/hello",?helloHandler)
?log.Fatal(http.ListenAndServe(":8080",?nil))
}
go net/http包是一個比較均衡的通用實現(xiàn),能滿足大多數(shù)gopher 90%以上場景的需要,并且具有如下優(yōu)點:
- 標(biāo)準(zhǔn)庫包,無需引入任何第三方依賴;
- 對http規(guī)范的滿足度較好;
- 無需做任何優(yōu)化,即可獲得相對較高的性能;
- 支持HTTP代理;
- 支持HTTPS;
- 無縫支持HTTP/2。
不過也正是因為http包的“均衡”通用實現(xiàn),在一些對性能要求嚴(yán)格的領(lǐng)域,net/http的性能可能無法勝任,也沒有太多的調(diào)優(yōu)空間。這時我們會將眼光轉(zhuǎn)移到其他第三方的http服務(wù)端框架實現(xiàn)上。
而在第三方http服務(wù)端框架中,一個“行如其名”的框架fasthttp[1]被提及和采納的較多,fasthttp官網(wǎng)宣稱其性能是net/http的十倍(基于go test benchmark的測試結(jié)果)。
fasthttp采用了許多性能優(yōu)化上的最佳實踐[2],尤其是在內(nèi)存對象的重用上,大量使用sync.Pool[3]以降低對Go GC的壓力。
那么在真實環(huán)境中,到底fasthttp能比net/http快多少呢?恰好手里有兩臺性能還不錯的服務(wù)器可用,在本文中我們就在這個真實環(huán)境下看看他們的實際性能。
2. 性能測試
我們分別用net/http和fasthttp實現(xiàn)兩個幾乎“零業(yè)務(wù)”的被測程序:
- nethttp:
//?github.com/bigwhite/experiments/blob/master/http-benchmark/nethttp/main.go
package?main
import?(
?_?"expvar"
?"log"
?"net/http"
?_?"net/http/pprof"
?"runtime"
?"time"
)
func?main()?{
?go?func()?{
??for?{
???log.Println("當(dāng)前routine數(shù)量:",?runtime.NumGoroutine())
???time.Sleep(time.Second)
??}
?}()
?http.Handle("/",?http.HandlerFunc(func(w?http.ResponseWriter,?r?*http.Request)?{
??w.Write([]byte("Hello,?Go!"))
?}))
?log.Fatal(http.ListenAndServe(":8080",?nil))
}
- fasthttp:
//?github.com/bigwhite/experiments/blob/master/http-benchmark/fasthttp/main.go
package?main
import?(
?"fmt"
?"log"
?"net/http"
?"runtime"
?"time"
?_?"expvar"
?_?"net/http/pprof"
?"github.com/valyala/fasthttp"
)
type?HelloGoHandler?struct?{
}
func?fastHTTPHandler(ctx?*fasthttp.RequestCtx)?{
?fmt.Fprintln(ctx,?"Hello,?Go!")
}
func?main()?{
?go?func()?{
??http.ListenAndServe(":6060",?nil)
?}()
?go?func()?{
??for?{
???log.Println("當(dāng)前routine數(shù)量:",?runtime.NumGoroutine())
???time.Sleep(time.Second)
??}
?}()
?s?:=?&fasthttp.Server{
??Handler:?fastHTTPHandler,
?}
?s.ListenAndServe(":8081")
}
對被測目標(biāo)實施壓力測試的客戶端,我們基于hey[4]這個http壓測工具進(jìn)行,為了方便調(diào)整壓力水平,我們將hey“包裹”在下面這個shell腳本中(僅適于在linux上運行):
//?github.com/bigwhite/experiments/blob/master/http-benchmark/client/http_client_load.sh
#?./http_client_load.sh?3?10000?10?GET?http://10.10.195.181:8080
echo?"$0?task_num?count_per_hey?conn_per_hey?method?url"
task_num=$1
count_per_hey=$2
conn_per_hey=$3
method=$4
url=$5
start=$(date?+%s%N)
for((i=1;?i<=$task_num;?i++));?do?{
?tm=$(date?+%T.%N)
????????echo?"$tm:?task?$i?start"
?hey?-n?$count_per_hey?-c?$conn_per_hey?-m?$method?$url?>?hey_$i.log
?tm=$(date?+%T.%N)
????????echo?"$tm:?task?$i?done"
}?&?done
wait
end=$(date?+%s%N)
count=$((?$task_num?*?$count_per_hey?))
runtime_ns=$((?$end?-?$start?))
runtime=`echo?"scale=2;?$runtime_ns?/?1000000000"?|?bc`
echo?"runtime:?"$runtime
speed=`echo?"scale=2;?$count?/?$runtime"?|?bc`
echo?"speed:?"$speed?
該腳本的執(zhí)行示例如下:
bash?http_client_load.sh?8?1000000?200?GET?http://10.10.195.134:8080
http_client_load.sh?task_num?count_per_hey?conn_per_hey?method?url
16:58:09.146948690:?task?1?start
16:58:09.147235080:?task?2?start
16:58:09.147290430:?task?3?start
16:58:09.147740230:?task?4?start
16:58:09.147896010:?task?5?start
16:58:09.148314900:?task?6?start
16:58:09.148446030:?task?7?start
16:58:09.148930840:?task?8?start
16:58:45.001080740:?task?3?done
16:58:45.241903500:?task?8?done
16:58:45.261501940:?task?1?done
16:58:50.032383770:?task?4?done
16:58:50.985076450:?task?7?done
16:58:51.269099430:?task?5?done
16:58:52.008164010:?task?6?done
16:58:52.166402430:?task?2?done
runtime:?43.02
speed:?185960.01
從傳入的參數(shù)來看,該腳本并行啟動了8個task(一個task啟動一個hey),每個task向http://10.10.195.134:8080建立200個并發(fā)連接,并發(fā)送100w http GET請求。
我們使用兩臺服務(wù)器分別放置被測目標(biāo)程序和壓力工具腳本:
- 目標(biāo)程序所在服務(wù)器:10.10.195.181(物理機,Intel x86-64 CPU,40核,128G內(nèi)存, CentOs 7.6)
$?cat?/etc/redhat-release
CentOS?Linux?release?7.6.1810?(Core)?
$?lscpu
Architecture:??????????x86_64
CPU?op-mode(s):????????32-bit,?64-bit
Byte?Order:????????????Little?Endian
CPU(s):????????????????40
On-line?CPU(s)?list:???0-39
Thread(s)?per?core:????2
Core(s)?per?socket:????10
座:???????????????? 2
NUMA 節(jié)點:???????? 2
廠商 ID:?????????? GenuineIntel
CPU 系列:????????? 6
型號:????????????? 85
型號名稱:??????? Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
步進(jìn):????????????? 4
CPU MHz:???????????? 800.000
CPU?max?MHz:???????????2201.0000
CPU?min?MHz:???????????800.0000
BogoMIPS:??????????? 4400.00
虛擬化:?????????? VT-x
L1d 緩存:????????? 32K
L1i 緩存:????????? 32K
L2 緩存:?????????? 1024K
L3 緩存:?????????? 14080K
NUMA 節(jié)點0 CPU:????0-9,20-29
NUMA 節(jié)點1 CPU:??? 10-19,30-39
Flags:?????????????????fpu?vme?de?pse?tsc?msr?pae?mce?cx8?apic?sep?mtrr?pge?mca?cmov?pat?pse36?clflush?dts?acpi?mmx?fxsr?sse?sse2?ss?ht?tm?pbe?syscall?nx?pdpe1gb?rdtscp?lm?constant_tsc?art?arch_perfmon?pebs?bts?rep_good?nopl?xtopology?nonstop_tsc?aperfmperf?eagerfpu?pni?pclmulqdq?dtes64?ds_cpl?vmx?smx?est?tm2?ssse3?sdbg?fma?cx16?xtpr?pdcm?pcid?dca?sse4_1?sse4_2?x2apic?movbe?popcnt?tsc_deadline_timer?aes?xsave?avx?f16c?rdrand?lahf_lm?abm?3dnowprefetch?epb?cat_l3?cdp_l3?intel_pt?ssbd?mba?ibrs?ibpb?stibp?tpr_shadow?vnmi?flexpriority?ept?vpid?fsgsbase?tsc_adjust?bmi1?hle?avx2?smep?bmi2?erms?invpcid?rtm?cqm?mpx?rdt_a?avx512f?avx512dq?rdseed?adx?smap?clflushopt?clwb?avx512cd?avx512bw?avx512vl?xsaveopt?xsavec?xgetbv1?cqm_llc?cqm_occup_llc?cqm_mbm_total?cqm_mbm_local?dtherm?ida?arat?pln?pts?pku?ospke?spec_ctrl?intel_stibp?flush_l1d
- 壓力工具所在服務(wù)器:10.10.195.133(物理機,鯤鵬arm64 cpu,96核,80G內(nèi)存, CentOs 7.9)
#?cat?/etc/redhat-release?
CentOS?Linux?release?7.9.2009?(AltArch)
#?lscpu
Architecture:??????????aarch64
Byte?Order:????????????Little?Endian
CPU(s):????????????????96
On-line?CPU(s)?list:???0-95
Thread(s)?per?core:????1
Core(s)?per?socket:????48
座:???????????????? 2
NUMA 節(jié)點:???????? 4
型號:??????????????0
CPU?max?MHz:???????????2600.0000
CPU?min?MHz:???????????200.0000
BogoMIPS:??????????? 200.00
L1d 緩存:????????? 64K
L1i 緩存:????????? 64K
L2 緩存:?????????? 512K
L3 緩存:?????????? 49152K
NUMA 節(jié)點0 CPU:????0-23
NUMA 節(jié)點1 CPU:??? 24-47
NUMA 節(jié)點2 CPU:??? 48-71
NUMA 節(jié)點3 CPU:??? 72-95
Flags:?????????????????fp?asimd?evtstrm?aes?pmull?sha1?sha2?crc32?atomics?fphp?asimdhp?cpuid?asimdrdm?jscvt?fcma?dcpop?asimddp?asimdfhm
我用dstat監(jiān)控被測目標(biāo)所在主機資源占用情況(dstat -tcdngym),尤其是cpu負(fù)荷;通過expvarmon監(jiān)控memstats[5]查看目標(biāo)程序中對各類資源消耗情況的排名。
下面是多次測試后制作的一個數(shù)據(jù)表格:

3. 對結(jié)果的簡要分析
受特定場景、測試工具及腳本精確性以及壓力測試環(huán)境的影響,上面的測試結(jié)果有一定局限,但卻真實反映了被測目標(biāo)的性能趨勢。我們看到在給予同樣壓力的情況下,fasthttp并沒有10倍于net http的性能,甚至在這樣一個特定的場景下,兩倍于net/http的性能都沒有達(dá)到:我們看到在目標(biāo)主機cpu資源消耗接近70%的幾個用例中,fasthttp的性能僅比net/http高出30%~70%左右。
那么為什么fasthttp的性能未及預(yù)期呢?要回答這個問題,那就要看看net/http和fasthttp各自的實現(xiàn)原理了!我們先來看看net/http的工作原理示意圖:

http包作為server端的原理很簡單,那就是accept到一個連接(conn)之后,將這個conn甩給一個worker goroutine去處理,后者一直存在,直到該conn的生命周期結(jié)束:即連接關(guān)閉。
下面是fasthttp的工作原理示意圖:

而fasthttp設(shè)計了一套機制,目的是盡量復(fù)用goroutine,而不是每次都創(chuàng)建新的goroutine。fasthttp的Server accept一個conn之后,會嘗試從workerpool中的ready切片中取出一個channel,該channel與某個worker goroutine一一對應(yīng)。一旦取出channel,就會將accept到的conn寫到該channel里,而channel另一端的worker goroutine就會處理該conn上的數(shù)據(jù)讀寫。當(dāng)處理完該conn后,該worker goroutine不會退出,而是會將自己對應(yīng)的那個channel重新放回workerpool中的ready切片中,等待這下一次被取出。
fasthttp的goroutine復(fù)用策略初衷很好,但在這里的測試場景下效果不明顯,從測試結(jié)果便可看得出來,在相同的客戶端并發(fā)和壓力下,net/http使用的goroutine數(shù)量與fasthttp相差無幾。這是由測試模型導(dǎo)致的:在我們這個測試中,每個task中的hey都會向被測目標(biāo)發(fā)起固定數(shù)量的長連接(keep-alive),然后在每條連接上發(fā)起“飽和”請求。這樣fasthttp workerpool中的goroutine一旦接收到某個conn就只能在該conn上的通訊結(jié)束后才能重新放回,而該conn直到測試結(jié)束才會close,因此這樣的場景相當(dāng)于讓fasthttp“退化”成了net/http的模型,也染上了net/http的“缺陷”:goroutine的數(shù)量一旦多起來,go runtime自身調(diào)度所帶來的消耗便不可忽視甚至超過了業(yè)務(wù)處理所消耗的資源占比。下面分別是fasthttp在200長連接、8000長連接以及16000長連接下的cpu profile的結(jié)果:
200長連接:
(pprof)?top?-cum
Showing?nodes?accounting?for?88.17s,?55.35%?of?159.30s?total
Dropped?150?nodes?(cum?<=?0.80s)
Showing?top?10?nodes?out?of?60
??????flat??flat%???sum%????????cum???cum%
?????0.46s??0.29%??0.29%????101.46s?63.69%??github.com/valyala/fasthttp.(*Server).serveConn
?????????0?????0%??0.29%????101.46s?63.69%??github.com/valyala/fasthttp.(*workerPool).getCh.func1
?????????0?????0%??0.29%????101.46s?63.69%??github.com/valyala/fasthttp.(*workerPool).workerFunc
?????0.04s?0.025%??0.31%?????89.46s?56.16%??internal/poll.ignoringEINTRIO?(inline)
????87.38s?54.85%?55.17%?????89.27s?56.04%??syscall.Syscall
?????0.12s?0.075%?55.24%?????60.39s?37.91%??bufio.(*Writer).Flush
?????????0?????0%?55.24%?????60.22s?37.80%??net.(*conn).Write
?????0.08s??0.05%?55.29%?????60.21s?37.80%??net.(*netFD).Write
?????0.09s?0.056%?55.35%?????60.12s?37.74%??internal/poll.(*FD).Write
?????????0?????0%?55.35%?????59.86s?37.58%??syscall.Write?(inline)
(pprof)?
8000長連接:
(pprof)?top?-cum
Showing?nodes?accounting?for?108.51s,?54.46%?of?199.23s?total
Dropped?204?nodes?(cum?<=?1s)
Showing?top?10?nodes?out?of?66
??????flat??flat%???sum%????????cum???cum%
?????????0?????0%?????0%????119.11s?59.79%??github.com/valyala/fasthttp.(*workerPool).getCh.func1
?????????0?????0%?????0%????119.11s?59.79%??github.com/valyala/fasthttp.(*workerPool).workerFunc
?????0.69s??0.35%??0.35%????119.05s?59.76%??github.com/valyala/fasthttp.(*Server).serveConn
?????0.04s??0.02%??0.37%????104.22s?52.31%??internal/poll.ignoringEINTRIO?(inline)
???101.58s?50.99%?51.35%????103.95s?52.18%??syscall.Syscall
?????0.10s??0.05%?51.40%?????79.95s?40.13%??runtime.mcall
?????0.06s??0.03%?51.43%?????79.85s?40.08%??runtime.park_m
?????0.23s??0.12%?51.55%?????79.30s?39.80%??runtime.schedule
?????5.67s??2.85%?54.39%?????77.47s?38.88%??runtime.findrunnable
?????0.14s??0.07%?54.46%?????68.96s?34.61%??bufio.(*Writer).Flush
16000長連接:
(pprof)?top?-cum
Showing?nodes?accounting?for?239.60s,?87.07%?of?275.17s?total
Dropped?190?nodes?(cum?<=?1.38s)
Showing?top?10?nodes?out?of?46
??????flat??flat%???sum%????????cum???cum%
?????0.04s?0.015%?0.015%????153.38s?55.74%??runtime.mcall
?????0.01s?0.0036%?0.018%????153.34s?55.73%??runtime.park_m
?????0.12s?0.044%?0.062%???????153s?55.60%??runtime.schedule
?????0.66s??0.24%???0.3%????152.66s?55.48%??runtime.findrunnable
?????0.15s?0.055%??0.36%????127.53s?46.35%??runtime.netpoll
???127.04s?46.17%?46.52%????127.04s?46.17%??runtime.epollwait
?????????0?????0%?46.52%???????121s?43.97%??github.com/valyala/fasthttp.(*workerPool).getCh.func1
?????????0?????0%?46.52%???????121s?43.97%??github.com/valyala/fasthttp.(*workerPool).workerFunc
?????0.41s??0.15%?46.67%????120.18s?43.67%??github.com/valyala/fasthttp.(*Server).serveConn
???111.17s?40.40%?87.07%????111.99s?40.70%??syscall.Syscall
(pprof)?
通過上述profile的比對,我們發(fā)現(xiàn)當(dāng)長連接數(shù)量增多時(即workerpool中g(shù)oroutine數(shù)量增多時),go runtime調(diào)度的占比會逐漸提升,在16000連接時,runtime調(diào)度的各個函數(shù)已經(jīng)排名前4了。
4. 優(yōu)化途徑
從上面的測試結(jié)果,我們看到fasthttp的模型不太適合這種連接連上后進(jìn)行持續(xù)“飽和”請求的場景,更適合短連接或長連接但沒有持續(xù)飽和請求,在后面這樣的場景下,它的goroutine復(fù)用模型才能更好的得以發(fā)揮。
但即便“退化”為了net/http模型,fasthttp的性能依然要比net/http略好,這是為什么呢?這些性能提升主要是fasthttp在內(nèi)存分配層面的優(yōu)化trick的結(jié)果,比如大量使用sync.Pool,比如避免在[]byte和string互轉(zhuǎn)等。
那么,在持續(xù)“飽和”請求的場景下,如何讓fasthttp workerpool中g(shù)oroutine的數(shù)量不會因conn的增多而線性增長呢?fasthttp官方?jīng)]有給出答案,但一條可以考慮的路徑是使用os的多路復(fù)用(linux上的實現(xiàn)為epoll),即go runtime netpoll使用的那套機制。在多路復(fù)用的機制下,這樣可以讓每個workerpool中的goroutine處理同時處理多個連接,這樣我們可以根據(jù)業(yè)務(wù)規(guī)模選擇workerpool池的大小,而不是像目前這樣幾乎是任意增長goroutine的數(shù)量。當(dāng)然,在用戶層面引入epoll也可能會帶來系統(tǒng)調(diào)用占比的增多以及響應(yīng)延遲增大等問題。至于該路徑是否可行,還是要看具體實現(xiàn)和測試結(jié)果。
注:fasthttp.Server中的Concurrency可以用來限制workerpool中并發(fā)處理的goroutine的個數(shù),但由于每個goroutine只處理一個連接,當(dāng)Concurrency設(shè)置過小時,后續(xù)的連接可能就會被fasthttp拒絕服務(wù)。因此fasthttp的默認(rèn)Concurrency為:
const?DefaultConcurrency?=?256?*?1024
本文涉及的源碼可以在這里[6]?github.com/bigwhite/experiments/blob/master/http-benchmark 下載。
參考資料
[1]?fasthttp:?https://github.com/valyala/fasthttp
[2]?性能優(yōu)化上的最佳實踐:?https://github.com/valyala/fasthttp#fasthttp-best-practices
[3]?sync.Pool:?https://www.imooc.com/read/87/article/2432
[4]?hey:?https://github.com/rakyll/hey
[5]?expvarmon監(jiān)控memstats:?https://mp.weixin.qq.com/s/cr2JeUq5HOYQC0qji_Ip5g
[6]?這里:?github.com/bigwhite/experiments/blob/master/http-benchmark
[7]?改善Go語?編程質(zhì)量的50個有效實踐:?https://www.imooc.com/read/87
[8]?Kubernetes實戰(zhàn):高可用集群搭建、配置、運維與應(yīng)用:?https://coding.imooc.com/class/284.html
[9]?我愛發(fā)短信:?https://51smspush.com/
[10]?鏈接地址:?https://m.do.co/c/bff6eed92687
? ?

喜歡明哥文章的同學(xué)歡迎長按下圖訂閱!
???
