記一次 .NET 某電商定向爬蟲 內(nèi)存碎片化分析
一:背景
1. 講故事
上個(gè)月有位朋友wx找到我,說他的程序存在內(nèi)存泄漏問題,尋求如何解決?如下圖所示:

從截圖中可以看出,這位朋友對(duì) windbg 的操作還是有些熟悉的,可能缺乏一定的實(shí)操經(jīng)驗(yàn),所以用了幾個(gè)命令之后就不知道怎么排查下去了。
既然找到我,那就以我的個(gè)人經(jīng)驗(yàn)在他的dump上繼續(xù)分析尋找罪魁禍?zhǔn)祝e話不多說,上windbg說話。
二:Windbg 分析
1. 真的存在內(nèi)存泄漏嗎?
追這個(gè)系列的朋友應(yīng)該知道,我無數(shù)次的用 !address -summary 和 !eeheap -gc 這兩個(gè)命令來判斷當(dāng)前的內(nèi)存泄漏是屬于托管層還是非托管層。
0:000>?!address?-summary
---?Usage?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal
Free????????????????????????????????????358?????7dfc`67f60000?(?125.986?TB)???????????98.43%
??????????????????????????????1087??????203`88b6e000?(???2.014?TB)??99.99%????1.57%
Image??????????????????????????????????1532????????0`09f11000?(?159.066?MB)???0.01%????0.00%
Heap????????????????????????????????????249????????0`03453000?(??52.324?MB)???0.00%????0.00%
Stack????????????????????????????????????66????????0`01fc0000?(??31.750?MB)???0.00%????0.00%
Other????????????????????????????????????10????????0`001d1000?(???1.816?MB)???0.00%????0.00%
TEB??????????????????????????????????????22????????0`0002c000?(?176.000?kB)???0.00%????0.00%
PEB???????????????????????????????????????1????????0`00001000?(???4.000?kB)???0.00%????0.00%
---?Type?Summary?(for?busy)?------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal
MEM_MAPPED??????????????????????????????183??????200`00d06000?(???2.000?TB)??99.30%????1.56%
MEM_PRIVATE????????????????????????????1252????????3`8d479000?(??14.207?GB)???0.69%????0.01%
MEM_IMAGE??????????????????????????????1532????????0`09f11000?(?159.066?MB)???0.01%????0.00%
---?State?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal
MEM_FREE????????????????????????????????358?????7dfc`67f60000?(?125.986?TB)???????????98.43%
MEM_RESERVE?????????????????????????????749??????200`28a9b000?(???2.001?TB)??99.33%????1.56%
MEM_COMMIT?????????????????????????????2218????????3`6f5f5000?(??13.740?GB)???0.67%????0.01%
0:000>?!eeheap?-gc
Number?of?GC?Heaps:?1
generation?0?starts?at?0x00000026DA8DA928
generation?1?starts?at?0x00000026DA7FC348
generation?2?starts?at?0x00000024C4691000
ephemeral?segment?allocation?context:?none
?????????segment?????????????begin?????????allocated??????????????size
00000024C4690000??00000024C4691000??00000024D468FF28??0xfffef28(268431144)
00000024EECF0000??00000024EECF1000??00000024FECF0000??0xffff000(268431360)
000000248D6F0000??000000248D6F1000??000000249D6EFEF8??0xfffeef8(268431096)
...
00000026D66D0000??00000026D66D1000??00000026DBA3CA30??0x536ba30(87472688)
Large?object?heap?starts?at?0x00000024D4691000
?????????segment?????????????begin?????????allocated??????????????size
00000024D4690000??00000024D4691000??00000024DC67C318??0x7feb318(134132504)
00000024E60F0000??00000024E60F1000??00000024EE0637C8??0x7f727c8(133638088)
0000002482140000??0000002482141000??000000248A08F338??0x7f4e338(133489464)
00000024A6770000??00000024A6771000??00000024AE76F6C0??0x7ffe6c0(134211264)
...
000000278E6D0000??000000278E6D1000??000000279635F2D0??0x7c8e2d0(130605776)
00000029233E0000??00000029233E1000??000000292AF672F8??0x7b862f8(129524472)
000000292B3E0000??000000292B3E1000??0000002931A5ED60??0x667dd60(107470176)
000000299B3E0000??000000299B3E1000??00000029A20095B0??0x6c285b0(113411504)
000000281E6D0000??000000281E6D1000??0000002825CD3F58??0x7602f58(123744088)
00000028266D0000??00000028266D1000??000000282D5CAD50??0x6ef9d50(116366672)
000000282E6D0000??000000282E6D1000??0000002833CA0880??0x55cf880(89979008)
00000029A33E0000??00000029A33E1000??00000029A684D300??0x346c300(54969088)
Total?Size:??????????????Size:?0x353f96d88?(14293757320)?bytes.
------------------------------
GC?Heap?Size:????Size:?0x353f96d88?(14293757320)?bytes.
從輸出看,當(dāng)前進(jìn)程占用 MEM_COMMIT=13.7G,托管堆內(nèi)存占用 14293757320 = 13.3G,很明顯這屬于簡(jiǎn)單模式的 托管內(nèi)存泄漏,根據(jù)經(jīng)驗(yàn),托管堆上可能有什么大對(duì)象,這里用 !dumpheap -stat 命令。
0:000>?!dumpheap?-stat
Statistics:
??????????????MT????Count????TotalSize?Class?Name
00007ff9ed6ea268??3956842?????94964208?System.Collections.Generic.Dictionary`2+KeyCollection[[System.String,?System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue,?Serilog]]
00007ff9ed5e6d28??3842435????166405016?Serilog.Parsing.MessageTemplateToken[]
00007ff9ed5e5e28??3842434????184436832?Serilog.Events.MessageTemplate
00007ff9ecccf090??4011012????203304420?System.Int32[]
00007ff9ed647078??3956849????253238336?Serilog.Events.LogEvent
00007ff9ed6e7b48??3956849????284893128?System.Collections.Generic.Dictionary`2[[System.String,?System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue,?Serilog]]
00007ff9ed5e74e8??9259598????296307136?Serilog.Parsing.TextToken
00007ff9ed6471b0?12551808????301243392?Serilog.Events.ScalarValue
00007ff9ed6e8308??3956849????729078048?System.Collections.Generic.Dictionary`2+Entry[[System.String,?System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue,?Serilog]][]
00007ff9eccb1e18?16546412???3987811940?System.String
00000024c3b8faf0????82904???7382993568??????Free
我去,托管堆最大的對(duì)象居然是 Free,大概占用 7.3G, 這就????了,不按套路出牌哈,為了更好的理解,先來簡(jiǎn)要解釋下 free 對(duì)象。
2. 簡(jiǎn)單解釋 free
其實(shí)簡(jiǎn)而言之,free對(duì)象是被gc標(biāo)記成已回收的空閑塊但并未提交給操作系統(tǒng)釋放的對(duì)象,那怎么去證明呢?可以先到托管堆上找一個(gè)free塊。
0:000>?!dumpheap?-type?Free
?????????Address???????????????MT?????Size
00000024c4691000?00000024c3b8faf0???????24?Free
00000024c46a2448?00000024c3b8faf0???????24?Free
00000024c46b26d8?00000024c3b8faf0???????40?Free
00000024c47e4418?00000024c3b8faf0???????40?Free
00000024c4925680?00000024c3b8faf0???????40?Free
00000024c49284a8?00000024c3b8faf0???????64?Free
00000024c4947a90?00000024c3b8faf0??????192?Free
00000024c4951f70?00000024c3b8faf0???????48?Free?????????
000000249d6ea3a8?00000024c3b8faf0??????640?Free
用 !do 命令查看是否標(biāo)記為free塊
0:000>?!do?000000249d6ea3a8
Free?Object
Size:????????640(0x280)?bytes
用 dc 看一下這個(gè) free 塊的內(nèi)容,是否有g(shù)c回收的痕跡。
0:000>?dc?000000249d6ea3a8?000000249d6ea3a8+0x280
00000024`9d6ea3a8??c3b8faf0?00000024?00000268?00000000??....$...h.......
00000024`9d6ea3b8??9d6ea6d0?00000024?00000000?00000000??..n.$...........
00000024`9d6ea3c8??ed3ae2b8?00007ff9?9d6ea3a8?00000024??..:.......n.$...
00000024`9d6ea3d8??00000000?00000000?edcc9118?00007ff9??................
00000024`9d6ea3e8??00000000?00000000?00000000?00000000??................
00000024`9d6ea3f8??00000000?00000000?00000000?00000000??................
00000024`9d6ea408??eeb07a50?00007ff9?9d6ea3c8?00000024??Pz........n.$...
00000024`9d6ea418??00000000?00000000?ef292ee8?00007ff9??..........).....
00000024`9d6ea428??9d6ea408?00000024?00000000?00000000??..n.$...........
...
00000024`9d6ea4a8??eeb0a158?00007ff9?9d6ea420?00000024??X.......?.n.$...
00000024`9d6ea4b8??00000000?00000000?00000000?00000000??................
00000024`9d6ea4c8??ef293818?00007ff9?9d6ea4a8?00000024??.8).......n.$...
00000024`9d6ea4d8??00000000?00000000?ee8357e0?00007ff9??.........W......
...
00000024`9d6ea508??eed37b40?00007ff9?00000000?00000000??@{..............
00000024`9d6ea518??00000000?00000000?00000000?00000000??................
00000024`9d6ea528??c4699b48?00000024?00000000?00000000??H.i.$...........
00000024`9d6ea538??00000000?07000440?00000001?00000000??....@...........
00000024`9d6ea548??00000000?00000000?00000000?00000000??................
00000024`9d6ea558??00000000?00000000?ef2af6e0?00007ff9??..........*.....
00000024`9d6ea568??00000000?00000000?00000000?00000000??................
00000024`9d6ea578??00000000?00000000?c4699b48?00000024??........H.i.$...
00000024`9d6ea588??00000000?00000000?00000000?07000400??................
...
00000024`9d6ea628??ef2afd08?????????????????????????????..*.
可以看到,這個(gè)free塊生前是有一些殘留的內(nèi)容字符,好了,對(duì) free 塊有基本了解后,接下來我們繼續(xù)探究下。
3. 到底是什么阻止了free塊的合并?
按照正常邏輯,大多free塊會(huì)在gc回收完之后合并內(nèi)存時(shí),那些被清空后的segment會(huì)被操作系統(tǒng)釋放的,但這個(gè)dump并沒有,也就說明這里有什么東西阻止了free塊的合并,那到底是什么呢?有經(jīng)驗(yàn)的朋友會(huì)說,可以觀察下 gc 的句柄表,命令為 !gchandles -stat 。
0:000>?!gchandles?-stat
Statistics:
??????????????MT????Count????TotalSize?Class?Name
...
00007ff9ed15c0f0?????1008????????72576?System.Reflection.Emit.DynamicResolver
00007ff9ecbf6618???????38???????409344?System.Object[]
Total?1784?objects
Handles:
????Strong?Handles:???????233
????Pinned?Handles:???????16
????Async?Pinned?Handles:?18
????Ref?Count?Handles:????1
????Weak?Long?Handles:????1327
????Weak?Short?Handles:???144
????Dependent?Handles:????45
從輸出看,這里并沒有什么可疑的地方,那怎么辦呢?實(shí)操經(jīng)驗(yàn)多的話,這里還是有一些經(jīng)驗(yàn)值得分享的,比如觀察 free 在 heap 上的布局特征,往往就有重大發(fā)現(xiàn)。
4. 查看 free 塊的布局特征
為了簡(jiǎn)化輸出結(jié)果,我把范圍限定到 heap 上某一個(gè) segment 上,比如這里的:00000029233E0000 000000292AF672F8,所以命令就是 !dumpheap 00000029233E0000 000000292AF672F8
0:000>?!dumpheap?00000029233E0000??000000292AF672F8
?????????Address???????????????MT?????Size
00000029233e1000?00000024c3b8faf0??8291896?Free
0000002923bc9638?00007ff9eccb1e18???108448?????
0000002923be3dd8?00000024c3b8faf0?29931248?Free
000000292586f4c8?00007ff9eccb1e18???301328?????
00000029258b8dd8?00000024c3b8faf0?41384784?Free
0000002928030928?00007ff9eccb1e18???301328?????
000000292807a238?00000024c3b8faf0??2542664?Free
00000029282e6e80?00007ff9eccb1e18???108448?????
0000002928301620?00000024c3b8faf0?29915032?Free
0000002929f88db8?00007ff9eccb1e18???301328?????
0000002929fd26c8?00000024c3b8faf0??2746688?Free
000000292a271008?00007ff9eccb1e18???291304?????
000000292a2b81f0?00000024c3b8faf0??1019600?Free
000000292a3b10c0?00007ff9eccb1e18???108448?????
000000292a3cb860?00000024c3b8faf0?10601048?Free
000000292ade7ab8?00007ff9eccb1e18???301328?????
000000292ae313c8?00000024c3b8faf0???280808?Free
000000292ae75cb0?00007ff9eccb1e18???280854?????
000000292aeba5c8?00000024c3b8faf0???416584?Free
000000292af20110?00007ff9eccb1e18???291304?????
Statistics:
??????????????MT????Count????TotalSize?Class?Name
00007ff9eccb1e18???????10??????2394118?System.String
00000024c3b8faf0???????10????127130352??????Free
Total?20?objects
真是一看嚇一跳,free 和 object 呈交替狀,這就是為什么free塊不能被合并的真正原因,說實(shí)話這種教科書式的 內(nèi)存碎片化 dump,真是可遇不可求,接下來就抽幾個(gè) free 之間的 object 對(duì)象,看看到底是被什么引用著導(dǎo)致gc回收不掉。
5. 尋找 object 的引用鏈
要想查看 object 到底被誰引用著,可以用 !gcroot 命令,這里我抽二個(gè)看看。
0:000>?!gcroot?0000002923bc9638
Thread?1878:
????00000024C39BE4B0?00007FFA4C0B3522?System.Threading.Tasks.Task.SpinThenBlockingWait(Int32,?System.Threading.CancellationToken)?[/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs?@?2922]
????????rbp+10:?00000024c39be520
????????????->??00000024C48AD6E0?System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult,?System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+d__4,?Microsoft.Extensions.Hosting.Abstractions]]
????????????->??00000024C48AD200?Microsoft.Extensions.Hosting.Internal.Host
????????????->??00000024C48AC538?Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host,?Microsoft.Extensions.Hosting]]
????????????->??00000024C48AC658?Microsoft.Extensions.Logging.Logger
????????????->??00000024C48AC680?Microsoft.Extensions.Logging.LoggerInformation[]
????????????->??00000024C48AC6E0?Serilog.Extensions.Logging.SerilogLogger
????????????->??00000024C48AC808?Serilog.Core.Logger
????????????->??00000024C48AC760?Serilog.Core.Logger
????????????->??00000024C47AD468?Serilog.Core.Logger
????????????->??00000024C47ABF08?Serilog.Core.Sinks.SafeAggregateSink
????????????->??00000024C47ABF20?Serilog.Core.ILogEventSink[]
????????????->??00000024C479C548?Serilog.Sinks.Grafana.Loki.LokiSink
????????????->??00000024C479C778?Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent,?Serilog]]
????????????->??00000024C479C7B8?System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent,?Serilog]]
????????????->??00000026E8C1A800?Serilog.Events.LogEvent[]
????????????->??00000026148D3308?Serilog.Events.LogEvent
????????????->??00000026148D4EF0?Serilog.Events.MessageTemplate
????????????->??0000002923BC9638?System.String
Found?1?unique?roots?(run?'!gcroot?-all'?to?see?all?roots).
0:000>?!gcroot?000000292586f4c8
Thread?1878:
????00000024C39BE4B0?00007FFA4C0B3522?System.Threading.Tasks.Task.SpinThenBlockingWait(Int32,?System.Threading.CancellationToken)?[/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs?@?2922]
????????rbp+10:?00000024c39be520
????????????->??00000024C48AD6E0?System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult,?System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+d__4,?Microsoft.Extensions.Hosting.Abstractions]]
????????????->??00000024C48AD200?Microsoft.Extensions.Hosting.Internal.Host
????????????->??00000024C48AC538?Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host,?Microsoft.Extensions.Hosting]]
????????????->??00000024C48AC658?Microsoft.Extensions.Logging.Logger
????????????->??00000024C48AC680?Microsoft.Extensions.Logging.LoggerInformation[]
????????????->??00000024C48AC6E0?Serilog.Extensions.Logging.SerilogLogger
????????????->??00000024C48AC808?Serilog.Core.Logger
????????????->??00000024C48AC760?Serilog.Core.Logger
????????????->??00000024C47AD468?Serilog.Core.Logger
????????????->??00000024C47ABF08?Serilog.Core.Sinks.SafeAggregateSink
????????????->??00000024C47ABF20?Serilog.Core.ILogEventSink[]
????????????->??00000024C479C548?Serilog.Sinks.Grafana.Loki.LokiSink
????????????->??00000024C479C778?Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent,?Serilog]]
????????????->??00000024C479C7B8?System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent,?Serilog]]
????????????->??00000026E8C1A800?Serilog.Events.LogEvent[]
????????????->??0000002614BB7AC8?Serilog.Events.LogEvent
????????????->??0000002616D3CC40?Serilog.Events.MessageTemplate
????????????->??000000292586F4C8?System.String
Found?1?unique?roots?(run?'!gcroot?-all'?to?see?all?roots).
從引用鏈看,這些 object 都是通過 Serilog 日志組件發(fā)送給 Grafana.Loki ,通過對(duì)引用鏈對(duì)象的追蹤,我發(fā)現(xiàn) System.Collections.Generic.Queue 有重大問題,截圖如下:

對(duì),這個(gè) queue 居然有 395w 的積壓,到底積壓了什么東西,可以用 !wdo 看一下string內(nèi)容。

看樣子這是一個(gè)失敗重試,分析到這里思路大概就清晰了,就是 Grafana.Loki 或者 Serilog 組件出了什么問題,導(dǎo)致日志發(fā)送不到 Loki 中或者發(fā)送速度過慢,然后不斷的積壓所致,接下來把分析到的這些信息和朋友做了一個(gè)溝通,截圖如下:

三:總結(jié)
本次內(nèi)存碎片化的主要原因在于 Serilog 對(duì)接 Loki 的過程中產(chǎn)生的395w的queue積壓所致,但我也只能分析到這里了,至于為什么有積壓,這個(gè)還得朋友進(jìn)一步調(diào)試分析,我相信這個(gè)問題很快就能得到解決??????
END
