.NET Core 中如何調(diào)試 死鎖 ?
這篇文章,我們研究一下如何調(diào)試一個 死鎖問題,可以下載一下 https://github.com/dotnet/samples/tree/main/core/diagnostics/DiagnosticScenarios 源碼,程序運行后,你會發(fā)現(xiàn)api無響應而且線程會不斷增長,然后你會學習到用不同的工具去分析這個死鎖問題。
生成 core dump
為了能找出程序無響應的原因,core dump 提供了當時線程的狀態(tài)以及各種存在線程競爭的鎖,接下來運行如下命令:
dotnet?run
為了能找到 進程ID,使用如下命令。
dotnet-trace?ps
可以自己查看一下 shell 輸出,我們的進程是 4807,你的進程號可能不一樣,接下來訪問URL: https://localhost:5001/api/diagscenario/deadlock ,你會發(fā)現(xiàn)這個api請求會一直得不到response,等 10-15s 后,使用下面的命令收集一個 dump 文件。
dotnet-dump?collect?-p?4807
分析 core dump
為了能夠?qū)ι珊玫?dump 進行分析,使用如下命令 dotnet-dump analyze 命令,參考如下:
dotnet-dump?analyze??~/.dotnet/tools/core_20190513_143916
因為是一個莫名的無響應問題,所以你可能想先對進程中的所有線程有一個整體的感知。
>?threads
*0?0x1DBFF?(121855)
?1?0x1DC01?(121857)
?2?0x1DC02?(121858)
?3?0x1DC03?(121859)
?4?0x1DC04?(121860)
?5?0x1DC05?(121861)
?6?0x1DC06?(121862)
?7?0x1DC07?(121863)
?8?0x1DC08?(121864)
?9?0x1DC09?(121865)
?10?0x1DC0A?(121866)
?11?0x1DC0D?(121869)
?12?0x1DC0E?(121870)
?13?0x1DC10?(121872)
?14?0x1DC11?(121873)
?15?0x1DC12?(121874)
?16?0x1DC13?(121875)
?17?0x1DC14?(121876)
?18?0x1DC15?(121877)
?19?0x1DC1C?(121884)
?20?0x1DC1D?(121885)
?21?0x1DC1E?(121886)
?22?0x1DC21?(121889)
?23?0x1DC22?(121890)
?24?0x1DC23?(121891)
?25?0x1DC24?(121892)
?26?0x1DC25?(121893)
...
...
?317?0x1DD48?(122184)
?318?0x1DD49?(122185)
?319?0x1DD4A?(122186)
?320?0x1DD4B?(122187)
?321?0x1DD4C?(122188)
輸出中展示了進程中的所有線程ID,以及相關(guān)的 調(diào)試器線程ID 和 操作系統(tǒng)線程ID。
接下來我們有必要了解下每個線程都在做什么?可以用 clrstack 查看線程棧,為了能查看所有的線程棧,使用如下命令:
clrstack?-all
??...
??...
??...
????????Child?SP???????????????IP?Call?Site
00007F2AE37B5680?00007f305abc6360?[GCFrame:?00007f2ae37b5680]
00007F2AE37B5770?00007f305abc6360?[GCFrame:?00007f2ae37b5770]
00007F2AE37B57D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae37b57d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE37B5920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE37B5950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE37B5CA0?00007f30593044af?[GCFrame:?00007f2ae37b5ca0]
00007F2AE37B5D70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae37b5d70]
OS?Thread?Id:?0x1dc82
????????Child?SP???????????????IP?Call?Site
00007F2AE2FB4680?00007f305abc6360?[GCFrame:?00007f2ae2fb4680]
00007F2AE2FB4770?00007f305abc6360?[GCFrame:?00007f2ae2fb4770]
00007F2AE2FB47D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae2fb47d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE2FB4920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE2FB4950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE2FB4CA0?00007f30593044af?[GCFrame:?00007f2ae2fb4ca0]
00007F2AE2FB4D70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae2fb4d70]
OS?Thread?Id:?0x1dc83
????????Child?SP???????????????IP?Call?Site
00007F2AE27B3680?00007f305abc6360?[GCFrame:?00007f2ae27b3680]
00007F2AE27B3770?00007f305abc6360?[GCFrame:?00007f2ae27b3770]
00007F2AE27B37D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae27b37d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE27B3920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE27B3950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE27B3CA0?00007f30593044af?[GCFrame:?00007f2ae27b3ca0]
00007F2AE27B3D70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae27b3d70]
OS?Thread?Id:?0x1dc84
????????Child?SP???????????????IP?Call?Site
00007F2AE1FB2680?00007f305abc6360?[GCFrame:?00007f2ae1fb2680]
00007F2AE1FB2770?00007f305abc6360?[GCFrame:?00007f2ae1fb2770]
00007F2AE1FB27D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae1fb27d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE1FB2920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE1FB2950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE1FB2CA0?00007f30593044af?[GCFrame:?00007f2ae1fb2ca0]
00007F2AE1FB2D70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae1fb2d70]
OS?Thread?Id:?0x1dc85
????????Child?SP???????????????IP?Call?Site
00007F2AE17B1680?00007f305abc6360?[GCFrame:?00007f2ae17b1680]
00007F2AE17B1770?00007f305abc6360?[GCFrame:?00007f2ae17b1770]
00007F2AE17B17D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae17b17d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE17B1920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE17B1950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE17B1CA0?00007f30593044af?[GCFrame:?00007f2ae17b1ca0]
00007F2AE17B1D70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae17b1d70]
OS?Thread?Id:?0x1dc86
????????Child?SP???????????????IP?Call?Site
00007F2AE0FB0680?00007f305abc6360?[GCFrame:?00007f2ae0fb0680]
00007F2AE0FB0770?00007f305abc6360?[GCFrame:?00007f2ae0fb0770]
00007F2AE0FB07D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae0fb07d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE0FB0920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE0FB0950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE0FB0CA0?00007f30593044af?[GCFrame:?00007f2ae0fb0ca0]
00007F2AE0FB0D70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae0fb0d70]
OS?Thread?Id:?0x1dc87
????????Child?SP???????????????IP?Call?Site
00007F2AE07AF680?00007f305abc6360?[GCFrame:?00007f2ae07af680]
00007F2AE07AF770?00007f305abc6360?[GCFrame:?00007f2ae07af770]
00007F2AE07AF7D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2ae07af7d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2AE07AF920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2AE07AF950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2AE07AFCA0?00007f30593044af?[GCFrame:?00007f2ae07afca0]
00007F2AE07AFD70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2ae07afd70]
OS?Thread?Id:?0x1dc88
????????Child?SP???????????????IP?Call?Site
00007F2ADFFAE680?00007f305abc6360?[GCFrame:?00007f2adffae680]
00007F2ADFFAE770?00007f305abc6360?[GCFrame:?00007f2adffae770]
00007F2ADFFAE7D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2adffae7d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2ADFFAE920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2ADFFAE950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2ADFFAECA0?00007f30593044af?[GCFrame:?00007f2adffaeca0]
00007F2ADFFAED70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2adffaed70]
...
...
Observing?the?callstacks?for?all?300+?threads?shows?a?pattern?where?a?majority?of?the?threads?share?a?common?callstack:
Console
Copy
OS?Thread?Id:?0x1dc88
????????Child?SP???????????????IP?Call?Site
00007F2ADFFAE680?00007f305abc6360?[GCFrame:?00007f2adffae680]
00007F2ADFFAE770?00007f305abc6360?[GCFrame:?00007f2adffae770]
00007F2ADFFAE7D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2adffae7d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2ADFFAE920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2ADFFAE950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2ADFFAECA0?00007f30593044af?[GCFrame:?00007f2adffaeca0]
00007F2ADFFAED70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2adffaed70]
觀察這超300+ 的線程棧,發(fā)現(xiàn)很多線程都是類似下面的線程棧。
OS?Thread?Id:?0x1dc88
????????Child?SP???????????????IP?Call?Site
00007F2ADFFAE680?00007f305abc6360?[GCFrame:?00007f2adffae680]
00007F2ADFFAE770?00007f305abc6360?[GCFrame:?00007f2adffae770]
00007F2ADFFAE7D0?00007f305abc6360?[HelperMethodFrame_1OBJ:?00007f2adffae7d0]?System.Threading.Monitor.ReliableEnter(System.Object,?Boolean?ByRef)
00007F2ADFFAE920?00007F2FE392B31F?testwebapi.Controllers.DiagScenarioController.b__3_1()?[/home/marioh/webapi/Controllers/diagscenario.cs?@?36]
00007F2ADFFAE950?00007F2FE392B46D?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
00007F2ADFFAECA0?00007f30593044af?[GCFrame:?00007f2adffaeca0]
00007F2ADFFAED70?00007f30593044af?[DebuggerU2MCatchHandlerFrame:?00007f2adffaed70]
從調(diào)用棧看,api請求已經(jīng)進去了 deadlock 方法,然后在 Monitor.ReliableEnter 處卡住,這就說明有其他線程已經(jīng)進去了這個臨界區(qū)。
接下來我們要尋找到底是哪一個線程正在持有鎖?可以用 syncblk 來獲取更多線程鎖信息。
>?syncblk
Index?????????SyncBlock?MonitorHeld?Recursion?Owning?Thread?Info??????????SyncBlock?Owner
???43?00000246E51268B8??????????603?????????1?0000024B713F4E30?5634??28???00000249654b14c0?System.Object
???44?00000246E5126908????????????3?????????1?0000024B713F47E0?51d4??29???00000249654b14d8?System.Object
-----------------------------
Total???????????344
CCW?????????????1
RCW?????????????2
ComClassFactory?0
Free????????????0
這里有兩個列需要注意下:
MonitorHeld
它記錄了這個持有鎖當前有多少 持有線程 和 等待線程。
Owning Thread Info
它表示當前持有鎖的線程信息。
到這里,我們看到有兩個線程 0x5634 和 0x51d4 持有了一個鎖,接下來可以切到各自線程看看里面到底做了什么?可以用 setthread 和 clrstack 命令。
首先看一下第一個線程 0x5634,我們發(fā)現(xiàn)它正在獲取一個鎖,但通過 syncblk 我們知道這個線程已經(jīng)持有了一個鎖,這問題就大了,可以想象所有等待這個持有鎖的線程將會被無限期等待。
>?setthread?28
>?clrstack
OS?Thread?Id:?0x5634?(28)
????????Child?SP???????????????IP?Call?Site
0000004E46AFEAA8?00007fff43a5cbc4?[GCFrame:?0000004e46afeaa8]
0000004E46AFEC18?00007fff43a5cbc4?[GCFrame:?0000004e46afec18]
0000004E46AFEC68?00007fff43a5cbc4?[HelperMethodFrame_1OBJ:?0000004e46afec68]?System.Threading.Monitor.Enter(System.Object)
0000004E46AFEDC0?00007FFE5EAF9C80?testwebapi.Controllers.DiagScenarioController.DeadlockFunc()?[C:\Users\dapine\Downloads\Diagnostic_scenarios_sample_debug_target\Controllers\DiagnosticScenarios.cs?@?58]
0000004E46AFEE30?00007FFE5EAF9B8C?testwebapi.Controllers.DiagScenarioController.b__3_0()?[C:\Users\dapine\Downloads\Diagnostic_scenarios_sample_debug_target\Controllers\DiagnosticScenarios.cs?@?26]
0000004E46AFEE80?00007FFEBB251A5B?System.Threading.ThreadHelper.ThreadStart_Context(System.Object)?[/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs?@?44]
0000004E46AFEEB0?00007FFE5EAEEEEC?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object)?[/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs?@?201]
0000004E46AFEF20?00007FFEBB234EAB?System.Threading.ThreadHelper.ThreadStart()?[/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs?@?93]
0000004E46AFF138?00007ffebdcc6b63?[GCFrame:?0000004e46aff138]
0000004E46AFF3A0?00007ffebdcc6b63?[DebuggerU2MCatchHandlerFrame:?0000004e46aff3a0]
第二個線程情況類似,剩下的 300+ 線程也差不多都是在這些 死鎖 的 lock 上無限期的等待。
