記一次 .NET 某藥品倉儲(chǔ)管理系統(tǒng) 卡死分析
一:背景
1. 講故事
這個(gè)月初,有位朋友wx上找到我,說他的api過一段時(shí)間后,就會(huì)出現(xiàn)只有請(qǐng)求,沒有響應(yīng)的情況,截圖如下:
從朋友的描述中看樣子程序是被什么東西卡住了,這種卡死的問題解決起來相對(duì)簡單,接下來我就用 windbg 給大家分析一下。
二:Windbg 分析
1. Request 請(qǐng)求正在干嘛?
既然朋友說 api 有 request 無 response,那怎么去驗(yàn)證朋友的話對(duì)不對(duì)呢?我們都知道 .NET 用 HttpContext 來表示一個(gè)請(qǐng)求,言外之意就是可以去抓 HttpContext 下的時(shí)長屬性,Netext 中有一個(gè) !whttp 命令可以幫助我們。
- 0:000> !whttp
- HttpContext Thread Time Out Running Status Verb Url
- 000000563bf803b0 42 00:01:50 00:01:24 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN
- 000000563bf84660 -- 00:01:50 Finished 200 GET http://xxx.com:30003/
- 000000563c4a0470 51 00:01:50 00:00:12 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C
- 00000056bbf63590 30 00:01:50 00:02:41 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
- 00000056bc82a038 -- 00:01:50 Finished 200 GET http://localhost:30003/
- 00000056bc84a3e8 44 00:01:50 00:00:51 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
- 00000056bc8671c8 46 00:01:50 00:00:45 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
- 000000573bf44698 35 00:01:50 00:02:39 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
- 000000573bf483c0 33 00:01:50 00:02:41 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN
- 000000573bf97e80 40 00:01:50 00:02:32 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=ZJB2C
- 000000573c583b08 -- 00:01:50 Finished 200 GET http://localhost:30003/
- 000000573c589ec8 -- 00:01:50 Finished 200 GET http://xxx.com:30003/Wms/xxx/xxx/xxx
- 000000573c760e28 -- 00:01:50 Finished 200 POST http://xxx.com:30003/Wms/xxx/xxx/xxx
- 000000573c95f990 48 00:01:50 00:00:31 200 POST http://xxx.com:30003/Wms/Common/xxx?xxx=xxx&xxx=x-HN
- 00000057bbf4f8e8 31 00:01:50 00:02:12 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
- 00000057bc080340 50 00:01:50 00:00:19 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
- 000000583c4aee80 43 00:01:50 00:01:11 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
- 000000583c4d0c50 53 00:01:50 00:00:01 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
- 00000058bbf8f1a0 34 00:01:50 00:02:22 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
- 000000593bfe1758 41 00:01:50 00:01:22 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C
- 000000593c892160 -- 00:01:50 Finished 200 GET http://xxx.com:30003/Wms/xxx/xxx/xxxJob
- 000000593ca813b0 45 00:01:50 00:00:30 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-HN
- 000000593caa45d8 -- 00:01:50 Finished 200 GET http://xxx.com:30003/
- 00000059bc1ad808 32 00:01:50 00:01:45 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
- 00000059bc1c3d70 36 00:01:50 00:01:29 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
- 25 HttpContext object(s) found matching criteria
從 Running 列可以看到大多請(qǐng)求都已經(jīng)達(dá)到1分鐘以上,這也驗(yàn)證了朋友所說的卡死問題,按照經(jīng)驗(yàn),可以取 Running 列中最大的 httpContext 所在的線程,也就是上面的 30 和 33 號(hào)線程, 看看它們都在干什么?
2. 探究 Running 最長線程
接下來切到 30 和 33 號(hào)線程,看看它們的線程棧。
- 0:000> ~30s
- ntdll!NtWaitForSingleObject+0xa:
- 00007ffd`b81f024a c3 ret
- 0:030> !clrstack
- OS Thread Id: 0x29d0 (30)
- Child SP IP Call Site
- 0000005acc3ac590 00007ffdb81f024a [PrestubMethodFrame: 0000005acc3ac590] xxx.xxx.RedisConnectionHelp.get_Instance()
- 0000005acc3ac850 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)
- 0000005acc3ac8c0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)
- 0000005acc3ac968 00007ffdabef1f7c [StubHelperFrame: 0000005acc3ac968]
- 0000005acc3ac9c0 00007ffd4dd77f18 xxx.xxx.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)
- ...
- 0:030> ~33s
- ntdll!NtWaitForMultipleObjects+0xa:
- 00007ffd`b81f07ba c3 ret
- 0:033> !clrstack
- OS Thread Id: 0x3ad4 (33)
- Child SP IP Call Site
- 0000005accabae90 00007ffdb81f07ba [GCFrame: 0000005accabae90]
- 0000005accabafb8 00007ffdb81f07ba [HelperMethodFrame_1OBJ: 0000005accabafb8] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
- 0000005accabb0d0 00007ffdaac60d64 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
- 0000005accabb160 00007ffdaac5b4bb System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
- 0000005accabb1d0 00007ffdab5a01d1 System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken)
- 0000005accabb2a0 00007ffdab59cfa7 System.Threading.Tasks.Task`1[[System.__Canon, mscorlib]].GetResultxxx(Boolean)
- 0000005accabb2e0 00007ffd4d8d338f xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].GetConfig(xxx.Config.Model.ConfigListener, System.Func`2<xxx.Config.Request.GetConfigRequest,System.Threading.Tasks.Task`1<System.String>>)
- 0000005accabb340 00007ffd4d8d2f40 xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].get_Item(System.String, System.String)
- 0000005accabb3c0 00007ffd4dd78f7f xxx.Util.BaseConfig.get_GetRedisConn()
- 0000005accabb440 00007ffd4dd78e9c xxx.xxx.RedisConnectionHelp.GetConnectionString()
- 0000005accabb4a0 00007ffd4dd789cb xxx.xxx.RedisConnectionHelp..cctor()
- 0000005accabb940 00007ffdabef6953 [GCFrame: 0000005accabb940]
- 0000005accabc5b0 00007ffdabef6953 [PrestubMethodFrame: 0000005accabc5b0] xxx.xxx.RedisConnectionHelp.get_Instance()
- 0000005accabc870 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)
- 0000005accabc8e0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)
- 0000005accabc988 00007ffdabef1f7c [StubHelperFrame: 0000005accabc988]
- 0000005accabc9e0 00007ffd4dd77f18 xxx.Core.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)
- ...
上面的信息不難發(fā)現(xiàn) 30 號(hào)線程正卡在 RedisConnectionHelp.get_Instance() 處,33 號(hào)線已經(jīng)進(jìn)入了 RedisConnectionHelp.get_Instance() 方法中,最后在 GetConfig() 處等待 Result 的結(jié)果,按經(jīng)驗(yàn)來說,30 號(hào)線程看樣子正在鎖等待, 33 號(hào)正在等待異步結(jié)果,接下來的突破點(diǎn)就是探究下 RedisConnectionHelp.Instance 處代碼。
3. 尋找問題代碼
接下來用反編譯工具 ILSpy 找到問題代碼。
- public static class RedisConnectionHelp
- {
- public static ConnectionMultiplexer Instance
- {
- get
- {
- if (_instance == null)
- {
- lock (Locker)
- {
- if (_instance == null || !_instance.IsConnected)
- {
- _instance = GetManager();
- }
- }
- }
- return _instance;
- }
- }
- }
30 號(hào)線程果然是卡在 Locker 處,接下來深挖下 33 號(hào)線程所執(zhí)行的 GetManager() 方法,簡化后代碼如下:
- public T this[string dataId, string key = ""]
- {
- get
- {
- try
- {
- string config = GetConfig(configListener, new NacosConfigClient(Base.NacosConfiguration).DoGetConfigAsync);
- return JsonConvert.DeserializeObject<T>(config);
- }
- catch (Exception ex)
- {
- return default(T);
- }
- }
- }
- private string GetConfig(ConfigListener listener, Func<GetConfigRequest, Task<string>> action)
- {
- var text2 = action(new GetConfigRequest
- {
- DataId = listener.DataId,
- Group = listener.Group,
- Tenant = text
- }).Result;
- return text2;
- }
- internal async Task<string> DoGetConfigAsync(GetConfigRequest request)
- {
- IRestResponse restResponse = await HttpUtil.Request(currentServerAddr, Method.GET, request.ParamValues(), xxx);
- return restResponse.Content;
- }
可以看到代碼卡在了 Result 上無限期等待,到這里我就想到了 同步上下文 ,我看他這個(gè)程序是 .NET 4.8 下的 ASP.NET MVC,記得不錯(cuò)上下文應(yīng)該是 AspNetSynchronizationContext,具體死鎖原因可參見我的這篇文章:一句 Task.Result 就死鎖, 這代碼還怎么寫?,解決辦法大概有四種。
- 使用 .ConfigureAwait(false)
- 改成全異步
- 用 Task 再包一層。
- 改成全同步
三:總結(jié)
其實(shí)本次事故主要還是因?yàn)樵?同步代碼 中做了 異步代碼.Result 導(dǎo)致的死鎖問題,有非常多的文章都在抨擊這種現(xiàn)象,在 asp.net core 中已經(jīng)移除了這種同步上下文的大坑,給到朋友的建議是改成全同步,死鎖問題也隨之消失。
哈哈,真替朋友開心!
本文轉(zhuǎn)載自微信公眾號(hào)「 一線碼農(nóng)聊技術(shù)」,可以通過以下二維碼關(guān)注。轉(zhuǎn)載本文請(qǐng)聯(lián)系一線碼農(nóng)聊技術(shù)公眾號(hào)。