記一次 .NET 某電商定向爬蟲的內(nèi)存碎片化分析
一、背景
1. 講故事
上個月有位朋友wx找到我,說他的程序存在內(nèi)存泄漏問題,尋求如何解決? 如下圖所示:
從截圖中可以看出,這位朋友對 windbg 的操作還是有些熟悉的,可能缺乏一定的實操經(jīng)驗,所以用了幾個命令之后就不知道怎么排查下去了。
既然找到我,那就以我的個人經(jīng)驗在他的dump上繼續(xù)分析尋找罪魁禍?zhǔn)?,閑話不多說,上windbg說話。
二、Windbg 分析
1. 真的存在內(nèi)存泄漏嗎?
追這個系列的朋友應(yīng)該知道,我無數(shù)次的用 !address -summary 和 !eeheap -gc 這兩個命令來判斷當(dāng)前的內(nèi)存泄漏是屬于托管層還是非托管層。
- 0:000> !address -summary
 - --- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
 - Free 358 7dfc`67f60000 ( 125.986 TB) 98.43%
 - <unknown> 1087 203`88b6e000 ( 2.014 TB) 99.99% 1.57%
 - Image 1532 0`09f11000 ( 159.066 MB) 0.01% 0.00%
 - Heap 249 0`03453000 ( 52.324 MB) 0.00% 0.00%
 - Stack 66 0`01fc0000 ( 31.750 MB) 0.00% 0.00%
 - Other 10 0`001d1000 ( 1.816 MB) 0.00% 0.00%
 - TEB 22 0`0002c000 ( 176.000 kB) 0.00% 0.00%
 - PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%
 - --- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
 - MEM_MAPPED 183 200`00d06000 ( 2.000 TB) 99.30% 1.56%
 - MEM_PRIVATE 1252 3`8d479000 ( 14.207 GB) 0.69% 0.01%
 - MEM_IMAGE 1532 0`09f11000 ( 159.066 MB) 0.01% 0.00%
 - --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
 - MEM_FREE 358 7dfc`67f60000 ( 125.986 TB) 98.43%
 - MEM_RESERVE 749 200`28a9b000 ( 2.001 TB) 99.33% 1.56%
 - MEM_COMMIT 2218 3`6f5f5000 ( 13.740 GB) 0.67% 0.01%
 - 0:000> !eeheap -gc
 - Number of GC Heaps: 1
 - generation 0 starts at 0x00000026DA8DA928
 - generation 1 starts at 0x00000026DA7FC348
 - generation 2 starts at 0x00000024C4691000
 - ephemeral segment allocation context: none
 - segment begin allocated size
 - 00000024C4690000 00000024C4691000 00000024D468FF28 0xfffef28(268431144)
 - 00000024EECF0000 00000024EECF1000 00000024FECF0000 0xffff000(268431360)
 - 000000248D6F0000 000000248D6F1000 000000249D6EFEF8 0xfffeef8(268431096)
 - ...
 - 00000026D66D0000 00000026D66D1000 00000026DBA3CA30 0x536ba30(87472688)
 - Large object heap starts at 0x00000024D4691000
 - segment begin allocated size
 - 00000024D4690000 00000024D4691000 00000024DC67C318 0x7feb318(134132504)
 - 00000024E60F0000 00000024E60F1000 00000024EE0637C8 0x7f727c8(133638088)
 - 0000002482140000 0000002482141000 000000248A08F338 0x7f4e338(133489464)
 - 00000024A6770000 00000024A6771000 00000024AE76F6C0 0x7ffe6c0(134211264)
 - ...
 - 000000278E6D0000 000000278E6D1000 000000279635F2D0 0x7c8e2d0(130605776)
 - 00000029233E0000 00000029233E1000 000000292AF672F8 0x7b862f8(129524472)
 - 000000292B3E0000 000000292B3E1000 0000002931A5ED60 0x667dd60(107470176)
 - 000000299B3E0000 000000299B3E1000 00000029A20095B0 0x6c285b0(113411504)
 - 000000281E6D0000 000000281E6D1000 0000002825CD3F58 0x7602f58(123744088)
 - 00000028266D0000 00000028266D1000 000000282D5CAD50 0x6ef9d50(116366672)
 - 000000282E6D0000 000000282E6D1000 0000002833CA0880 0x55cf880(89979008)
 - 00000029A33E0000 00000029A33E1000 00000029A684D300 0x346c300(54969088)
 - Total Size: Size: 0x353f96d88 (14293757320) bytes.
 - ------------------------------
 - GC Heap Size: Size: 0x353f96d88 (14293757320) bytes.
 
從輸出看,當(dāng)前進(jìn)程占用 MEM_COMMIT=13.7G,托管堆內(nèi)存占用 14293757320 = 13.3G,很明顯這屬于簡單模式的 托管內(nèi)存泄漏,根據(jù)經(jīng)驗,托管堆上可能有什么大對象,這里用 !dumpheap -stat 命令。
- 0:000> !dumpheap -stat
 - Statistics:
 - MT Count TotalSize Class Name
 - 00007ff9ed6ea268 3956842 94964208 System.Collections.Generic.Dictionary`2+KeyCollection[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]]
 - 00007ff9ed5e6d28 3842435 166405016 Serilog.Parsing.MessageTemplateToken[]
 - 00007ff9ed5e5e28 3842434 184436832 Serilog.Events.MessageTemplate
 - 00007ff9ecccf090 4011012 203304420 System.Int32[]
 - 00007ff9ed647078 3956849 253238336 Serilog.Events.LogEvent
 - 00007ff9ed6e7b48 3956849 284893128 System.Collections.Generic.Dictionary`2[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]]
 - 00007ff9ed5e74e8 9259598 296307136 Serilog.Parsing.TextToken
 - 00007ff9ed6471b0 12551808 301243392 Serilog.Events.ScalarValue
 - 00007ff9ed6e8308 3956849 729078048 System.Collections.Generic.Dictionary`2+Entry[[System.String, System.Private.CoreLib],[Serilog.Events.LogEventPropertyValue, Serilog]][]
 - 00007ff9eccb1e18 16546412 3987811940 System.String
 - 00000024c3b8faf0 82904 7382993568 Free
 
我去,托管堆最大的對象居然是 Free,大概占用 7.3G, 這就牛了,不按套路出牌哈,為了更好的理解,先來簡要解釋下 free 對象。
2. 簡單解釋 free
其實簡而言之,free對象是被gc標(biāo)記成已回收的空閑塊但并未提交給操作系統(tǒng)釋放的對象,那怎么去證明呢? 可以先到托管堆上找一個free塊。
- 0:000> !dumpheap -type Free
 - Address MT Size
 - 00000024c4691000 00000024c3b8faf0 24 Free
 - 00000024c46a2448 00000024c3b8faf0 24 Free
 - 00000024c46b26d8 00000024c3b8faf0 40 Free
 - 00000024c47e4418 00000024c3b8faf0 40 Free
 - 00000024c4925680 00000024c3b8faf0 40 Free
 - 00000024c49284a8 00000024c3b8faf0 64 Free
 - 00000024c4947a90 00000024c3b8faf0 192 Free
 - 00000024c4951f70 00000024c3b8faf0 48 Free
 - 000000249d6ea3a8 00000024c3b8faf0 640 Free
 
(1)用 !do 命令查看是否標(biāo)記為free塊
- 0:000> !do 000000249d6ea3a8
 - Free Object
 - Size: 640(0x280) bytes
 
(2)用 dc 看一下這個 free 塊的內(nèi)容,是否有g(shù)c回收的痕跡。
- 0:000> dc 000000249d6ea3a8 000000249d6ea3a8+0x280
 - 00000024`9d6ea3a8 c3b8faf0 00000024 00000268 00000000 ....$...h.......
 - 00000024`9d6ea3b8 9d6ea6d0 00000024 00000000 00000000 ..n.$...........
 - 00000024`9d6ea3c8 ed3ae2b8 00007ff9 9d6ea3a8 00000024 ..:.......n.$...
 - 00000024`9d6ea3d8 00000000 00000000 edcc9118 00007ff9 ................
 - 00000024`9d6ea3e8 00000000 00000000 00000000 00000000 ................
 - 00000024`9d6ea3f8 00000000 00000000 00000000 00000000 ................
 - 00000024`9d6ea408 eeb07a50 00007ff9 9d6ea3c8 00000024 Pz........n.$...
 - 00000024`9d6ea418 00000000 00000000 ef292ee8 00007ff9 ..........).....
 - 00000024`9d6ea428 9d6ea408 00000024 00000000 00000000 ..n.$...........
 - ...
 - 00000024`9d6ea4a8 eeb0a158 00007ff9 9d6ea420 00000024 X....... .n.$...
 - 00000024`9d6ea4b8 00000000 00000000 00000000 00000000 ................
 - 00000024`9d6ea4c8 ef293818 00007ff9 9d6ea4a8 00000024 .8).......n.$...
 - 00000024`9d6ea4d8 00000000 00000000 ee8357e0 00007ff9 .........W......
 - ...
 - 00000024`9d6ea508 eed37b40 00007ff9 00000000 00000000 @{..............
 - 00000024`9d6ea518 00000000 00000000 00000000 00000000 ................
 - 00000024`9d6ea528 c4699b48 00000024 00000000 00000000 H.i.$...........
 - 00000024`9d6ea538 00000000 07000440 00000001 00000000 ....@...........
 - 00000024`9d6ea548 00000000 00000000 00000000 00000000 ................
 - 00000024`9d6ea558 00000000 00000000 ef2af6e0 00007ff9 ..........*.....
 - 00000024`9d6ea568 00000000 00000000 00000000 00000000 ................
 - 00000024`9d6ea578 00000000 00000000 c4699b48 00000024 ........H.i.$...
 - 00000024`9d6ea588 00000000 00000000 00000000 07000400 ................
 - ...
 - 00000024`9d6ea628 ef2afd08 ..*.
 
可以看到,這個free塊生前是有一些殘留的內(nèi)容字符,好了,對 free 塊有基本了解后,接下來我們繼續(xù)探究下。
3. 到底是什么阻止了free塊的合并?
按照正常邏輯,大多free塊會在gc回收完之后合并內(nèi)存時,那些被清空后的segment會被操作系統(tǒng)釋放的,但這個dump并沒有,也就說明這里有什么東西阻止了free塊的合并,那到底是什么呢? 有經(jīng)驗的朋友會說,可以觀察下 gc 的句柄表,命令為 !gchandles -stat 。
- 0:000> !gchandles -stat
 - Statistics:
 - MT Count TotalSize Class Name
 - ...
 - 00007ff9ed15c0f0 1008 72576 System.Reflection.Emit.DynamicResolver
 - 00007ff9ecbf6618 38 409344 System.Object[]
 - Total 1784 objects
 - Handles:
 - Strong Handles: 233
 - Pinned Handles: 16
 - Async Pinned Handles: 18
 - Ref Count Handles: 1
 - Weak Long Handles: 1327
 - Weak Short Handles: 144
 - Dependent Handles: 45
 
從輸出看,這里并沒有什么可疑的地方,那怎么辦呢? 實操經(jīng)驗多的話,這里還是有一些經(jīng)驗值得分享的,比如觀察 free 在 heap 上的布局特征,往往就有重大發(fā)現(xiàn)。
4. 查看 free 塊的布局特征
為了簡化輸出結(jié)果,我把范圍限定到 heap 上某一個 segment 上,比如這里的: 00000029233E0000 000000292AF672F8,所以命令就是 !dumpheap 00000029233E0000 000000292AF672F8。
- 0:000> !dumpheap 00000029233E0000 000000292AF672F8
 - Address MT Size
 - 00000029233e1000 00000024c3b8faf0 8291896 Free
 - 0000002923bc9638 00007ff9eccb1e18 108448
 - 0000002923be3dd8 00000024c3b8faf0 29931248 Free
 - 000000292586f4c8 00007ff9eccb1e18 301328
 - 00000029258b8dd8 00000024c3b8faf0 41384784 Free
 - 0000002928030928 00007ff9eccb1e18 301328
 - 000000292807a238 00000024c3b8faf0 2542664 Free
 - 00000029282e6e80 00007ff9eccb1e18 108448
 - 0000002928301620 00000024c3b8faf0 29915032 Free
 - 0000002929f88db8 00007ff9eccb1e18 301328
 - 0000002929fd26c8 00000024c3b8faf0 2746688 Free
 - 000000292a271008 00007ff9eccb1e18 291304
 - 000000292a2b81f0 00000024c3b8faf0 1019600 Free
 - 000000292a3b10c0 00007ff9eccb1e18 108448
 - 000000292a3cb860 00000024c3b8faf0 10601048 Free
 - 000000292ade7ab8 00007ff9eccb1e18 301328
 - 000000292ae313c8 00000024c3b8faf0 280808 Free
 - 000000292ae75cb0 00007ff9eccb1e18 280854
 - 000000292aeba5c8 00000024c3b8faf0 416584 Free
 - 000000292af20110 00007ff9eccb1e18 291304
 - Statistics:
 - MT Count TotalSize Class Name
 - 00007ff9eccb1e18 10 2394118 System.String
 - 00000024c3b8faf0 10 127130352 Free
 - Total 20 objects
 
真是一看嚇一跳,free 和 object 呈交替狀,這就是為什么free塊不能被合并的真正原因,說實話這種教科書式的 內(nèi)存碎片化 dump,真是可遇不可求,接下來就抽幾個 free 之間的 object 對象,看看到底是被什么引用著導(dǎo)致gc回收不掉。
5. 尋找 object 的引用鏈
要想查看 object 到底被誰引用著,可以用 !gcroot 命令,這里我抽二個看看。
- 0:000> !gcroot 0000002923bc9638
 - Thread 1878:
 - 00000024C39BE4B0 00007FFA4C0B3522 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
 - rbp+10: 00000024c39be520
 - -> 00000024C48AD6E0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+<RunAsync>d__4, Microsoft.Extensions.Hosting.Abstractions]]
 - -> 00000024C48AD200 Microsoft.Extensions.Hosting.Internal.Host
 - -> 00000024C48AC538 Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host, Microsoft.Extensions.Hosting]]
 - -> 00000024C48AC658 Microsoft.Extensions.Logging.Logger
 - -> 00000024C48AC680 Microsoft.Extensions.Logging.LoggerInformation[]
 - -> 00000024C48AC6E0 Serilog.Extensions.Logging.SerilogLogger
 - -> 00000024C48AC808 Serilog.Core.Logger
 - -> 00000024C48AC760 Serilog.Core.Logger
 - -> 00000024C47AD468 Serilog.Core.Logger
 - -> 00000024C47ABF08 Serilog.Core.Sinks.SafeAggregateSink
 - -> 00000024C47ABF20 Serilog.Core.ILogEventSink[]
 - -> 00000024C479C548 Serilog.Sinks.Grafana.Loki.LokiSink
 - -> 00000024C479C778 Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent, Serilog]]
 - -> 00000024C479C7B8 System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent, Serilog]]
 - -> 00000026E8C1A800 Serilog.Events.LogEvent[]
 - -> 00000026148D3308 Serilog.Events.LogEvent
 - -> 00000026148D4EF0 Serilog.Events.MessageTemplate
 - -> 0000002923BC9638 System.String
 - Found 1 unique roots (run '!gcroot -all' to see all roots).
 - 0:000> !gcroot 000000292586f4c8
 - Thread 1878:
 - 00000024C39BE4B0 00007FFA4C0B3522 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
 - rbp+10: 00000024c39be520
 - -> 00000024C48AD6E0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions+<RunAsync>d__4, Microsoft.Extensions.Hosting.Abstractions]]
 - -> 00000024C48AD200 Microsoft.Extensions.Hosting.Internal.Host
 - -> 00000024C48AC538 Microsoft.Extensions.Logging.Logger`1[[Microsoft.Extensions.Hosting.Internal.Host, Microsoft.Extensions.Hosting]]
 - -> 00000024C48AC658 Microsoft.Extensions.Logging.Logger
 - -> 00000024C48AC680 Microsoft.Extensions.Logging.LoggerInformation[]
 - -> 00000024C48AC6E0 Serilog.Extensions.Logging.SerilogLogger
 - -> 00000024C48AC808 Serilog.Core.Logger
 - -> 00000024C48AC760 Serilog.Core.Logger
 - -> 00000024C47AD468 Serilog.Core.Logger
 - -> 00000024C47ABF08 Serilog.Core.Sinks.SafeAggregateSink
 - -> 00000024C47ABF20 Serilog.Core.ILogEventSink[]
 - -> 00000024C479C548 Serilog.Sinks.Grafana.Loki.LokiSink
 - -> 00000024C479C778 Serilog.Sinks.Grafana.Loki.Infrastructure.BoundedQueue`1[[Serilog.Events.LogEvent, Serilog]]
 - -> 00000024C479C7B8 System.Collections.Generic.Queue`1[[Serilog.Events.LogEvent, Serilog]]
 - -> 00000026E8C1A800 Serilog.Events.LogEvent[]
 - -> 0000002614BB7AC8 Serilog.Events.LogEvent
 - -> 0000002616D3CC40 Serilog.Events.MessageTemplate
 - -> 000000292586F4C8 System.String
 - Found 1 unique roots (run '!gcroot -all' to see all roots).
 
從引用鏈看,這些 object 都是通過 Serilog 日志組件發(fā)送給 Grafana.Loki ,通過對引用鏈對象的追蹤,我發(fā)現(xiàn) System.Collections.Generic.Queue 有重大問題,截圖如下:
對,這個 queue 居然有 395w 的積壓,到底積壓了什么東西,可以用 !wdo 看一下string內(nèi)容。
看樣子這是一個失敗重試,分析到這里思路大概就清晰了,就是 Grafana.Loki 或者 Serilog 組件出了什么問題,導(dǎo)致日志發(fā)送不到 Loki 中或者發(fā)送速度過慢,然后不斷的積壓所致,接下來把分析到的這些信息和朋友做了一個溝通,截圖如下:
三、總結(jié)
本次內(nèi)存碎片化的主要原因在于 Serilog 對接 Loki 的過程中產(chǎn)生的395w的queue積壓所致,但我也只能分析到這里了,至于為什么有積壓,這個還得朋友進(jìn)一步調(diào)試分析,我相信這個問題很快就能得到解決。
本文轉(zhuǎn)載自微信公眾號「一線碼農(nóng)聊技術(shù)」,可以通過以下二維碼關(guān)注。轉(zhuǎn)載本文請聯(lián)系一線碼農(nóng)聊技術(shù)公眾號。




















 
 
 

















 
 
 
 