Intel PAUSE指令變化影響到MySQL的性能,該如何解決?
MySQL得益于其開源屬性、成熟的商業(yè)運(yùn)作、良好的社區(qū)運(yùn)營(yíng)以及功能的不斷迭代與完善,已經(jīng)成為互聯(lián)網(wǎng)關(guān)系型數(shù)據(jù)庫的標(biāo)配。可以說,X86服務(wù)器、Linux作為基礎(chǔ)設(shè)施,跟MySQL一起構(gòu)建了互聯(lián)網(wǎng)數(shù)據(jù)存儲(chǔ)服務(wù)的基石,三者相輔相成。本文將分享一個(gè)工作中的實(shí)踐案例:因Intel PAUSE指令周期的迭代,引發(fā)了MySQL的性能瓶頸,美團(tuán)MySQL DBA團(tuán)隊(duì)如何基于這三者來一步步進(jìn)行分析、定位和優(yōu)化。希望這些思路能對(duì)大家有所啟發(fā)。
1.背景
在2017年,Intel發(fā)布了新一代的服務(wù)器平臺(tái)Purley,并將Intel Xeon Scalable Processor(至強(qiáng)可擴(kuò)展處理器)重新劃分為:Platinum(鉑金)、Gold(金)、Silver(銀)、Broze(銅)等四個(gè)等級(jí)。產(chǎn)品定位和框架也變得更加清晰。
因美團(tuán)線上海量數(shù)據(jù)交易和存儲(chǔ)等后端服務(wù)依賴大量高性能服務(wù)器的支撐。隨著線上部分Grantly平臺(tái)E系列服務(wù)器生命周期的臨近,以及產(chǎn)品本身的發(fā)展和迭代。從2019年開始,RDS(關(guān)系型數(shù)據(jù)庫服務(wù))后端存儲(chǔ)(MySQL)開始大量上線Purley平臺(tái)的Skylake CPU服務(wù)器,其中包含Silver 4110等。
Silver 4110相比上一代E5-2620 V4,支持更高的內(nèi)存頻率、更多的內(nèi)存通道、更大的L2 Cache、更快的總線傳輸速率等。Intel官方數(shù)據(jù)顯示Silver 4110的性能比上一代E5-2620 V4提升了10%。
然而,隨著線上Skylake服務(wù)器數(shù)量的增加,以及越來越多的業(yè)務(wù)接入。美團(tuán)MySQL DBA團(tuán)隊(duì)發(fā)現(xiàn)部分MySQL實(shí)例性能與預(yù)期并不相符,有時(shí)甚至出現(xiàn)較大程度的下降。經(jīng)過持續(xù)的性能問題分析,我們定位到Skylake服務(wù)器存在性能瓶頸:
- CPU負(fù)載相對(duì)較高。
- TPS等吞吐量下降。
接下來,我們將從Intel CPU、ut_delay函數(shù)、PAUSE指令三方面入手,進(jìn)行剖析定位,并探索相關(guān)優(yōu)化方案。
2.性能問題分析
2.1 Grantly與Purley CPU性能差異
首先,基于上述兩代平臺(tái)的CPU(Grantly和Purley),通過基準(zhǔn)測(cè)試,橫向?qū)Ρ仍诓煌琌S下的性能表現(xiàn)。
通過基準(zhǔn)測(cè)試數(shù)據(jù),總結(jié)如下:
1.在oltp_write_only(只寫)的場(chǎng)景下Purley 4110的性能下降較為明顯。 2.同為Purley 4110,CentOS 7比CentOS 6 oltp_write_only(只寫)性能有提升。
我們通過二維折線圖,來展示性能之間的差異:
在上圖中,同為Purley 4110,CentOS 7比CentOS 6性能有提升。具體提升原因,因不涉及本文重點(diǎn)內(nèi)容,所以不在這里詳細(xì)展開了。
New MCS-based Locking Mechanism
Red Hat Enterprise Linux 7.1 introduces a new locking mechanism, MCS locks. This new locking mechanism significantly reduces spinlock overhead in large systems, which makes spinlocks generally more efficient in Red Hat Enterprise Linux 7.1.
紅帽官網(wǎng)Release Notes顯示,從內(nèi)核3.10.0-229開始,引入了新的加鎖機(jī)制,MCS鎖??梢越档蛃pinlock的開銷,從而更高效地運(yùn)行。普通spinlock在多CPU Core下,同時(shí)只能有一個(gè)CPU獲取變量,并自旋,而緩存一致性協(xié)議為了保證數(shù)據(jù)的正確,會(huì)對(duì)所有CPU Cache Line狀態(tài)、數(shù)據(jù),同步、失效等操作,導(dǎo)致性能下降。而MSC鎖實(shí)現(xiàn)每個(gè)CPU都有自己的“spinlock”本地變量,只在本地自旋。避免Cache Line同步等,從而提升了相關(guān)性能。不過,社區(qū)對(duì)于spinlock的優(yōu)化爭(zhēng)議還是比較大的,后續(xù)又有大?;贛SC實(shí)現(xiàn)了qspinlock,并在4.x的版本上patch了。具體實(shí)現(xiàn)可以參看:MCS locks and qspinlocks。
在大致了解CentOS 7性能的迭代后,接下來我們深入分析一下Skylake CPU 4110導(dǎo)致性能下降的緣由。
3.CPU性能跟蹤
3.1 定位熱點(diǎn)函數(shù)
具體定位4110性能瓶頸,分如下幾步:
- 首先,通過perf top來跟蹤一下Linux CPU性能開銷。
- 然后,通過perf record記錄函數(shù)CPU周期的消耗占比。
- 最后,通過火焰圖來驗(yàn)證定位熱點(diǎn)函數(shù)。
可以看到,其中占CPU消耗占比較大為:ut_delay函數(shù)。
我們繼續(xù)深挖一下函數(shù)鏈調(diào)用關(guān)系:
- # Children Self Command Shared Object Symbol
- # ........ ........ ....... ................... ..................................................................................................................................................................................
- #
- 93.54% 0.00% mysqld libpthread-2.17.so [.] start_thread
- |
- ---start_thread
- |
- |--77.07%--pfs_spawn_thread
- | |
- | --77.05%--handle_connection
- | |
- | --76.97%--do_command
- | |
- | |--74.30%--dispatch_command
- | | |
- | | |--71.16%--mysqld_stmt_execute
- | | | |
- | | | --70.74%--Prepared_statement::execute_loop
- | | | |
- | | | |--69.53%--Prepared_statement::execute
- | | | | |
- | | | | |--67.90%--mysql_execute_command
- | | | | | |
- | | | | | |--23.43%--trans_commit_stmt
- | | | | | | |
- | | | | | | --23.30%--ha_commit_trans
- | | | | | | |
- | | | | | | |--18.86%--MYSQL_BIN_LOG::commit
- | | | | | | | |
- | | | | | | | --18.18%--MYSQL_BIN_LOG::ordered_commit
- | | | | | | | |
- | | | | | | | |--8.02%--MYSQL_BIN_LOG::change_stage
- | | | | | | | | |
- | | | | | | | | |--2.35%--__lll_unlock_wake
- | | | | | | | | | |
- | | | | | | | | | --2.24%--system_call_fastpath
- | | | | | | | | | |
- | | | | | | | | | --2.24%--sys_futex
- | | | | | | | | | |
- | | | | | | | | | --2.23%--do_futex
- | | | | | | | | | |
- | | | | | | | | | --2.14%--futex_wake
- | | | | | | | | | |
- | | | | | | | | | --1.38%--wake_up_q
- | | | | | | | | | |
- | | | | | | | | | --1.33%--try_to_wake_up
- ...
將上述調(diào)用通過火焰圖進(jìn)行直觀展示:
現(xiàn)在基本可以確定,所有的函數(shù)調(diào)用,最后大部分的消耗都在ut_delay上。
3.2 ut_delay和PAUSE之間的關(guān)聯(lián)與性能影響
3.2.1 MySQL ut_delay實(shí)現(xiàn)
接下來,我們繼續(xù)看一下MySQL源碼中ut_delay函數(shù)的功能:
- /*************************************************************//**
- Runs an idle loop on CPU. The argument gives the desired delay
- in microseconds on 100 MHz Pentium + Visual C++.
- @return dummy value */
- ulint
- ut_delay(
- /*=====*/
- ulint delay) /*!< in: delay in microseconds on 100 MHz Pentium */
- {
- ulint i, j;
-
- UT_LOW_PRIORITY_CPU();
-
- j = 0;
-
- for (i = 0; i < delay * 50; i++) {
- j += i;
- UT_RELAX_CPU();
- }
-
- UT_RESUME_PRIORITY_CPU();
-
- return(j);
- }
- ...
-
- # define UT_RELAX_CPU() asm ("pause" )
- # define UT_RELAX_CPU() __asm__ __volatile__ ("pause")
可以了解到,MySQL自旋會(huì)調(diào)用PAUSE指令,從而提升spin-wait loop的性能。
3.2.2 PAUSE指令周期的演變
我們可以看下Intel官網(wǎng),也描述了在新平臺(tái)架構(gòu)PAUSE的改動(dòng):
Pause Latency in Skylake Microarchitecture
The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is better to wait while occupying the CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling an OS synchronization API function, such as WaitForSingleObject on Windows* OS or futex on Linux.
…
The latency of the PAUSE instruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake microarchitecture it has been extended to as many as 140 cycles.
The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked executing a fixed number of looped PAUSE instructions. There’s also a small power benefit in 2-core and 4-core systems.
As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.
…
- 上一代架構(gòu)中(Grantly平臺(tái)E系列)PAUSE的周期時(shí)長(zhǎng)為10 cycles,新一代的Skylake架構(gòu)中則為140 cycles。
- 如果程序中使用固定次數(shù)的PAUSE循環(huán)來實(shí)現(xiàn)一段時(shí)間的延遲,以此阻塞程序執(zhí)行,可能引發(fā)非預(yù)期的延遲。
- 由于PAUSE周期增加,對(duì)于PAUSE敏感的應(yīng)用會(huì)有一定的性能損失。
衡量程序執(zhí)行性能的簡(jiǎn)化公式:
ExecutionTime(T)=InstructionCount∗TimePerCycle∗CPI
即:程序執(zhí)行時(shí)間 = 程序總指令數(shù) x 每CPU時(shí)鐘周期時(shí)間 x 每指令執(zhí)行所需平均時(shí)鐘周期數(shù)。
MySQL內(nèi)部自旋,就是通過固定次數(shù)的PAUSE循環(huán)實(shí)現(xiàn)。可知,PAUSE指令周期的增加,那么執(zhí)行自旋的時(shí)間也會(huì)增加,即程序執(zhí)行的時(shí)間也會(huì)相對(duì)增加,對(duì)系統(tǒng)整體的吞吐量就會(huì)有影響。
顯然,Intel文檔已說明不同平臺(tái)、不同架構(gòu)CPU PAUSE定義的周期是不一樣的。
下面,我們通過一個(gè)測(cè)試用例來大致驗(yàn)證、對(duì)比一下新老架構(gòu)CPU執(zhí)行PAUSE的cycles:
- #include <stdio.h>
- #define TIMES 5
-
- static inline unsigned long long rdtsc(void)
- {
- unsigned long low, high;
- asm volatile("rdtsc" : "=a" (low), "=d" (high) );
- return ((low) | (high) << 32);
- }
-
- void pause_test()
- {
- int i = 0;
- for (i = 0; i < TIMES; i++) {
- asm(
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"\
- "pause\n"
- ::
- :);
- }
- }
-
- unsigned long pause_cycle()
- {
- unsigned long start, finish, elapsed;
- start = rdtsc();
- pause_test();
- finish = rdtsc();
- elapsed = finish - start;
- printf("Pause的cycles約為:%ld\n", elapsed / 100);
- return 0;
- }
-
- int main()
- {
- pause_cycle();
- return 0;
- }
其運(yùn)行結(jié)果統(tǒng)計(jì)如下:
- 4110和5118 PAUSE周期較大,均為100多,它們屬于Purley第一代架構(gòu):Skylake。
- 4210和5218 PAUSE相比前一代有提升,是因?yàn)樗鼈兺瑢貾urley第二代架構(gòu):Cascadelake,該代CPU PAUSE指令有優(yōu)化。
3.2.3 Intel 提升PAUSE猜想
Intel提高PAUSE指令周期的原因,推測(cè)可能是減少自旋鎖沖突的概率,以及降低功耗;但反而導(dǎo)致PAUSE執(zhí)行時(shí)間變長(zhǎng),降低了整體的吞吐量。
The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor read to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked executing a fixed number of looped PAUSE instructions.
3.3 PAUSE導(dǎo)致寫瓶頸分析
接下來,我們深入分析一下PAUSE指令導(dǎo)致MySQL寫瓶頸的原因。
首先,通過MySQL 內(nèi)部統(tǒng)計(jì)信息,查看一下InnoDB信號(hào)量監(jiān)控?cái)?shù)據(jù):
- SEMAPHORES
- ----------
- OS WAIT ARRAY INFO: reservation count 153720
- --Thread 139868617205504 has waited at row0row.cc line 1075 for 0.00 seconds the semaphore:
- X-lock on RW-latch at 0x7f4298084250 created in file buf0buf.cc line 1425
- a writer (thread id 139869284108032) has reserved it in mode SX
- number of readers 0, waiters flag 1, lock_word: 10000000
- Last time read locked in file not yet reserved line 0
- Last time write locked in file /mnt/workspace/percona-server-5.7-redhat-binary-rocks-new/label_exp/min-centos-7-x64/test/rpmbuild/BUILD/percona-server-5.7.26-29/percona-server-5.7.26-29/storage/innobase/buf/buf0flu.cc line 1216
- OS WAIT ARRAY INFO: signal count 441329
- RW-shared spins 0, rounds 1498677, OS waits 111991
- RW-excl spins 0, rounds 717200, OS waits 9012
- RW-sx spins 47596, rounds 366136, OS waits 4100
- Spin rounds per wait: 1498677.00 RW-shared, 717200.00 RW-excl, 7.69 RW-sx
可見寫操作并阻塞在:storage/innobase/buf/buf0flu.cc第1216行調(diào)用上。
跟蹤一下發(fā)生等待的源碼:buf0flu.cc line 1216:
- if (flush_type == BUF_FLUSH_LIST
- && is_uncompressed
- && !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) { // 加鎖前,判斷鎖沖突
- if (!fsp_is_system_temporary(bpage->id.space())) {
- /* avoiding deadlock possibility involves
- doublewrite buffer, should flush it, because
- it might hold the another block->lock. */
- buf_dblwr_flush_buffered_writes(
- buf_parallel_dblwr_partition(bpage,
- flush_type));
- } else {
- buf_dblwr_sync_datafiles();
- }
- rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE); // 加sx鎖
- }
- ...
- #define rw_lock_sx_lock_nowait(M, P) \
- rw_lock_sx_lock_low((M), (P), __FILE__, __LINE__)
- ...
-
- rw_lock_sx_lock_func( // 加sx鎖函數(shù)
- /*=================*/
- rw_lock_t* lock, /*!< in: pointer to rw-lock */
- ulint pass, /*!< in: pass value; != 0, if the lock will
- be passed to another thread to unlock */
- const char* file_name,/*!< in: file name where lock requested */
- ulint line) /*!< in: line where requested */
-
- {
- ulint i = 0;
- sync_array_t* sync_arr;
- ulint spin_count = 0;
- uint64_t count_os_wait = 0;
- ulint spin_wait_count = 0;
-
- ut_ad(rw_lock_validate(lock));
- ut_ad(!rw_lock_own(lock, RW_LOCK_S));
-
- lock_loop:
-
- if (rw_lock_sx_lock_low(lock, pass, file_name, line)) {
-
- if (count_os_wait > 0) {
- lock->count_os_wait +=
- static_cast<uint32_t>(count_os_wait);
- rw_lock_stats.rw_sx_os_wait_count.add(count_os_wait);
- }
-
- rw_lock_stats.rw_sx_spin_round_count.add(spin_count);
- rw_lock_stats.rw_sx_spin_wait_count.add(spin_wait_count);
-
- /* Locking succeeded */
- return;
-
- } else {
-
- ++spin_wait_count;
-
- /* Spin waiting for the lock_word to become free */
- os_rmb;
- while (i < srv_n_spin_wait_rounds
- && lock->lock_word <= X_LOCK_HALF_DECR) {
-
- if (srv_spin_wait_delay) {
- ut_delay(ut_rnd_interval(
- 0, srv_spin_wait_delay)); // 加鎖失敗,調(diào)用ut_delay
- }
-
- i++;
- }
-
- spin_count += i;
-
- if (i >= srv_n_spin_wait_rounds) {
-
- os_thread_yield();
-
- } else {
-
- goto lock_loop;
- }
- ...
- ulong srv_n_spin_wait_rounds = 30;
- ulong srv_spin_wait_delay = 6;
上述源碼可知,MySQL鎖等待是通過調(diào)用ut_delay做空循環(huán)實(shí)現(xiàn)的。
InnoDB層有三種鎖:S(共享鎖)、X(排他鎖)和SX(共享排他鎖)。 SX與SX、X是互斥鎖。加SX不會(huì)影響讀,只會(huì)阻塞寫。所以在大量寫入操作時(shí),會(huì)造成大量的鎖等待,即大量的PAUSE指令。
分析到這里,我們總結(jié)一下影響吞吐量的兩個(gè)因素:
- 自旋的時(shí)長(zhǎng),在MySQL5.7以及之前版本的源碼定位為:spin_wait_delay * 50。
- Intel CPU PAUSE的指令周期。
接下來,我們就從這兩方面入手,評(píng)估優(yōu)化空間以及效果。
4. 針對(duì)PAUSE指令和spin參數(shù)優(yōu)化與探索
4.1 MySQL spin參數(shù)優(yōu)化
4.1.1 MySQL 5.7 spin參數(shù)優(yōu)化
我們可以基于現(xiàn)有MySQL版本、硬件等方面,來尋找優(yōu)化點(diǎn)。
MySQL針對(duì)spin控制這塊有個(gè)參數(shù)可以調(diào)整,根據(jù)參數(shù)特點(diǎn)進(jìn)行相關(guān)優(yōu)化:
innodb_spin_wait_delay
innodb_spin_wait_delay的單位,是100MHZ的奔騰處理器處理1毫秒的時(shí)間,默認(rèn)innodb_spin_wait_delay配置成6,表示最多在100MHZ的奔騰處理器上自旋6毫秒。
innodb_sync_spin_loops
當(dāng) innodb 線程獲取 mutex 資源而得不到滿足時(shí),會(huì)最多進(jìn)行 innodb_sync_spin_loops次嘗試獲取mutex資源。
其中innodb_spin_wait_delay參數(shù)對(duì)PAUSE運(yùn)行時(shí)長(zhǎng)是有影響的。針對(duì)此參數(shù),我們進(jìn)行調(diào)優(yōu)測(cè)試。
同樣,針對(duì)上述參數(shù)優(yōu)化,我們通過基準(zhǔn)測(cè)試來對(duì)比性能和效果:
可以總結(jié)為:
- innodb_spin_wait_delay的調(diào)整對(duì)TPS、QPS 一定影響,其值趨于小,則MySQL性能有提升。反之,下降。
- innodb_spin_wait_delay參數(shù)調(diào)整性能優(yōu)化效果有限,性能提升的幅度還是無法滿足線上業(yè)務(wù)需求。
4.2 MySQL8.0 spin新特性移植
4.2.1 spin_wait_pause_multiplier移植
針對(duì)Skylake CPU,PAUSE造成的吞吐量下降,我們對(duì)MySQL 5.7 spin控制參數(shù)innodb_spin_wait_delay的調(diào)優(yōu)并未取得明顯效果。
于是,我們將目光投向了MySQL 8.0的新特性:MySQL 8.0 針對(duì)PAUSE,源碼中新增了spin_wait_pause_multiplier參數(shù),來替換之前寫死的循環(huán)次數(shù)。
4.2.2 spin_wait_pause_multiplier實(shí)現(xiàn)
MySQL 8.0源碼中,之前循環(huán)50次的邏輯修改成了可以調(diào)整循環(huán)次數(shù)的參數(shù):spin_wait_pause_multiplier。
- ulint ut_delay(ulint delay) {
- ulint i, j;
- /* We don't expect overflow here, as ut::spin_wait_pause_multiplier is limited
- to 100, and values of delay are not larger than @@innodb_spin_wait_delay
- which is limited by 1 000. Anyway, in case an overflow happened, the program
- would still work (as iterations is unsigned). */
- const ulint iterations = delay * ut::spin_wait_pause_multiplier;
- UT_LOW_PRIORITY_CPU();
-
- j = 0;
-
- for (i = 0; i < iterations; i++) {
- j += i;
- UT_RELAX_CPU();
- }
-
- UT_RESUME_PRIORITY_CPU();
-
- return (j);
- }
- ...
- namespace ut {
- ulong spin_wait_pause_multiplier = 50;
- }
4.2.3 移植spin_wait_pause_multiplier patch優(yōu)化
既然MySQL 8.0參數(shù)spin_wait_pause_multiplier可以控制PAUSE執(zhí)行的時(shí)長(zhǎng),那么就可以減少該值,從而降低整體PAUSE影響。
了解MySQL 8.0相關(guān)代碼后,我們將該patch移植到線上的穩(wěn)定版本:
- MySQ >select version();
- +------------------+
- | version() |
- +------------------+
- | 5.7.26-29-mt-log |
- +------------------+
- 1 row in set (0.00 sec)
-
- MySQL>show global variables like '%spin%';
- +-----------------------------------+-------+
- | Variable_name | Value |
- +-----------------------------------+-------+
- | innodb_spin_wait_delay | 6 |
- | innodb_spin_wait_pause_multiplier | 5 |
- | innodb_sync_spin_loops | 30 |
- +-----------------------------------+-------+
- 3 rows in set (0.00 sec)
由上述可知,Silver 4110的PAUSE cycles是E5-2620 v4的14倍左右?;诖?,將innodb_spin_wait_pause_multiplier值調(diào)整為默認(rèn)值的1/14,取稍大值:5。即將該參數(shù)由原默認(rèn)的50調(diào)整為5。
最后,還是通過二維折線圖來對(duì)比該patch調(diào)優(yōu)后的基準(zhǔn)測(cè)試數(shù)據(jù):
- Silver 4110移植spin_wait_pause_multiplier patch,并調(diào)整優(yōu)化后,4110(patch)性能有了較大的提升。
- Silver 4110(patch) 相對(duì)調(diào)優(yōu)innodb_spin_wait_delay性能上更優(yōu)。
- Silver 4110(patch)并發(fā)線程大于64的只寫場(chǎng)景,性能略低于E5-2620 V4 ,其他均優(yōu)。
- 按照真實(shí)的線上讀寫比例,4110(patch)可以將吞吐量恢復(fù)到原先的性能水平。
4.3 PAUSE指令周期優(yōu)化
上述章節(jié)中,我們測(cè)出Cascadelake CPU PAUSE周期下降了。在跟Intel技術(shù)專家確認(rèn)后得知:從Purley的第二代產(chǎn)品Cascadelake開始,Intel將PAUSE的指令周期降低到了44。(估計(jì)Intel也發(fā)現(xiàn)了第一代增加PAUSE周期后的性能瓶頸問題。)
我們針對(duì)第二代CPU產(chǎn)品繼續(xù)做基準(zhǔn)測(cè)試,來看一下性能表現(xiàn):
接著用perf diff來對(duì)比一下4110和4210在ut_delay上的開銷:
- 可以看到4210比4110占比下降了8%。
- 由于PAUSE指令周期還是數(shù)倍于E5系列CPU,4210在高負(fù)載下,PAUSE的開銷對(duì)MySQL吞吐量還是有較大的影響。而在128并發(fā)線程以下,性能相比4110有了較大的提升。按理,可以滿足線上業(yè)務(wù)需求(該測(cè)試結(jié)果跟移植spin_wait_pause_multiplier patch性能測(cè)試數(shù)據(jù)曲線一致)。
5. 總結(jié)
最后針對(duì)本篇內(nèi)容,我們可以做個(gè)簡(jiǎn)單的總結(jié):
Intel在新平臺(tái)CPU產(chǎn)品調(diào)大了PAUSE指令周期,在高并發(fā)spinlock競(jìng)爭(zhēng)激烈場(chǎng)景下,可能會(huì)造成程序性能較大損耗(特別是執(zhí)行固定PAUSE次數(shù)的程序)。 針對(duì)Skylake架構(gòu)CPU(比如:4110等)PAUSE指令周期較長(zhǎng)引起性能問題的優(yōu)化方法如下:
將MySQL 8.0 innodb_spin_wait_pause_multiplier patch移植到線上穩(wěn)定版本(或升級(jí)到MySQL 8.0),通過降低PAUSE執(zhí)行時(shí)長(zhǎng),來提升吞吐量。 如果是OS為CentOS 6,可以升級(jí)到CentOS 7,CentOS 7本身spinlock優(yōu)化,對(duì)MySQL性能也有一定提升。 最簡(jiǎn)單、直接的方法可以替換為Cascadelake架構(gòu)CPU。
針對(duì)Cascadelake架構(gòu)CPU,由于Intel本身在PAUSE周期已經(jīng)優(yōu)化,性能上已經(jīng)做了修復(fù)。當(dāng)然也可以采用上述優(yōu)化方案,讓性能提升一個(gè)臺(tái)階。
6. 作者簡(jiǎn)介
春林,2017年加入美團(tuán),主要負(fù)責(zé)MySQL運(yùn)維開發(fā)和優(yōu)化工作。