Watchdog機(jī)制源碼分析
前言
Linux引入Watchdog,在Linux內(nèi)核下,當(dāng)Watchdog啟動(dòng)后,便設(shè)定了一個(gè)定時(shí)器,如果在超時(shí)時(shí)間內(nèi)沒有對(duì)/dev/Watchdog進(jìn)行寫操作,則會(huì)導(dǎo)致系統(tǒng)重啟。通過定時(shí)器實(shí)現(xiàn)的Watchdog屬于軟件層面;
Android設(shè)計(jì)了一個(gè)軟件層面Watchdog,用于保護(hù)一些重要的系統(tǒng)服務(wù),當(dāng)出現(xiàn)故障時(shí),通常會(huì)讓Android系統(tǒng)重啟,由于這種機(jī)制的存在,就經(jīng)常會(huì)出現(xiàn)一些system_server進(jìn)程被Watchdog殺掉而發(fā)生手機(jī)重啟的問題;
今天我們就來分析下原理;
一、WatchDog啟動(dòng)機(jī)制詳解
ANR機(jī)制是針對(duì)應(yīng)用的,對(duì)于系統(tǒng)進(jìn)程來說,如果長(zhǎng)時(shí)間“無響應(yīng)”,Android系統(tǒng)設(shè)計(jì)了WatchDog機(jī)制來管控。如果超過了“無響應(yīng)”的延時(shí),那么系統(tǒng)WatchDog會(huì)觸發(fā)自殺機(jī)制;
Watchdog是一個(gè)線程,繼承于Thread,在SystemServer.java里面通過getInstance獲取watchdog的對(duì)象;
1、在SystemServer.java中啟動(dòng)
- private void startOtherServices() {
- ······
- traceBeginAndSlog("InitWatchdog");
- final Watchdog watchdog = Watchdog.getInstance();
- watchdog.init(context, mActivityManagerService);
- traceEnd();
- ······
- traceBeginAndSlog("StartWatchdog");
- Watchdog.getInstance().start();
- traceEnd();
- }
因?yàn)槭蔷€程,所以,只要start即可;
2、查看WatchDog的構(gòu)造方法
- private Watchdog() {
- super("watchdog");
- // Initialize handler checkers for each common thread we want to check. Note
- // that we are not currently checking the background thread, since it can
- // potentially hold longer running operations with no guarantees about the timeliness
- // of operations there.
- // The shared foreground thread is the main checker. It is where we
- // will also dispatch monitor checks and do other work.
- mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
- "foreground thread", DEFAULT_TIMEOUT);
- mHandlerCheckers.add(mMonitorChecker);
- // Add checker for main thread. We only do a quick check since there
- // can be UI running on the thread.
- mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
- "main thread", DEFAULT_TIMEOUT));
- // Add checker for shared UI thread.
- mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
- "ui thread", DEFAULT_TIMEOUT));
- // And also check IO thread.
- mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
- "i/o thread", DEFAULT_TIMEOUT));
- // And the display thread.
- mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
- "display thread", DEFAULT_TIMEOUT));
- // Initialize monitor for Binder threads.
- addMonitor(new BinderThreadMonitor());
- mOpenFdMonitor = OpenFdMonitor.create();
- // See the notes on DEFAULT_TIMEOUT.
- assert DB ||
- DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
- // mtk enhance
- exceptionHWT = new ExceptionLog();
- }
重點(diǎn)關(guān)注兩個(gè)對(duì)象:mMonitorChecker和mHandlerCheckers
mHandlerCheckers列表元素的來源:
構(gòu)造對(duì)象的導(dǎo)入:UiThread、IoThread、DisplatyThread、FgThread加入
外部導(dǎo)入:Watchdog.getInstance().addThread(handler);
mMonitorChecker列表元素的來源:
外部導(dǎo)入:Watchdog.getInstance().addMonitor(monitor);
特別說明:addMonitor(new BinderThreadMonitor());
3、查看WatchDog的run方法
- public void run() {
- boolean waitedHalf = false;
- boolean mSFHang = false;
- while (true) {
- ······
- synchronized (this) {
- ······
- for (int i=0; i<mHandlerCheckers.size(); i++) {
- HandlerChecker hc = mHandlerCheckers.get(i);
- hc.scheduleCheckLocked();
- }
- ······
- }
- ······
- }
對(duì)mHandlerCheckers列表元素進(jìn)行檢測(cè);
4、查看HandlerChecker的scheduleCheckLocked
- public void scheduleCheckLocked() {
- if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
- // If the target looper has recently been polling, then
- // there is no reason to enqueue our checker on it since that
- // is as good as it not being deadlocked. This avoid having
- // to do a context switch to check the thread. Note that we
- // only do this if mCheckReboot is false and we have no
- // monitors, since those would need to be executed at this point.
- mCompleted = true;
- return;
- }
- if (!mCompleted) {
- // we already have a check in flight, so no need
- return;
- }
- mCompleted = false;
- mCurrentMonitor = null;
- mStartTime = SystemClock.uptimeMillis();
- mHandler.postAtFrontOfQueue(this);
- }
mMonitors.size() == 0的情況:主要為了檢查mHandlerCheckers中的元素是否超時(shí),運(yùn)用的手段:mHandler.getLooper().getQueue().isPolling();
mMonitorChecker對(duì)象的列表元素一定是大于0,此時(shí),關(guān)注點(diǎn)在mHandler.postAtFrontOfQueue(this);
- public void run() {
- final int size = mMonitors.size();
- for (int i = 0 ; i < size ; i++) {
- synchronized (Watchdog.this) {
- mCurrentMonitor = mMonitors.get(i);
- }
- mCurrentMonitor.monitor();
- }
- synchronized (Watchdog.this) {
- mCompleted = true;
- mCurrentMonitor = null;
- }
- }
監(jiān)聽monitor方法,這里是對(duì)mMonitors進(jìn)行monitor,而能夠滿足條件的只有:mMonitorChecker,例如:各種服務(wù)通過addMonitor加入列表;
- ActivityManagerService.java
- Watchdog.getInstance().addMonitor(this);
- InputManagerService.java
- Watchdog.getInstance().addMonitor(this);
- PowerManagerService.java
- Watchdog.getInstance().addMonitor(this);
- ActivityManagerService.java
- Watchdog.getInstance().addMonitor(this);
- WindowManagerService.java
- Watchdog.getInstance().addMonitor(this);
而被執(zhí)行的monitor方法很簡(jiǎn)單,例如ActivityManagerService:
- public void monitor() {
- synchronized (this) { }
- }
這里僅僅是檢查系統(tǒng)服務(wù)是否被鎖住;
Watchdog的內(nèi)部類;
- private static final class BinderThreadMonitor implements Watchdog.Monitor {
- @Override
- public void monitor() {
- Binder.blockUntilThreadAvailable();
- }
- }
- android.os.Binder.java
- public static final native void blockUntilThreadAvailable();
- android_util_Binder.cpp
- static void android_os_Binder_blockUntilThreadAvailable(JNIEnv* env, jobject clazz)
- {
- return IPCThreadState::self()->blockUntilThreadAvailable();
- }
- IPCThreadState.cpp
- void IPCThreadState::blockUntilThreadAvailable()
- {
- pthread_mutex_lock(&mProcess->mThreadCountLock);
- while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
- ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
- static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
- static_cast<unsigned long>(mProcess->mMaxThreads));
- pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
- }
- pthread_mutex_unlock(&mProcess->mThreadCountLock);
- }
這里僅僅是檢查進(jìn)程中包含的可執(zhí)行線程的數(shù)量不能超過mMaxThreads,如果超過了最大值(31個(gè)),就需要等待;
- ProcessState.cpp
- #define DEFAULT_MAX_BINDER_THREADS 15
- 但是systemserver.java進(jìn)行了設(shè)置
- // maximum number of binder threads used for system_server
- // will be higher than the system default
- private static final int sMaxBinderThreads = 31;
- private void run() {
- ······
- BinderInternal.setMaxThreads(sMaxBinderThreads);
- ······
- }
5、發(fā)生超時(shí)后退出
- public void run() {
- ······
- Process.killProcess(Process.myPid());
- System.exit(10);
- ······
- }
kill自己所在進(jìn)程(system_server),并退出;
二、原理解釋
1、系統(tǒng)中所有需要監(jiān)控的服務(wù)都調(diào)用Watchdog的addMonitor添加Monitor Checker到mMonitors這個(gè)List中或者addThread方法添加Looper Checker到mHandlerCheckers這個(gè)List中;
2、當(dāng)Watchdog線程啟動(dòng)后,便開始無限循環(huán),它的run方法就開始執(zhí)行;
- 第一步調(diào)用HandlerChecker#scheduleCheckLocked處理所有的mHandlerCheckers
- 第二步定期檢查是否超時(shí),每一次檢查的間隔時(shí)間由CHECK_INTERVAL常量設(shè)定,為30秒,每一次檢查都會(huì)調(diào)用evaluateCheckerCompletionLocked()方法來評(píng)估一下HandlerChecker的完成狀態(tài):
- COMPLETED表示已經(jīng)完成;
- WAITING和WAITED_HALF表示還在等待,但未超時(shí),WAITED_HALF時(shí)候會(huì)dump一次trace.
- OVERDUE表示已經(jīng)超時(shí)。默認(rèn)情況下,timeout是1分鐘;
3、如果超時(shí)時(shí)間到了,還有HandlerChecker處于未完成的狀態(tài)(OVERDUE),則通過getBlockedCheckersLocked()方法,獲取阻塞的HandlerChecker,生成一些描述信息,保存日志,包括一些運(yùn)行時(shí)的堆棧信息。
4、最后殺死SystemServer進(jìn)程;
總結(jié)
Watchdog是一個(gè)線程,用來監(jiān)聽系統(tǒng)各項(xiàng)服務(wù)是否正常運(yùn)行,沒有發(fā)生死鎖;
HandlerChecker用來檢查Handler以及monitor;
monitor通過鎖來判斷是否死鎖;
超時(shí)30秒會(huì)輸出log,超時(shí)60秒會(huì)重啟;
Watchdog會(huì)殺掉自己的進(jìn)程,也就是此時(shí)system_server進(jìn)程id會(huì)變化;
本文轉(zhuǎn)載自微信公眾號(hào)「Android開發(fā)編程」