今天就跟大家聊聊有關Android中 ANR在線監(jiān)控的原理是什么,可能很多人都不太了解,為了讓大家更加了解,小編給大家總結了以下內(nèi)容,希望大家根據(jù)這篇文章可以有所收獲。
創(chuàng)新互聯(lián)公司成立于2013年,我們提供高端成都網(wǎng)站建設、成都網(wǎng)站制作、成都網(wǎng)站設計、網(wǎng)站定制、成都全網(wǎng)營銷、小程序制作、微信公眾號開發(fā)、seo優(yōu)化服務,提供專業(yè)營銷思路、內(nèi)容策劃、視覺設計、程序開發(fā)來完成項目落地,為服務器租用企業(yè)提供源源不斷的流量和訂單咨詢。
Android中的Watchdog
在Android中,Watchdog是用來監(jiān)測關鍵服務是否發(fā)生了死鎖,如果發(fā)生了死鎖就kill進程,重啟SystemServer
Android的Watchdog是在SystemServer中進行初始化的,所以Watchdog是運行在SystemServer進程中
Watchdog是運行一個單獨的線程中的,每次wait 30s之后就會發(fā)起一個監(jiān)測行為,如果系統(tǒng)休眠了,那Watchdog的wait行為也會休眠,此時需要等待系統(tǒng)喚醒之后才會重新恢復監(jiān)測
想要被Watchdog監(jiān)測的對象需要實現(xiàn)Watchdog.Monitor接口的monitor()方法,然后調(diào)用addMonitor()方法
其實framework里面的Watchdog實現(xiàn)除了能監(jiān)控線程死鎖以外還能夠監(jiān)控線程卡頓,addMonitor()方法是監(jiān)控線程死鎖的,而addThread()方法是監(jiān)控線程卡頓的
Watchdog線程死鎖監(jiān)控實現(xiàn)
Watchdog監(jiān)控線程死鎖需要被監(jiān)控的對象實現(xiàn)Watchdog.Monitor接口的monitor()方法,然后再調(diào)用addMonitor()方法,例如ActivityManagerService:
public final class ActivityManagerService extends ActivityManagerNative implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback { public ActivityManagerService(Context systemContext) { Watchdog.getInstance().addMonitor(this); } public void monitor() { synchronized (this) { } } // ... }
如上是從ActivityManagerService提取出來關于Watchdog監(jiān)控ActivityManagerService這個對象鎖的相關代碼,而監(jiān)控的實現(xiàn)如下,Watchdog是一個線程對象,start這個線程之后就會每次wait 30s后檢查一次,如此不斷的循環(huán)檢查:
public void addMonitor(Monitor monitor) { synchronized (this) { if (isAlive()) { throw new RuntimeException("Monitors can't be added once the Watchdog is running"); } mMonitorChecker.addMonitor(monitor); } } @Override public void run() { boolean waitedHalf = false; while (true) { final ArrayList<HandlerChecker> blockedCheckers; final String subject; final boolean allowRestart; int debuggerWasConnected = 0; synchronized (this) { long timeout = CHECK_INTERVAL; // Make sure we (re)spin the checkers that have become idle within // this wait-and-check interval for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerChecker hc = mHandlerCheckers.get(i); hc.scheduleCheckLocked(); } if (debuggerWasConnected > 0) { debuggerWasConnected--; } // NOTE: We use uptimeMillis() here because we do not want to increment the time we // wait while asleep. If the device is asleep then the thing that we are waiting // to timeout on is asleep as well and won't have a chance to run, causing a false // positive on when to kill things. long start = SystemClock.uptimeMillis(); while (timeout > 0) { if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } try { wait(timeout); } catch (InterruptedException e) { Log.wtf(TAG, e); } if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); } final int waitState = evaluateCheckerCompletionLocked(); if (waitState == COMPLETED) { // The monitors have returned; reset waitedHalf = false; continue; } else if (waitState == WAITING) { // still waiting but within their configured intervals; back off and recheck continue; } else if (waitState == WAITED_HALF) { if (!waitedHalf) { // We've waited half the deadlock-detection interval. Pull a stack // trace and wait another half. ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); ActivityManagerService.dumpStackTraces(true, pids, null, null, NATIVE_STACKS_OF_INTEREST); waitedHalf = true; } continue; } // something is overdue! blockedCheckers = getBlockedCheckersLocked(); subject = describeCheckersLocked(blockedCheckers); allowRestart = mAllowRestart; } // If we got here, that means that the system is most likely hung. // First collect stack traces from all threads of the system process. // Then kill this process so that the system will restart. EventLog.writeEvent(EventLogTags.WATCHDOG, subject); ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); if (mPhonePid > 0) pids.add(mPhonePid); // Pass !waitedHalf so that just in case we somehow wind up here without having // dumped the halfway stacks, we properly re-initialize the trace file. final File stack = ActivityManagerService.dumpStackTraces( !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST); // Give some extra time to make sure the stack traces get written. // The system's been hanging for a minute, another second or two won't hurt much. SystemClock.sleep(2000); // Pull our own kernel thread stacks as well if we're configured for that if (RECORD_KERNEL_THREADS) { dumpKernelStackTraces(); } String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null); String traceFileNameAmendment = "_SystemServer_WDT" + mTraceDateFormat.format(new Date()); if (tracesPath != null && tracesPath.length() != 0) { File traceRenameFile = new File(tracesPath); String newTracesPath; int lpos = tracesPath.lastIndexOf ("."); if (-1 != lpos) newTracesPath = tracesPath.substring (0, lpos) + traceFileNameAmendment + tracesPath.substring (lpos); else newTracesPath = tracesPath + traceFileNameAmendment; traceRenameFile.renameTo(new File(newTracesPath)); tracesPath = newTracesPath; } final File newFd = new File(tracesPath); // Try to add the error to the dropbox, but assuming that the ActivityManager // itself may be deadlocked. (which has happened, causing this statement to // deadlock and the watchdog as a whole to be ineffective) Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, subject, null, newFd, null); } }; dropboxThread.start(); try { dropboxThread.join(2000); // wait up to 2 seconds for it to return. } catch (InterruptedException ignored) {} // At times, when user space watchdog traces don't give an indication on // which component held a lock, because of which other threads are blocked, // (thereby causing Watchdog), crash the device to analyze RAM dumps boolean crashOnWatchdog = SystemProperties .getBoolean("persist.sys.crashOnWatchdog", false); if (crashOnWatchdog) { // Trigger the kernel to dump all blocked threads, and backtraces // on all CPUs to the kernel log Slog.e(TAG, "Triggering SysRq for system_server watchdog"); doSysRq('w'); doSysRq('l'); // wait until the above blocked threads be dumped into kernel log SystemClock.sleep(3000); // now try to crash the target doSysRq('c'); } IActivityController controller; synchronized (this) { controller = mController; } if (controller != null) { Slog.i(TAG, "Reporting stuck state to activity controller"); try { Binder.setDumpDisabled("Service dumps disabled due to hung system process."); // 1 = keep waiting, -1 = kill system int res = controller.systemNotResponding(subject); if (res >= 0) { Slog.i(TAG, "Activity controller requested to coninue to wait"); waitedHalf = false; continue; } } catch (RemoteException e) { } } // Only kill the process if the debugger is not attached. if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (!allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); for (int i=0; i<blockedCheckers.size(); i++) { Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:"); StackTraceElement[] stackTrace = blockedCheckers.get(i).getThread().getStackTrace(); for (StackTraceElement element: stackTrace) { Slog.w(TAG, " at " + element); } } Slog.w(TAG, "*** GOODBYE!"); Process.killProcess(Process.myPid()); System.exit(10); } waitedHalf = false; } }
首先,ActivityManagerService調(diào)用addMonitor()方法把自己添加到了Watchdog的mMonitorChecker對象中,這是Watchdog的一個全局變量,這個全部變量在Watchdog的構造方法中已經(jīng)事先初始化好并添加到mHandlerCheckers:ArrayList<HandlerChecker>這個監(jiān)控對象列表中了,mMonitorChecker是一個HandlerChecker類的實例對象,代碼如下:
public final class HandlerChecker implements Runnable { private final Handler mHandler; private final String mName; private final long mWaitMax; private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>(); private boolean mCompleted; private Monitor mCurrentMonitor; private long mStartTime; HandlerChecker(Handler handler, String name, long waitMaxMillis) { mHandler = handler; mName = name; mWaitMax = waitMaxMillis; mCompleted = true; } public void addMonitor(Monitor monitor) { mMonitors.add(monitor); } public void scheduleCheckLocked() { if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) { // If the target looper has recently been polling, then // there is no reason to enqueue our checker on it since that // is as good as it not being deadlocked. This avoid having // to do a context switch to check the thread. Note that we // only do this if mCheckReboot is false and we have no // monitors, since those would need to be executed at this point. mCompleted = true; return; } if (!mCompleted) { // we already have a check in flight, so no need return; } mCompleted = false; mCurrentMonitor = null; mStartTime = SystemClock.uptimeMillis(); mHandler.postAtFrontOfQueue(this); } public boolean isOverdueLocked() { return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax); } public int getCompletionStateLocked() { if (mCompleted) { return COMPLETED; } else { long latency = SystemClock.uptimeMillis() - mStartTime; if (latency < mWaitMax/2) { return WAITING; } else if (latency < mWaitMax) { return WAITED_HALF; } } return OVERDUE; } public Thread getThread() { return mHandler.getLooper().getThread(); } public String getName() { return mName; } public String describeBlockedStateLocked() { if (mCurrentMonitor == null) { return "Blocked in handler on " + mName + " (" + getThread().getName() + ")"; } else { return "Blocked in monitor " + mCurrentMonitor.getClass().getName() + " on " + mName + " (" + getThread().getName() + ")"; } } @Override public void run() { final int size = mMonitors.size(); for (int i = 0 ; i < size ; i++) { synchronized (Watchdog.this) { mCurrentMonitor = mMonitors.get(i); } mCurrentMonitor.monitor(); } synchronized (Watchdog.this) { mCompleted = true; mCurrentMonitor = null; } } }
HandlerChecker類中的mMonitors也是監(jiān)控對象列表,這里是監(jiān)控所有實現(xiàn)了Watchdog.Monitor接口的監(jiān)控對象,而那些沒有實現(xiàn)Watchdog.Monitor接口的對象則會單獨創(chuàng)建一個HandlerChecker類并add到Watchdog的mHandlerCheckers監(jiān)控列表中,當Watchdog線程開始健康那個的時候就回去遍歷mHandlerCheckers列表,并逐一的調(diào)用HandlerChecker的scheduleCheckLocked方法:
public void scheduleCheckLocked() { if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) { // If the target looper has recently been polling, then // there is no reason to enqueue our checker on it since that // is as good as it not being deadlocked. This avoid having // to do a context switch to check the thread. Note that we // only do this if mCheckReboot is false and we have no // monitors, since those would need to be executed at this point. mCompleted = true; return; } if (!mCompleted) { // we already have a check in flight, so no need return; } mCompleted = false; mCurrentMonitor = null; mStartTime = SystemClock.uptimeMillis(); mHandler.postAtFrontOfQueue(this); }
HandlerChecker這個類中有幾個比較重要的標志,一個是mCompleted,標識著本次監(jiān)控掃描是否在指定時間內(nèi)完成,mStartTime標識本次開始掃描的時間mHandler,則是被監(jiān)控的線程的handler,scheduleCheckLocked是開啟本次對與改線程的監(jiān)控,里面理所當然的會把mCompleted置為false并設置開始時間,可以看到,監(jiān)控原理就是向被監(jiān)控的線程的Handler的消息隊列中post一個任務,也就是HandlerChecker本身,然后HandlerChecker這個任務就會在被監(jiān)控的線程對應Handler維護的消息隊列中被執(zhí)行,如果消息隊列因為某一個任務卡住,那么HandlerChecker這個任務就無法及時的執(zhí)行到,超過了指定的時間后就會被認為當前被監(jiān)控的這個線程發(fā)生了卡死(死鎖造成的卡死或者執(zhí)行耗時任務造成的卡死),在HandlerChecker這個任務中:
@Override public void run() { final int size = mMonitors.size(); for (int i = 0 ; i < size ; i++) { synchronized (Watchdog.this) { mCurrentMonitor = mMonitors.get(i); } mCurrentMonitor.monitor(); } synchronized (Watchdog.this) { mCompleted = true; mCurrentMonitor = null; } }
首先遍歷mMonitors列表中的監(jiān)控對象并調(diào)用monitor()方法來開啟監(jiān)控,通常在被監(jiān)控對象實現(xiàn)的monitor()方法都是按照如下實現(xiàn)的:
public void monitor() { synchronized (this) { } }
即監(jiān)控某一個死鎖,然后就是本次監(jiān)控完成,mCompleted設置為true,而當所有的scheduleCheckLocked都執(zhí)行完了之后,Watchdog就開始wait,而且一定要wait for 30s,這里有一個實現(xiàn)細節(jié):
long start = SystemClock.uptimeMillis(); while (timeout > 0) { if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } try { wait(timeout); } catch (InterruptedException e) { Log.wtf(TAG, e); } if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); }
原先,我看到這段代碼的時候,首先關注到SystemClock.uptimeMillis()在設備休眠的時候是不計時的,因此猜測會不會是因為設備休眠了,wait也停止了,Watchdog在wait到15s的時候設備休眠了,并且連續(xù)休眠30分鐘后才又被喚醒,那么這時候wait會不會馬上被喚醒,答案是:正常情況下wait會繼續(xù),知道直到剩下的15s也wait完成后才會喚醒,所以我疑惑了,于是查看下下Thread的wait()方法的接口文檔,終于找到如下解釋:
A thread can also wake up without being notified, interrupted, or * timing out, a so-called <i>spurious wakeup</i>. While this will rarely * occur in practice, applications must guard against it by testing for * the condition that should have caused the thread to be awakened, and * continuing to wait if the condition is not satisfied. In other words, * waits should always occur in loops, like this one: * <pre> * synchronized (obj) { * while (<condition does not hold>) * obj.wait(timeout); * ... // Perform action appropriate to condition * } * </pre>
大致意思是說當Thread在wait的時候除了會被主動喚醒(notify或者notifyAll),中斷(interrupt),或者wait的時間到期而喚醒,還有可能被假喚醒,而這種假喚醒在實踐中發(fā)生的幾率非常低,不過針對這種假喚醒,程序需要通過驗證喚醒條件來區(qū)分線程是真的喚醒還是假的喚醒,如果是假的喚醒那么就繼續(xù)wait直到真喚醒,事實上,在我們實際的開發(fā)過程中確實要注意這種微小的細節(jié),可能99%的情況下不會發(fā)生,但是要是遇到1%的情況發(fā)生之后,那么這個問題將會是非常隱晦的,而且在查找問題的時候也會變得很困難,很奇怪,為什么線程好好的wait過程中突然被喚醒了呢,甚至可能懷疑我們以前對于線程wait在設備休眠狀態(tài)下的執(zhí)行情況?,廢話就扯到這里,繼續(xù)來研究Watchdog機制,在Watchdog等待30s之后會調(diào)用evaluateCheckerCompletionLocked()方法來檢測被監(jiān)控對象的運行情況:
private int evaluateCheckerCompletionLocked() { int state = COMPLETED; for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerChecker hc = mHandlerCheckers.get(i); state = Math.max(state, hc.getCompletionStateLocked()); } return state; }
通過調(diào)用HandlerChecker的getCompletionStateLocked來獲取每一個HandlerChecker的監(jiān)控狀態(tài):
public int getCompletionStateLocked() { if (mCompleted) { return COMPLETED; } else { long latency = SystemClock.uptimeMillis() - mStartTime; if (latency < mWaitMax/2) { return WAITING; } else if (latency < mWaitMax) { return WAITED_HALF; } } return OVERDUE; }
從這里,我們就看到了其實是通過mCompleted這個標志來區(qū)分30s之前和30s之后的不通狀態(tài),因為30s之前對被監(jiān)控的線程對應的Handler的消息對了中post了一個HandlerChecker任務,然后mCompleted = false,等待了30s后,如果HandlerChecker被及時的執(zhí)行了,那么mCompleted = true表示任務及時執(zhí)行完畢,而如果發(fā)現(xiàn)mCompleted = false那就說明HandlerChecker依然未被執(zhí)行,當mCompleted = false的時候,會繼續(xù)檢測HandlerChecker任務的執(zhí)行時間,如果在喚醒狀態(tài)下的執(zhí)行時間小于30秒,那重新post監(jiān)控等待,如果在30秒到60秒之間,那就會dump出一些堆棧信息,然后重新post監(jiān)控等待,當?shù)却龝r間已經(jīng)超過60秒了,那就認為這是異常情況了(要么死鎖,要么耗時任務太久),這時候就會搜集各種相關信息,例如代碼堆棧信息,kernel信息,cpu信息等,生成trace文件,保存相關信息到dropbox文件夾下,然后殺死該進程,到這里監(jiān)控就結束了
Watchdog線程卡頓監(jiān)控實現(xiàn)
之前我們提到Watchdog監(jiān)控的實現(xiàn)是通過post一個HandlerChecker到線程對應的Handler對的消息對了中的,而死鎖的監(jiān)控對象都是保存在HandlerChecker的mMonitors列表中的,所以外部調(diào)用addMonitor()方法,最終都會add到Watchdog的全局變量mMonitorChecker中的監(jiān)控列表,一次所有線程的死鎖監(jiān)控都由mMonitorChecker來負責實現(xiàn),那么對于線程耗時任務的監(jiān)控,Watchdog是通過addThread()方法來實現(xiàn)的:
public void addThread(Handler thread) { addThread(thread, DEFAULT_TIMEOUT); } public void addThread(Handler thread, long timeoutMillis) { synchronized (this) { if (isAlive()) { throw new RuntimeException("Threads can't be added once the Watchdog is running"); } final String name = thread.getLooper().getThread().getName(); mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis)); } }
addThread()方法實際上是創(chuàng)建了一個新的HandlerChecker對象,通過該對象來實現(xiàn)耗時任務的監(jiān)控,而該HandlerChecker對象的mMonitors列表實際上是空的,因此在執(zhí)行任務的時候并不會執(zhí)行monitor()方法了,而是直接設置mCompleted標志位,所以可以這么解釋:Watchdog監(jiān)控者是HandlerChecker,而HandlerChecker實現(xiàn)了線程死鎖監(jiān)控和耗時任務監(jiān)控,當有Monitor對象的時候就會同時監(jiān)控線程死鎖和耗時任務,而沒有Monitor的時候就只是監(jiān)控線程的耗時任務造成的卡頓
Watchdog監(jiān)控流程
理解了Watchdog的監(jiān)控流程,我們可以考慮是否把Watchdog機制運用到我們實際的項目中去實現(xiàn)監(jiān)控在多線程場景中重要線程的死鎖,以及實時監(jiān)控主線程的anr的發(fā)生?當然是可以的,事實上,Watchdog的在framework中的重要作用就是監(jiān)控主要的系統(tǒng)服務器是否發(fā)生死鎖或者發(fā)生卡頓,例如監(jiān)控ActivityManagerService,如果發(fā)生異常情況,那么Watchdog將會殺死進程重啟,這樣可以保證重要的系統(tǒng)服務遇到類似問題的時候可以通過重啟來恢復,Watchdog實際上相當于一個最后的保障,及時的dump出異常信息,異常恢復進程運行環(huán)境
對于應用程序中,健康那個重要線程的死鎖問題實現(xiàn)原理可以和Watchdog保持一致
對于監(jiān)控應用的anr卡頓的實現(xiàn)原理可以從Watchdog中借鑒,具體實現(xiàn)稍微有點不一樣,Activity是5秒發(fā)生anr,Broadcast是10秒,Service是20秒,但是實際四大組件都是運行在主線程中的,所以可以用像Watchdog一樣,wait 30秒發(fā)起一次監(jiān)控,通過設置mCompleted標志位來檢測post到MessageQueue的任務是否被卡住并未及時的執(zhí)行,通過mStartTime來計算出任務的執(zhí)行時間,然后通過任務執(zhí)行的時間來檢測MessageQueue中其他的任務執(zhí)行是否存在耗時操作,如果發(fā)現(xiàn)執(zhí)行時間超過5秒,那么可以說明消息隊列中存在耗時任務,這時候可能就有anr的風險,應該及時dump線程棧信息保存,然后通過大數(shù)據(jù)上報后臺分析,記住這里一定是計算設備活躍的狀態(tài)下的時間,如果是設備休眠,MessageQueue本來就會暫停運行,這時候其實并不是死鎖或者卡頓
看完上述內(nèi)容,你們對Android中 ANR在線監(jiān)控的原理是什么有進一步的了解嗎?如果還想了解更多知識或者相關內(nèi)容,請關注創(chuàng)新互聯(lián)行業(yè)資訊頻道,感謝大家的支持。
分享文章:Android中ANR在線監(jiān)控的原理是什么
網(wǎng)站地址:http://chinadenli.net/article46/ihjieg.html
成都網(wǎng)站建設公司_創(chuàng)新互聯(lián),為您提供網(wǎng)站改版、外貿(mào)網(wǎng)站建設、網(wǎng)站維護、定制網(wǎng)站、標簽優(yōu)化、定制開發(fā)
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權請盡快告知,我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時需注明來源: 創(chuàng)新互聯(lián)