watchdog bite导致系统重启问题的调试

背景说明

ST在做stress test过程中发现一个bug,DUT会在工作一段时间后重启,重启原因是watchdog bite

[Thu Sep 26 09:21:59.734 2019] Watchdog bark! Now = 831425.568038
[Thu Sep 26 09:21:59.734 2019] Causing a watchdog bite!
[Thu Sep 26 09:21:59.734 2019] Configuring Watchdog Timer
[Thu Sep 26 09:21:59.734 2019] Wa

但是无法确定根本原因是什么,是什么导致的死锁,让watchdog没办法在规定时间内bark。为此,我们需要启用相关的内核调试手段去获取相关信息,然后深入分析crashdump和console log。

内核裁剪

启用ftrace

Ftrace is an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel. It can be used for debugging or analyzing latencies and performance issues that take place outside of user-space.

Ftrace是一个内部跟踪器,用于追踪内核运行情况,帮助调试用户空间之外的潜在问题或性能问题。根据QCA提供的帮助文档,可以通过以下步骤开启ftrace功能。

  1. From "make menuconfig", go to --> "Global Build Settings" then go to --> "Compile kernel with tracing support"
  2. In the "Compile kernel with tracing support", enable the below options :-
    • Enable/disable function tracing dynamically
    • Trace process context switches and events
    • Function tracer
  3. Once above changes are done, then from the "make kernel_menuconfig", go to --> "Kernel hacking" --> then enable the below options in "Tracers":-
    • Kernel Function Tracer
    • enable/disable function tracing dynamically

可以看出,配置分两个部分完成,make menuconfigmake kernel_menuconfig

make menuconfig

使用图形化界面配置menuconfig如上,保存后对应的.config文件会随之更改

 # CONFIG_KERNEL_PROFILIING is not set
 CONFIG_KERNEL_KALLSYMS=y
 # CONFIG_KERNEL_KALLSYMS_ALL is not set
-# CONFIG_KERNEL_FTRACE is not set
+CONFIG_KERNEL_FTRACE=y
+# CONFIG_KERNEL_FTRACE_SYSCALLS is not set
+CONFIG_KERNEL_ENABLE_DEFAULT_TRACERS=y
 # CONFIG_KERNEL_DEBUG_KMEMLEAK is not set
+CONFIG_KERNEL_FUNCTION_TRACER=y
+# CONFIG_KERNEL_FUNCTION_GRAPH_TRACER is not set
+CONFIG_KERNEL_DYNAMIC_FTRACE=y
+# CONFIG_KERNEL_FUNCTION_PROFILER is not set
 # CONFIG_KERNEL_IRQSOFF_TRACER is not set
 # CONFIG_KERNEL_PREEMPT_TRACER is not set
 CONFIG_KERNEL_DEBUG_KERNEL=y

但是在openWrt中使用make kernel_menuconfig会打乱原有的配置选项,所以需要手动修改配置文件

@@ -3301,8 +3304,26 @@ CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
 CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
 CONFIG_HAVE_C_RECORDMCOUNT=y
 CONFIG_TRACING_SUPPORT=y
-# CONFIG_FTRACE is not set
-
+CONFIG_FTRACE=y
+CONFIG_FUNCTION_TRACER=y
+CONFIG_DYNAMIC_FTRACE=y
+# CONFIG_NET_DROP_MONITOR is not set
+# CONFIG_FUNCTION_GRAPH_TRACER is not set
+# CONFIG_IRQSOFF_TRACER is not set
+# CONFIG_PREEMPT_TRACER is not set
+# CONFIG_SCHED_TRACER is not set
+# CONFIG_FTRACE_SYSCALLS is not set
+# CONFIG_TRACER_SNAPSHOT is not set
+CONFIG_BRANCH_PROFILE_NONE=y
+# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
+# CONFIG_PROFILE_ALL_BRANCHES is not set
+# CONFIG_STACK_TRACER is not set
+# CONFIG_BLK_DEV_IO_TRACE is not set
+# CONFIG_FUNCTION_PROFILER is not set
+# CONFIG_FTRACE_STARTUP_TEST is not set
+# CONFIG_RING_BUFFER_STARTUP_TEST is not set
+# CONFIG_RING_BUFFER_BENCHMARK is not set
+#
 #
 # Runtime Testing
 #

对应的图形化配置如下:

kernel_menuconfig ftrace

启用lockup debug

为了打印出现死锁后的相关信息,需要打开相应的配置参数

@@ -3249,9 +3249,9 @@ CONFIG_HAVE_DEBUG_KMEMLEAK=y
 # Debug Lockups and Hangs
 #
 CONFIG_LOCKUP_DETECTOR=y
-# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
+CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
 CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
-# CONFIG_DETECT_HUNG_TASK is not set
+CONFIG_DETECT_HUNG_TASK=y
 # CONFIG_PANIC_ON_OOPS is not set
 CONFIG_PANIC_ON_OOPS_VALUE=0
 CONFIG_PANIC_TIMEOUT=3
@@ -3259,20 +3259,21 @@ CONFIG_PANIC_TIMEOUT=3
 # CONFIG_SCHEDSTATS is not set
 # CONFIG_TIMER_STATS is not set
 # CONFIG_DEBUG_PREEMPT is not set
-
+CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
+CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
 #
 # Lock Debugging (spinlocks, mutexes, etc...)
 #
-# CONFIG_DEBUG_RT_MUTEXES is not set
-# CONFIG_RT_MUTEX_TESTER is not set
-# CONFIG_DEBUG_SPINLOCK is not set
-# CONFIG_DEBUG_MUTEXES is not set
-# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
-# CONFIG_DEBUG_LOCK_ALLOC is not set
-# CONFIG_PROVE_LOCKING is not set
-# CONFIG_LOCK_STAT is not set
-# CONFIG_DEBUG_ATOMIC_SLEEP is not set
-# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
+CONFIG_DEBUG_RT_MUTEXES=y
+CONFIG_RT_MUTEX_TESTER=y
+CONFIG_DEBUG_SPINLOCK=y
+CONFIG_DEBUG_MUTEXES=y
+CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+CONFIG_LOCK_STAT=y
+CONFIG_DEBUG_ATOMIC_SLEEP=y
+CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
 # CONFIG_DEBUG_KOBJECT is not set
 CONFIG_DEBUG_BUGVERBOSE=y
 # CONFIG_DEBUG_WRITECOUNT is not set
@@ -3280,6 +3281,8 @@ CONFIG_DEBUG_BUGVERBOSE=y
 # CONFIG_DEBUG_SG is not set
 # CONFIG_DEBUG_NOTIFIERS is not set
 # CONFIG_DEBUG_CREDENTIALS is not set
+CONFIG_DEBUG_LOCKDEP=y
+# CONFIG_PROVE_RCU is not set

其中有些看似不相干的config(如CONFIG_PROVE_RCU)被设置成了is not set,这是和某些lockup配置相关的依赖关系决定的,如果没有显式设置的话,每次编译都会让你手动选择是否启用。如果使用下面的图形化界面配置就不需要手动添加了,因为它会自动处理各种依赖关系。

kernel_menuconfig lockup

移除未使用的内核模块

由于启用了大量的调试信息,开机启动无线模块时会提示内存不足的情况,从而导致无线模块无法工作。

ath_dev: Copyright (c) 2001-2007 Atheros Communications, Inc, All Rights Reserved
ath_da_pci:  (Atheros/multi-bss)
DHCPv6 client is not running! Return
vmap allocation for size 1064960 failed: use vmalloc=<size> to increase size.
vmalloc: allocation failure: 1060629 bytes
insmod: page allocation failure: order:0, mode:0xd0
......
ath_dev: driver unloaded
ath_tx99: driver unloaded
ath_rate_atheros: driver unloaded
ath_hal: driver unloaded
ath_spectral: driver unloaded
ath_dfs: driver unloaded
phy for wifi device wifi0 not found
wifi0(qcawifi): enable failed
qcawifi: enable radio wifi1

为此可以将部分未使用的内核模块移除。

  1. Bluetooth driver
  2. SOUND driver
  3. IDE & SCSI driver
  4. Filesystems not used

可以通过直接修改内核配置文件target/linux/ipq806x/config_dni-3.14完成裁剪。

添加内核测试模块 kmod-dead

QCA为了测试watchdog问题,提供了一个内核测试模块,通过该模块可以手动触发系统crash

kmod-dead的目录结构如下:

kmod-dead
├── Makefile
└── src
    ├── dead.c
    ├── Kconfig
    └── Makefile

Makefile

模块根目录的Makefile仿其它内核模块(kmod-urlblock)编写

include $(TOPDIR)/rules.mk
include $(INCLUDE_DIR)/kernel.mk

PKG_RELEASE:=1
PKG_NAME:=kmod-dead

PKG_BUILD_DIR:=$(KERNEL_BUILD_DIR)/$(PKG_NAME)

include $(INCLUDE_DIR)/package.mk

define KernelPackage/dead
  SUBMENU:=Other modules
  TITLE:=kernel dead
  VERSION:=$(LINUX_VERSION)-$(BOARD)-$(PKG_RELEASE)
  FILES:= $(PKG_BUILD_DIR)/dead.$(LINUX_KMOD_SUFFIX)
# AUTOLOAD:=$(call AutoLoad,46,dead)
endef

define Build/Prepare
    mkdir -p $(PKG_BUILD_DIR)
    $(CP) ./src/* $(PKG_BUILD_DIR)
endef

define Build/Compile
    $(MAKE) -C "$(LINUX_DIR)" \
        CROSS_COMPILE="$(TARGET_CROSS)" \
        ARCH="$(LINUX_KARCH)" \
        SUBDIRS="$(PKG_BUILD_DIR)" \
        EXTRA_CFLAGS="$(BUILDFLAGS)" \
        modules
endef

define KernelPackage/dead/install
    $(INSTALL_DIR) $(1)/lib/network/
endef

$(eval $(call KernelPackage,dead))

Makefile中将AUTOLOAD所在行注释掉是为了禁止模块自动加载,导致系统刚刚启动就crash,然后无限重启。

src/dead.c

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/spinlock_types.h>

static DEFINE_SPINLOCK(test_lock);

static int process_test(void *info)
{
    spin_lock(&test_lock);
    while(1);
    return 0;
}

static int __init init_3mb_alloc(void)
{
    struct task_struct *p1, *p2, *p3, *p4;
    printk("MODULE\tINITIALIZED\n");

    p1 = kthread_create(process_test, NULL, "TEST_CPU_1_THREAD");
    kthread_bind(p1,0);

    p2 = kthread_create(process_test, NULL, "TEST_CPU_2_THREAD");
    kthread_bind(p2,0);

    p3 = kthread_create(process_test, NULL, "TEST_CPU_3_THREAD");
    kthread_bind(p3,0);

    p4 = kthread_create(process_test, NULL, "TEST_CPU_4_THREAD");
    kthread_bind(p4,0);

    wake_up_process(p1);
    wake_up_process(p2);
    wake_up_process(p3);
    wake_up_process(p4);

    return 0;
}

static void __exit exit_3mb_alloc(void)
{
    printk("MODULE\tTERMINATED\n");
}

module_init(init_3mb_alloc);
module_exit(exit_3mb_alloc);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("QUALCOMM");

src/Makefile

# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version
# 2 of the License, or (at your option) any later version.

obj-m := dead.o

ifeq ($(MAKING_MODULES),1)

-include $(TOPDIR)/Rules.make
endif

src/Kconfig

config DEAD
    tristate "This is a Module_DEAD"
    default y
    help
        This is a dead module, for debugging kernel crash.
        If unsure, say N.

打印watchdog ping相关log

编辑以下文件,在watchdog执行ping操作时打印log

build_dir/target-arm_cortex-a7_uClibc-1.0.14_eabi/linux-ipq806x/linux-3.14.77/drivers/watchdog/qcom-wdt.c

static int qcom_wdt_ping(struct watchdog_device *wdd)
{
    struct qcom_wdt *wdt = to_qcom_wdt(wdd);

    printk("********************qcom-wdt-ping1*****************\n");
    writel(1, wdt->wdt_reset);
    printk("********************qcom-wdt-ping2*****************\n");
    return 0;
}

以上printk部分就是打印的log。据此可以生成kernel patch如下:

待修改文件:linux-3.14.77/drivers/watchdog/qcom-wdt.c
路径:target/linux/ipq806x/patches_dni-3.14
patch:0101-print-watch-ping-log.patch

--- linux-3.14.77.orig/drivers/watchdog/qcom-wdt.c 2019-11-01 10:0:18.332348721 +0800
+++ linux-3.14.77/drivers/watchdog/qcom-wdt.c 2019-11-01 09:39:58.689575974 +0800
@@ -253,7 +253,9 @@
 {
     struct qcom_wdt *wdt = to_qcom_wdt(wdd);

+    printk("********************qcom-wdt-ping1*****************\n");
     writel(1, wdt->wdt_reset);
+    printk("********************qcom-wdt-ping2*****************\n");
     return 0;
 }

打开DTS debug

编辑dts(Device Tree Source)的配置文件,使能init_debug

待修改文件:linux-3.14.77/arch/arm/boot/dts/qcom-ipq40xx-ap.dj04.1.dtsi
路径:target/linux/ipq806x/patches_dni-3.14
patch:0102-add-init_debug=4-in-dts.patch

--- linux-3.14.77.orig/arch/arm/boot/dts/qcom-ipq4xx-ap.dk04.1.dtsi 2019-11-01 16:07:53.063327477 +0800
+++ linux-3.14.77/arch/arm/boot/dts/qcom-ipq4xx-ap.dk04.1.dtsi 2019-11-01 14:05:33.923495413 +0800
@@ -50,7 +50,7 @@
+       };
+
+       chosen {
+-              bootargs-append = " clk_ignore_unused user_debug=0xff";
++              bootargs-append = " clk_ignore_unused user_debug=0xff init_debug=4";
+       };
+
+ };

init_debug用于打开调试log,上面的patch对应修改的系统文件是/proc/cmdline

$ cat /proc/cmdline
 rootwait clk_ignore_unused user_debug=0xff init_debug=4

修改kernel_size & rootfs_size

使能大量debug信息会导致kernel size增大,所以需要修正kernel size.

计算

kernel的大小可以从以下文件获取。

$ ls -l bin/ipq806x/openwrt-ipq806x-qcom-ipq40xx-ap.dkxx-fit-uImage.itb
-rw-r--r-- 1 guangtao.wu guangtao.wu 4364244 Oct 31 16:06 bin/ipq806x/openwrt-ipq806x-qcom-ipq40xx-ap.dkxx-fit-uImage.itb

由于flash需要128k对齐,所以kernel size需要通过以上文件大小4364244除以128k再取整后重新计算

4364244 / ( 128 * 1024 ) = 33.2965

# kernel size
34 * 128 * 1024 = 4456448 = 0x440000

# rootfs size
0x2800000 - 0x440000 = 0x23c0000

修改dni_home config/defconfig-orbi

CONFIG_DGC_FW_KERNEL_SIZE="4456448"
CONFIG_DGC_FW_ROOTFS_SIZE="37486592"
CONFIG_DGC_FW_KERNEL_SIZE_CC="4456448"
CONFIG_DGC_FW_ROOTFS_SIZE_CC="37486592"

修改kernel patch

patch: kernel/patches_dni/0012-fix-issue-after-disable-usb-rootfs-checksum-error.patch

      { 0x0a600000, 0x02800000, "firmware" },
 -    { 0x0a600000, 0x003c0000, "kernel" },
 -    { 0x0a9c0000, 0x02440000, "rootfs" },
-+    { 0x0a600000, 0x003c0000, "kernel" },
-+    { 0x0a9c0000, 0x02440000, "rootfs" },
++    { 0x0a600000, 0x00440000, "kernel" },
++    { 0x0aa40000, 0x023c0000, "rootfs" },
      { 0x0ce00000, 0x03200000, "reserved" },

watchdog相关指令

禁用watchdog的硬件重启功能

devmem 0x0B017008 w 0x0

说明:禁用watchdog的硬件重启功能可以防止kernel crash后自动重启,配合ftrace及其它调试信息能方便分析问题。

watchdog软件控制指令

# To query watchdog status
ubus call system watchdog

# To stop watchdog
ubus call system watchdog '{"stop": true}'

# To start watchdog
ubus call system watchdog '{"stop": false}'

# To configure watchdog timeout as 20 seconds (default is 30 seconds)
ubus call system watchdog '{"timeout": 20}'

说明:在没有禁用watchdog硬件功能时,如果使用ubus停止watchdog,系统会在设定的超时时间(默认30s)后自动重启。

测试脚本

启用ftrace调试功能除了需要修改配置文件外,还要对/sys/kernel/debug/tracing/目录的文件进行相关修改才能生效。下面是在测试trial FW时用到的测试脚本,用于获取必要的调试信息。

#!/bin/sh
# use for watchdog cause reboot issue

#********************************************#
# watchdog settings
#********************************************#
# disable the APCS_KPSS watchdog
devmem 0x0B017008 w 0x0

# check watchdog status
ubus call system watchdog

# other related commands
#   ubus call system watchdog '{"stop": true}'
#   ubus call system watchdog '{"stop": false}'
#   ubus call system watchdog '{"timeout": 20}'

#********************************************#
# enable ftrace
#********************************************#
# To enable function tracing during run time
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo function > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on

# To enable specific events to be traced
echo tlet_entry >> /sys/kernel/debug/tracing/set_event
echo tlet_exit >> /sys/kernel/debug/tracing/set_event
echo irq_handler_entry >> /sys/kernel/debug/tracing/set_event
echo irq_handler_exit >> /sys/kernel/debug/tracing/set_event
echo softirq_entry >> /sys/kernel/debug/tracing/set_event
echo softirq_exit >> /sys/kernel/debug/tracing/set_event
echo timer_expire_entry >> /sys/kernel/debug/tracing/set_event
echo timer_expire_exit >> /sys/kernel/debug/tracing/set_event
echo sched_switch >> /sys/kernel/debug/tracing/set_event

# To Dump the FTrace to the console in case of oops
echo 1 > /proc/sys/kernel/ftrace_dump_on_oops

# To increase the size of Ftrace buffer per CPU
echo 2048 > /sys/kernel/debug/tracing/buffer_size_kb

# To enable tracing of signals
echo SyS_reboot:dump > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/events/signal/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
sysctl -w kernel.ftrace_dump_on_oops=1


#********************************************#
# loop to cat memory info and process info
#********************************************#
while true
do
    free
    ps ww
    cat /proc/meminfo
    cat /proc/slabinfo
    ubus call system watchdog
    sleep 300
done

参考文献

2020.01.14 更新

在一个新项目中又出现了watchdog bite导致系统重启的问题,而且重现概率很高,所以又开始了新一轮的debug

重现步骤

根据SQA提供的重现步骤,经过多次测试后简化如下,按此步骤可以100%重现问题

  1. 重置路由器
  2. 重新配置路由器
  3. 使能block site功能,添加关键词,比如yam
  4. 使能Email功能,添加gmail邮箱,当用户访问blocked site时自动发送mail
  5. 访问block site

除了第一次重现时需要执行step1~4外,之后重新烧录FW都只需执行第5步

debug

enable lockup

在情况并不明朗的情况下,仅仅根据以下两行log根本无法判断问题所在

Causing a watchdog bite!
Configuring Watchdog Timer

所以先假设可能是某些进程死锁导致的,启用内核的lockup debugging功能,启用方式与上述一致

启用后,除了boot阶段打印了大量log外,watchdog问题出现时并无任何lockup log出现,所以进入下一步

watchdog的工作流程大致是:procd定时发送一个ping包给内核,内核的watchdog驱动接收到ping后喂给硬件watchdog

为了确认是user space的问题还是kernel space的问题,我们需要print两个space的ping信息,相关patch也和上述的一致,只不过在procd中需要另外添加一个patch

+--- procd.old/watchdog.c    2020-01-13 16:03:08.671893208 +0800
++++ procd/watchdog.c    2020-01-13 16:06:32.966771640 +0800
+@@ -32,9 +33,26 @@
+ static int wdt_fd = -1;
+ static int wdt_frequency = 5;
+
++void __nprintf(const char *fmt, ...);
++/* use this '__nprintf' to print message */
++void __nprintf(const char *fmt, ...)
++{
++    va_list ap;
++    static FILE *filp;
++
++    if ((filp == NULL) && (filp = fopen("/dev/console", "a")) == NULL)
++            return;
++
++    va_start(ap, fmt);
++    vfprintf(filp, fmt, ap);
++    fputs("\n", filp);
++    va_end(ap);
++}
++
+ void watchdog_ping(void)
+ {
+     DEBUG(4, "Ping\n");
++    __nprintf("[%s:%d] Ping\n", __func__, __LINE__);
+     if (wdt_fd >= 0 && write(wdt_fd, "X", 1) < 0)
+         ERROR("WDT failed to write: %s\n", strerror(errno));
+ }

再次重现可以看到以下console log

124809:[Mon Jan 13 18:12:48.754 2020] [watchdog_ping:54] Ping
124810:[Mon Jan 13 18:12:48.754 2020] *************qcom-wdt-ping1*******************
124811:[Mon Jan 13 18:12:48.754 2020] **************qcom-wdt-ping2*******************
124814:[Mon Jan 13 18:12:52.811 2020] [watchdog_ping:54] Ping
124815:[Mon Jan 13 18:12:53.766 2020] *************qcom-wdt-ping1*******************
124816:[Mon Jan 13 18:12:53.766 2020] **************qcom-wdt-ping2*******************
124820:[Mon Jan 13 18:12:55.021 2020] cat: can't open '/tmp/gl_task_name': No such file or directory
124821:[Mon Jan 13 18:12:55.021 2020] cat: can't open '/var/log/block-site-messages': No such file or directory
124880:[Mon Jan 13 18:13:53.391 2020] Causing a watchdog bite!

可以看出,在刷新block site页面后,console不再打印watchdog log,可以判断是user space出了问题,也就是procd

在刷新block site页面时,内核模块会发送信号告知procd(这个是组长告知的,重要信息!),结合log中显示的两个不存在的文件,可以找到相应的执行脚本send_email_alert。procd在接收信号后会通过system函数调用脚本,进而通过ssmtp发送邮件。

此时为了确定脚本执行过程中发生了什么异常,我加了许多log,然后定位到ssmtp执行时间过长,为此,到了接下来的最后一步。

strace ssmtp

修改脚本send_email_alert,在执行ssmtp前添加strace,追踪代码执行过程。

[Tue Jan 14 16:08:44.046 2020] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
[Tue Jan 14 16:08:44.046 2020] connect(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("139.175.54.240")}, 16

可以发现进程卡死在connect部分,最终分析出访问gmail时连接超时,导致procd hang住,无法正常发送watchdog ping,进而引发reboot。

Root Cause

在启用block site功能和send_email_alert功能后,访问一个被路由器block的网站,此时内核模块会向procd进程发送一个信号,procd接收信号后会后通过system函数执行一个发送邮件的指令,但是当发送的对象是gmail时,由于中国网络无法直接访问google,导致长时间无法连接,最终导致procd hang住,并且无法发送watchdog ping,最终导致watchdog等待超时进而重启设备。

简言之:邮件发送超时导致procd hang住

解决方案

最简单的方法是,在调用system函数时,让指令在后台执行↓

-   system("/etc/email/send_email_alert");
+   system("/etc/email/send_email_alert &");

当然,或许有其它更好的解决方案,比如不要让内核模块直接发信号给procd,毕竟procd作为pid为1的特殊进程,不适合处理太多琐碎的事务。可以让内核模块给email相关的daemon进程单独发信号,但改动会多很多。就目前而言,使用后台执行指令的方法是简单而有效的。

OK,这个问题总算完结了,撒花✿✿ヽ(°▽°)ノ✿~

总结

  1. watchdog是用来监控系统异常的,当其重启设备时,绝大情况源于进程死锁
  2. 对于watchdog相关问题,可以先确认是用户空间的问题还是内核空间的问题,便于针对性debug
  3. 有时候,某些看起来与问题毫无关联的log可能就是最关键的突破点
  4. 重现步骤的准确性能大大提高问题调试的效率