openwrt swconfig stack trace分析

接上一篇博客 watchdog bite导致系统重启问题的调试 ,打开调试功能后开始压力测试,在测试过程中发现DUT每隔2s打印一次以下异常信息

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 9465, name: swconfig
INFO: lockdep is turned off.
CPU: 2 PID: 9465 Comm: swconfig Tainted: P        W    3.14.77 #1
[<c021561c>] (unwind_backtrace) from [<c0211d44>] (show_stack+0x18/0x1c)
[<c0211d44>] (show_stack) from [<c062ea98>] (dump_stack+0x9c/0xd4)
[<c062ea98>] (dump_stack) from [<c06312d8>] (mutex_lock_nested+0x2c/0x450)
[<c06312d8>] (mutex_lock_nested) from [<c0499df8>] (swconfig_get_dev+0x70/0x88)
[<c0499df8>] (swconfig_get_dev) from [<c049a808>] (swconfig_list_attrs+0x20/0x20c)
[<c049a808>] (swconfig_list_attrs) from [<c054fde8>] (genl_rcv_msg+0x260/0x2e0)
[<c054fde8>] (genl_rcv_msg) from [<c054f2d0>] (netlink_rcv_skb+0x60/0xbc)
[<c054f2d0>] (netlink_rcv_skb) from [<c054fb74>] (genl_rcv+0x28/0x3c)
[<c054fb74>] (genl_rcv) from [<c054ec94>] (netlink_unicast+0x11c/0x1d0)
[<c054ec94>] (netlink_unicast) from [<c054f114>] (netlink_sendmsg+0x30c/0x368)
[<c054f114>] (netlink_sendmsg) from [<c050fb78>] (sock_sendmsg+0x78/0x8c)
[<c050fb78>] (sock_sendmsg) from [<c0511310>] (___sys_sendmsg.part.3+0x184/0x20c)
[<c0511310>] (___sys_sendmsg.part.3) from [<c0512340>] (__sys_sendmsg+0x54/0x78)
[<c0512340>] (__sys_sendmsg) from [<c020df40>] (ret_fast_syscall+0x0/0x50)

问题分析

每隔2s是因为在detcable模块的主循环中执行了以下代码,并且在while循环中每2s执行一次。

system("/sbin/swconfig dev switch0 show |grep \"link: port\" > /tmp/switch);

根据log首行提示kernel/locking/mutex.c:616找到相关代码:

mutex.c:616

在内核代码中搜索might_sleep找到其定义于include/linux/kernel.h

kernel.h might_sleep

从说明信息可以看出,这些stack trace提示swconfig进程运行过程中进入内核态时可能进入不被允许的睡眠状态。而这些信息是在启用CONFIG_DEBUG_ATOMIC_SLEEP后打印的,该CONFIG是在启用lockup相关调试功能时打开,所以想要停止打印可以禁用该CONFIG。

但是实际上这个问题是swconfig的内核驱动导致的,具体代码如下:

swconfig drviers

spinlock自旋锁不允许临界区有触发sleep的函数,而mutex_lock正好就是可能进入sleep状态的函数,所以才触发了这个stack trace

mutex_lock — acquire the mutex
Lock the mutex exclusively for this task. If the mutex is not available right now, it will sleep until it can get it.

为了解决这个问题,可以将加锁方式由spin_lock改为mutex_lock,这个解决方案是组长google来的,我这是拾人牙慧了,哈哈哈。

index 78569a9..e8a6847 100644 (file)
--- a/target/linux/generic/files/drivers/net/phy/swconfig.c
+++ b/target/linux/generic/files/drivers/net/phy/swconfig.c
@@ -36,7 +36,7 @@ MODULE_LICENSE("GPL");

 static int swdev_id;
 static struct list_head swdevs;
-static DEFINE_SPINLOCK(swdevs_lock);
+static DEFINE_MUTEX(swdevs_lock);
 struct swconfig_callback;

 struct swconfig_callback {
@@ -296,13 +296,13 @@ static struct nla_policy link_policy[SWITCH_LINK_ATTR_MAX] = {
 static inline void
 swconfig_lock(void)
 {
-       spin_lock(&swdevs_lock);
+       mutex_lock(&swdevs_lock);
 }

 static inline void
 swconfig_unlock(void)
 {
-       spin_unlock(&swdevs_lock);
+       mutex_unlock(&swdevs_lock);
 }

 static struct switch_dev *

加入patch后完美解决问题。

参考文献