使用 gdb 解析 ppp driver crash log

由于项目整合,经过一次大版本升级后的项目,继承了大家族中许多新的特性,然而在正常功能测试中崩溃了,而且还是kernel crash, 会导致reboot的那种。本文就此问题重现、调试分析过程予以归纳总结。

问题描述

PPTP 拨号上网模式下,Router 在添加特定静态路由后crash. 添加路由的原因在本文讨论内容中不重要,因此略过。

问题再现

实际测试过程中寻找重现方法耗时较长,经过大量实验后简化如下:

  1. 连接 WAN Port
  2. PPTP 拨号上网

拨号成功后,查看 WAN 口及 ppp0 的 IP.

root@model:$ ifconfig brwan
brwan     Link encap:Ethernet  HWaddr 00:13:2F:34:42:59
          inet addr:10.0.0.84  Bcast:10.156.23.255  Mask:255.255.254.0
          inet6 addr: fe80::203:7fff:fe94:229/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3547 errors:0 dropped:541 overruns:0 frame:0
          TX packets:174 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:354037 (345.7 KiB)  TX bytes:16378 (15.9 KiB)
root@model:$ ifconfig ppp0
ppp0      Link encap:Point-to-Point Protocol
          inet addr:192.168.0.234  P-t-P:192.168.0.1  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1436  Metric:1
          RX packets:5 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:3
          RX bytes:84 (84.0 B)  TX bytes:233 (233.0 B)

确认拨号成功,同时查看一下当前路由。

root@model:$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         0.0.0.0         0.0.0.0         U     0      0        0 ppp0
10.0.0.1        10.0.23.254     255.255.255.255 UGH   0      0        0 brwan
10.0.0.2        10.0.23.254     255.255.255.255 UGH   0      0        0 brwan
10.0.22.0       0.0.0.0         255.255.254.0   U     0      0        0 brwan
192.168.0.1     0.0.0.0         255.255.255.255 UH    0      0        0 ppp0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 br0
  1. 添加以下静态路由
route add -net 10.0.23.10 netmask 255.255.255.255 dev ppp0

再次查看路由,可以看到静态路由已经添加成功。

root@model:$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         0.0.0.0         0.0.0.0         U     0      0        0 ppp0
10.0.0.1        10.0.23.254     255.255.255.255 UGH   0      0        0 brwan
10.0.0.2        10.0.23.254     255.255.255.255 UGH   0      0        0 brwan
10.0.22.0       0.0.0.0         255.255.254.0   U     0      0        0 brwan
10.0.23.10      0.0.0.0         255.255.255.255 UH    0      0        0 ppp0
192.168.0.1     0.0.0.0         255.255.255.255 UH    0      0        0 ppp0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 br0

前面为什么说是特定路由呢?因为 10.0.23.10 正是 PPTP Server 的 IP 地址,不加该路由前,访问 PPTP Server 的数据包会从 brwan 出去,添加后将从 ppp0 出去,这是个关键点。

添加路由后,等待10秒左右,kernel 将会 crash, 并且会打印出以下 crash log, 这也是本文的研究对象。

INFO: rcu_preempt self-detected stall on CPU { 0}  (t=2100 jiffies g=4282090 c=4282089 q=40763)
CPU: 0 PID: 10879 Comm: pppd Tainted: P             3.14.77 #2
[<c021eb54>] (unwind_backtrace) from [<c021bc4c>] (show_stack+0x10/0x14)
[<c021bc4c>] (show_stack) from [<c03b2b08>] (dump_stack+0x78/0x98)
INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 1, t=2102 jiffies, g=4282090, c=4282089, q=40778)
Task dump for CPU 0:
pppd            R running      0 10879      1 0x00000002
[<c020fc4c>] (__schedule) from [<d922e000>] (0xd922e000)
[<c03b2b08>] (dump_stack) from [<c026e390>] (rcu_check_callbacks+0x230/0x6a0)
[<c026e390>] (rcu_check_callbacks) from [<c02391c0>] (update_process_times+0x38/0x58)
[<c02391c0>] (update_process_times) from [<c0276a58>] (tick_sched_timer+0x44/0x74)
[<c0276a58>] (tick_sched_timer) from [<c024a874>] (__run_hrtimer+0x50/0xc8)
[<c024a874>] (__run_hrtimer) from [<c024b0b4>] (hrtimer_interrupt+0x12c/0x288)
[<c024b0b4>] (hrtimer_interrupt) from [<c04e2828>] (arch_timer_handler_virt+0x28/0x30)
[<c04e2828>] (arch_timer_handler_virt) from [<c0268d28>] (handle_percpu_devid_irq+0x68/0x84)
[<c0268d28>] (handle_percpu_devid_irq) from [<c0265698>] (generic_handle_irq+0x20/0x30)
[<c0265698>] (generic_handle_irq) from [<c0218e74>] (handle_IRQ+0x68/0x90)
[<c0218e74>] (handle_IRQ) from [<c0208514>] (gic_handle_irq+0x3c/0x5c)
[<c0208514>] (gic_handle_irq) from [<c0209580>] (__irq_svc+0x40/0x70)
Exception stack(0xd922fd58 to 0xd922fda0)
fd40:                                                       de01cadc 00000000
fd60: 0000013c 0000013b d7e444c0 de01cae8 de01cadc 00000000 ffffffff ffffffff
fd80: 00000001 d7e44514 00000000 d922fda0 bf11d598 c0212a44 20000013 ffffffff
[<c0209580>] (__irq_svc) from [<c0212a44>] (_raw_spin_lock_bh+0x48/0x5c)
[<c0212a44>] (_raw_spin_lock_bh) from [<bf11d598>] (ppp_release_channels+0xa4/0x5cc [ppp_generic])
[<bf11d598>] (ppp_release_channels [ppp_generic]) from [<bf11f91c>] (ppp_input+0x590/0x658 [ppp_generic])
[<bf11f91c>] (ppp_input [ppp_generic]) from [<bf11fb4c>] (ppp_start_xmit+0x168/0x174 [ppp_generic])
[<bf11fb4c>] (ppp_start_xmit [ppp_generic]) from [<c05305dc>] (dev_hard_start_xmit+0x2d4/0x420)
[<c05305dc>] (dev_hard_start_xmit) from [<c0546d2c>] (sch_direct_xmit+0x6c/0x1b0)
[<c0546d2c>] (sch_direct_xmit) from [<c0530918>] (__dev_queue_xmit+0x1f0/0x418)
[<c0530918>] (__dev_queue_xmit) from [<c0571da0>] (ip_finish_output+0x43c/0x49c)
[<c0571da0>] (ip_finish_output) from [<bf064f7c>] (pptp_xmit+0x49c/0x51c [pptp])
[<bf064f7c>] (pptp_xmit [pptp]) from [<bf120ea4>] (ppp_ioctl+0xe20/0xe7c [ppp_generic])
[<bf120ea4>] (ppp_ioctl [ppp_generic]) from [<bf120ff0>] (ppp_write+0xe0/0x104 [ppp_generic])
[<bf120ff0>] (ppp_write [ppp_generic]) from [<c02c1698>] (vfs_write+0xcc/0x160)
[<c02c1698>] (vfs_write) from [<c02c1ba0>] (SyS_write+0x40/0x80)
[<c02c1ba0>] (SyS_write) from [<c0208c80>] (ret_fast_syscall+0x0/0x38)
BUG: soft lockup - CPU#1 stuck for 22s! [ifconfig:6333]

解析 crash log

针对以上crash log, 我们要找出 Root Cause,需要分析其中的关键信息,然后从代码中找到对应位置。

CPU: 0 PID: 10879 Comm: pppd Tainted: P             3.14.77 #2

首先,最简单是查看用户空间的信息,可以看到,Comm 后输出的是 pppd, 显然与 pppd 有关,pppd 作为用户空间的一个 Daemon, 用于处理 pptp, l2tp 以及 pppoe 相关功能,并且会与 kernel 中的 ppp driver 通信。

但是使用 strace pid-of-pppd 并不能很好的查找到问题所在,因为 crash 发生时,strace 也无法打印信息。

继续查看后面的log,下面这段log像是进程调度相关的信息,并不能直观看出是什么问题。

[<c020fc4c>] (__schedule) from [<d922e000>] (0xd922e000)
[<c03b2b08>] (dump_stack) from [<c026e390>] (rcu_check_callbacks+0x230/0x6a0)
[<c026e390>] (rcu_check_callbacks) from [<c02391c0>] (update_process_times+0x38/0x58)
[<c02391c0>] (update_process_times) from [<c0276a58>] (tick_sched_timer+0x44/0x74)
[<c0276a58>] (tick_sched_timer) from [<c024a874>] (__run_hrtimer+0x50/0xc8)
[<c024a874>] (__run_hrtimer) from [<c024b0b4>] (hrtimer_interrupt+0x12c/0x288)
[<c024b0b4>] (hrtimer_interrupt) from [<c04e2828>] (arch_timer_handler_virt+0x28/0x30)
[<c04e2828>] (arch_timer_handler_virt) from [<c0268d28>] (handle_percpu_devid_irq+0x68/0x84)
[<c0268d28>] (handle_percpu_devid_irq) from [<c0265698>] (generic_handle_irq+0x20/0x30)
[<c0265698>] (generic_handle_irq) from [<c0218e74>] (handle_IRQ+0x68/0x90)
[<c0218e74>] (handle_IRQ) from [<c0208514>] (gic_handle_irq+0x3c/0x5c)
[<c0208514>] (gic_handle_irq) from [<c0209580>] (__irq_svc+0x40/0x70)
Exception stack(0xd922fd58 to 0xd922fda0)

继续往后看,下面这部分出现了 ppp 相关的信息,其中 ppp_generic 正是 ppp driver 对应的内核模块 ppp_generic.ko , 而 pptp 对应的是内核模块 pptp.ko. 所以可以大致确定是 ppp driver 引发的 crash.

fd40:                                                       de01cadc 00000000
fd60: 0000013c 0000013b d7e444c0 de01cae8 de01cadc 00000000 ffffffff ffffffff
fd80: 00000001 d7e44514 00000000 d922fda0 bf11d598 c0212a44 20000013 ffffffff
[<c0209580>] (__irq_svc) from [<c0212a44>] (_raw_spin_lock_bh+0x48/0x5c)
[<c0212a44>] (_raw_spin_lock_bh) from [<bf11d598>] (ppp_release_channels+0xa4/0x5cc [ppp_generic])
[<bf11d598>] (ppp_release_channels [ppp_generic]) from [<bf11f91c>] (ppp_input+0x590/0x658 [ppp_generic])
[<bf11f91c>] (ppp_input [ppp_generic]) from [<bf11fb4c>] (ppp_start_xmit+0x168/0x174 [ppp_generic])
[<bf11fb4c>] (ppp_start_xmit [ppp_generic]) from [<c05305dc>] (dev_hard_start_xmit+0x2d4/0x420)
[<c05305dc>] (dev_hard_start_xmit) from [<c0546d2c>] (sch_direct_xmit+0x6c/0x1b0)
[<c0546d2c>] (sch_direct_xmit) from [<c0530918>] (__dev_queue_xmit+0x1f0/0x418)
[<c0530918>] (__dev_queue_xmit) from [<c0571da0>] (ip_finish_output+0x43c/0x49c)
[<c0571da0>] (ip_finish_output) from [<bf064f7c>] (pptp_xmit+0x49c/0x51c [pptp])
[<bf064f7c>] (pptp_xmit [pptp]) from [<bf120ea4>] (ppp_ioctl+0xe20/0xe7c [ppp_generic])
[<bf120ea4>] (ppp_ioctl [ppp_generic]) from [<bf120ff0>] (ppp_write+0xe0/0x104 [ppp_generic])
[<bf120ff0>] (ppp_write [ppp_generic]) from [<c02c1698>] (vfs_write+0xcc/0x160)
[<c02c1698>] (vfs_write) from [<c02c1ba0>] (SyS_write+0x40/0x80)
[<c02c1ba0>] (SyS_write) from [<c0208c80>] (ret_fast_syscall+0x0/0x38)

然后从其中的 _raw_spin_lock_bh 可以猜出可能是死锁(dead lock)问题.

[<c0212a44>] (_raw_spin_lock_bh) from [<bf11d598>] (ppp_release_channels+0xa4/0x5cc [ppp_generic])

但是具体怎么造成死锁的,以及和添加的静态路由有没有关系?继续 debug.

printk

根据以上 log 可知,ppp_generic 会通过 ppp_write 调用 ppp_ioctl 继而调用 pptp_xmit, 那么先来看看内核驱动代码中的 ppp_write.

  • drivers/net/ppp/ppp_generic.c
static ssize_t ppp_write(struct file *file, const char __user *buf,
                         size_t count, loff_t *ppos)
{
    struct ppp_file *pf = file->private_data;
    struct sk_buff *skb;
    ssize_t ret;

    if (!pf)
        return -ENXIO;
    ret = -ENOMEM;
    skb = alloc_skb(count + pf->hdrlen, GFP_KERNEL);
    if (!skb)
        goto out;
    skb_reserve(skb, pf->hdrlen);
    ret = -EFAULT;
    if (copy_from_user(skb_put(skb, count), buf, count)) {
        kfree_skb(skb);
        goto out;
    }

    skb_queue_tail(&pf->xq, skb);

    switch (pf->kind) {
    case INTERFACE:
        ppp_xmit_process(PF_TO_PPP(pf));
        break;
    case CHANNEL:
        ppp_channel_push(PF_TO_CHANNEL(pf));
        break;
    }

    ret = count;

 out:
    return ret;
}

这里并没有直接调用 ppp_ioctl, 但是可以知道发送数据需要通过 ppp_xmit_process 或者 ppp_channel_push 完成。

    switch (pf->kind) {
    case INTERFACE:
        printk("Using ppp_xmit_process...\n");
        ppp_xmit_process(PF_TO_PPP(pf));
        break;
    case CHANNEL:
        printk("Using ppp_channel_push...\n");
        ppp_channel_push(PF_TO_CHANNEL(pf));
        break;
    }

为此,使用 printk 打印 log 确定是经过 ppp_channel_push. 那么再来看看这个函数长啥样。

/*
 * Try to send data out on a channel.
 */
static void
ppp_channel_push(struct channel *pch)
{
    struct sk_buff *skb;
    struct ppp *ppp;

    spin_lock_bh(&pch->downl);
    if (pch->chan) {
        while (!skb_queue_empty(&pch->file.xq)) {
            skb = skb_dequeue(&pch->file.xq);
            if (!pch->chan->ops->start_xmit(pch->chan, skb)) {
                /* put the packet back and try again later */
                skb_queue_head(&pch->file.xq, skb);
                break;
            }
        }
    } else {
        /* channel got deregistered */
        skb_queue_purge(&pch->file.xq);
    }
    spin_unlock_bh(&pch->downl);
    /* see if there is anything from the attached unit to be sent */
    if (skb_queue_empty(&pch->file.xq)) {
        read_lock_bh(&pch->upl);
        ppp = pch->ppp;
        if (ppp)
            ppp_xmit_process(ppp);
        read_unlock_bh(&pch->upl);
    }
}

注意这里有个加锁操作 spin_lock_bh(&pch->downl), 这里先不管。继续加printk,确定是从下面的语句访问的 pptp_xmit.

            if (!pch->chan->ops->start_xmit(pch->chan, skb)) {

按这个思路可以把执行顺序列出来。

  1. ppp_write [ppp_generic] <- ppp_channel_push [ppp_generic]
  2. ppp_channel_push [ppp_generic] <- start_xmit [ppp_generic]
  3. start_xmit [ppp_generic] <- pptp_xmit [pptp]
  4. pptp_xmit [pptp] <- ip_local_out []
  5. ip_local_out [] <- ...
  6. ... <- ppp_start_xmit [ppp_generic]

但是其中有部分不是 ppp driver 的code, 所以中间部分就没细究。最终是又回到 ppp_generic. 到这还是无法看出哪里出的问题。当时着急出release,又有了临时的workaroud,就把这个问题搁置了。

gdb

最后还是组长通过 gdb 破案了,不过注意需要使用 toolchain 里的 gdb.

  • gdb: ./staging_dir/toolchain-.../bin/arm-openwrt-linux-dgb
  • ppp_generic.ko: build_dir/target.../linux-ipq806x/linux-.../drivers/net/ppp/ppp_generic.ko

使用 gdb 中的 "list" 指令,简写 "l" 也可以,后面接地址、函数名或者行数

(gdb) list *Addr
(gdb) list FUNCTION
(gdb) list LINE

使用该指令,以 crash log 中给出的地址为参数,执行结果如下:

$ ./staging_dir/toolchain-.../bin/arm-openwrt-linux-dgb \
  build_dir/target.../linux-ipq806x/linux-.../drivers/net/ppp/ppp_generic.ko
(gdb) list *(ppp_write+0xe0)
0x3ff0 is in ppp_write (drivers/net/ppp/ppp_generic.c:568).
563             case INTERFACE:
564                     ppp_xmit_process(PF_TO_PPP(pf));
565                     break;
566             case CHANNEL:
567                     ppp_channel_push(PF_TO_CHANNEL(pf));
568                     break;
569             }
570
571             ret = count;
572
(gdb) list *(ppp_ioctl+0xe20)
0x3ea4 is in ppp_channel_push (drivers/net/ppp/ppp_generic.c:1811).
1806
1807            spin_lock_bh(&pch->downl);
1808            if (pch->chan) {
1809                    while (!skb_queue_empty(&pch->file.xq)) {
1810                            skb = skb_dequeue(&pch->file.xq);
1811                            if (!pch->chan->ops->start_xmit(pch->chan, skb)) {
1812                                    /* put the packet back and try again later */
1813                                    skb_queue_head(&pch->file.xq, skb);
1814                                    break;
1815                            
}
(gdb) list *(ppp_release_channels+0xa4)
0x598 is in ppp_push (drivers/net/ppp/ppp_generic.c:1540).
1535                    /* not doing multilink: send it down the first channel */
1536                    list = list->next;
1537                    pch = list_entry(list, struct channel, clist);
1538
1539                    spin_lock_bh(&pch->downl);
1540                    if (pch->chan) {
1541                            if (pch->chan->ops->start_xmit(pch->chan, skb))
1542                                    ppp->xmit_pending = NULL;
1543                    } else {
1544                            /* channel got unregistered */
(gdb)

使用 gdb 可以快速定位到出错代码,从上面的解析结果可以看出,在以下两个函数

  • ppp_channel_push
  • ppp_push

中都调用了 spin_lock_bh(&pch->downl) , ppp_channel_push 加锁后正在等待数据发送完成,而发送过程中又调用了 ppp_push ,但是该锁已经锁住了, ppp_push 只能一直等待,最终导致两处都在无限等待中从而陷入死锁。

ok,到此基本真相大白了,结合设定的静态路由来说。当用户向 PPTP server 请求数据时, 驱动会调用 ppp_channel_push 发送数据, 由于添加了静态路由,数据会走 ppp0, 于是又重新调用了 ppp_push, 导致形成了一个环。

当然这并不是linux kernel 的bug,因为所用kernel经过了 qualcomm 的修改,事实证明公板也存在这个问题,所以应该是某部分修改引入的。于是乎,甩锅~

总结

经过这次的 crash 调试,学习了 ppp driver 的部分代码,积累了处理此类 crash 问题的经验,进一步学习了 gdb 调试方法。

此外,

  1. crash log 指示的函数未必就是真正出现问题的函数,编译优化可能导致某些Symbols 乱序,具体以 gdb 结果为准;
  2. FW 中的 binary 通常不包含 Symbols, 所以使用 gdb 时需要对编译目录中的 binary 进行调试;
  3. 触发 crash 的是定时发送的 LCP echo request 数据包;
  4. PPTP 使用 GRE(Generic Routing Encapsulation) 封装 ppp0 数据包,经由以太网络接口 brwan 进行IP网络数据收发。

参考