OpenWrt中使用gdb分析coredump

近日调试一个bug,一个守护进程在特定情况下执行一段程序后便会挂掉。为了分析bug产生原因,本人使用了printf, strace追踪,gdb调试等诸多调试工具和测试方法。本文对于在OpenWrt嵌入式系统中启用gdb功能及其使用方法进行详细说明。

调试背景

我最先通过strace工具追踪发现进程是在收到SIGABRT信号后被kill的。

pipe([8, 9])                            = 0
fcntl64(8, F_GETFL)                     = 0 (flags O_RDONLY)
ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbe90a454) = -1 EINVAL (Invalid argument)
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(14189, 14189, SIGABRT)             = 0
--- SIGABRT (Aborted) @ 0 (0) ---
Process 14189 detached

由以上信息可知,进程在执行某个管道pipe相关操作时被kill,通常是在执行popen函数会出现pipe调用。之后对源码grep -rn popen, 并结合strace打印的出错前的信息可以大致定位到可能出错的位置,然后通过添加printf打印log,根据重现时停止打印log的地方精确定位到源码出错位置。

// config.c
char * config_get(char *name) {
    FILE *fp;
    char cmd[128]={0};
    snprintf(cmd, sizeof(cmd)-1, "config get %s", name);
    fp = popen(cmd, "r");   // 出错位置
    // ...
}

至此,仅仅能判断出用户态的出错位置,但从此处代码尚无法明确出错的根本原因,还需抓取内核态的出错信息,此时便需要使用GDB对进程出错时系统生成的coredump文件进行分析了。

编译gdb以及带symbols的程序

在分析之前,需要被调试进程的二进制文件包含GDB分析所需的symbols,什么是symbols? 粗略的讲,就是一张嵌入待调试进程的二进制文件中的映射表,包含代码中的变量、函数名、行号等信息。详见GDB-Debugging Symbols

配置编译参数

OpenWrt编译参数存于.config文件中,OpenWrt默认并未打开gdb功能以及debug调试功能,我们可以通过make menuconfig选择参数或者手动更改配置文件。

# .config
CONFIG_DEBUG=y # 使能调试功能,启用后会给集成GDB调试所需的symbols
CONFIG_NO_STRIP=y # 禁用strip,防止程序代码被打乱
#CONFIG_USE_SSTRIP=y
CONFIG_TOOLCHAINOPTS=y # 使能交叉工具链可选功能,这是编译GDB功能的总开关

配置完成后重新编译交叉工具链,用以得到gdb工具

make toolchain/{compile,install} V=s

编译单个模块(package)

参考OpenWrt官方文档,可以使用以下指令单独为一个模块添加debug symbols

make package/traffic_meter/{clean,compile,install} V=99 CONFIG_DEBUG=y

完整编译

如果将全局debug开启,并进行完整编译,这会导致image过大(>300M)而编译失败。当然啦,编译失败不要紧,因为只是没有生成image文件,但是所需模块和动态链接库都能正常编译完成,并不影响coredump文件的分析。

不过完整编译太过费时,不推荐,还是对需要调试的单个模块进行编译比较快捷和方便。

获取coredump

得到了带有symbols的二进制文件,以及交叉编译得到的gdb调试工具,剩下的就是获取coredump文件

配置coredump参数

$ sudo vi /etc/profile
# 在文件末尾添加以下指令,以取消对coredump文件大小的限制
ulimit -c unlimited
$ source /etc/profile

# 设置coredump文件命名格式
# e - process name; p - pid; t - time
$ echo "core-%e-%p-%t" > /proc/sys/kernel/core_pattern

关于coredump文件格式的参数说明,可以参考core dump file

重现bug并获取coredump文件

首先重现bug,然后找到coredump文件,并传至编译服务器

$ cd /
$ find . -name "core-*" |grep traffic_meter
./sbin/core-traffic_meter-14189-2895
$ cd sbin
$ tftp -pl core-traffic_meter-14189-2895 192.168.1.10

GDB调试

常用指令

下面是常用的几个gdb指令

(gdb) help
(gdb) where
(gdb) bt    # backtrace
(gdb) list  # [l] 显示当前调试处的相关代码
(gdb) up [num]  # 向上跳转1个或num个bt
(gdb) down [num]    # 向下跳转1个或num个bt
(gdb) print [variable]  # [p] 打印当前调试处相关变量的值

调试实例

$ cd repo.git
$ cd build_dir/target-arm_v7=a_uClibc-0.9.33.2_eabi/root-ipq806x
$ ../../toolchain-arm_v7-a_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/gdb-linaro-7.2-2011.03-0/gdb/gdb sbin/traffic_meter ~/core-traffic_meter-14189-28959-2895
GNU gdb (Linaro GDB) 7.2-2011.03-0
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=arm-openwrt-linux-uclibcgnueabi".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/sbin/traffic_meter...done.

warning: exec file is newer than core file.
[New Thread 14189]
Reading symbols from /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/lib/libgcc_s.so.1...done.
Loaded symbols for /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/lib/libgcc_s.so.1
Reading symbols from /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/lib/libc.so.0...done.
Loaded symbols for /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/lib/libc.so.0
Reading symbols from /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/lib/ld-uClibc.so.0...done.
Loaded symbols for /home/litreily/R7500v2-Fortify.git/build_dir/target-arm_v7-a_uClibc-0.9.33.2_eabi/root-ipq806x/lib/ld-uClibc.so.0
Core was generated by `traffic_meter -w brwan -p ppp0 -m /dev/mtd15`.
Program terminated with signal 6, Aborted.
#0  0x402fb4fc in raise (sig=6) at libpthread/nptl/sysdeps/unix/sysv/linux/raise.c:67
67        int res = INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x402fb4fc in raise (sig=6) at libpthread/nptl/sysdeps/unix/sysv/linux/raise.c:67
#1  0x402f579c in abort () at libc/stdlib/abort.c:89
#2  0x402f5060 in __malloc_consolidate (av=0x4030b3e8) at libc/stdlib/malloc-standard/free.c:234
#3  __malloc_consolidate (av=0x4030b3e8) at libc/stdlib/malloc-standard/free.c:170
#4  0x402f4854 in malloc (bytes=<value optimized out>) at libc/stdlib/malloc-standard/malloc.c:908
#5  0x402d6250 in _stdio_fopen (fname_or_mode=<value optimized out>, mode=<value optimized out>, stream=0x8ca0e8, filedes=8) at libc/stdio/_fopen.c:177
#6  0x402d4fb4 in popen (command=0x8 <Address 0x8 out of bounds>, modes=0xfc08 "r") at libc/stdio/popen.c:83
#7  0x0000f488 in config_get (name=<value optimized out>) at config.c:11
#8  0x0000ed48 in get_bogus_time_region (ct=60744, st=0xbe9f8994, btr=0xbe9f896c) at util.c:184
#9  0x0000e8f0 in get_traffic_from_flash (tfm=0xbe9f8738, ct=2894) at spi_flash.c:912
#10 0x0000ccb0 in restart_traffic_counter (tfm=0xbe9f8738, ct=52400) at trafficmeter.c:976
#11 0x0000a1f4 in main (argc=<value optimized out>, argv=<value optimized out>) at trafficmeter.c:1798
(gdb)

BackTrace (bt) 输出
#num memory_addr in function (arg1=val1, arg2=val2,...) at file.c:line
bt输出前面的编号是进程执行时的压栈顺序,编号越小越底层。编号后面紧跟的是内存地址,从地址大小可以看出哪些是内核调用,哪些是用户调用。

注意:由于编译器优化缘故,某些变量会显示value optimized out,如果想获取真实值,需要在编译时添加-O0,用以禁用编译器优化

bt结果可以看出,进程是在执行动态内存分配函数malloc时检测到错误,并执行abort函数触发SIGABRT信号后退出的。那就可以确定是内存问题,多半是内存多次释放或是未释放导致的。

据此线索,检查代码中与内存分配和释放相关的部分,最终调试发现是某处代码引用指针错误,并在之后使用free释放内存,而该指针指向的内存在多处地方被重新分配和释放,导致内存出现不可预料的问题。

注意事项

在分析coredump时,需要注意以下几点:

  1. 交叉编译后的GDB可执行文件位于build_dir/toolchain-arm_v7-a_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/gdb-linaro-7.2-2011.03-0/gdb/gdb
  2. 注意当前调试路径最好是在编译完成后的根目录root-ipq806x, 否则GDB可能无法找到动态链接库的位置,从而无法找到库函数的symbols,此时可能出现以下情况
warning: exec file is newer than core file.
[New Thread 14189]

warning: Could not load shared library symbols for 3 libraries, e.g. /lib/libgcc_s.so.1.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
Core was generated by `traffic_meter -w brwan -p ppp0 -m /dev/mtd15`.
Program terminated with signal 6, Aborted.
#0  0x402fb4fc in ?? ()
Setting up the environment for debugging gdb.
Function "internal_error" not defined.
Make breakpoint pending on future shared library load? (y or [n]) [answered N; input not from terminal]
Function "info_command" not defined.
Make breakpoint pending on future shared library load? (y or [n]) [answered N; input not from terminal]
.gdbinit:8: Error in sourced command file:
Argument required (one or more breakpoint numbers).
(gdb) info sharedlibrary
From        To          Syms Read   Shared Object Library
                        No          /lib/libgcc_s.so.1
                        No          /lib/libc.so.0
                        No          /lib/ld-uClibc.so.0
(gdb) bt
#0  0x402fb4fc in ?? ()
#1  0x402f579c in ?? ()
#2  0x402f579c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

若出现以上情况,我们得不到任何有效信息,此时可以通过提示的set solib-search-pathset sysroot手动设置库路径或根目录路径。但我仍建议在调试前cd到根目录。

参考文档

  1. Linux coredump解决流程
  2. GNU Debugger
  3. GDB - Debugging Symbols
  4. Core dump file