I recently had to debug a memory corruption issue, and used my new favorite tool for the job. rr is a time-traveling debugger that lets you step backward through your program to find the source of the bug.
rr runs on Linux. Windows has a tool "WinDbg" with similar capabilities. I don't know of anything like this on Mac.
It all started with this panic:
$ zig build sdl-test
thread 656166 panic: reached unreachable code
src/gui.zig:473:59: 0x46c55b in textSizeRaw (sdl-test)
var utf8 = (std.unicode.Utf8View.init(text) catch unreachable).iterator();
^
src/gui.zig:452:41: 0x423043 in textSizeEx (sdl-test)
const s = sized_font.textSizeRaw(text, max_width_sized, end_idx);
^
src/gui.zig:439:31: 0x41c229 in textSize (sdl-test)
return self.textSizeEx(text, null, null);
^
src/gui.zig:5060:50: 0x3d636b in initNoFormat (sdl-test)
const size = options.font().textSize(self.label_str);
^
src/gui.zig:5088:38: 0x370872 in labelNoFormat (sdl-test)
var lw = LabelWidget.initNoFormat(src, id_extra, str, opts);
^
src/gui.zig:3196:22: 0x37cc26 in windowHeader (sdl-test)
gui.labelNoFormat(@src(), 0, str, .{ .gravity = .center, .expand = .horizontal, .font_style = .heading });
^
src/gui.zig:6640:29: 0x4432f4 in dialogDisplay (sdl-test)
gui.windowHeader(title, "", &header_openflag);
^
src/gui.zig:2455:38: 0x3d9442 in dialogsShow (sdl-test)
if (dialog.display(dialog.id)) {
^
src/gui.zig:2464:25: 0x37187f in end (sdl-test)
self.dialogsShow();
^
/home/dvanderson/temp/gui/sdl-test.zig:321:35: 0x36a7c0 in main (sdl-test)
const end_micros = win.end();
^
/home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/start.zig:614:37: 0x373301 in main (sdl-test)
const result = root.main() catch |err| {
Some quick printf-style debugging showed me a slice ([]u8) of text was mysteriously changing from okay to garbage after some unrelated code ran. Looks like memory corruption, so let's break out the big guns:
$ rr record zig-out/bin/sdl-test
rr needs /proc/sys/kernel/perf_event_paranoid <= 1, but it is 2.
Change it to 1, or use 'rr record -n' (slow).
Consider putting 'kernel.perf_event_paranoid = 1' in /etc/sysctl.d/10-rr.conf.
See 'man 8 sysctl', 'man 5 sysctl.d' (systemd systems)
and 'man 5 sysctl.conf' (non-systemd systems) for more details.
Oh right, you have to tweak a kernel thing to get good performance:
$ sudo sysctl -w kernel.perf_event_paranoid=1
$ rr record zig-out/bin/sdl-test
rr: Saving execution to trace directory `/home/dvanderson/.local/share/rr/sdl-test-2'.
thread 656934 panic: reached unreachable code
src/gui.zig:473:59: 0x46c55b in textSizeRaw (sdl-test)
var utf8 = (std.unicode.Utf8View.init(text) catch unreachable).iterator();
^
[same stacktrace as before]
Aborted
$ rr replay
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/dvanderson/.local/share/rr/sdl-test-2/mmap_hardlink_3_sdl-test...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:1588
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/61/ef896a699bb1c2e4e231642b2e1688b2f1a61e.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
0x00007eff55bb52b0 in _start () from /lib64/ld-linux-x86-64.so.2
(rr)
We are now debugging a recording of our program's execution. This is not a live execution of the program. We can run this multiple times, go forward, go backward. Everything including memory locations will be the same each time.
First we'll run to the end:
(rr) cont
Continuing.
thread 656934 panic: reached unreachable code
src/gui.zig:473:59: 0x46c55b in textSizeRaw (sdl-test)
var utf8 = (std.unicode.Utf8View.init(text) catch unreachable).iterator();
^
src/gui.zig:452:41: 0x423043 in textSizeEx (sdl-test)
const s = sized_font.textSizeRaw(text, max_width_sized, end_idx);
^
src/gui.zig:439:31: 0x41c229 in textSize (sdl-test)
return self.textSizeEx(text, null, null);
^
src/gui.zig:5060:50: 0x3d636b in initNoFormat (sdl-test)
const size = options.font().textSize(self.label_str);
^
src/gui.zig:5088:38: 0x370872 in labelNoFormat (sdl-test)
var lw = LabelWidget.initNoFormat(src, id_extra, str, opts);
^
src/gui.zig:3196:22: 0x37cc26 in windowHeader (sdl-test)
gui.labelNoFormat(@src(), 0, str, .{ .gravity = .center, .expand = .horizontal, .font_style = .heading });
^
src/gui.zig:6640:29: 0x4432f4 in dialogDisplay (sdl-test)
gui.windowHeader(title, "", &header_openflag);
^
src/gui.zig:2455:38: 0x3d9442 in dialogsShow (sdl-test)
if (dialog.display(dialog.id)) {
^
src/gui.zig:2464:25: 0x37187f in end (sdl-test)
self.dialogsShow();
^
/home/dvanderson/temp/gui/sdl-test.zig:321:35: 0x36a7c0 in main (sdl-test)
const end_micros = win.end();
^
/home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/start.zig:614:37: 0x373301 in main (sdl-test)
const result = root.main() catch |err| {
^
[New Thread 656934.656935]
Thread 1 received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=139635116586816) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(rr)
Now we'll search up through the stack to identify the variable that got overwritten. In gdb (which rr enhances), pressing enter with an empty command will repeat the last command.
(rr) up
#1 __pthread_kill_internal (signo=6, threadid=139635116586816) at ./nptl/pthread_kill.c:78
78 in ./nptl/pthread_kill.c
(rr)
#2 __GI___pthread_kill (threadid=139635116586816, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
89 in ./nptl/pthread_kill.c
(rr)
#3 0x00007eff558a0476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26 ../sysdeps/posix/raise.c: No such file or directory.
(rr)
#4 0x00007eff558867f3 in __GI_abort () at ./stdlib/abort.c:79
79 ./stdlib/abort.c: No such file or directory.
(rr)
#5 0x00000000003fa0c9 in os.abort () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/os.zig:560
560 system.abort();
(rr)
#6 0x0000000000373c7a in debug.panicImpl () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/debug.zig:384
384 os.abort();
(rr)
#7 0x0000000000368524 in builtin.default_panic () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/builtin.zig:840
840 std.debug.panicImpl(error_return_trace, first_trace_addr, msg);
(rr)
#8 0x000000000046c55c in src.gui.Font.textSizeRaw () at src/gui.zig:473
473 var utf8 = (std.unicode.Utf8View.init(text) catch unreachable).iterator();
(rr)
#9 0x0000000000423044 in src.gui.Font.textSizeEx () at src/gui.zig:452
452 const s = sized_font.textSizeRaw(text, max_width_sized, end_idx);
(rr) p text
$1 = {ptr = 0x7eff55757b34 '\252' <repeats 200 times>..., len = 140727813082528}
We found the problem memory. Here's where the magic happens. We put a watch on that memory location so debugging will break on any change, and reverse-continue back through the program. For some reason I had to do "rc" (reverse-continue) twice. Then "bt" to see the backtrace for where it changed:
(rr) watch *0x7eff55757b34
Hardware watchpoint 1: *0x7eff55757b34
(rr) rc
Continuing.
Thread 1 received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=139635116586816) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(rr) rc
Continuing.
Thread 1 hit Hardware watchpoint 1: *0x7eff55757b34
Old value = -1431655766
New value = 1835626049
0x0000000000a79a67 in memset ()
(rr) bt
#0 0x0000000000a79a67 in memset ()
#1 0x000000000051cfe9 in mem.Allocator.reallocAdvancedWithRetAddr__anon_39189 ()
at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/mem/Allocator.zig:419
#2 0x00000000004fa70d in mem.Allocator.reallocAtLeast__anon_34276 ()
at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/mem/Allocator.zig:356
#3 0x00000000004b6a4b in array_list.ArrayListAligned(u8,null).ensureTotalCapacityPrecise ()
at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/array_list.zig:353
#4 0x0000000000471c34 in array_list.ArrayListAligned(u8,null).ensureTotalCapacity ()
at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/array_list.zig:338
#5 0x000000000046606c in array_list.ArrayListAligned(u8,null).ensureUnusedCapacity ()
at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/array_list.zig:364
#6 0x000000000041a6d0 in array_list.ArrayListAligned(u8,null).appendSlice ()
at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/array_list.zig:209
#7 0x00000000003d79c3 in src.gui.dataSet__anon_21677 () at src/gui.zig:1510
#8 0x00000000003d3801 in src.gui.FloatingWindowWidget.saveRect () at src/gui.zig:2952
#9 0x0000000000443056 in src.gui.examples.AnimatingDialog.dialogDisplay () at src/gui.zig:6607
#10 0x00000000003d9443 in src.gui.Window.dialogsShow () at src/gui.zig:2455
#11 0x0000000000371880 in src.gui.Window.end () at src/gui.zig:2464
#12 0x000000000036a7c1 in sdl-test.main () at sdl-test.zig:321
#13 0x0000000000373302 in start.callMain () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/start.zig:614
#14 start.initEventLoopAndCallMain () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/start.zig:548
#15 start.callMainWithArgs () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/start.zig:498
#16 main () at /home/dvanderson/apps/zig-linux-x86_64-0.11.0-dev.178+27e63bb59/lib/std/start.zig:513
(rr)
And we've identified the culprit. In this case the slice pointed to memory inside an ArrayList. Some other code added items to the ArrayList, which caused it to realloc to get more memory. This invalidates all previous memory. Whoops!
The solution in this case was to copy the slice when retrieving it from the ArrayList.
Using rr this way has saved me lots of unhappy hours tracking down memory corruption, and I hope it helps you too!
Top comments (2)
Didn't know about rr. Thanks!
Neat article, rr is a really awesome tool.