Victory! I've just hunted down a bug that has haunted me, and had me baffled, for a couple of weeks.
I often enjoy bug hunting war stories, so I figured I'd write this one down, while running a full build and regression test cycle with the fix.
Context: I've been porting GCC and binutils for the Libre-SOC project. We're designing and building a PowerPC processor with various extensions and lots more registers to make it efficient as a CPU, GPU, VPU, APU... I'm calling it hapPU, if you get one.
Adding the hundreds of new registers required renumbering some preexisting registers in GCC's internal register file. Since several parts of GCC used numeric literals instead of symbolic names to refer to certain registers, one of my first tasks was to hunt those down and adjust them for the new numbering. Some of that amounted to grepping for suspicious constants, some of that was only caught with regression testing.
Eventually, I was down to a handful of stack-check fails in the Ada testsuite. The tests and the failures were similar: create a task to run a subprogram that recursed infinitely, to make sure that the stack overflow was detected, handled, turned into an Ada exception out of the signal handler, caught by the task subsystem and made available for the task initiator. All really simple stuff, right?
At first I thought I'd made some mistake while adjusting rs6000_dbx_register_number, the function that maps GCC's internal notion of register numbers to the machine ABI-defined numbering that goes in debug information, and in call frame information used for backtraces, debugging, and exception handling. Much staring and head scratching later, nothing jumped at me, so I jumped onto a debugger, figuring that if C++ exceptions worked, if other Ada tests exercising exceptions worked, and only Ada tests that raised exceptions from within signal handlers failed, that's where the problem had to be.
(Ada is a great language for testing GCC. It exercises various corners of GCC that other languages don't, and has various strict runtime requirements that other languages don't have, some of which involve catching such execution errors as null pointer dereferences, divides by zero, and stack overflows, usually delivered as signals, and turning them into exceptions that get raised from within signal handlers, and caught by exception handlers written by users, or installed by the runtime. AFAIK Ada is the only language supported by GCC that does this, so that pointed at problems with unwinding across signal frames, which only Ada tests did.)
So I started single-stepping in the optimized stack unwinder, trying to find something wrong, particularly as it went over the signal handling frame. I was reasonably familiar with the stack unwinder code in libgcc. I'd even tracked down and worked around a kernel bug that broke unwinding across signal handlers, not long before, just on a different target architecture and operating system. But now the system was GNU/Linux, and I didn't know much about unwinding signal frames on GNU/Linux, in general or on PowerPC, so I had a bit of learning to do.
Much debugging later, I couldn't see anything wrong. I ended up building another toolchain with pristine sources, built the test program with it, and started comparing its behavior with that of the failing program. They both seemed to get past the signal frame without trouble, but the pristine test found a handler for the exception, whereas the failing program got to the end of the stack without finding it, and thus terminated.
Clearly it had to be some fallout from the unwinding of the signal frame, so I went back to that point to try to figure out how it worked. Just by looking at the code, I found in libgcc/unwind-dw2.c that uw_frame_state_for called MD_FALLBACK_FRAME_STATE_FOR, that resolved to ppc_fallback_frame_state in libgcc/config/rs6000/, and there were plenty of register numbers in there, including explicit uses of ARG_REGNUM_POINTER, that I'd renumbered, and of various other registers by number. Caught ya!, I thought.
But then, confusion set in. That file hadn't been updated after the last PowerPC register renumbering, register numbers there were all off, unless they referred to remapped numbers. But if that was the case, ARG_REGNUM_POINTER wouldn't be right. I tried changing some register numbers in there, just to make sure it made any difference, and it did. It made it worse, but it showed me I was getting close!
The corresponding fallback routines for AIX and Darwin also referenced ARG_POINTER_REGNUM, and a comment in the Darwin file explained the choice, to some extent, suggesting what was wanted at that point was really a GCC-internal register number, not an ABI-defined number. So it had to undergo mapping somewhere, and I couldn't figure out where. As if shooting at random, blinded in the dark, I restored the file to its pristine state, and tried changing only the uses of the arg pointer to... 291, the GCC-internal number it should have now;, 99, its GCC-internal number prior to the renumbering; 67, the number the internal register mapped to... And IIRC I got some slightly different failure modes out of each. Confused and exhausted, I figured I wasn't getting anywhere with this random shooting in the dark, and had to seek some light. One of those numbers had to do, and since the file clearly used register numbers that matched those remapped for exception handling, I had to figure out why restoring the original number didn't work. Maybe the answer was elsewhere.
I compared the object files of the test in the pristine vs modified case, and there were no changes. I compared the runtime libraries, suspecting there might be some change induced by the register renumbering in the Ada runtime, but they were identical. Only libgcc changed and, indeed, relinking the failing test program with the pristine libgcc, it worked.
The differences in libgcc were limited to unwinding files, but they were too big to compare their object codes manually. Much of that had to do with the renumbering. With the additional registers, data structures in the unwinder that had one entry per GCC-internal register were much bigger, and that caused plenty of differences in object code. Fortunately, there was a way to mask out that change. DWARF_FRAME_REGISTERS was not set in gcc/config/rs6000/rs6000.h, so libgcc was falling back to FIRST_PSEUDO_REGISTER, but setting the missing macro, I could restore it to its original, pre-renumbering value, and then any codegen differences in the unwinder, that were presumably causing the problem, would become far more apparent.
I returned to it about a week later. Turns out there weren't any differences. I figured maybe I'd failed to rebuild everything, and decided to start from a clean build.
Still no differences. The pristine test, and the newly-built test program that used the newly-built libgcc, were identical, and worked the same. WTF? I could use this as a work-around, since at least for now none of the SVP64 registers are call-saved, so the unwinder doesn't need to restore them. As if! If you're like me, you wouldn't be able to get your mind off of it until you understood what was going on. So I reversed the changes to rs6000.h and to the fallback function, and got back to debugging.
I tried to set a breakpoint in the fallback function, since we got there before for the signal frame, and... didn't we?!? The program just terminated with the unexpectedly escaped exception. Well, line numbers in debug info for the inlined fallback function seemed to be missing much of the function, so I tried a breakpoint at the caller, right after the test that determines that _Unwind_Find_FDE failed. Still no hits. Before the test, with a condition to stop only if it failed. Weird, still nothing. Disabling the breakpoint condition, I'd stop at each frame, so I did, until I got to the frame that would return to the signal handler, and then one more for the signal frame.
And, surprise, an FDE was found for it! That's how it works on modern GNU/Linux systems: the kernel itself offers a small set of system call wrappers, as well as the signal return trampoline with corresponding unwind information, in a virtual dynamic shared object (is that what VDSO stands for?) that GNU libc attempts to map into dynamically-linked programs. The fallback function was not used at all. (But wait, weren't there visible behavior changes when it was modified? Weird!, I shall look into that later.)
Hmm, maybe the FDE uses incompatible register numbers? No, only remapped register numbers appear there. Hmm, maybe the VSDO's FDE is skipped when register numbers or count are found to be different? No, it just got used with a different register count. Hmm, what if the problem is not the signal unwinding, after all? It looked like it had to, but if it's provided by the kernel, it's the same in both working and failing programs...
Still puzzled, I figured I'd continue unwinding and see if I hit another susprise that made sense. Next frame is the infinite-recursion subprogram, at the early stack check. Next is at a recursive call, and so are the next... how many? Well, it doesn't matter. Unwind until the caller PC is different. And so we get to the end of the stack.
Hey, but that's not right. I recall seeing other task and thread functions at the top of the stack in past runs. Ok, start over, get to the signal frame, backtrace to count how many frames to skip, and... No, the backtrace ends at what I could tell (from call count passed as argument) was not even the first call of the function. Aha, there's something wrong with the stack. It's getting corrupted somehow. Maybe there is some register number that's wrong somewhere, and somehow some pointer with an early stack value causes something to be scribbled around the earliest stack frames.
Start the program one more time, and inspect the backtrace at the point the signal is delivered. All the 250+ recursive call frames are there, and then the GNAT task and GNU libc thread frames I recalled from other runs. So far so good. Step into the signal handler. Backtrace still good. Step a little further. Still good. Advance to the unwinder at the signal frame. Not good any more.
Something fishy for sure. Checking stack pointers near the corruption point, and they progress regularly until the abrupt end. Back to the latest calls, the stack pointer is far away, as expected. Then, within the unwinder. Hey, that looks familiar. It's back at the other end, very close to the stack pointer of the earliest calls. Huh?!? Up the stack a little, back to the signal handler, and the stack progresses as expected, but it's now clear that the signal handler stack grew into the task's stack.
That made sense. Checking the task initialization code, starting from the code at very earliest frame in the uncorrupted stack, I saw that an automatic array was set up as the alternate stack to handle stack overflow signals. I tracked down the size of that variable to a system-specific Ada specification file, and found that on GNU/Linux that was 16KiB.
Checking how much stack was used in the handler, I confirmed more than that had been used. Oopsie. Ok, that made sense. Some automatic objects in the unwinder had arrays on register numbers that had grown from 111 to 425 elements. That required a lot more stack, and we just weren't getting enough. That's why lowering the register count to 111 with DWARF_FRAME_REGISTERS solved the problem.
Just to be sure, I doubled the alternate stack size for signal handling in the Ada runtime, and the error was gone. Reverted that, and put DWARF_FRAME_REGISTERS back in, now that I knew why that was a proper fix.
Almost ready to celebrate victory! But there's still a loose end. How come changing the register numbers in the unused fallback function caused behavior changes? Well, one possibility that comes to mind involved changes to the system. Not likely: it was still the same kernel (same debug sessions running), and the same libc. Maybe the rebuild from scratch fixed some inconsistency, but rebuilding some file that partial rebuilds hadn't got to. Unlikely, GCC uses automatic dependency tracking, and libgcc gets fully rebuilt.
Fortunately, there's another sensible explanation: unwinding the corrupted stack eventually got to the corrupted range, and a corrupted caller address in there would likely not have an FDE found for it. We'd call the fallback function at that point, and then it would hardly matter which register slot was used to extract the trapping PC from the presumed signal frame that wasn't really there, things would go wrong one way or another.
Alas, I don't recall exactly how my shooting in the dark hit cases in which e.g. frob_update_context (in the same file as the fallback function) would crash, or some assert would fail, or something else would go wrong, so I don't have much hope of duplicating those failures. But now that the GCC regstrap (bootstrap plus regression testing) completed successfully, just as reach this point in this bug hunter's report, I might as well try to duplicate an activation of the fallback function.
Unfortunately, after reversing the DWARF_FRAME_REGISTERS change in the updated source tree, I couldn't get ppc_fallback_frame_state called, though there is code inlined from it that gets executed unconditionally in the caller, and that is used as the inlined entry point for breakpoint-seting purposes. I don't think that was what I experienced, though.
Aah, but one of the R_ macros at the top of the file is used elsewhere. At some early point in my session of shooting in the dark, while believing the numbers should be in the GCC internal register numbering space, and that they were only used in ppc_fallback_frame_state, experimented with changing all of those macros to symbolic _REGNO or _REGNUM macros from gcc. This would have broken the use of R_LR in frob_update_context and, indeed, changing R_LR that way, I get one of the different behaviors I recall.
That's enough to explain the false lead I created for myself. Compounded with changes, and possibly with partial rebuilds, may even bring about the other variants I've observed. But I'm leaving it at that, satisfied that it all makes some sense now, and the bug is gone. Phew! Took me "just" some significant chunks of 3 weekends.
Happy hacking, and so blong...