Possible memory corruption after stack overflow


#1

This is on the PicoCalc firmware, version 4.8f, compiled for a Pi Pico 2 (RP2350), RISC-V with -O3 optimization level (in case it matters - though I also was able to reproduce it with the default -Os).

I’m observing the following behavior, which is reproducible for me. Though I suspect it might depend on a lot of details, such as the workspace size, compiler flags, and exact configuration:

(save-image)

^ succeeds just fine.

Then:

(defun r () (+ 1 (r)))
(r)
Error: stack overflow

So far so good.

But now when I try to access the SD card again:

(save-image)

it either gets stuck (requiring a power cycle), or I get “problem saving to SD card”.

This suggests to me that something in the SD card support got corrupted when I filled up the stack?

Bumping STACKDIFF from 580 to something like 2000 bytes seems to avoid the issue, though even 1000 was not enough for me. So I wonder if there’s something else that’s going on in this case.

Curious, how do you normally determine the STACKDIFF values that are in the code?


#2

As far as I know, none of the processors that uLisp supports provide built-in stack overflow detection, so I’ve had to periodically check the position of the stack pointer against the position of the bottom of the stack, which is usually the top of the static variables. I do this at the start of eval(), after the comment:

// Enough space?

But since it’s not known how much additional stack space the execution of eval() will need, it’s a bit of a guess how much safety net to leave. This is what the #define STACKDIFF is for. Make STACKDIFF too large and you won’t be able to recurse as much as you’d like to. Make it too small and your static variables may get overwritten.

Curious, how do you normally determine the STACKDIFF values that are in the code?

I uncomment the line:

// Serial.println((uint32_t)sp - (uint32_t)&ENDSTACK); // Find best STACKDIFF value

and then run a recursing program like your example r. The last value printed out before uLisp hangs up is a good candidate for STACKDIFF.

Probably the only safe advice is to assume your workspace may be corrupted after a stack overflow error. Try doing (pprintall).

By the way, the above explanation applies to ARM platforms. The mechanism of detecting stack overflows is platform specific.


#3

Update: I’ve corrected the “Find best STACKDIFF value” line in the above post.


#4

Thanks for the details on how you go about finding a suitable STACKDIFF value. I’ll check what I get when I run this with my configuration later.

I’m surprised that there would be more than 1000 bytes needed in between two eval calls, but I’m probably missing something.

Maybe I’ll also implement an additional protection:

  1. Declare a special cookie value (something like “int cookie = 0x12345678;”) at the bottom of the dynamic variables.
  2. When a stack overflow error is caught in eval, check if this value is still intact. If it is, then we can call errorsym2 as usual to abort execution, but the workspace and system is otherwise known to still be intact. If on the other hand the cookie value has changed, then we know that the dynamic value region has been at least partially corrupted by the stack, and we can halt the system instead to make sure we don’t write corrupted data to the SD card in a subsequent (save-image), and/or run into other undefined behaviors…

This could provide a nice additional safeguard beyond the current stack overflow check…

Plus, for additional overflow proofing, I’m thinking I should also put a stack overflow check into the gc’s markobject routine, since that is also recursive…


#5

Plus, for additional overflow proofing, I’m thinking I should also put a stack overflow check into the gc’s markobject routine, since that is also recursive…

For the record, I’ve never encountered any problem with garbage collection overflowing the stack.


#6

Looking at Arduino RP2040 FS.h file that is used by SD.h, I’m seeing some std::shared_ptr usages.

I’m not very familiar with how the Arduino libraries manage memory - but does this mean that the filesystem library will perform heap allocations of some form? I’d assume that the heap would be located just above the dynamic memory region?

If so, that might explain why the SD card functions get corrupted even before the stack overflow protection trips, and why I need to use such a big STACKDIFF to not affect SD card functionality?


#7

A stack overflow normally only occurs because of a bug, or a deliberate attempt to generate it by an infinite recursion test such as the one in your first post. The aim of the STACKDIFF test is to allow you to at least recover your program, with (pprintall) for example.


#8

I see. I think I probably had the wrong expectations about the stack overflow detection in uLisp. If I’m reading you correctly, it isn’t necessarily intended to guarantee that the system is left in a fully intact state, but is meant as more of a “best effort” check to make the error condition clear to the user, as well as prevent data loss in some cases?

I would say that stack overflows are probably uniquely a problem only when using uLisp as an interactive development environment (such as one might do on something like the PicoCalc). I did run into this issue organically while writing some code, because I had made a mistake in the base case of a recursive function. I kept working on the problem for a bit, but then later found that I couldn’t save my program to the SD card anymore. To be fair, pprint was still working, so I could have copied it out over the serial port, or copied the program to some other medium manually.

I think it matters much less when pre-writing software and then using uLisp as just a runtime for running it on a microcontroller.

I was also under the impression that there wouldn’t be any heap / dynamic memory allocation when running uLisp, but I think - even though uLisp itself doesn’t make any heap memory allocations at runtime, some of the libraries that are used on the more powerful controllers, such as the Pi Pico (or Pico 2), do still utilize the heap (such as the filesystem implementation for SD card support, and it seems also some of the GFX routines). At least that’s the only explanation I can think of that would be consistent with what I’ve observed when I write close to (but not below) ENDSTACK.

Either way, I’m perfectly happy with just setting STACKDIFF to something generous (like 2 or 4 KB) for my personal use. This gives me the additional bit of safety, and the space left for the stack on the Pi Pico 2 that I’m using is way more than I’d ever need for typical programs anyway.

Happy to close this thread.

(fwiw, I think if we wanted to have a more general solution for this problem, we could check if there’s a way to limit the heap allocator used by the RP2040 Arduino SDK to a maximum heap size - then we could use that limit to have a well defined upper bound of the heap address space, and could determine ENDSTACK accordingly…)


#9

Also, to close the loop on this:

I did implement this, but it turned out that the system became unusable before this value would get overwritten, so it was ineffective. SD card support would break first, and then the screen output on the PicoCalc would freeze fully, before getting close to touching the cookie value (The stack pointer sp was still quite far away from ENDSTACK at that point). This is what led me down the path of suspecting that there might be some heap memory above the dynamic value memory region that’s being used for the SD card functionality, and it seems also for some screen drawing routines on the Pico…

PS: Some useful reads I found after writing this:


#10

If I’m reading you correctly, it isn’t necessarily intended to guarantee that the system is left in a fully intact state, but is meant as more of a “best effort” check to make the error condition clear to the user, as well as prevent data loss in some cases?

Yes, that’s a good summary.

Declare a special cookie value (something like “int cookie = 0x12345678;”) at the bottom of the dynamic variables.

I think I tried this too at one stage and also found that it was ineffective.

One idea might be to do the check as follows in eval(): allocate one byte on the heap, compare the address of that with the stack pointer, and then free the test byte. It might slow down execution a bit though.


#11

Yeah. Also, you’d rely on the specific behavior of the heap allocator to allocate the value at the top of the heap. Presumably, once a heap gets fragmented after a free, it will at some point re-use the freed space, and this check would no longer be reliable for determining where the heap ends.

There’s probably some way to set an explicit end of the heap at link or compile time, that the heap would then be guaranteed to stay under (maybe something like __HeapLimit ). But I haven’t come across a documented interface for that yet.

Given I can so easily work around this issue by increasing STACKDIFF, I think I’ll leave it at that for the time being.


#12

Also, you’d rely on the specific behavior of the heap allocator to allocate the value at the top of the heap.

Yes, good point.


#13