Fixing a Devious Memory Corruption Bug in Sandbox

We’ve just merged a fix in Concept for a particularly tricky memory corruption bug in the U-Boot sandbox environment. This bug was difficult to track down, so we wanted to share the story of the investigation and the solution.

The Symptom: Mysterious Heap Corruption

The problem first appeared as random and hard-to-reproduce memory corruption in the malloc() heap. These are often the most frustrating bugs to solve, as the place where the program crashes is rarely the place where the error actually occurred. After many hours of debugging, the root cause was finally identified, and it was quite unexpected.

The Culprit: A Memory Map Collision 💥

It turns out that the PCI Enhanced Allocation (EA) driver, a feature used for testing within sandbox, was mapping its simulated device memory at the 1MB physical address.

This might seem harmless, but there’s a problem. When using the vbe_abrec_os boot method in sandbox, the Linux kernel is also loaded and executed from that very same 1MB address.

The collision was triggered by a specific feature: measured boot. When enabled, the bootm_measure() command needs to read the kernel’s data to calculate its hash. To do this, it calls map_sysmem() on the kernel’s memory region at 1MB. However, because the PCI EA driver had claimed that address space, the call was “hijacked.” Instead of mapping and reading RAM where the kernel was supposed to be, U-Boot was unknowingly reading from the simulated PCI device!

This led to the measurement being performed on incorrect data and, through a complex chain of events, corrupted the heap, causing crashes later on.

The Fix: A Multi-Pronged Approach

The solution was implemented in a series of five patches that not only fix the bug but also add safeguards to prevent similar issues in the future.

1. Move the PCI Space

The core of the fix was to move the PCI EA test space out of the way. Patch 2/5 relocates it from 0x100000 (1MB) to a much safer address at 0x20000000, well clear of any system RAM used by sandbox. This resolves the memory collision entirely.

2. Make Partial Maps a Fatal Error

A key clue during debugging was this log message: map_physmem: Warning: partial map at 100000, wanted 4, got 2000

This warning indicated that a driver asked to map a certain number of bytes but the underlying device mapping had a different size. While this might sometimes be harmless, in this case, it was a giant red flag that something was wrong.

Patch 5/5 promotes this warning to a fatal error. It now prints a “Fatal: partial map…” message and calls os_abort(). This makes the system fail loudly and immediately, which is much better than silently continuing with a corrupted state. This change will make similar bugs far easier to spot in the future.

3. Clean up Mapping Sizes

With the new fatal error in place, we had to clean up a few drivers that were requesting inexact mapping sizes. Patches 3/5 and 4/5 update the ACPI PMC driver and the PCI EA tests to request the exact size of the memory they intend to map, silencing the new error and making the code more precise.

4. A Quick Logging Fix

Finally, patch 1/5 was a small but helpful cleanup, swapping the “wanted” and “got” values in the log message, which were reversed, to avoid any future confusion.

This was a great example of how a seemingly innocuous test configuration can have far-reaching and destructive consequences. By moving the test region and hardening our memory mapping code, the sandbox environment is now more robust. Happy hacking!