Hardware Debugging is Hard

Motivation

I’m the core firmware developer in the Trusted-V Team, working on different parts of the low-level firmware stack like the RTOS, board bringup, and PAC/HAL crates for a couple of RISC-V MCU boards, using Rust as the primary language of choice for every layer.

Recently I was working on porting our Trusted-V RTOS to one of our client’s MCU boards (name withheld for privacy) and found out that the functionality works without any issues under QEMU emulation but not on the real hardware. So I set out to debug/fix this issue.

Setup

I will provide brief details about the MCU which are generic in the MCU world, like what kind of debugging tools I had access to and how I was flashing the firmware and so on.

QEMU: QEMU emulation was the default target for our RTOS to run and test our code before we move onto real MCU boards. QEMU has support for RISC-V firmware, so it was easier to use this as our initial RISC-V target.
FTDI FT2232H: This MCU has an onboard FTDI FT2232H chip with its dual independent channels configured for JTAG and UART support.
DC Barrel Jack: This is used to power up the MCU board.
USB-C Cable: A Type-C USB cable is used for connecting the MCU to the PC for programming and debugging purposes.
OpenOCD and GDB: OpenOCD and GDB are used for flashing/loading the firmware onto the MCU and debugging.

Problem

We are writing a Real Time Operating System (RTOS) from scratch using Rust, targeting both RISC-V 32-bit and 64-bit MCUs which could be used in IoT, industrial and consumer electronics. I won’t go into any details about this unreleased OS (maybe a topic for the future) but one thing we need for an OS to work is to have some bare-metal startup code which does all the necessary initialization for the RISC-V cores present on the MCU, which helps in bringing up the board to a safe state, after which we can run bare-metal applications or add support for our OS on top of it.

In the Rust embedded world, we have the riscv-rt crate which has startup code and a minimal runtime for RISC-V CPUs. We could make use of this because it has generic support for many types of RISC-V cores present in the market. We do have the option of writing the assembly code ourselves for board bringup, though.

Different MCUs have different requirements and do unique things at startup to support their unique features, so we decided to use assembly code instead of depending on the riscv-rt crate. That said, the riscv-rt source code is a good starting point for understanding the basics of startup code specific to RISC-V cores.

Our RTOS was running fine under QEMU targeting the exact same machine type as our MCU hardware, printing necessary info over UART. But when run on real hardware, nothing was printing over the UART terminal.

Exploration

This MCU configures UART0 in its bootloader bootrom image, so whenever we press the reset pin on the MCU, a useful text banner about the company is printed by default — meaning the bootloader was successfully setting up UART0 for serial communication. So an easy way to debug embedded firmware issues is to first configure UART and then print things over it at every step (printf debugging). Referring to the UART register map gives us the base address, status register and transmit FIFO register address, which is the minimum requirement to write a small UART driver to send text over UART. This can be done simply by polling the TX_FULL bit clear from the status register and then writing bytes using the transmit FIFO register. The specific/exact details may vary depending on which UART you are using (refer to the datasheet). This model matched QEMU, so the same driver (with a tiny variation) was used on both the MCU and QEMU.

A simple UART Driver in Rust could be written something like below:

#[inline]
pub fn console_putc(byte: u8) {
    const TX_OFFSET: usize = 8;
    const STATUS_OFFSET: usize = 4;
    const TX_FULL_MASK: u16 = 1 << 1; // wait for TX_FULL clear

    unsafe {
        let base = peripherals::UART0_BASE;
        let status = (base + STATUS_OFFSET) as *const u16;
        while core::ptr::read_volatile(status) & TX_FULL_MASK != 0 {
            core::hint::spin_loop();
        }
        core::ptr::write_volatile((base + TX_OFFSET) as *mut u8, byte);
    }
}

Firmware Execution models

There are different ways to flash or load the binary onto the MCU. I list below 2 common ways:

RAM Execution

In this model, the firmware is loaded to RAM, the program counter is set to the start address, and then we run it.
This mode is the preferred one during development as it offers faster loading and debugging capabilities because we are not dependent on flash read/writes anywhere.
Please be aware that the firmware is erased after each power cycle, so we have to reload it because RAM is volatile memory.
We can either use openOCD, or use gdb with openOCD, to achieve this.
In OpenOCD terms, command will be something like:

openocd -f <CFG_FILE> -c "load_image <ELF_FILE>; reg pc <START_ADDR>; resume"

In GDB terms, command will be something like:

openocd -f <CFG_FILE>
riscv64-unknown-elf-gdb <ELF_FILE> -ex "target remote :3333" -ex "load" -ex "set $pc = <START_ADDR>" -ex "continue"

FLASH/XIP Execution

In this model, the firmware is loaded to FLASH and executed directly from flash because Execute In Place (XIP) supports it.
This model survives a power cycle as FLASH is non-volatile memory.
Hence this is usually used during deployment of the final firmware.
In OpenOCD terms, command will be something like:

openocd -f <CFG_FILE> -c "flash_write <BIN_FILE> <START_ADDR>" -c "resume"

Each and every model uses linker scripts to precisely place the .text, .data, .bss and other required sections in RAM/FLASH accordingly. One more hint: we can choose based on the size of our firmware and whether it fits completely into RAM/FLASH.

I was using RAM Execution as loading my RTOS into Flash was taking too long and setting it up correctly is kind of difficult.

Okay, now coming back to my original problem, my RTOS had the basic things set up in the startup code which are necessary for RISC-V code execution, namely:

Setting the stack pointer and global pointer
Zeroing the .bss section
Copying the .data section from its LMA (Load Memory Address — FLASH/RAM) to its VMA (Virtual/Runtime Memory Address — RAM). Note that in the RAM Execution model the LMA and VMA are both in RAM, so this copy is effectively a no-op; it only does real work in the XIP/flash model.

When I ran the firmware, the code was stuck in Zeroing the .bss section.

So I started trying different possible solutions to hunt down this bug.

Hunting

Printing Characters over UART

As a first step, I used the UART to print a single character (using assembly) at every stage of the startup assembly routine explained above. Interestingly, all the characters were printed in the console and the RTOS ran without issues. But here’s the twist: just when I thought I had solved the issue — that it was simply a power cycle I needed to do to successfully run the RTOS — NO… after I removed the placeholder characters, it got stuck in the same .bss loop again.

I had been running the code successfully for a month because my firmware had those placeholder characters present in the startup routine; they were never deleted. The issue resurfaced as soon as I removed them.

So this bug went unnoticed for a very long time without an actual fix.

Using C SDK examples instead of Rust RTOS

We had access to a C SDK which had a few different working examples to test the board and its functionality over various peripherals. The SDK had quite a few archived object files (.a / .o) for startup/print/misc things which were linked into all the application example code.

Due to this, the application size was over 200 KB even for a simple hello world example. So I discarded that object file and wrote my own startup routine in assembly which set the stack pointer and then jumped to the main function.

Believe it or not, this compiled and ran on the real RISC-V hardware without issues. Strictly speaking, RISC-V hardware only needs a valid program counter to start executing — sp, gp, .bss/.data are all software/ABI conventions, not hardware preconditions. To run C safely you also set up a stack pointer (the calling convention uses it for saved registers and locals) before jumping to main. A trivial main that fits entirely in registers — like this one, which just polls the UART and writes a string — could even run without a valid sp, but you don’t want to rely on that the moment main grows a function call or a local that spills to the stack.

Reference assembly code:

    .section .text.init   /* @ADDR = <START_ADDR> */
    .globl _start
    .align 2
_start:
    la    sp, _stack
    call  main

Now that I had basic example code working, I changed the startup routine to be similar to what I had in Rust, and ran the code — which triggered the same issue. This hinted that Rust was never the problem. I had thought maybe the Rust target was wrong, or the compiler/linker/optimizer had mangled the Rust code somehow. But this was not the case: even the C firmware hit the same issue.

Other Misc Tries

I thought maybe a single UART write before clearing the .bss section would fix the issue, since sprinkling UART characters over the startup routine had somehow worked fine — but this failed.
After this I tried masking interrupts before .bss clearing; even this failed.
After this I set up mtvec to catch any error from .bss clearing, but even this failed — because clearing the .bss section never hit any exceptions in the first place.

Disassembling the C SDK startup library

After trying so many things as mentioned above, I explored the C SDK for any hints/gotchas, but I couldn’t find any in the source. Then I tried to look into the linked archive object file using objdump, and disassembling the startup object revealed something I had been overlooking: in the SDK’s reference startup, each iteration of the .bss zeroing and .data copying loops was followed by a fence.i instruction.

This is not something you typically see in startup code for generic RISC-V cores or under QEMU — for instance, the riscv-rt startup routine does a plain store loop with no per-store fence. The vendor’s own reference startup clearly fenced after every store for a reason, so I matched that behaviour in my own startup routine — and the issue was gone.

I won’t speculate here about the exact micro-architectural reason this is required on this particular hardware; the takeaway is that the vendor’s reference startup encodes hard-won bring-up knowledge, and “rolling your own” means you have to respect what that reference code is doing and why. A software issue that had been hiding for a few weeks was solved by adding essentially one instruction — fence.i — in the right place.

I was a bit frustrated about how I could have missed this, or that I should have tried this approach first. But I was relieved to finally have it fixed, and because I stumbled onto this almost as a last resort, I learned quite a few things along the way about execution models, startup code, RISC-V targets and so on. So I guess it was a win-win situation.

Conclusion

Thanks to my friend Imran K for reviewing the draft version of this blog.

References

/riscv/ /mcu/ /debug/ /openocd/ /gdb/ /qemu/