Saturday, January 30, 2021

SoC Integration Testing: Higher-Level Software Debug Visibility

Debug is a key task in any development task. Whether debugging application-level software or a hardware design, a key to productive debug is getting a higher-level view of what is happening in the design. Blindly stepping around in source code or staring at low-level waveforms is rarely a productive approach to debugging. Debug-log messages provide a high-level view of what's happening in an software application, allowing us to better target what source we actually inspect. Testbench logging, coupled with a transaction-level view of interface activity, provides us that higher-level view when verifying IP-level designs. Much of this is lacking when it comes to verifying SoC integration.

Challenges at SoC Level
We face a few unique challenges when doing SoC-integration testing. Software (okay, really firmware) is an integral part of our test environment, but that software is running really really slowly since it is running at RTL-simulation speeds. That makes using debug messages impractical, since simulating execution of the code to produce messages makes our test software run excruciatingly slowly. In addition, the types of issues we are likely to find -- especially early on -- are not at the application-level anyway. 

Processor simulation models often provide some form of execution trace, such as ARM's Tarmac file, which provides us a window into what's happening in the software. The downsides, here, are that we end up having to manually correlate low-level execution with higher-level application execution and what's happening in the waveform. There are also some very nice commercial integrated hardware/software debug tools that dramatically simplify the task of debugging software at the source level and correlating that with what's happening in the hardware design -- well worth checking out if you have access.

At IP level, it's common to use Verification IP to relate the signal-level view of implementation with the more-abstract level we use when developing tests and debugging. It's highly desirable, of course, to be able to use Verification IP across multiple IPs and projects. This requires the existence of a common protocol that VIP can be developed to comprehend. 

If we want VIP that exposes a higher-level view of a processor's execution, we'll need just such a common protocol to interpret. The good news is that there is such a protocol for the RISC-V architecture: the RISC-V Formal Interface (RVFI). As its name suggests, the RISC-V Formal Interface was developed to enable a variety of RISC-V cores to be formally verified using the same library of formal properties. Using the RVFI as our common 'protocol' to understand the execution of a RISC-V processor enables us to develop a Verification IP that supports any processor that implements the RVFI.

The RISC-V Debug BFM is part of the PyBfms project and, like the other Bus-Functional Models within the project, implements low-level behavior in Verilog and higher-level behavior in Python. Like other PyBfms models, the RISC-V Debug BFM works nicely with cocotb testbench environments.

Instruction-Level Trace
Like other BFMs, the Verilog side of the RISC-V Debug BFM contains various mechanics for converting the input signals to a higher-level instruction trace. Consequently, the signals that expose the higher-level view of software execution are collected in a sub-module of the BFM instance.

The image above shows the elements within the debug BFM. The ctxt scope contains the higher-abstraction view of software execution, while the regs scope inside it contains the register state.

The first level of debug visibility that we receive is at the instruction level. The RISC-V Debug BFM exposes a simple disassembly of the executed instructions on the disasm signal within the ctxt scope. Note that you need to set the trace format to ASCII or String (depending on your waveform viewer) to see the disassembly. 

C-Level Execution Trace
Seeing instruction execution and register values is useful, but still leaves us looking at software execution at a very low level. This is very limiting, and especially so if we're attempting to understand the execution of software that we didn't write -- booting of an RTOS, for example. 

Fortunately, our BFM is connected to Python and there's a readily-available library (pyelftools) for accessing symbols and other information from the software image being executed by the processor core.

The code snippet above shows our testbench obtaining the path to the ELF file from cocotb, and passing this to the RISC-V Debug BFM. Now, what can we do with a stream of instruction-execution events and an ELF file? How about reconstructing the call stack?

The screenshot above shows the call stack of the Zephyr OS booting and running a short user program. If we need to debug a design failure, we can always correlate it to where the software was when the failure occurred. 

The screenshot above covers approximately 2ms of simulation time. At this scale, the signal-level details at the top of the waveform view are incomprehensible. The instruction-level view in the middle are difficult to interpret, though perhaps you could infer something from the register values. However, the C-level execution view at the bottom is still largely legible. Even when function execution is too brief to enable the function name to be legible, sweeping the cursor makes the execution flow easy to follow.

Current Status and Looking Forward
The RISC-V Debug BFM is still early in its development cycle, with additional opportunities for new features (stay tuned!) and a need for increased stability and documentation. That said, feel free to have a look and consider whether having access to the features described above would improve your SoC bring-up experience.

Looking forward in this series of blog posts, we'll be looking next at some of the additional things we can do with the information and events collected by the RISC-V Debug BFM. Among other things, these will allow us to more tightly connect the execution of our Python-based testbench with the execution of our test software.

Finally, the process of creating the RISC-V BFM has me thinking about the possibilities when assembling an SoC from IPs with integrated higher-level debug. What if not only the processor core but also the DMA engine, internal accelerators, and external communication IPs were all able to show a high-level view of what they were doing? It would certainly give the SoC integrator a better view of what was happening, and even facilitate discussions with the IP developer. How would IP with integrated high-level debug improve your SoC bring-up experience?

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, January 16, 2021

SoC Integration Testing: Intro and Challenges

As I mentioned in my end-of-year post, one of my 2020 projects was to develop a design for the Google/eFabless/SkyWater Multi-Project Wafer (MPW) fab run. One thing I looked forward to was applying elements of the Python-based verification flow that I've been developing. Doing so highlighted a gap in my verification toolkit: reusable infrastructure for SoC-level verification.

Caravel and the User Project Area
eFabless, the company developing the RTL to GDS flow and project-managing the MPW shuttle, developed the pad ring and some management circuitry that all projects made use of. The management circuitry includes a small processor, a few peripherals, and debug circuitry for observing and interacting with the user-project area (see image below). 

The entire thing is called the Caravel -- a carrier for the user project. To keep things simple, my project was, itself, a very small SoC with a RISC-V core, a few peripherals and some memory (shown below). 
So, essentially, the entire project is two SoCs back to back. 

IP Verification vs SoC Integration Testing

Much of my work recently has been with Python-based verification environments focused on IP-level verification. I've worked with constrained-random stimulus generation, functional coverage, and bus functional models. While IP-level verification isn't the only possible application of this work, my usage has all been firmly focused on verification of RTL IP-level designs.

Verifying the "payload" portion of the MPW design was fairly straightforward using this infrastructure. I was able to leverage some bus functional models (BFMs) from the PyBfms library, and wrote some Python tests to verify that the design IPs were properly integrated.

However, things got more complicated (and painful) when it came to verifying the integration by running software on the Caravel management processor. Lack of visibility into what the software was doing made debug difficult. Lack of synchronization between the running software and the testbench environment made automating regression tests difficult. Given some tight deadlines, I ended up focusing on verifying my project and largely tested the interface between the management processor and my project using interactive tests. But, the experience got me thinking about what reusable elements would have enabled more complete and comprehensive verification.

Verification Key Requirements
IP-level and SoC-level testbench environments are quite different. IP-level environments have a monolithic testbench ideally composed of reusable test infrastructure, while the test infrastructure is much more distributed in an SoC-level environment. Despite these differences, the key requirements for highly-productive verification are very similar in both of these test environments.

Synchronization and Control
All testbench environments need to synchronize execution of the various components. In a monolithic testbench, this is typically done with thread-synchronization primitives provided by the testbench language (eg fork/join and semaphores for SystemVerilog) or the testbench library (eg Event for cocotb). 

Synchronization and control have two primary roles: ensure the test only begins once everything in the testbench environment is running, and detect the end of the test and shut everything down. In a monolithic environment, this isn't so difficult. In an SoC environment, this becomes much more difficult because a key part of our testbench is the embedded software running on the processor core(s) in the design. Synchronizing the start and end of the test with this running software is a challenge. Unlike synchronization in an IP-level testbench, which is addressed in one way for a given language and library, synchronization and control in an SoC environment is often addressed in a custom manner. 

Debug visibility
In an IP-level testbench environment, debug typically leverages two sources of information: signal-level waveform trace and the debug log. We still have all of that data in an SoC environment, of course, but getting a sense of what the test software is doing at the point of a hardware failure is much more difficult. Often, it comes down to manually correlating the program counter from the waveform with a disassembly dump of the test program.

IP-level environments provide several sources of metrics for determining when verification is complete. Functional coverage metrics ensure that key test scenarios are executed, and that key conditions are exercised in the design. Code-coverage metrics alert us to areas of the design not being properly exercised by tests.
In an SoC-level environment, we would like to add software-centric metrics to help us understand whether our test software is exercising key scenarios. Lack of visibility into the operation of the software tends to get in the way of doing this.

Verification IP
Verification IP for external interfaces is present in both IP- and SoC-level environments. VIP simplifies the process of exercising design behavior via an interface. In an SoC-level environment, the IPs in the SoC take over the role that verification IP played for internal interfaces. It's often difficult to use these IPs as verification IP because appropriate low-level driver software isn't available -- either it hasn't been developed yet or it only exists in the context of a full operating system. Taking time to write low-level driver software for IPs in the SoC takes away time from writing test scenarios.

Looking Forward
My latest experience in both IP/subsystem-level and SoC-integration verification has emphasized that there's a hole in my verification toolbox. The existing tools in my verification toolbox work quite well for IP-level verification, and they're quite reusable. I'd like to have more reusable elements when approaching SoC integration testing. 

Over the next few blog posts I'll look at some SoC-level verification infrastructure that I'm creating. A key hope, of course, is that this is sufficiently general that it's more broadly useful than just for Caravel. I'll be focusing on approaches and methodology that can be applied whether you're a hardware hobbyist or in commercial practice. I'll also be continuing my focus on Python as the testbench methodology, but same approaches should work with SystemVerilog or SystemC as well if you're using these methodologies.

I'm always interested in feedback on whether these elements of methodology are useful, scalable, etc. So, please comment with your thoughts. 

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.