- pybfms-uart -- https://github.com/pybfms/pybfms-uart
- hvl-rpc -- https://github.com/fvutils/pyhvl-rpc
- tiny-soc -- https://github.com/mballance/tiny-soc
Matthew Ballance's blog -- Musings on hardware and embedded software design and verification, and the EDA tools and methodologies that support them.
Sunday, April 18, 2021
SoC Integration Testing: Hw/Sw Coordination (Part 2)
Sunday, March 28, 2021
SoC Integration Testing: Hw/Sw Test Coordination (Part 1)
IP- and subsystem-level testbenches are quite monolithic. There is a single entity (the testbench) that applies stimulus to the design, collects metrics, and checks results. In contrast, an SoC-level testbench is composed of at least two islands: the software running on the design’s processor and the external testbench connected to the design interfaces. Efficiently developing SoC tests involving both islands requires the ability to easily and efficiently coordinate their activity.
There are a two times when it’s imperative that the behavior of the test island(s) inside the design and the test island outside the design are coordinated – specifically, the beginning and end of the test when all islands must be in agreement. But, there are many other points in time where it is advantageous to be able communicate between the test islands.
Especially when running in simulation, the ability to efficiently pass debug information from software out to the test harness dramatically speeds debug.
It’s often useful to collect metrics on what’s happening in the software environment during test – think of this as functional coverage for software.
Verifying our design requires applying external stimulus to prove that the design (including firmware) reacts appropriately. This requires the ability to coordinate between initiating traffic on external interfaces and running firmware on the design processors to react – another excellent application of hardware/software coordination.
Often, checking results consumes a particularly-large portion of the software-test’s time. The ability to offload this to the test harness (which runs on the host server) can shorten our simulation times significantly.
Key Care-Abouts
When it comes to our key requirements for communication, one of the biggest is efficiency – at least while we’re in simulation. The key metric being how many clock cycles it takes to transfer data from software to testbench. When we look at a simulation log, we want to see most activity (and simulation time) focused on actually testing our SoC, and not on sending debug messages back to the test harness. A mechanism with a low overhead will allow us to collect more debug data, check more results, and generally have more flexibility and freedom in transferring data between the two islands.
Non-Invasive
One approach to efficiency is to use custom hardware for communication. Currently, though this may change, building the communication path into the design seems to be disfavored. So, having the communication path be non-invasive is a big plus.
Portable
Designs, of course, don’t stay in simulation forever. The end goal is to run them in emulation and prototyping for performance validation, then eventually on real silicon where validation continues -- just at much higher execution speed. Ideally, our communication path will be portable across these changes in environment. The low-level transport may change – for example, we may move from a shared-memory mailbox to using an external interface – but we shouldn’t need to fundamentally change our embedded software tests or the test behavior running on the test harness.
Scalable
A key consideration – which really has nothing to do with the communication medium at all – is how scalable the solution is in general. How much work is required to add a piece of data (message, function, etc) that will be communicated? How much specialized expertise is required? The simpler the process is to incrementally enhance the data communicated, the greater the likelihood that it will be used.
Current Approaches
Of the approaches that I’ve seen in use, most involve either software-accessible memory or the use of an existing external interface as the transport mechanism between software and the external test harness. In fact, one of the earliest cases of hardware/software interaction that I used was the Arm Trickbox – a memory-mapped special-purpose hardware device that supported sending messages to the simulation transcript and terminating the test, among other actions.
In both of these cases, some amount of code will run on the processor to format messages and put them in the mailbox or send them via the interface.
Challenges
Using a memory-based communication is generally possible in a simulation-based environment – provided we can snoop writes to memory, and/or read memory contents directly from the test harness. That doesn’t mean that memory-based communication is efficient, though, and in simulation, we care a lot about efficiency due to the speed of hardware simulators.
Our first challenge comes from the fact that all data coming from the software environment needs to be copied from its original location in memory into the shared-memory mailbox. This is because the test harness only has access to portions of the address space, and generally can’t piece together data stored in caches. The result is that we have to copy all data sent from software to the test harness out to main (non-cached) memory. Accessing main memory is slow, and thus communication between software and the test harness significantly lengthens our simulations.
Our second challenge comes from the fact that the mailbox is likely to be smaller than the largest message we wish to send. This means that our libraries on both sides of the mailbox need to manage synchronizing data transmission with available space in the mailbox. This means that one of the first tasks we need to undertake when bringing up our SoC is to test the communication path between software and test harness.
A final challenge, which really ought not to be a challenge, is that we’ll often end up custom-developing the communication mechanism since there aren’t readily-available reusable libraries that we can easily deploy. More about that later.
Making use of Execution Trace
In a previous post, I wrote about using processor-execution trace for enhanced debug. I've also used processor trace as a simple way to detect test termination. For example, here is the Python test-harness code that terminates the test when one of 'test_pass' or 'test_fail' are invoked:
In order to support test-result checking, the processor-execution trace BFM has the ability to track both the register state and memory state as execution proceeds.
Our test harness has access to the processor core's view of register values and memory content at the point that a function is called. As it turns out, we can build on this to create a very efficient way to transferring data from software to the test harness.
In order to access the value of function parameters, we need to know the calling convention for our processor core. Here's the table describing register usage in the RISC-V calling convention:
Next Steps
SoC integration tests are distributed tests carried out by islands of test behavior running on the processor(s) and on the test harness controlling the external interfaces. Testing more-interesting scenarios requires coordinating these islands of test functionality.
In this post, we’ve looked at using execution-trace to implement a high-efficiency mechanism for communicating from embedded test software back to the test harness. While this mechanism is mostly-specific to simulation, it has the advantage of simplifying communication, debug, and metrics collection at this early phase of integration testing when, arguably, we most-need a high degree of visibility.
While we have an efficient mechanism, we don’t yet has a mechanism that makes it easy to add new APIs (scalable) nor a mechanism that is easily portable to environments that need to use a different transport mechanism.
In the next post, we’ll have a look at putting some structure and abstraction around communication that will help with both of these points.
- RISC-V Calling Conventions (ABI) – https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
- pybfms-core-debug-common – https://github.com/pybfms/pybfms-core-debug-common
- pybfms-riscv-debug – https://github.com/pybfms/pybfms_riscv_debug
- pyhvl-rpc -- https://github.com/fvutils/pyhvl-rpc
Saturday, February 8, 2020
Selectively Muting your BFMs to Speed up Simulation
Have you ever had the misfortune to be on the CC list for a "lively" email discussion where you're a stakeholder but only case about the conclusion? You can't simply ignore the traffic, because you do care about the conclusion to the discussion. But, it would be a significant time saver if you could just "tune out" all the discussion and simply be notified when a conclusion is reached.
I've been working with Python-based testbench environments (specifically cocotb) since the middle of last year. My first foray into contributing to cocotb was to implement a task-based BFM interface between the HDL environment and the Python environment (related blog posts here and here). The motivation was to increase simulation speed by reducing the number of interactions between the HDL environment and the testbench environment. In other words, allow the Python testbench to "tune out" what was happening in the simulation until the BFM came back with some useful conclusions.
The performance benefits of abstracting up and interacting at the task-call level come from maximizing the amount of time the simulation engine can run before it needs to check in with the testbench environment. Getting good performance also requires having the HDL environment generate clocks for the design, in addition to having Python interact with BFMs at the task level to drive stimulus against the design.
The performance benefits of this approach do vary a bit. If all of your tests interact with the design at the signal level and still only run for a small number of seconds each, then changing the way you interact with the simulation is unlikely to improve performance. A significant percentage of the time in your simulation runs is likely taken up by environment startup, and you might not really care about the run time anyway (what's a few extra seconds, right?)
For tests that run longer, though, the performance benefits can be quite significant. One of my projects is Featherweight RISC (FWRISC), a small RISC-V implementation. One of the Zephyr-OS tests that runs as part of the regression suite tests the ability to synchronize between multiple threads. Unlike the simple unit tests or compliance tests, this test runs for quite a while. Here, we see a pretty significant difference in runtime based on whether we interact with the simulation via signals or BFM tasks.
Scenario
|
Time
|
Icarus / Signal BFM
|
17:06
|
Icarus / Task BFM
|
1:51
|
Verilator / Task BFM
|
0:06
|
The table above compares running the same test in Icarus Verilog using signal-level interaction and in Icarus Verilog using task-based interaction. Simply switching to task-based interaction reduces the runtime by a factor of 9.2x! Moving from interpreted event-driven simulation to Verilator's two-state cycle-based simulation engine allows us even more speed.
Delegating Decisions to the BFM
This type of performance speed-up is quite nice, especially since the BFMs are quite simple. But, can we do better? In short, the answer is yes. We just need to continue down the path of reducing the number of interactions between simulation and testbench environment.
With task-based BFMs, quite a few decisions are already handled locally. For example, a UART BFM will typically handle the details of transmitting/receiving a byte, allowing the testbench to only interact with the BFM at byte boundaries.
There are cases, however, where a bit more flexibility is required. For example, an AMBA AXI BFM may handle all details of burst and phase timing under normal circumstances. Sometimes, however, the testbench might need to take fine-grained control for testing corner cases. Always interacting at the detail level will hurt performance of the common case. The solution is to have multiple (two in this case) operating modes for the BFM that allow us to have the highest performance for the common case, and fine-grained control (with lower performance) in the cases where this is needed.
- Instruction executed
- Register write
- Memory write
@cocotb.bfm(hdl={ bfm_vlog : bfm_hdl_path(__file__, "hdl/fwrisc_tracer_bfm.v"), bfm_sv : bfm_hdl_path(__file__, "hdl/fwrisc_tracer_bfm.v") }) class FwriscTracerBfm(): # ... @cocotb.bfm_import(cocotb.bfm_uint32_t) def set_trace_reg_writes(self, t): pass
module fwrisc_tracer_bfm( input clock, input reset, // ... ); reg trace_reg_writes = 1; task set_trace_reg_writes(reg t); trace_reg_writes = t; endtask always @(posedge clock) begin if (rd_write && rd_waddr != 0) begin if (trace_reg_writes) begin reg_write(rd_waddr, rd_wdata); end end end // ...
@cocotb.bfm(hdl={ bfm_vlog : bfm_hdl_path(__file__, "hdl/fwrisc_tracer_bfm.v"), bfm_sv : bfm_hdl_path(__file__, "hdl/fwrisc_tracer_bfm.v") }) class FwriscTracerBfm(): @cocotb.bfm_import(cocotb.bfm_uint32_t, cocotb.bfm_uint32_t, cocotb.bfm_uint32_t) def set_trace_instr(self, all_instr, jump_instr, call_instr): pass
module fwrisc_tracer_bfm( input clock, input reset, // ... ); // ... reg trace_instr_all = 1; reg trace_instr_jump = 1; reg trace_instr_call = 1; task set_trace_instr(reg all, reg jumps, reg calls); trace_instr_all = all; trace_instr_jump = jumps; trace_instr_call = calls; endtask always @(posedge clock) begin if (ivalid) begin last_instr <= instr; if (trace_instr_all || (trace_instr_jump && ( last_instr[6:0] == 7'b1101111 || // jal last_instr[6:0] == 7'b1100111)) // jalr || (trace_instr_call && ( // JAL with a non-zero link target last_instr[6:0] == 7'b1101111 || last_instr[6:0] == 7'b1100111) && last_instr[11:7] != 5'b0) ) begin instr_exec(pc, instr); end end end // ... endmodule
Scenario
|
Time
|
Verilator / Task BFM (Full)
|
1:05
|
Verilator / Task BFM (Min)
|
0:24
|
| BFM (Full Activity) | BFM (Selectively-Muted Activity) |







