Showing posts with label Design Verification. Show all posts
Showing posts with label Design Verification. Show all posts

Sunday, March 28, 2021

SoC Integration Testing: Hw/Sw Test Coordination (Part 1)

 



IP- and subsystem-level testbenches are quite monolithic. There is a single entity (the testbench) that applies stimulus to the design, collects metrics, and checks results. In contrast, an SoC-level testbench is composed of at least two islands: the software running on the design’s processor and the external testbench connected to the design interfaces. Efficiently developing SoC tests involving both islands requires the ability to easily and efficiently coordinate their activity.

There are a two times when it’s imperative that the behavior of the test island(s) inside the design and the test island outside the design are coordinated – specifically, the beginning and end of the test when all islands must be in agreement. But, there are many other points in time where it is advantageous to be able communicate between the test islands. 

Especially when running in simulation, the ability to efficiently pass debug information from software out to the test harness dramatically speeds debug. 

It’s often useful to collect metrics on what’s happening in the software environment during test – think of this as functional coverage for software. 

Verifying our design requires applying external stimulus to prove that the design (including firmware) reacts appropriately. This requires the ability to coordinate between initiating traffic on external interfaces and running firmware on the design processors to react – another excellent application of hardware/software coordination. 

Often, checking results consumes a particularly-large portion of the software-test’s time. The ability to offload this to the test harness (which runs on the host server) can shorten our simulation times significantly. 

Key Care-Abouts

When it comes to our key requirements for communication, one of the biggest is efficiency – at least while we’re in simulation. The key metric being how many clock cycles it takes to transfer data from software to testbench. When we look at a simulation log, we want to see most activity (and simulation time) focused on actually testing our SoC, and not on sending debug messages back to the test harness. A mechanism with a low overhead will allow us to collect more debug data, check more results, and generally have more flexibility and freedom in transferring data between the two islands.

Non-Invasive

One approach to efficiency is to use custom hardware for communication. Currently, though this may change, building the communication path into the design seems to be disfavored. So, having the communication path be non-invasive is a big plus.

Portable

Designs, of course, don’t stay in simulation forever. The end goal is to run them in emulation and prototyping for performance validation, then eventually on real silicon where validation continues -- just at much higher execution speed. Ideally, our communication path will be portable across these changes in environment. The low-level transport may change – for example, we may move from a shared-memory mailbox to using an external interface – but we shouldn’t need to fundamentally change our embedded software tests or the test behavior running on the test harness.

Scalable

A key consideration – which really has nothing to do with the communication medium at all – is how scalable the solution is in general. How much work is required to add a piece of data (message, function, etc) that will be communicated? How much specialized expertise is required? The simpler the process is to incrementally enhance the data communicated, the greater the likelihood that it will be used.

Current Approaches

Of the approaches that I’ve seen in use, most involve either software-accessible memory or the use of an existing external interface as the transport mechanism between software and the external test harness. In fact, one of the earliest cases of hardware/software interaction that I used was the Arm Trickbox – a memory-mapped special-purpose hardware device that supported sending messages to the simulation transcript and terminating the test, among other actions.

In both of these cases, some amount of code will run on the processor to format messages and put them in the mailbox or send them via the interface. 

Challenges

Using a memory-based communication is generally possible in a simulation-based environment – provided we can snoop writes to memory, and/or read memory contents directly from the test harness. That doesn’t mean that memory-based communication is efficient, though, and in simulation, we care a lot about efficiency due to the speed of hardware simulators.

Our first challenge comes from the fact that all data coming from the software environment needs to be copied from its original location in memory into the shared-memory mailbox. This is because the test harness only has access to portions of the address space, and generally can’t piece together data stored in caches. The result is that we have to copy all data sent from software to the test harness out to main (non-cached) memory. Accessing main memory is slow, and thus communication between software and the test harness significantly lengthens our simulations.

Our second challenge comes from the fact that the mailbox is likely to be smaller than the largest message we wish to send. This means that our libraries on both sides of the mailbox need to manage synchronizing data transmission with available space in the mailbox. This means that one of the first tasks we need to undertake when bringing up our SoC is to test the communication path between software and test harness.

A final challenge, which really ought not to be a challenge, is that we’ll often end up custom-developing the communication mechanism since there aren’t readily-available reusable libraries that we can easily deploy. More about that later.

Making use of Execution Trace

In a previous post, I wrote about using processor-execution trace for enhanced debug. I've also used processor trace as a simple way to detect test termination. For example, here is the Python test-harness code that terminates the test when one of 'test_pass' or 'test_fail' are invoked:

In order to support test-result checking, the processor-execution trace BFM has the ability to track both the register state and memory state as execution proceeds.


The memory mirror is a sparse memory model that contains only the data that the core is actively using. It's initialized from the software image loaded into simulation memory, and updated when the core performs a write. The memory mirror provides the view of memory from the processor core's perspective -- in other words, pre-cache. 

Our test harness has access to the processor core's view of register values and memory content at the point that a function is called. As it turns out, we can build on this to create a very efficient way to transferring data from software to the test harness.

In order to access the value of function parameters, we need to know the calling convention for our processor core. Here's the table describing register usage in the RISC-V calling convention:

Note that x10-17 are used to pass the first eight function arguments. 

Creating Abstraction
We could, of course, directly access registers and memory from our test-harness code to get the value of function parameters. But, a little abstraction will help us out in the long run.

The architecture-independent core-debug BFM defines a class API for accessing the value of function parameters. This is very similar to the varadic-argument API used in C programming:


Now, we just need to implement a RISC-V specific version of this API in order to simplify accessing function parameter values:

Here's how we use this implementation. Assume we have a embedded-software function like this:
When we detect that this function has been called, we can access the value of the string passed to the function from the test harness like this:


Advantages
There are several advantages to using a trace-driven approach to data communication between processor core and test harness. Because the trace BFM sees the processor's view of memory, there's no need to (slowly) copy data out to main memory in order for the test harness to see it. This allows data to stay in caches and avoids unnecessary copying.

Perhaps more importantly, our trace-based communication mechanism allow us to offload data processing to the test harness. Take, for example, the very-common debug printf:


The user passes a format string and then a variable number of arguments that will all be converted to string representations that can be displayed. If our communication mechanism is an in-memory mailbox or external interface, we need to perform the string formatting on the design's processor core. If, however, we use the trace-based mechanism for communication, the string formatting can all be done by the test harness in zero simulation time. This allows us to keep our simulations shorter and more-focused on the test at hand, while maximizing the debug and metrics data we collect.


Next Steps

SoC integration tests are distributed tests carried out by islands of test behavior running on the processor(s) and on the test harness controlling the external interfaces. Testing more-interesting scenarios requires coordinating these islands of test functionality. 

In this post, we’ve looked at using execution-trace to implement a high-efficiency mechanism for communicating from embedded test software back to the test harness. While this mechanism is mostly-specific to simulation, it has the advantage of simplifying communication, debug, and metrics collection at this early phase of integration testing when, arguably, we most-need a high degree of visibility. 

While we have an efficient mechanism, we don’t yet has a mechanism that makes it easy to add new APIs (scalable) nor a mechanism that is easily portable to environments that need to use a different transport mechanism.

In the next post, we’ll have a look at putting some structure and abstraction around communication that will help with both of these points.

References

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.


Saturday, January 30, 2021

SoC Integration Testing: Higher-Level Software Debug Visibility



Debug is a key task in any development task. Whether debugging application-level software or a hardware design, a key to productive debug is getting a higher-level view of what is happening in the design. Blindly stepping around in source code or staring at low-level waveforms is rarely a productive approach to debugging. Debug-log messages provide a high-level view of what's happening in an software application, allowing us to better target what source we actually inspect. Testbench logging, coupled with a transaction-level view of interface activity, provides us that higher-level view when verifying IP-level designs. Much of this is lacking when it comes to verifying SoC integration.

Challenges at SoC Level
We face a few unique challenges when doing SoC-integration testing. Software (okay, really firmware) is an integral part of our test environment, but that software is running really really slowly since it is running at RTL-simulation speeds. That makes using debug messages impractical, since simulating execution of the code to produce messages makes our test software run excruciatingly slowly. In addition, the types of issues we are likely to find -- especially early on -- are not at the application-level anyway. 

Processor simulation models often provide some form of execution trace, such as ARM's Tarmac file, which provides us a window into what's happening in the software. The downsides, here, are that we end up having to manually correlate low-level execution with higher-level application execution and what's happening in the waveform. There are also some very nice commercial integrated hardware/software debug tools that dramatically simplify the task of debugging software at the source level and correlating that with what's happening in the hardware design -- well worth checking out if you have access.

RISC-V VIP
At IP level, it's common to use Verification IP to relate the signal-level view of implementation with the more-abstract level we use when developing tests and debugging. It's highly desirable, of course, to be able to use Verification IP across multiple IPs and projects. This requires the existence of a common protocol that VIP can be developed to comprehend. 

If we want VIP that exposes a higher-level view of a processor's execution, we'll need just such a common protocol to interpret. The good news is that there is such a protocol for the RISC-V architecture: the RISC-V Formal Interface (RVFI). As its name suggests, the RISC-V Formal Interface was developed to enable a variety of RISC-V cores to be formally verified using the same library of formal properties. Using the RVFI as our common 'protocol' to understand the execution of a RISC-V processor enables us to develop a Verification IP that supports any processor that implements the RVFI.

RISC-V Debug BFM
The RISC-V Debug BFM is part of the PyBfms project and, like the other Bus-Functional Models within the project, implements low-level behavior in Verilog and higher-level behavior in Python. Like other PyBfms models, the RISC-V Debug BFM works nicely with cocotb testbench environments.

Instruction-Level Trace
Like other BFMs, the Verilog side of the RISC-V Debug BFM contains various mechanics for converting the input signals to a higher-level instruction trace. Consequently, the signals that expose the higher-level view of software execution are collected in a sub-module of the BFM instance.


The image above shows the elements within the debug BFM. The ctxt scope contains the higher-abstraction view of software execution, while the regs scope inside it contains the register state.


The first level of debug visibility that we receive is at the instruction level. The RISC-V Debug BFM exposes a simple disassembly of the executed instructions on the disasm signal within the ctxt scope. Note that you need to set the trace format to ASCII or String (depending on your waveform viewer) to see the disassembly. 


C-Level Execution Trace
Seeing instruction execution and register values is useful, but still leaves us looking at software execution at a very low level. This is very limiting, and especially so if we're attempting to understand the execution of software that we didn't write -- booting of an RTOS, for example. 

Fortunately, our BFM is connected to Python and there's a readily-available library (pyelftools) for accessing symbols and other information from the software image being executed by the processor core.


The code snippet above shows our testbench obtaining the path to the ELF file from cocotb, and passing this to the RISC-V Debug BFM. Now, what can we do with a stream of instruction-execution events and an ELF file? How about reconstructing the call stack?


The screenshot above shows the call stack of the Zephyr OS booting and running a short user program. If we need to debug a design failure, we can always correlate it to where the software was when the failure occurred. 



The screenshot above covers approximately 2ms of simulation time. At this scale, the signal-level details at the top of the waveform view are incomprehensible. The instruction-level view in the middle are difficult to interpret, though perhaps you could infer something from the register values. However, the C-level execution view at the bottom is still largely legible. Even when function execution is too brief to enable the function name to be legible, sweeping the cursor makes the execution flow easy to follow.

Current Status and Looking Forward
The RISC-V Debug BFM is still early in its development cycle, with additional opportunities for new features (stay tuned!) and a need for increased stability and documentation. That said, feel free to have a look and consider whether having access to the features described above would improve your SoC bring-up experience.

Looking forward in this series of blog posts, we'll be looking next at some of the additional things we can do with the information and events collected by the RISC-V Debug BFM. Among other things, these will allow us to more tightly connect the execution of our Python-based testbench with the execution of our test software.

Finally, the process of creating the RISC-V BFM has me thinking about the possibilities when assembling an SoC from IPs with integrated higher-level debug. What if not only the processor core but also the DMA engine, internal accelerators, and external communication IPs were all able to show a high-level view of what they were doing? It would certainly give the SoC integrator a better view of what was happening, and even facilitate discussions with the IP developer. How would IP with integrated high-level debug improve your SoC bring-up experience?

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, June 27, 2020

Arrays, Dynamic Arrays, Queues: One List to Rule them All


Randomizable lists are, of course, very important in modeling more-complex stimulus, and I've been working to support these within PyVSC recently. Thus far, PyVSC has attempted to stay as close as possible to both the feature set and, to the extent possible, the look and feel of SystemVerilog features for modeling constraints and coverage.  With randomizable lists, unlike other features, I've decided to diverge from the SystemVerilog. Keep reading to learn a bit more about the capabilities of randomizable lists in PyVSC and the reason from diverging from the SystemVerilog approach.

SystemVerilog: Three Lists with Different Capabilities
SystemVerilog is, of course, three or so languages in one. There's the synthesizable design subset used for capturing an RTL model of the design. There's the testbench subset that is an object-oriented language with classes, constraints, etc. There's also the assertion subset. These different subsets of the language have different requirements when it comes to data structures. These different requirements have led SystemVerilog to have three array- or list-like data structures:

Fixed-size arrays, as their name indicates, have a size specified as part of their declaration. A fixed-size array never changes size. Because  the array size is captured as part of the declaration, methods that operate on fixed-size arrays can only operate on a single-size array.

The size of dynamic-size arrays can change across a simulation. The size of a dynamic-size array is specified when it is created using the new operator. Once a dynamic-size array instance has been created, the only way to change its size is to re-create it with another new call. Well, actually, there is one other way. Randomizing a dynamic-size array also changes the size.

The size of a queue is changed by calling methods. Elements can be appended to the list, removed, etc. A queue is also re-sized when it is randomized.


PyVSC: One List with Three Options
If you've done a bit of Python programming, you're well aware that Python has a single list. Python's list is closest to SystemVerilog's queue data structure. My initial thought on supporting randomizable lists with PyVSC was just to create an equivalent to the list and be done. But then I thought a bit more about use models for arrays in verification. Each SystemVerilog array type represents a useful use model, but there's also another use model that I've never properly figured out how to easily represent in SystemVerilog. Fundamentally, there are two use cases for randomizable lists:
  • List with non-random elements
  • List with random elements, whose size is not random
  • List with random elements, whose size is random
When the size of a list whose size is not randomizable is modified by appending or removing elements, its size is preserved when the list is subsequently randomized.

Here are a few examples.

@vsc.randobj
class my_item_c(object):
    def __init__(self):
      self.my_l = vsc.rand_list_t(vsc.uint8_t(), 4)

The example above declares a list that initially contains four random elements.

@vsc.randobj
class my_item_c(object):
    def __init__(self):
      self.my_l = vsc.randsz_list_t(vsc.uint8_t())

    @vsc.constraint
    def my_l_c(self):
        self.my_l.size in vsc.rangelist((1,10))
The example above declares a list whose size will be randomized when the list is randomized. A list with randomized size must have a top-level constraint that specifies the maximum size of the list. Note that in this case the size of the list will be between 1 and 10.

If you wish to use a list of non-random values in constraints, you must store those values in an attribute of type list_t. This allows PyVSC to properly capture the constraints.
@vsc.randobj
class my_item_c(object):
    def __init__(self):
      self.a = vsc.rand_uint8_t()
      self.my_l = vsc.list_t(vsc.uint8_t(), 4)

      for i in range(10):
          self.my_l.append(i)

    @vsc.constraint
    def a_c(self):
      self.a in self.my_l

it = my_item_c()
it.my_l.append(20)

with it.randomize_with(): 
      it.a == 20 

In the example above, the class contains a non-random list with values 0..9. After an instance of the class is created, the list is modified to also contain 20. Then we randomize the class with an additional constraint that a must be 20. This randomization will succeed because the my_l list does contain the value 20.

Using Lists in Foreach Constraints 

PyVSC now also supports the foreach constraint. By default, a foreach constraint provides a reference to each element of the array. 
@vsc.randobj
class my_s(object):
    def __init__(self);
        self.my_l = vsc.rand_list_t(vsc.uint8_t(), 4)

    @vsc.constraint
    def my_l_c(self):
        with vsc.foreach(self.my_l) as it:
            it < 10
In the example above, we constrain each element of the list to have a value less then 10. However, it can also be useful to have an index to use in computing values. The foreach construct allows the user to request that an index variable be provided instead.
@vsc.randobj
class my_s(object):
    def __init__(self);
        self.my_l = vsc.rand_list_t(vsc.uint8_t(), 4)

    @vsc.constraint
    def my_l_c(self):
        with vsc.foreach(self.my_l, idx=True) as i:
            self.my_l[i] < 10
The example above is identical semantically to the previous one. However, in this case we refer to elements of the list by their index. But, what if we want both index and value iterator?
@vsc.randobj
class my_s(object):
    def __init__(self);
        self.my_l = vsc.rand_list_t(vsc.uint8_t(), 4)

    @vsc.constraint
    def my_l_c(self):
        with vsc.foreach(self.my_l, it=True, idx=True) as (i,it):
            it == (i+1)

Just specify both 'it=True' and 'idx=True' and both index and value-reference iterator will be provided.

One List to Rule them All
As of the 0.0.4 release (available now!) PyVSC supports lists of randomizable elements whose size is either fixed or variable with respect to randomization. Check it out and see how it helps in modeling more-complex verification scenarios in Python!

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.






Tuesday, December 31, 2019

2019 - The "Nights and Weekends Projects" Year in Review



It's almost the end of 2019, and I've been thinking back over the year as well as thinking ahead to 2020. In past years, I've often evaluated my "nights and weekends" projects using the same metrics I'm evaluated on at work: projects completed, and results obtained. This year, I've started looking my my "nights and weekends" efforts through a different lens focused more on the knowledge I've gained than just what I've produced.
As an aside, given the cover image, I do find it somewhat ironic that almost none of the knowledge I gained this year came from printed and bound books. Growing up with a love of libraries, and the fascinating collections of books they contained, it's both sad to think that knowledge is no longer concentrated there, and amazing to realize what a wealth of knowledge is now so easily-accessible just a short search away.

Looking back, there are two themes that run through several areas that I worked in across the year. The first of these is making software more modular, collaborative, and accessible. The second is Python. That's not all, though. So, let's get right to it!

Software Packaging and Distribution
Professionally, I come from a standard commercial-software background, and have often looked at open source through a similar lens. Specifically, I've often focused on software that can be packaged such that it's easily accessible to end users. This means bundling dependencies, providing installers, etc (see DVKit, a 'batteries-included' IDE for verification engineers).

This application-centric approach works well so long as the elements of functionality being distributed are relatively small in number, and the ways in which they need to be combined are fairly limited. This approach breaks down when the elements of functionality are relatively large in number, and need to be combined in many ways. In short, the more modular software becomes, the less feasible typical application-centric packaging becomes.

I've been dabbling for a few years in RTL design and verification. In this space, the verification environment for a given design will depend on many small elements of functionality -- utility libraries, reusable verification IP, etc. Bundling the dependencies with the verification environment quickly leads to projects that require lots of disk space. On the other hand, forcing users to download and install all the dependencies presents a significant barrier to new users.

One of the biggest reasons that I've spent so much time with Python this past year is that the Python ecosystem appears to provide a solution to this challenge of packaging and easily distributing small elements of functionality. Over the course of the year, I've spent time looking at Conda as a way of making application-level features more modular and easily-accessible. I've also spent time learning about how to package Python extension libraries (both with and without native library components) for distribution on PyPi, a repository for distributing Python packages.


New Approaches to Embedded DSLs
I've been involved in several projects over the years that have used C++ to provide a language-like user experience via C++ overloaded operators and macros. While there are certainly downsides to these embedded domain-specific languages in terms of error messaging and extensibility, an embedded domain-specific language can be a great way to prototype a language-based user interface before committing to the work of defining a first-class language and creating the parsing and processing infrastructure. It's also a very helpful approach for exploring new techniques in the context of existing languages.

C++ support for macros and operator overloading have been used for embedded DSLs from the beginning. However, using just these features tends to lead to somewhat awkward syntax, since operator overloading only supports expressions. C++11 (and beyond) brings new features, such as lambda expressions, and I spent time investigating these mechanisms and their impact on supporting expressing more-complex constructs in a more-natural way.

While the new C++11 features definitely showed promise, I started to wonder what support Python provided for implementing embedded domain-specific languages. As it turns out, Python provides some very powerful capabilities. Python supports overloading more operators than C++, and supports introspection into the code described by the user. I definitely intend to revisit embedded domain-specific languages captured in Python in 2020!

Constraint Solvers
Highly-capable constraint solvers that are available under permissive open-source licenses are becoming widely available, and I'm seeing these solvers applied to a range of interesting tasks. The CRAVE library for generating random stimulus has been around for some time. Several tools are leveraging available SMT solvers for model checking. Constraint solvers are even being applied for graphical layout of diagrams!

Given the range of applications to which solvers lend themselves, I thought it would be worth having a bit more hands-on knowledge. I spent some time learning about the Z3 solver API before concluding that, while the API is elegant and comprehensive, it's also more-complicated that what I need. I subsequently shifted to looking at the Boolector solver API, which is smaller and simpler.

The Boolector solver provides a Python binding, which is built along with the solver. This means that a user needs to manually build Boolector in order to use a Python package that uses the Boolector solver. Fortunately, I'd been learning about packaging and distributing Python extension libraries, and this this provided a perfect place to try this out. The Boolector Python library (PyBoolector) on PyPi is the result of this work.

Python for Verification
My background in verification is rooted in SystemC, SystemVerilog, and UVM. All very mainstream languages and methodologies in the commercial design and functional verification space. As I spent more time exploring Python and the modular and collaborative packaging it supports, I concluded that it made sense to investigate using Python for functional verification.

I spent time learning about cocotb, the most popular functional verification library in Python that I'm aware of. I also spent time learning about Python's back-end C API and how to structure bus-functional models to integrate at the procedure level with Python.

Actually, the more time I spend looking at Python for verification, the more possibilities I see. Definitely look for more on this topic in 2020!

In most areas, I've been quite happy with Python for verification. The object-oriented language features fit the requirements for high-level verification, and the easy availability of utility packages simplifies dealing with project dependencies. The one thing I've been dissatisfied with is support for static checking. I've used statically-typed languages for most application development. These languages have the advantage that the compiler can identify misuse of types before running the application. Dynamically-typed languages, such as Python and TCL, end up discovering type-misuse issues (eg passing an object to a method that expects an object of a different type) at runtime. One target for 2020 is learning more about what can be done to address this issue. Lint tools such as Pylint help, and my hope is to discover more tools and methodologies that help to close this gap.

RTL Design Skills
When I undertook the 2018 RISC-V Soft Core Contest, It had been quite a few years since I'd done any RTL design. Going through the design work for that project helped me brush up my skills quite a bit, but I knew I had quite a ways to go to be proficient. When the 2019 contest, centered around software security, came along, I knew it was a good opportunity to both learn more about software security vulnerabilities and improve my RTL design skills.

In addition to improving my RTL design skills, I learned a couple of things from initially attempting to add a few new features (multiplication, compressed instructions, security extensions) to my 2018 soft core. First, I had succeeded at writing some very good spaghetti RTL that wasn't modular enough to support extensibility. Furthermore, I didn't have sufficient tests to effectively and efficiently catch bugs introduced by adding new features.

Over the course of the 2019 project, I did a complete rewrite of the Featherweight RISC core. The more-modular structure of the rewritten core lends itself even better to bounded model checking, and I found this to be extremely helpful in catching and diagnosing bugs introduced during development and integration.

Going through this process also helped to improve my knowledge of RTL constructs that result in good efficient implementation, and which do not.


Looking Forward
2019 has been a great year for learning about more corners of the technical world. Looking forward to 2020, I see more work with Python, transitioning more of my existing projects over to cloud-based continuous integration, and more work with Python in the functional verification space. What will I learn along the way? Stay tuned for more blog posts across 2020 to find out!

As we come to the end of 2019 and the beginning of a new year (and new decade), I wish you happy holidays, a happy new year, and a 2020 ahead that is full of learning!

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, December 14, 2019

Writing a Task-Based Cocotb BFM



Background
The purpose of a Bus Functional Model (BFM) is to enable interacting with a design via a given protocol at a higher level of abstraction than the signal-level protocol, while knowing the bare minimum about the details of that protocol. Verification IP goes beyond these benefits to provide test plans, functional coverage, test sequence, and often protocol-specific benefits like compliance test suites.

In order to realize any these benefits of having a BFM, one first needs to exist. So, let's take a look at what it takes to create a task-based Cocotb BFM for a very simple ready/valid protocol.


The ready/valid protocol is simple, but useful. Data is exchanged between two blocks on a clock edge when both ready and valid are active (high). The initiator controls the valid signal, the target controls the ready signal. That's all there is to the protocol.

Before we get into the details, a few words about the structure of a task-based Cocotb BFM. There are two key components:

  • A Python class that provides the API used by the test writer, and defines the lower-level API used to actually interact with the HDL portion of the BFM
  • An HDL module that performs the conversion between the commands sent from Python and signals, and vice-versa.



Both of these aspects of a task-based BFM are collected together in a Python package that is typically named with the protocol implemented by the BFM (rv, or ready-valid, in this case).

Python API

The Python portion of a task-based Cocotb BFM is captured as a Python class. I use camel-case names for classes, so the ready/value output (initiator) BFM is ReadyValidDataOutBFM.

class ReadyValidDataOutBFM():

    def __init__(self):
        self.busy = Lock()
        self.ack_ev = Event()

The two core data items shown in the class initializer are likely to be found in every BFM. Cocotb supports multi-threaded tests using Python co-routines. Our BFM will use the 'busy' lock to ensure that only one Python thread can use the BFM at a time. It will use the 'ack_ev' event to interact with the HDL portion of the BFM.

As I mentioned earlier, there are typically two API layers that we need to define: the API that the user calls, and a lower-level API that is used to control the BFM's HDL code inside the simulation.

Let's start with the user layer, since that's quite simple. The user's test will use the ready/valid initiator BFM to write data to a target ready/valid interface. Let's call the public method 'write_c', to denote that this is a Python co-routine that writes data out from the BFM.

   @cocotb.coroutine
    def write_c(self, data):
        '''
        Writes the specified data word to the interface
        '''
        
        yield self.busy.acquire()
        self._write_req(data)

        # Wait for acknowledge of the transfer
        yield self.ack_ev.wait()
        self.ack_ev.clear()

        self.busy.release()

Before getting into the implementation details, let's look at the second API layer -- the one that interacts directly with the BFM. The low-level API must be non-blocking in order to work with the full range of simulation and execution environments that must be supported. This means that we need to split the write operation into two pieces: an outbound call to initiate a write, and an inbound call from the BFM to notify that the write is complete.

    @cocotb.bfm_import(cocotb.bfm_uint32_t)
    def _write_req(self, d):
        pass
    
    @cocotb.bfm_export()
    def _write_ack(self):
        self.ack_ev.set()

Note that the low-level API functions are prefixed with '_', denoting that this is an internal API and not intended to be called directly by the user.



HDL BFM

The HDL portion of the BFM also has two aspects. One is synchronous synthesizable code, while the other implements the interface to the synchronous code. Both of these are contained within a Verilog module, shown below:

module rv_data_out_bfm #(
  parameter DATA_WIDTH = 8
) (
  input clock,
  input reset,
  output reg[DATA_WIDTH-1:0] data,
  output reg data_valid,
  input data_ready
);

reg[DATA_WIDTH-1:0] data_v = 0;
reg data_valid_v = 0;

This module is instantiated in the HDL testbench and connected to the signals on the appropriate design interface. The data_v and data_valid_v variables are used to interface between the synchronous and control code inside the BFM.

We'll look at the interface code first. The Verilog task below implements the Python _write_reg method shown above.

task _write_req(reg[63:0] d);
begin
data_v = d;
data_valid_v = 1;
end
endtask


Note that the interface task is non-blocking, and simply sets values on variables within the module -- in this case, storing the data to be written and indicating that there is new data to transfer.

The synchronous logic controls the module signals based on the variables set by the interface tasks. This code is shown below:

always @(posedge clock) begin
  if (reset) begin
    data_valid <= 0;
    data <= 0;
  end else begin
    data_valid <= data_valid_v;
    data <= data_v;
          if (data_valid && data_ready) begin
      _write_ack();
      data_valid_v = 0;
    end
  end
end

This synchronous logic propagates the variables that were set in the interface task. When both the data_valid and data_ready signals are high, the _write_ack() task is called to notify the Python environment that the write is completed. At the same time the data_valid_v variable is cleared to terminate the transfer. 

Note that the synchronous logic is likely very similar to the logic within an RTL implementation of a ready/valid initiator. This is an opportunity, since it means that Cocotb BFMs can leverage existing RTL implementations of interface protocols.



Publishing
Now that we have a ready/valid BFM implemented, what can we do with it? Well, in addition to using it to verify our current design, we can also share it with others that also have ready/valid interfaces on their designs. This is Python, after all, and it's very easy to share Python libraries with others via the PyPi repository (https://pypi.org/).

In order to do this, we need to setup a very basic 'setup.py' script in our project directory that identifies the Python package and related data (BFM RTL) that needs to be distributed. After that, it's a simple matter to publish to PyPi such that another project can make use of the BFM simply by adding rv_bfms to that project's requirements.txt file.

Next Steps
Hopefully the description above shows just how simple it is to setup a Python BFM that can interact at the task level with RTL. You can find the full code for this BFM in the rv_bfms Github repository here: https://github.com/pybfms/rv_bfms.

In my next post, we'll take a look at some more-advanced ways to structure BFMs to increase the overall performance even more, and see some ways for BFMs to add more value in debug.

Meanwhile, I'd be interested to hear what protocols are of high interest in your FOSSi (Free and Open-Source Silicon) projects. I'd be especially interested if you'd like to contribute a task-based Python BFM for one or more of those protocols!


Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.


Saturday, November 30, 2019

Adding Task-Based Bus Functional Models to Cocotb



Getting a project started -- even to a certain level of completeness -- is often pretty simple. A couple weekends of hacking often results in pretty good progress and results. Finishing things up, in contrast, is often a slow process. That has certainly been the case with some work I did back in May and June this year to prototype a task-based interface between Python and an HDL simulator. The proof-of-concept work I did at the time seemed quite promising, but the clear next step was to make that work more accessible to others. That "last little bit of work" has certainly turned out to take more time that I had originally assumed!

Just as a reminder, the motivation for interacting with an HDL environment at the task level is quite simple: performance. Communicating across language (especially interpreted language) boundaries tends to be expensive, so minimizing the number of cross-language communications is critical to achieving high throughput. Using a task-based interface between Python and an HDL environment boosts performance in three ways. First, a task-based interface groups data so fewer language-boundary crossings are required to communicate a given amount of data. Secondly, and more importantly, using a task-based interface enables the HDL environment to filter events and only interact with the Python environment when absolutely necessary. Finally, using a task-based interface enables integrations with high(er)-speed environments, such as emulation or the current release of Verilator, where signal-level integration isn't practical.

I started looking at Python as a testbench language for reasons that might initially seem strange: the Python ecosystem (primarily PyPi) that makes it easy to publish bits of library and utility code in a way that it is easily-accessible to others. Often, the ability to take advantage of the work of others is gated by the effort required to gather all the required software dependencies. The Python ecosystem promises to alleviate that challenge, and I was excited to explore the possibilities.

Where Does it Fit?
I'm aware of a few projects that use Python for verification, but it currently appears that Cocotb is the most visible Python-based testbench for doing hardware verification using Python testbench code. Consequently, it made very good sense to see whether the task-based integration I had prototyped could be integrated with the existing Cocotb library.

Cocotb is a Python library that supports light-weight concurrency via co-routines, and provides primitives for coordinating these co-routines with each other and with activity in a HDL simulation environment. In addition to the Python library, Cocotb provides native-compiled C/C++ libraries that integrate with the simulation environment via APIS implemented by the simulator (VPI, DPI, FLI, or VHPI depending on the simulator). Currently, Cocotb interacts with simulation environments at the signal level.

In considering how to add task-based interactions to Cocotb, there were a several requirements that I thought were quite important. First, the user should not be forced to choose between signal-level and task-based interactions between Python and HDL. It should be possible to introduce task-based interactions to a testbench currently interacting at the signal level, or add a few signal-based interactions to a testbench that primarily interacts at the task level. Secondly, a task-based integration must support a range of simulator APIs. I had prototyped a DPI-based integration, which is supported by SystemVerilog simulators, supporting Verilog and VHDL simulators as well was clearly important. Finally, achieving good performance was a key requirement, since performance is the primary motivation for using a task-based interface in the first place.

Task-Based BFM Cocotb Architecture


From a system perspective, the diagram above captures how task-based BFMs integrate with Cocotb. Each BFM instance is represented in the HDL environment by an instance of an HDL module. This module is special, in that it knows how to accept and make task calls and convert between signal-level information and those task calls.

Each BFM also knows how to register itself with a BFM Manager within Cocotb. When the HDL portion of a BFM registers with Cocotb, the BFM Manager creates an instance of a Python class that represents the BFM within the Python environment.

The BFM Manager provides methods to allow the user's test code to query the available BFMs and obtain a handle to the BFM instances required by the test. From there, the user's test simply calls methods on the Python class object and/or receives callbacks.

Task-Based BFM Architecture

Let's take a quick look at the work needed to support a task-based BFM. First off, the BFM author needs to a Python class to implement the Python side of the BFM. That class is decorated with a @cocotb.bfm decorator that associates HDL template files with the BFM class. Below is a BFM for a simple ready/valid protocol.

@cocotb.bfm(hdl={
    cocotb.bfm_vlog : cocotb.bfm_hdl_path(__file__,   "hdl/rv_data_out_bfm.v"),
    cocotb.bfm_sv   : cocotb.bfm_hdl_path(__file__, "hdl/rv_data_out_bfm.v")
})
class ReadyValidDataOutBFM():
    # ...


Next, the BFM author must specify the low-level interaction API with the HDL BFM. All calls must be non-blocking, so most interactions with the HDL environment are implemented as a request/acknowledge pair of API calls.

    @cocotb.bfm_import(cocotb.bfm_uint32_t)
    def write_req(self, d):
        pass
    
    @cocotb.bfm_export()
    def write_ack(self):
        self.ack_ev.set()

Calling a class methods decorated with @cocotb.bfm_import will result in a task call in the HDL BFM. Class methods decorated with @cocotb.bfm_export can be called from the HDL BFM.

Finally, on the Python side, the BFM author will likely provide a convenience API to simplify the testwriter's life:

    @cocotb.coroutine
    def write_c(self, data):
        '''
        Writes the specified data word to the interface
        '''
        
        yield self.busy.acquire()
        self.write_req(data)

        # Wait for acknowledge of the transfer
        yield self.ack_ev.wait()
        self.ack_ev.clear()

        self.busy.release()


There's one piece left, and that's the HDL BFM. This is specified as a template:

module rv_data_out_bfm #(
parameter DATA_WIDTH = 8
) (
input clock,
input reset,
output reg[DATA_WIDTH-1:0] data,
output reg data_valid,
input data_ready
);
reg[DATA_WIDTH-1:0] data_v = 0;
reg data_valid_v = 0;
always @(posedge clock) begin
if (reset) begin
data_valid <= 0;
data <= 0;
end else begin
if (data_valid_v) begin
data_valid <= 1;
data <= data_v;
data_valid_v = 0;
end
if (data_valid && data_ready) begin
write_ack();

if (!data_valid_v) begin
data_valid <= 0;
end
end
end
end
task write_req(reg[63:0] d);
begin
data_v = d;
data_valid_v = 1;
end
endtask

// Auto-generated code to implement the BFM API
${cocotb_bfm_api_impl}

endmodule

The BFM author must implement tasks that will be called from the Python class. Task proxies that will invoke Python methods are implemented by the Cocotb automation, and substituted into the template where the ${cocotb_bfm_api_impl} tag is specified.

Current Integrations
Currently, task-based BFM integrations are implemented for Verilog via the VPI interface and for SystemVerilog via the DPI interface. A VHDL integration isn't currently supported, but that's on the roadmap. One complication with VHDL is that there are actually several interfaces that may need to be supported depending on the simulator -- VHPI, VHPI via VPI, Modelsim FLI. Here I could use some input from the community on priorities, so I'd definitely like to hear from you if you're using Cocotb with VHDL...

Results
As I mentioned at the beginning of this post, performance is the primary reason for using task-based interaction between Python and a HDL environment. So, how much improvement can you expect? Well, that entirely depends on how frequently your tests interact with the HDL environment and, to a certain extent, on how long your tests are. If your testbench needs to interact with the testbench every clock cycle, then you're unlikely to see much benefit. If, however, your testbench spends quite a few cycles waiting for the design to respond, then you're likely to see pretty significant benefits.

I'll use my FWRISC (Featherweight RISC) RISCV core as an example. In this environment, the bulk of the test is actually compiled code that executes on the processor. The Python testbench is primarily responsible for checking results and providing debug information when needed.
A diagram of the simulation-based testbench is shown above. The Tracer BFM is responsible for monitoring execution of the FWRISC core and sending events up to the high-level testbench as needed. These events include:
  • Instruction executed
  • Register written
  • Memory written
I've created two Cocotb implementations of this BFM: one that interacts at the signal level, and one that interacts at the task level. To compare the performance, I'm running a Zephyr test with Icarus Verilog for 10ms of simulation time.

Let's start with as close to a direct comparison as possible. Both the signal-level and task-based BFM will capture the same information and propagate it to the Python testbench.
  • Signal-Level BFM: 85s (wallclock)
  • Task-Level BFM: 33s (wallclock)
Okay, so already we're looking pretty good. This performance increase is simply because the task-based BFM doesn't need to call the Python environment every cycle. 

Another way we can benefit is to use a higher-performance simulator. Icarus Verilog is interpreted, and supports a full event-driven simulation environment. Verilator has a much more restricted set of features (synthesizable Verilog only, limited signal-level access, etc), but is also much faster. It also doesn't currently support the signal-level access to the extent necessary to do a direct comparison between a task-based BFM and a signal-level BFM. So, how do we look here? I actually had to increase the simulation time to 100ms (10x longer) to get a meaningful reading.
  • Task-Level BFM: 18s (wallclock)
So, coupling a fast execution platform with an efficient integration mechanism definitely brings benefits!

Next Steps
So, where do we go from here? Well, please stay tuned for my next blog post to get more details on how create task-based BFMs using these features. I also have an active pull request (#1217) to get this support merged into Cocotb directly. Until then, you can always access the code here



Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.