Showing posts with label HDL. Show all posts
Showing posts with label HDL. Show all posts

Tuesday, December 31, 2019

2019 - The "Nights and Weekends Projects" Year in Review



It's almost the end of 2019, and I've been thinking back over the year as well as thinking ahead to 2020. In past years, I've often evaluated my "nights and weekends" projects using the same metrics I'm evaluated on at work: projects completed, and results obtained. This year, I've started looking my my "nights and weekends" efforts through a different lens focused more on the knowledge I've gained than just what I've produced.
As an aside, given the cover image, I do find it somewhat ironic that almost none of the knowledge I gained this year came from printed and bound books. Growing up with a love of libraries, and the fascinating collections of books they contained, it's both sad to think that knowledge is no longer concentrated there, and amazing to realize what a wealth of knowledge is now so easily-accessible just a short search away.

Looking back, there are two themes that run through several areas that I worked in across the year. The first of these is making software more modular, collaborative, and accessible. The second is Python. That's not all, though. So, let's get right to it!

Software Packaging and Distribution
Professionally, I come from a standard commercial-software background, and have often looked at open source through a similar lens. Specifically, I've often focused on software that can be packaged such that it's easily accessible to end users. This means bundling dependencies, providing installers, etc (see DVKit, a 'batteries-included' IDE for verification engineers).

This application-centric approach works well so long as the elements of functionality being distributed are relatively small in number, and the ways in which they need to be combined are fairly limited. This approach breaks down when the elements of functionality are relatively large in number, and need to be combined in many ways. In short, the more modular software becomes, the less feasible typical application-centric packaging becomes.

I've been dabbling for a few years in RTL design and verification. In this space, the verification environment for a given design will depend on many small elements of functionality -- utility libraries, reusable verification IP, etc. Bundling the dependencies with the verification environment quickly leads to projects that require lots of disk space. On the other hand, forcing users to download and install all the dependencies presents a significant barrier to new users.

One of the biggest reasons that I've spent so much time with Python this past year is that the Python ecosystem appears to provide a solution to this challenge of packaging and easily distributing small elements of functionality. Over the course of the year, I've spent time looking at Conda as a way of making application-level features more modular and easily-accessible. I've also spent time learning about how to package Python extension libraries (both with and without native library components) for distribution on PyPi, a repository for distributing Python packages.


New Approaches to Embedded DSLs
I've been involved in several projects over the years that have used C++ to provide a language-like user experience via C++ overloaded operators and macros. While there are certainly downsides to these embedded domain-specific languages in terms of error messaging and extensibility, an embedded domain-specific language can be a great way to prototype a language-based user interface before committing to the work of defining a first-class language and creating the parsing and processing infrastructure. It's also a very helpful approach for exploring new techniques in the context of existing languages.

C++ support for macros and operator overloading have been used for embedded DSLs from the beginning. However, using just these features tends to lead to somewhat awkward syntax, since operator overloading only supports expressions. C++11 (and beyond) brings new features, such as lambda expressions, and I spent time investigating these mechanisms and their impact on supporting expressing more-complex constructs in a more-natural way.

While the new C++11 features definitely showed promise, I started to wonder what support Python provided for implementing embedded domain-specific languages. As it turns out, Python provides some very powerful capabilities. Python supports overloading more operators than C++, and supports introspection into the code described by the user. I definitely intend to revisit embedded domain-specific languages captured in Python in 2020!

Constraint Solvers
Highly-capable constraint solvers that are available under permissive open-source licenses are becoming widely available, and I'm seeing these solvers applied to a range of interesting tasks. The CRAVE library for generating random stimulus has been around for some time. Several tools are leveraging available SMT solvers for model checking. Constraint solvers are even being applied for graphical layout of diagrams!

Given the range of applications to which solvers lend themselves, I thought it would be worth having a bit more hands-on knowledge. I spent some time learning about the Z3 solver API before concluding that, while the API is elegant and comprehensive, it's also more-complicated that what I need. I subsequently shifted to looking at the Boolector solver API, which is smaller and simpler.

The Boolector solver provides a Python binding, which is built along with the solver. This means that a user needs to manually build Boolector in order to use a Python package that uses the Boolector solver. Fortunately, I'd been learning about packaging and distributing Python extension libraries, and this this provided a perfect place to try this out. The Boolector Python library (PyBoolector) on PyPi is the result of this work.

Python for Verification
My background in verification is rooted in SystemC, SystemVerilog, and UVM. All very mainstream languages and methodologies in the commercial design and functional verification space. As I spent more time exploring Python and the modular and collaborative packaging it supports, I concluded that it made sense to investigate using Python for functional verification.

I spent time learning about cocotb, the most popular functional verification library in Python that I'm aware of. I also spent time learning about Python's back-end C API and how to structure bus-functional models to integrate at the procedure level with Python.

Actually, the more time I spend looking at Python for verification, the more possibilities I see. Definitely look for more on this topic in 2020!

In most areas, I've been quite happy with Python for verification. The object-oriented language features fit the requirements for high-level verification, and the easy availability of utility packages simplifies dealing with project dependencies. The one thing I've been dissatisfied with is support for static checking. I've used statically-typed languages for most application development. These languages have the advantage that the compiler can identify misuse of types before running the application. Dynamically-typed languages, such as Python and TCL, end up discovering type-misuse issues (eg passing an object to a method that expects an object of a different type) at runtime. One target for 2020 is learning more about what can be done to address this issue. Lint tools such as Pylint help, and my hope is to discover more tools and methodologies that help to close this gap.

RTL Design Skills
When I undertook the 2018 RISC-V Soft Core Contest, It had been quite a few years since I'd done any RTL design. Going through the design work for that project helped me brush up my skills quite a bit, but I knew I had quite a ways to go to be proficient. When the 2019 contest, centered around software security, came along, I knew it was a good opportunity to both learn more about software security vulnerabilities and improve my RTL design skills.

In addition to improving my RTL design skills, I learned a couple of things from initially attempting to add a few new features (multiplication, compressed instructions, security extensions) to my 2018 soft core. First, I had succeeded at writing some very good spaghetti RTL that wasn't modular enough to support extensibility. Furthermore, I didn't have sufficient tests to effectively and efficiently catch bugs introduced by adding new features.

Over the course of the 2019 project, I did a complete rewrite of the Featherweight RISC core. The more-modular structure of the rewritten core lends itself even better to bounded model checking, and I found this to be extremely helpful in catching and diagnosing bugs introduced during development and integration.

Going through this process also helped to improve my knowledge of RTL constructs that result in good efficient implementation, and which do not.


Looking Forward
2019 has been a great year for learning about more corners of the technical world. Looking forward to 2020, I see more work with Python, transitioning more of my existing projects over to cloud-based continuous integration, and more work with Python in the functional verification space. What will I learn along the way? Stay tuned for more blog posts across 2020 to find out!

As we come to the end of 2019 and the beginning of a new year (and new decade), I wish you happy holidays, a happy new year, and a 2020 ahead that is full of learning!

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, December 14, 2019

Writing a Task-Based Cocotb BFM



Background
The purpose of a Bus Functional Model (BFM) is to enable interacting with a design via a given protocol at a higher level of abstraction than the signal-level protocol, while knowing the bare minimum about the details of that protocol. Verification IP goes beyond these benefits to provide test plans, functional coverage, test sequence, and often protocol-specific benefits like compliance test suites.

In order to realize any these benefits of having a BFM, one first needs to exist. So, let's take a look at what it takes to create a task-based Cocotb BFM for a very simple ready/valid protocol.


The ready/valid protocol is simple, but useful. Data is exchanged between two blocks on a clock edge when both ready and valid are active (high). The initiator controls the valid signal, the target controls the ready signal. That's all there is to the protocol.

Before we get into the details, a few words about the structure of a task-based Cocotb BFM. There are two key components:

  • A Python class that provides the API used by the test writer, and defines the lower-level API used to actually interact with the HDL portion of the BFM
  • An HDL module that performs the conversion between the commands sent from Python and signals, and vice-versa.



Both of these aspects of a task-based BFM are collected together in a Python package that is typically named with the protocol implemented by the BFM (rv, or ready-valid, in this case).

Python API

The Python portion of a task-based Cocotb BFM is captured as a Python class. I use camel-case names for classes, so the ready/value output (initiator) BFM is ReadyValidDataOutBFM.

class ReadyValidDataOutBFM():

    def __init__(self):
        self.busy = Lock()
        self.ack_ev = Event()

The two core data items shown in the class initializer are likely to be found in every BFM. Cocotb supports multi-threaded tests using Python co-routines. Our BFM will use the 'busy' lock to ensure that only one Python thread can use the BFM at a time. It will use the 'ack_ev' event to interact with the HDL portion of the BFM.

As I mentioned earlier, there are typically two API layers that we need to define: the API that the user calls, and a lower-level API that is used to control the BFM's HDL code inside the simulation.

Let's start with the user layer, since that's quite simple. The user's test will use the ready/valid initiator BFM to write data to a target ready/valid interface. Let's call the public method 'write_c', to denote that this is a Python co-routine that writes data out from the BFM.

   @cocotb.coroutine
    def write_c(self, data):
        '''
        Writes the specified data word to the interface
        '''
        
        yield self.busy.acquire()
        self._write_req(data)

        # Wait for acknowledge of the transfer
        yield self.ack_ev.wait()
        self.ack_ev.clear()

        self.busy.release()

Before getting into the implementation details, let's look at the second API layer -- the one that interacts directly with the BFM. The low-level API must be non-blocking in order to work with the full range of simulation and execution environments that must be supported. This means that we need to split the write operation into two pieces: an outbound call to initiate a write, and an inbound call from the BFM to notify that the write is complete.

    @cocotb.bfm_import(cocotb.bfm_uint32_t)
    def _write_req(self, d):
        pass
    
    @cocotb.bfm_export()
    def _write_ack(self):
        self.ack_ev.set()

Note that the low-level API functions are prefixed with '_', denoting that this is an internal API and not intended to be called directly by the user.



HDL BFM

The HDL portion of the BFM also has two aspects. One is synchronous synthesizable code, while the other implements the interface to the synchronous code. Both of these are contained within a Verilog module, shown below:

module rv_data_out_bfm #(
  parameter DATA_WIDTH = 8
) (
  input clock,
  input reset,
  output reg[DATA_WIDTH-1:0] data,
  output reg data_valid,
  input data_ready
);

reg[DATA_WIDTH-1:0] data_v = 0;
reg data_valid_v = 0;

This module is instantiated in the HDL testbench and connected to the signals on the appropriate design interface. The data_v and data_valid_v variables are used to interface between the synchronous and control code inside the BFM.

We'll look at the interface code first. The Verilog task below implements the Python _write_reg method shown above.

task _write_req(reg[63:0] d);
begin
data_v = d;
data_valid_v = 1;
end
endtask


Note that the interface task is non-blocking, and simply sets values on variables within the module -- in this case, storing the data to be written and indicating that there is new data to transfer.

The synchronous logic controls the module signals based on the variables set by the interface tasks. This code is shown below:

always @(posedge clock) begin
  if (reset) begin
    data_valid <= 0;
    data <= 0;
  end else begin
    data_valid <= data_valid_v;
    data <= data_v;
          if (data_valid && data_ready) begin
      _write_ack();
      data_valid_v = 0;
    end
  end
end

This synchronous logic propagates the variables that were set in the interface task. When both the data_valid and data_ready signals are high, the _write_ack() task is called to notify the Python environment that the write is completed. At the same time the data_valid_v variable is cleared to terminate the transfer. 

Note that the synchronous logic is likely very similar to the logic within an RTL implementation of a ready/valid initiator. This is an opportunity, since it means that Cocotb BFMs can leverage existing RTL implementations of interface protocols.



Publishing
Now that we have a ready/valid BFM implemented, what can we do with it? Well, in addition to using it to verify our current design, we can also share it with others that also have ready/valid interfaces on their designs. This is Python, after all, and it's very easy to share Python libraries with others via the PyPi repository (https://pypi.org/).

In order to do this, we need to setup a very basic 'setup.py' script in our project directory that identifies the Python package and related data (BFM RTL) that needs to be distributed. After that, it's a simple matter to publish to PyPi such that another project can make use of the BFM simply by adding rv_bfms to that project's requirements.txt file.

Next Steps
Hopefully the description above shows just how simple it is to setup a Python BFM that can interact at the task level with RTL. You can find the full code for this BFM in the rv_bfms Github repository here: https://github.com/pybfms/rv_bfms.

In my next post, we'll take a look at some more-advanced ways to structure BFMs to increase the overall performance even more, and see some ways for BFMs to add more value in debug.

Meanwhile, I'd be interested to hear what protocols are of high interest in your FOSSi (Free and Open-Source Silicon) projects. I'd be especially interested if you'd like to contribute a task-based Python BFM for one or more of those protocols!


Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.


Saturday, November 30, 2019

Adding Task-Based Bus Functional Models to Cocotb



Getting a project started -- even to a certain level of completeness -- is often pretty simple. A couple weekends of hacking often results in pretty good progress and results. Finishing things up, in contrast, is often a slow process. That has certainly been the case with some work I did back in May and June this year to prototype a task-based interface between Python and an HDL simulator. The proof-of-concept work I did at the time seemed quite promising, but the clear next step was to make that work more accessible to others. That "last little bit of work" has certainly turned out to take more time that I had originally assumed!

Just as a reminder, the motivation for interacting with an HDL environment at the task level is quite simple: performance. Communicating across language (especially interpreted language) boundaries tends to be expensive, so minimizing the number of cross-language communications is critical to achieving high throughput. Using a task-based interface between Python and an HDL environment boosts performance in three ways. First, a task-based interface groups data so fewer language-boundary crossings are required to communicate a given amount of data. Secondly, and more importantly, using a task-based interface enables the HDL environment to filter events and only interact with the Python environment when absolutely necessary. Finally, using a task-based interface enables integrations with high(er)-speed environments, such as emulation or the current release of Verilator, where signal-level integration isn't practical.

I started looking at Python as a testbench language for reasons that might initially seem strange: the Python ecosystem (primarily PyPi) that makes it easy to publish bits of library and utility code in a way that it is easily-accessible to others. Often, the ability to take advantage of the work of others is gated by the effort required to gather all the required software dependencies. The Python ecosystem promises to alleviate that challenge, and I was excited to explore the possibilities.

Where Does it Fit?
I'm aware of a few projects that use Python for verification, but it currently appears that Cocotb is the most visible Python-based testbench for doing hardware verification using Python testbench code. Consequently, it made very good sense to see whether the task-based integration I had prototyped could be integrated with the existing Cocotb library.

Cocotb is a Python library that supports light-weight concurrency via co-routines, and provides primitives for coordinating these co-routines with each other and with activity in a HDL simulation environment. In addition to the Python library, Cocotb provides native-compiled C/C++ libraries that integrate with the simulation environment via APIS implemented by the simulator (VPI, DPI, FLI, or VHPI depending on the simulator). Currently, Cocotb interacts with simulation environments at the signal level.

In considering how to add task-based interactions to Cocotb, there were a several requirements that I thought were quite important. First, the user should not be forced to choose between signal-level and task-based interactions between Python and HDL. It should be possible to introduce task-based interactions to a testbench currently interacting at the signal level, or add a few signal-based interactions to a testbench that primarily interacts at the task level. Secondly, a task-based integration must support a range of simulator APIs. I had prototyped a DPI-based integration, which is supported by SystemVerilog simulators, supporting Verilog and VHDL simulators as well was clearly important. Finally, achieving good performance was a key requirement, since performance is the primary motivation for using a task-based interface in the first place.

Task-Based BFM Cocotb Architecture


From a system perspective, the diagram above captures how task-based BFMs integrate with Cocotb. Each BFM instance is represented in the HDL environment by an instance of an HDL module. This module is special, in that it knows how to accept and make task calls and convert between signal-level information and those task calls.

Each BFM also knows how to register itself with a BFM Manager within Cocotb. When the HDL portion of a BFM registers with Cocotb, the BFM Manager creates an instance of a Python class that represents the BFM within the Python environment.

The BFM Manager provides methods to allow the user's test code to query the available BFMs and obtain a handle to the BFM instances required by the test. From there, the user's test simply calls methods on the Python class object and/or receives callbacks.

Task-Based BFM Architecture

Let's take a quick look at the work needed to support a task-based BFM. First off, the BFM author needs to a Python class to implement the Python side of the BFM. That class is decorated with a @cocotb.bfm decorator that associates HDL template files with the BFM class. Below is a BFM for a simple ready/valid protocol.

@cocotb.bfm(hdl={
    cocotb.bfm_vlog : cocotb.bfm_hdl_path(__file__,   "hdl/rv_data_out_bfm.v"),
    cocotb.bfm_sv   : cocotb.bfm_hdl_path(__file__, "hdl/rv_data_out_bfm.v")
})
class ReadyValidDataOutBFM():
    # ...


Next, the BFM author must specify the low-level interaction API with the HDL BFM. All calls must be non-blocking, so most interactions with the HDL environment are implemented as a request/acknowledge pair of API calls.

    @cocotb.bfm_import(cocotb.bfm_uint32_t)
    def write_req(self, d):
        pass
    
    @cocotb.bfm_export()
    def write_ack(self):
        self.ack_ev.set()

Calling a class methods decorated with @cocotb.bfm_import will result in a task call in the HDL BFM. Class methods decorated with @cocotb.bfm_export can be called from the HDL BFM.

Finally, on the Python side, the BFM author will likely provide a convenience API to simplify the testwriter's life:

    @cocotb.coroutine
    def write_c(self, data):
        '''
        Writes the specified data word to the interface
        '''
        
        yield self.busy.acquire()
        self.write_req(data)

        # Wait for acknowledge of the transfer
        yield self.ack_ev.wait()
        self.ack_ev.clear()

        self.busy.release()


There's one piece left, and that's the HDL BFM. This is specified as a template:

module rv_data_out_bfm #(
parameter DATA_WIDTH = 8
) (
input clock,
input reset,
output reg[DATA_WIDTH-1:0] data,
output reg data_valid,
input data_ready
);
reg[DATA_WIDTH-1:0] data_v = 0;
reg data_valid_v = 0;
always @(posedge clock) begin
if (reset) begin
data_valid <= 0;
data <= 0;
end else begin
if (data_valid_v) begin
data_valid <= 1;
data <= data_v;
data_valid_v = 0;
end
if (data_valid && data_ready) begin
write_ack();

if (!data_valid_v) begin
data_valid <= 0;
end
end
end
end
task write_req(reg[63:0] d);
begin
data_v = d;
data_valid_v = 1;
end
endtask

// Auto-generated code to implement the BFM API
${cocotb_bfm_api_impl}

endmodule

The BFM author must implement tasks that will be called from the Python class. Task proxies that will invoke Python methods are implemented by the Cocotb automation, and substituted into the template where the ${cocotb_bfm_api_impl} tag is specified.

Current Integrations
Currently, task-based BFM integrations are implemented for Verilog via the VPI interface and for SystemVerilog via the DPI interface. A VHDL integration isn't currently supported, but that's on the roadmap. One complication with VHDL is that there are actually several interfaces that may need to be supported depending on the simulator -- VHPI, VHPI via VPI, Modelsim FLI. Here I could use some input from the community on priorities, so I'd definitely like to hear from you if you're using Cocotb with VHDL...

Results
As I mentioned at the beginning of this post, performance is the primary reason for using task-based interaction between Python and a HDL environment. So, how much improvement can you expect? Well, that entirely depends on how frequently your tests interact with the HDL environment and, to a certain extent, on how long your tests are. If your testbench needs to interact with the testbench every clock cycle, then you're unlikely to see much benefit. If, however, your testbench spends quite a few cycles waiting for the design to respond, then you're likely to see pretty significant benefits.

I'll use my FWRISC (Featherweight RISC) RISCV core as an example. In this environment, the bulk of the test is actually compiled code that executes on the processor. The Python testbench is primarily responsible for checking results and providing debug information when needed.
A diagram of the simulation-based testbench is shown above. The Tracer BFM is responsible for monitoring execution of the FWRISC core and sending events up to the high-level testbench as needed. These events include:
  • Instruction executed
  • Register written
  • Memory written
I've created two Cocotb implementations of this BFM: one that interacts at the signal level, and one that interacts at the task level. To compare the performance, I'm running a Zephyr test with Icarus Verilog for 10ms of simulation time.

Let's start with as close to a direct comparison as possible. Both the signal-level and task-based BFM will capture the same information and propagate it to the Python testbench.
  • Signal-Level BFM: 85s (wallclock)
  • Task-Level BFM: 33s (wallclock)
Okay, so already we're looking pretty good. This performance increase is simply because the task-based BFM doesn't need to call the Python environment every cycle. 

Another way we can benefit is to use a higher-performance simulator. Icarus Verilog is interpreted, and supports a full event-driven simulation environment. Verilator has a much more restricted set of features (synthesizable Verilog only, limited signal-level access, etc), but is also much faster. It also doesn't currently support the signal-level access to the extent necessary to do a direct comparison between a task-based BFM and a signal-level BFM. So, how do we look here? I actually had to increase the simulation time to 100ms (10x longer) to get a meaningful reading.
  • Task-Level BFM: 18s (wallclock)
So, coupling a fast execution platform with an efficient integration mechanism definitely brings benefits!

Next Steps
So, where do we go from here? Well, please stay tuned for my next blog post to get more details on how create task-based BFMs using these features. I also have an active pull request (#1217) to get this support merged into Cocotb directly. Until then, you can always access the code here



Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Sunday, June 16, 2019

Py-HPI: Applying Python for Verification



Intro
In my last post, I talked about a prototype procedural interface between Python and HDL that enables cross-calling between Python and SystemVerilog. My primary motivation for investigating a procedural interface was its potential to maximize performance. In this post, I create a Python testbench for a small IP and compare it to the equivalent C++ testbench. I also look at the performance of Python for verification.

Creating a Python Testbench
My go-to IP for trying out new verification techniques is a small 32-bit RISC-V core named Featherweight RISC (FWRISC) that I created for a design contest last year. The original testbench was written in C++, so that will be my baseline for comparison. If you're interested in the structure of the testbench, have a look at this post.

Since I was keeping the testbench structure the same, I didn't expect much in terms of a reduction in lines of code. C++ is a bit verbose, in that it expects a header and implementation file for each class. This contributes to the fact that each C++ test is roughly twice as long as each Python test:

  • C++ Test: 328 lines
  • Python Test: 139 lines
Reducing the lines of code is a good thing, since more code statistically means more bugs, and spending time finding and fixing testbench bugs doesn't help us get our design verified. But, that's just the start.

The unit tests for FWRISC are all self-checking. This means that each assembly file contains the expected value for registers modified by the test. You can see the data embedded below between the start_expected and end_expected labels.


entry:
li x1, 5
add x3, x1, 6
j done
// Expected value for registers
start_expected:
.word 1, 5
.word 3, 11
end_expected:

Because I didn't want to need to install an ELF-reading library on every machine where I wanted to run the FWRISC regression, I wrote my own small ELF-reading classes for the FWRISC testbench. This amounted to ~400 lines of code, and required a certain amount of thought and effort.

When I started writing the Python testbench, I thought about writing another ELF-reader in Python based on the code I'd written in C++... But then I realized that there was already a Python library for doing this called pyelftools. All I needed to do was get it installed in my environment (more on that in a future post), and call the API:

with open(sw_image, "rb") as f:
elffile = ELFFile(f)
symtab = elffile.get_section_by_name('.symtab')
start_expected = symtab.get_symbol_by_name("start_expected")[0]["st_value"]
end_expected = symtab.get_symbol_by_name("end_expected")[0]["st_value"]
section = None
for i in range(elffile.num_sections()):
shdr = elffile._get_section_header(i)
if (start_expected >= shdr['sh_addr']) and (end_expected <= (shdr['sh_addr'] + shdr['sh_size'])):
start_expected -= shdr['sh_addr']
end_expected -= shdr['sh_addr']
section = elffile.get_section(i)
break
data = section.data()

That's a pretty significant savings both in terms of code, and in terms of development and debug effort! So, definitely my Python testbench is looking pretty good in terms of productivity. But, what about performance?

Evaluating Performance
Testbench performance may not be the most important factor when evaluating a language for use in verification. In general, the time an engineer takes to develop, debug, and maintain a verification environment is far more expensive than the compute time taken to execute tests. That said, understanding that performance characteristics of any language enables us to make smarter tradeoffs in how we use the language. 


I was fortunate enough to see David Patterson deliver his keynote A New Golden Age for Computer Architecture around a year ago at DAC 2018. The slide above comes from that presentation, and compares the performance of a variety of implementations of the computationally-intensive matrix multiply operation. As you can see from the slide, a C implementation is 50x faster than a Python implementation. Based on this slide and the anecdotal evidence of others, my pre-existing expectations were somewhat low when it came to Python performance. But, of course, having concrete data specific to functional verification is far more useful than a few anecdotes and rumors.

Spoiler alert: C++ is definitively faster than Python.

As with most languages, there are two aspects of performance to consider with Python: startup time and steady-state performance. Most of the FWRISC tests are quite short -- in fact, the suite of unit tests contains tests that execute less than 10 instructions.This gives us a good way to evaluate the startup overhead of Python. In order to evaluate the steady-state performance, I created a program that ran a tight loop with 10,000,000 instructions. The performance numbers below all come from Verilator-based simulations.

Startup Overhead
As I noted above, I evaluated the startup overhead of Python using the unit test suite. This suite contains 66 very short tests. 

  • C++ Testbench: 7s
  • Python Testbench: 18s
Based on the numbers above, Python does impose a noticeable overhead on the test suite -- it takes ~2.5x longer to run the suite with Python vs C++. That said, 18 seconds is still very reasonable to run a suite of smoke tests.

Steady-State Overhead
To evaluate the steady-state overhead of a Python testbench, I ran a long-loop test that ran a total of 10,000,000 instructions.

  • C++ Testbench: 11.6s
  • Python Testbench: 109.7s
Okay, this doesn't look so good. Our C++ testbench is 9.45x faster than our Python testbench. What do we do about this?

Adapting to Python's Performance
Initially, the FWRISC testbench didn't worry much about interaction between the design and testbench. The fwrisc_tracer BFM called the testbench on each executed instruction, register write, and memory access. This was, of course, simple. But, was it really necessary?

Actually, in most cases, the testbench only needs to be aware of the results of a simulation, or key events across the simulation. Given the cost of calling Python, I made a few optimizations to the frequency of events sent to the testbench:

  • Maintain the register state in the tracer BFM, instead of calling the testbench every time a write occurs. The testbench can read back the register state at the end of the test as needed.
  • Notify the testbench when a long-jump or jump-link instruction occurs, instead of on every instruction. This allows the testbench to detect end-of-test conditions and minimizes the frequency of calls
With these two enhancements to both the C++ and Python testbenches, I re-ran the long-loop test and got new results:

  • C++ Testbench: 4s
  • Python Testbench: 5s
Notice that the C++ results have improved as well. My interpretation of these results is that most of the time is now spent by Verilator in simulating the design, and the results are more-or-less identical.

Conclusions
The Python ecosystem brings definite benefits when applying Python for functional verification. The existing ecosystem of available libraries, and the infrastructure to easily access them, simplifies the effort needed to reuse existing code. It also minimizes the burden placed on users that want to try out an open source project that uses Python for verification.

Using Python does come with performance overhead. This means that it's more important to consider how the execution of the testbench relates to execution of the design. A testbench that interacts with the design frequently (eg every clock) will impose much greater overhead compared to a testbench that interacts with the design every 100 or 1000 cycles. There are typically many optimization opportunities that minimize the performance overhead of a Python testbench, while not adversely impacting verification results.

It's important to remember that engineer time is much more expensive than compute time, so making engineers more productive wins every time. So, from my perspective, the real question isn't whether C++ is faster than Python. The real questions are whether Python is sufficiently fast to be useful, and whether there are reasonable approaches to dealing with the performance bottlenecks. Based on my experience, the answer is a resounding Yes. 

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, June 8, 2019

Py-HPI: A Procedural HDL/Python Integration



As I mentioned in my last post, I've been looking at using Python for more tasks, including functional verification. My go-to languages for functional verification have traditionally been SystemVerilog for professional work, and C++ when I'm working on a personal project. I've started doing more of my small-application development in Python (often as an alternative to C++), and have wondered whether I could also migrate my testbench development from C++ to Python as well.

This blog post provides an introduction to an integration I created between Python and an hardware descriptin language (HDL) simulation environment called Py-HPI (for Python HDL Procedural Interface). I'm far from the first to create an integration between Python and an HDL simulator (I'm aware of at least one formal project, and several others users that have written about their integration work), so what is different about Py-HPI?

Well, two things, really in my opinion: 
  • Py-HPI integrates at the procedural level, which means Python can directly call tasks in the HDL environment instead of interacting with signals in the HDL environment. 
  • Py-HPI provides a high degree of automation for setting up this procedural-level integration.
In this blog post, I will be describing the user experience in using Py-HPI. In future blog posts, I'll walk through how Py-HPI integrates on my go-to project for playing with verification technologies, and I'll go more in-depth on how Bus Functional Models (BFMs) and testbench environments are developed for Py-HPI.

Py-HPI: The Big Picture


The structure of a Py-HPI enabled testbench is shown above. The key elements are described below
  • Testbench (Python) -- This is Python code the user writes to interact with the design running within the HDL simulation environment
  • Simulator Support -- This is C/C++ code generated by Py-HPI that implements the integration with a specific type of simulator. In general, this code is independent of the specific testbench
  • Testbench Wrapper -- This is C code generated by Py-HPI that implements the testbench specifics of the integration between Python and the HDL environment
  • Bus Functional Models (BFMs) -- BFMs written in HDL (eg SystemVerilog) implement the translation between task calls and signal activity and vice versa.
Currently, Py-HPI supports standard SystemVerilog-DPI simulators (eg Modelsim) as well as Verilator. More integrations are planned, including support for Verilog simulators like Icarus Verilog.

Py-HPI: A Small Example


One easy way to get a sense for the user experience when using Py-HPI is to walk through the steps to run a very simple testbench environment. One of the Py-HPI examples provides just such a testbench.
The structure of this testbench environment is shown above. The Python portion of the testbench drives the SystemVerilog HDL testbench via two bus functional models that are instanced in the SystemVerilog environment.

Python Testbench

First, let's take a look at the Python testbench code, which you can find here:
def thread_func_1():
  print("thread_func_1")
  my_bfm = hpi.rgy.bfm_list[0]
  for i in range(1000):
    my_bfm.xfer(i*2)

def thread_func_2():
  print("thread_func_2")
  my_bfm = hpi.rgy.bfm_list[1]
  for i in range(1000):
    my_bfm.xfer(i)

@hpi.entry
def run_my_tb():
    print("run_my_tb - bfms: " + str(len(hpi.rgy.bfm_list)))

    with hpi.fork() as f:
      f.task(lambda: thread_func_1());
      f.task(lambda: thread_func_2());

    print("end of run_my_tb");
Execution starts in the run_my_tb()method (which is marked by a special Python decorator hpi.entry, to identify it as a valid entry point) which starts two threads and waits for them to complete. Each of the thread methods (thread_func_1 and thread_func_2) obtain a handle to one of the BFM instances and call the BFM's API to perform data transfers in the SystemVerilog testbench environment.
In a way, it's almost identical to what I would write in either C++ or SystemVerilog. In a way, that's kind of the point from my perspective.

Running the Testbench

Okay, now that we know what the Python side of the testbench looks like, let's see the commands used to create and compile the files necessary to run a simulation. These commands are in the runit_vl.sh script inside the example directory. In this case, I'll show the commands required to run Py-HPI with the Verilator simulator. The example also provides a script (runit_ms.vl) that runs the same example with Modelsim.

Create the Simulation Support Files

We first need to create the simulation-support files. Since we're targeting the Verilator simulator, we need to run the 'gen-launcher-vl' subcommand implemented by the Py-HPI library.
python3 -m hpi gen-launcher-vl top -clk clk=1ns
Verilator is a bit of an outlier, in that the simulation-support files are specific to the HDL design being simulated. Consequently, we need to specify the name of the top Verilog module and the clock name and period.

Create the Testbench Wrapper

Now, we need to create the Testbench wrapper file that will support the specific BFMs instantiated inside the testbench. 
python3 -m hpi -m my_tb gen-bfm-wrapper simple_bfm -type sv-dpi
python3 -m hpi -m my_tb gen-dpi

Because the Verilator simulator supports DPI, we generate a DPI-based testbench wrapper for our testbench that uses a single BFM. The resulting testbench wrapper is implemented in C and provides the connection between SystemVerilog and Python for our BFM.

Compile Everything

This step is very specific to the simulator being used. 
# Query required compilation/linker flags from Python
CFLAGS="${CFLAGS} `python3-config --cflags`"
LDFLAGS="${LDFLAGS} `python3-config --ldflags`"

verilator --cc --exe -Wno-fatal --trace \
 top.sv simple_bfm.sv \
 launcher_vl.cpp pyhpi_dpi.c \
 -CFLAGS "${CFLAGS}" -LDFLAGS "${LDFLAGS}"

make -C obj_dir -f Vtop.mk
Since we're using Verilator, we need to run Verilator to compile the HDL files and the simulator-support and testbench wrapper C/C++ files. Verilator generates C++ source and a Makefile to build the final simulator image. Our last step is to build the Verilator simulation image using the Verilator-created Makefile.

Run it!

Finally, we can run our simulation.
./obj_dir/Vtop +hpi.load=my_tb +vl.timeout=1ms +vl.trace
We pass a few additional plusargs to enable specific behavior:

  • The +hpi.load=my_tb specifies the Python module to load
  • The +vl.timeout=1ms specifies that the simulation should run for a maximum of 1ms. Other simulators will, of course, provide different mechanisms for doing this
  • The +vl.trace argument specifies that waveforms should be created. Other simulators will provide different ways of turning on tracing.
So, all in all, Py-HPI makes it quite easy to connect a Python testbench to an HDL simulator at the procedural level.

Conclusion

In this blog post, I introduced Py-HPI, a procedural interface between Python and an HDL testbench environment along with an overview of the user experience when creating and running a testbench with Py-HPI. In my next post, I'll look at a Py-HPI testbench for my FWRISC RISC-V core and compare the new Python testbench with the existing C++ testbench. Until then, feel free to check out the Py-HPI library on GitHub (https://github.com/fvutils/py-hpi) and I'd be interested to hear your experiences in using Python for functional verification.


Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.