Sunday, June 16, 2019

Py-HPI: Applying Python for Verification



Intro
In my last post, I talked about a prototype procedural interface between Python and HDL that enables cross-calling between Python and SystemVerilog. My primary motivation for investigating a procedural interface was its potential to maximize performance. In this post, I create a Python testbench for a small IP and compare it to the equivalent C++ testbench. I also look at the performance of Python for verification.

Creating a Python Testbench
My go-to IP for trying out new verification techniques is a small 32-bit RISC-V core named Featherweight RISC (FWRISC) that I created for a design contest last year. The original testbench was written in C++, so that will be my baseline for comparison. If you're interested in the structure of the testbench, have a look at this post.

Since I was keeping the testbench structure the same, I didn't expect much in terms of a reduction in lines of code. C++ is a bit verbose, in that it expects a header and implementation file for each class. This contributes to the fact that each C++ test is roughly twice as long as each Python test:

  • C++ Test: 328 lines
  • Python Test: 139 lines
Reducing the lines of code is a good thing, since more code statistically means more bugs, and spending time finding and fixing testbench bugs doesn't help us get our design verified. But, that's just the start.

The unit tests for FWRISC are all self-checking. This means that each assembly file contains the expected value for registers modified by the test. You can see the data embedded below between the start_expected and end_expected labels.


entry:
li x1, 5
add x3, x1, 6
j done
// Expected value for registers
start_expected:
.word 1, 5
.word 3, 11
end_expected:

Because I didn't want to need to install an ELF-reading library on every machine where I wanted to run the FWRISC regression, I wrote my own small ELF-reading classes for the FWRISC testbench. This amounted to ~400 lines of code, and required a certain amount of thought and effort.

When I started writing the Python testbench, I thought about writing another ELF-reader in Python based on the code I'd written in C++... But then I realized that there was already a Python library for doing this called pyelftools. All I needed to do was get it installed in my environment (more on that in a future post), and call the API:

with open(sw_image, "rb") as f:
elffile = ELFFile(f)
symtab = elffile.get_section_by_name('.symtab')
start_expected = symtab.get_symbol_by_name("start_expected")[0]["st_value"]
end_expected = symtab.get_symbol_by_name("end_expected")[0]["st_value"]
section = None
for i in range(elffile.num_sections()):
shdr = elffile._get_section_header(i)
if (start_expected >= shdr['sh_addr']) and (end_expected <= (shdr['sh_addr'] + shdr['sh_size'])):
start_expected -= shdr['sh_addr']
end_expected -= shdr['sh_addr']
section = elffile.get_section(i)
break
data = section.data()

That's a pretty significant savings both in terms of code, and in terms of development and debug effort! So, definitely my Python testbench is looking pretty good in terms of productivity. But, what about performance?

Evaluating Performance
Testbench performance may not be the most important factor when evaluating a language for use in verification. In general, the time an engineer takes to develop, debug, and maintain a verification environment is far more expensive than the compute time taken to execute tests. That said, understanding that performance characteristics of any language enables us to make smarter tradeoffs in how we use the language. 


I was fortunate enough to see David Patterson deliver his keynote A New Golden Age for Computer Architecture around a year ago at DAC 2018. The slide above comes from that presentation, and compares the performance of a variety of implementations of the computationally-intensive matrix multiply operation. As you can see from the slide, a C implementation is 50x faster than a Python implementation. Based on this slide and the anecdotal evidence of others, my pre-existing expectations were somewhat low when it came to Python performance. But, of course, having concrete data specific to functional verification is far more useful than a few anecdotes and rumors.

Spoiler alert: C++ is definitively faster than Python.

As with most languages, there are two aspects of performance to consider with Python: startup time and steady-state performance. Most of the FWRISC tests are quite short -- in fact, the suite of unit tests contains tests that execute less than 10 instructions.This gives us a good way to evaluate the startup overhead of Python. In order to evaluate the steady-state performance, I created a program that ran a tight loop with 10,000,000 instructions. The performance numbers below all come from Verilator-based simulations.

Startup Overhead
As I noted above, I evaluated the startup overhead of Python using the unit test suite. This suite contains 66 very short tests. 

  • C++ Testbench: 7s
  • Python Testbench: 18s
Based on the numbers above, Python does impose a noticeable overhead on the test suite -- it takes ~2.5x longer to run the suite with Python vs C++. That said, 18 seconds is still very reasonable to run a suite of smoke tests.

Steady-State Overhead
To evaluate the steady-state overhead of a Python testbench, I ran a long-loop test that ran a total of 10,000,000 instructions.

  • C++ Testbench: 11.6s
  • Python Testbench: 109.7s
Okay, this doesn't look so good. Our C++ testbench is 9.45x faster than our Python testbench. What do we do about this?

Adapting to Python's Performance
Initially, the FWRISC testbench didn't worry much about interaction between the design and testbench. The fwrisc_tracer BFM called the testbench on each executed instruction, register write, and memory access. This was, of course, simple. But, was it really necessary?

Actually, in most cases, the testbench only needs to be aware of the results of a simulation, or key events across the simulation. Given the cost of calling Python, I made a few optimizations to the frequency of events sent to the testbench:

  • Maintain the register state in the tracer BFM, instead of calling the testbench every time a write occurs. The testbench can read back the register state at the end of the test as needed.
  • Notify the testbench when a long-jump or jump-link instruction occurs, instead of on every instruction. This allows the testbench to detect end-of-test conditions and minimizes the frequency of calls
With these two enhancements to both the C++ and Python testbenches, I re-ran the long-loop test and got new results:

  • C++ Testbench: 4s
  • Python Testbench: 5s
Notice that the C++ results have improved as well. My interpretation of these results is that most of the time is now spent by Verilator in simulating the design, and the results are more-or-less identical.

Conclusions
The Python ecosystem brings definite benefits when applying Python for functional verification. The existing ecosystem of available libraries, and the infrastructure to easily access them, simplifies the effort needed to reuse existing code. It also minimizes the burden placed on users that want to try out an open source project that uses Python for verification.

Using Python does come with performance overhead. This means that it's more important to consider how the execution of the testbench relates to execution of the design. A testbench that interacts with the design frequently (eg every clock) will impose much greater overhead compared to a testbench that interacts with the design every 100 or 1000 cycles. There are typically many optimization opportunities that minimize the performance overhead of a Python testbench, while not adversely impacting verification results.

It's important to remember that engineer time is much more expensive than compute time, so making engineers more productive wins every time. So, from my perspective, the real question isn't whether C++ is faster than Python. The real questions are whether Python is sufficiently fast to be useful, and whether there are reasonable approaches to dealing with the performance bottlenecks. Based on my experience, the answer is a resounding Yes. 

Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, June 8, 2019

Py-HPI: A Procedural HDL/Python Integration



As I mentioned in my last post, I've been looking at using Python for more tasks, including functional verification. My go-to languages for functional verification have traditionally been SystemVerilog for professional work, and C++ when I'm working on a personal project. I've started doing more of my small-application development in Python (often as an alternative to C++), and have wondered whether I could also migrate my testbench development from C++ to Python as well.

This blog post provides an introduction to an integration I created between Python and an hardware descriptin language (HDL) simulation environment called Py-HPI (for Python HDL Procedural Interface). I'm far from the first to create an integration between Python and an HDL simulator (I'm aware of at least one formal project, and several others users that have written about their integration work), so what is different about Py-HPI?

Well, two things, really in my opinion: 
  • Py-HPI integrates at the procedural level, which means Python can directly call tasks in the HDL environment instead of interacting with signals in the HDL environment. 
  • Py-HPI provides a high degree of automation for setting up this procedural-level integration.
In this blog post, I will be describing the user experience in using Py-HPI. In future blog posts, I'll walk through how Py-HPI integrates on my go-to project for playing with verification technologies, and I'll go more in-depth on how Bus Functional Models (BFMs) and testbench environments are developed for Py-HPI.

Py-HPI: The Big Picture


The structure of a Py-HPI enabled testbench is shown above. The key elements are described below
  • Testbench (Python) -- This is Python code the user writes to interact with the design running within the HDL simulation environment
  • Simulator Support -- This is C/C++ code generated by Py-HPI that implements the integration with a specific type of simulator. In general, this code is independent of the specific testbench
  • Testbench Wrapper -- This is C code generated by Py-HPI that implements the testbench specifics of the integration between Python and the HDL environment
  • Bus Functional Models (BFMs) -- BFMs written in HDL (eg SystemVerilog) implement the translation between task calls and signal activity and vice versa.
Currently, Py-HPI supports standard SystemVerilog-DPI simulators (eg Modelsim) as well as Verilator. More integrations are planned, including support for Verilog simulators like Icarus Verilog.

Py-HPI: A Small Example


One easy way to get a sense for the user experience when using Py-HPI is to walk through the steps to run a very simple testbench environment. One of the Py-HPI examples provides just such a testbench.
The structure of this testbench environment is shown above. The Python portion of the testbench drives the SystemVerilog HDL testbench via two bus functional models that are instanced in the SystemVerilog environment.

Python Testbench

First, let's take a look at the Python testbench code, which you can find here:
def thread_func_1():
  print("thread_func_1")
  my_bfm = hpi.rgy.bfm_list[0]
  for i in range(1000):
    my_bfm.xfer(i*2)

def thread_func_2():
  print("thread_func_2")
  my_bfm = hpi.rgy.bfm_list[1]
  for i in range(1000):
    my_bfm.xfer(i)

@hpi.entry
def run_my_tb():
    print("run_my_tb - bfms: " + str(len(hpi.rgy.bfm_list)))

    with hpi.fork() as f:
      f.task(lambda: thread_func_1());
      f.task(lambda: thread_func_2());

    print("end of run_my_tb");
Execution starts in the run_my_tb()method (which is marked by a special Python decorator hpi.entry, to identify it as a valid entry point) which starts two threads and waits for them to complete. Each of the thread methods (thread_func_1 and thread_func_2) obtain a handle to one of the BFM instances and call the BFM's API to perform data transfers in the SystemVerilog testbench environment.
In a way, it's almost identical to what I would write in either C++ or SystemVerilog. In a way, that's kind of the point from my perspective.

Running the Testbench

Okay, now that we know what the Python side of the testbench looks like, let's see the commands used to create and compile the files necessary to run a simulation. These commands are in the runit_vl.sh script inside the example directory. In this case, I'll show the commands required to run Py-HPI with the Verilator simulator. The example also provides a script (runit_ms.vl) that runs the same example with Modelsim.

Create the Simulation Support Files

We first need to create the simulation-support files. Since we're targeting the Verilator simulator, we need to run the 'gen-launcher-vl' subcommand implemented by the Py-HPI library.
python3 -m hpi gen-launcher-vl top -clk clk=1ns
Verilator is a bit of an outlier, in that the simulation-support files are specific to the HDL design being simulated. Consequently, we need to specify the name of the top Verilog module and the clock name and period.

Create the Testbench Wrapper

Now, we need to create the Testbench wrapper file that will support the specific BFMs instantiated inside the testbench. 
python3 -m hpi -m my_tb gen-bfm-wrapper simple_bfm -type sv-dpi
python3 -m hpi -m my_tb gen-dpi

Because the Verilator simulator supports DPI, we generate a DPI-based testbench wrapper for our testbench that uses a single BFM. The resulting testbench wrapper is implemented in C and provides the connection between SystemVerilog and Python for our BFM.

Compile Everything

This step is very specific to the simulator being used. 
# Query required compilation/linker flags from Python
CFLAGS="${CFLAGS} `python3-config --cflags`"
LDFLAGS="${LDFLAGS} `python3-config --ldflags`"

verilator --cc --exe -Wno-fatal --trace \
 top.sv simple_bfm.sv \
 launcher_vl.cpp pyhpi_dpi.c \
 -CFLAGS "${CFLAGS}" -LDFLAGS "${LDFLAGS}"

make -C obj_dir -f Vtop.mk
Since we're using Verilator, we need to run Verilator to compile the HDL files and the simulator-support and testbench wrapper C/C++ files. Verilator generates C++ source and a Makefile to build the final simulator image. Our last step is to build the Verilator simulation image using the Verilator-created Makefile.

Run it!

Finally, we can run our simulation.
./obj_dir/Vtop +hpi.load=my_tb +vl.timeout=1ms +vl.trace
We pass a few additional plusargs to enable specific behavior:

  • The +hpi.load=my_tb specifies the Python module to load
  • The +vl.timeout=1ms specifies that the simulation should run for a maximum of 1ms. Other simulators will, of course, provide different mechanisms for doing this
  • The +vl.trace argument specifies that waveforms should be created. Other simulators will provide different ways of turning on tracing.
So, all in all, Py-HPI makes it quite easy to connect a Python testbench to an HDL simulator at the procedural level.

Conclusion

In this blog post, I introduced Py-HPI, a procedural interface between Python and an HDL testbench environment along with an overview of the user experience when creating and running a testbench with Py-HPI. In my next post, I'll look at a Py-HPI testbench for my FWRISC RISC-V core and compare the new Python testbench with the existing C++ testbench. Until then, feel free to check out the Py-HPI library on GitHub (https://github.com/fvutils/py-hpi) and I'd be interested to hear your experiences in using Python for functional verification.


Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.


Sunday, June 2, 2019

Functional Verification and the Ecosystem Argument

I've been involved in the functional verification space for quite some time -- both personally and professionally. On the personal side, I've recently been experimenting with using Python as a functional verification language. The simplest reason? The ecosystem.

The Ecosystem Argument
I've been involved in many discussions over the years that bring up The Ecosystem Argument. It goes something like this: If we use programming language X for Y (a plug-in language, a tool extension language, an implementation language, etc), we'll benefit from the ecosystem around language X. An ecosystem is a very powerful thing, and encompasses the language tools (editors, compilers, linting tools), the users that know a given language, and the available libraries for the language. So, the argument isn't entirely out of line. Using a popular language rather than an obscure language is often not a bad thing. That said, there's a hole in the generic form of The Ecosystem Argument: a programming language is used for many purposes, and much of the specialized knowledge is domain-specific and not language-specific. To put it another way, just because I know Python doesn't mean that I'm a Machine Learning guru -- despite the fact that Python is heavily used in this space.

All of these elements of a language ecosystem are important to consider. However, depending on the circumstances, some of these factors take on greater importance than others. If users will be using the language to create larger applications, the popularity of the language and availability of relevant libraries may take on increasing importance. If users will be using a domain-specific library (eg SystemC for modeling hardware with C++), the language may be less important than the semantics of the underlying domain because users will be more focused on expressing the semantics supported by the library than the semantics supported by the programming language.  

Libraries and Availability
In my experience, it's not a good assumption that selecting a particular programming language will bring a new group of users into a specific domain. It's possible, but not probable. What's more important is whether a group of users can easily be productive given the ecosystem around a given language.

A key aspect of language ecosystem and productivity is library availability. Libraries are present in all language ecosystems. In some cases (eg Java), the language specifies a rich library that satisfies the requirements of many users. In all cases, there are external libraries that serve more specialized needs. 

Getting access to these external libraries poses another challenge, and it's well worth noting any language whose ecosystem simplifies the process of acquiring and publishing libraries. If we're using C/C++ or Java, there are several steps we need to go through to acquire a new library:
  • Locate the project
  • Determine whether the project requires other libraries, and go find those
  • Determine where to install the library, and how our software will get access to it (eg co-locate it with our project, modify the library path, etc)
This model is pretty workable if we are building an large-scale application or library. The overhead of setting up libraries is typically low compared to the value provided by the application or library. This model becomes much more challenging when we're developing smaller pieces of functionality. In these cases, the overhead of acquiring and setting up the external library may equal or exceed the utility we get from the library. In these situations, it quickly starts to appear less expensive to just build something small that serves our needs rather than try to use an external library.

These situations come up all the time in functional verification. Here's just one example. One of the projects I worked on is a small RISC-V core. My testbench needed to get symbol information from the test files that the core executes as part of the test suite. Ideally, I could have used something like libelf or its successor elftools. However, that would require anyone that wanted to run the test suite for my core to also get and install this library. To avoid adding complication, I simply wrote my own simple ELF-file reader that could be included in the testbench.

Python and other languages address this lost opportunity for reuse by providing mechanisms to easily publish and acquire external libraries. Python has PyPi, the Python Package Index. Javascript has NPM. And, at least with Python, users don't need to have administrator privileges in order to install packages for their own use. 

In a functional verification environment, there are many other more-specialized cases where reuse is desired. The case described above is more generic, since many software projects have reasons to inspect ELF files. What about Bus Functional Models (BFMs)? What about specialized protocol generators? The possibility of being able to easily share and reuse some of these elements by leveraging Python's existing ecosystem is exciting!
 

Looking Forward
In my next few blogs, I plan to expand on how I'm using Python for functional verification and the benefits (and challenges) that I've experienced. So, stay tuned!


Disclaimer
The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.