Tuesday, December 31, 2019

2019 - The "Nights and Weekends Projects" Year in Review

It's almost the end of 2019, and I've been thinking back over the year as well as thinking ahead to 2020. In past years, I've often evaluated my "nights and weekends" projects using the same metrics I'm evaluated on at work: projects completed, and results obtained. This year, I've started looking my my "nights and weekends" efforts through a different lens focused more on the knowledge I've gained than just what I've produced.
As an aside, given the cover image, I do find it somewhat ironic that almost none of the knowledge I gained this year came from printed and bound books. Growing up with a love of libraries, and the fascinating collections of books they contained, it's both sad to think that knowledge is no longer concentrated there, and amazing to realize what a wealth of knowledge is now so easily-accessible just a short search away.

Looking back, there are two themes that run through several areas that I worked in across the year. The first of these is making software more modular, collaborative, and accessible. The second is Python. That's not all, though. So, let's get right to it!

Software Packaging and Distribution
Professionally, I come from a standard commercial-software background, and have often looked at open source through a similar lens. Specifically, I've often focused on software that can be packaged such that it's easily accessible to end users. This means bundling dependencies, providing installers, etc (see DVKit, a 'batteries-included' IDE for verification engineers).

This application-centric approach works well so long as the elements of functionality being distributed are relatively small in number, and the ways in which they need to be combined are fairly limited. This approach breaks down when the elements of functionality are relatively large in number, and need to be combined in many ways. In short, the more modular software becomes, the less feasible typical application-centric packaging becomes.

I've been dabbling for a few years in RTL design and verification. In this space, the verification environment for a given design will depend on many small elements of functionality -- utility libraries, reusable verification IP, etc. Bundling the dependencies with the verification environment quickly leads to projects that require lots of disk space. On the other hand, forcing users to download and install all the dependencies presents a significant barrier to new users.

One of the biggest reasons that I've spent so much time with Python this past year is that the Python ecosystem appears to provide a solution to this challenge of packaging and easily distributing small elements of functionality. Over the course of the year, I've spent time looking at Conda as a way of making application-level features more modular and easily-accessible. I've also spent time learning about how to package Python extension libraries (both with and without native library components) for distribution on PyPi, a repository for distributing Python packages.

New Approaches to Embedded DSLs
I've been involved in several projects over the years that have used C++ to provide a language-like user experience via C++ overloaded operators and macros. While there are certainly downsides to these embedded domain-specific languages in terms of error messaging and extensibility, an embedded domain-specific language can be a great way to prototype a language-based user interface before committing to the work of defining a first-class language and creating the parsing and processing infrastructure. It's also a very helpful approach for exploring new techniques in the context of existing languages.

C++ support for macros and operator overloading have been used for embedded DSLs from the beginning. However, using just these features tends to lead to somewhat awkward syntax, since operator overloading only supports expressions. C++11 (and beyond) brings new features, such as lambda expressions, and I spent time investigating these mechanisms and their impact on supporting expressing more-complex constructs in a more-natural way.

While the new C++11 features definitely showed promise, I started to wonder what support Python provided for implementing embedded domain-specific languages. As it turns out, Python provides some very powerful capabilities. Python supports overloading more operators than C++, and supports introspection into the code described by the user. I definitely intend to revisit embedded domain-specific languages captured in Python in 2020!

Constraint Solvers
Highly-capable constraint solvers that are available under permissive open-source licenses are becoming widely available, and I'm seeing these solvers applied to a range of interesting tasks. The CRAVE library for generating random stimulus has been around for some time. Several tools are leveraging available SMT solvers for model checking. Constraint solvers are even being applied for graphical layout of diagrams!

Given the range of applications to which solvers lend themselves, I thought it would be worth having a bit more hands-on knowledge. I spent some time learning about the Z3 solver API before concluding that, while the API is elegant and comprehensive, it's also more-complicated that what I need. I subsequently shifted to looking at the Boolector solver API, which is smaller and simpler.

The Boolector solver provides a Python binding, which is built along with the solver. This means that a user needs to manually build Boolector in order to use a Python package that uses the Boolector solver. Fortunately, I'd been learning about packaging and distributing Python extension libraries, and this this provided a perfect place to try this out. The Boolector Python library (PyBoolector) on PyPi is the result of this work.

Python for Verification
My background in verification is rooted in SystemC, SystemVerilog, and UVM. All very mainstream languages and methodologies in the commercial design and functional verification space. As I spent more time exploring Python and the modular and collaborative packaging it supports, I concluded that it made sense to investigate using Python for functional verification.

I spent time learning about cocotb, the most popular functional verification library in Python that I'm aware of. I also spent time learning about Python's back-end C API and how to structure bus-functional models to integrate at the procedure level with Python.

Actually, the more time I spend looking at Python for verification, the more possibilities I see. Definitely look for more on this topic in 2020!

In most areas, I've been quite happy with Python for verification. The object-oriented language features fit the requirements for high-level verification, and the easy availability of utility packages simplifies dealing with project dependencies. The one thing I've been dissatisfied with is support for static checking. I've used statically-typed languages for most application development. These languages have the advantage that the compiler can identify misuse of types before running the application. Dynamically-typed languages, such as Python and TCL, end up discovering type-misuse issues (eg passing an object to a method that expects an object of a different type) at runtime. One target for 2020 is learning more about what can be done to address this issue. Lint tools such as Pylint help, and my hope is to discover more tools and methodologies that help to close this gap.

RTL Design Skills
When I undertook the 2018 RISC-V Soft Core Contest, It had been quite a few years since I'd done any RTL design. Going through the design work for that project helped me brush up my skills quite a bit, but I knew I had quite a ways to go to be proficient. When the 2019 contest, centered around software security, came along, I knew it was a good opportunity to both learn more about software security vulnerabilities and improve my RTL design skills.

In addition to improving my RTL design skills, I learned a couple of things from initially attempting to add a few new features (multiplication, compressed instructions, security extensions) to my 2018 soft core. First, I had succeeded at writing some very good spaghetti RTL that wasn't modular enough to support extensibility. Furthermore, I didn't have sufficient tests to effectively and efficiently catch bugs introduced by adding new features.

Over the course of the 2019 project, I did a complete rewrite of the Featherweight RISC core. The more-modular structure of the rewritten core lends itself even better to bounded model checking, and I found this to be extremely helpful in catching and diagnosing bugs introduced during development and integration.

Going through this process also helped to improve my knowledge of RTL constructs that result in good efficient implementation, and which do not.

Looking Forward
2019 has been a great year for learning about more corners of the technical world. Looking forward to 2020, I see more work with Python, transitioning more of my existing projects over to cloud-based continuous integration, and more work with Python in the functional verification space. What will I learn along the way? Stay tuned for more blog posts across 2020 to find out!

As we come to the end of 2019 and the beginning of a new year (and new decade), I wish you happy holidays, a happy new year, and a 2020 ahead that is full of learning!

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, December 14, 2019

Writing a Task-Based Cocotb BFM

The purpose of a Bus Functional Model (BFM) is to enable interacting with a design via a given protocol at a higher level of abstraction than the signal-level protocol, while knowing the bare minimum about the details of that protocol. Verification IP goes beyond these benefits to provide test plans, functional coverage, test sequence, and often protocol-specific benefits like compliance test suites.

In order to realize any these benefits of having a BFM, one first needs to exist. So, let's take a look at what it takes to create a task-based Cocotb BFM for a very simple ready/valid protocol.

The ready/valid protocol is simple, but useful. Data is exchanged between two blocks on a clock edge when both ready and valid are active (high). The initiator controls the valid signal, the target controls the ready signal. That's all there is to the protocol.

Before we get into the details, a few words about the structure of a task-based Cocotb BFM. There are two key components:

  • A Python class that provides the API used by the test writer, and defines the lower-level API used to actually interact with the HDL portion of the BFM
  • An HDL module that performs the conversion between the commands sent from Python and signals, and vice-versa.

Both of these aspects of a task-based BFM are collected together in a Python package that is typically named with the protocol implemented by the BFM (rv, or ready-valid, in this case).

Python API

The Python portion of a task-based Cocotb BFM is captured as a Python class. I use camel-case names for classes, so the ready/value output (initiator) BFM is ReadyValidDataOutBFM.

class ReadyValidDataOutBFM():

    def __init__(self):
        self.busy = Lock()
        self.ack_ev = Event()

The two core data items shown in the class initializer are likely to be found in every BFM. Cocotb supports multi-threaded tests using Python co-routines. Our BFM will use the 'busy' lock to ensure that only one Python thread can use the BFM at a time. It will use the 'ack_ev' event to interact with the HDL portion of the BFM.

As I mentioned earlier, there are typically two API layers that we need to define: the API that the user calls, and a lower-level API that is used to control the BFM's HDL code inside the simulation.

Let's start with the user layer, since that's quite simple. The user's test will use the ready/valid initiator BFM to write data to a target ready/valid interface. Let's call the public method 'write_c', to denote that this is a Python co-routine that writes data out from the BFM.

    def write_c(self, data):
        Writes the specified data word to the interface
        yield self.busy.acquire()

        # Wait for acknowledge of the transfer
        yield self.ack_ev.wait()


Before getting into the implementation details, let's look at the second API layer -- the one that interacts directly with the BFM. The low-level API must be non-blocking in order to work with the full range of simulation and execution environments that must be supported. This means that we need to split the write operation into two pieces: an outbound call to initiate a write, and an inbound call from the BFM to notify that the write is complete.

    def _write_req(self, d):
    def _write_ack(self):

Note that the low-level API functions are prefixed with '_', denoting that this is an internal API and not intended to be called directly by the user.


The HDL portion of the BFM also has two aspects. One is synchronous synthesizable code, while the other implements the interface to the synchronous code. Both of these are contained within a Verilog module, shown below:

module rv_data_out_bfm #(
  parameter DATA_WIDTH = 8
) (
  input clock,
  input reset,
  output reg[DATA_WIDTH-1:0] data,
  output reg data_valid,
  input data_ready

reg[DATA_WIDTH-1:0] data_v = 0;
reg data_valid_v = 0;

This module is instantiated in the HDL testbench and connected to the signals on the appropriate design interface. The data_v and data_valid_v variables are used to interface between the synchronous and control code inside the BFM.

We'll look at the interface code first. The Verilog task below implements the Python _write_reg method shown above.

task _write_req(reg[63:0] d);
data_v = d;
data_valid_v = 1;

Note that the interface task is non-blocking, and simply sets values on variables within the module -- in this case, storing the data to be written and indicating that there is new data to transfer.

The synchronous logic controls the module signals based on the variables set by the interface tasks. This code is shown below:

always @(posedge clock) begin
  if (reset) begin
    data_valid <= 0;
    data <= 0;
  end else begin
    data_valid <= data_valid_v;
    data <= data_v;
          if (data_valid && data_ready) begin
      data_valid_v = 0;

This synchronous logic propagates the variables that were set in the interface task. When both the data_valid and data_ready signals are high, the _write_ack() task is called to notify the Python environment that the write is completed. At the same time the data_valid_v variable is cleared to terminate the transfer. 

Note that the synchronous logic is likely very similar to the logic within an RTL implementation of a ready/valid initiator. This is an opportunity, since it means that Cocotb BFMs can leverage existing RTL implementations of interface protocols.

Now that we have a ready/valid BFM implemented, what can we do with it? Well, in addition to using it to verify our current design, we can also share it with others that also have ready/valid interfaces on their designs. This is Python, after all, and it's very easy to share Python libraries with others via the PyPi repository (https://pypi.org/).

In order to do this, we need to setup a very basic 'setup.py' script in our project directory that identifies the Python package and related data (BFM RTL) that needs to be distributed. After that, it's a simple matter to publish to PyPi such that another project can make use of the BFM simply by adding rv_bfms to that project's requirements.txt file.

Next Steps
Hopefully the description above shows just how simple it is to setup a Python BFM that can interact at the task level with RTL. You can find the full code for this BFM in the rv_bfms Github repository here: https://github.com/pybfms/rv_bfms.

In my next post, we'll take a look at some more-advanced ways to structure BFMs to increase the overall performance even more, and see some ways for BFMs to add more value in debug.

Meanwhile, I'd be interested to hear what protocols are of high interest in your FOSSi (Free and Open-Source Silicon) projects. I'd be especially interested if you'd like to contribute a task-based Python BFM for one or more of those protocols!

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, November 30, 2019

Adding Task-Based Bus Functional Models to Cocotb

Getting a project started -- even to a certain level of completeness -- is often pretty simple. A couple weekends of hacking often results in pretty good progress and results. Finishing things up, in contrast, is often a slow process. That has certainly been the case with some work I did back in May and June this year to prototype a task-based interface between Python and an HDL simulator. The proof-of-concept work I did at the time seemed quite promising, but the clear next step was to make that work more accessible to others. That "last little bit of work" has certainly turned out to take more time that I had originally assumed!

Just as a reminder, the motivation for interacting with an HDL environment at the task level is quite simple: performance. Communicating across language (especially interpreted language) boundaries tends to be expensive, so minimizing the number of cross-language communications is critical to achieving high throughput. Using a task-based interface between Python and an HDL environment boosts performance in three ways. First, a task-based interface groups data so fewer language-boundary crossings are required to communicate a given amount of data. Secondly, and more importantly, using a task-based interface enables the HDL environment to filter events and only interact with the Python environment when absolutely necessary. Finally, using a task-based interface enables integrations with high(er)-speed environments, such as emulation or the current release of Verilator, where signal-level integration isn't practical.

I started looking at Python as a testbench language for reasons that might initially seem strange: the Python ecosystem (primarily PyPi) that makes it easy to publish bits of library and utility code in a way that it is easily-accessible to others. Often, the ability to take advantage of the work of others is gated by the effort required to gather all the required software dependencies. The Python ecosystem promises to alleviate that challenge, and I was excited to explore the possibilities.

Where Does it Fit?
I'm aware of a few projects that use Python for verification, but it currently appears that Cocotb is the most visible Python-based testbench for doing hardware verification using Python testbench code. Consequently, it made very good sense to see whether the task-based integration I had prototyped could be integrated with the existing Cocotb library.

Cocotb is a Python library that supports light-weight concurrency via co-routines, and provides primitives for coordinating these co-routines with each other and with activity in a HDL simulation environment. In addition to the Python library, Cocotb provides native-compiled C/C++ libraries that integrate with the simulation environment via APIS implemented by the simulator (VPI, DPI, FLI, or VHPI depending on the simulator). Currently, Cocotb interacts with simulation environments at the signal level.

In considering how to add task-based interactions to Cocotb, there were a several requirements that I thought were quite important. First, the user should not be forced to choose between signal-level and task-based interactions between Python and HDL. It should be possible to introduce task-based interactions to a testbench currently interacting at the signal level, or add a few signal-based interactions to a testbench that primarily interacts at the task level. Secondly, a task-based integration must support a range of simulator APIs. I had prototyped a DPI-based integration, which is supported by SystemVerilog simulators, supporting Verilog and VHDL simulators as well was clearly important. Finally, achieving good performance was a key requirement, since performance is the primary motivation for using a task-based interface in the first place.

Task-Based BFM Cocotb Architecture

From a system perspective, the diagram above captures how task-based BFMs integrate with Cocotb. Each BFM instance is represented in the HDL environment by an instance of an HDL module. This module is special, in that it knows how to accept and make task calls and convert between signal-level information and those task calls.

Each BFM also knows how to register itself with a BFM Manager within Cocotb. When the HDL portion of a BFM registers with Cocotb, the BFM Manager creates an instance of a Python class that represents the BFM within the Python environment.

The BFM Manager provides methods to allow the user's test code to query the available BFMs and obtain a handle to the BFM instances required by the test. From there, the user's test simply calls methods on the Python class object and/or receives callbacks.

Task-Based BFM Architecture

Let's take a quick look at the work needed to support a task-based BFM. First off, the BFM author needs to a Python class to implement the Python side of the BFM. That class is decorated with a @cocotb.bfm decorator that associates HDL template files with the BFM class. Below is a BFM for a simple ready/valid protocol.

    cocotb.bfm_vlog : cocotb.bfm_hdl_path(__file__,   "hdl/rv_data_out_bfm.v"),
    cocotb.bfm_sv   : cocotb.bfm_hdl_path(__file__, "hdl/rv_data_out_bfm.v")
class ReadyValidDataOutBFM():
    # ...

Next, the BFM author must specify the low-level interaction API with the HDL BFM. All calls must be non-blocking, so most interactions with the HDL environment are implemented as a request/acknowledge pair of API calls.

    def write_req(self, d):
    def write_ack(self):

Calling a class methods decorated with @cocotb.bfm_import will result in a task call in the HDL BFM. Class methods decorated with @cocotb.bfm_export can be called from the HDL BFM.

Finally, on the Python side, the BFM author will likely provide a convenience API to simplify the testwriter's life:

    def write_c(self, data):
        Writes the specified data word to the interface
        yield self.busy.acquire()

        # Wait for acknowledge of the transfer
        yield self.ack_ev.wait()


There's one piece left, and that's the HDL BFM. This is specified as a template:

module rv_data_out_bfm #(
parameter DATA_WIDTH = 8
) (
input clock,
input reset,
output reg[DATA_WIDTH-1:0] data,
output reg data_valid,
input data_ready
reg[DATA_WIDTH-1:0] data_v = 0;
reg data_valid_v = 0;
always @(posedge clock) begin
if (reset) begin
data_valid <= 0;
data <= 0;
end else begin
if (data_valid_v) begin
data_valid <= 1;
data <= data_v;
data_valid_v = 0;
if (data_valid && data_ready) begin

if (!data_valid_v) begin
data_valid <= 0;
task write_req(reg[63:0] d);
data_v = d;
data_valid_v = 1;

// Auto-generated code to implement the BFM API


The BFM author must implement tasks that will be called from the Python class. Task proxies that will invoke Python methods are implemented by the Cocotb automation, and substituted into the template where the ${cocotb_bfm_api_impl} tag is specified.

Current Integrations
Currently, task-based BFM integrations are implemented for Verilog via the VPI interface and for SystemVerilog via the DPI interface. A VHDL integration isn't currently supported, but that's on the roadmap. One complication with VHDL is that there are actually several interfaces that may need to be supported depending on the simulator -- VHPI, VHPI via VPI, Modelsim FLI. Here I could use some input from the community on priorities, so I'd definitely like to hear from you if you're using Cocotb with VHDL...

As I mentioned at the beginning of this post, performance is the primary reason for using task-based interaction between Python and a HDL environment. So, how much improvement can you expect? Well, that entirely depends on how frequently your tests interact with the HDL environment and, to a certain extent, on how long your tests are. If your testbench needs to interact with the testbench every clock cycle, then you're unlikely to see much benefit. If, however, your testbench spends quite a few cycles waiting for the design to respond, then you're likely to see pretty significant benefits.

I'll use my FWRISC (Featherweight RISC) RISCV core as an example. In this environment, the bulk of the test is actually compiled code that executes on the processor. The Python testbench is primarily responsible for checking results and providing debug information when needed.
A diagram of the simulation-based testbench is shown above. The Tracer BFM is responsible for monitoring execution of the FWRISC core and sending events up to the high-level testbench as needed. These events include:
  • Instruction executed
  • Register written
  • Memory written
I've created two Cocotb implementations of this BFM: one that interacts at the signal level, and one that interacts at the task level. To compare the performance, I'm running a Zephyr test with Icarus Verilog for 10ms of simulation time.

Let's start with as close to a direct comparison as possible. Both the signal-level and task-based BFM will capture the same information and propagate it to the Python testbench.
  • Signal-Level BFM: 85s (wallclock)
  • Task-Level BFM: 33s (wallclock)
Okay, so already we're looking pretty good. This performance increase is simply because the task-based BFM doesn't need to call the Python environment every cycle. 

Another way we can benefit is to use a higher-performance simulator. Icarus Verilog is interpreted, and supports a full event-driven simulation environment. Verilator has a much more restricted set of features (synthesizable Verilog only, limited signal-level access, etc), but is also much faster. It also doesn't currently support the signal-level access to the extent necessary to do a direct comparison between a task-based BFM and a signal-level BFM. So, how do we look here? I actually had to increase the simulation time to 100ms (10x longer) to get a meaningful reading.
  • Task-Level BFM: 18s (wallclock)
So, coupling a fast execution platform with an efficient integration mechanism definitely brings benefits!

Next Steps
So, where do we go from here? Well, please stay tuned for my next blog post to get more details on how create task-based BFMs using these features. I also have an active pull request (#1217) to get this support merged into Cocotb directly. Until then, you can always access the code here

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, July 27, 2019

Embedded Languages: The Space Between Language and API

We're all familiar with general-purpose programming language for capturing general algorithms, but there are also a sizeable group of domain-specific languages that exist to efficiently capture reasoning in a specific domain -- whether that's hardware design (Verilog, VHDL), database manipulations (SQL), or models at a high level of abstraction (UML/xtUML). These languages exist because the overhead is enormous for a domain expert to capture a problem in their given domain using a general-purpose programming language and APIs.

One of my favorite examples showing the motivation for domains-specific languages is spreadsheets. A spreadsheet is a language based around a namespace (table) where elements (cells) in the namespace are addressable by their coordinates, and whose values are represented by equations that may include references to other elements in the namespace. Just think how easy it is to setup a simple spreadsheet to do some what-if analysis, and how difficult it would be if you had to write a program to perform those calculations instead!

Simplistic though it may be, the spreadsheet perfectly captures the motivation behind domain-specific languages: focus on capturing the what of a given domain -- the key attributes, key relationships, and key operations -- and not on the how of the mechanics of how these elements would be represented in a general-purpose programming language. In short, a domain-specific language provides a user interface to complex algorithms phrased in familiar terms -- at least to someone knowledgeable in a that specific domain.

Taking the step of capturing domain knowledge in a new domain-specific language is a big step, though. There are a variety of reasons to defer taking that step or, perhaps, to not take that step at all.  Sometimes an entire language isn't required to implement the desired user interface. Sometimes it's desirable to have some benefits of a general-purpose language without the overhead of designing an entirely new all-in-one domain-specific and general-purpose language. The embedded domain-specific language is one approach that has been used to bring some benefits of a domain-specific language into an existing general-purpose programming language. The general approach is to use existing general-purpose language constructs, such as pre-processor macros and operator overloading, to build constructs with a domain-specific language feel within an existing language.

Within the set of embedded domain-specific languages that I'm aware of, I'm actually aware of three key styles of embedded a domain-specific language inside an existing general-purpose programming language.

Decorations and Annotations
One of the simplest domain-specific language integration techniques that I'm aware of is the decorator/annotation pattern. This style of domain-specific language is used to statically register classes or functions with a library framework.
class slave_address_map_info extends uvm_object;
  protected int min_addr;
  protected int max_addr;
  function new(string name = "slave_address_map_info");
    `uvm_field_int(min_addr, UVM_DEFAULT)
    `uvm_field_int(max_addr, UVM_DEFAULT)

  // ...

While there are many examples of a decorator/annotation eDSLs, the example that came to mind first for me was the Universal Verification Methodology (UVM). UVM is a class library for functional verification built on top of the SystemVerilog domain-specific language. Two common operations that users of the UVM need to perform is registration of key user-defined types with the class library, and writing functions to clone, compare, and print class instances. Performing these operations in plain old code is time-consuming and error-prone. UVM provides a set of macros that allow the user to declare the existence of their user-defined class type and the fields within it (shown above highlighted in blue). 
The macros (SystemVerilog's key feature supporting embedded domain-specific languages) above cause the class type to be registered with the UVM class library, and implement functions for comparing, displaying, and cloning an object of this type. All from a high-level specification.

Enmeshed eDSL
Our next level of eDSL integration starts to look a bit more like a language. An Enmeshed eDSL provides the user statements that look a bit like a programming language, but are really driving algorithms behind the scenes. I call this style of integration Enmeshed because the user's general-purpose programming language code interacts closely with the algorithms driven by the eDSL as program runs.
class item : public rand_obj {
item(rand_obj* parent = 0) : rand_obj(parent), src_addr(this), dest_addr(this) {
src_addr.addRange(0, 9);
src_addr.addRange(90, 99);
constraint(dest_addr() % 4 == 0);
constraint(dest_addr() <= reference(src_addr) + 3); 

randv<uint> src_addr;
randv<uint> dest_addr;
Our example of an Enmeshed eDSL comes courtesy of CRAVE, a constrained-random data generation package for the C++-based SystemC library. As you can see, the highlighted sections above look a bit more like a language. In this case, these are constraint expressions that control a constraint solver such that the values of src_addr and dst_addr obey the relationships established by the expressions.
When the user's program runs, it creates instances of classes like the one shown above, calls an API to create new random values for the random fields, and uses the values from those fields directly. In short, I consider the eDSL enmeshed with the host language because execution of the host language is interleaved with (effective) execution of the eDSL. The host language takes a primary role, and calls the eDSL code to provide specific services to the primary application.

Encapsulated eDSL
Our final level of eDSL integration is an embedded DSL that defines a new domain within the host language. There are several hardware-description languages embedded in general-purpose programming languages that fit this definition.

import chisel3._

class GCD extends Module {
  val io = IO(new Bundle {
    val a  = Input(UInt(32.W))
    val b  = Input(UInt(32.W))
    val e  = Input(Bool())
    val z  = Output(UInt(32.W))
    val v  = Output(Bool())
  val x = Reg(UInt(32.W))
  val y = Reg(UInt(32.W))
  when (x > y)   { x := x -% y }
  .otherwise     { y := y -% x }
  when (io.e) { x := io.a; y := io.b }
  io.z := x
  io.v := y === 0.U

I've selected CHISEL (Constructing Hardware in a Scala-Embedded Language) as the example. What makes an encapsulated eDSL different is that the description made using the eDSL is monolithic and executed to create a single model -- in this case, Verilog. The GCD design show above might be used within a larger CHISEL-based design, but would never be used within a user's program to provide a useful service to the program. In a sense, an encapsulated eDSL description takes on a primary role within the host application. 

Embedding a DSL in Python
As we've seen, an embedded domain-specific language can provide a domain-specific interface to complex algorithms inside the confines of an existing general-purpose programming language. We've looked at several styles in which an embedded domain-specific language can be integrated into its host language -- all with different tradeoffs in terms of benefits and usability.
I've personally worked with embedded domain-specific languages in nearly every programming language I've used -- from C/C++ to TCL to Java. Most recently, though, I've been learning Python and (naturally) exploring the capabilities that Python offers for supporting an eDSL. Over the next few posts I'll look at Python's features that enable eDSL integration using a small eDSL I've been working on as an example.
In the meantime, what has your experience been with embedded domain-specific languages? Helpful or frustrating? Any notable examples -- either good or bad?

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, July 13, 2019

The Toolmaker's Dilemma: Visionaries Have Always Created Their Own Tools

I recently saw a quote (on Twitter, in a photo no less) to roughly the same effect as the title above: that innovators and visionaries have always created their own tools. I'd attribute the quote if I could, but the dynamic nature of Twitter has ensured that my chances of locating the original post are slim to none. Nonetheless, I've been thinking quite a bit about the concept over the last few weeks. First, whether the claim is true. Secondly, if true, whether this is a good, bad, or neutral thing. And thirdly, what, if anything, could or should be done to change the situation.

Who Are the Visionaries?

One of my favorite Geoffrey Moore books is Crossing the Chasm, which I've read this multiple times over the course of my career. Despite some of the example in the book being a bit dated, much of the advice is still relevant and I feel that I come away with new insights every time I read the book.

Moore identifies Visionaries as the Early Adopters in his book.
Visionaries are that rare breed of people who have the insight to match an emerging technology to a strategic opportunity, the temperament to translate that insight into a high-visibility, high-risk project, and the charisma to get the rest of their organization to buy into that project. 
Visionaries drive the high-tech industry because they see the potential for an "order-of-magnitude" return on investment and willingly take high risks to pursue that goal. 
Moore sees Visionaries as seeking monetary return on investment, and I've encountered many of these. However, I've also encountered Visionaries that are looking for different types of return on investment -- for example, fundamentally altering the way an industry approaches functional verification methodology.

The Opportunities and Challenges of Visionaries

Engaging with Visionaries, as with all of the personalities across the technology adoption curve, has its opportunities and challenges. Visionaries have a clear and ambition vision of the future, and are highly motivated to realize that vision. On the flip side, Visionaries make for demanding customers. Quoting Moore:
As a buying group, Visionaries are easy to sell to but very hard to please. This is because they are buying a dream -- which, to some degree will always be a dream.
Visionaries are also, by definition, out of step with the majority and early majority portions of the market where you hope to get your tool adopted, and often 5-10 years ahead. This can be a good thing if you buy into their vision of the future. However, it's often unclear whether their ideas are crazy good or just crazy.

Moore advises an approach of partnership with Visionaries. This can be a good approach for truly early stage technologies, provided the Visionary is backed by an organization with deep pockets (either financial or people resources), and provided that you buy into their vision of the future. When these criteria are not met, it's very likely a "business decision" will be made to not engage with the Visionary since engagement won't help the product get to its intended destination.

Why do Visionaries Build their Own Tools?

The decision to not engage with a Visionary is typically done for good and rational reasons. After all, the market is optimized to target the majority -- and this is true both for commercial and non-commercial endeavors. However, the story often doesn't end here. Remember, Visionaries are highly-motivated people and driven to realize their view of the future.

A not-infrequent result is that Visionaries collect pieces of available technology and proceed to implement a version of the tool of their dreams. So, the short answer is that Visionaries build their own tools because they feel they haven't any viable alternatives.

What's Wrong with this Picture?

So, is this really a problem, or just the way technology needs to evolve? After all, the best way to determine whether an idea is crazy-good or just crazy is to wait and see if it gains traction or withers away. Just like biological evolution, this makes sense in large ecosystems. Losing a few crazy-good ideas along with the plain crazy ideas in a large ecosystem is just a cost of doing business.

However, many of us find ourselves, by choice or circumstance, in much smaller niche ecosystems. Here, fostering innovation is key to growing the ecosystem. Losing a few crazy-good ideas makes a big difference here. And, the fact that Visionaries are off building out their visions on the fringes of the ecosystem has the effect of further subdividing an already-small ecosystem.

A Venture Capital Approach to Visionaries

If you find yourself in a niche ecosystem -- whether commercial or open source -- I would encourage you to take a venture capital approach to supporting the work of Visionaries. What do I mean by this? Make some bets to enable the work of Visionaries in the tool you provide to the market, and provide ways for Visionaries to interact with your tool's community. Much like the investments venture capital makes, the assumption is that most of these bets won't pay off, but that a few will pay off in a big way.

One big thing that has changed since Geoffrey Moore wrote Crossing the Chasm in 1991 is just how easy it is for communities (even niche ones) to connect and interact. This provides fundamental new opportunities when it comes to propagating new technology to technology ecosystems, and merits a rethink in the way we approach the relationship between new technology and visionaries.

In some sense, the suggestion above changes relatively little about the way we build and promote tools. We often accidentally build in capabilities that visionaries can use. However, I believe we can get even better results by deliberately and intentionally creating these visionary-friendly mechanisms. Plan and develop extension mechanisms that don't impact usability for mainstream users, but allow visionaries to implement their visions in the context of your tool instead of needing to create a completely-separate tool. Keep visionaries close to your community of existing users so they have a ready audience to help accelerate the process of identifying crazy-good ideas and weeding out plain-crazy ideas. Be on the lookout for those truly crazy-good ideas that resonate with your community and have potential to be the next big thing.

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Sunday, June 16, 2019

Py-HPI: Applying Python for Verification

In my last post, I talked about a prototype procedural interface between Python and HDL that enables cross-calling between Python and SystemVerilog. My primary motivation for investigating a procedural interface was its potential to maximize performance. In this post, I create a Python testbench for a small IP and compare it to the equivalent C++ testbench. I also look at the performance of Python for verification.

Creating a Python Testbench
My go-to IP for trying out new verification techniques is a small 32-bit RISC-V core named Featherweight RISC (FWRISC) that I created for a design contest last year. The original testbench was written in C++, so that will be my baseline for comparison. If you're interested in the structure of the testbench, have a look at this post.

Since I was keeping the testbench structure the same, I didn't expect much in terms of a reduction in lines of code. C++ is a bit verbose, in that it expects a header and implementation file for each class. This contributes to the fact that each C++ test is roughly twice as long as each Python test:

  • C++ Test: 328 lines
  • Python Test: 139 lines
Reducing the lines of code is a good thing, since more code statistically means more bugs, and spending time finding and fixing testbench bugs doesn't help us get our design verified. But, that's just the start.

The unit tests for FWRISC are all self-checking. This means that each assembly file contains the expected value for registers modified by the test. You can see the data embedded below between the start_expected and end_expected labels.

li x1, 5
add x3, x1, 6
j done
// Expected value for registers
.word 1, 5
.word 3, 11

Because I didn't want to need to install an ELF-reading library on every machine where I wanted to run the FWRISC regression, I wrote my own small ELF-reading classes for the FWRISC testbench. This amounted to ~400 lines of code, and required a certain amount of thought and effort.

When I started writing the Python testbench, I thought about writing another ELF-reader in Python based on the code I'd written in C++... But then I realized that there was already a Python library for doing this called pyelftools. All I needed to do was get it installed in my environment (more on that in a future post), and call the API:

with open(sw_image, "rb") as f:
elffile = ELFFile(f)
symtab = elffile.get_section_by_name('.symtab')
start_expected = symtab.get_symbol_by_name("start_expected")[0]["st_value"]
end_expected = symtab.get_symbol_by_name("end_expected")[0]["st_value"]
section = None
for i in range(elffile.num_sections()):
shdr = elffile._get_section_header(i)
if (start_expected >= shdr['sh_addr']) and (end_expected <= (shdr['sh_addr'] + shdr['sh_size'])):
start_expected -= shdr['sh_addr']
end_expected -= shdr['sh_addr']
section = elffile.get_section(i)
data = section.data()

That's a pretty significant savings both in terms of code, and in terms of development and debug effort! So, definitely my Python testbench is looking pretty good in terms of productivity. But, what about performance?

Evaluating Performance
Testbench performance may not be the most important factor when evaluating a language for use in verification. In general, the time an engineer takes to develop, debug, and maintain a verification environment is far more expensive than the compute time taken to execute tests. That said, understanding that performance characteristics of any language enables us to make smarter tradeoffs in how we use the language. 

I was fortunate enough to see David Patterson deliver his keynote A New Golden Age for Computer Architecture around a year ago at DAC 2018. The slide above comes from that presentation, and compares the performance of a variety of implementations of the computationally-intensive matrix multiply operation. As you can see from the slide, a C implementation is 50x faster than a Python implementation. Based on this slide and the anecdotal evidence of others, my pre-existing expectations were somewhat low when it came to Python performance. But, of course, having concrete data specific to functional verification is far more useful than a few anecdotes and rumors.

Spoiler alert: C++ is definitively faster than Python.

As with most languages, there are two aspects of performance to consider with Python: startup time and steady-state performance. Most of the FWRISC tests are quite short -- in fact, the suite of unit tests contains tests that execute less than 10 instructions.This gives us a good way to evaluate the startup overhead of Python. In order to evaluate the steady-state performance, I created a program that ran a tight loop with 10,000,000 instructions. The performance numbers below all come from Verilator-based simulations.

Startup Overhead
As I noted above, I evaluated the startup overhead of Python using the unit test suite. This suite contains 66 very short tests. 

  • C++ Testbench: 7s
  • Python Testbench: 18s
Based on the numbers above, Python does impose a noticeable overhead on the test suite -- it takes ~2.5x longer to run the suite with Python vs C++. That said, 18 seconds is still very reasonable to run a suite of smoke tests.

Steady-State Overhead
To evaluate the steady-state overhead of a Python testbench, I ran a long-loop test that ran a total of 10,000,000 instructions.

  • C++ Testbench: 11.6s
  • Python Testbench: 109.7s
Okay, this doesn't look so good. Our C++ testbench is 9.45x faster than our Python testbench. What do we do about this?

Adapting to Python's Performance
Initially, the FWRISC testbench didn't worry much about interaction between the design and testbench. The fwrisc_tracer BFM called the testbench on each executed instruction, register write, and memory access. This was, of course, simple. But, was it really necessary?

Actually, in most cases, the testbench only needs to be aware of the results of a simulation, or key events across the simulation. Given the cost of calling Python, I made a few optimizations to the frequency of events sent to the testbench:

  • Maintain the register state in the tracer BFM, instead of calling the testbench every time a write occurs. The testbench can read back the register state at the end of the test as needed.
  • Notify the testbench when a long-jump or jump-link instruction occurs, instead of on every instruction. This allows the testbench to detect end-of-test conditions and minimizes the frequency of calls
With these two enhancements to both the C++ and Python testbenches, I re-ran the long-loop test and got new results:

  • C++ Testbench: 4s
  • Python Testbench: 5s
Notice that the C++ results have improved as well. My interpretation of these results is that most of the time is now spent by Verilator in simulating the design, and the results are more-or-less identical.

The Python ecosystem brings definite benefits when applying Python for functional verification. The existing ecosystem of available libraries, and the infrastructure to easily access them, simplifies the effort needed to reuse existing code. It also minimizes the burden placed on users that want to try out an open source project that uses Python for verification.

Using Python does come with performance overhead. This means that it's more important to consider how the execution of the testbench relates to execution of the design. A testbench that interacts with the design frequently (eg every clock) will impose much greater overhead compared to a testbench that interacts with the design every 100 or 1000 cycles. There are typically many optimization opportunities that minimize the performance overhead of a Python testbench, while not adversely impacting verification results.

It's important to remember that engineer time is much more expensive than compute time, so making engineers more productive wins every time. So, from my perspective, the real question isn't whether C++ is faster than Python. The real questions are whether Python is sufficiently fast to be useful, and whether there are reasonable approaches to dealing with the performance bottlenecks. Based on my experience, the answer is a resounding Yes. 

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.