Monday, August 28, 2017

Chisel Sharpening: In the end, it's all about results

At the end of the day, it's all about results, of course. A productivity improvement is great, as long as the implementation results at least stay constant. Any decrease in implementation results definitely detract from any productivity improvements.

So, after all the work thus far describing and verifying the wishbone interconnect, we still have to answer the question: how are the results? How does the Chisel description compare, in terms of synthesis results, to the results from the hand-coded description?

In this post, I'll be using Intel/Altera Quartus for synthesis (since I have access to an Altera prototype board and Quartus). If anyone wishes to donate a Zynq prototype board, I'd be happy to report the results from Xilinx Vivado as well.

There are two things that we care about when it comes to implementation: speed and size. What's the maximum frequency at which the implementation operates? How many logic elements are required? How many registers?

Prerequisites

Synthesis tools are very good, by design, at eliminating unused logic. Any I/Os in our design that aren't connected will be optimized out of the design. Consequently, synthesizing the interconnect will only work under two conditions: if all the interconnect I/Os are connected to FPGA device I/Os, or if we build a design around the interconnect that utilizes the all the interconnect I/Os.

Given the number of I/Os the wishbone interconnect has, the second path is the most feasible.

We already have a synthesizable Wishbone target memory device from the verification environment. Now, we just need something to drive the master interfaces. Just to further explore Chisel, I created a Wishbone initiator using the LFSR Chisel module to randomize the address (to select the target device), read/write, and write data.

I connected up the LEDs to counters that toggle the LED every time the respective target device is accessed 16,384 times -- just for kicks, and because I like blinking lights...

Results

For comparison, I'll leave everything the same about the design except for the interconnect. Here are the relevant details:

Design	Registers	ALMs	Fmax
Hand-coded	264	239	160.64
Chisel	227	275	201.37

So, Chisel uses:

14% more ALMs than the hand-coded RTL
14% fewer registers than the hand-coded RTL

So, it's a bit of a wash in terms of size.

The performance difference is fairly significant, though: the Chisel design is 25% faster than the hand-coded RTL. So, the (slightly) higher-level description certainly doesn't hurt the results! Now, I'm sure I could achieve the same results with the hand-coded description that the Chisel description achieved. I'm still a bit surprised, though, that the Chisel description achieved better results out the box with a less-than-expert user. So, definitely a promising conclusion to my initial Chisel exploration!

Now, just for fun, here is the design running on the Cylone V prototype board.

Stay tuned for more details on what I'm learning about Chisel constructs. And, you can find the code used in my experiments here: https://github.com/mballance/wb_sys_ip

Saturday, August 19, 2017

Chisel Sharpening: If it's not tested, it's broken

"If it's not tested, it's broken."

-- Bruce Eckel

I'm a big believer in the quote above, and cite it somewhat frequently -- perhaps to the tedium of my colleagues. In my last post, I showed a Chisel3-based description of a Wishbone interconnect. While it might have looked cool, without tests I had no idea whether it worked correctly or not. After adding a simple UVM testbench around my interconnect I can, yet again, confirm the truth of the quote above. But, enough pontificating, let's dig into the specifics.

Error Types

Especially when implementing something while learning a new language or technique, I find that I make four types of errors:

Implementation errors and oversights - There are your run of the mill bugs. For example, I neglected to implement an intended feature, or I implemented the logic incorrectly. Good planning helps to minimize these errors, but they are why we place such value on good verification.
Errors in description - These are learning mistakes related to the new language or technique. I structured a description around my understanding of the language/technique, only to find that it resulted in unexpected behavior. The ease or difficulty in avoiding and/or diagnosing errors in description is a key determinant for me in deciding how easy a new technique is to adopt.
Errors in reuse - This type of error occurs when reusing existing IP, only to find that it functions differently from my understanding.
Tool or library issues - These are errors that you hope not to encounter, but do crop up from time to time.

Not surprisingly, I encountered the first three categories of errors while verifying my Wishbone interconnect. I did encounter one tool/library issue, but I'll get to that later...

UVM Testbench

I decided to verify a 2x4 configuration of the Wishbone interconnect, and created a very (very) basic UVM testbench around the interconnect and instantiated two Wishbone master agents and four memory target devices - mapped at 0x00000000, 0x00001000, 0x00002000, and 0x00003000, respectively.

And, a very basic write/read test:

I ran this in Questa (the Altera Modelsim Starter Edition, to be precise). On the first test run, nothing worked and quite a few signals were driven to X. In order to accelerate progress, I turned on register randomization (something supported by the Verilog code generated by Chisel).

Reuse Error: Chisel Arbiter

My simple write/read test causes both masters to perform a write to the same target device as the first operation. I had assumed that the Chisel-provided arbiter would grant the output to the selected master until that master dropped its request. However, this turned out to not be the case. It took browsing the source code (for me at least) to understand the mechanism the library developer had provided for controlling the arbiter locking behavior. Once I understood that mechanism, it was very simple to customize the locking behavior.

The arbiter is such a useful and reusable construct that I'll devote a future post to it, rather than delve into the details here.

Description Error: Multiple Assignments

My original interconnect description used a nested loop structure across the masters and slaves, with a conditional assignment to propagate back the slave response.

  for (i <- 0 until p.N_SLAVES) {
    for (j <- 0 until p.N_MASTERS) {
      // ...
      
      // Propagate slave response back to active master
      when (out_arb(i).io.in(j).ready) {
          in_rsp(j) := out_rsp(i);
      } .otherwise {
          in_rsp(j).park_rsp();
      }
    }
    out_arb(i).io.out.bits.assign_req2p(io.s(i));
  }

As it turns out, this code trips over one of the corner cases of Chisel: when procedural code (eg for loops) make multiple assignments, the last assignment is taken. In this case, that means that both masters were being fed the response from the last slave device.
Investigating description errors like these unfortunately involve digging into the generated Verilog code. This is both tedious and not quite as bad as it sounds. Chisel picks sensible names for module I/O signals, so these are easy to track. However, Chisel also generates lots of anonymously-named internal signals (eg _T_51) that are used to implement the logic within the module.
I don't have a concrete proposal for the Chisel authors on how to improve this situation, but I would like to think a graphical view, such as a schematic, might be helpful in relating the input Scala code to the resulting Verilog.

Reuse Error: Mux Arguments

After better-understanding Chisel's behavior with respect to multiple assignments, I decided that using Chisel's 1-hot Mux primitive would be the best way to handle the response data. Here I bumped into a limitation of the Mux that the Arbiter primitive allowed me to ignore: multiplexing bundles with signals of different I/O directions is not supported (and, sadly, only uncovered very late in the transformation process from Chisel to Verilog). It all makes a lot of sense once you think it through.

Understanding this limitation drove me to redefine the I/O bundles I used to describe the Wishbone interface. It was a fairly straight-forward process, and one that I'll describe in more depth in a future post on structuring I/Os for standard interfaces with Chisel.

Understanding Data Manipulation Techniques

Chisel encourages descriptions that involve collections of data. In some ways, this isn't so different from other hardware-description languages. What's different is the set of operators Chisel provides for manipulating these data collections. One early example I ran across was implementing address decode for the masters. I had arrays of target base/limit addresses, and wanted to determine which target device each master was selecting. This was very easy (and compactly) described with the following code:

val slave_req = io.addr_base.zip(io.addr_limit).map(e => (

io.m(i).ADR >= e._1 && io.m(i).ADR <= e._2) &&

io.m(i).CYC && io.m(i).STB)

This code describes the following:

Combine the addr_base and addr_limit arrays into an array of (addr_base,addr_limit) tuples using the 'zip' operation
Convert this array of tuples into an array of Bool where the entry is 'true' if the target is selected

These techniques also apply nicely to selecting fields from composite data structures, as shown below. In the case below, we want to determine whether a given target device is actively selected by any master.

when (out_arb(j).io.in.map((f) => f.valid).reduceLeft(_|_)) {

out_arb(j).io.out.bits.assign_b2(io.s(j));

} .otherwise {

// If no master is requesting, deactivate the slave requests

io.s(j).park_req()

}

In this case, the code does the following:

out_arb(j).io.in is an array of composite data going into the per-target arbiters. The map() operation selects just the 'valid' field from each array element
Then, the reduceLeft() operation performs a reduction across the array

Now, both of these operations can be described in other hardware description languages. But both likely would involve several layers of temporary data fields. It's actually really nice to be able to describe the high-level view of the manipulation to be performed, and be confident that a sensible implementation of this implementation will be inferred (and, after having to dig into the implementation for other reasons, I can state that the implementation is sensible).

Tool Issue: Register without Reset

I mentioned earlier that I had turned on register initial-value randomization when I first started simulations. After getting my test running correctly, I had hoped this would not be needed. However, it turns out that Chisel's Arbiter primitive contains a register without a reset value. Perhaps this hasn't created an issue for many Chisel users because the Verilator 2-state simulator is often used. However, with a 4-state simulator like Questa/Modelsim, an uninitialized register is a fatal issue that results in X propagation and a non-functioning design.

I strongly recommend using registers that are reset, and will provide this feedback to the Chisel team.

Conclusions

I'd class many of the issues I faced as all part of the learning curve for a new tool or technique. Challenging (and sometimes time-consuming) to surmount, perhaps, but issues I'd likely not face in the future. I've also gained an new-found appreciation for the descriptive power that Chisel's support for Scala's collection-manipulation operators bring to the description of hardware.

For now, I have a working Wishbone interconnect described with Chisel. And, despite a few hiccups along the way, I'm still feeling pretty good about the expressive power that Chisel brings to hardware description.

Next, I'm curious to see how synthesis results compare for a hand-coded Wishbone interconnect and the Chisel-generated one.

As always, you can find the source for my experiments here:

https://github.com/mballance/wb_sys_ip.git

Wednesday, August 2, 2017

Chisel Sharpening: Initial impressions

Recently, RiscV (https://riscv.org/) has been all the rage. The free (as in speech and beer) instruction-set architecture has experienced an explosion of interest in the last couple of years, after being more-or-less an academic curiosity since 2010 or so.

This post isn't about RiscV, though. It's about Chisel (https://github.com/freechipsproject/chisel3/wiki) , the design language that the UC Berkeley team working on RiscV uses to implement Rocket Chip, their proof-of-concept RiscV implementation. Claims of the productivity benefits of using Chisel for hardware design are substantial. For example, one data point two implementations of a RiscV architecture -- one using Chisel and one in Verilog (https://riscv.org/wp-content/uploads/2015/01/riscv-chisel-tutorial-bootcamp-jan2015.pdf).

3x fewer lines of code -- even (but maybe especially) if the savings was in wiring -- sounds pretty good to me! So, I decided a while ago that I should dig in and learn more about Chisel.

So, what's Chisel?

Chisel certainly isn't a one-for-one replacement for a hardware description language (HDL) like Verilog or VHDL. Chisel is a class library, written in Scala (https://www.scala-lang.org/). Descriptions written using that class library are compiled, executed, and converted to Verilog. Now, your first thought when a high-level language like Scala is mentioned might be "High-Level Synthesis", but that's not what Chisel is all about. Chisel is very much focused on RTL (register-tranfer level) design.

A learning project

I learn almost everything by doing, so I decided to assign myself a learning project to see if I could make Chisel go. A few years back, I coded a parameterized Wishbone interconnect in SystemVerilog. I was curious to see how a similar project coded in Chisel would compare.

All about the interfaces

The first thing to do is to capture a Wishbone interface. As with many hardware interfaces, Wishbone is parameterized with address, data, and tag widths. Quite sensibly, Chisel provides a class for specifying reusable collections of signals called Bundle. Below, you can see the code used to declare the parameters for a Wishbone interface, and the Bundle (WishboneMaster) that specifies the signals for a Master interface.

class WishboneParameters (
val ADDR_WIDTH : Int=32,
val DATA_WIDTH : Int=32,
val TGA_WIDTH : Int=1,
val TGD_WIDTH : Int=1,
val TGC_WIDTH : Int=1) {

def cloneType() = (new WishboneParameters(ADDR_WIDTH, DATA_WIDTH,
TGA_WIDTH, TGD_WIDTH, TGC_WIDTH)).asInstanceOf[this.type]
}

class WishboneMaster(val p : WishboneParameters) extends Bundle {
val ADR = Output(UInt(p.ADDR_WIDTH.W));
val TGA = Output(UInt(p.TGA_WIDTH.W));
val CTI = Output(UInt(3.W));
val BTE = Output(UInt(2.W));
val DAT_W = Output(UInt(p.DATA_WIDTH.W));
val TGD_W = Output(UInt(p.TGD_WIDTH.W));
val DAT_R = Input(UInt(p.DATA_WIDTH.W));
val TGD_R = Input(UInt(p.TGD_WIDTH.W));
val CYC = Output(Bool());
val TGC = Output(UInt(p.TGC_WIDTH.W));
val ERR = Input(Bool());
val SEL = Output(UInt(p.DATA_WIDTH/8));
val STB = Output(Bool());
val ACK = Input(Bool());
val WE = Output(Bool());

// ...
}

Coding up the interconnect

As with many learning projects, the Wishbone interconnect involved a series of iterations. Eventually, I arrived at the code below, broken up a bit to support comments:

class WishboneInterconnectParameters(

val N_MASTERS : Int=1,

val N_SLAVES : Int=1,

val wb_p : WishboneParameters) {

}

class WishboneInterconnect(

val p : WishboneInterconnectParameters,

val typename : String = "WishboneInterconnect") extends Module {

val io = IO(new Bundle {

val addr_base = Input(Vec(p.N_SLAVES, UInt(p.wb_p.ADDR_WIDTH.W)))

val addr_limit = Input(Vec(p.N_SLAVES, UInt(p.wb_p.ADDR_WIDTH.W)))

val m = Vec(p.N_MASTERS, Flipped(new WishboneMaster(p.wb_p)))

val s = Vec(p.N_SLAVES, new WishboneMaster(p.wb_p))

});

override def desiredName() : String = typename;

Here's the interface declaration. Note that we have vectors of address base and limit for address decode, then vectors of master and slave interfaces. Note the 'Flipped' method that reverses the Input/Output direction of elements within a bundle.

Now, what we're building is effectively shown below. Each slave interface has an associated arbiter that is connected to all masters. Fortunately, Chisel provides an Arbiter as a built-in element of the class library.

val in_rsp = Seq.fill(p.N_MASTERS) ( Wire(new WishboneMaster(p.wb_p) ))

val out_rsp = Seq.fill(p.N_SLAVES) ( Wire(new WishboneMaster(p.wb_p) ))

for (i <- 0 until p.N_MASTERS) {

// Drive back to master

in_rsp(i).assign_rsp2p(io.m(i));

}

val out_arb = Seq.fill(p.N_SLAVES) ( Module(new RRArbiter(

new WishboneMaster(p.wb_p), p.N_MASTERS)) )

This code creates a couple of temp arrays for routing the response back from the slave to the master, as well as an array of per-slave interface arbiters.

// For each slave, hook up all masters

for (i <- 0 until p.N_SLAVES) {

for (j <- 0 until p.N_MASTERS) {

val m_sel = io.addr_base.indexWhere((p:UInt) => (io.m(j).ADR >= p))

val m_ex = (io.addr_base.exists((p:UInt) => (io.m(j).ADR >= p)) &&

io.addr_limit.exists((p:UInt) => (io.m(j).ADR <= p)));

out_arb(i).io.in(j).bits.assign_p2req(io.m(j))

when (m_ex && m_sel === i.asUInt()) {

out_arb(i).io.in(j).valid := Bool(true);

} .otherwise {

out_arb(i).io.in(j).valid := Bool(false);

}

// Propagate slave response back to active master

when (out_arb(i).io.in(j).ready /* out_arb(i).io.in(j).valid &&

out_arb(i).io.chosen === j.asUInt() */) {

in_rsp(j) := out_rsp(i);

} .otherwise {

in_rsp(j).park_rsp();

}

out_arb(i).io.out.bits.assign_req2p(io.s(i));

}

Finally, we do the address decode to determine which slave a master's request address selects, and connect everything up to the arbiters. Note how simple it is to query the base/limit address arrays! The 'm_ex' field is true if the master is selecting a valid slave, while the 'm_sel' field holds the target index.

Generating RTL

One of the things I spent far too much time on was finding out how to generate Verilog from my Chisel description. Turns out the incantation is quite simple once you know what it is:

object WishboneInterconnectDriver extends App {

var N_MASTERS = 2;

var N_SLAVES = 4;

var ADDR_WIDTH = 32;

var DATA_WIDTH = 32;

var typename = "wishbone_ic_%d_%d_%dx%d".format(

ADDR_WIDTH, DATA_WIDTH, N_MASTERS, N_SLAVES);

chisel3.Driver.execute(args, () => new WishboneInterconnect(

new WishboneInterconnectParameters(N_MASTERS, N_SLAVES,

wb_p=new WishboneParameters(ADDR_WIDTH, DATA_WIDTH)

), typename)

)

}

The code above calls the Chisel 'Driver', passing in an instance of the WishboneInterconnect class. The result of running this code is a set of files, one of which is the Verilog RTL. The output RTL is somewhat low-level -- and 1656 lines long (!). When it comes to debugging this, I'll be interested to how much this gets in the way. But, it's all sensible RTL at the end of the day...

Results

Okay, so the hand-coded SystemVerilog interconnect took a total of 326 lines of SystemVerilog code. But, a little over 100 of those were the per-slave arbiter. If we ignore those lines, we have 206 lines of SystemVerilog. The Chisel description is 56 lines of code. So, 3-6x less code, depending on whether you count or ignore the arbiter implementation. Not bad, and I'm definitely feeling more comfortable with Chisel after working through an example like this.

If you're interested, you can find the complete code on GitHub:

https://github.com/mballance/wb_sys_ip

This repository contains both the hand-coded SystemVerilog and the Chisel representation.

So, what did we learn?

Well, initial experiments certainly seem to bear out the productivity benefits of Chisel. Library elements, such as the arbiter module are a great productivity boost! Array operations raise the abstraction level.

Figuring out the basics can be a bit challenging. Even figuring out how to run the conversion to Verilog process took some digging. Because Chisel is embedded in another language, semantic errors tend to show up as Java exception errors, rather than nice high-level error messages.

So, thus far, some good and some bad. Over the next couple of posts, I plan to dig into a couple of other areas of comparison -- including verification of the RTL, and how efficiently hand-coded and Chisel-generated implementations synthesize. So, stay tuned for more.

Have you experimented at all with Chisel? Or, with other HDL alternatives for that matter. What has your experience been?