Bits, Bytes, and Gates: RISC-V

Showing posts with label RISC-V. Show all posts

Sunday, March 28, 2021

SoC Integration Testing: Hw/Sw Test Coordination (Part 1)

IP- and subsystem-level testbenches are quite monolithic. There is a single entity (the testbench) that applies stimulus to the design, collects metrics, and checks results. In contrast, an SoC-level testbench is composed of at least two islands: the software running on the design’s processor and the external testbench connected to the design interfaces. Efficiently developing SoC tests involving both islands requires the ability to easily and efficiently coordinate their activity.

There are a two times when it’s imperative that the behavior of the test island(s) inside the design and the test island outside the design are coordinated – specifically, the beginning and end of the test when all islands must be in agreement. But, there are many other points in time where it is advantageous to be able communicate between the test islands.

Especially when running in simulation, the ability to efficiently pass debug information from software out to the test harness dramatically speeds debug.

It’s often useful to collect metrics on what’s happening in the software environment during test – think of this as functional coverage for software.

Verifying our design requires applying external stimulus to prove that the design (including firmware) reacts appropriately. This requires the ability to coordinate between initiating traffic on external interfaces and running firmware on the design processors to react – another excellent application of hardware/software coordination.

Often, checking results consumes a particularly-large portion of the software-test’s time. The ability to offload this to the test harness (which runs on the host server) can shorten our simulation times significantly.

Key Care-Abouts

When it comes to our key requirements for communication, one of the biggest is efficiency – at least while we’re in simulation. The key metric being how many clock cycles it takes to transfer data from software to testbench. When we look at a simulation log, we want to see most activity (and simulation time) focused on actually testing our SoC, and not on sending debug messages back to the test harness. A mechanism with a low overhead will allow us to collect more debug data, check more results, and generally have more flexibility and freedom in transferring data between the two islands.

Non-Invasive

One approach to efficiency is to use custom hardware for communication. Currently, though this may change, building the communication path into the design seems to be disfavored. So, having the communication path be non-invasive is a big plus.

Portable

Designs, of course, don’t stay in simulation forever. The end goal is to run them in emulation and prototyping for performance validation, then eventually on real silicon where validation continues -- just at much higher execution speed. Ideally, our communication path will be portable across these changes in environment. The low-level transport may change – for example, we may move from a shared-memory mailbox to using an external interface – but we shouldn’t need to fundamentally change our embedded software tests or the test behavior running on the test harness.

Scalable

A key consideration – which really has nothing to do with the communication medium at all – is how scalable the solution is in general. How much work is required to add a piece of data (message, function, etc) that will be communicated? How much specialized expertise is required? The simpler the process is to incrementally enhance the data communicated, the greater the likelihood that it will be used.

Current Approaches

Of the approaches that I’ve seen in use, most involve either software-accessible memory or the use of an existing external interface as the transport mechanism between software and the external test harness. In fact, one of the earliest cases of hardware/software interaction that I used was the Arm Trickbox – a memory-mapped special-purpose hardware device that supported sending messages to the simulation transcript and terminating the test, among other actions.

In both of these cases, some amount of code will run on the processor to format messages and put them in the mailbox or send them via the interface.

Challenges

Using a memory-based communication is generally possible in a simulation-based environment – provided we can snoop writes to memory, and/or read memory contents directly from the test harness. That doesn’t mean that memory-based communication is efficient, though, and in simulation, we care a lot about efficiency due to the speed of hardware simulators.

Our first challenge comes from the fact that all data coming from the software environment needs to be copied from its original location in memory into the shared-memory mailbox. This is because the test harness only has access to portions of the address space, and generally can’t piece together data stored in caches. The result is that we have to copy all data sent from software to the test harness out to main (non-cached) memory. Accessing main memory is slow, and thus communication between software and the test harness significantly lengthens our simulations.

Our second challenge comes from the fact that the mailbox is likely to be smaller than the largest message we wish to send. This means that our libraries on both sides of the mailbox need to manage synchronizing data transmission with available space in the mailbox. This means that one of the first tasks we need to undertake when bringing up our SoC is to test the communication path between software and test harness.

A final challenge, which really ought not to be a challenge, is that we’ll often end up custom-developing the communication mechanism since there aren’t readily-available reusable libraries that we can easily deploy. More about that later.

Making use of Execution Trace

In a previous post, I wrote about using processor-execution trace for enhanced debug. I've also used processor trace as a simple way to detect test termination. For example, here is the Python test-harness code that terminates the test when one of 'test_pass' or 'test_fail' are invoked:

In order to support test-result checking, the processor-execution trace BFM has the ability to track both the register state and memory state as execution proceeds.

The memory mirror is a sparse memory model that contains only the data that the core is actively using. It's initialized from the software image loaded into simulation memory, and updated when the core performs a write. The memory mirror provides the view of memory from the processor core's perspective -- in other words, pre-cache.

Our test harness has access to the processor core's view of register values and memory content at the point that a function is called. As it turns out, we can build on this to create a very efficient way to transferring data from software to the test harness.

In order to access the value of function parameters, we need to know the calling convention for our processor core. Here's the table describing register usage in the RISC-V calling convention:

Note that x10-17 are used to pass the first eight function arguments.

Creating Abstraction

We could, of course, directly access registers and memory from our test-harness code to get the value of function parameters. But, a little abstraction will help us out in the long run.

The architecture-independent core-debug BFM defines a class API for accessing the value of function parameters. This is very similar to the varadic-argument API used in C programming:

Now, we just need to implement a RISC-V specific version of this API in order to simplify accessing function parameter values:

Here's how we use this implementation. Assume we have a embedded-software function like this:

When we detect that this function has been called, we can access the value of the string passed to the function from the test harness like this:

Advantages

There are several advantages to using a trace-driven approach to data communication between processor core and test harness. Because the trace BFM sees the processor's view of memory, there's no need to (slowly) copy data out to main memory in order for the test harness to see it. This allows data to stay in caches and avoids unnecessary copying.

Perhaps more importantly, our trace-based communication mechanism allow us to offload data processing to the test harness. Take, for example, the very-common debug printf:

The user passes a format string and then a variable number of arguments that will all be converted to string representations that can be displayed. If our communication mechanism is an in-memory mailbox or external interface, we need to perform the string formatting on the design's processor core. If, however, we use the trace-based mechanism for communication, the string formatting can all be done by the test harness in zero simulation time. This allows us to keep our simulations shorter and more-focused on the test at hand, while maximizing the debug and metrics data we collect.

Next Steps

SoC integration tests are distributed tests carried out by islands of test behavior running on the processor(s) and on the test harness controlling the external interfaces. Testing more-interesting scenarios requires coordinating these islands of test functionality.

In this post, we’ve looked at using execution-trace to implement a high-efficiency mechanism for communicating from embedded test software back to the test harness. While this mechanism is mostly-specific to simulation, it has the advantage of simplifying communication, debug, and metrics collection at this early phase of integration testing when, arguably, we most-need a high degree of visibility.

While we have an efficient mechanism, we don’t yet has a mechanism that makes it easy to add new APIs (scalable) nor a mechanism that is easily portable to environments that need to use a different transport mechanism.

In the next post, we’ll have a look at putting some structure and abstraction around communication that will help with both of these points.

References

RISC-V Calling Conventions (ABI) – https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
pybfms-core-debug-common – https://github.com/pybfms/pybfms-core-debug-common
pybfms-riscv-debug – https://github.com/pybfms/pybfms_riscv_debug
pyhvl-rpc -- https://github.com/fvutils/pyhvl-rpc

Disclaimer

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Sunday, June 16, 2019

Py-HPI: Applying Python for Verification

Intro

In my last post, I talked about a prototype procedural interface between Python and HDL that enables cross-calling between Python and SystemVerilog. My primary motivation for investigating a procedural interface was its potential to maximize performance. In this post, I create a Python testbench for a small IP and compare it to the equivalent C++ testbench. I also look at the performance of Python for verification.

Creating a Python Testbench

My go-to IP for trying out new verification techniques is a small 32-bit RISC-V core named Featherweight RISC (FWRISC) that I created for a design contest last year. The original testbench was written in C++, so that will be my baseline for comparison. If you're interested in the structure of the testbench, have a look at this post.

Since I was keeping the testbench structure the same, I didn't expect much in terms of a reduction in lines of code. C++ is a bit verbose, in that it expects a header and implementation file for each class. This contributes to the fact that each C++ test is roughly twice as long as each Python test:

C++ Test: 328 lines
Python Test: 139 lines

Reducing the lines of code is a good thing, since more code statistically means more bugs, and spending time finding and fixing testbench bugs doesn't help us get our design verified. But, that's just the start.

The unit tests for FWRISC are all self-checking. This means that each assembly file contains the expected value for registers modified by the test. You can see the data embedded below between the start_expected and end_expected labels.


	entry:
	li x1, 5
	add x3, x1, 6
	j done

	// Expected value for registers
	start_expected:
	.word 1, 5
	.word 3, 11
	end_expected:

Because I didn't want to need to install an ELF-reading library on every machine where I wanted to run the FWRISC regression, I wrote my own small ELF-reading classes for the FWRISC testbench. This amounted to ~400 lines of code, and required a certain amount of thought and effort.

When I started writing the Python testbench, I thought about writing another ELF-reader in Python based on the code I'd written in C++... But then I realized that there was already a Python library for doing this called pyelftools. All I needed to do was get it installed in my environment (more on that in a future post), and call the API:

with open(sw_image, "rb") as f:
	elffile = ELFFile(f)

	symtab = elffile.get_section_by_name('.symtab')
	start_expected = symtab.get_symbol_by_name("start_expected")[0]["st_value"]
	end_expected = symtab.get_symbol_by_name("end_expected")[0]["st_value"]

	section = None
	for i in range(elffile.num_sections()):
	shdr = elffile._get_section_header(i)
	if (start_expected >= shdr['sh_addr']) and (end_expected <= (shdr['sh_addr'] + shdr['sh_size'])):
	start_expected -= shdr['sh_addr']
	end_expected -= shdr['sh_addr']
	section = elffile.get_section(i)
	break

	data = section.data()

That's a pretty significant savings both in terms of code, and in terms of development and debug effort! So, definitely my Python testbench is looking pretty good in terms of productivity. But, what about performance?

Evaluating Performance

Testbench performance may not be the most important factor when evaluating a language for use in verification. In general, the time an engineer takes to develop, debug, and maintain a verification environment is far more expensive than the compute time taken to execute tests. That said, understanding that performance characteristics of any language enables us to make smarter tradeoffs in how we use the language.

I was fortunate enough to see David Patterson deliver his keynote A New Golden Age for Computer Architecture around a year ago at DAC 2018. The slide above comes from that presentation, and compares the performance of a variety of implementations of the computationally-intensive matrix multiply operation. As you can see from the slide, a C implementation is 50x faster than a Python implementation. Based on this slide and the anecdotal evidence of others, my pre-existing expectations were somewhat low when it came to Python performance. But, of course, having concrete data specific to functional verification is far more useful than a few anecdotes and rumors.

Spoiler alert: C++ is definitively faster than Python.

As with most languages, there are two aspects of performance to consider with Python: startup time and steady-state performance. Most of the FWRISC tests are quite short -- in fact, the suite of unit tests contains tests that execute less than 10 instructions.This gives us a good way to evaluate the startup overhead of Python. In order to evaluate the steady-state performance, I created a program that ran a tight loop with 10,000,000 instructions. The performance numbers below all come from Verilator-based simulations.

Startup Overhead
As I noted above, I evaluated the startup overhead of Python using the unit test suite. This suite contains 66 very short tests.

C++ Testbench: 7s
Python Testbench: 18s

Based on the numbers above, Python does impose a noticeable overhead on the test suite -- it takes ~2.5x longer to run the suite with Python vs C++. That said, 18 seconds is still very reasonable to run a suite of smoke tests.

Steady-State Overhead
To evaluate the steady-state overhead of a Python testbench, I ran a long-loop test that ran a total of 10,000,000 instructions.

C++ Testbench: 11.6s
Python Testbench: 109.7s

Okay, this doesn't look so good. Our C++ testbench is 9.45x faster than our Python testbench. What do we do about this?

Adapting to Python's Performance

Initially, the FWRISC testbench didn't worry much about interaction between the design and testbench. The fwrisc_tracer BFM called the testbench on each executed instruction, register write, and memory access. This was, of course, simple. But, was it really necessary?

Actually, in most cases, the testbench only needs to be aware of the results of a simulation, or key events across the simulation. Given the cost of calling Python, I made a few optimizations to the frequency of events sent to the testbench:

Maintain the register state in the tracer BFM, instead of calling the testbench every time a write occurs. The testbench can read back the register state at the end of the test as needed.
Notify the testbench when a long-jump or jump-link instruction occurs, instead of on every instruction. This allows the testbench to detect end-of-test conditions and minimizes the frequency of calls

With these two enhancements to both the C++ and Python testbenches, I re-ran the long-loop test and got new results:

C++ Testbench: 4s
Python Testbench: 5s

Notice that the C++ results have improved as well. My interpretation of these results is that most of the time is now spent by Verilator in simulating the design, and the results are more-or-less identical.

Conclusions

The Python ecosystem brings definite benefits when applying Python for functional verification. The existing ecosystem of available libraries, and the infrastructure to easily access them, simplifies the effort needed to reuse existing code. It also minimizes the burden placed on users that want to try out an open source project that uses Python for verification.

Using Python does come with performance overhead. This means that it's more important to consider how the execution of the testbench relates to execution of the design. A testbench that interacts with the design frequently (eg every clock) will impose much greater overhead compared to a testbench that interacts with the design every 100 or 1000 cycles. There are typically many optimization opportunities that minimize the performance overhead of a Python testbench, while not adversely impacting verification results.

It's important to remember that engineer time is much more expensive than compute time, so making engineers more productive wins every time. So, from my perspective, the real question isn't whether C++ is faster than Python. The real questions are whether Python is sufficiently fast to be useful, and whether there are reasonable approaches to dealing with the performance bottlenecks. Based on my experience, the answer is a resounding Yes.

Disclaimer

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, December 15, 2018

FWRISC: Creating a Unit-test Safety Net

When developing software, I've become very comfortable with test-driven development -- a methodology that calls for tests to be developed along with, or even before, functionality. It's quite common for me to develop a test first, which obviously fails initially, and implement functionality until the test passes.
I have typically approached hardware verification from a very different perspective. When doing verification, I've typically started with a hardware block that is largely functional, or at least is believed to be largely functional. My work then begins with test plan development and some detective work to ferret out the potentially-problematic areas of the design that require targeted tests.

When developing the FWRISC RISC-V core, I initially started off taking a verification approach. One of the requirements for the RISC-V contest was that cores must pass the RISC-V Compliance Tests. So, I started off by attempting to run one of the compliance tests. I quickly realized that there were some challenges with this approach:

Even for testing a simple feature, the RISC-V compliance tests are quite complicated. All involve exception instructions. All involve multiple instructions that are unrelated to the instruction that is ostensibly the target of the test. This isn't uncommon for verification, of course.
Taking a verification approach early in the development cycle means that debugging is incredibly painful. Tracking down a small issue with the core by looking at the test result of a test that consists of a hundred or so instructions is incredibly painful -- just see the waveform at the top of this blog post which comes from the test for the 'add' instruction.
In contrast, see the following waveform that shows the entire 'add' unit test. While it's longer than a single instruction, this test consists of a total of 6 instructions. This means that there's much less to look at when a problem needs to be debugged.

Creating a Testbench

My verification background is SystemVerilog and UVM, so one early complication I faced when developing tests for the Featherweight RISC core was that the RISC-V soft-core context explicitly required the use of Verilator as a simulator. Verilator is, in some senses, a simulator for the synthesizable subset of SystemVerilog. In other senses, it's a Verilog to C++ translator that can be used as to create a C++ version of a synthesizable Verilog description to bind in to a C++ program. Compared to a simulator, think of it, not as a house, but as a pile of lumber, tools, and a blueprint from which you can build a house (which, by the way, is in no way an attempt to minimize the value of Verilator -- just point out some of the required legwork).

The fact that Verilator produces a C++ model of the SystemVerilog RTL brought to mind the unit-testing library that I invariably use when testing C++ code. Googletest is a very handy library for organizing and executing a suite of C++ unit tests:

Collects and categorizes tests
Enables common functionality to be centralized across tests
Provides result-checking macros/assertions
Executes an entire test suite, a subset, or a single test
Reports test status

In addition to using Googletest for standard C++ applications, I had used the Googletest framework to manage some early device firmware written in C when testing it against Verilog RTL being verified in a UVM testbench. This approach seemed close enough to what I was looking to do with Featherweight RISC and Verilator that I ended up extending that project (named googletest-hdl) to support Verilator. While the contest required the use of Verilator, I wanted to setup my basic test suite such that it could run with other simulators. Since the googletest-hdl project already supported running with SystemVerilog and UVM, I was covered there.

The Featherweight RISC testbench block diagram is shown below:

The design is instantiated in the HDL portion of the testbench. This testbench portion will run in Verilator or another Verilog simulator. The Googletest portion of the testbench contains the test suite and code needed to check test results. There are two points at which the environments communicate: the run-control path by which the Googletest environment instructs the HDL environment to run, and the tracer API path by which events are sent from the processor to the testbench environment.

The tracer API path is the key to checking test results. The following events are sent to the Googletest environment as SystemVerilog DPI calls:

Instruction-execution event
Register-write event
Data-access event

Now that we have this testbench structure, we can create some tests.

Creating Tests

When it comes to unit tests, I usually find that the initial test structure is successively refined as I discover more and more commonality between the tests. Frankly, the ideal unit test is almost-entirely data driven, and the Featherweight RISC tests are very close to this ideal.

The unique aspects of the instruction unit tests are stored in the test files themselves. Here is test for the add instruction:

Note the instruction sequence to load registers with literals and perform the 'add'. This is the actual test.
The data between 'start_expected' and 'end_expected' contains the registers that are expected to be written. The test harness will read this data from the compiled test.

This data-driven test allows our test harness to be fairly simple and completely data driven, as shown below:

First, the test harness executes the simulation by calling GoogletestHdl::run()
Next, the test harness reads in the compiled test file, which has been specified with the +SW_IMAGE=<path> option
The start_expected..end_expected range is located and read in
Finally, the register contents are checked against the expected values specified in the test.

Results

After getting the unit-test structure in place, I almost-literally went down the list of RISC-V RV32I instructions and created a test for each as I implemented the instruction. The completed tests provided a nice safety net for catching issues introduced as new instructions were implemented, and as bugs were corrected. More issues were found as the RISC-V compliance tests were brought up and these issues prompted creation of new unit tests.

The unit-test safety net has been incredibly helpful in quickly identifying regressions in the processor implementation, and helping to identify exactly which instructions are impacted. The unit-test experience also got me thinking about formal verification, and whether this could also be used as a form of unit-test suite. I'm getting ready to take another pass at shrinking the area required for the Featherweight RISC-V, and am thinking about creating a Formal unit-test suite to help in catching bugs introduced by that work. I'll talk about those experiments in a future post.

For now, I have a suite of 67 unit tests and 55 RISC-V compliance tests that provide a very good safety net to catch regressions in the Featherweight RISC implementation.

Disclaimer

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, December 8, 2018

FWRISC: Sizing up the RISC-V Architecture

After deciding on October 22nd to create a RISC-V implementation to enter in the 2018 RISC-V soft-core contest (with entries due November 26th), I needed to gather more information of the RISC-V ISA in general, and the RV32I subset of the ISA specifically. I had previously done some work in RISC-V assembly -- mostly writing boot code, interrupt handlers, and thread-management code. But I certainly hadn't explored the full ISA, and certainly not from the perspective of implementing it. Bottom line, I needed a better understanding of the ISA I needed to implement.

Fundamentals of the RISC-V ISA
The first thing to understand about the RISC-V architecture is that it came from academia. If you took a computer architecture course and read the Patterson and Hennesy book, you read about some aspects of one of the RISC-X family of instruction sets (RISC-V is, quite literally, the 5th iteration of the RISC architecture developed at UC Berkeley).

Due in part to its academic background, the ISA has both been extended and refined (restricted) over time -- sometimes in significant and sometimes in insignificant ways. This ability to both extend and change the ISA is somewhat unique when it comes to instruction sets. I'm sure many of you reading this are well-aware of some of the baggage still hanging around in the x86 instruction set (string-manipulation instructions, for example). While many internal protocols, such as the AMBA bus protocol, often take a path of complex early specification versions followed by simpler follow-on versions, instruction set architectures often remain more fixed. In my opinion, the fact that the RISC-V ISA had a longer time to incubate in a context that did not penalize backwards-incompatible changes has resulted in an architecture that is cleaner and easier to implement.

The RISC-V ISA is actually a base instruction-set architecture, and a family of extensions. The RV32I (32-bit integer) instruction set forms the core of the instruction-set architecture. Extensions add on capabilities such as multiply and divide, floating-point instructions, compressed instructions, and atomic instructions. Having this modular structure defined is very helpful in enabling a variety of implementations, while maintaining a single compiler toolchain that understands how to create code for a variety of implementations.
The RISC-V soft-core contest called for an RV32I implementation, though implementations could choose to include other extensions. The RV32I instruction set is actually very simple -- much simpler than other ISAs I've looked at in the past:

32 32-bit general-purpose registers
Integer add, subtract, and logical-manipulation instructions
Control-flow instructions
Load/store instructions
Exceptions, caused by a system-call instruction and address misalignment
Control and status registers (CSRs)
Cycle and instruction-counting registers
Interestingly enough, interrupts are not required

In total, the instruction-set specification states that there are 47 instructions. I consider the RV32I subset to actually contain 48 instructions, since ERET (return from exception) is effectively required by most RV32I software, despite the fact that it isn't formally included in the RV32I subset.

On inspection, the instruction-set encoding seemed fairly straightforward. So, where were the implementation challenges?

CSR manipulation seemed a bit tricky in terms of atomic operations to read the current CSR value, while clearing/setting bits.
Exceptions always pose interesting challenges
The performance counters pose a size challenge, since they don't nicely fit in FPGA-friendly memory blocks

Despite the challenges, the RV32I architectural subset is quite small and simple. This simplicity, in my opinion, is the primary reason it was possible for me to create an implementation in a month of my spare time.

Implementation Game Plan
For a couple or reasons, I elected to use a simple approach to implementation of the RISC-V ISA. First, the deadline for the contest was very close and I wanted to be sure to actually have an entry. Secondly, my thinking was that a simple implementation would result in a smaller implementation.
Since I was interested in evolving Featherweight RISC after the contest, a second-level goal with the initial implementation was to build and prove-out a test suite that could be used to validate later enhancements.

The implementation approach I settled on was state-machine based -- not the standard RISC pipelined architecture. Given that I was targeting an FPGA, I also planned to move as many registers as possible to memory blocks.

Next Steps
With those decisions made, I was off create an implementation of the RISC-V RV32I instruction set! In my next post, I'll discuss the test-driven development approach I took to implementing the Featherweight-RISC core.

Disclaimer

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.

Saturday, December 1, 2018

FWRISC: Designing an FPGA-friendly Core in 30 Days

Designing a processor is often considered to be a large and complex undertaking, so how did I decide to design and implement in a month? For a few reasons, really. For one, my background is in hardware design, despite having worked in the EDA (Electronic Design Automation) software industry for many years. The last time I did a full hardware design was quite a few years ago using an i386EX embedded processor and other packaged ICs. Recently, though, I've been looking for opportunities to brush up on my digital-design skills. The primary reason, however, was that I saw the call for contestants in the 2018 RISC-V soft-core processor contest. I've found contests to be a fun way to learn because the organizers' criteria often cause me to learn something I otherwise wouldn't have thought to investigate. This contest was certainly no different!

The 2018 RISC-V contest certainly had some unique criteria. The contest required that verification be done using the Verilator "simulator", an open-source Verilog to C++ translator that is very fast and powerful, but also has some interesting quirks. Also required was support for Zephyr, a real-time operating system (RTOS) that I certainly wasn't aware of before the contest. Most interesting to me, though, was the contest category for smallest RISC-V FPGA implementation.

Small, you say?

When thinking about processor design, I often think about maximizing performance. However, there are many applications -- especially in the IoT space -- where having a small amount of processing power that requires little resources is very important. Often these applications are dominated today by older processor architectures, such as the venerable 8051. Despite it's somewhat-small size in an FPGA implementation, the 8051 processor isn't terribly friendly to C compilers, and is very slow. What if a modern architecture, such as the RISC-V ISA, could take the place of these older architectures while matching, or even improving, on their small size?

Despite seeing the value of having small RISC-V implementations, my first reaction when seeing the contest announcement was puzzlement. Weren't there already several small RISC-V implementations? Well, as it turns out, yes and no. There were several existing small implementations. However, the ones I found were not truly compliant with the RV32I architecture specification. The tradeoffs taken were often taken to reduce the implementation size by removing features that required resources, but were not needed for the author's intended application. These tradeoffs often meant that a special compiler toolchain was needed, or that users needed to be cautious when attempting to reuse existing software written for the RISC-V ISA.

Results?

Well, bottom line, I was able to design, verify, and implement a 32-bit RV32I RISC-V core in 30 days, and you can find the code on GitHub. A netlist of the design is shown at the beginning of this post. Early results are quite promising with respect to the balance between performance and size, and there are several known areas for improvement. Through the process, I've learned a lot -- rediscovering RTL design, gaining a much deeper appreciation of the RISC-V ISA, and learning about new tools like Verilator and infrastructure like Zephyr. Over the next few weeks, I'll be writing more about specific details of the design and verification process and what I learned. So, stay tuned for future posts!

Disclaimer

The views and opinions expressed above are solely those of the author and do not represent those of my employer or any other party.