// Musings of a compiler gremlin
void blog(void) {
RISCY-V02: Outline
I've been continuing work on RISCY-V02, and it's been going well. I think it's about time to start writing about it in more detail.
First, the project's explicit goals. RISCY-V02, if successful, would be a CMOS VLSI CPU design that fits within the 11,500 CMOS transistors
of the WDC 65C02. Additionally, it must be a solid expression of modern RISC principles, a straightforward compiler target, broadly electrically
compatible with the 65C02, and, when taken as a whole, superior to 65C02 for the home computer and game console use cases.
To review, the purpose of these goals is to gain insight into whether modern CPU design principles could have produced a superior CPU to the 6502
for around the same price at the time of its introduction, 1975.
Now, I'd really like to be able to compare to the 6502 itself; the 65C02 launched in 1983, just one year before the 68000 Macintosh. A company that
attempted to launch a RISCY-V02 in 1983 would be squeezed out by the 68000 on the high end, and the backwards compatible 65C02 on the low end. A more
useful RISCY-V02 design would instead fit within the 3510 NMOS transistors (with 1018 pull-up transistors) of the original NMOS 6502.
Unfortunately, I don't have the expertise needed to answer that question. I have no background in VLSI design; I minored in Electrical and Computer Engineering
in college, which exposed me to basic digital design techniques using static CMOS logic, but nothing more. I have access to only the open source "qflow" VLSI workflow,
and I can use it to produce static CMOS designs using its standard cell library, but I don't have the expertise to rework such designs using the dynamic NMOS logic used
by the actual 6502, particularly while meeting the design contraints of their silicon processes.
Accordingly, many aspects of RISCY-V02's design, like the 65C02 itself, will use more transistors than strictly needed by the 6502. These additional transistors remove invalid
opcodes, allow the processor to be single stepped, and vastly decrease it's power consumption. It's my hope that fitting a design to the 65C02's footprint provides some likelihood that the
design could have been made to fit within the 6502's footprint. But even if not, the RISCY-V02 is also shaping up to be dramatically superior to the 8080/Z80, and that was also a wildly popular,
if considerably more expensive, pair of chips.
A few more caveats: the standard cell library I'm using, OSU035, doesn't have anything that can be used to create an SRAM. That's the usual way that register files are implemented in modern processors,
since it allows using only 6 CMOS transistors per bit, while a D latch (the next smallest memory cell) takes 8. I'll aim to have the design fit using a much larger than necessary design using latches, and
I'll provide an estimate of the transistor count using a proper SRAM design with bit lines and sense amplifiers. Unfortunately sense amplifiers are nonlinear analog circuits, so I'm out of my depth there too.
This article is already starting to get pretty long, so at this point I'll just sketch out the basics of the motivation and decisions behind the ISA design itself.
First, how large should instructions be? It's broadly desirable in modern RISC designs that most instructions complete within a single cycle; that vastly simplifies the CPU's sequencing, since you don't need
a complex internal state machine to move the CPU from one cycle to the next. For that to work on an 8-bit bus like the 6502's, it would need to have 8-bit instructions. I spent some time noodling with this,
and while it's possible to have extremely tiny instructions that stack up work to do, then another instruction that executes that work, there's very little truly useful you could tell a CPU to do in only 8 bits.
Most of the 6502's 8-bit instructions wouldn't apply to a RISC design, and the vast vast majority of 6502 instructions are 16 bits or larger. It's also the case that a 6502 can't execute an instruction in fewer than
2 cycles anyway. So 16-bit instructions seem ideal.
It takes two cycles to fetch a 16-bit instruction, and one of the cores to modern RISC designs is pipelining. That means that while the next instruction is being fetched, the current instruction has two
cycles to execute, completely in parallel with the fetch. We can set our processor up with a two phase pipeline, where each phase constitutes two cycles. So long as an instruction can execute within two cycles, and
so long as it doesn't need to read or write from memory, then it can execute concurrently with the next fetch, and that in turn means that an instruction finishes executing every two cycles. This fact has quite a few
implications on the design.
First, we have two cycles to execute arithmetic operations, and it takes two cycles to perform a 16-bit operation using an 8-bit ALU roughly equivalent to the one in the 6502. That means that we can have our architecture
be 16-bit without imposing too much additional cost beyond the 6502. If we can find cuts to pay for it, then we're good.
Second, if an instruction does need to load or store, then we can't fetch the next instruction at the same time. We have to stall the fetch until the load or store completes. Thus, every load or store
incurs a cycle penalty of at least the number of bytes loaded or stored. That means it pays to reduce the numbers of loads and stores as much as possible, which means keeping as much data in the CPU registers as
we can.
CPU registers are expensive, but they're not *that* expensive. We can also make them more general purpose than the ones in the 6502, allowing us to convert some of its hardware state into more registers. From the above,
we want our registers to be 16-bit, but how many can we support? The answer to this question comes not from the transistor count, but from the instructions themselves.
It's essentially impossible to fit a resonable-sized immediate, a 4-bit register number, and an opcode into a 16-bit encoding. There's too many opcodes needed, and the whole thing just won't be crammed that small. But
3-bit register numbers can just barely be made to fit; this allows for 8 16-bit registers. That's what RISCY-V02 is going with.
Next, the semantics of the registers. RISC-V has a zero register, x0, which always reads 0, and where storing to it does nothing. This lowers our transistor count, robs us of a register, and reduces the instruction count
(again lowering the transistor count). To see why, consider the RISC-V unconditional branch: JAL x0, imm. This adds the sign-extended immediate imm to the program counter, then stores PC+4 into the named register (x0), as
if performing a subroutine call. But, since that register is x0, nothing happens, turning this into an unconditional jump instead, with no extra logic. That being said, we really only need to add 2 instructions to get most
of the benefit of the x0 register, so it doesn't seem worth losing a register over.
That leaves us with two special registers: our new x0, x1, and x7.
x0 is the "link register": calling a subroutine places the return address in this register. If your subroutine doesn't call anything (it's a "leaf"), then you
can leave x0 alone, and return by jumping through it. If you do call something, then you save and restore x0 on the stack in the function prologue/epilogue. This prevents there from needing to be any stack manipulation logic
in the CPU (quite a hefty bit of work in the 6502!), at the cost of 4-8 extra bytes ({[dec SP], load/store word}x2) in non-leaf subroutines. But leaf subroutines (most!) get to be faster, since they don't need to access
memory to stack the return address.
x1 is the stack pointer; it's a stack pointer. Having it as a general purpose register allows manipulating it using regualar arithmetic and loading/storing through it as if it were any other pointer.
x7 is a callee saved register; by convention, a procedure needs to preserve the caller's value. Having at least one of these makes it easy to keep a key value alive across many procedure calls.
Okay, we've covered the project goals, the broad shape of the ISA (8x16-bit registers, 16-bit instructions), and specifics about how the registers are used. That seems enough for one entry. Ta ta for now!
RISCY-V02 (Risky Five Oh Two)
Last year, at the Vintage Computer Festival West, I had a chance to ask Bill Mensch, one of the designers of the MOS 6502, a question.
I had been wondering why the chip was so seemingly hostile towards higher level languages like C. The answer I got was largely the one
I expected: that higher level languages weren't really in the expected requirements of that kind of chip at the time it was made, and that
the successor chip, the 65816, was designed at least in part with C in mind. This satisfied me, but Bill tacked on a line, which I'm not sure I
remember enough quite well enough to quote, but the gist of which was that looking back, he wouldn't have designed the 6502 any other way than
the way it ended up.
I believe him! Given the constraints of the time, I don't know that there were any real mistakes made in the development of the 6502. It's a
tremendous chip that has stood the test of time, and the 6502 has had a lasting impact on computing. But that quote has stuck with me. Could things
have actually been done better than they were? Have we learned anything in the last 30 years of R&D on chip design that might have changed how the
6502 would have been developed for the role it ended up taking on?
I've been interested in this question for a while, and for the same while, I've noodled with little CPU designs that have tried to bring more modern
CPU design concepts to the same design space that the 6502 occupied. I've recently had some considerable success, so this is the first in a series
where I'll introduce the project, its goals, and the progress I've made.
First the project. I'm calling it "RISCY-V02", pronounced Risky Five Oh Two. Pronounced like 6502, but RISC-V at its heart. My goal is to create a
design for a CPU with the same broad footprint as the 65c02 (transistor count, pins, etc.). The chip would have a completely new ISA and internal
architecture built using modern princples, based on RISC-V.
The 6502 gained success by throwing out wasteful elements of the 6800 and replacing them with smaller, faster, more useful elements. My thesis is this:
that we've learned that the 6502 didn't go far enough in reducing the instruction set, and that it could have gone further towards a modern RISC design
without increasing its transistor count. The purpose of this project is to expore first whether this might be true, and if so, to provide a nice exposition
of exactly what it is that we have learned about CPU design since the 70's against the nice simple backdrop of the 65c02.
The prototype I'm building is a 16-bit CPU, to the 6502's 8-bits, it has a register file of 8 16-bit registers,
to the 6502's 3 8-bit registers, and almost all instructions (all 16 bits wide!) have a latency of two cycles, which is the latency of the very fastest
8-bit 6502 instructions. Predictable branches have no penalty, subroutine returns and mispredicted branches have a two cycle penalty, and loads and stores
have a 1 or 2 cycle penalty for 8 or 16-bit variants, respectively. It's not finished yet, so it's difficult to tell what I'll have to end up sacrificing
from the above, but so far, it's still quite a bit under the 65c02's transistor budget. If these performance characteristics hold up, it would kick the
pants off the 6502 in code size, speed, and ease of compilation.
Prima facie, the above sounds a bit unlikely, even to my ears. But, I spent some time digging through a Verilog 65c02 implementation, and the reasons became
a lot clearer. The 65c02 spends a lot on things that are just as inessential as the 6800 before it: binary coded decimal, way too many addressing modes, push
and pop instructions, JSR and RTS instead of a link register, and etc. A modern RISC like RISC-V finds more parsimonous ways to represent all of this, usually
involving slightly more clever compilers and considerably more difficult hand-written assembly. But, the payoff is that you can fit a lot more into the chip
of the kinds of things you need to make it really really fast. An example: if it takes, say, two cycles to fetch the next instruction, why not hide a
16-bit add using an 8-bit ALU behind that fetch?
I'm wiring up everything using the open source qflow VLSI workflow; that should give me a real estimate for transistor count and maximum stable frequency.
I'm also going to compare the end result with an off the shelf Verilog model for a 65c02, using the same workflow.
A few words first. I'm not a chip designer; I minored in electrical and computer engineering, and I'm a professional compiler engineer. But that goes
so far, and no farther. I have hardware design as a hobby, and I'm not even great at it. So I'm likely to make mistakes, and the results here should
be taken with an enormous grain of salt.
Next post, I'll start to expand on the broad strokes of the design I ended up with, and what motivated it. Until next time!
New blog and NZ woes
At the urging of my wife (and vague gentle nudgings from the llvm-mos community throughout the years), I'm starting a blog!
This will be a place for me to doodle in my down time. When I get around to it, I intend to write some rambly deep dives on
little things I'm working towards, or exposes on the more interesting bits of how llvm-mos (or other side projects of mine) work.
As for what I'm working on now, that's mostly trying to get llvm-mos's release unstuck. We're currently in a really nasty state
of "revealed check"; the most recent merge from upstream uncovered a fundamental flaw in how we handle branches, and we can't push
a new release until we've addressed it. Business as usual.
The issue is with how llvm-mos manages the 6502's N (negative) and Z (zero) flags. Unlike most modern architectures, the 6502
doesn't have a separate set of instructions (or modifier for opcodes) that cause flags to be set; an instruction just always
sets whatever flags it sets. Almost every instruction on the 6502 sets the N and Z flags to whether whatever result the
instruction produces is negative or zero. This means it's insanely difficult to keep N and Z alive as code is being moved around.
To avoid this insanity, llvm-mos doesn't directly model N and Z at all for most of the code generation pipeline.
Instead, llvm-mos models N and Z through the use of
CMPTerm
instructions. These shadow the logical
CMP
instructions, but with the added caveat that they are terminators. A terminator is an instruction that has to appear at the end of a
basic block. As the name implies, it's illegal for a non-terminator
instruction to appear after a terminator in the same basic block.
Making CMPTerm instruction terminators ensures that they stay adjacent to the branches that use the N and Z flags they generate.
That then means that nothing can clobber N and Z; the compiler generally doesn't insert things into terminator blocks when it's
shuffling code around, because usually only branches are terminators, and they intrinsically end basic blocks anyway. So far so good.
The issue comes with what happens when LLVM starts manipulating the branches. There are a few calls into our target it uses to do so:
Generally, the flow works like this. First, call analyzeBranch(); this fills out a little data structure that tells you the branch structure
at the end of a basic block. Then, call removeBranch() to strip out the existing branch structure. If you like, reverseBranchCondition().
Then, call insertBranch() to insert a new branch structure.
Our problem comes from analyzeBranch(); there's no way at present to represent the compare instruction in the branch data structure. Accordingly, removeBranch()
strips off everything except the branch, which leaves the CMPTerm dangling.
This isn't as big of a problem as it might seem; the only real things you can do is insert branches, possibly using the inverse of the same condition. That means
that the CMPTerm can just be picked up again.
The issue occurs when no branch is actually inserted; then the CMPTerm is actually left dangling in perpetuity. Still not a huge problem; we strip out dead CMPTerms.
But, there's also a LLVM pass that folds in the successor basic block into the previous! So we can end up (in surprisingly rare cases) with a terminator instruction
followed by a non-terminator (the start of the next block). This breaks our verifier CI test suite, and generally could cause all hell to break loose. (Friends don't
let friends violate compiler invariants.)
The root of the issue is that comparisons are not really expected to be terminators. If they are, then they're expected to also be branches; that's the way LLVM likes it.
But, it seems that the CMPTerm instructions can just be transformed into Compare and Branch on NZ instructions, since they're always paired up with branches anyway. That would
allow analyzeBranch() to include the full comparison in it's analysis, and LLVM will be happy.
That's more or less what I'm fiddling around with now. Our branch is very simple:
BR
to $target when $flag has value $imm (either 1 or 0). So we can just fold those arguments into CMPTerm without increasing the number of pseudo instructions in llvm-mos.
Nothing is that simple, and I expect something surprising to break, but eh, business as usual.