Verilog 6502

Back to projects


Here, with full source code, is a cycle-accurate 6502 microprocessor core in Verilog HDL, which was automatically generated from a transistor-level netlist published by the Visual 6502 project. The 6502 has a two phase clock and this core is not only cycle but phase (half-cycle) accurate. The state of internal nodes is faithfully reproduced. All external address, data and control signals including RDY, SO, RES, IRQ, NMI, RW and SYNC are supported. The core runs 10 times faster than a real 6502 and occupies only 8% of the flops and 7% of the LUTs in the Xilinx xc3s500e FPGA on a Spartan 3E Starter Kit.

This project began in 1982 when my friend Alan Fothergill and I learnt the maths of colliding spheres at school. Alan wrote a Sinclair Spectrum pool game for Bug Byte Software Ltd. I tried to write pool for the Commodore PET, which lacked high-resolution graphics, so I made a vector display on my oscilloscope using digital-to-analog converters. I never got vector pool working properly at the time, but still have my original 5¼" floppy disks and recently recovered the data off them using Kryoflux. I thought it would be fun to get my pool code running again on a 6502 core in FPGA.

The 6502 has been studied and reverse-engineered more than any other microprocessor: in 1982, Donald Hanson created a detailed block diagram from original blueprints; more recently, Balázs Beregnyei drew transistor-level schematics by hand from his own die photographs; and the Visual 6502 project extracted vector polygon models of metal and silicon layers from die photographs, and published a transistor-level netlist. Very recently, Eric Schlaepfer designed the MOnSter 6502, a transistor-scale replica, using discrete MOSFETs. This web page is about how I converted the Visual6502 netlist into Verilog.

Verilog generation was a semi-automatic, guided process of sub-circuit recognition and extraction. The definition of sub-circuits and extraction order was manual; however, that done, the process ran without intervention. It was like an audit of the chip, in which every last transistor was accounted for. Repeated sub-circuits only needed defining once and all instances were extracted. Verilog fragments were hand-crafted to implement each class of sub-circuit. Mostly, these are simple assign statements, one line per logic gate. By far the most complex sub-circuit was a bitslice of the internal data paths, of which there are eight instances in the chip.

NMOS to Verilog

The 6502 contains 3,510 enhancement-mode and 1,018 depletion-mode NMOS transistors - the latter all acting as passive pull-ups at gate outputs. Here are the simplest logic gates and equivalent Verilog fragments. I follow the MOS Technology convention of labelling the +5V power supply Vcc, although Vdd seems more appropriate. They used Vcc and Vss.


    assign o[2] = ~i[1];
    assign o[5] = ~(i[3] | i[4]);
    assign o[8] = ~(i[6] & i[7]);

Modern CMOS logic typically uses edge-triggered D-type flip-flops as storage elements, synchronous to a single phase clock. The older NMOS technology of the 6502 used transparent latches for storage, and a two phase clock. Data was stored as charge on the gate capacitance of a transistor. A pass transistor or transmission gate, controlled by one or other phase of the clock, was the latch enable. Below, Q1 is the transmission gate and charge is stored on the gate of Q2, which is an inverter, but could be part of a more complex gate:


Xilinx Spartan-3 FPGA storage registers can emulate either D-type flip-flops or transparent latches; however, toolchain support for latches is limited. Data transitions can pass through more than one cascaded latch at a time, making timing closure tricky. Zero-page operands pass along the internal data bus (IDB) and internal address bus (ADL) to the address bus latch (ABL) in half a cycle; however, consecutive latches usually open on opposite clock phases. Whilst edge-triggered flops are synchronous elements; latches are asynchronous, best emulated in FPGA using a simple 2-input MUX. Registers are inserted to break combinatorial loops:



    assign o[789] = i[456] ? i[123] : i[789];

Combinatorial loops do not work well in FPGA logic. Transparent latches are not the only problem. The 6502 contains S-R (set-reset) type latches made from cross-coupled gates; and numerous potential feedback loops in the data path. All circular paths are broken by inserting registers after every node. The minimum FPGA to 6502 clock ratio depends on the number of logic levels through which signals must propagate on each 6502 clock. Some registers could possibly be removed; but keeping them eases timing closure. The next state of every node in the chip is a combinatorial function of their previous states and those of external stimuli:


The auto-generated Verilog is a single `include file containing combinatorial assign statements for logic gates; plus a few instantiations of a combinatorial helper module to cope with data paths. The helper is a custom multiplexer with parameterised input width. The select inputs are "one-or-more-hot" encoded. If there is contention when multiple data inputs are enabled, low inputs win. If no inputs are enabled, the output becomes a capacitive storage node, like the transparent latches. This emulates an NMOS structure comprising uni-directional pass transistors with a common output node:

module DATAPATH_MUX #(
    parameter N=4
) (
    output wire out,
    input  wire in,
    input  wire [N-1:0] s,
    input  wire [N-1:0] d);
    
    assign out = (|s) ? &(d|(~s)) : in;
endmodule

The most complex sub-circuit is a bitslice of the datapath. This includes precharge transistors, bus bars, multiplexors and two bi-directional pass transistors. There are 8 instances of this sub-circuit in the chip. To the right below is it encoded as a netlist. Arbitrary numeric identifiers are assigned to each transistor and node. 100-101 are outputs to the PCL and PCH latches. 200-207 are inputs from IDL, X, Y, A, S, PCL, PCH and the ALU. 300-343 are data path control (dpc) lines. 400-443, 700 and 800-803 are transistors. 500-503 are the special bus (SB), internal data bus (IDB), address low (ADL) bus and address high (ADH) bus respectively:

AddTran(400, 300, 202, 500);        // dpc0_YSB                   
AddTran(402, 302, 201, 500);        // dpc2_XSB                   
AddTran(404, 304, 204, 500);        // dpc4_SSB                   
AddTran(405, 305, 204, 502);        // dpc5_SADL                  
AddTran(419, 319, 207, 500);        // dpc19_ADDSB7, dpc20_ADDSB06
AddTran(421, 321, 207, 502);        // dpc21_ADDADL               
AddTran(424, 324, 203, 500);        // dpc24_ACSB                 
AddTran(425, 325, 500, 501);        // dpc25_SBDB                 
AddTran(426, 326, 203, 501);        // dpc26_ACDB                 
AddTran(427, 327, 500, 503);        // dpc27_SBADH                
AddTran(428, 328, 503, NODE_vss);   // dpc28_0ADH0, dpc29_0ADH17  
AddTran(430, 330, 503, 101);        // dpc30_ADHPCH               
AddTran(431, 331, 206, 101);        // dpc31_PCHPCH               
AddTran(432, 332, 206, 503);        // dpc32_PCHADH               
AddTran(433, 333, 206, 501);        // dpc33_PCHDB                
AddTran(437, 337, 205, 501);        // dpc37_PCLDB                
AddTran(438, 338, 205, 502);        // dpc38_PCLADL               
AddTran(439, 339, 205, 100);        // dpc39_PCLPCL               
AddTran(440, 340, 502, 100);        // dpc40_ADLPCL               
AddTran(441, 341, 600, 502);        // dpc41_DL/ADL               
AddTran(442, 342, 600, 503);        // dpc42_DL/ADH               
AddTran(443, 343, 600, 501);        // dpc43_DL/DB                
AddTran(700, NODE_cp1, 200, 600);
AddTran(800, NODE_cp2, 500, NODE_vcc);
AddTran(801, NODE_cp2, 501, NODE_vcc);
AddTran(802, NODE_cp2, 502, NODE_vcc);
AddTran(803, NODE_cp2, 503, NODE_vcc);
Each slice is emulated using a fragment of Verilog like that below. Transistors 425 and 427, which are controlled by nodes 325 (dpc25_SBDB) and 327 (dpc27_SBADH) respectively, are bidirectional pass transistors and require special handling. The first three lines determine if inter-connected buses are driven locally i.e. not through a pass transistor. The next four lines determine if a local state should be propagated through to another bus. The last six lines are instantiations of the custom uni-directional multiplexer class described above:

wire sb_local_500 = i[300]|i[302]|i[304]|i[319]|i[324];
wire idb_local_501 = i[326]|i[333]|i[337]|i[cp1]&i[343]|i[?];
wire adh_local_503 = i[328]|i[332]|i[cp1]&i[342];
wire db2sb_500 = idb_local_501 & i[325];
wire sb2db_501 = sb_local_500 & i[325];
wire adh2sb_500 = adh_local_503 & i[327];
wire sb2adh_503 = sb_local_500 & i[327];
DATAPATH_MUX #(8) mux_sb_500  (.o(o[500]), .i(i[500]), .s({i[cp2],i[300],i[302],i[304],i[319],i[324],db2sb_500,adh2sb_500}), .d({i[vcc],i[202],i[201],i[204],i[207],i[203],i[501],i[503]}));
DATAPATH_MUX #(7) mux_idb_501 (.o(o[501]), .i(i[501]), .s({i[cp2],i[326],i[333],i[337],i[cp1]&i[343],sb2db_501,i[?]}),       .d({i[vcc],i[203],i[206],i[205],i[200],i[500],i[?]}));
DATAPATH_MUX #(6) mux_adl_502 (.o(o[502]), .i(i[502]), .s({i[cp2],i[305],i[321],i[338],i[cp1]&i[341],i[?]}),                 .d({i[vcc],i[204],i[207],i[205],i[200],i[vss]}));
DATAPATH_MUX #(5) mux_adh_503 (.o(o[503]), .i(i[503]), .s({i[cp2],i[328],i[332],i[cp1]&i[342], sb2adh_503}),                 .d({i[vcc],i[vss],i[206],i[200],i[500]}));
DATAPATH_MUX #(2) mux_pcl_100 (.o(o[100]), .i(i[100]), .s({i[339],i[340]}),                                                  .d({i[205],i[502]}));
DATAPATH_MUX #(2) mux_pch_101 (.o(o[101]), .i(i[101]), .s({i[330],i[331]}),                                                  .d({i[503],i[206]}));

The helper MUX retains its previous state when no inputs are selected. This emulates capacitive storage, which the 6502 designers exploit to decrement registers. Internal busses are precharged to $FF during Φ2 and then connected to one of the ALU input latches in the immediately following Φ1, during which the nodes are floating. This forces $FF onto the ALU input by swamping it with charge from physically larger nodes. A register (X, Y, SP or PCH) is routed to the other ALU input and decremented by adding $FF (-1).

SubGemini

First described in a 1993 paper [1] by Ohlrich et al, SubGemini is a fast, technology-independent algorithm for finding instances of sub-circuits in a larger netlist. Netlists describe electronic circuits containing devices inter-connected by wires or nets. In SubGemini, both the main circuit and sub-circuits are represented by data structures comprising vertices and edges. Vertices are either devices or nets. Edges are the connections between them i.e. at device ports. Mathematicians call the structure a bipartite graph of two disjoint sets. Traversing edges, one encounters devices and nets alternately:

The problem of finding sub-graphs in larger graphs is called Subgraph Isomorphism and is a difficult one to solve. All so-called invariant properties must be exploited: devices have a type e.g. enhancement or depletion mode MOSFETs in NMOS, P and N-channel in CMOS; nets have an order, which is the number of ports they connect to; and edges have port type e.g. input, output e.t.c. That's all there is to go on. Here is a sub-circuit schematic and its representation within SubGemini, labelled with all the distinguishing features of its edges and vertices:

Notice the net types: internal, external and special. Special nets are used for power supplies and global clocks. External nets are those at the boundaries of a sub-circuit. There is only one internal net (of order 3) in this example. Any matching nets in the main circuit will also be of order 3 and will be connected, by their drains, to two enhancement mode MOSFETs, and by its source to a third. We could easily make a shortlist of all such candidate vertices in the main circuit. It would then be fairly trivial in this simple case to check beyond the vertices immediately neighbouring the candidates to verify matches of the entire sub-circuit.

The process described in the previous paragraph is close to how SubGemini actually works, in cases as trivial as the above example. The starting point, that order 3 internal net in the sub-circuit, is called the key vertex; and the shortlist is called the candidate vector. Key vertices can be devices or internal nets. Clearly, one could choose any key vertex at random, make a candidate vector shortlist, and use a brute-force recursive walk algorithm to verify sub-circuit matches; however, this would be inefficient.

The recursive walk is a depth-first approach. It explores many wrong paths leading to dead-ends, sometimes descending way down the call stack before back-tracking. SubGemini uses a far more efficient breadth-first technique that effectively searches outwards from every vertex through its surrounding neighbours, one edge at a time in all directions! Initially, all vertices are labelled using their invariant properties of type and order. There follows an iterative process of re-labelling, combining previous labels with connecting edge types and labels of immediate neighbours. All labels, types and orders are simply 32-bit integers.

The remarkable result of this re-labelling process is that vertices rapidly acquire large and relatively unique labels, which are like a hash of the surrounding structure. Re-labelling occurs in parallel in both main and sub-circuits; and matching vertices have equal labels. Of course false duplicates can also arise in non-matching vertices; however, this is relatively uncommon. SubGemini uses a two phase process to first build a candidate vector and then validate the candidates. Both phases involve iterative re-labelling. In SubGemini terminology, grouping vertices with matching labels is called partitioning.

Sub-circuits are bounded with external nets. But matching instances in the main circuit extend seemlessly beyond. At first, labels in the sub-circuit match those in the main circuit; however, after some iterations, labels within matching instances will be affected by vertices outside. Iteration stops just before the innermost labels are corrupted. Each label has a flag to indicate if it is still trustworthy. Devices and internal nets are initially valid; external nets are not. Invalidity propagates along edges with each iteration. Special nets have globally unique labels and are never re-labelled.

Hopefully, the above will provide some insight into how SubGemini works. Of course, there are additional complexities which have not been mentioned. Highly symmetrical circuits lead to ambiguities, which can only be resolved by back-tracking in the last resort. In my application I also cheat by giving node matching hints in a few special cases. Readers interested in learning more can refer to the original paper [1] or my source code, linked at the bottom of this page.

Performance

Inserting registers at the outputs of the auto-generated logic effectively imposed a propagation delay through the 6502 logic of one FPGA clock per 6502 gate. Performance was significantly improved by optimising away non-inverting buffers and inverters. That helped; but this core still pays a high performance price for node-level accuracy. There are smaller and faster Verilog 6502 cores out there, such as the one by Arlet Ottens. This core manages performance equivalent to a 10 MHz 6502 using less than 700 LUTs in a Xilinx xc3s500e Spartan-3E FPGA.

The core passes Klaus Dormann's test suite and AllSuiteA. Since it emulates the 6502 at transistor level, I would expect it to handle most of the so-called undocumented opcodes correctly. I am aware of the work by Peter Monta in which he used a 6-bit code to model voltages and currents at certain nodes. My core uses a 1-bit (Boolean) state for all nodes and resolves contention with a rule that pull downs to logic '0' (Vss) always win. It is possible that my core may not correctly emulate some internal conflicts that might arise whilst executing undefined opcodes.

Clock ratio

How do I get away with an FPGA to 6502 clock ratio as low as 16:1, for program counter increment? The active-low PC increment, dpc36_IPC, is asserted in Φ1 and remains so through Φ2. The PC+1 result is latched using Φ2. Although rippling starts in Φ1, the result is not required until the end of Φ2. With a 16:1 clock ratio, rippling spreads into Φ2, which might be what happens in the real chip. Having a fast carry look-ahead from PCL and a half-carry in the middle of PCH also helps. There's a 3 FPGA clock margin with inverter/buffer optimisation (removal) enabled, reducing to 1 without it, on the rollover from $FFFF to $0000.

The above rollover waves are from a simulation of the synthesizable core. The effect of registering the auto-generated logic output is evident in the staggering of edges as they propagate. Registers breaking combinatorial feedback paths are required for synthesis, but not for simulation. Much nicer waves with zero propagation delays were obtained from a (100% combinatorial) behavioural model. This is a very useful tool for studying 6502 internals. Buffers and inverters are not optimised away, so more nodes can be probed. See source code links below.

Earlier attempt

This is not my first 6502 core. Before adopting the SubGemini (guided) method, I generated a core from the net list using a completely different, fully automatic approach. It was a recursive walk, starting from every node, stepping through transistor channels, finding routes to Vcc or Vss, writing out Boolean expressions in terms of gate nodes of transistors that had to be conducting for paths to exist. When no paths existed, the starting node was a storage latch. Everything worked apart from register decrement. The model didn't recognise charge sharing. But that was easy to fix for the specific scenario between SB and the ALU input. I abandoned the recursive walk because it generated redundant logic. I had to arbitrarily limit the search depth, or else it would exponentially bloat the Verilog.

Pool

My pool code was originally developed on a Commodore PET 3032 personal computer with homemade vector graphics display using an oscilloscope in X-Y mode, connected via a pair of 8-bit digital-to-analogue converters to the PET's user port and IEEE-488 (GPIB) data lines. The code (2016 working version linked below) isn't actually a game, it just displays 7 colliding circles, but is quite pleasing to watch in dim light. There was a collision detection bug in the original version, which was hard to spot. Finding it took me back to staring at the same spaghetti code 34 years earlier.

The 19" rack behind my head in this 1980 photo is a microprocessor-controlled Home Telephone Exchange.

Source code

Links

References

1. Miles Ohlrich, Carl Ebeling, Eka Ginting and Lisa Sather. "SubGemini: Identifying Subcircuits Using a Fast Subgraph Isomorphism Algorithm," In Proceedings of the 30th IEEE/ACM Design Automation Conference, June 1993.