From 27da4958d27f3ab2c0e44c9518dc9080847b2ba5 Mon Sep 17 00:00:00 2001 From: David Anderson Date: Tue, 13 Aug 2024 20:53:47 -0700 Subject: [PATCH] experiments/rmw_ram: document failed/paused memory trickery experiments --- experiments/rmw_ram/README.md | 140 ++++++++++++++++++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 experiments/rmw_ram/README.md diff --git a/experiments/rmw_ram/README.md b/experiments/rmw_ram/README.md new file mode 100644 index 0000000..eeed85d --- /dev/null +++ b/experiments/rmw_ram/README.md @@ -0,0 +1,140 @@ +No code here, just a record of a brief detour to experiment with RAM +layouts. + +The issue: for parity with VERA, we'd like 128KBytes of VRAM. The +iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that +easily. The ECP5 on the other hand has no large SPRAMs. What it does +have is a varying amount of EBRs, embedded block RAMs. These are +18Kbits each of dual-ported memory, which can be arrayed to form +larger memories. + +For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is +around $20 in unit quantities. It has plenty of logic tiles for our +needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe +we can make up the difference with LUT RAMs. + +Unfortunately no: the EBRs have configurable data width, so you can +set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if +you select one of the power of two widths, you only get to access +16Kbits! IOW, the layout options you have for an EBR are: + + - 16384x1b (16Kbits) + - 8192x2b (16KBits) + - 4096x4b (16Kbits) + - 2048x9b (**18Kbits**) + - 1024x18b (**18Kbits**) + +We want a byte-addressed memory, so if we configure things the obvious +way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs +we only get 112KBytes total, because 14KBytes worth of bits have been +stranded by the layout. + +## The terrible plan + +This led to a horrible line of thinking: what if we configure the +memory blocks for 9 or 18 bits, to get access to all the bits, and +then try to adapt that to an 8-bit external interface? How would that +even work? + +Well, if you array a bunch of 18-bit words side by side, you could +chop those up into 8-bit chunks, with some chunks straddling an 18-bit +boundary: + +``` +[aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii] + aaBBccDD DDeeFFgg ggHHii.. + eeFFggHH HHii..aaBB + ii..aaBBcc ccDDeeFF +``` + +So, given a byte address, that byte would end up stored in either 1 or +2 EBR blocks, and writes on individual EBRs would have to read out the +appropriate 18-bit word, replace a subset of the bytes, and write back +the changed word. + +Conceptually, the flow for a memory write from the system bus: + - Inputs: byte address `BA`, byte value to write `BV` + - Translate the byte address to a pair of word addresses and bit + ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit, + W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which + happen to fall within a single 18b word. + - Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]` + - Fan out the write to the appropriate 1 or 2 EBRs + - Each EBR does a read-modify-write cycle to update the appropriate + bit ranges. + +## Oh no + +The difficulties are several here, but the big one is the address +translation step: to turn a byte address into the pair of word +addresses plus bit ranges, we need to divide the byte address by 2.25 +(`*8/18`). The ECP5 has no floating point hardware, and no hardware +divider, so division would suck. + +There's well known theory here though: multiply by 8 is easy, that's +just a left-shift. Divide by 18 we can break down to divide by 2 +(shift-right), followed by a division by 9. To implement division by 9 +there's a bunch of tricks that effectively turn the division into a +multiplication by a magic number followed by a power of 2 division +(shift-right again). + +The ECP5 has DSP hardware that, among other things, provides an 18x18 +bit multiplier (36-bit output). So, we could do that. However, couple +problems: for the multiplier to run fast, you need to pipeline it, +which increases overall memory access latency. If we want to hit the +timings for "fast" memory access from a 65C826, assuming the FPGA +design can run at 100MHz, we have 4 cycles to turn a read around. The +memories themselves take 1 or 2 cycles, so the entire address +translation and reassembly has to somehow be jammed into 2 cycles, +_and_ the combinatorial paths can't be very long because otherwise it +won't be able to run in 100MHz. + +Another issue is that the ECP5's DSP block is currently a bit of an +unknown in project Trellis, so if we want to use OSS tools, we don't +get access to the full cosmic power of the ECP5 DSP blocks, we only +get a basic 18x18 multiplier with no frills. + +Overall, this just seems like too much computation to jam into the +number of cycles available. + +Finally, talking with an expert hardware designer, the plan to make up +the memory shortfall of EBR using LUT RAM may also not work, in their +experience it's very difficult for large chunks of LUT RAM to meet +sensible timing constraints. Even 2K in the original plan would likely +be a problem. + +## What now? + +For now, I'm going to build GARY without any memory trickery. That +means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be +able to claw some of that back by tweaking the data structure layout +and memory mapping, to effectively have compression compared to what +VERA stores. + +If we really want 128KBytes of VRAM, we have two main options: + - Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even + with the stranded bits from running 4-bit wide memory, that's + 216KBytes available. We also get roughly double the LUTs and DSP + units, although the -25 already has more than we need of + both. + - Downside: in LQFP form, the -45 is 2x more expensive, $40/ea + instead of $20/ea. Some BGA form factors offer the -45 for "only" + $33/ea, which still hurts but a bit less... in exchange for + having to learn how to BGA. + - Use an external RAM. This would let GARY have several megabytes of + VRAM easily. + - Downside: more BOM cost for the extra chip, though we can maybe + compensate by dropping back down to the smallest LFE5U-12 FPGA. + - Downside: to meet timing requirements, this needs to be a + parallel RAM, which will consume a couple dozen IOs on the FPGA + and require some painful routing of >100MHz traces (length + matching, worrying about signal integrity, maybe being forced + into more board layers...). + +## References + +For the tricksy address translation stuff, the references are Granlund +& Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 ) +for implementing division by a constant in various slick ways. Chapter +10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery +for decomposing divisions into a pipeline of adds and shifts.