experiments/rmw_ram: document failed/paused memory trickery experiments

2024-08-13 20:53:47 -07:00 · 2024-08-13 20:53:47 -07:00 · 27da4958d2
parent da6ea4cf42
commit 27da4958d2
1 changed files with 140 additions and 0 deletions
--- a/experiments/rmw_ram/README.md
+++ b/experiments/rmw_ram/README.md
@ -0,0 +1,140 @@
+No code here, just a record of a brief detour to experiment with RAM
+layouts.
+
+The issue: for parity with VERA, we'd like 128KBytes of VRAM. The
+iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that
+easily. The ECP5 on the other hand has no large SPRAMs. What it does
+have is a varying amount of EBRs, embedded block RAMs. These are
+18Kbits each of dual-ported memory, which can be arrayed to form
+larger memories.
+
+For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is
+around $20 in unit quantities. It has plenty of logic tiles for our
+needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe
+we can make up the difference with LUT RAMs.
+
+Unfortunately no: the EBRs have configurable data width, so you can
+set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if
+you select one of the power of two widths, you only get to access
+16Kbits! IOW, the layout options you have for an EBR are:
+
+ - 16384x1b (16Kbits)
+ - 8192x2b (16KBits)
+ - 4096x4b (16Kbits)
+ - 2048x9b (**18Kbits**)
+ - 1024x18b (**18Kbits**)
+
+We want a byte-addressed memory, so if we configure things the obvious
+way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs
+we only get 112KBytes total, because 14KBytes worth of bits have been
+stranded by the layout.
+
+## The terrible plan
+
+This led to a horrible line of thinking: what if we configure the
+memory blocks for 9 or 18 bits, to get access to all the bits, and
+then try to adapt that to an 8-bit external interface? How would that
+even work?
+
+Well, if you array a bunch of 18-bit words side by side, you could
+chop those up into 8-bit chunks, with some chunks straddling an 18-bit
+boundary:
+
+```
+[aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii]
+ aaBBccDD                  DDeeFFgg                  ggHHii..
+         eeFFggHH                  HHii..aaBB
+                 ii..aaBBcc                  ccDDeeFF
+```
+
+So, given a byte address, that byte would end up stored in either 1 or
+2 EBR blocks, and writes on individual EBRs would have to read out the
+appropriate 18-bit word, replace a subset of the bytes, and write back
+the changed word.
+
+Conceptually, the flow for a memory write from the system bus:
+ - Inputs: byte address `BA`, byte value to write `BV`
+ - Translate the byte address to a pair of word addresses and bit
+   ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit,
+   W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which
+   happen to fall within a single 18b word.
+ - Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]`
+ - Fan out the write to the appropriate 1 or 2 EBRs
+ - Each EBR does a read-modify-write cycle to update the appropriate
+   bit ranges.
+
+## Oh no
+
+The difficulties are several here, but the big one is the address
+translation step: to turn a byte address into the pair of word
+addresses plus bit ranges, we need to divide the byte address by 2.25
+(`*8/18`). The ECP5 has no floating point hardware, and no hardware
+divider, so division would suck.
+
+There's well known theory here though: multiply by 8 is easy, that's
+just a left-shift. Divide by 18 we can break down to divide by 2
+(shift-right), followed by a division by 9. To implement division by 9
+there's a bunch of tricks that effectively turn the division into a
+multiplication by a magic number followed by a power of 2 division
+(shift-right again).
+
+The ECP5 has DSP hardware that, among other things, provides an 18x18
+bit multiplier (36-bit output). So, we could do that. However, couple
+problems: for the multiplier to run fast, you need to pipeline it,
+which increases overall memory access latency. If we want to hit the
+timings for "fast" memory access from a 65C826, assuming the FPGA
+design can run at 100MHz, we have 4 cycles to turn a read around. The
+memories themselves take 1 or 2 cycles, so the entire address
+translation and reassembly has to somehow be jammed into 2 cycles,
+_and_ the combinatorial paths can't be very long because otherwise it
+won't be able to run in 100MHz.
+
+Another issue is that the ECP5's DSP block is currently a bit of an
+unknown in project Trellis, so if we want to use OSS tools, we don't
+get access to the full cosmic power of the ECP5 DSP blocks, we only
+get a basic 18x18 multiplier with no frills.
+
+Overall, this just seems like too much computation to jam into the
+number of cycles available.
+
+Finally, talking with an expert hardware designer, the plan to make up
+the memory shortfall of EBR using LUT RAM may also not work, in their
+experience it's very difficult for large chunks of LUT RAM to meet
+sensible timing constraints. Even 2K in the original plan would likely
+be a problem.
+
+## What now?
+
+For now, I'm going to build GARY without any memory trickery. That
+means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be
+able to claw some of that back by tweaking the data structure layout
+and memory mapping, to effectively have compression compared to what
+VERA stores.
+
+If we really want 128KBytes of VRAM, we have two main options:
+ - Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even
+   with the stranded bits from running 4-bit wide memory, that's
+   216KBytes available. We also get roughly double the LUTs and DSP
+   units, although the -25 already has more than we need of
+   both.
+   - Downside: in LQFP form, the -45 is 2x more expensive, $40/ea
+     instead of $20/ea. Some BGA form factors offer the -45 for "only"
+     $33/ea, which still hurts but a bit less... in exchange for
+     having to learn how to BGA.
+ - Use an external RAM. This would let GARY have several megabytes of
+   VRAM easily.
+   - Downside: more BOM cost for the extra chip, though we can maybe
+     compensate by dropping back down to the smallest LFE5U-12 FPGA.
+   - Downside: to meet timing requirements, this needs to be a
+     parallel RAM, which will consume a couple dozen IOs on the FPGA
+     and require some painful routing of >100MHz traces (length
+     matching, worrying about signal integrity, maybe being forced
+     into more board layers...).
+
+## References
+
+For the tricksy address translation stuff, the references are Granlund
+& Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 )
+for implementing division by a constant in various slick ways. Chapter
+10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery
+for decomposing divisions into a pipeline of adds and shifts.