experiments/rmw_ram: document failed/paused memory trickery experiments

2024-08-13 20:53:47 -07:00 · 2024-08-13 20:53:47 -07:00 · 27da4958d2
parent da6ea4cf42
commit 27da4958d2
1 changed files with 140 additions and 0 deletions
--- a/experiments/rmw_ram/README.md
+++ b/experiments/rmw_ram/README.md
@ -0,0 +1,140 @@
 No code here, just a record of a brief detour to experiment with RAM
 layouts.
 The issue: for parity with VERA, we'd like 128KBytes of VRAM. The
 iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that
 easily. The ECP5 on the other hand has no large SPRAMs. What it does
 have is a varying amount of EBRs, embedded block RAMs. These are
 18Kbits each of dual-ported memory, which can be arrayed to form
 larger memories.
 For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is
 around $20 in unit quantities. It has plenty of logic tiles for our
 needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe
 we can make up the difference with LUT RAMs.
 Unfortunately no: the EBRs have configurable data width, so you can
 set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if
 you select one of the power of two widths, you only get to access
 16Kbits! IOW, the layout options you have for an EBR are:
 - 16384x1b (16Kbits)
 - 8192x2b (16KBits)
 - 4096x4b (16Kbits)
 - 2048x9b (**18Kbits**)
 - 1024x18b (**18Kbits**)
 We want a byte-addressed memory, so if we configure things the obvious
 way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs
 we only get 112KBytes total, because 14KBytes worth of bits have been
 stranded by the layout.
 ## The terrible plan
 This led to a horrible line of thinking: what if we configure the
 memory blocks for 9 or 18 bits, to get access to all the bits, and
 then try to adapt that to an 8-bit external interface? How would that
 even work?
 Well, if you array a bunch of 18-bit words side by side, you could
 chop those up into 8-bit chunks, with some chunks straddling an 18-bit
 boundary:
 ```
 [aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii]
 aaBBccDD                  DDeeFFgg                  ggHHii..
         eeFFggHH                  HHii..aaBB
                 ii..aaBBcc                  ccDDeeFF
 ```
 So, given a byte address, that byte would end up stored in either 1 or
 2 EBR blocks, and writes on individual EBRs would have to read out the
 appropriate 18-bit word, replace a subset of the bytes, and write back
 the changed word.
 Conceptually, the flow for a memory write from the system bus:
 - Inputs: byte address `BA`, byte value to write `BV`
 - Translate the byte address to a pair of word addresses and bit
   ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit,
   W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which
   happen to fall within a single 18b word.
 - Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]`
 - Fan out the write to the appropriate 1 or 2 EBRs
 - Each EBR does a read-modify-write cycle to update the appropriate
   bit ranges.
 ## Oh no
 The difficulties are several here, but the big one is the address
 translation step: to turn a byte address into the pair of word
 addresses plus bit ranges, we need to divide the byte address by 2.25
 (`*8/18`). The ECP5 has no floating point hardware, and no hardware
 divider, so division would suck.
 There's well known theory here though: multiply by 8 is easy, that's
 just a left-shift. Divide by 18 we can break down to divide by 2
 (shift-right), followed by a division by 9. To implement division by 9
 there's a bunch of tricks that effectively turn the division into a
 multiplication by a magic number followed by a power of 2 division
 (shift-right again).
 The ECP5 has DSP hardware that, among other things, provides an 18x18
 bit multiplier (36-bit output). So, we could do that. However, couple
 problems: for the multiplier to run fast, you need to pipeline it,
 which increases overall memory access latency. If we want to hit the
 timings for "fast" memory access from a 65C826, assuming the FPGA
 design can run at 100MHz, we have 4 cycles to turn a read around. The
 memories themselves take 1 or 2 cycles, so the entire address
 translation and reassembly has to somehow be jammed into 2 cycles,
 _and_ the combinatorial paths can't be very long because otherwise it
 won't be able to run in 100MHz.
 Another issue is that the ECP5's DSP block is currently a bit of an
 unknown in project Trellis, so if we want to use OSS tools, we don't
 get access to the full cosmic power of the ECP5 DSP blocks, we only
 get a basic 18x18 multiplier with no frills.
 Overall, this just seems like too much computation to jam into the
 number of cycles available.
 Finally, talking with an expert hardware designer, the plan to make up
 the memory shortfall of EBR using LUT RAM may also not work, in their
 experience it's very difficult for large chunks of LUT RAM to meet
 sensible timing constraints. Even 2K in the original plan would likely
 be a problem.
 ## What now?
 For now, I'm going to build GARY without any memory trickery. That
 means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be
 able to claw some of that back by tweaking the data structure layout
 and memory mapping, to effectively have compression compared to what
 VERA stores.
 If we really want 128KBytes of VRAM, we have two main options:
 - Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even
   with the stranded bits from running 4-bit wide memory, that's
   216KBytes available. We also get roughly double the LUTs and DSP
   units, although the -25 already has more than we need of
   both.
   - Downside: in LQFP form, the -45 is 2x more expensive, $40/ea
     instead of $20/ea. Some BGA form factors offer the -45 for "only"
     $33/ea, which still hurts but a bit less... in exchange for
     having to learn how to BGA.
 - Use an external RAM. This would let GARY have several megabytes of
   VRAM easily.
   - Downside: more BOM cost for the extra chip, though we can maybe
     compensate by dropping back down to the smallest LFE5U-12 FPGA.
   - Downside: to meet timing requirements, this needs to be a
     parallel RAM, which will consume a couple dozen IOs on the FPGA
     and require some painful routing of >100MHz traces (length
     matching, worrying about signal integrity, maybe being forced
     into more board layers...).
 ## References
 For the tricksy address translation stuff, the references are Granlund
 & Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 )
 for implementing division by a constant in various slick ways. Chapter
 10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery
 for decomposing divisions into a pipeline of adds and shifts.