gary/experiments/rmw_ram/README.md

No code here, just a record of a brief detour to experiment with RAM
layouts.

The issue: for parity with VERA, we'd like 128KBytes of VRAM. The
iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that
easily. The ECP5 on the other hand has no large SPRAMs. What it does
have is a varying amount of EBRs, embedded block RAMs. These are
18Kbits each of dual-ported memory, which can be arrayed to form
larger memories.

For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is
around $20 in unit quantities. It has plenty of logic tiles for our
needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe
we can make up the difference with LUT RAMs.

Unfortunately no: the EBRs have configurable data width, so you can
set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if
you select one of the power of two widths, you only get to access
16Kbits! IOW, the layout options you have for an EBR are:

 - 16384x1b (16Kbits)
 - 8192x2b (16KBits)
 - 4096x4b (16Kbits)
 - 2048x9b (**18Kbits**)
 - 1024x18b (**18Kbits**)

We want a byte-addressed memory, so if we configure things the obvious
way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs
we only get 112KBytes total, because 14KBytes worth of bits have been
stranded by the layout.

## The terrible plan

This led to a horrible line of thinking: what if we configure the
memory blocks for 9 or 18 bits, to get access to all the bits, and
then try to adapt that to an 8-bit external interface? How would that
even work?

Well, if you array a bunch of 18-bit words side by side, you could
chop those up into 8-bit chunks, with some chunks straddling an 18-bit
boundary:

```
[aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii]
 aaBBccDD                  DDeeFFgg                  ggHHii..
         eeFFggHH                  HHii..aaBB
                 ii..aaBBcc                  ccDDeeFF
```

So, given a byte address, that byte would end up stored in either 1 or
2 EBR blocks, and writes on individual EBRs would have to read out the
appropriate 18-bit word, replace a subset of the bytes, and write back
the changed word.

Conceptually, the flow for a memory write from the system bus:
 - Inputs: byte address `BA`, byte value to write `BV`
 - Translate the byte address to a pair of word addresses and bit
   ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit,
   W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which
   happen to fall within a single 18b word.
 - Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]`
 - Fan out the write to the appropriate 1 or 2 EBRs
 - Each EBR does a read-modify-write cycle to update the appropriate
   bit ranges.

## Oh no

The difficulties are several here, but the big one is the address
translation step: to turn a byte address into the pair of word
addresses plus bit ranges, we need to divide the byte address by 2.25
(`*8/18`). The ECP5 has no floating point hardware, and no hardware
divider, so division would suck.

There's well known theory here though: multiply by 8 is easy, that's
just a left-shift. Divide by 18 we can break down to divide by 2
(shift-right), followed by a division by 9. To implement division by 9
there's a bunch of tricks that effectively turn the division into a
multiplication by a magic number followed by a power of 2 division
(shift-right again).

The ECP5 has DSP hardware that, among other things, provides an 18x18
bit multiplier (36-bit output). So, we could do that. However, couple
problems: for the multiplier to run fast, you need to pipeline it,
which increases overall memory access latency. If we want to hit the
timings for "fast" memory access from a 65C826, assuming the FPGA
design can run at 100MHz, we have 4 cycles to turn a read around. The
memories themselves take 1 or 2 cycles, so the entire address
translation and reassembly has to somehow be jammed into 2 cycles,
_and_ the combinatorial paths can't be very long because otherwise it
won't be able to run in 100MHz.

Another issue is that the ECP5's DSP block is currently a bit of an
unknown in project Trellis, so if we want to use OSS tools, we don't
get access to the full cosmic power of the ECP5 DSP blocks, we only
get a basic 18x18 multiplier with no frills.

Overall, this just seems like too much computation to jam into the
number of cycles available.

Finally, talking with an expert hardware designer, the plan to make up
the memory shortfall of EBR using LUT RAM may also not work, in their
experience it's very difficult for large chunks of LUT RAM to meet
sensible timing constraints. Even 2K in the original plan would likely
be a problem.

## What now?

For now, I'm going to build GARY without any memory trickery. That
means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be
able to claw some of that back by tweaking the data structure layout
and memory mapping, to effectively have compression compared to what
VERA stores.

If we really want 128KBytes of VRAM, we have two main options:
 - Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even
   with the stranded bits from running 4-bit wide memory, that's
   216KBytes available. We also get roughly double the LUTs and DSP
   units, although the -25 already has more than we need of
   both.
   - Downside: in LQFP form, the -45 is 2x more expensive, $40/ea
     instead of $20/ea. Some BGA form factors offer the -45 for "only"
     $33/ea, which still hurts but a bit less... in exchange for
     having to learn how to BGA.
 - Use an external RAM. This would let GARY have several megabytes of
   VRAM easily.
   - Downside: more BOM cost for the extra chip, though we can maybe
     compensate by dropping back down to the smallest LFE5U-12 FPGA.
   - Downside: to meet timing requirements, this needs to be a
     parallel RAM, which will consume a couple dozen IOs on the FPGA
     and require some painful routing of >100MHz traces (length
     matching, worrying about signal integrity, maybe being forced
     into more board layers...).

## References

For the tricksy address translation stuff, the references are Granlund
& Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 )
for implementing division by a constant in various slick ways. Chapter
10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery
for decomposing divisions into a pipeline of adds and shifts.
experiments/rmw_ram: document failed/paused memory trickery experiments 2024-08-14 05:53:47 +02:00			`No code here, just a record of a brief detour to experiment with RAM`
			`layouts.`

			`The issue: for parity with VERA, we'd like 128KBytes of VRAM. The`
			`iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that`
			`easily. The ECP5 on the other hand has no large SPRAMs. What it does`
			`have is a varying amount of EBRs, embedded block RAMs. These are`
			`18Kbits each of dual-ported memory, which can be arrayed to form`
			`larger memories.`

			`For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is`
			`around $20 in unit quantities. It has plenty of logic tiles for our`
			`needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe`
			`we can make up the difference with LUT RAMs.`

			`Unfortunately no: the EBRs have configurable data width, so you can`
			`set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if`
			`you select one of the power of two widths, you only get to access`
			`16Kbits! IOW, the layout options you have for an EBR are:`

			`- 16384x1b (16Kbits)`
			`- 8192x2b (16KBits)`
			`- 4096x4b (16Kbits)`
			`- 2048x9b (18Kbits)`
			`- 1024x18b (18Kbits)`

			`We want a byte-addressed memory, so if we configure things the obvious`
			`way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs`
			`we only get 112KBytes total, because 14KBytes worth of bits have been`
			`stranded by the layout.`

			`## The terrible plan`

			`This led to a horrible line of thinking: what if we configure the`
			`memory blocks for 9 or 18 bits, to get access to all the bits, and`
			`then try to adapt that to an 8-bit external interface? How would that`
			`even work?`

			`Well, if you array a bunch of 18-bit words side by side, you could`
			`chop those up into 8-bit chunks, with some chunks straddling an 18-bit`
			`boundary:`

			```
			`[aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii]`
			`aaBBccDD DDeeFFgg ggHHii..`
			`eeFFggHH HHii..aaBB`
			`ii..aaBBcc ccDDeeFF`
			```

			`So, given a byte address, that byte would end up stored in either 1 or`
			`2 EBR blocks, and writes on individual EBRs would have to read out the`
			`appropriate 18-bit word, replace a subset of the bytes, and write back`
			`the changed word.`

			`Conceptually, the flow for a memory write from the system bus:`
			- Inputs: byte address `BA`, byte value to write `BV`
			`- Translate the byte address to a pair of word addresses and bit`
			ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit,
			W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which
			`happen to fall within a single 18b word.`
			- Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]`
			`- Fan out the write to the appropriate 1 or 2 EBRs`
			`- Each EBR does a read-modify-write cycle to update the appropriate`
			`bit ranges.`

			`## Oh no`

			`The difficulties are several here, but the big one is the address`
			`translation step: to turn a byte address into the pair of word`
			`addresses plus bit ranges, we need to divide the byte address by 2.25`
			(`*8/18`). The ECP5 has no floating point hardware, and no hardware
			`divider, so division would suck.`

			`There's well known theory here though: multiply by 8 is easy, that's`
			`just a left-shift. Divide by 18 we can break down to divide by 2`
			`(shift-right), followed by a division by 9. To implement division by 9`
			`there's a bunch of tricks that effectively turn the division into a`
			`multiplication by a magic number followed by a power of 2 division`
			`(shift-right again).`

			`The ECP5 has DSP hardware that, among other things, provides an 18x18`
			`bit multiplier (36-bit output). So, we could do that. However, couple`
			`problems: for the multiplier to run fast, you need to pipeline it,`
			`which increases overall memory access latency. If we want to hit the`
			`timings for "fast" memory access from a 65C826, assuming the FPGA`
			`design can run at 100MHz, we have 4 cycles to turn a read around. The`
			`memories themselves take 1 or 2 cycles, so the entire address`
			`translation and reassembly has to somehow be jammed into 2 cycles,`
			`_and_ the combinatorial paths can't be very long because otherwise it`
			`won't be able to run in 100MHz.`

			`Another issue is that the ECP5's DSP block is currently a bit of an`
			`unknown in project Trellis, so if we want to use OSS tools, we don't`
			`get access to the full cosmic power of the ECP5 DSP blocks, we only`
			`get a basic 18x18 multiplier with no frills.`

			`Overall, this just seems like too much computation to jam into the`
			`number of cycles available.`

			`Finally, talking with an expert hardware designer, the plan to make up`
			`the memory shortfall of EBR using LUT RAM may also not work, in their`
			`experience it's very difficult for large chunks of LUT RAM to meet`
			`sensible timing constraints. Even 2K in the original plan would likely`
			`be a problem.`

			`## What now?`

			`For now, I'm going to build GARY without any memory trickery. That`
			`means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be`
			`able to claw some of that back by tweaking the data structure layout`
			`and memory mapping, to effectively have compression compared to what`
			`VERA stores.`

			`If we really want 128KBytes of VRAM, we have two main options:`
			`- Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even`
			`with the stranded bits from running 4-bit wide memory, that's`
			`216KBytes available. We also get roughly double the LUTs and DSP`
			`units, although the -25 already has more than we need of`
			`both.`
			`- Downside: in LQFP form, the -45 is 2x more expensive, $40/ea`
			`instead of $20/ea. Some BGA form factors offer the -45 for "only"`
			`$33/ea, which still hurts but a bit less... in exchange for`
			`having to learn how to BGA.`
			`- Use an external RAM. This would let GARY have several megabytes of`
			`VRAM easily.`
			`- Downside: more BOM cost for the extra chip, though we can maybe`
			`compensate by dropping back down to the smallest LFE5U-12 FPGA.`
			`- Downside: to meet timing requirements, this needs to be a`
			`parallel RAM, which will consume a couple dozen IOs on the FPGA`
			`and require some painful routing of >100MHz traces (length`
			`matching, worrying about signal integrity, maybe being forced`
			`into more board layers...).`

			`## References`

			`For the tricksy address translation stuff, the references are Granlund`
			`& Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 )`
			`for implementing division by a constant in various slick ways. Chapter`
			`10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery`
			`for decomposing divisions into a pipeline of adds and shifts.`