experiments/rmw_ram: document failed/paused memory trickery experiments
This commit is contained in:
parent
da6ea4cf42
commit
27da4958d2
|
@ -0,0 +1,140 @@
|
||||||
|
No code here, just a record of a brief detour to experiment with RAM
|
||||||
|
layouts.
|
||||||
|
|
||||||
|
The issue: for parity with VERA, we'd like 128KBytes of VRAM. The
|
||||||
|
iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that
|
||||||
|
easily. The ECP5 on the other hand has no large SPRAMs. What it does
|
||||||
|
have is a varying amount of EBRs, embedded block RAMs. These are
|
||||||
|
18Kbits each of dual-ported memory, which can be arrayed to form
|
||||||
|
larger memories.
|
||||||
|
|
||||||
|
For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is
|
||||||
|
around $20 in unit quantities. It has plenty of logic tiles for our
|
||||||
|
needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe
|
||||||
|
we can make up the difference with LUT RAMs.
|
||||||
|
|
||||||
|
Unfortunately no: the EBRs have configurable data width, so you can
|
||||||
|
set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if
|
||||||
|
you select one of the power of two widths, you only get to access
|
||||||
|
16Kbits! IOW, the layout options you have for an EBR are:
|
||||||
|
|
||||||
|
- 16384x1b (16Kbits)
|
||||||
|
- 8192x2b (16KBits)
|
||||||
|
- 4096x4b (16Kbits)
|
||||||
|
- 2048x9b (**18Kbits**)
|
||||||
|
- 1024x18b (**18Kbits**)
|
||||||
|
|
||||||
|
We want a byte-addressed memory, so if we configure things the obvious
|
||||||
|
way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs
|
||||||
|
we only get 112KBytes total, because 14KBytes worth of bits have been
|
||||||
|
stranded by the layout.
|
||||||
|
|
||||||
|
## The terrible plan
|
||||||
|
|
||||||
|
This led to a horrible line of thinking: what if we configure the
|
||||||
|
memory blocks for 9 or 18 bits, to get access to all the bits, and
|
||||||
|
then try to adapt that to an 8-bit external interface? How would that
|
||||||
|
even work?
|
||||||
|
|
||||||
|
Well, if you array a bunch of 18-bit words side by side, you could
|
||||||
|
chop those up into 8-bit chunks, with some chunks straddling an 18-bit
|
||||||
|
boundary:
|
||||||
|
|
||||||
|
```
|
||||||
|
[aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii]
|
||||||
|
aaBBccDD DDeeFFgg ggHHii..
|
||||||
|
eeFFggHH HHii..aaBB
|
||||||
|
ii..aaBBcc ccDDeeFF
|
||||||
|
```
|
||||||
|
|
||||||
|
So, given a byte address, that byte would end up stored in either 1 or
|
||||||
|
2 EBR blocks, and writes on individual EBRs would have to read out the
|
||||||
|
appropriate 18-bit word, replace a subset of the bytes, and write back
|
||||||
|
the changed word.
|
||||||
|
|
||||||
|
Conceptually, the flow for a memory write from the system bus:
|
||||||
|
- Inputs: byte address `BA`, byte value to write `BV`
|
||||||
|
- Translate the byte address to a pair of word addresses and bit
|
||||||
|
ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit,
|
||||||
|
W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which
|
||||||
|
happen to fall within a single 18b word.
|
||||||
|
- Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]`
|
||||||
|
- Fan out the write to the appropriate 1 or 2 EBRs
|
||||||
|
- Each EBR does a read-modify-write cycle to update the appropriate
|
||||||
|
bit ranges.
|
||||||
|
|
||||||
|
## Oh no
|
||||||
|
|
||||||
|
The difficulties are several here, but the big one is the address
|
||||||
|
translation step: to turn a byte address into the pair of word
|
||||||
|
addresses plus bit ranges, we need to divide the byte address by 2.25
|
||||||
|
(`*8/18`). The ECP5 has no floating point hardware, and no hardware
|
||||||
|
divider, so division would suck.
|
||||||
|
|
||||||
|
There's well known theory here though: multiply by 8 is easy, that's
|
||||||
|
just a left-shift. Divide by 18 we can break down to divide by 2
|
||||||
|
(shift-right), followed by a division by 9. To implement division by 9
|
||||||
|
there's a bunch of tricks that effectively turn the division into a
|
||||||
|
multiplication by a magic number followed by a power of 2 division
|
||||||
|
(shift-right again).
|
||||||
|
|
||||||
|
The ECP5 has DSP hardware that, among other things, provides an 18x18
|
||||||
|
bit multiplier (36-bit output). So, we could do that. However, couple
|
||||||
|
problems: for the multiplier to run fast, you need to pipeline it,
|
||||||
|
which increases overall memory access latency. If we want to hit the
|
||||||
|
timings for "fast" memory access from a 65C826, assuming the FPGA
|
||||||
|
design can run at 100MHz, we have 4 cycles to turn a read around. The
|
||||||
|
memories themselves take 1 or 2 cycles, so the entire address
|
||||||
|
translation and reassembly has to somehow be jammed into 2 cycles,
|
||||||
|
_and_ the combinatorial paths can't be very long because otherwise it
|
||||||
|
won't be able to run in 100MHz.
|
||||||
|
|
||||||
|
Another issue is that the ECP5's DSP block is currently a bit of an
|
||||||
|
unknown in project Trellis, so if we want to use OSS tools, we don't
|
||||||
|
get access to the full cosmic power of the ECP5 DSP blocks, we only
|
||||||
|
get a basic 18x18 multiplier with no frills.
|
||||||
|
|
||||||
|
Overall, this just seems like too much computation to jam into the
|
||||||
|
number of cycles available.
|
||||||
|
|
||||||
|
Finally, talking with an expert hardware designer, the plan to make up
|
||||||
|
the memory shortfall of EBR using LUT RAM may also not work, in their
|
||||||
|
experience it's very difficult for large chunks of LUT RAM to meet
|
||||||
|
sensible timing constraints. Even 2K in the original plan would likely
|
||||||
|
be a problem.
|
||||||
|
|
||||||
|
## What now?
|
||||||
|
|
||||||
|
For now, I'm going to build GARY without any memory trickery. That
|
||||||
|
means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be
|
||||||
|
able to claw some of that back by tweaking the data structure layout
|
||||||
|
and memory mapping, to effectively have compression compared to what
|
||||||
|
VERA stores.
|
||||||
|
|
||||||
|
If we really want 128KBytes of VRAM, we have two main options:
|
||||||
|
- Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even
|
||||||
|
with the stranded bits from running 4-bit wide memory, that's
|
||||||
|
216KBytes available. We also get roughly double the LUTs and DSP
|
||||||
|
units, although the -25 already has more than we need of
|
||||||
|
both.
|
||||||
|
- Downside: in LQFP form, the -45 is 2x more expensive, $40/ea
|
||||||
|
instead of $20/ea. Some BGA form factors offer the -45 for "only"
|
||||||
|
$33/ea, which still hurts but a bit less... in exchange for
|
||||||
|
having to learn how to BGA.
|
||||||
|
- Use an external RAM. This would let GARY have several megabytes of
|
||||||
|
VRAM easily.
|
||||||
|
- Downside: more BOM cost for the extra chip, though we can maybe
|
||||||
|
compensate by dropping back down to the smallest LFE5U-12 FPGA.
|
||||||
|
- Downside: to meet timing requirements, this needs to be a
|
||||||
|
parallel RAM, which will consume a couple dozen IOs on the FPGA
|
||||||
|
and require some painful routing of >100MHz traces (length
|
||||||
|
matching, worrying about signal integrity, maybe being forced
|
||||||
|
into more board layers...).
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
For the tricksy address translation stuff, the references are Granlund
|
||||||
|
& Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 )
|
||||||
|
for implementing division by a constant in various slick ways. Chapter
|
||||||
|
10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery
|
||||||
|
for decomposing divisions into a pipeline of adds and shifts.
|
Loading…
Reference in New Issue