experiments/rmw_ram: document failed/paused memory trickery experiments
This commit is contained in:
parent
da6ea4cf42
commit
27da4958d2
|
@ -0,0 +1,140 @@
|
|||
No code here, just a record of a brief detour to experiment with RAM
|
||||
layouts.
|
||||
|
||||
The issue: for parity with VERA, we'd like 128KBytes of VRAM. The
|
||||
iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that
|
||||
easily. The ECP5 on the other hand has no large SPRAMs. What it does
|
||||
have is a varying amount of EBRs, embedded block RAMs. These are
|
||||
18Kbits each of dual-ported memory, which can be arrayed to form
|
||||
larger memories.
|
||||
|
||||
For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is
|
||||
around $20 in unit quantities. It has plenty of logic tiles for our
|
||||
needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe
|
||||
we can make up the difference with LUT RAMs.
|
||||
|
||||
Unfortunately no: the EBRs have configurable data width, so you can
|
||||
set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if
|
||||
you select one of the power of two widths, you only get to access
|
||||
16Kbits! IOW, the layout options you have for an EBR are:
|
||||
|
||||
- 16384x1b (16Kbits)
|
||||
- 8192x2b (16KBits)
|
||||
- 4096x4b (16Kbits)
|
||||
- 2048x9b (**18Kbits**)
|
||||
- 1024x18b (**18Kbits**)
|
||||
|
||||
We want a byte-addressed memory, so if we configure things the obvious
|
||||
way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs
|
||||
we only get 112KBytes total, because 14KBytes worth of bits have been
|
||||
stranded by the layout.
|
||||
|
||||
## The terrible plan
|
||||
|
||||
This led to a horrible line of thinking: what if we configure the
|
||||
memory blocks for 9 or 18 bits, to get access to all the bits, and
|
||||
then try to adapt that to an 8-bit external interface? How would that
|
||||
even work?
|
||||
|
||||
Well, if you array a bunch of 18-bit words side by side, you could
|
||||
chop those up into 8-bit chunks, with some chunks straddling an 18-bit
|
||||
boundary:
|
||||
|
||||
```
|
||||
[aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii]
|
||||
aaBBccDD DDeeFFgg ggHHii..
|
||||
eeFFggHH HHii..aaBB
|
||||
ii..aaBBcc ccDDeeFF
|
||||
```
|
||||
|
||||
So, given a byte address, that byte would end up stored in either 1 or
|
||||
2 EBR blocks, and writes on individual EBRs would have to read out the
|
||||
appropriate 18-bit word, replace a subset of the bytes, and write back
|
||||
the changed word.
|
||||
|
||||
Conceptually, the flow for a memory write from the system bus:
|
||||
- Inputs: byte address `BA`, byte value to write `BV`
|
||||
- Translate the byte address to a pair of word addresses and bit
|
||||
ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit,
|
||||
W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which
|
||||
happen to fall within a single 18b word.
|
||||
- Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]`
|
||||
- Fan out the write to the appropriate 1 or 2 EBRs
|
||||
- Each EBR does a read-modify-write cycle to update the appropriate
|
||||
bit ranges.
|
||||
|
||||
## Oh no
|
||||
|
||||
The difficulties are several here, but the big one is the address
|
||||
translation step: to turn a byte address into the pair of word
|
||||
addresses plus bit ranges, we need to divide the byte address by 2.25
|
||||
(`*8/18`). The ECP5 has no floating point hardware, and no hardware
|
||||
divider, so division would suck.
|
||||
|
||||
There's well known theory here though: multiply by 8 is easy, that's
|
||||
just a left-shift. Divide by 18 we can break down to divide by 2
|
||||
(shift-right), followed by a division by 9. To implement division by 9
|
||||
there's a bunch of tricks that effectively turn the division into a
|
||||
multiplication by a magic number followed by a power of 2 division
|
||||
(shift-right again).
|
||||
|
||||
The ECP5 has DSP hardware that, among other things, provides an 18x18
|
||||
bit multiplier (36-bit output). So, we could do that. However, couple
|
||||
problems: for the multiplier to run fast, you need to pipeline it,
|
||||
which increases overall memory access latency. If we want to hit the
|
||||
timings for "fast" memory access from a 65C826, assuming the FPGA
|
||||
design can run at 100MHz, we have 4 cycles to turn a read around. The
|
||||
memories themselves take 1 or 2 cycles, so the entire address
|
||||
translation and reassembly has to somehow be jammed into 2 cycles,
|
||||
_and_ the combinatorial paths can't be very long because otherwise it
|
||||
won't be able to run in 100MHz.
|
||||
|
||||
Another issue is that the ECP5's DSP block is currently a bit of an
|
||||
unknown in project Trellis, so if we want to use OSS tools, we don't
|
||||
get access to the full cosmic power of the ECP5 DSP blocks, we only
|
||||
get a basic 18x18 multiplier with no frills.
|
||||
|
||||
Overall, this just seems like too much computation to jam into the
|
||||
number of cycles available.
|
||||
|
||||
Finally, talking with an expert hardware designer, the plan to make up
|
||||
the memory shortfall of EBR using LUT RAM may also not work, in their
|
||||
experience it's very difficult for large chunks of LUT RAM to meet
|
||||
sensible timing constraints. Even 2K in the original plan would likely
|
||||
be a problem.
|
||||
|
||||
## What now?
|
||||
|
||||
For now, I'm going to build GARY without any memory trickery. That
|
||||
means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be
|
||||
able to claw some of that back by tweaking the data structure layout
|
||||
and memory mapping, to effectively have compression compared to what
|
||||
VERA stores.
|
||||
|
||||
If we really want 128KBytes of VRAM, we have two main options:
|
||||
- Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even
|
||||
with the stranded bits from running 4-bit wide memory, that's
|
||||
216KBytes available. We also get roughly double the LUTs and DSP
|
||||
units, although the -25 already has more than we need of
|
||||
both.
|
||||
- Downside: in LQFP form, the -45 is 2x more expensive, $40/ea
|
||||
instead of $20/ea. Some BGA form factors offer the -45 for "only"
|
||||
$33/ea, which still hurts but a bit less... in exchange for
|
||||
having to learn how to BGA.
|
||||
- Use an external RAM. This would let GARY have several megabytes of
|
||||
VRAM easily.
|
||||
- Downside: more BOM cost for the extra chip, though we can maybe
|
||||
compensate by dropping back down to the smallest LFE5U-12 FPGA.
|
||||
- Downside: to meet timing requirements, this needs to be a
|
||||
parallel RAM, which will consume a couple dozen IOs on the FPGA
|
||||
and require some painful routing of >100MHz traces (length
|
||||
matching, worrying about signal integrity, maybe being forced
|
||||
into more board layers...).
|
||||
|
||||
## References
|
||||
|
||||
For the tricksy address translation stuff, the references are Granlund
|
||||
& Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 )
|
||||
for implementing division by a constant in various slick ways. Chapter
|
||||
10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery
|
||||
for decomposing divisions into a pipeline of adds and shifts.
|
Loading…
Reference in New Issue