No code here, just a record of a brief detour to experiment with RAM layouts. The issue: for parity with VERA, we'd like 128KBytes of VRAM. The iCE40 UltraPlus has 4x256Kbits of SPRAM built in, so can do that easily. The ECP5 on the other hand has no large SPRAMs. What it does have is a varying amount of EBRs, embedded block RAMs. These are 18Kbits each of dual-ported memory, which can be arrayed to form larger memories. For BOM cost reasons, we'd like to target the LFE5U-25 SKU, which is around $20 in unit quantities. It has plenty of logic tiles for our needs, and 56 EBR tiles. This adds up to 126KBytes. Close, and maybe we can make up the difference with LUT RAMs. Unfortunately no: the EBRs have configurable data width, so you can set them up as an array of 1, 2, 4, 9 or 18-bit values. However, if you select one of the power of two widths, you only get to access 16Kbits! IOW, the layout options you have for an EBR are: - 16384x1b (16Kbits) - 8192x2b (16KBits) - 4096x4b (16Kbits) - 2048x9b (**18Kbits**) - 1024x18b (**18Kbits**) We want a byte-addressed memory, so if we configure things the obvious way and use a pair of 4b EBRs to form an 8b memory, then with 56 EBRs we only get 112KBytes total, because 14KBytes worth of bits have been stranded by the layout. ## The terrible plan This led to a horrible line of thinking: what if we configure the memory blocks for 9 or 18 bits, to get access to all the bits, and then try to adapt that to an 8-bit external interface? How would that even work? Well, if you array a bunch of 18-bit words side by side, you could chop those up into 8-bit chunks, with some chunks straddling an 18-bit boundary: ``` [aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii][aaBBccDDeeFFggHHii] aaBBccDD DDeeFFgg ggHHii.. eeFFggHH HHii..aaBB ii..aaBBcc ccDDeeFF ``` So, given a byte address, that byte would end up stored in either 1 or 2 EBR blocks, and writes on individual EBRs would have to read out the appropriate 18-bit word, replace a subset of the bytes, and write back the changed word. Conceptually, the flow for a memory write from the system bus: - Inputs: byte address `BA`, byte value to write `BV` - Translate the byte address to a pair of word addresses and bit ranges `(WA1, W1_startbit, W1_bitlen), (WA2, W2_startbit, W2_bitlen)`. Note the `WA2` tuple may be null, for bytes which happen to fall within a single 18b word. - Split the byte into appropriate bit spans: `SV1 = BV[:W1_bitlen]`, `SV2 = BV[W1_bitlen:]` - Fan out the write to the appropriate 1 or 2 EBRs - Each EBR does a read-modify-write cycle to update the appropriate bit ranges. ## Oh no The difficulties are several here, but the big one is the address translation step: to turn a byte address into the pair of word addresses plus bit ranges, we need to divide the byte address by 2.25 (`*8/18`). The ECP5 has no floating point hardware, and no hardware divider, so division would suck. There's well known theory here though: multiply by 8 is easy, that's just a left-shift. Divide by 18 we can break down to divide by 2 (shift-right), followed by a division by 9. To implement division by 9 there's a bunch of tricks that effectively turn the division into a multiplication by a magic number followed by a power of 2 division (shift-right again). The ECP5 has DSP hardware that, among other things, provides an 18x18 bit multiplier (36-bit output). So, we could do that. However, couple problems: for the multiplier to run fast, you need to pipeline it, which increases overall memory access latency. If we want to hit the timings for "fast" memory access from a 65C826, assuming the FPGA design can run at 100MHz, we have 4 cycles to turn a read around. The memories themselves take 1 or 2 cycles, so the entire address translation and reassembly has to somehow be jammed into 2 cycles, _and_ the combinatorial paths can't be very long because otherwise it won't be able to run in 100MHz. Another issue is that the ECP5's DSP block is currently a bit of an unknown in project Trellis, so if we want to use OSS tools, we don't get access to the full cosmic power of the ECP5 DSP blocks, we only get a basic 18x18 multiplier with no frills. Overall, this just seems like too much computation to jam into the number of cycles available. Finally, talking with an expert hardware designer, the plan to make up the memory shortfall of EBR using LUT RAM may also not work, in their experience it's very difficult for large chunks of LUT RAM to meet sensible timing constraints. Even 2K in the original plan would likely be a problem. ## What now? For now, I'm going to build GARY without any memory trickery. That means on LFE5U-25, it'll only have 112KBytes of video RAM. We may be able to claw some of that back by tweaking the data structure layout and memory mapping, to effectively have compression compared to what VERA stores. If we really want 128KBytes of VRAM, we have two main options: - Bump up to the bigger LFE5U-45 SKU, which has 108 EBR blocks. Even with the stranded bits from running 4-bit wide memory, that's 216KBytes available. We also get roughly double the LUTs and DSP units, although the -25 already has more than we need of both. - Downside: in LQFP form, the -45 is 2x more expensive, $40/ea instead of $20/ea. Some BGA form factors offer the -45 for "only" $33/ea, which still hurts but a bit less... in exchange for having to learn how to BGA. - Use an external RAM. This would let GARY have several megabytes of VRAM easily. - Downside: more BOM cost for the extra chip, though we can maybe compensate by dropping back down to the smallest LFE5U-12 FPGA. - Downside: to meet timing requirements, this needs to be a parallel RAM, which will consume a couple dozen IOs on the FPGA and require some painful routing of >100MHz traces (length matching, worrying about signal integrity, maybe being forced into more board layers...). ## References For the tricksy address translation stuff, the references are Granlund & Montgomery (1994, https://dl.acm.org/doi/pdf/10.1145/178243.178249 ) for implementing division by a constant in various slick ways. Chapter 10 of Hacker's Delight 2nd ed. (ISBN 0321842685) has further trickery for decomposing divisions into a pipeline of adds and shifts.