Archive for February, 2009
I’m organizing a introductory workshop for people who wish to start designing with FPGAs at the /tmp/lab hackerspace near Paris. The event will take place on Saturday March 21st, from 14:30 to 23:30 and is free of charge.
- Presentation of the FPGA technology
- Project examples
- Bases of synchronous logic design
- Hands-on: implementation of a simple audio generator
- If time allows: Verilog introduction
- Implementation of the audio generator using Verilog
More info on this page : http://www.tmplab.org/wiki/index.php/Workshop_Introduction_aux_FPGA.
Contact me if you wish to participate by sending an email to sebastien dot bourdeauducq at gmail. Language will be French unless there is a demand for English.
I now have completed the AC97 controller that will be used to record audio in Milkymist. Although the design is relatively simple, the important features are there – supporting DMA buffers, interrupt-driven mode, full duplex (simultaneous playback and recording) and codec register access. The only missing parts are suspend modes and variable sample-rates (only the standard 48kHz rate is supported).
While this is not one of the hardest parts, it’s good to have this working (and use the ML401 board to listen to music )
By now, all known bugs are fixed in the image warping engine, and I can already make some visual effects using the FPGA board
I had to resort to GPL Cver and good old PLI to find the source of the problems (mostly due to a stupid typo in the code of the DMA write engine) since working around Verilator bugs would have been too tedious. Cver is many times slower, but more reliable.
I also moved the CSRs to a dedicated bus, since Wishbone caused some problems :
- Wishbone is made to support variable latency through the “ack” signal, whereas CSRs are, as their name suggests, registers that can usually be accessed in one clock cycles on FPGA architectures. In this case, most of the time this signal causes useless complications and timing problems. The new bus requires all accesses to be made within one cycle and removes the “ack” signal.
- Wishbone has two signal for qualifying a cycle, which are “cyc” and “stb” and serve no real purpose when accessing CSRs. There is usually a single master, and the address decoding can be made at the slave. So those signals can and have been removed.
- Wishbone requires a multiplexer in the slave to master data path. By adding the requirement that a slave puts out a zero value when it is not addressed, that multiplexer can be replaced by a potentially distributed OR that can improve timing and make chip layout easier (since the CSR bus is made to connect many peripherals that can cover a large area of the chip, timing is important).
- All CSR bus signals are explicitly shared between the slaves except the slave to master data. This makes the interconnect code much more readable.
- This issue is not specific to Wishbone, but the Wishbone to CSR bus bridge implemented in the Milkymist SoC registers all signals passing through the two bus, in order to improve timing when more devices are added to Wishbone (and that’s what will happen in Milkymist, since the shader and audio DMA still need to be added to this bus).
To sum up, I have made this diagram of the system architecture, with the current progress.
I eventually finished coding the FML DMA engines for the warper, ran some simulations, integrated the warper into the SoC and synthesized everything. And it even met timing at 100MHz, after slight modifications
Now has come the tedious bug-hunting phase. By now, everything is working, except that approximately one time out of 400, the source pixel seems not to be read correctly, resulting in visible black and white dots on the image. Sounds like some corner case with pipeline handshakes and cache status is not handled properly. To make things worse, Verilator is prone to a serious Heisenbug : adding a $display in a clocked “always” block to monitor a suspicious signal sometimes prints another value as the one written to the registers. Also, registering data with blocking assignments in one always block and then recapturing it with non-blocking assignments in another block sometimes yields incorrect results.
After these issues are sorted out, the next steps are implementing bilinear filtering (optional, only improves image quality) and negative off-screen coordinates (to be able to implement effects like centered zooming of the screen). The last major steps to a fully fledged MilkDrop implementation would then be a fast FPU and the software.
About performance, the fill rate in the fully running SoC with a VGA output at 640×480@60Hz has dropped to 30 MPixels/s. Analyzing the pipeline handshakes reveals that most of this performance hit is due to the read latency of the memory system. In the future, this could be improved by using a dual-port RAM for the source image cache, so that in the event of a cache miss, the DMA read engine can continue processing the stream of incoming requests while refilling the cache to honor the request that caused the cache miss. But 30MPixels/s is still enough to get good visual effects.