Gateware Design Overview

The digital signal processing logic starts at the ADCs, gets PFB'd (Filter + FFT), then splits into two signal paths:

  1. On board auto and cross correlations which can be read by the ARM core over SPI.
  2. The logic re-quantizes the signal to 1 or 4 bits then selects and packages a subset of frequency channels into UDP packets which are streamed out of the Sparrow's 0'th SFP port.
                                     _--> Correlator -> Readable BRAM (SPI)
2x Analog Signal -> ADC -> PFB/FFT _/
                                    \_--> 1bit quant |mux\_--> UDP Packets (1GBE)
                                      --> 4bit quant |   /

The Simulink (Matlab) and CASPER framework abstracts away the nitty gritty plumbing inherent to modern (2025) FPGA programming in regular text-based HDLs. There are good, bad, and ugly things about this.

The good: the user can jump right in and design, simulate, compile, implement, and synthesise a basic design targeted for supported hardware. The simulation aspect is worth emphasising because it's very easy to write a bug into logic design and it's often very hard to catch it without simulation.

The bad: it is impossible to track small diffs with modern version control tools (git) between two commits because the .slx files, being diagramatic rather than text based, are big and stored in compressed binary format.

The ugly side is that the toolflow is very brittle and annoying to set up. You need a precise Ubuntu distribution, which means you can't just set it up on any linux machine without reinstalling the operating system and migrating all users to that new OS. Not to mention that you need to buy, install, set up and periodically renew licences for bulky (~300 GB), propriotary Xilinx (AMD) software (Vivado) and Matlab's Simulink software.

Once you have the toolflow running and are ready to assemble your first design, you'll notice that there are many different types of color-coded block.

  • The yellow (CASPER) blocks are IO bocks. They map hardware pins to logic, define user programmable BRAMS and registers
  • The green (CASPER) blocks are (vaguely) DSP blocks. They implement logic to manipulate data.
  • The blue Xilinx blocks mix memory access and logic and are provided by Xilinx. These are primitive blocks provided by Xilinx, like operators or primitive data types in C.
  • Some white blocks implement simulation-only logic and tools to help develop and simulate the gateware.
  • The glossy grey blocks are user-defined subsystems that abstract away complexity, like a function in C or Python. Some white blocks also abstract user-defined logic, but these ones have 'mask parameters'. The values of mask parameters are set by the user before compilation, just like C macro-definitions. You can create one by highlighting some logic, righ-clicking, and selecting 'create subsystem from selection'. You can then turn a subsystem into a block with mask parameters by right-clicking the subsystem then Mask > Create Mask.

ADC interface

The ADC interface block, labelled sparrow_adc, provides the plumbing to link the ADCs' output pins with the design. The sync_adc register provides an interface for the python framework to sync the ADCs. The ADCs are programmed to sample at 250 MSPS but the fabric clock in this design is configured to run at 125 MSPS. To accomplish this clock domain crossing ADC samples are loaded into the fabric two at a time from each Pol (analog channel). The four output pins labels data_a/b_0/1 denote a/b for pol0/1 and 0/1 for first and second samples within one fabric-clock period. image

PFB FIR

The Polyphase Filter Bank Finite Impulse Response stage comprises four (number of inputs) sets of a series of four (number of taps) shift registers. Four frames worth of data from each ADC is windowed (point-wise multiplied by a vector of 'window coefficients'), then stacked and summed vertically. The output of this FIR filter feeds into the FFT. Each frame is 4096 (2^12) samples wide. image

The sync register is a user-writeable register which creates a 'sync pulse'. The sync pulse serves to align frames output by the PFB FIR filter with all of the logic downstream of it. This digital signal chain comprises an infinite loop of the same operations on consecutive data-frames and the sync pulse tells all the logical operators when the begining and end of each frame is. image

FFT

The FFT block implements a highly optimized Radix-2 decomposition Decimation In Frequency Fast Fourier Transform on parallel inputs. It takes a user-defined shift schedule as input (pfb_fft_shift) and records overflow events in a the register named fft_of_count. This FFT implementation takes advantage of the fact that a real-to-complex DFT has half as many outputs as inputs to balance the fact that it's being fed two samples from each ADC simultaneously. It also incorporates a de-scrambler so that the outputs are frequency-ordered complex samples (DIF FFTs naturally scramble their output (frequency domain samples) into bit-reversed order). image

TODO: inside the FFT

TVG1

The first Test Vector Generator follows immediately downstream of the FFT. TVGs are gateware testing/debugging units that, when deployed in the field, are configured to litterally not do anything to the signal. (The two green bus_create/bus_expand blocks don't modify the data whatsoever but simply package the 36/72 parallel bit-lines together/appart.)

image

A TVG enables the developer to mux in values from a user-writeable BRAM instead of the input. This module helps to test and debug gateware at both the simulation stage and in-Silica. The user controls whether to let actual data pass by writing to the enable register, and what data to replace it with by writing to the data BRAM.

image

Requantization

After the PFB (and TVG), each FFT'd signal branches three ways:

  • The on-board correlator (skip ahead to the correlator section)
  • 4-bit requantization
  • 1-bit requantization

The latter two paths requantize the same complex digital signal to 1+1-bit and 4+4-bits (in parallel). Each of the real and imaginary components of each sample is quantized to one and four bits.

Data re-ordering and bussifying follows both one-bit and four-bit requantization stages. The output order is defined by the user so that the frequency channels to be packetized come out first. Both 1bit and 4bit data is 'bussified' onto an 8-bit bus. 8-bits is not wide enough for the 4-bit data (4 bits ADC0 real + 4 bits ADC0 imaginary + 4 bits ADC1 real + 4 bits ADC1 imaginary = 16 bits per clock), which is why the re-order stage must come before the bussifying stage. This limits the number of selectable channels to 1024 of 2048.

image

For example, if you were interested in channels 420 and above, you would configure the re-order and transpose stages to output signals illustrated in the timing diagram below. These are the signals you would see if you were to put a scope (or a logic analyzer) right before the MUX that selects whether to transmit 1-bit or 4-bit requantized signals to the next stage. The data lines below represent a byte-wide bus carrying quantized, complex data from both ADC channels, so there are 16 bits per frequency channel in 4-bit mode, and 4 bits per frequency channel in 1-bit mode.

The one bit re-quantization logic is just a bunch of comparators (> or <).

image

In parallel we quantize each component (real/imaginary) of each frequency channel to four bits. To exploit the full range of bits excersised by 4-bit quantization we apply digital gain to each frequency channel individually. The gain in each channel is set by the user through the coeffs_pol0/coeffs_pol1 registers. The result is saturated against a floor and ceiling of the 4-bit range (-/+0.875) to wrapping (overflow). This image shows the logic for one of four quantization signal paths is shown.

image

Payload Packetiser

The payload packetiser is a logical subsystem that creates payloads for the UDP packets which are broadcast over ethernet on SFP0. It bundles re-quantized data, either 1-bit or 4-bit quantized, with the spectrum number, which counts the number of FFTs which have been performed so that we can tell 1) whether we have dropped any UDP packets and 2) which UDP packets we have dropped. The logic that generates UDP packet headers and bundles these with our payloads is taken care of by a Xilinx IP. Our payload packetiser subsystem is directly upstream of this UDP packetiser logic and interfaces with it over three busses: data (8 bits wide), valid line (1 bit), end-of-frame line (EOF, 1 bit). Pull the valid line high simultaneously with valid data (data that you want to transmit) on your eight-lane bus, and low when the data on the data bus is trash. A UDP payload buffer fills up with valid data until you pull the end-of-frame line high simultaneously with the last valid data point.

image

Caption: the payload packetizer subsystem is the white block labelled 'packetiser'. It has three input busses and three output busses, but it also takes input from readable and writable registers hidden beneath the subsystem mask.

For example, lets say we have one bit data and the only channels we care about are 420 through 427 inclusive, the timing diagram at the input of the payload packetiser, as we saw above in the requantization section, looks something like the following.

The sync pulse is only generated once and the logic is synchronised only once, so every subsequent input will have only the data line with anything of interest on it. The downstream logic knows to expect the pattern to repeat every 2048 clocks. The timing diagram downstream of the payload packetiser looks alternatively like either of the following three diagrams.

User facing registers tell the packetiser 1) how many FFT frames go into each packet and 2) how many clocks the valid pulse should last on each frame--which is directly related to the number of channels we want to preserve. Lets take a closer look at the logic that accomplishes the generation and pulsing.

image

The sync and reset lines trigger a reset of the FFT-frame (/spectrum) counter, "spectra-counter". This counter is a UFix 43 bit counter, the most significant 32-bits are sliced and bussified onto a byte-wide bus. These four bytes populate the first four bytes of each UDP packet payload. This spectrum number is written to disk as per specified in our data format [need link to data format spec]. The 11th LSB is also sliced out and used to trigger a "new spectra" pulse. This pulse which anounces a new FFT-frame or "spectrum" is multi-purposes, it:

  • synchronises the 32-bit spectrum number bussifier (so that the bits come out in sync with the first valid data point)

image

  • increments another counter, "packet_spec_counter", by one which, in turn,
    • triggers the valid line on the first spectrum of the packet to include the spectrum number
    • triggers the end-of-frame line on the last spectrum of the packet to tell the UDP packetizer to bundle and send the buffered payload. The user sets the spectra per packet from python by writing an integer to the (above) yellow "spectra_per_packet" register.

image

  • triggers a pulse-extender that pulls high the valid line for the number of clocks required to for each spectrum. The user sets the number of bytes in each spectrum by writing said value into the (above) yellow "bytes_per_spectrum" register. The number of bytes in each spectrum is a function of the number of channels saved as well as the re-quantization depth (4+4 bits vs 1+1 bit per sample).

image

Once the data is packetized, it's checked into the one_gbe block, and the CASPER framework takes care of the plumbing to pipe this into the correct Xilinx IP that implements UDP packetizing, and routs it to the correct physical SFP port (SFP0). In the image, a helpful user-readable buffer-overflow counter tx_of_cnt keeps track of overflowing packets, and the signal coming out of our user-defined packetizer also routs in simulation only to a virtual oscilascope.

image

On-board correlator

The signal path branches after the FFT. In the previous section we looked at re-quantization, data selection, and UDP packetizing, here we look at the second branch down-stream of the FFT: the correlator. The correlator computes auto- and cross-correlations of both channelized signals. The power in each pol is computed with a simple accumulator. Real and imaginary components of the cross correlation are similarly calculated. The result is dumped periodically into addressable BRAM registers pol00, pol11, pol01r, pol01i. Mathematically, the correlator computes autocorrelations,

\[P_{00}[k] = \sum_{l=0}^{N-1}|y_0[k,l]|^2,\]

and cross correlations,

\[P_{01}[k] = \sum_{l=0}^{N-1}y_0[k,l] \cdot y_1[k,l]^\ast,\]

where \(y_s[k,l]\) is the \(l\)'th spectrum's \(k\)'th frequency channel's data from pol-\(s\). Real and imaginary terms of the cross-correlations are accumulated in seperate BRAMs as all the arithmetic is carried out on real integers.

image

Correlator accumulator book-keeping

It's important to do some book-keeping to make sure the correlator BRAMs don't overflow. If the signal is U37_36 and the accumulator BRAM is U64_35 then we can only accumulate 2^28 samples samples (per channel), in seconds the accumulator BRAM fills up in 2^28*4096/250e6 = 4398 seconds, which is over an hour. We will never want to accumulate more than a few seconds.

However, the calculus changes if we implement FFT bit-growth to avoid doing a full shift schedule. If we grow the data by one bit on every FFT butterfly stage we're eating up 12 bits, which means that it becomes logically possible for the accumulator BRAMs to overflow after only one second, which is unacceptable. This means we either have to grow our BRAMs or do something to reduce the bit depth of these numbers.

The latest version of the firmware implements 8-bits of bitgrowth, the full twelve is not needed as our 12 bit data is already LSB-padded by four bits in the stack-and-sum stage of the PFB (upstream of the FFT). After eight bits of growth in the FFT the data is 24 bits wide; four LSBs are sliced off in the correlator branch so that it fits snugly in the correlator's accumulators.

User read/writeable registers

We make use of all named addressable registers in this design so it's worth knowing what each of them does. (The left-pointing pentagonal tags mean goto and are paired with right-pointing ones of the same name.)

  • gbe_en GigaBit Ethernet interface ENable. Single bit, 0 for disable, 1 for enable.
  • gbe_rst GigaBit Ethernet interface ReSeT. Single bit. Resets on transition from 0 to 1.
  • pack_rst PACKetizer ReSeT. Single bit to reset packetizer logic, including spectrum counter, on transition from 0 to 1.
  • cnt_rst accumulator CoNTroller ReSeT. Single bit to reset accumulator control logic.
  • acc_len ACCumulation LENgth. UFix32 sets the number of spectra to accumulate for each correlation. (How long to integrate data, if you prefer to think that way.) This number times 4096/250e6 gives the accumulation time in seconds.
  • dest_ip Sets the DESTination IP addresss.
  • dest_prt Sets the DESTination PoRT.
  • sync_cnt (read only) does exactly nothing because I removed the logic that periodically syncs all the logic. Instead it just syncs stuff once on initialization.
  • acc_cnt (read_only) ACCumulation CouNTer. Counts the number of spectra that have been accumulated.

image

Other user read/writeable registers are scattered throughout the design.

sync_adc SYNChronize ADC logic. Single bit pulse active on transition from 0 to 1.

image

sync Creates SYNChronizing pulse that aligns each set of the DSP chain. Single bit pulse, active on transition from 0 to 1.

image

pfb_fft_shift UFix12 determines the shift schedule. Each bit represents 0 for no shift, 1 for shift. Currently we have a full shift schedule. We may want to implement a bit-growth FFT so that we can have our cake (low noise floor from not shifting) and eat it too (no FFT overflows).

image

fft_of_count (read only) FFT OverFlow COUNTer. Every time there's an overflow event in a frame +1 is added to this UFix32 register.

image

sel SELects which requantization bit mode to choose from: 0 for 1bit, 1 for 4bits.

image

tvg1_enable Enables the the TVG right after the FFT stage. 0 to pass actual data, 1 to pass values read sequentially from the BRAM labelled data.

image

tvg16bit_enable enables the the TVG right after the 4bit requantization stage. Write 0 to pass actual data, 1 to pass values read sequentially from the BRAM labelled data.

image

four_bit_quant_clip_count is a readable register that counts the amount of four-bit clipping events. It increments by one when one of the four re/im pol0/pol1 values saturates four bit re-quantizer.

image

spectra_per_packet specifies the amount of spectra to include in each UDP packet. We're not using jumbo frames so we limit the spectra per packet to a number small enough for regular ethernet UDP frames. This register is written to at the tuning stage. The register holds an unsigned 32 bit integer, but only the five least significant bits are used.

MTU=1500 # max number of bytes in a packet
assert spectra_per_packet < (1<<5), "spec-per-pack too large for slice, aborting"
assert spectra_per_packet * bytes_per_spectrum <= MTU-8, "Packets too large, will cause fragmentation"

image

bytes_per_spectrum

image

tx_of_cnt TODO

image

User BRAM interfaces

tvg1_data TODO

image

tvg16bit_data TODO

Examples of this TVG's use can be found here.

View example use
"""Example of writing to TVG"""

#import numpy as np
#
#fpga.read("tvg16bit_data", 16, offset=0)
#
## 32 bit registers but 16bit LSB are sliced (i.e. 16 MSB are sliced off)
#tvgbytes=np.ndarray.tobytes(np.array([0x0,0x0,0x05,0xaf]*(1<<11),dtype='>i1'))
#
#fpga.write("tvg16bit_data", tvgbytes, offset=0)

#def pack_into_4bit_tvg(pol0r:bytes, pol0i:bytes, pol1r:bytes, pol1i:bytes):
#    """Pack complex arrays into TVG right after 4bit requantizer"""
#    for p in (pol0r,pol0i,pol1r,pol1i):assert len(p)==(1<<10)
#    zeros=b'\x00'*(1<<10)

###################

TVG_FFT_SBRAM_SHAPE = (1<<11,)

def pack_into_64(pol0r, pol0i, pol1r, pol1i):
    """Pack complex arrays into the TVG right after FFT block.

    Numpy doesn't know about fix_16_15's, so we'll just use int16
    and pretend that they are the fractional part, i.e. that our int16s
    represent that number divided by (1<<15)=32_768.

    :param np.ndarray(int16) pol0r:
        Real componant of FFT'd pol0
    :param np.ndarray(int16) pol0i:
        Imaginary componant of FFT'd pol0
    :param np.ndarray(int16) pol1r:
        Real componant of FFT'd pol1
    :param np.ndarray(int16) pol1i:
        Imaginary componant of FFT'd pol1

    :returns: Bytearray, ready for writing to sbram.
    """
    # TODO: figure out endianness...
    for pol in (pol0r, pol0i, pol1r, pol1i):
        assert pol.shape==TVG_FFT_SBRAM_SHAPE, "Pol shape must be compatible with sbram shape"
    out = np.array([pol0r, pol0i, pol1r, pol1i], dtype='>i2').T.flatten()
    return out.tobytes()


def pack_tvg_fft_ramp_pol0r(fpga):
    """Write to the TVG data register"""
    pol0r = np.arange(1<<11, dtype='>i2')
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.zeros(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_ramp_pol0i(fpga):
    pol0r = np.zeros(1<<11, dtype='>i2')
    pol0i = np.arange(1<<11, dtype='>i2')
    pol1r = np.zeros(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_ramp_pol1r(fpga):
    pol0r = np.zeros(1<<11, dtype='>i2')
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.arange(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_ramp_pol1i(fpga):
    pol0r = np.zeros(1<<11, dtype='>i2')
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.zeros(1<<11, dtype='>i2')
    pol1i = np.arange(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_const_pol0r(fpga):
    pol0r = np.ones(1<<11, dtype='>i2')
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.zeros(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_const_pattern2(fpga):
    pol0r = np.ones(1<<11, dtype='>i2')
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.ones(1<<11, dtype='>i2') * 2
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_const_pattern3(fpga):
    pol0r = np.zeros(1<<11, dtype='>i2')
    pol0r[np.array((400,401,403,406,410,415,421,428))] = 1
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.zeros(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_pol0r(fpga,pol0r):
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1r = np.zeros(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def pack_tvg_fft_pol0r_pol1r(fpga,pol0r,pol1r):
    pol0i = np.zeros(1<<11, dtype='>i2')
    pol1i = np.zeros(1<<11, dtype='>i2')
    fpga.write("tvg1_data", pack_into_64(pol0r, pol0i, pol1r, pol1i), offset=0)
    return

def zeros_tag(chan_idx):
    arr=np.zeros(1<<11, dtype='>i2')
    arr[chan_idx] = 1<<14
    arr=np.array(arr,dtype='>i2')
    return arr

def negrailed_tag(chan_idx):
    arr=np.array(np.ones(1<<11) * (-1<<14),dtype='>i2')
    arr[chan_idx] = 1<<14
    arr=np.array(arr,dtype='>i2')
    return arr

if __name__=="__main__":
    # tests
    pass

#pols=[np.concatenate([[1]*44,[-1],[1]*((1<<11)-45)]),np.concatenate([[1]*45,[-1],[1]*((1<<11)-46)]),np.ones(1<<11),np.ones(1<<11)];fpga.write("tvg1_data",pack_into_64(pols[0],pols[1],pols[2],pols[3]),offset=0)


image

four_bit_quant_coeffs_pol0/1 TODO

image

image

one_bit_reorder_map1 TODO

image

four_bit_reorder_map1 TODO

image

pol00, pol11, pol01r, pol01i TODO

image