4. A. Write a code for Vivado HLS to be able to synthesize a n-tap FIR filter (2

ID: 3833735 • Letter: 4

Question

A. Write a code for Vivado HLS to be able to synthesize a n-tap FIR filter (20 points)?

B. Apply any two optimizations for latency reduction using directives or adding pragmas to this code (such as pipelining, unrolling, array partitioning, array reshaping).

Clearly show these directives or pragmas. Your additional directives should reduce latency without impact of other constraints and may require multiple pragmas to actually reduce latency in some cases.

Explain the impact of your optimization on both latency and resource utilization. (20 points)

Explanation / Answer

Vivado HLS:

Vivado Design Suite is a software suite produced by Xilinx for synthesis and analysis of HDL designs.

High-level synthesis (HLS), sometimes referred to as C synthesis, electronic system-level (ESL) synthesis, algorithmic synthesis, or behavioral synthesis, is an automated design process that interprets an algorithmic description of a desired behavior and creates digital hardware that implements that behavior.

FIR filter:

In signal processing, a finite impulse response (FIR) filter is a filter whose impulse response (or response to any finite length input) is of finite duration, because it settles to zero in finite time.

FIR filters are one of two primary types of digital filters used in Digital Signal Processing (DSP) applications, the other type being IIR.

“FIR” means “Finite Impulse Response.” If you put in an impulse, that is, a single “1” sample followed by many “0” samples, zeroes will come out after the “1” sample has made its way through the delay line of the filter.

In the common case, the impulse response is finite because there is no feedback in the FIR. A lack of feedback guarantees that the impulse response will be finite. Therefore, the term “finite impulse response” is nearly synonymous with “no feedback”.

However, if feedback is employed yet the impulse response is finite, the filter still is a FIR. An example is the moving average filter, in which the Nth prior sample is subtracted (fed back) each time a new sample comes in. This filter has a finite impulse response even though it uses feedback: after N samples of an impulse, the output will always be zero.

Some people say the letters F-I-R; other people pronounce as if it were a type of tree. We prefer the tree. (The difference is whether you talk about an F-I-R filter or a FIR filter.)

Describing FIR filters:

Impulse Response – The “impulse response” of a FIR filter is actually just the set of FIR coefficients. (If you put an “impulse” into a FIR filter which consists of a “1” sample followed by many “0” samples, the output of the filter will be the set of coefficients, as the 1 sample moves past each coefficient in turn to form the output.)

Tap – A FIR “tap” is simply a coefficient/delay pair. The number of FIR taps, (often designated as “N”) is an indication of 1) the amount of memory required to implement the filter, 2) the number of calculations required, and 3) the amount of “filtering” the filter can do; in effect, more taps means more stopband attenuation, less ripple, narrower filters, etc.

Multiply-Accumulate (MAC) – In a FIR context, a “MAC” is the operation of multiplying a coefficient by the corresponding delayed data sample and accumulating the result. FIRs usually require one MAC per tap. Most DSP microprocessors implement the MAC operation in a single instruction cycle.

Transition Band – The band of frequencies between passband and stopband edges. The narrower the transition band, the more taps are required to implement the filter. (A “small” transition band results in a “sharp” filter.)

Delay Line – The set of memory elements that implement the “Z^-1” delay elements of the FIR calculation.

Circular Buffer – A special buffer which is “circular” because incrementing at the end causes it to wrap around to the beginning, or because decrementing from the beginning causes it to wrap around to the end. Circular buffers are often provided by DSP microprocessors to implement the “movement” of the samples through the FIR delay-line without having to literally move the data in memory. When a new sample is added to the buffer, it automatically replaces the oldest one.

original, non-optimized version of FIR

#define SIZE 128

#define N 10

void fir(int input[SIZE], int output[SIZE]) {

// FIR coefficients

int coeff[N] = {13, -2, 9, 11, 26, 18, 95, -43, 6, 74};

// exactly translation from FIR formula above

for (int i = 0; i < SIZE; i++) {

int acc = 0;

for (int j = 0; j < N; j ++ ) {

if (i - j >= 0)

acc += coeff[j] * input[i - j];

}

output[i] = acc;

}

CODE:

#include <math.h>

#include <stdint.h>

#include "xilly_debug.h"

extern float sinf(float);

int mycalc(int a, float *x2) {

*x2 = sinf(*x2);

return a + 1;

}

void xillybus_wrapper(int *in, int *out) {

#pragma AP interface ap_fifo port=in

#pragma AP interface ap_fifo port=out

#pragma AP interface ap_ctrl_none port=return

uint32_t x1, tmp, y1;

float x2, y2;

xilly_puts("Hello, world ");

// Handle input data

x1 = *in++;

tmp = *in++;

x2 = *((float *) &tmp); // Convert uint32_t to float

// Debug output

xilly_puts("x1=");

xilly_decprint(x1, 1);

xilly_puts(" ");

// Run the calculations

y1 = mycalc(x1, &x2);

y2 = x2; // This helps HLS in the conversion below

// Handle output data

tmp = *((uint32_t *) &y2); // Convert float to uint32_t

*out++ = y1;

*out++ = tmp;

}

Host program

#include <stdio.h>

#include <unistd.h>

#include <stdlib.h>

#include <sys/types.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <stdint.h>

int main(int argc, char *argv[]) {

int fdr, fdw;

struct {

uint32_t v1;

float v2;

} tologic, fromlogic;

fdr = open("/dev/xillybus_read_32", O_RDONLY);

fdw = open("/dev/xillybus_write_32", O_WRONLY);

if ((fdr < 0) || (fdw < 0)) {

perror("Failed to open Xillybus device file(s)");

exit(1);

}

tologic.v1 = 123;

tologic.v2 = 0.78539816; // ~ pi/4

// Not checking return values of write() and read(). This must be done

// in a real-life program to ensure reliability.

write(fdw, (void *) &tologic, sizeof(tologic));

read(fdr, (void *) &fromlogic, sizeof(fromlogic));

printf("FPGA said: %d + 1 = %d and also "

"sin(%f) = %f ",

tologic.v1, fromlogic.v1,

tologic.v2, fromlogic.v2);

close(fdr);

close(fdw);

return 0;

}

program’s listing follows.

#include <stdio.h>

#include <unistd.h>

#include <stdlib.h>

#include <errno.h>

#include <sys/types.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <stdint.h>

#define N 1000

struct packet {

uint32_t v1;

float v2;

};

int main(int argc, char *argv[]) {

int fdr, fdw, rc, donebytes;

char *buf;

pid_t pid;

struct packet *tologic, *fromlogic;

int i;

float a, da;

fdr = open("/dev/xillybus_read_32", O_RDONLY);

fdw = open("/dev/xillybus_write_32", O_WRONLY);

if ((fdr < 0) || (fdw < 0)) {

perror("Failed to open Xillybus device file(s)");

exit(1);

}

pid = fork();

if (pid < 0) {

perror("Failed to fork()");

exit(1);

}

if (pid) {

close(fdr);

tologic = malloc(sizeof(struct packet) * N);

if (!tologic) {

fprintf(stderr, "Failed to allocate memory ");

exit(1);

}

// Fill array of structures with just some numbers

da = 6.283185 / ((float) N);

for (i=0, a=0.0; i<N; i++, a+=da) {

tologic[i].v1 = i;

tologic[i].v2 = a;

}

buf = (char *) tologic;

donebytes = 0;

while (donebytes < sizeof(struct packet) * N) {

rc = write(fdw, buf + donebytes, sizeof(struct packet) * N - donebytes);

if ((rc < 0) && (errno == EINTR))

continue;

if (rc <= 0) {

perror("write() failed");

exit(1);

}

donebytes += rc;

}

sleep(1); // Let debug output drain (if used)

close(fdw);

return 0;

} else {

close(fdw);

fromlogic = malloc(sizeof(struct packet) * N);

if (!fromlogic) {

fprintf(stderr, "Failed to allocate memory ");

exit(1);

}

buf = (char *) fromlogic;

donebytes = 0;

while (donebytes < sizeof(struct packet) * N) {

rc = read(fdr, buf + donebytes, sizeof(struct packet) * N - donebytes);

if ((rc < 0) && (errno == EINTR))

continue;

if (rc < 0) {

perror("read() failed");

exit(1);

}

if (rc == 0) {

fprintf(stderr, "Reached read EOF!? Should never happen. ");

exit(0);

}

donebytes += rc;

}

for (i=0; i<N; i++)

printf("%d: %f ", fromlogic[i].v1, fromlogic[i].v2);

sleep(1); // Let debug output drain (if used)

close(fdr);

return 0;

}

#include "ap_int.h"

#define FIR_ LENGTH 19

typedef ap_fixed<12,2> fir_t;

void fir( volatile ap_uint<10> * x, volatile ap_uint<10> * y)

{

#pragma HLS INTERFACE ap_fifo depth=19 port=x

#pragma HLS INTERFACE ap_fi fo depth=19 port =y

fir_t b_k[FIR_LENGTHJ = {-0.000859, -0.000 229, -0.001539,

-0.015266, -0_012840, 0.060680, 0.150479, 0.087929, -0.138514,

0.727273, -0.138514, 0.087929, 0.150479, 0.060680, -0.012840,

-0.015266, -0.001539,-0.000229,-0.000859};

fir_t sum;

ap_fi xed<17,7> ints um;

ap_in t<12> temp;

int i;

fir_t x_store[FIR_LENGTH];

while(l){

#pragma HLS PIPELINE 11=1

//Update input data

}

LOOP_UPDATE: for (i=O ; i < FIR_LENGTH-1; i++){

x_store[iJ = x_store[i+1J;

}

//Zero-Shift from ADC

temp = ( ap_int<1 2»* (x++) - (ap_int<1 2»512;

/ /Map to ap_fi xed Variable

x_store[F1R_LENGTH-1J(11,0)

intsum = 0;

temp(ll,O) ;

LOOP_FIR: for (i=O; i < F1R_LENGTH; i++) {

//Multiply & Add

i ntsum += b_k[iJ * x_store[(FIR_LENGTH-1)-iJ ;

}

//Map to ap_uint variable

temp = fir_t (i nt s um ) . r ange(ll,O ) ;

//Zero-Shift for DA C

*(y++) = (ap_uint<10» (t emp + (ap_int<1 2» 512 ) ;

}

Navigate

4. A. What is the sharp solid projection A shown to the right? (thorn, spine, st

4. A1.00 L buffer is made containing 50.0 mmol of bromoacetic acid (pKa 290 and

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

4. A. Write a code for Vivado HLS to be able to synthesize a n-tap FIR filter (2

Question

Explanation / Answer

Related Questions

Navigate