Fundamentals of High-Level Synthesis 2024 Lecture Notes PDF
Document Details
Uploaded by SharperTucson707
Tampere University
2024
COMP.CE.320
null
Tags
Summary
These are lecture notes from a 2024 course on high-level synthesis (HLS). The lecture provides foundational knowledge in the core topics of HLS.
Full Transcript
COMP.CE.320 High-level Synthesis 2024 Lectures 5 & 6 Fundamentals of High-level Synthesis Contents Introduction The top-level design module High-level C++ synthesis Loops Pipeline feedback Conditions COMP.CE.320 High-level Syn...
COMP.CE.320 High-level Synthesis 2024 Lectures 5 & 6 Fundamentals of High-level Synthesis Contents Introduction The top-level design module High-level C++ synthesis Loops Pipeline feedback Conditions COMP.CE.320 High-level Synthesis 2/12/2024 2 Introduction Style matters: When designing for HLS, remember that you are describing hardware Poor description will lead to a sub-optimal RTL implementation Hardware is inherently parallel, not sequential The user must consider the underlying memory architecture and adhere to the recommended coding style On this lecture, we cover the basics of HLS and examine how coding style affects the results Special emphasis on handling loops COMP.CE.320 High-level Synthesis 2/12/2024 3 The top-level design module HLS requires that users specify where the "top" of their design is Where the design interfaces with the outside world Contains port definitions, direction, and bit widths COMP.CE.320 High-level Synthesis 2/12/2024 4 The top-level design module module top(clk,arst,din,dout); input clk; input arst; D-type flip-flop in Verilog: input [31:0] din; output [31:0] dout; Defines inputs, outputs, and their width reg [31:0] dout; explicitly always@(posedge clk or posedge arst) begin if(arst == 1’b1) dout = 1’b0; else dout = din; end endmodule COMP.CE.320 High-level Synthesis 2/12/2024 5 The top-level design module #pragma hls_design_top void top(int din, int& dout){ dout = din; } D-type flip-flop in C++: Also defines inputs, outputs, and their width explicitly Pragma tells the HLS tool, that this is the top-level module Output is by default registered (in Catapult) No clock, reset, or enable in the source These are added by the tool Polarity, type of reset, etc. can be configured Port widths implied by the data type (32-bit ”int” in this case) Use bit-accurate data types usually COMP.CE.320 High-level Synthesis 2/12/2024 6 The top-level design module #pragma hls_design_top void top(int din, int& dout){ dout = din; } D-type flip-flop in C++: Input port is inferred when variable is only read (”din”) Pass-by-value variables can only be inputs Output port is inferred when variable is written (”dout”) or if top-level function returns a value (none here) Must be reference or pointer Inout port is inferred if an interface variable is both written and read in the same design (none here) Must be reference or pointer COMP.CE.320 High-level Synthesis 2/12/2024 7 High-level C++ Synthesis Let’s examine how un-timed algorithmic C++ is turned into hardware (crux of HLS) Example: four value accumulation #include “accum.h” void accumulate(int a, int b, int c, int d, int &dout){ int t1,t2; t1 = a + b; t2 = t1 + c; dout = t2 + d; } COMP.CE.320 High-level Synthesis 2/12/2024 8 High-level C++ Synthesis – DFG HLS starts by analyzing the data dependencies in the algorithm: #include “accum.h” Data flow graph (DFG) void accumulate(int a, int b, int c, int d, int &dout){ int t1,t2; t1 = a + b; t2 = t1 + c; dout = t2 + d; } Nodes represent operations in C++ code Arrows represent data dependencies between operations E.g. t1 must be computed before t2 Compiler and HLS tool may optimize things first E.g. this might be transformed to dout = (a+b)+(c+d) (adder tree) COMP.CE.320 High-level Synthesis 2/12/2024 9 High-level C++ Synthesis – Resource Allocation Resource allocation: After DFG, each operation is mapped onto a hardware resource Resource corresponds to a physical implementation of the operator in HW Resources are selected from a technology specific library Implementation is annotated with timing and area info Operators may have multiple hardware resource implementations that each have different area/delay/latency trade-offs COMP.CE.320 High-level Synthesis 2/12/2024 10 High-level C++ Synthesis – Resource Allocation Resource allocation (”Resources” step in Catapult GUI): Chosen automatically or selected by designer COMP.CE.320 High-level Synthesis 2/12/2024 11 High-level C++ Synthesis – Scheduling Scheduling: ”Time” is added to the design during scheduling step Scheduling decides when (in which clock cycle) operations in DFG are performed Registers between operations may be added based on target clock frequency (pipelining) COMP.CE.320 High-level Synthesis 2/12/2024 12 High-level C++ Synthesis – Scheduling Example continued: Assume ”add” operation takes 3 ns of 5 ns clock cycle Each add operation is scheduled in its own clock cycle C1, C2, or C3 COMP.CE.320 High-level Synthesis 2/12/2024 13 High-level C++ Synthesis – Scheduling A data path state machine is created to control the scheduled design The four states correspond to the four clock cycles in the schedule Output produced every four clock cycles In Catapult, these states are also referred as control steps or c-steps COMP.CE.320 High-level Synthesis 2/12/2024 14 High-level C++ Synthesis – Final RTL Resulting hardware depends on how the design was constrained in HLS tool E.g. shown hardware diagram is implemented with maximum sharing -> only one adder is implemented Notice 32-bit wide datapath since variables are type ”int” COMP.CE.320 High-level Synthesis 2/12/2024 15 High-level C++ Synthesis – Loop pipelining Loop pipelining Essential concept in HLS Allows a new iteration of a loop to be started before the current iteration has finished Similar in concept to classic RISC processor pipelining COMP.CE.320 High-level Synthesis 2/12/2024 16 High-level C++ Synthesis – Loop pipelining RISC pipelining contains five stages: instruction fetch, instruction decode, execute, memory access, and register write back New instruction can be fetched each clock cycle while other stages are activated The time it takes for all pipeline stages to become active is known as the pipeline “ramp up” The pipeline “ramp down” is the time it takes for all pipeline stages to become inactive COMP.CE.320 High-level Synthesis 2/12/2024 17 High-level C++ Synthesis – Loop pipelining Loop pipelining in HLS In HLS, the top-level function call has an implied loop, called the main loop Each iteration of the main loop corresponds to complete execution of the schedule Runs continuously as long as the clock is supplied Loop pipelining allows the execution to be overlapped, increasing performance COMP.CE.320 High-level Synthesis 2/12/2024 18 High-level C++ Synthesis – Loop pipelining Loop pipelining terms Initiation interval (II): the number of clock cycles taken before starting the next loop iteration Set on desired loop as a design constraint in HLS tool Latency: time in clock cycles from the first input to the first output Throughput: How often, in clock cycles, a function call can complete Sometimes thoughput means samples/second or similar, but we use slightly different definition here COMP.CE.320 High-level Synthesis 2/12/2024 19 High-level C++ Synthesis – Loop pipelining Example continued: No loop pipelining Latency = 3 Throughput = 4 Only a single adder is required COMP.CE.320 High-level Synthesis 2/12/2024 20 High-level C++ Synthesis – Loop pipelining Example continued: Loop pipelined with II = 3 Latency = 3 Throughput = 3 Only a single adder is required COMP.CE.320 High-level Synthesis 2/12/2024 21 High-level C++ Synthesis – Loop pipelining Example continued: Loop pipelined with II = 2 Latency = 3 Throughput = 2 Two adders are required COMP.CE.320 High-level Synthesis 2/12/2024 22 High-level C++ Synthesis – Loop pipelining Example continued: Loop pipelined with II = 1 Latency = 3 Throughput = 1 Three adders are required Best performance, worst area COMP.CE.320 High-level Synthesis 2/12/2024 23 Loops Definitions about loops: Loop iterations: The number of times loop runs before it exits Loop iterator: The variable used to compute the loop iteration Loop body: The code between the start and the end of the loop Loop unrolling: The number of times to copy the loop body Loop pipelining: How often to start the next iteration of the loop (overlapping iterations) COMP.CE.320 High-level Synthesis 2/12/2024 24 Loops ”for” loop: LABEL: for( initialization; test-condition; increment) { statement-list of the loop body; } Example: #include “simple_for_loop.h” Prefer for-loops: void simple_for_loop(int din, int dout){ FOR_LOOP:for(int i=0;i