The basic pipeline for DLX has five stages: IF, ID, EX, MEM, and WB. Assuming al
ID: 673795 • Letter: T
Question
The basic pipeline for DLX has five stages: IF, ID, EX, MEM, and WB. Assuming all memory accesses take 1 clock cycle.
a) What are the three types of pipeline hazards?
b) What is the control (branch) hazard of an instruction pipeline? Provide three branch prediction alternatives to reduce branch hazards.
c) What is the data forwarding scheme used to reduce the data hazard?
d) Give all the forwarding paths of the five-stage DLX pipeline, including sources, destinations, and information transferred.
Explanation / Answer
There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining.
There are three classes of hazards:
Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.
Data Hazards. They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards.They arise from the pipelining of branches and other instructions that change the PC.
Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall on different events:
A cache miss. A cache miss stalls all the instructions on pipeline both before and after the instruction causing the miss.
A hazard in pipeline. Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed. When the instruction is stalled, all the instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never clear.
In case of structural hazards:
Clock cycle number
Instr
1
2
3
4
5
6
7
8
9
10
Instr i
IF
ID
EX
MEM
WB
Instr i+1
IF
ID
EX
MEM
WB
Instr i+2
IF
ID
EX
MEM
WB
Stall
bubble
bubble
bubble
bubble
bubble
Instr i+3
IF
ID
EX
MEM
WB
Instr i+4
IF
ID
EX
MEM
WB
A stall causes the pipeline performance to degrade the ideal performance.
Average instruction time unpipelined
Speedup from pipelining = ----------------------------------------
Average instruction time pipelined
CPI unpipelined * Clock Cycle Time unpipelined
= -------------------------------------
CPI pipelined * Clock Cycle Time pipelined
The ideal CPI on a pipelined machine is almost always 1. Hence, the pipelined CPI is
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instruction
= 1 + Pipeline stall clock cycles per instruction
If we ignore the cycle time overhead of pipelining and assume the stages are all perfectly balanced, then the cycle time of the two machines are equal and
CPI unpipelined
Speedup = ----------------------------
1+ Pipeline stall cycles per instruction
If all instructions take the same number of cycles, which must also equal the number of pipeline stages ( the depth of the pipeline) then unpipelined CPI is equal to the depth of the pipeline, leading to
Pipeline depth
Speedup = --------------------------
1 + Pipeline stall cycles per instruction
If there are no pipeline stalls, this leads to the intuitive result that pipelining can improve performance by the depth of pipeline.
Structural Hazards
When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all posible combinations of instructions in the pipeline.
If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.
Common instances of structural hazards arise when
Some functional unit is not fully pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the rate of one per clock cycle
Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute.
Example1:
a machine may have only one register-file write port, but in some cases the pipeline might want to perform two writes in a clock cycle.
Example2:
a machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference(load), it will conflict with the instruction reference for a later instruction (instr 3):
Clock cycle number
Instr
1
2
3
4
5
6
7
8
Load
IF
ID
EX
MEM
WB
Instr 1
IF
ID
EX
MEM
WB
Instr 2
IF
ID
EX
MEM
WB
Instr 3
IF
ID
EX
MEM
WB
To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented.
Clock cycle number
Instr
1
2
3
4
5
6
7
8
9
Load
IF
ID
EX
MEM
WB
Instr 1
IF
ID
EX
MEM
WB
Instr 2
IF
ID
EX
MEM
WB
Stall
bubble
bubble
bubble
bubble
bubble
Instr 3
IF
ID
EX
MEM
WB
Instruction 1 assumed not to be data-memory reference (load or store), otherwise Instruction 3 cannot start execution for the same reason as above.
To simplify the picture it is also commonly shown like this:
Clock cycle number
Instr
1
2
3
4
5
6
7
8
9
Load
IF
ID
EX
MEM
WB
Instr 1
IF
ID
EX
MEM
WB
Instr 2
IF
ID
EX
MEM
WB
Instr 3
stall
IF
ID
EX
MEM
WB
Introducing stalls degrades performance as we saw before. Why, then, would the designer allow structural hazards? There are two reasons:
To reduce cost. For example, machines that support both an instruction and a cache access every cycle (to prevent the structural hazard of the above example) require at least twice as much total memory.
To reduce the latency of the unit. The shorter latency comes from the lack of pipeline registers that introduce overhead
Data Hazards
A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine.
Consider the pipelined execution of these instructions:
1
2
3
4
5
6
7
8
9
ADD
R1, R2, R3
IF
ID
EX
MEM
WB
SUB
R4, R5, R1
IF
IDsub
EX
MEM
WB
AND
R6, R1, R7
IF
IDand
EX
MEM
WB
OR
R8, R1, R9
IF
IDor
EX
MEM
WB
XOR
R10,R1,R11
IF
IDxor
EX
MEM
WB
All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it.
The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong result.
The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle.
The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD.
The next page discusses forwarding, a technique to eliminate the stalls for the hazard involving the SUB and AND instructions.
We will also classify the data hazards and consider the cases when stalls can not be eliminated. We will see what compiler can do to schedule the pipeline to avoid stalls
Forwarding
The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.
1
2
3
4
5
6
7
ADD
R1, R2, R3
IF
ID
EX
MEM
WB
SUB
R4, R5, R1
IF
IDsub
EX
MEM
WB
AND
R6, R1, R7
IF
IDand
EX
MEM
WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs.
The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,
(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.
Without forwarding our example will execute correctly with stalls:
1
2
3
4
5
6
7
8
9
ADD
R1, R2, R3
IF
ID
EX
MEM
WB
SUB
R4, R5, R1
IF
stall
stall
IDsub
EX
MEM
WB
AND
R6, R1, R7
stall
stall
IF
IDand
EX
MEM
WB
As our example shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code sequence can be executed without stalls:
1
2
3
4
5
6
7
ADD
R1, R2, R3
IF
ID
EXadd
MEMadd
WB
SUB
R4, R5, R1
IF
ID
EXsub
MEM
WB
AND
R6, R1, R7
IF
ID
EXand
MEM
WB
The first forwarding is for value of R1 from EXadd to EXsub .
The second forwarding is also for value of R1 from MEMadd to EXand.
This code now can be executed without stalls.
Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.
One more Example
To prevent a stall in this example, we would need to forward the values of R1 and R4 from the pipeline registers to the ALU and data memory inputs.
1
2
3
4
5
6
7
ADD
R1, R2, R3
IF
ID
EXadd
MEMadd
WB
LW
R4, d (R1)
IF
ID
EXlw
MEMlw
WB
SW
R4,12(R1)
IF
ID
EXsw
MEMsw
WB
Stores require an operand during MEM, and forwarding of that operand is shown here.
The first forwarding is for value of R1 from EXadd to EXlw .
The second forwarding is also for value of R1 from MEMadd to EXsw.
The third forwarding is for value of R4 from MEMlw to MEMsw.
Observe that the SW instruction is storing the value of R4 into a memory location computed by adding the displacement 12 to the value contained in register R1. This effective address computation is done in the ALU during the EX stage of the SW instruction. The value to be stored (R4 in this case) is needed only in the MEM stage as an input to Data Memory. Thus the value of R1 is forwarded to the EX stage for effective address computation and is needed earlier in time than the value of R4 which is forwarded to the input of Data Memory in the MEM stage.
So forwarding takes place from "left-to-right" in time, but operands are not ALWAYS forwarded to the EX stage - it depends on the instruction and the point in the Datapath where the operand is needed. Of course, hardware support is necessary to support data forwarding.
Clock cycle number
Instr
1
2
3
4
5
6
7
8
9
10
Instr i
IF
ID
EX
MEM
WB
Instr i+1
IF
ID
EX
MEM
WB
Instr i+2
IF
ID
EX
MEM
WB
Stall
bubble
bubble
bubble
bubble
bubble
Instr i+3
IF
ID
EX
MEM
WB
Instr i+4
IF
ID
EX
MEM
WB
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.