| 1 |
2 |
jsauermann |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
| 2 |
|
|
"http://www.w3.org/TR/html4/strict.dtd">
|
| 3 |
|
|
<HTML>
|
| 4 |
|
|
<HEAD>
|
| 5 |
|
|
<TITLE>html/Pipelining</TITLE>
|
| 6 |
|
|
<META NAME="generator" CONTENT="HTML::TextToHTML v2.46">
|
| 7 |
|
|
<LINK REL="stylesheet" TYPE="text/css" HREF="lecture.css">
|
| 8 |
|
|
</HEAD>
|
| 9 |
|
|
<BODY>
|
| 10 |
|
|
<P><table class="ttop"><th class="tpre"><a href="02_Top_Level.html">Previous Lesson</a></th><th class="ttop"><a href="toc.html">Table of Content</a></th><th class="tnxt"><a href="04_Cpu_Core.html">Next Lesson</a></th></table>
|
| 11 |
|
|
<hr>
|
| 12 |
|
|
|
| 13 |
|
|
<H1><A NAME="section_1">3 DIGRESSION: PIPELINING</A></H1>
|
| 14 |
|
|
|
| 15 |
|
|
<P>In this short lesson we will give a brief overview of a design technique
|
| 16 |
|
|
known as pipelining. Most readers will already be familiar with it; those
|
| 17 |
|
|
readers should take a day off or proceed to the next lesson.
|
| 18 |
|
|
|
| 19 |
|
|
<P>Assume we have a piece of combinational logic that happens to have a
|
| 20 |
|
|
long propagation delay even in its fastest implementation. The long delay
|
| 21 |
|
|
is then caused by the slowest path through the logic, which will run
|
| 22 |
|
|
through either many fast elements (like gates) or a number of slower
|
| 23 |
|
|
elements (likes adders or multipliers), or both.
|
| 24 |
|
|
|
| 25 |
|
|
<P>That is the situation where you should use pipelining. We will explain
|
| 26 |
|
|
it by an example. Consider the circuit shown in the following figure.
|
| 27 |
|
|
|
| 28 |
|
|
<P><br>
|
| 29 |
|
|
|
| 30 |
|
|
<P><img src="pipelining_1.png">
|
| 31 |
|
|
|
| 32 |
|
|
<P><br>
|
| 33 |
|
|
|
| 34 |
|
|
<P>The circuit is a sequential logic which consists of 3 combinational
|
| 35 |
|
|
functions f1, f2, and f3 and a flip-flop at the output of f3.
|
| 36 |
|
|
|
| 37 |
|
|
<P>Let t1, t2, and t3 be the respective propagation delays of f1, f2, and f3.
|
| 38 |
|
|
Assume that the slowest path of the combinational logic runs from the
|
| 39 |
|
|
upper input of f1 towards the output of f3. Then the total delay of
|
| 40 |
|
|
the combinational is t = t1 + t2 + t3. The entire circuit cannot be
|
| 41 |
|
|
clocked faster than with frequency 1/t.
|
| 42 |
|
|
|
| 43 |
|
|
<P>Now pipelining is a technique that slightly increases the delay of a
|
| 44 |
|
|
combinational circuit, but thereby allows different parts of the logic
|
| 45 |
|
|
at the same time. The slight increase in total propagation delay is more
|
| 46 |
|
|
than compensated by a much higher throughput.
|
| 47 |
|
|
|
| 48 |
|
|
<P>Pipelining divides a complex combinational logic with an accordingly long
|
| 49 |
|
|
delay into a number of stages and places flip-flops between the stages as
|
| 50 |
|
|
shown in the next figure.
|
| 51 |
|
|
|
| 52 |
|
|
<P><br>
|
| 53 |
|
|
|
| 54 |
|
|
<P><img src="pipelining_2.png">
|
| 55 |
|
|
|
| 56 |
|
|
<P><br>
|
| 57 |
|
|
|
| 58 |
|
|
<P>The slowest path is now max(t1, t2, t3) and the new circuit can be clocked
|
| 59 |
|
|
with frequency 1/max(t1, t2, t3) instead of 1/(t1 + t2 + t3). If the
|
| 60 |
|
|
functions f1, f2, and f3 had equal propagation delays, then the max.
|
| 61 |
|
|
frequency of the new circuit would have tripled compared to the old circuit.
|
| 62 |
|
|
|
| 63 |
|
|
<P>It is generally a good idea when using pipelining to divide the
|
| 64 |
|
|
combinational logic that shall be pipelined into pieces with similar delay.
|
| 65 |
|
|
Another aspect is to divide the combinational logic at places where the
|
| 66 |
|
|
number of connections between the pieces is small since this reduces the
|
| 67 |
|
|
number of flip-flops that are being inserted.
|
| 68 |
|
|
|
| 69 |
|
|
<P>The first design of the CPU described in this lecture had the opcode decoding
|
| 70 |
|
|
logic (which is combinational) and the data path logic combined. That design
|
| 71 |
|
|
had a worst path delay of over 50 ns (and hence a max. frequency of less
|
| 72 |
|
|
than 20 MHz). After splitting of the opcode decoder, the worst path delay
|
| 73 |
|
|
was below 30 ns which allows for a frequency of 33 MHz. We could have
|
| 74 |
|
|
divides the pipeline into even more stages (and thereby increasing the
|
| 75 |
|
|
max. frequency even further). This would, however, have obscured the design
|
| 76 |
|
|
so we did not do it.
|
| 77 |
|
|
|
| 78 |
|
|
<P>The reason for the improved throughput is that the different stages of a
|
| 79 |
|
|
pipeline work in parallel while without pipelining the entire logic would
|
| 80 |
|
|
be occupied by a single operation. In a pipeline the single operation is
|
| 81 |
|
|
often displayed like this (one color = one operation).
|
| 82 |
|
|
|
| 83 |
|
|
<P><br>
|
| 84 |
|
|
|
| 85 |
|
|
<P><img src="pipelining_3.png">
|
| 86 |
|
|
|
| 87 |
|
|
<P><br>
|
| 88 |
|
|
|
| 89 |
|
|
<P>This kind of diagram shows how an operation is distributed over the
|
| 90 |
|
|
different stages over time.
|
| 91 |
|
|
|
| 92 |
|
|
<P>To summarize, pipelining typically results in:
|
| 93 |
|
|
|
| 94 |
|
|
<UL>
|
| 95 |
|
|
<LI>a slightly more complex design,
|
| 96 |
|
|
<LI>a moderately longer total delay, and
|
| 97 |
|
|
<LI>a considerable improvement in throughput.
|
| 98 |
|
|
</UL>
|
| 99 |
|
|
<P><hr><BR>
|
| 100 |
|
|
<table class="ttop"><th class="tpre"><a href="02_Top_Level.html">Previous Lesson</a></th><th class="ttop"><a href="toc.html">Table of Content</a></th><th class="tnxt"><a href="04_Cpu_Core.html">Next Lesson</a></th></table>
|
| 101 |
|
|
</BODY>
|
| 102 |
|
|
</HTML>
|