URL
https://opencores.org/ocsvn/funbase_ip_library/funbase_ip_library/trunk
Subversion Repositories funbase_ip_library
[/] [funbase_ip_library/] [trunk/] [TUT/] [ip.hwp.communication/] [hibi/] [3.0/] [doc/] [Datasheet/] [Latex/] [hibi_datasheet.tex] - Rev 145
Compare with Previous | Blame | View Log
\documentclass[a4paper,10pt,oneside,final]{article} %\documentclass[12pt,a4paper,english]{tutthesis} %\documentclass[11pt,final]{tutdrthesis} %\documentclass[11pt,final]{IEEETran} % Otetaan tarvittavat paketit mukaan \usepackage[dvips]{graphicx} \usepackage{enumerate} \usepackage[UKenglish]{babel} \usepackage{cite} \usepackage{subfigure} % 'pslatex' is otherwise equal to 'times' % but courier font is narrower \usepackage{pslatex} %\usepackage{times} % 2 pkgs for Scandinavian alphabets \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{listings} \usepackage{color} \definecolor{gray95}{gray}{.95} \lstdefinestyle{ccc} { numbers=none, basicstyle=\small\ttfamily, keywordstyle=\bf\color[rgb]{0,0,0}, %commentstyle=\color[rgb]{0.133,0.545,0.133}, stringstyle=\color[rgb]{0.627,0.126,0.941}, backgroundcolor=\color{white}, frame=tb, %frame= lrtb, framerule=0.5pt, linewidth=\textwidth, %aboveskip=-4.0pt, %belowskip=-4.0pt, lineskip=-5.0pt, } % style for transgen xml listings \lstdefinestyle{a1listing} { numbers=none, language=bash, basicstyle=\small\bf\ttfamily, emphstyle=\color[rgb]{0.0, 0.7, 0.3}, keywordstyle=\color[rgb]{0.0, 0.0, 1.0}, commentstyle=\color[rgb]{0.8, 0.0, 0.0}, stringstyle=\color[rgb]{0.737, 0.560, 0.560}, backgroundcolor=\color{gray95}, frame= lrtb, framerule=0.5pt, linewidth=\textwidth, } % console listings style \lstdefinestyle{console} { numbers=none, basicstyle=\small\bf\ttfamily, backgroundcolor=\color{gray95}, frame=lrtb, framerule=0.5pt, linewidth=\textwidth, } % 2 Completely strange definitions \newcommand{\longpage}{\enlargethispage*{100cm} \pagebreak} \newcommand{\nohyphens}{\hyphenpenalty=10000\exhyphenpenalty=10000\relax} % Poikkeukselliset tavutusmuodot, erotettu välilyönneillä \hyphenation{Sal-mi-nen Kan-gas Rii-hi-mä-ki Kuu-si-lin-na Hä-mä-läi-nen Kuk-ka-la HIBI TUTMAC Koski} % \hyphenation{de-vel-oped pro-vides multi-stage Rctrl} % Koitetaan estaa kuvien sijoittelu sivulle yksinaan ilman tekstia % http://dcwww.camd.dtu.dk/~schiotz/comp/LatexTips/LatexTips.html % Be careful not to make \floatpagefraction larger than \topfraction \renewcommand{\topfraction}{0.85} \renewcommand{\textfraction}{0.1} \renewcommand{\floatpagefraction}{0.75} % % Define author(s) and component's name % \def\defauthor{Salminen, Hämäläinen} \def\deftitle{HIBI v.3 \\Reference Manual} \author{\defauthor} \title{\deftitle} \usepackage{fancyhdr} \pagestyle{fancy} \lhead{\bfseries Department of Computer Systems\\ Faculty of Computing and Electrical Engineering} \chead{} \rhead{\bfseries \deftitle} \lfoot{\thepage} \cfoot{} \rfoot{TUT} %\rfoot{\includegraphics[height=1.0cm]{../Fig/Eps/tut_logo.eps}} \renewcommand{\headrulewidth}{0.4pt} \renewcommand{\footrulewidth}{0.4pt} \def\deftablecolora{blue!10!white} \def\deftablecolorb{white} \begin{document} %\maketitle %\thispagestyle{empty} \begin{titlepage} \begin{center} \vspace{6.0cm} \begin{center} \includegraphics[height=1.0cm]{../Fig/Eps/tut_logo.eps} \end{center} \textsc{Faculty of Computing and Electrical Engineering}\\[1.0cm] \textsc{Department of Computer Systems}\\[1.0cm] %\textsc{\LARGE Tampere University of Technology}\\[1.0cm] %\textsc{\Large Faculty of Computing and Electrical Engineering}\\[1.0cm] %\textsc{\Large Department of Computer Systems}\\[1.0cm] \vspace{6.0cm} \hrule \vspace{0.4cm} { \huge \bfseries Heterogenerous IP Block Interconnection (HIBI) \\ version 3 \\ [0.5cm]Reference Manual} \vspace{0.4cm} \hrule %\vspace{2.0cm} \vfill \begin{minipage}{0.4\textwidth} \begin{flushleft} \large \emph{Author:}\\ Erno Salminen, \\Timo Hämäläinen \end{flushleft} \end{minipage} \begin{minipage}{0.4\textwidth} \begin{flushright} \large \emph{Updated:} \\ \today \end{flushright} \end{minipage} \end{center} \end{titlepage} %\title{HIBI data sheet - September 2011} %\author{Erno Salminen} %\begin{document} % \onecolumn %\include{cover} \setcounter{secnumdepth}{-1} % Add some space between lines \linespread{1.25}\normalsize \tableofcontents \newpage \thispagestyle{empty} \listoffigures \listoftables % \twocolumn \newpage \thispagestyle{empty} \setcounter{secnumdepth}{2} \section{Introduction} \label{ch:hibi} This data sheet presents the third version of \textit{Heterogeneous IP Block Interconnection} (HIBI). HIBI is intended for integrating coarse-grain components such as intellectual property blocks that have size of thousands of gates, see \cite{salminen04} for examples. Topology, arbitration and data transfers are presented first. After that, data buffering and the structure of wrapper component are discussed. Finally, the developed runtime configuration is presented followed by comparison to the previous version of HIBI. HIBI is a communication network designed for System-on-Chips. It can be used both in FPGA and ASIC designs (field-programmable gate-array, application-specific integrated circuit). Fig.~\ref{fig:soc_concept} shows an example SoC at conceptual level. There are many different types of IP blocks (intellectual property), namely CPU (central processing unit) for executing software, memories and IP blocks that are either fixed function accelerators or interfaces to external components. All these are connected using an on-chip network. \begin{figure} [b] \begin{center} {\includegraphics[width=0.60\textwidth]{../Fig/Eps/fig_soc_concept.eps}} \caption{Conceptual structure of system-on-chip} \label{fig:soc_concept} \end{center} \end{figure} \subsection{Main points} The major design choices for HIBI were \begin{itemize} \item IP-block granularity for functional units \item Application independent interface to allow re-use of processors and IP-blocks \item Communication and computation separated \item Communication network used in all transfers, no ad-hoc wires between IPs \item support local clock domains for IP granularity \end{itemize} A parameterizable HW component, called HIBI wrapper, is used to construct modular, hierarchical bus structures with distributed arbitration and multiple clock domains as shown in Fig \ref{fig:hierarchy} (explained later in detail). This simplifies design and allows reuse since the same wrapper can always be utilized. Configuration takes place both at synthesis time (e.g. data width and buffer sizes) and on runtime (arbitration parameters). In addition, since we are targeting also FPGAs, there are some additional constraints \begin{itemize} \item keep the number of wires low - to avoid exhausting routing resources \item avoid global connections - to avoid long combinatorial routing delays \item avoid 3-state wires - to simplify testing and synthesis (most FPGAs allow three-state logic onlu in I/O pins) \end{itemize} \subsection{Versions} The development of HIBI \cite{kuusilinna98, lahtinen02, lahtinen04, salminen10} started in 1997 in Tampere University of Technology. Currently, there are 3 versions of HIBI, denoted as v1-v.3. However, certain basics have remained unchanged. Hence, in the remainder the version number is omitted unless, it is necessary. In version 2, the biggest changes were removing tri-state logic and increasing modularity and configurability. For version 3, address decoder logic was modified to simplify usage. Furthermore, the tx and rx state machines were re-factored, which also necessitated minor change in bus timing. These latter FSM changes do not affect the IP, though. \section {HIBI topology} \begin{figure*} \begin{center} {\includegraphics[width=0.85\textwidth]{../Fig/Eps/fig_topo_hibi_hierarchy.eps}} \caption{Example of a hierarchical HIBI network with multiple clock domains and bus segments} \label{fig:hierarchy} \end{center} \end{figure*} The topology in HIBI is not fixed, but configurable by the designer. HIBI network consists of wrappers, bus segments, and bridges. These are the basic building blocks from which the whole network is constructed and configured. All wrappers in the system are instantiated from the same parameterizable HDL (HW description language) entity and bridges are constructed by connecting two wrappers together. If the connected segments use different data widths, the bridges are responsible for the data width adaptation. All wrappers can act both as a \textit{master} and a \textit{slave}. Masters can initiate transfers and slaves can only respond. In many buses, most units operate in on mode only and only few in both modes. In the most simple case, there is only segment and the topology is hence single shared bus. However, HIBI network can have multiple segments which form a hierarchical bus structure. Segments are connected together using bridges. Bridges increase latency but, on the other hand, hierarchical structure allows multiple parallel transactions. Bridge are simply constructed from 2 wrappers. For the IP, the wrapper offers FIFO-based (first in, first out) interface, as depicted in Fig. In network side, all signals inside a segment are shared between wrappers and no dedicated point-to-point signals are used. Arbitration decides which wrapper (or bridge) controls the segment and the utilized arbitration algorithms distributed to wrappers without any central controller. \subsection{Example of hierarchical topology} Bus performance can be scaled up by using bridges. Segments having only simple peripheral devices can have a slow and narrow bus while the main processing parts have higher capacity buses. Fig.~\ref{fig:hierarchy} depicts an irregular HIBI network. The example has a point-to-point link ($Seg A$), hierarchical bus ($Seg B$ and $SegC$), and multibus topology ($Seg C$ and $SegD$). Furthermore, $Seg B$ is wider than other segments and thus offers greater bandwidth. In the multibus configuration, each IP must decide which bus to use while sending. Note that $Seg A$ could be implemented without wrappers since there is no need for arbitration. The example shows four clock domains. Agents in $Seg A$ and $SegB$ are inside one domain and HIBI wrappers on $Seg C$ are in one domain. However, two IPs in the top right corner use different clock than the wrappers of $Seg C$. The IPs in the bottom right corner and all wrappers in $Seg D$ are in one domain. The number of clock domains is not otherwise restricted but all wrappers in one bus segment must use the same clock. Handshaking between the clock domains is done in the IP-wrapper interface or inside the bridge \cite{kulmala06b, kulmala06e}. This allows the construction of GALS systems. The example shows only one bridge but HIBI does not restrict either the number of bridges or hierarchy levels in contrast to many bus architectures. \subsection{Switching} Transfers inside a bus segment are circuit-switched and use a common clock due to (current) implementation of the distributed arbitration. However, HIBI bridges utilize switching principle that resembles packet-switching so that bus segments are not circuit-switched together. Instead, the data is stored inside the bridge until it gets an access to the other segment. The data is forwarded to next segment as soon as possible like in wormhole routing. However, no guarantees are given for the minimum length of continuous transfer. If the bridge cannot buffer all the data, the transfer is interrupted and the source segment is free for other transfers. The interrupted wrapper will continue the transfer on its next turn. It is also possible that a bridge buffers parts from multiple transfers. \section {Data transfer operations} In HIBI, all transfers are bursts. In practice, there is always 1 address word followed by n data words. The max. n is wrapper-specific arbitration parameters. HIBI v2. used multiplexed address and data lines, but HIBI v.3 allows transmitting them in parallel. Due to multiplexed addr/data lines, it is beneficial to send many data into single address. This is quite different from ``traditional'' memory accesses, with address and data at the same time. Hence, the destination IP should keep track of received data count, e.g. TUT's SDRAM controller can do this to avoid excess transmitting addr + data pairs The transfers are pipelined with arbitration, and hence the next transfer can start immediately when the previous ends. The protocol on the bus side is optimized so that there no wait cycles are allowed during a transfer. This means that is sender runs out of data or the receiver does not accept it fast enough, the transfer is interrupted. On the next arbitration turn, the wrapper it continues automatically. Note that IP may transfer data at pace it wishes. IP has only to ensure that there is space in TX FIFO while writing and that RX FIFO is not empty while reading. In order to increase bus utilization, HIBI uses so called split-transactions in read operation. It means that single read operation is split into two phases: request and response. The bus segment is released while the addressed IP handles the read request and prepares its response. The other wrappers may use bus during that period and this increases the overall performance, although a single read becomes a little slower due additional arbitration round. \begin{figure} \begin{center} {\includegraphics[width=0.7\textwidth]{../Fig/Eps/fig_basic_tx.eps}} \caption{Example of read and write operations.} \label{fig:basic_tx} \end{center} \end{figure} \begin{figure} \begin{center} {\includegraphics[width=0.4\textwidth]{../Fig/Eps/fig_basic_tx2.eps}} \caption{Basic transactions are write and read.} \label{fig:basic_tx2} \end{center} \end{figure} Write operation \begin{itemize} \item Includes destination address \item Data is sent in words (=HIBI bus width) \item Several words can follow: all will be sent to the same destination address \end{itemize} Read operation \begin{itemize} \item Includes exactly two words: destination address and return address (where to put the data) \item Data is received in words \item Several words can be received (all to same return address) \begin{itemize} \item No handshaking: data is transmitted/received when bus, sender, or receiver are available \item No acknowledgements or flow control \end{itemize} \end{itemize} Figs.~\ref{fig:basic_tx} and~\ref{fig:basic_tx2} depict the two basic transfers: sending the read request, write, and the response to read. IP can send multiple read requests before the previous ones have completed. It is the responsibility of the requestor to keep track which response belongs to which request. This can be implemented with appropriate use of return addresses. The reader does not get data any faster but the advantage is that the shared medium is available for other agents in the middle of the transmission process and consequently the achieved total throughput increases. In packet-switched networks the split-transactions are commonly used and also in modern bus protocols, such as AMBA Since there is exactly one path between each source and destination, all data is guaranteed to arrive in-order and hence no reordering buffers are needed at the receiver. Data can be sent with different relative priorities. High priority data, such as control messages, bypass the normal data transfers inside the wrappers and bridges resulting in smaller latency. This does not change the timing of bus reservations, but it selects what is transferred first. \subsection{HIBI Basic Transaction Motivation} HIBI was motivated by streaming applications where continuous flow of data is transmitted between IPs. Destinations are merely ports than random accessed memory locations. Hence, HIBI is not natively a processor memory bus but can be used for it as well. HIBI does not implement end-to-end flow control but the IPs must do not explicitly. The FIFO buffers and rx and tx side may get full if the receiver does not eject data fast enough, and this will throttle the transmitter as well. The wrappers takes care of retransmission at the link level. (HIBI v.1 dropped data if the receiving buffer got full but usage of v.1 is not recommended anymore). \begin{figure*} \begin{center} {\includegraphics[width=0.8\textwidth]{../Fig/Eps/fig_tx_steps.eps}} \caption{Logical steps that IP does during transaction.} \label{fig:tx_steps} \end{center} \end{figure*} Fig.~\ref{fig:tx_steps} shows the steps that IP needs to take when communicating using HIBI. On the left, IP sends data when the TX FIFO is not full. It must assign data, address valid (strobe), command, and write enable signals at the same time. When receiving data, IP first checks is the incoming value address or data word. This is done by examining the address valid signal. One word is removed from the FIFO on every clock cycle when receiver assigns read enable signal. Next, IP must check is the operation write or read. In case of write, it stores the incoming data to location defined by the address. In case of read, the second word denotes the return address. It is the address, where the read data word must be transmitted. \section{Addressing} All IP-blocks have unique address and register space defined at design time and every transfer starts with single destination address. Source identification not included in basic transfer and hence a) Use data payload to define source, e.g. first world in a data packet b) Use unique address inside IP block for each source (IP knows from the destination address the sender) Every wrappers has a set of addresses and they set with a VHDL generic (automatic by Kactus). Wrappers may have varying address space sizes, e.g. simple UART has only 2 addresses whereas memory has 16K addresses. Incoming Addresses go through the receiving wrapper to the receiving IP and it can identify the incoming data by its address. For example, the uppermost bits define which IP is addressed and the lowermost define the register of that IP. There are wo ways to set addresses 1. manually 2. A generator script in Kactus tool does this automatically according to system specification IP may write arbitrarily long bursts to wrapper. Perhaps only one address in the beginning followed by arbitrary number of data words. Moreover, IP writes data in arbitrary pace to wrapper. There can be any number of idle cycles between data words. Therefore, the bursts sent by the IP do not necessarily have the same length in the bus (between wrapper). For example, wrapper may split long IP-transfer into multiple bus transfers if the arbitration algorithms gives ownership to another wrapper in the middle. Each part of the transfer starts with the same address as previous. On the other hand, a wrapper may send many short IP-transfers consecutively at one turn. These properties have two consequences: 1. Bursts from multiple source IP will be interleaved 2. Destination may get different number of addresses than sender. Note that the destination IP does not know the sender unless it is separately encoded into data or address \subsection{HIBI destination addresses and channels} In HIBI v.2, all transfers are bursts, i.e. address is transmitted only in the beginning of the transfer and it is followed by one or more data words. The maximum burst length is wrapper-specific. HIBI uses mainly two-level addressing scheme: the upper bits of the address identify the target terminal (e.g. $destination_0$) whereas the lower bits define the additional identifier. This identifier can be used either as an address to local memory, to select the correct reception channel on DMA, to identify the source of the data, or to select requested service. Certain packet-switched networks (at least those implemented in this work) allow only one address per terminal. In that case, the second level address must increase the header length. HIBI destination addresses are 1. internal registers 2. ports (to/from IPs internal logic) 3. IPs memory locations transparent to outside Burst transfers use channels (or ports) and IP block must perform addressing (increment) internally since all data is sent to one address. If IP's memory is transparent, the address seen outside includes also IP-block address (e.g. in address 0xB100, oxB000 defines the target IP and 0x100 internal memory) \begin{figure} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/fig_chan_addr.eps}} \caption{Relation between addresses and channels.} \label{fig:chan_addr} \end{center} \end{figure} HIBI transfers can be abstracted as channels at IP-block side (but not formally specified how). Easiest way to separate channels is to use unique HIBI addresses. It is IP/System level design issue is to give meaning to the channels. For example, accelerator receives data from CPU0 via channel 0 and from CPU1 via channel 1 and so on. Basic HIBI transactions are used to handle possible flow control and handshaking in addition to transfers. Fig.~\ref{fig:chan_addr} shows an example with 6 channels (addressing style of HIBI v.2) . Note that all incoming channels 4-6 have the same 4 upper bits in their addresses. In other words, the example uses a convention that the base address of IP1 is 0xC00 and therefore its uppermost address is implcitly 0xCFF. The channels can be easily distinguished from the lowest address bits. In HIBI v.3 the addressing defined using two parameters: start and end address. Designer can use the same addresses as in HIBI v.2 based systems, but this scheme allows more freedom is address definitions, which especially beneficial in hierarchical systems \subsection{Implementing flow control} Flow control and handshaking must be implemented in IP-blocks. In practise leads to IP-block specific methods which must be carefully specified at design time. Minimum issues to be agreed \begin{enumerate} \item Sender identification (e.g. unique channel address ties Ip block and purpose together) \item Transfer size \item Size unit in addressing(bytes/words) \item Are byte enables utilized \item Messages for non-posted transactions (Acknowledgements to write/read) \end{enumerate} \subsection{Example: Overlapping and breaking transfers} It was noted that the transfers may split due to arbitration. Example in Fig.~\ref{fig:addr_interleaving} clarifies the phenomenon. Let us assume that IP 1 and IP 2 send data to IP 3. We notice that IP 1 gets the first turn in the bus its two first data words arrive to IP 3. However, after that IP 3 gets two consecutive words from IP 2, then from IP 1 and so on. Note that in realistic case, the arbitration happens less frequently but the example highlights the issue. \begin{figure} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/fig_addr_interleaving.eps}} \caption{The transfers may get intereleaved due to arbitration.} \label{fig:addr_interleaving} \end{center} \end{figure} \begin{figure*} \begin{center} {\includegraphics[width=0.65\textwidth]{../Fig/Eps/fig_hibi_wrapper.eps}} \caption{Structure of HIBI v.2 wrapper and configuration memory} \label{fig:wrapper} \end{center} \end{figure*} As a conclusion \begin{enumerate} \item Data is transferred in order through FIFO \item If tx is interrupted in bus, wrapper re-sends address and continues tx of rest of data to destination \item Sender tx FIFO can not be cleared once written \item Receiver can identify to which channel data is coming based on address \end{enumerate} \section{Wrapper structure} HIBI network is constructed using parameterizable builgin blocks called wrappers. The wrappers take care of arbitration, link-level transmission, data buffering, and optional clock-domain crossing. All signals on both sides of the wrapper are unidirectional. For example, there are separate multibit signals data\_in and data\_out. Let us first consider the bus side, i.e. the signals between wrappers. The structure of the HIBI v.2 wrapper is depicted in Fig \ref{fig:wrapper}. The modular wrapper structure can be tuned to better meet the application requirements by using different versions of the internal units or leaving out properties that are not needed in a particular application. On IP side, there can be separate interfaces for every data priority or they can be multiplexed into one interface. Furthermore, the power control signals can be routed out of the wrapper if the IP block can utilize them. The main parts are buffers for transferring and receiving data and the corresponding controllers. The transfer controller takes care of distributed arbitration. The configuration memory stores the arbitration parameters. Relative data priority is implemented by adding extra FIFOs. A (de)multiplexer is placed between the FIFOs and the corresponding controller so that the controller operates only on a single FIFO interface. The separate (de)multiplexer allows adding FIFOs to support priorities in excess of two without changing the control. Currently, transmit multiplexer uses pre-emptive scheduling. HIBI v.2 has multiplexed address and data lines whereas HIBI v.1 uses separate address and data lines. Multiplexing decreases implementation area because signal lines are removed and less buffering capacity is needed for the addresses. This causes overhead in control logic but that is less than the saving in buffering. Having fewer wires allows wider spacing between wires and hence lower coupling capacitance. On the other hand, the saved wiring area can be used for wider data transfers to increase the available bandwidth. The HIBI protocol does not require any specific control signals, but message-passing is utilized when needed. HIBI v.1 assumes strictly non-blocking transfers and omits handshake signals to minimize transfer latency but one handshake signal \textit{Full} was added to HIBI v.2 to avoid FIFO overflow at the receiver. As a result, blocking models of computation can be used in system design and, in addition, the depths of FIFOs can be considerably smaller than in HIBI v.1. \subsection{Bus-side signals} All outputs from wrappers are ``ORed'' together and OR-gates' outputs are connected to all wrappers' inputs. This scheme avoids the tri-state logic that was used in HIBI v.1. Table~\ref{table:bus_signals} lists the bus side signals and Fig.~\ref{fig:3_wrappers} illustrates the connection between wrapper and OR-gates. The cycle-accurate bus timing is omitted from this used guide for brevity. All bus side outputs come directly from register except the handshaking signal full. \begin{table*} \caption {The signals at bus side, i.e. between the wrappers, in \label{table:bus_signals} v.2 and v.3 } \begin{center} \begin{tabular}{l | l | l | l} \hline Signal & Width & Dir. & Meaning \\ \hline \hline data & generic & i+o & Data and address are multiplexed into single set of wires \\ av & 1 & i+o & Address valid. Notifies when address is transmitted \\ cmd & 3 & i+o & Command: read or write, data or conficuration etc. \\ full & 1 & i+o & Target wrapper is full and acannot accept the data. Current transfer will be repeated later \\ lock & 1 & i+o & Bus is reserved \\ \hline \end{tabular} \end{center} \end{table*} \begin{figure*} \begin{center} {\includegraphics[width=0.65\textwidth]{../Fig/Eps/fig_hibi_3_wrappers.eps}} \caption{Structure of HIBI v.2 wrapper and configuration memory} \label{fig:3_wrappers} \end{center} \end{figure*} The number of data bits can be freely chosen. This is beneficial, for example, when error correcting or detecting codes are added to data and the resulting total data width is not equal to any power of two. Active master asserts $Lock$ signal when it reserves the bus. Handshaking is done with the $Full$ signal. When $Full$ is asserted, the data word on the bus must be retransmitted by the wrapper. To improve modularity, all signals are shared by all wrappers within a segment and no point-to-point signaling is required. Consequently, the interface of a wrapper does not depend on the number of agents and the wrapper can be reused more easily. An OR network was selected for bus signal resolution. The HIBI implementation pays special attention on minimizing the transfer latency by removing empty cycles from the arbitration process by pipelining. Empty cycles are here defined as cycles when at least one wrapper has data to send but the bus segment is not reserved. An optimized protocol allows lower frequency, and hence lower power, for certain performance level than inefficient protocol. Empty cycles appear also when bus utilization is low as distributed round-robin arbitration takes one cycle per agent. If only one agent is transmitting, it has to wait a whole round-robin cycle between transfers. In such cases, the priority-based arbitration is useful. \subsection{IP-side signals} The signals at IP interface are mostly the same signals as in the bus side. Interface signals are connected to FIFO buffers inside the wrapper and all output signals of the wrapper come from registers. Most signals are driven by both IP and wrapper \begin{itemize} \item Command \item Address / Address valid \item Data \begin{itemize} \item May have high (message) and low (data) priotities (depends on wrapper type) \item Priority is defined by transmissting IP-block (source) \end{itemize} \end{itemize} On the other hand, the FIFO access control signals depend on the direction. Both control signals Write enable and Read enable and driven by wrapper. The status signals are driven by wrapper. There are always at least two status signals FIFO full and FIFO empty. In addition, the FIFO buffers developed for HIBI offer two others: One data left at FIFO and One place left at FIFO, which may simplify the logic IP. The address signals at IP side offer few choices that described next. \begin{figure*} \begin{center} {\includegraphics[width=0.6\textwidth]{../Fig/Eps/fig_ip_signals.eps}} \caption{The signals between IP and wrapper} \label{fig:ip_signals} \end{center} \end{figure*} Fig~\ref{fig:ip_signals} depicts the signals between IP and wrapper and Table~\ref{table:ip_signals} list their details. \begin{table*} \caption {The signals at wrapper's IP interface} \label{table:ip_signals} \begin{center} \begin{tabular}{l | l | l | l} \hline Signal & Width & Dir. & Meaning \\ \hline \hline rst\_n & 1 & i & Active low reset \\ clk & 1 & i & Clock, active on rising edge. Same for all wrappers inside one segment \\ data & generic & i+o & Data and address are multiplexed into single set of wires \\ av & 1 & i+o & Address valid. Notifies when address is transmitted \\ cmd & 3 & i+o & Command: read or write, data or conficuration etc. \\ re & 1 & i & Read enable. Wrapper can remove the first data from FIFO \\ we & 1 & i & Write enable. Adds the data from IP to TX FIFO \\ full & 1 & o & TX FIFO is full \\ empty & 1 & o & RX FIFO is empty \\ one\_p & 1 & o & TX FIFO has one place left, i.e. almost full \\ one\_d & 1 & o & RX FIFO has one data left, i.e. almost empty \\ \hline \end{tabular} \end{center} \end{table*} \subsection{Variants of IP interface} There are 4 variants of the IP interface depending on how to handle a) high/low priority data: one or two interfaces b) address and data: separate interfaces or one multiplexed The different wrapper are denoted with postfix $\_r<x>$ r1: a) 2 interfaces hi+lo; b) muxed a/d r2: a) 1 interface hi/lo; b) separate a+d r3: a) 2 interfaces hi+lo; b) separate a+d r4: a) 1 interface hi/lo; b) muxed a/d Since these options affect only the IP side, different wrapper types can co-exist in the same system, and the wrappers' bus side interface is always the same. Furthermore, the addresses work directly between wrapper types. However, hi-priority data cannot bypass lo-prior data in wrapper types r2 and r4. However, all data is always transmitted For example, Nios subsystems utilize commonly r4 but SDRAM utilizes r3. This is because SDRAM ctrl distinguishes DMA configuration and memory data traffic with priority of incoming data. It also prevents dead-lock. Fig ~\ref{fig:ip_interface_variants} depicts variants of wrapper's IP side signals. Interface type r1 is the ``native'' interface that is used inside all other variants. \begin{figure*} \begin{center} {\includegraphics[width=0.8\textwidth]{../Fig/Eps/fig_ip_interface_variants.eps}} \caption{There are 4 variants of IP interface. There are two selectable features, namely separations of hi/lo-prior data and separate/multiplexed addressing.} \label{fig:ip_interface_variants} \end{center} \end{figure*} \subsection{Signal naming in VHDL} The side and direction are marked into signal name in HIBI wrapper VHDL, for example \begin{enumerate} \item agent\_data\_in, agent\_data\_out, \item bus\_data\_in, bus\_data\_out \end{enumerate} Fig.~\ref{fig:sgn_naming} clarifies the naming scheme. \begin{figure*} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/fig_sgn_naming.eps}} \caption{The naming convention of ports} \label{fig:sgn_naming} \end{center} \end{figure*} \subsection{Cycle-accurate timing} For brevity, only the IP side timing is explained. It is actually very simple. The timing when transmitting is depicted in Fig 1) IP checks that tx FIFO is not full 2) IP sets data, command, addr/av, and write\_enable=1 for one clk cycle \begin{figure*} \begin{center} \subfigure[IP sends.]{\includegraphics[width=0.95\textwidth]{../Fig/Eps/fig_tx_timing.eps} \label{subfig:tx_timing}} \subfigure[IP receives data]{\includegraphics[width=0.95\textwidth]{../Fig/Eps/fig_rx_timing.eps} \label{subfig:rx_timing}} \caption{Examples of timing at IP interface.} \label{fig:interface_timing} \end{center} \end{figure*} The timing when receiving is depicted in Fig 1) IP checks that rx FIFO is not empty 2) IP captures data, command, and addr/av 3) IP sets read\_enable=1 for one clk cycle Notes on signal timing \begin{enumerate} \item Very easy to write/read on every other cycle \item Almost as easy to write/read on every cycle. Needs a bit more care with checking empty and full \item IP may keep we=1 and re=1 continuously and just change/store data according to full/empty \item Signal FIFO full comes from register. It goes high on the next cycle after the write, if at all. In the Tx example, writing value 0xacdc filled the FIFO \item Setting we=1 when FIFO is full has no effect \item Setting re=1 when FIFO is empty has no effect \item Received data, addr/av and command appear to interface, if FIFO was empty before. IP can use them directly. They are ``removed'' only when read enable is activated o Checking empty==0 ensures validity \item Data and command values are undefined when FIFO is empty. Most likely the old values remain \end{enumerate} A Simple example VHDL code can be found in SVN /release\_1/lib/hw\_lib/ips/computation/image\_xor/tb/tb\_image\_xor\_linemaker.vhd It shows how to send address and data. Fig.~\ref{fig:ip_fsm} shows the simple example FSM of the IP. \begin{figure*} \begin{center} {\includegraphics[width=0.9\textwidth]{../Fig/Eps/fig_ip_fsm.eps}} \caption{Example FSM of an IP} \label{fig:ip_fsm} \end{center} \end{figure*} Sometimes the output registers of the IP may cause unexpected behavior for novices. Even if FIFO appears ``not full'', IP cannot necessarily write new data. That happens if it was already writing and there was only one place left at the FIFO. Hence, remember to check if IP is already writing! The following code snippet should clarify correct writing \begin{lstlisting}[language=vhdl, style=console, basicstyle=\footnotesize, title={Example code of IP's sending control}] if (we_r ='1' and one_p_in='1') or full_in ='0' then we_r <= '0'; //FIFO is becoming or already full else we_r <= '1'; // There is room in FIFO data_r <= new_value; end if; \end{lstlisting} HIBI wrapper shows the data as soon as it comes from the bus. Same data might get used (counted) twice, if IP only checks the empty signal. Remember to check if IP is already reading! The following code snippet should clarify correct reading \begin{lstlisting}[language=vhdl, style=console, basicstyle=\footnotesize, title={Example code of IP's reception handling}] if (re_r = '1' and one_d_in = '1') or empty_in = '1' then re_r <= '0'; // Stop reading else re_r <= '1'; // Start or continue reading end if; if re_r = '1' then if hibi_av_in = '0' then // handle the incoming address else // handle the incoming data end if; end if; \end{lstlisting} Common pitfalls \begin{itemize} \item Not noticing that tx FIFO fills while writing. Consequence: Some data are lost (not written to FIFO) \item Write enable remains 1 for one cycle too long. Undefined data written to FIFO, or the same data is written twice o In both of above, the likely cause is not acocunting to output register of the IP \item Not noticing that rx FIFO goes empty while reading. Data consumed by IP is undefined \item Read enable remains 1 for one cycle too long. Next data is accidentally read away from the FIFO unless FIFO was empty \item Not noticing that rx data changes only after the clock edge when re=1. IP uses the same data twice \end{itemize} \section{Arbitration} A distinct feature in HIBI is that arbitration is distributed to wrappers, meaning that they can decide the correct time to access the bus by themselves. Therefore, no central arbiter is required. In practice, Bus is ``offered'' to one wrapper on each cycle. The wrapper reserves the bus using signal lock if has data to send. Multiple policies are supported \begin{enumerate} \item Fixed priority, Round-robin \item Dynamically adaptive arbitration (DAA) \item Time-division multiple access (TDMA) \item Random \item Combination of above \end{enumerate} A scheme called Dynamically Adaptive Arbitration (DAA) was presented in \cite{kulmala08b}. In most cases, designers should use round-robin or DAA. If there is minor performance bottleneck, one can easily configure the arbitration parameters. \begin{figure*} \begin{center} {\includegraphics[width=0.8\textwidth]{../Fig/Eps/fig_arb_example.eps}} \caption{Example timing in 3 arvitration policies.} \label{fig:arb_example} \end{center} \end{figure*} Fig.~\ref{fig:arb_example} shows an example of different policies. A two-level arbitration scheme, a combination of time division multiple access (TDMA) and competition, is used in HIBI. In TDMA, time is divided into repeating time frames. Inside frames, agents are provided time slots when they are guaranteed an access to the communication channel. This way the throughput of each wrapper can be guaranteed. The worst-case response time for a bus access through TDMA is the interval of the adjacent time slots. TDMA in HIBI supports two flavors for handling the slots when there is no data send: keeping them or releasing the bus for competition. \begin{figure*} \begin{center} \subfigure[Low contention (send probability ~4\% per agent).]{\includegraphics[width=0.85\textwidth]{../Fig/Eps/fig_arb_recfg_lowcontention_v2.eps} \label{subfig:wave_arb_lowcont}} \subfigure[High contention (send probability ~30\% per agent).]{\includegraphics[width=0.85\textwidth]{../Fig/Eps/fig_arb_recfg_highcontention_v2.eps} \label{subfig:wave_arb_highcont}} \caption{Various arbitration schemes for 8-agent single bus and uniform random traffic. The differences become evident on highly utilized bus.} \label{fig:wave_arb} \end{center} \end{figure*} Competition is based either on round-robin or non-pre-emptive priority arbitration. The second level mechanism is used to arbitrate the unassigned or unused time slots. If the agent does not have anything to send in the beginning of its time slot, the time slot can be given away to allow maximal bus utilization. Priority arbitration as a second level method attempts to guarantee a small latency for high priority agents whereas round-robin provides a fair arbitration scheme. When the bus is freed and priority scheme is utilized, the agent with the highest priority can reserve the bus on the first cycle. If the bus has been idle for two cycles, the agent with the second highest priority may reserve it and so on. The maximum transfer length is restricted with runtime configurable parameter $max\_send$. For round-robin, the maximum wait time for accessing the bus is obtained by summing all $max\_send$ values. For priority-based arbitration, the maximum wait time can be defined only for the two highest priorities. This means that the low-priority agents may suffer starvation and system may end up in deadlock. Therefore, using only priority arbitration is not recommended. \subsection{Detailed timing example} Fig.~\ref{fig:wave_arb} shows the differences in various arbitration policies and two traffic loads (low and high contention). HIBI is configured as single bus with 8 agents. Agent 0 performs dynamic reconfiguration (time instants $i-v$) and other agents generate uniformly distributed random traffic. The reconfiguration changes the arbitration policy at runtime. The exact configuration procedure is explained in more detail later %in Section\ref{ch:hibi:reconf}. The utilized arbitration policies are \begin{enumerate}[i)] \item round-robin \item combination of priority and round-robin \item priority \item random \item round-robin (again). \end{enumerate} Round-robin offers fair arbitration (each agent has its share) whereas priority favors the highest priority agents and leads to starvation of others. Their combination switches between them at user-defined intervals. Arbitration policy does not play a major role when bus is lightly loaded, as illustrated in Fig.~\ref{subfig:wave_arb_lowcont}. The differences are clear with higher load, Fig.~\ref{subfig:wave_arb_highcont}. \subsection{Performance implications} \begin{figure*} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/gra_hibi_arb_rel_perf.eps}} \caption{Relative performance of arbitration algorithms in MPEG-4 encoding \cite{kulmala08b}} \label{fig:hibi_arb_rel_perf} \end{center} \end{figure*} % !!! ks. myös $http://ieeexplore.ieee.org/iel5/10626/33561/01594751.pdf$ Various arbitration methods of HIBI were compared in \cite{kulmala08b}. The test case was MPEG-4 encoding on MPSoC. HIBI has $6$ arbitrated components: $4$ CPUs, SDRAM, and performance monitor; all operating at $50 MHz$ frequency. The maximum transfer length was varied from 5 words (denoted as $tx=5$) to non-limited. Transfer length has major impact but all lengths of 50 words or over (tx>49) resulted in equal performance. The bus frequency was set to $1, 2, 5$, or $50~MHz$ in order to achieve varying bus utilization ($75\%, 56\%, 26\%$, and $3\%$, respectively) with single application. The best and worse algorithms vary case by case but DAA performed well in general. Fig.~\ref{fig:hibi_arb_rel_perf} plots the relative encoding performance between the worst and best algorithms. The curves denote different transfer lengths, and $1.0$ is the best algorithm for each case. Tx lengths over $49$ are joined for clarity because they yield practically the same results. With short transfers, the worst algorithm at $1~MHz$ HIBI ($75\%$ utilization) offers only $0.62x$ the performance of the best, at $2~MHz~0.73x$, at $5~MHz~0.98x$, and at $50~MHz$ there are no differences. \section{Commands} Source IP sets the command and most commands are forwarded to the receiving IP. The most common commands are: \begin{itemize} \item Write data - regular send operation, so called posted write \item Read request - split-transaction, the requested data is returned later with regular write command \end{itemize} The other, less common commands are \begin{itemize} \item Idle - IPs never use this command, but this appears on the bus when no-one sends anything \item High priority - bypasses normal data in the wrappers, otherwise just like regular operation, can be added to many commands \item Write and read config - access the configuration memories inside the wrappers. Not forwarded to the IP at the receiving end \item Multicast - send the same data to multiple targets (only in HIBI v.2) \item Non-posted write - Receveir IP must provide some response (ACK or NACK) (v.3 only) \item Linked read + conditional write - to perform read-modify-write (v.3 only) \item Exclusive access - reserve the whole path to the destination, read, write, and remove the lock (v.3 only) \end{itemize} HIBI v.3 has 5 command bits and v.2 had only 3 bits,see Tables~\ref{table:hibi_v3_cmd} and~\ref{table:hibi_v2_cmd}. \begin{table*} \caption {The command codes in HIBI v.3} \label{table:hibi_v3_cmd} \begin{center} \begin{tabular}{l | l | r |l} \hline Cmd & Code & Code & Meaning \\ & [4:0] & [decimal]& \\ \hline \hline idle & 0 0000 & 0 & Appears on the bus when it is free \\ <reserved> & 0 0001 & 1 & not used, most unused codes hidden from the table \\ wr data & 0 0010 & 2 & Regular write \\ wr data hi-prior & 0 0011 & 3 & - `` - w/ high priority \\ \hline rd data & 0 0100 & 4 & Request of the split-transaction \\ rd data hi-prior & 0 0101 & 5 & - `` - w/ high priority \\ rd data linked & 0 0110 & 6 & \\ rd d. linked hi-p& 0 0111 & 7 & - `` - w/ high priority \\ \hline wr data non-post & 0 1000 & 8 & Write that expects response\\ wr d. non-post hi-p& 0 1001 & 9 & - `` - w/ high priority \\ wr conditional & 0 1010 & 10 & Write that follows rd linked \\ wr cond. hi-p & 0 1011 & 11 & - `` - w/ high priority \\ \hline % <reserved> & 0 1100 & 12 & not used \\ excl. lock & 0 1101 & 13 & Locks the path to the destination \\ % <reserved> & 0 1110 & 14 & not used \\ excl. wr & 0 1111 & 15 & Exclusive write, must follow excl.lock \\ \hline % <reserved> & 1 0000 & 16 & not used \\ excl. rd & 1 0001 & 17 & Exclusive read request, must follow excl.lock \\ % <reserved> & 1 0010 & 18 & not used \\ excl. release & 1 0011 & 19 & Removed the lock from the path\\ \hline % <reserved> & 1 0100 & 20 & not used \\ wr config & 1 0101 & 21 & \\ % <reserved> & 1 0110 & 22 & not used \\ rd config & 1 0111 & 23 & \\ \hline <reserved> & 1 1xxx & 24-31 & not used \\ \hline \hline \end{tabular} \end{center} \end{table*} \begin{table*} \caption {The command codes in HIBI v.2} \label{table:hibi_v2_cmd} \begin{center} \begin{tabular}{l | l | l} \hline Cmd & Code [2:0] & Meaning \\ \hline \hline idle & 000 & Appears on the bus when it is free \\ wr config data & 001 & Updates config mem inside the wrapper \\ wr data & 010 & Regular write \\ wr data hi-prior & 011 & High-priority data bypasses the regualr one \\ \hline rd data & 100 & Request of the split-transaction \\ rd config data & 101 & Requests a value from wrapper's config mem \\ multicast data & 110 & Sends to all wrappers whose uppemost addr bits match \\ multicast config & 111 & Same as above for high-priority data\\ \hline \end{tabular} \end{center} \end{table*} \section {Buffering and signaling} The model of computation used in HIBI design approach assumes bounded first-in-first-out (FIFO) buffers between processes. A simple FIFO interface can be adapted to other interfaces such as the OCP (Open Core Protocol)\cite{ocp03}. % The basic principle of OCP is shown in % Fig \ref{fig:hibi_ocp}. % Transfers are initiated by $masters$ and $slaves$ % only respond to requests. The OCP transfers are translated to underlying % network protocol, in this case HIBI, and back by OCP wrappers. % \begin{figure} [t] % \begin{center} % \includegraphics[width=0.7\textwidth]{../Fig/Eps/fig_hibi_ocp.eps} % \caption{Using OCP with HIBI. The OCP interface is located between IP and HIBI wrapper} % \label{fig:hibi_ocp} % \end{center} % \end{figure} Consequently, IP components use only OCP protocol and are isolated from the actual network implementation. Ideally, network can be chosen freely without affecting the IPs. However, not all features of HIBI, such as relative data priorities or dynamic reconfiguration, can be used with OCP directly but only the basic transfers. To avoid excess buffering or retransfers, the received data must be read from the FIFO as soon as possible, for example by using a direct memory access controller. As a result, the receiver buffer space is not dictated by the \emph{amount} of transferred data, but the \emph{latency} of reading data from the wrapper. This scheme resembles wormhole routing, but the links are not reserved if the receiver is stalled. \section {Configuration} HIBI is both modular and configurable. At design time: structural and functional settings are made, whereas at run-time, one can modify data transfer properties (arbitration types, wrapper specific QoS settings). Fig.~\ref{fig:cfg_mem} shows the structure of the configuration memory. \begin{figure} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/fig_cfg_mem.eps}} \caption{Structure of the wrapper's configuration memory} \label{fig:cfg_mem} \end{center} \end{figure} \subsection{Generic parameters in VHDL} HIBI has a large set of generic parameters. They are categorized as follows \begin{enumerate} \item Stuctural \begin{itemize} \item Widths of interface ports: data, command, debug port \item Widths of internal signals: address, wrapper identifier field, counters \item Sizes of tx and rx FIFOs, both lo and hi priorities \item Use 0, 2, 3 etc. \item Run-time configuration: number of cfg pages, num of app-specific extra registers \end{itemize} \item Synchronization \begin{itemize} \item Type of the synchronizing FIFO buffers \item Relative frequencies of IP and bus \end{itemize} \item Functional \begin{itemize} \item Identifier, own address \item For bridges: base identifier, inverted address space \item Arbitration: type, priority, how many words to at one turn, number of agents in the same segment \item For TDMA: number of time slots, how to handle unused slots (keep/give away) \item Enable/disable multicast functionality \item Enable/disable runtime configuration functionality (affects structure=area as well) \end{itemize} \end{enumerate} Table~\ref{tab:generics} lists all the generics. Certain parameters are system-wide settings, for example the width of the command. Some are segment-wide, for example bus clock, data width, and number of wrappers in that segment. The rest are instance-specific, for example buffer sizes and priorities. \begin{table*} \caption{Properties of HIBI v.1 and v.2.} \label{tab:generics} \begin{center} \includegraphics[width=0.95\textwidth]{../Fig/Eps/tab_generics.eps} \end{center} \end{table*} \subsection{Clocking} HIBI can support may clock domains. The border is either between IP and wrapper, or in the middle of a bridge. There are five options: \begin{enumerate} \item Fully synchronous \item Synchronous multi-clk: Clock frequencies are integer-multiples of each other. Clocks are in the same phase. Easy to use with FPGA's PLLs \item GALS: No assumptions about relations (phase, speed) between clocks. Has longer synch. latency than synch.multiclock. \item Gray FIFO: FIFO depth limited to power of two ($=2^n$) \item Mixed clock pausible \end{enumerate} The method must be decide at synthesis time. \subsection {Runtime reconfiguration} \label{ch:hibi:reconf} Wrapper has config memory that stores all information for distributed arbitration. It can be synthesized in many ways: \begin{itemize} \item Permanent: ROM, 1 page \item Partial run-time configurable: ROM with several pages \item Full run-time configurable: RAM, with pages \item Kactus supports currently 1-page ROM \end{itemize} HIBI allows the runtime configuration of all arbitration parameters to maximize performance. This is achieved so that one of the agents (e.g. system controller CPU) writes the new configuration values to all wrappers. The configuration values are sent through the regular data lines. During the normal operation, i.e. when the configuration is not changed, the controller CPU can perform its computation tasks. In the best case, other PEs can continue their transfers even if HIBI is being configured. However, some operations, such as swapping priorities of two wrappers, necessitate disabling other transfers momentarily. The structure of the configuration memory is illustrated at the bottom of Fig \ref{fig:wrapper}. It includes multiple configuration pages for storing the parameter values, a register storing the number of currently active page, clock cycle counter, and logic that checks the start and end of times of the time slots. The receive controller takes care of writing new configuration values whereas the configuration values and time slot signals are fed to the transfer controller. Configuration values can be written to non-active pages before they are used to minimize the risk of conflict when the configuration is performed. For very regular traffic, the TDMA slots can be set to minimize the latency, i.e. slot starts shortly after the availability of data. For TDMA, each wrapper has an internal cycle counter to decide correct times to access the bus. For this reason, wrappers in one bus segment must be synchronized. When data is produced with varying time intervals or quantities, the time slots cannot be optimally located. By runtime reconfiguration, the cycle counters can be reset to an arbitrary clock cycle value within the time frame to keep time slots in the correct place with respect to data availability. Also the length and owner of the slots can be changed. The resynchronization can be triggered explicitly from software or automatically by a specific monitor unit, which monitors how effectively time slots are used and starts the reconfiguration if needed \cite{kangas02}. Roughly 10 \% improvement in HIBI v.1 throughput in video encoding due to dynamic reconfiguration was reported in \cite{lahtinen02}. Larger gains are expected when several applications are executed on a single platform. Reconfiguration was used in \cite{kulmala08b} to speed-up the exploration on FPGA. It allowed notably less synthesis runs, each of which took several hours. As a new feature in HIBI v.2, the second-level arbitration method can be changed at runtime between priority and round-robin or both of them can be disabled. When the second-level arbitration is disabled, only the basic TDMA is used and the slot owner reserves the bus always for the whole allocated time slot. Similarly, only the second-level arbitration is utilized when no time slots are allocated. \begin{figure*} [t] \begin{center} {\includegraphics[width=0.75\textwidth]{../Fig/Eps/fig_hibi_cfg_mem_wave.eps}} \caption{Example of runtime configuration} \label{fig:cfg_mem_wave} \end{center} \end{figure*} In HIBI v.2, three methods are used to improve the configuration procedure. First, by making use of the bus nature, each common parameter can be broadcast to all wrappers. Second, enabling the reading of configuration values simplifies the procedure as the whole configuration does not have to be stored in the configuring agent. In contrast, the configuring agent can read the old parameter values to help determining the new ones. Third, additional storage capacity for multiple parameter pages has been added to enable rapid change of all parameters. When a configuration page changes, all the parameters are updated immediately with one bus operation. It is possible to store a specific configuration for every application (phase) in its own configuration page to enable fast configuration switching. % !!! KS. myös, tuohon ei kyllä löydy viitettä kuka julkaissut % ym,joten se ei varmaan käy % % $http://www.eetasia.com/ARTICLES/2005JAN/B/2005JAN17_MPR_TA.pdf?SOURCES=DOWNLOAD$ Runtime reconfiguration is illustrated in Fig \ref{fig:cfg_mem_wave} for 2-page configuration memory. Signals coming from receive controller to configuration memory (\textit{addr\_in, data\_in, we\_in}) are shown on top. % with % post-fix % \emph{\_in}. In the middle are the registers \textit{.prior, .n\_agents, .arb\_type, .max\_send} for both configuration pages (all parameter registers are not shown for clarity). On the bottom, are the signals from memory to transfer controller (\textit{prior\_out, n\_agents\_out, arb\_type\_out, max\_send\_out}). In the example, the first digit of the address defines the page and two last digits define the parameter number. \begin{enumerate} \item The parameter registers for priority ($.prior$), arbitration type ($.arb\_type$), and maximum send amount ($.max\_send$) on current page (page 1) are configured to values 5, 2, and 20, respectively. \item Parameters on the inactive page are updated: priority is set to 4, arbitration type is changed from round-robin (0) to priority (1), and max\_send is increased to 30. \item Page 2 is activated by writing value 2 to address 0x000. When the page is changed, all outputs to transfer controller change immediately. Since the number of agents ($n\_agents$) changes to value 8, the wrapper with priority 9 cannot access the bus anymore. This way arbitration latency can be decreased if some agent is known to be idle. \end{enumerate} \section{Performance and resource usage} \subsection{HIBI wrapper structure} The resource usage of the HIBI comes mainly from it's wrappers. HIBI version 3 has three types of them which include R1, R3 and R4. Figure~\ref{fig:r3_block_diagram} shows how a R3 wrapper is constructed of multiplexors and a R1 wrapper which has four separate FIFOs itself. \begin{figure*} \begin{center} \includegraphics[width=0.9\textwidth]{../Fig/Eps/fig_r3_structure.eps} \caption{HIBI R3 wrapper block diagram} \label{fig:r3_structure} \end{center} \end{figure*} \subsection{Resource usage} The resource usage for invidual HIBI wrappers was acquired from a SoC that was synthesized to a Arria II GX FPGA on a Arria II GX development board. The SoC had two HIBI components with both attached to a R3 HIBI wrapper. The size of the fifos on these wrappers was set to 4 words which means $4 \cdot 32b = 128 b$ on each fifo. Table~\ref{table:resource_usage} shows the combinatorial ALU (adaptive LUT) counts and register counts of a wrapper. Both minimum and maximum values are reported since synthesis does not always produce exactly the same results. Area can be significantly reduced if the FIFOs are implemented as onchip memories (m9k blocks in Arria II GX). \begin{table*} \caption {Resource usage of wrapper R3, with 32b data, multiplxed address and 5b command. \label{table:resource_usage} v.2 and v.3 } \begin{center} \begin{tabular}{l | l | r } \hline \hline Wrapper subblock & Unit & Value \\ \hline \hline HIBI wrapper r3 & comb. ALUTs & 724-763 \\ & registers & 1029-1168 \\ \hline HIBI wrapper r1 & comb. ALUTs & 466-533 \\ & registers & 825-935 \\ \hline 4-word FIFO & comb. ALUTs & 76-104 \\ & registers & 155-167 \\ \hline \end{tabular} \end{center} \end{table*} Fig.~\ref{fig_chip_planner} shows the resource usage layout on the FPGA as seen on the Chip Planner in Quartus II. The two wrappers are highlighted in blue. \begin{figure*} \begin{center} \includegraphics[width=0.4\textwidth]{../Fig/Eps/fig_chip_planner.eps} \caption{HIBI R3 in Quartus' chip planner tool} \label{fig:chip_plannet} \end{center} \end{figure*} \subsection{Simulated performance} The throughput was measured for a 32 bit, 200 MHz HIBI segment with two components, both of which were connected to the segment with a R3 wrapper. The sender transmitted a continous stream of 1024 words to a single address. Maximum throughput is $200 MHz \cdot 32b =$ 800 MByte/s. Since the data and address are buses muxed together, the minimum time to send the stream would be 1025 cycles. Measured latency and throughput are shown in Fig.\ref{fig_performance}. Both approach their theoretical limits as the FIFO depth increases. \begin{figure*} \begin{center} \subfigure[Transfer latency in cycles. Theoretical miniumum 1025 cycles (one cycle needed for address)]{\includegraphics[width=0.85\textwidth]{../Fig/Eps/gra_latency_1024words.eps} \label{subfig:perf_latency}} \subfigure[Throuhgpput in MB/s. Theoretical max 800 MB/s]{\includegraphics[width=0.85\textwidth]{../Fig/Eps/gra_throughput_1024words.eps} \label{subfig:perf_throughput}} \caption{Performance with 1024-word transfers.} \label{fig:performance} \end{center} \end{figure*} \section{Usage examples} IP can connect directly to HIBI but CPUs should use a DMA. It allows performing transfers on the backgournd while CPU is processing. \subsection{Transmission with dual-port memory buffer and DMA controller} Fig.~\ref{fig:dma_tx} shows the concept how CPU can send data using DMA. \begin{enumerate} \item CPU reserves buffer space from dual-port memory \item CPU copies/writes data to dual-port memory \item CPU configures DMA transfer: memory address, size of transfer, and destination IP-block's HIBI address (not local CPU address) \item DMA reads data from dual-port memory and sends the data to the configured HIBI address \end{enumerate} \begin{figure} \begin{center} {\includegraphics[width=0.7\textwidth]{../Fig/Eps/fig_dma_tx.eps}} \caption{Example how CPU sends using DMA.} \label{fig:dma_tx} \end{center} \end{figure} \subsection{Reception with dual-port memory buffer and DMA controller} Fig.~\ref{fig:dma_rx} shows the concept how CPU can use DMA to copy received data into the local dual-port memory. \begin{enumerate} \item CPU reserves buffer space from dual-port memory \item CPU configures DMA: Memory address, size of transfer, and the HIBI address in which data is received \item DMA copies the incoming data to DPRAM \item DMA interrupts CPU when a configured number of words have been received \item CPU knows that data is ready in memory and uses it/copies to data memory \end{enumerate} \begin{figure} \begin{center} {\includegraphics[width=0.7\textwidth]{../Fig/Eps/fig_dma_rx.eps}} \caption{Example how CPU receives data usign DMA.} \label{fig:dma_rx} \end{center} \end{figure} Rx buffers are organized as channels. Fig.~\ref{fig:dma_rx_buffers} shows how DMA translates incoming HIBI addresses into addresses in the local memory. Only memory space limits how many buffers (channels) exists at the same time. Channels have implicit meanings that must be agreed: \begin{enumerate} \item Who (what IP-block or CPU) sends data to which channel, since otherwise the sender is not known (HIBI does not send sender ID in transfers). \item Possible explicit meaning of channel like ``DCT transform Q-parameter''. Then, it is not that relevant who provides data. \end{enumerate} \begin{figure} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/fig_dma_rx_buffers.eps}} \caption{Example mapping between incoming address and buffer in dual-port memory.} \label{fig:dma_rx_buffers} \end{center} \end{figure} \subsection{Example: use source specific addresses} \begin{figure} \begin{center} {\includegraphics[width=0.5\textwidth]{../Fig/Eps/fig_src_specific_addr.eps}} \caption{Example how CPU instructs the IP block where to put result data.} \label{fig:src_specific_addr} \end{center} \end{figure} Designer wished to implement following high-level sequence ``HW IP-block A should send data to CPU after initialization''. The procedure to achieve this is \begin{enumerate} \item CPU Sets rx buffer address to its DMA block N2H2\_0 \item CPU sends that same address to A's IP-block specific configuration register \item IP A knows now to where send data \item CPU knows from where data is coming to address \end{enumerate} It is assumed that CPU and IP A know the data amount at design time. Otherwise, it must agreed upon during initialization (that was omitted for clarity). \subsection{SW interface to DMA} There are low-level SW macros available that access the hardware registers of HIBI PE DMA (abbreaviated as HPD). They implement a driver, but can be also used from user programs. \begin{table*} \caption {The SW macros for accessing the DMA controller's registers} \label{table:dma_macros} \begin{center} \begin{tabular}{p{0.5\textwidth} | p{0.5\textwidth} } \hline Macro & Meaning \\ \hline \hline void HPD\_CHAN\_CONF ( int channel, int mem\_addr, int rx\_addr, int amount, int* base ) & Configure HPD channels. After configuration, specific channel is ready to receive amount of data to rx\_addr HIBI address. Received data is stored to mem\_addr in HPD address space. \\ \hline void HPD\_SEND (int mem\_addr, int amount, int haddr, int* base) & Send amount of data from mem\_addr to haddr HIBI address. mem\_addr is memory address in HPD address space. \\ \hline void HPD\_READ (int mem\_addr, int amount, int haddr, int* base) & Send command to read amountof data from haddrHIBI address. \\ \hline void HPD\_SEND\_MSG (int mem\_addr, int amount, int haddr, int* base) & Send amount of data from mem\_addr to haddr HIBI address as HIBI message. mem\_addr is memory address in HPD address space. \\ \hline int HPD\_TX\_DONE(int* base) & Returns status of transmit operation. \\ \hline void HPD\_CLEAR\_IRQ(int chan, int* base) & Clears IRQ of specific channel. \\ \hline int HPD\_GET\_IRQ\_CHAN(int* base) & Return the number of the channel that caused interrupt. If interrupt hasn't occurred, return -1. \\ \hline \end{tabular} \end{center} \end{table*} Notes: ``HPD'' is HIBI PE DMA (previously called Nios-to-HIBI 2, N2H2). ``Base'' is the base address of HIBI PE DMA in HIBI address space. ``Amount'' is data amount in 32-bit words. \begin{table*} \caption {The SW functions for using the DMA} \label{table:dma_functions} \begin{center} \begin{tabular}{p{0.5\textwidth} | p{0.5\textwidth} } \hline Function & Meaning \\ \hline \hline void HIBI\_TX (uint8* pData, uint32 dataLen, uint32 destAddr, uint8 commType) & Send data over HIBI. pData is pointer to data, dataLen is length of the data in bytes, destAddr is destination HIBI address, commType is either HIBI\_TRANSFER\_TYPE\_DATA or HIBI\_TRANSFER\_TYPE\_MESSAGE. Differences to lower level macros are the automatic copying of memory to HIBI PE DMA-buffer and protection against simultaneous sending in different threads. \\ \hline struct sN2H\_ChannelInfo* N2H\_ReserveChannel( int32 bufferSize, void* callbackFunc, bool handleInDsr, bool calledFromDsr, sint32 channelNum) & Reserve a channel for receiving data. bufferSize Size of the data to be received (bytes). callbackFunc: Function to call when the data arrives. Prototype: function(uint8* pData, uint32 dataLen, uint32 receivedAddr) handleInDsr: Set to false calledFromDsr: Set to false channelNum: Channel that is waiting for incoming data. The complete address will be HIBI base address + channelNum. Difference to lower level macros is that interrupt handler provided by HIBI driver, own function can be registered directly to handle data. \\ \hline \end{tabular} \end{center} \end{table*} HIBI\_TX checks that previous send operation is complete and Calls HPD\_send macro. Hence, it also runs macros HPD\_TX\_ADDR, TX\_AMOUNT, HIBI\_ADDR, TX\_COMM, and TX\_START Releases the Tx channel. Following example shows a data transfers between two CPUs assuming the system in Fig.~\ref{subfig:dma_example}. Fig.~\ref{subfig:dma_seq_diag} shows the sequence diagram. \begin{figure*} \begin{center} \subfigure[IP sends.]{\includegraphics[width=0.85\textwidth]{../Fig/Eps/fig_dma_example.eps} \label{subfig:dma_example}} \subfigure[IP receives data]{\includegraphics[width=0.85\textwidth]{../Fig/Eps/fig_dma_seq_diag.eps} \label{subfig:dma_seq_diag}} \caption{Examples of timing at IP interface.} \label{fig:dma_example} \end{center} \end{figure*} \section {Summary} The most important properties of HIBI are summarized in Table.~\ref{table:hibi_versions}. HIBI network allows multiple topologies and utilizes distributed arbitration. The network is constructed by instantiating multiple wrapper components and and connecting them together. The wrapper is modular allowing good parameterization at design time and possibility to reconfigure certain parameters of the network runtime. \begin{table*} \caption{Properties of HIBI v.3} \label{table:hibi_versions} \begin{center} \includegraphics[width=0.9\textwidth]{../Fig/Eps/tab_hibi_v3.eps} \end{center} \end{table*} \setcounter{secnumdepth}{-1} \bibliography{IEEEfull,hibi_datasheet_ref} %\bibliography{hibi_datasheet_ref} \bibliographystyle{IEEEtranS} \end{document}