Code generating for the architecture with ILP

by zzeng on 2002-05-29
source: http://www.geocities.com/creta_ru/doc/VLIW-draft-eng.pdf

Code generating for the architecture with ILP Boris Muratshin (zzeng@mail.ru), Alexander Artiushin (alexnikart@mail.ru) May 5, The original source of this page is placed in http://www.geocities.com/creta_ru/doc/ Contents 1 Introduction 2 Goals 3 Theses 4 Problems 5 Global code optimization. 6 Trial realization. 7 Conclusions. 8 To do 9 Acknowledgment 1 Introduction In this paper, we are going to talk about VLIW 1 which is an alternative to the superscalar processing. This method implies that the instruction word contains a set of commands to be executed in parallel [5]. The main advantages of this scheme are: - a compiler is not strongly pressed for time in contrast to a superscalar nucleus and has a lot of resources. Hence, it is able to analyze dependencies between commands potentially more effectively and create a parallel code. - a VLIW processor is more simple. The main drawbacks are: of a compiler is not capable to predict control transfers if they are dependent on the data being calculated. - a task of generating a code for the VLIW processor is much more difi- cult than for the superscalar architecture and for the time being its good solution is not known. At present, a lot of processors which might be related to this architecture is manufactured. Probably, there is a couple of arguments for it. First, it is convenient (for different reasons, e.g. to make a design easier) to have a VLIW nucleus of the processor and a "wrapper" consisting of microcode. Second, ele- ments of VLIW can be used by high-eficient superscalar processors for granting to a user an opportunity to control parallelism. Why is the given architecture interesting for us? Because it is highly promis- ing. What does the current level of the technology development offer us? An- nounced by the Texas Instruments, the TMS320C641x VLIW processor works at the frequency of 600 Mhz and executes up to 6 instructions for a clock tick. Achievable frequency is already in the GHz range, and it should be noted that a VLIW processor due to the simplicity as compared to the superscalar one is tolerant to the potentially higher frequencies. If we had the compiler capable to load a processor on the average half, a processor with the real productivity 1800 MIPS, the cost of $10 and demanding no cooling due to the low energy consumption would come up. The question is not in manufacturing of such a processor, but just in the creation of the compiler. This paper describes an attempt to outline an approach to construction of an architecture which would make the most of internal parallelism and make it possible to inexpensively generate a code of good quality. 2 Goals The overall goal for us (in the given context) is the possibility of the construction of the effective final system. The subgoals are: - an effective processor - an effective compiler - suitable hardware support for the processor. 3 Theses 1. Resource scheduling in generating code for a VLIW processor is a NP - complete problem. 2. For super-scalar processors resources scheduling is also a NP -complete problem but for them there is a set of ready -to -use cheap heuristics, for example, existing programming languages. 3. It would be a good idea to explain why the old good heuristics do not work in case of the VLIW, e.g. because the changes of a code are non local or because of the dynamic cost of instructions. 4. The question is how (from what and for what architecture) the compilation should be performed to avoid a NP-complete problem. 5. Global is the least of all local minima of some function F(x1, x2, x(n)) in the certain region. In our case, global-optimal is that code the execution of which takes the shortest time for a certain algorithm and a certain set of the input data. 6. We reject searching a global minimum as well as the nearest local one either, instead, we try to rehash the task and simplify it the way searching a new global minimum would take polynomial time. The more so as a concept of algorithm is not formalizable and we always deal with concrete scripting of algorithm, what is essentially heuristics itself. Searching a necessary form of scripting of algorithm is also our task. 7. We try to express an algorithm as a network of data flows and solve a problem of the fastest way in a certain graph derived from this network, what can be done easily by greedy heuristics. 8. Providing a correct choice of arcs weights , it will not demand infinite re- sources for storage of intermediate results. I.e., although such an approach to solving the problem in the pure state also is not "a silver bullet" against the exponent and is in essence equivalent to the exhaustive enumeration of variants, it turns the problem from the category "no ideas what to do" into the category "it's clear where to search heuristics". 4 Problems The main problem is that a code optimization task has always been considered in the inappropriate for the VLIW context. The traditional microprocessor op- timizer believes that any changes in code are local. A simple example: having rearranged places of two independent instructions, we are most likely to notice nothing at all because in case of the superscalar architecture, the nucleus will retranslate our code to the internal representation, in other words, these in- structions will be executed just in reverse order. I.e. there is a possibility to generate a sketch of a code in hope to improve it later. It is necessary to note, that optimizing compilers have appeared at all relatively recently in connection with the increased power of processors. So, for example, the C language was designed not to be a high level lan- guage since a programmer had a set of possibilities of giving instructions to the compiler concerning register scheduling, fast arithmetics, etc. By the way, even if there is no need in optimization, we somehow have to explain to a compiler what exactly the code the compiler is now to produce should do. A concept of "a variable" became an appropriate solution here -simple, clear and well-tested by the generations of all who got used to calculating manually. In order to explain the task to the compiler, we pull the task on a skeleton of variables and actions with them and divide into a tree of subtasks, coding of leaves of which is not too complex. Growth of processor power made it physically possible to optimize a code. Platform-sensitive optimization is of little interest for us in view of the mentioned above non local changes of a code. Platform-independent optimization is a set of transformations improving structure of some pseudo-code, for example, from the three-address instructions resulted from introducing new temporary variables in addition to the variables already defined by a user. Sources of opti- mization can be: elimination of common subexpressions, duplication of copies, removal of an unreachable code, duplication of constants [4]. Using different techniques, after several passes, we receive a quite decent (without a shadow of sarcasm) code. With the development of technology, an opportunity to make processors complex appeared. Mechanical escalation of the number of functional devices (what now is referred to as VLIW) turned out unclaimed, and processors became not only complex, but also intelligent. Remaining externally the same, they got an internal life, began to optimize a code and arrange a code pipelining, the CISC -RISC wars happened -in a word there took place all that we've been observing recently with constantly growing surprise. While this semi-conductor revolution was blossoming, optimizing compilers kept on evolving slowly. Appearing were more and more new patterns, support of new architecture, the area of enumeration of possibilities was expanding: life went on slowly and easy. And in the meantime, the gap between rated power of systems and their real productivity has been increasing all the time (and increasing continually). Maintenance of the external interface for backward compatibility costs more and more expensively. Appearing technological opportunities are spent not directly for growth of productivity, but mostly for the disproportionate refinement of internal logic. The exit from this situation is not visible and it is possible to declare presence of a logic dead end. The clock frequency and density of elements on a crystal can be increased infinitely but the problem is not in the technological area. Computing system is essentially an indivisible organism. Dividing it into the subsystems interacting through interfaces is an action that is certainly necessary, one could even say compelled. It is a matter of fact, subsystems have been defined proceeding from the physical organization of system and this scheme is a little bit about to be exhausted. More promising seems dividing into logical subsystems, for example, "effective memory operation" means a full control and a mutual penetration of the compiler, the operating system, the processor, etc. In other words, in addition to the existing analytical techniques, it is time to think up synthetic ones. 5 Global code optimization. In the previous section, it may seem to be a contradiction: we reject that only thing that should excite us (machine-sensitive optimization) and try to get en- gaged in what works perfectly without us (machine-independent optimization). Alas, but the machine-independent pseudo-code generated by existing optimiz- ing procedures is unsuitable for us. Let's consider existing linear techniques of resource scheduling : 1. Assignment of registers by painting a graph. This method requires two passes. At the first pass, instructions of the target machine are selected the way as if we have an infinite set of symbolical registers, the names used in intermediate representation becoming in fact the names of registers, while three-address instructions becoming instructions of a machine language [4]. At the second pass, a register interaction graph is built, the nodes of which represent symbolical registers, and the arcs connect two nodes if at the moment of defining one of the registers, the second is alive. Then, an attempt is made to paint the graph using k colors, where k is equal to the actually accessible number of registers. Although this is a NP -complete problem, there exists at least one cheap heuristics. The vertexes of the graph having less than k arcs are removed iteratively until we get an empty graph (otherwise painting is considered as impossible and reloading of the rest vertexes with the number of arcs gt;&= k is performed). In case of the VLIW processor, we have to build and &paint graph using several scales at once -one scale per each set of &devices. For example, a scale of summators, a scale of comparators, a scale of memory loading paths. It is obvious, that these new scales are all of low capacity, and the number of them is large, so that it can turn out that we fail to paint almost any graph. And if there is the dependence between the register numbers and the devices, the given heuristics becomes absolutely inapplicable. 2. Code generating based on the labelled tree (DAG -Directed Acyclic Graph). DAG is a very convenient tool that is used to determine the optimal reordering of the calculation sequence. For some models of pro- cessors, simple algorithms for calculation of optimal reordering are known. E.g., for machines in which all calculations are made in registers and which instructions are either the register-register or register-memory operators, there is the linear with respect to the DAG size algorithm: all tree nodes are labelled with the value which is actually equal to the number of reg- isters needed in order to calculate this value. After that, recursively traversing the tree, we create a code taking into account the ratio of the labels of children of each node, the node's own label and their relation to the amount of actually available registers. This algorithm gives a code that is really optimal with respect to the storage of intermediate results in registers. In case of VLIW, a contribution of this component to the cost of a code is not so great. Of more importance is to track down conflicts between functional devices. Thus, application of this algorithm or its variants (generalizations) to a VLIW machine does not give us any guarantees that the code is optimal, or requires enumeration in some nodes what yields the exponential time. 3. Dynamic programming. The algorithm of dynamic programming divides a task of generating an optimal code for expression into subtasks of gener- ating an optimal code for sub-expressions. The algorithm consists of three phases. At the first stage, we assign each node of a tree to a cost vector of length r + 1, where r -number of registers on the target machine. The 0-th element of the vector is the cost of calculation of the given node in memory, the 1-st element is the cost at one accessible register, 2-nd -at two, etc. In the second phase of the algorithm, cost vectors are used to determine which nodes are to be calculated in memory. During the third phase, code generating actually takes place. It is not dificult to general- ize the given algorithm to the VLIW architecture, but, unfortunately, the initial assumption that the code for the whole expression made up from the optimal codes for the sub-expressions is also optimal is not true in this case. Proceeding from the aforesaid, it can be concluded, that the linear algorithm of resource scheduling for an even little practically interesting VLIW architec- ture is not known at present. So, the size of a pseudo-code tree should be as small as possible and we should allow the fragments of a locally optimal code in its nodes. Now a little bit about optimization as such. Global optimization of a discrete function in the absence of the a prior information is solved either by exhaustive searching or by optimization of the continuous model function pulled on the given discrete skeleton. The second way is unacceptable for us since the non- differentiable singularities (gaps, bends) strongly complicate the optimization process. As our function is very likely to consist of non-differentiable singular- ities plus strong noise, it leads to the presence of huge amount of useless local minima. For the same reason, techniques like "simulated annealing" will not help here as well. On the other hand, searching for an optimum using the most optimizable (O(1)) function can be very easily transformed into the task which can be solved only by exhaustive searching. It's enough just to shuofe input (or output) values with some unequivocal but externally unpredictable procedure. Let's note that in the presence of shuofing, the task remains to be optimized for O(1), it has only lost this quality for us. In connection with this, the first thing we'll try to do before starting to solve actually an exhaustive task, we'll try to clear it of everything not inherent to it. One more remark. If we for any reason (even for the most objective one) fail to simplify optimization significantly (i.e. to find polynomial algorithm), we can assume that for any degree N of a polynom there exists a task subclass to be optimized for O (n**N). Anyway, there always exists a trivial O(1) subclass of tables of answers. The question (as usual) is in the balance between the time/memory and the price we are ready to pay for the necessary degree and also in the quality of the results given by the generating heuristics. We'll be repeated, but we shall note, that no heuristics will help in a situation when there is a shuofing intermediary between us and a task. Now let's try to set the task. Here, for example, is a classical statement in N. Jushchenko's words[1]: There is some task of a user the Task. To solve this task with a computer, it is necessary to propose its solution algorithm. For the same task, generally speaking, any number of algorithms can exist. Although there have been made attempts to define the concept of "algorithm" formally, in given case it will result in loss of the generality. Therefore, instead of acceptance of the formalism "al- gorithm", we accept the formalism "a way of scripting of algo- rithm". A way of scripting of algorithm is such a presentation of algorithm so that the execution of the algorithm becomes de- fined with mathematical accuracy. For example, a programming language is a way of scripting of algorithm. Thus: - to the Task task, corresponds a set of algorithms: Alg1..., AlgN... - there is a set of ways of scripting of algorithms: Lang1... LangM... - i.e. a specific solution of the Task which a user can write down can be - labelled as Prog (AlgI, LangJ). Naturally, the concrete scripting way LangJ may turn out to be incapable to express all AlgI. For example, if the scripting way is based on the graph terminology, then those Alg which are ex- pressed differently, can not be written down. But it is necessary to tell a scripting way of algorithm apart from an algorithm itself in order to formalize the concept of translation. Translation Comp is a reflection of the algorithm that has been written down with the scripting way LangI to the concrete ma- chine program. It is the result of translation, for which the exe- cution time in clock ticks is counted. So, what are the degrees of freedom with respect to which one can try "to optimize" the transformation Task -> Code ? Code = Code (Alg, Lang, Comp ). I.e. it is possible, independently from each other: - to change the algorithm - to change the language in which the same algorithm is written down - to change the algorithm of translation from the concrete language. With what here it would be desirable not to agree? Really, the concept of algorithm is not formalizable. However, existence of any of (Alg1..., Algn) already means its some scripting way. If I can present algorithm, there neces- sarily exists some virtual machine I mentally apply the algorithm to. It can be something like a processor with infinite number of registers or a stack processor, somebody (let's suppose) would have even Tewring machine in any case it is the consecutive virtual machine. Here there's nothing to be done. Regardless of the parallelism of the orga- nization of our brain, we think mainly consecutively (what is not bad itself). Without making special efforts, a (one-dimensional) process of serialization on a (two-dimensional) paper of a generally three-dimensional algorithm image will inevitably result in the distortion of the initial plan, what is highly undesirable for us. Hence, there should exist a scripting way without loss of internal paral- lelism, and the obvious candidate for it (with some changes) is the concept of basic blocks. Traditionally, "the sequence of adjacent instructions into which the control flow enters at their beginning and leaves at the end without program stop (halt) and possible branching (with except of the end of the block) is called as a basic block"[4]. Basic blocks are used as construction elements of Control Flow Graph. The problem of the control flow analysis includes determining of properties of control transfer between operators of the program. Examination of many properties of this sort is necessary in order to solve problems of optimization, transformations of programs, etc. Control flow graph is a directional graph with two dedicated vertexes: input and output , such, that - no arc comes in to the input - no arc goes out the output - any given vertex belongs to at least one path from the input to the output. Data flow analysis is a set of problems targeted to elucidating some global properties of the program, i.e. extraction of the information on behavior of those or other constructions in some context. Such a statement of the problem is possible for the reason that the programming language and the computing environment determine some general, "safe" semantics of constructions which is suitable "for all occasions". Taking into account contextual conditions allows to make more concrete, particular conclusions about behavior of one or another construction; such conclusions, generally speaking, stop being correct in the other context. For example, general semantics of assignement implies calculation of the right hand expression followed by assigning its value to the left hand variable. However, in case when the right hand expression has no side effects and the left hand variable is not used anywhere else, the given operator becomes equivalent to the empty operator[3]. We will make an attempt to put together the above mentioned flows and "to synthesize" a whole vision of the problem. We'll try to extend the definition of the basic block this way: each basic block, except trivial input and output pseudo-blocks, can have no less than one entrance and any number of exits, some of the latter can branch. At present it doesn't matter for us what is inside of blocks, it is important only that: - we consider that code of blocks is optimal - we consider that code of blocks is valid and correct - we can get only to the beginning - the block of all resulting values are ready by the moment of any exit from the block. "Consider now the mapping DU which assigns each output to some subset of the set of inputs. Then for each operator" (for each basic block w -in this paper, but not in the original) "we define sets def(w) and use(w) which are its set of outputs and set of those outputs of other operators images of which at mapping DU contain at least one input of the operator w."[3]. /* Now consider a mapping DU, which maps each output to a subset of inputs set. For the operator ?, the set def(?) is defined as output(?), and use(?)-as a set all outputs of other operators, DU images of which contain at least one input of ? operator. */ Such a presentation allows to trace data flows in the program no matter how these data are transformed. Sets of inputs and outputs correspond to the intuitive assumption about arguments and results of some operators, the mapping DU just describing how the results produced by operators are used. The arcs connecting the block (elements of the DU) now pass the control simultaneously with the data. Branching of the output of the block means data and control transfer in one of possible directions. Obvious analogy here is a network of pipes between some devices, branching of an output being a switch. Besides, we let presence of dead-end blocks which are the blocks that can be reached from the input and have no outputs. Such a code is not considered dead, but, obviously, the end of such a block should be processed specially. As an example, let's consider the task of summation of eight numbers using a pyramidal pattern. The input pseudo-block contains 8 outputs which are the numbers to be summed up. The output pseudo-block has one input which is the sum. Seven blocks with different names, but with the same contents, each of which has two inputs (left and right) and one output -the sum, are connected among themselves in such a manner that they form the turned upside-down pyramid. Note, that obtained resulting code will realize exactly this pattern and this is on the one hand a minus since under some conditions this pattern can turn out to be not the most effective, on the other hand this is an obvious plus since the compiler makes exactly what's been told, i.e. the compiler is predictable. And, as we'll try to show below, the most suitable pattern can be selected easier with other methods. What does such a presentation of algorithm give us? - it is intuitively clear of it is complete, i.e. any algorithm can be defined this way - it is free of lexical and syntactic analysis problems - it has natural parallelism -any two independent (excluding such blocks, that there is no path from any output of any of them to the input of the second) blocks are considered as being executed potentially in parallel - it is visual -a programmer(developer) literally sees the graph of the - algorithm, and it involuntarily forces him to search for more "beautiful"solution, what was determined implicitly by means of the programming language earlier and building of what cost the compiler the certain work, now is given directly by the developer of the algorithm and, we hope, is at least of no less eficacy - algorithm scripting can be easily verified with respect to correctness - existing algorithms of machine-independent optimization, for example, simplification of cycles are applicable to it. How are we going to create a code for such a presentation of algorithms? As we have already seen, existing algorithms of register scheduling operate on highly limited models of target architecture. On other architecture, there is no guarantee that the code will be optimal. And if for superscalar architecture such an approach yields more or less acceptable results (especially for linear time), in case of the VLIW such a method does not work mainly because we get a cost of the majority of instructions only at the moment of scripting, we do not have an opportunity by means of one or another method to label a tree and then using this labelling to generate a code -all should be made for one pass. So, let's begin. On the basis of the graph G of the algorithm, we build a graph of variants GX derived from G. The graph GX as well as the initial graph G is a directed graph with two dedicated vertexes input and output , such, that - no arc comes in to the input - no arc goes out of the output We have - blocks[]-a set of algorithm description blocks - def[blocks[I], ]-a set of outputs of the block I - use[blocks[I], ]-a set of inputs of the block I - DU[def[blocks[I], II], def[blocks[J],JJ]] -a connectivity matrix of the algorithm description

1. procedure explore_vertex ( the current vertex, the context - the list of all ready inputs) 
2. begin
3. while def[the current vertex, I] != null 
4. find the block connected with the data through DU 
5. notify this block that the appropriate input is ready, mark the input in the context 
6. if this correlative block has all inputs ready
7. insert into the GX a new vertex -this block; 
8. insert into the GX a new arc -the current vertex -> the given block; 
9. call explore_vertex (correlative block, the context); 
10. while there is a block with all inputs ready 
11. insert into the GX a new vertex -this block; 
12. insert into the GX a new arc -the current vertex -> the given block; 
13. call explore_vertex (the found block, the context); 
14. end ; 
15. insert into
the GX the vertex __input__; 
16. call explore_vertex (__input__, an emptycontext);

In fact, each time adding vertex (block) to the graph GX, we add to it recursively all completed blocks. The number of the vertexes of the GX is an exponent of the size of the initial graph G. In such a graph, any path from the input to the output represents some path of code generating, but our task is to find the shortest path what can be done easily by greedy algorithms, for example, by Dijkstra algorithm. The most important problem here is the selection of weights of the graph GX arcs -in the simplest variant, when weight of an arc is equal to a change of the current length of the code in ticks, searching for the shortest path will come to a trivial enumerating of the most part of variants since intermediate lengths of paths are distributed very densely, even presence of some rough conflicts of instructions will not result in the deviation of variant beyond the limits of consideration. Having made a successful choice of system of weights, it is possible to reach the finish, having touched only small part of possible variants. The following variant is proposed for consideration: 1. weight of the arc (i.e. the cost of addition of a code of some block to already available) is determined dynamically -directly at that moment when we try to make it 2. path length is not additive, but multiplicative -we are not to add , but to multiply lengths of arcs, however if we still assume that we go along Dijkstra algorithm, the length of the arc is determined by multiplying the current length of the path by some penalty coeficient of a new block coding. It is necessary to mention another problem which, fortunately, has a simple solution. When searching for the shortest path in GX, we can experience a situation of a choice between absolutely identical continuations. For example, in the above mentioned summation, at the very beginning, we have an opportu- nity to code any of blocks add1, add2, add3, add4 -due to the symmetry of the graph these variants are absolutely equivalent. Similarly, equivalent are adds1 and adds2. Having removed trivial branchings, we could reduce the number of variants at least by a factor of 4*3*2*2. The solution seems simple enough -to each vertex of the graph of the algorithm, we assign some metrics obtained from the description of the paths achievable from the given vertex by any stable way (here we mean stability of the metrics of the vertex to isomorphic transforma- tions of graph G). Then, by analyzing neighbors of any vertexes, we begin to incarnate only those branches which lead to vertexes with different metrics. So far, we considered basic blocks as given. Their contents did not interest us, we only assumed that it was correct and somehow optimal. But if we have really mentioned that we are going to charge penalties for coding of blocks, it is time to engage in their contents closely. It's obviously that blocks consist of instructions. It is also not less obvious that these instructions can not be static because it's impossible to have blocks for all occasions, moreover, nobody can make a programmer operate with millions of blocks. So, contents of the block is a pattern of a code with a set of parameters. Parameters can be used, for example, to set the actual numbers of registers. Thus, compiler's duty would consist in attaching blocks through parameters. This variant is simple enough, but not too flexible, therefore, just in order to add a variety, we'll split the concept of the block into two. 1. a type of block -in the above-stated example of summation, all blocks were of the same type, a definition of type must include - a name, using these names, the programmer will build a graph of the algorithm - a list of inputs of the block -just names in order to define DU - a list of outputs of the block -similarly, just names in order to define DU, some of outputs can branch. 2. implementation of the block -the variant of coding, contains -a name of the type of the block -a list of parameters (variables) which the compiler must define before coding given implementation - a list of inputs with the names equal to the names of inputs of the type, but with the definition of the optionally parameterized place where it is expected to find the data for this input. For example, if we have defined parameter x and we have a pool of registers A, such place can be A[x] - a similar list of outputs with the names from the description of the type and with the definition of the places where there will be the results of the execution of the code of the implementation - a code itself -is a list of instructions with optionally parameterized arguments. All instructions are considered consecutive, the compiler making them parallel if it's possible. In other words, when defining a graph of the algorithm, the programmer operates with the types of blocks, whereas the compiler, when generating a code, tries to substitute all available implementations of the blocks of the given type. For example, a block of division of two variables can have following implementations (it's a joke): - hardware division (if it's available) - Newton -Raffson approximation - taking the logarithm/subtraction/taking the antilogarithm. It should be noted, that the graph GX now must be redefined. In both places (lines 7 and 11) instead of insert into GX a vertex -this block; insert into GX an arc -the current vertex -> the given block; call explore_vertex (the found block, the context); must be: while implementation of the block! = null insert into GX a vertex -this implementation; insert into GX an arc: the current vertex->given implementation; call explore_vertex (the found implementation, the context); This resulting graph will be called as GXX. 6 Trial realization. At first, we have to create a universal as far as possible description of the target machine architecture. First of all, we define places where there can be data. They are: - registers - memory block(s), for example, if memory is spliced - devices of the processor -every possible adders, multipliers, memory paths... - special memory, for example ports of periphery -cache-memory of the all levels Let's note that such places can be both scalar (ports, devices of the processor) and vector (memory).

pool { 
    comment="Unit_L1 First memory gateway"; 
    name="L1"; range={0-0}; 
}; pool { 
    comment="Unit_L2 Second memory gateway"; 
    name="L2"; range={0-0}; 
}; pool {
    comment="Unit_S1 First summator";
    name="S1"; range={0-0}; 
}; pool { 
    comment="Unit_S2 Second summator"; 
    name="S2"; range={0-0}; 
}; pool { 
    comment="Register_pool_A"; 
    name="A"; range={0-15}; 
}; pool { 
    comment="Register_pool_B"; 
    name="B"; range={0-15}; 
}; pool { 
    comment="Memory_pool"; 
    name="MEM"; 
    range={0-100000000}; 
};

Attributes of the pool tag are obvious: name -an identifier, comment a comment, range from ... range to -bounds of the allowable interval of the vector, if they are equal -then a scalar Now our task is to describe atomic operations of the processor.

cop { 
    name="ADD"; comment="ADD via S1"; 
    time=7; 
    inargs={ A = "1111111"}; 
    outargs={A = "1111111"}; 
    units={S1[0] = "1111111"}; 
}; cop { 
    name="ADD"; 
    comment="ADD via S2"; 
    time=7; 
    inargs={ B = "1111111"}; 
    outargs={B = "1111111"}; 
    units={S2[0] = "1111111"}; 
}; cop { 
    name="ADD"; 
    comment="cross ADD via S1"; 
    time=9; inargs={ B = "111111111" }; 
    outargs={A = "111111111"}; 
    units={S1[0] = "111111111"}; 
}; cop { 
    name="ADD"; 
    comment="cross ADD via S2"; 
    time=9; inargs={ A = "111111111" }; 
    outargs={B = "111111111"}; 
    units={S2[0] = "111111111"}; 
}; cop { 
    name="LOAD"; 
    comment="memory word load via L1"; 
    time=9; 
    inargs={ MEM = "111111111"}; 
    outargs={A = "111111111"}; 
    units={L1[0] = "111111111"}; 
}; cop { 
    name="LOAD"; 
    comment="memory word load via L2"; 
    time=9; 
    inargs={ MEM = "111111111"}; 
    outargs={B = "111111111"}; 
    units={L2[0] = "111111111"}; 
}; cop { 
    name="STORE"; 
    comment="memory word store via L1"; 
    time=9; 
    inargs={ A = "111111111" }; 
    outargs={MEM = "111111111"}; 
    units={L1[0] = "111111111"}; 
}; cop { 
    name="STORE"; 
    comment="memory word store via L2"; 
    time=9; 
    inargs={ B = "111111111"}; 
    outargs={MEM = "111111111"}; 
    units={L2[0] = "111111111"}; 
}; cop { 
    name="MOV"; 
    comment="register copying in A"; 
    time=1; 
    inargs={ A = "1"}; 
    outargs={A = "1" }; 
}; cop { 
    name="MOV"; 
    comment="register copying in B"; 
    time=1; 
    inargs={ B = "1"}; 
    outargs={B = "1" }; 
}; cop { 
    name="MOV"; 
    comment="register copying from A to B"; 
    time=2; 
    inargs={ A = "11"}; 
    outargs={B = "11"}; 
}; cop { 
    name="MOV"; 
    comment="register copying from B to A"; 
    time=2; 
    inargs={ B = "11" }; 
    outargs={A = "11"}; 
};

Attributes of the cop tag: name -an identifier, comment -a comment, time -a formal time in clock ticks required for execution. The inargs, outargs and units attributes describe an argument of the instruction, distinctions between them consisting in that the argument described as unit does not demand to be referenced in the code, while the outargs argument in contrast to the inargs one keeps a cell occupied and the cell can not be used by a compiler within a current block without being explicitly referenced. A description of an argument contains a name=value pair where the name corresponds to the pool tag name and the value is a 0/1 bit string of length specified by the field time, where 1 means that the given device at the given tick is busy (now for simplicity we assume that all devices are busy all the time during execution of the instruction). Now it's the most interesting thing -the basic blocks: Types

block { comment="plus"; 
    name="plus"; 
    input {left, right}; 
    output {out}; }; 
    block { comment="bind"; name="bind"; input {in}; output {out}; 
};

and the implementations:

implementation { 
    comment="plus";
     name="plus";
     var=y,x,z;
     input { left=A(x);
     right=A(y);
    };
     output { out=A(z);
    };
     code { ADD A(x) A(y);
     MOV A(y) A(z);
    };
}; implementation { comment="plus";
     name="plus";
     var=y,x,z;
     input { left=B(x);
     right=B(y);
    };
     output { out=B(z);
    };
     code { ADD B(x) B(y);
     MOV B(y) B(z);
    };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=A(x);
    };
     output { out=A(y);
    };
     code { MOV A(x) A(y);
     };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=B(x);
    };
     output { out=B(y);
    };
     code { MOV B(x) B(y);
    };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=A(x);
    };
     output { out=B(y);
    };
     code { MOV A(x) B(y);
    };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=B(x);
     };
     output { out=A(y);
    };
     code { MOV B(x) A(y);
     };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=MEM(x);
    };
     output { out=A(y);
    };
     code { LOAD MEM(x) A(y);
    };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=MEM(x);
    };
     output { out=B(y);
    };
     code { LOAD MEM(x) B(y);
    };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=B(x);
    };
     output { out=MEM(y);
    };
     code { STORE B(x) MEM(y);
    };
}; implementation { comment="bind";
     name="bind";
     var=y,x;
     input { in=A(x);
    };
     output { out=MEM(y);
    };
     code { STORE A(x) MEM(y);
    };
};

The attributes name and comment keep their meanings they had earlier. Note that the attribute name is not unique any longer and this makes branching be possible. The block externally is a black box with a set of inputs and outputs. Both inputs (input tag) and outputs (output tag) are data cells, what means that there is a need to have internal parameters. A parameter (var tag) is an index of a certain cell of one or another pool and can be associated not only with the inputs/outputs, but also be used in a definition of the internal life of the block. To the moment when an implementation of the block takes place, all parameters for inputs should be defined. However, if extra parameters, not only for inputs, are defined, this is an additional place of branching. In our simple task the only block has been used -the adder, albeit there is a set of blocks with implicitly predefined names (for example "BIND"), which are meant for moving data from the output of one block to the input of the other block, and also "JUMP", arising when there is a need to merge branches of code that have been born by branching of the output of the block (in the given example, there are no such outputs): Our case is the simplest one, but even here there is a lot of variants of transfers. Actually, there should be much more variants, including double-thread ones, etc. The amount of the variants will not have a significant effect on speed of searching since short variants will be processed in the first place. Long variants will be processed if only resources are not in use and if for some reason such a path is cheaper. It is necessary to pay especial attention to a case of non- homogeneous memory. We've already told that sometimes processors have the spliced memory. Sometimes memory pools have different access times because of their physical organization -ROM, FLASH..., sometimes, it is allowed to take addresses of the cache-memory -we are able to describe all these situations easily in our terms. Cache-memory in case of the VLIW processor deserves separate words we'll tell later. Now it's time to describe the algorithm.

item {
name="add1";
     block=plus;
    };
     item { name="add2";
     block=plus;
    };
     item { name="add3";
     block=plus;
    };
     item { name="add4";
     block=plus;
    };
     item { name="adds1";
     block=plus;
    };
     item { name="adds2";
     block=plus;
    };
     item { name="addall";
     block=plus;
    };

There are seven blocks: the first four in pairs sum up our eight numbers, the others -intermediate results. We shall note that the name attribute is the unique identifier again, the block attribute referring to the group identifier of the block. Now it's the turn of the inter-block connections.

link { from=in_1;
     to=add1(left);
    };
     link { from=in_2;
     to=add1(right);
    };
     link { from=in_3;
     to=add2(left);
    };
     link { from=in_4;
     to=add2(right);
    };
     link { from=add1(out);
     to=adds1(left);
    };
     link { from=add2(out);
     to=adds1(right);
    };
     link { from=adds1(out);
     to=addall(left);
    };
     link { from=in_5;
     to=add3(left);
    };
     link { from=in_6;
     to=add3(right);
    };
     link { from=in_7;
     to=add4(left);
    };
     link { from=in_8;
     to=add4(right);
    };
     link { from=add3(out);
     to=adds2(left);
    };
     link { from=add4(out);
     to=adds2(right);
    };
     link { from=adds2(out);
     to=addall(right);
    };
     
link { from=addall(out);
     to=out_1;
    };

where the "from" tag refers to the certain output of the certain type of the block, and the "to" tag -to the certain input of the certain type of the block. Some of these connections refer to the inputs of the algorithm:

input { in_1=MEM(0);
     in_2=MEM(1);
     in_3=MEM(2);
     in_4=MEM(3);
     in_5=MEM(4);
     in_6=MEM(5);
     in_7=MEM(6);
     in_8=MEM(7);
};

and to the outputs:

output { 
    out_1=MEM(1);     
};

The result of application of Dijkstra algorithm to the graph GXX in our task gives the following result: 2 : 2 http://www.geocities.com/creta ru/samples/test.html 7 Conclusions. What conclusions can be made from this example? -We obtained the code, which calculates what we wanted. -This code is really the fastest in the context of the task under consideration. - With respect to the absolute (global) value, this code is not the fastest, in the given situation we lost on the second instruction of the block SUM - it might be possible to do without it, but as this block was defined this way -there's nothing to be done - There is a lot of trivial branchings, however, it's possible to fightagainst them, for example, by introducing signatures of states or by combining parallel devices into the pools: so, instead of L1 and L2 to have the pool L[0...1], thereby having reduced a number of declared types of blocks and, hence, branchings - Approximately equivalent branches could be a great problem for us, e.g., existence of a large amount of subexpressions with near resource consumption would yield to an intensive enumeration with small effectiveness of the target code acceleration. This problem can be solved in two ways -by heuristically calculating the weights of the subexpressions or by giving the recommendation to the end -programmer to avoid such code style e.g., by extracting such constructions as separate "super-blocks". - Our compiler is not tied to the certain architecture, it can be tuned up externally (we believe) to any of them. - a user can by himself supplement and expand the library of blocks and even carry out fine adjustment to his system (recollect here about the spliced memory). - it's said nowhere that it's possible to create a code only for VLIW -It's possible to generate a code for a cluster consisting of several processors if they are rigidly synchronized. The keyword here is the predictability, we must be confident that all events occur exactly when we expect. - A question what to do in case of the resource shortage still remains open in part. The natural way which is just to cut off such branches in hope that there will be found the path where available amount of lacking resources, for example, registers will be enough to realize this scheme, leaves many questions. Such an approach seems strategically correct although this is still an object to think about. - In this sense, cache-memory is at least useless for us, we will not get any acceleration since we'll always be waiting for the worse case. Multilevel cache-memory in its present kind seems to be an embodiment of pow- erlessness and inability to dispose of available resources. It's dificult to imagine such an algorithm which really needs hundreds of kilobytes of fast memory. For realization of algorithm that can be written and tested by a human being, several dozens of words seem to be suficient. If there is a constant need to process an array of significant size, to organize parallel transport of the data is much more eficient. Therefore, in our case, it is much better to have very fast addressed memory rather than large memory, that can be also multilevel, i.e. there are several pools of memory, each one with its access time. Which preconditions should be created for the given scheme to work? 1. A line of independent devices is necessary, which are capable to process data independently. There will be no sense in this scheme, if the L1 and L2 carry out all jobs dealing with moving. It's possible for the processor not to know in advance the future configuration of the fast memory. Neither does the compiler, however, we have an op- portunity to make the above mentioned "fine adjustment" to certain conditions. But as a whole, a triple alliance between the compiler, the processor and the controller of memory (periphery) is necessary. The processor should be able to distinguish transfers of the data and redirect them to different devices, a part of the laters can be out- side of the processor. Also it is necessary to allow memory-memory movings. 2. The compiler should be absolutely confident, that the data which the compiler has put to the fast memory, have no ways to get out of there. In other words, at context switching to another task, the data from the fast memory should be stored and be restored at switching back to the first thread. Variants with the page organization of fast mem- ory and storing only in case of lack of available pages are possible, storing/restoring can be carried out in parallel and this is necessary anyway since, let's repeat, the compiler should be confident in the data. - The amount of the calculations in order to generate a code is rated as big but not too excessive. - It is possible to limit the amount of the calculations if to build the big algorithms from the small sub-algorithms using them as original blocks. In other words, having written and having compiled some scheme, it is possible to save some of the most successful variants as parameterized high-level blocks to the block/implementation library. Thus, it is possible to build hierarchically complex algorithms, having reduced significantly the capacity of the enumeration. Note that such super-blocks at the absence of internal transfers still can overlap and be executed in parallel. - The technique of "register lifetime splitting" can be realized naturally when a given register (any device of the processor) can be shared simultaneously by several instructions, for example, if the number after multiplication should be assigned to the register, but this will happen only in some clocks after the beginning of the instruction, it's very tempting to use the idle register somewhere else. The mask attribute of the cop tag is designated for this purpose, but we have not shown its usage because such a code is heavy for perception. - In case of use of this technique, usual -cache-memory is not only useless, it is fatal for the compiler since nobody has rights to return a result faster before having been asked for it, what if this place is still busy? - an interesting point -jumps/calls. In order for the control transfer to be realized, it is necessary to introduce a synchronization of the processor a place where interruption without loss of the data is possible. At this place, there should be no instruction overlapping and the compiler begins with a "clean slate". 8 To do Currently realized trial compiler is able to encode static expressions as well as expressions with simple branching. The shortest path searching algorithm uses no heuristics and is based on the ordinal exhaustive procedure. So the primary(immediate/top-priority) tasks are: - cycle business debugging, because the effective short cycle arranging was one of the most motivating impulses to start this work - trivial branches cutting - debugging of the heuristical branche cutting technique at the shortest path search - different heuristics against different algorithms testing - creation of the visual tool for developing algorithm graph Perhaps, when you are reading this paper, something from this list is already realized and everyone who had the courage to reach this line may fill free to contact authors with any concerned questions 9 Acknowledgment Vasily Turan deserves a sincere thanks for help with realization of these ideas, for his uncompromising fight against our bugs as well as for his excellent common sense. It's impossible not to express gratitude to Nikita Youschenko for his long patience and firm scepticism which were so useful in our deliverance from excessive optimism and naive illusions. And if you are reading the English variant of this article, this is mostly the Andrey Tananakin's contribution, who saved us a lot of precious time and hair. References [1] Nikita V. Youshchenko (Nikita V. Youshchenko, yoush@cs.msu.su), private contacts [2] Jean Noel and Charles Trullemans "MOMO : Scheduling Method for Large Data Flow Graph". [www.dice.ucl.ac.be/anmarie/patmos/papers/S2/2 3.pdf] [3] Terehov A.A. etc. "Development of compilers for .NET" [se.math.spbu.ru/ courses/dotNET%20Compiler%20Engineering/Lectures/ 11%20Opti- mization.doc], [research.microsoft.com/programs/europe/Curriculum/ Compiler%20Engineering/Lectures/11.%20Optimization.doc] [4] A.V. Aho, R. Sethi, J.D. Ullman "Compilers: Principles, techniques

Browse

Tools

Code generating for the architecture with ILP