OpenCores

% This file is part of the MMIXware package (c) Donald E Knuth 1999

% This file is part of the MMIXware package (c) Donald E Knuth 1999

@i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!

@i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!

\def\title{MMIX-PIPE}

\def\title{MMIX-PIPE}

\def\MMIX{\.{MMIX}}

\def\MMIX{\.{MMIX}}

\def\NNIX{\hbox{\mc NNIX}}

\def\NNIX{\hbox{\mc NNIX}}

\def\Hex#1{\hbox{$^{\scriptscriptstyle\#}$\tt#1}} % experimental hex constant

\def\Hex#1{\hbox{$^{\scriptscriptstyle\#}$\tt#1}} % experimental hex constant

@s and normal @q unreserve a C++ keyword @>

@s and normal @q unreserve a C++ keyword @>

@s or normal @q unreserve a C++ keyword @>

@s or normal @q unreserve a C++ keyword @>

@s bool normal @q unreserve a C++ keyword @>

@s bool normal @q unreserve a C++ keyword @>

@s xor normal @q unreserve a C++ keyword @>

@s xor normal @q unreserve a C++ keyword @>

@* Introduction. This program is the heart of the meta-simulator for the

@* Introduction. This program is the heart of the meta-simulator for the

ultra-configurable \MMIX\ pipeline: It defines the |MMIX_run| routine, which

ultra-configurable \MMIX\ pipeline: It defines the |MMIX_run| routine, which

does most of the

does most of the

work. Another routine, |MMIX_init|, is also defined here, and so is a header

work. Another routine, |MMIX_init|, is also defined here, and so is a header

file called \.{mmix\_pipe.h}. The header file is used by the main routine and

file called \.{mmix\_pipe.h}. The header file is used by the main routine and

by other routines like |MMIX_config|, which are compiled separately.

by other routines like |MMIX_config|, which are compiled separately.

Readers of this program should be familiar with the explanation of \MMIX\

Readers of this program should be familiar with the explanation of \MMIX\

architecture as presented in the main program module for {\mc MMMIX}.

architecture as presented in the main program module for {\mc MMMIX}.

A lot of subtle things can happen when instructions are executed in parallel.

A lot of subtle things can happen when instructions are executed in parallel.

Therefore this simulator ranks among the most interesting and instructive

Therefore this simulator ranks among the most interesting and instructive

programs in the author's experience. The author has tried his best to make

programs in the author's experience. The author has tried his best to make

everything correct \dots\ but the chances for error are great. Anyone who

everything correct \dots\ but the chances for error are great. Anyone who

discovers a bug is therefore urged to report it as soon as possible to

discovers a bug is therefore urged to report it as soon as possible to

\.{knuth-bug@@cs.stanford.edu}; then the program will be as useful as

\.{knuth-bug@@cs.stanford.edu}; then the program will be as useful as

possible. Rewards will be paid to bug-finders! (Except for bugs in version~0.)

possible. Rewards will be paid to bug-finders! (Except for bugs in version~0.)

It sort of boggles the mind when one realizes that the present program might

It sort of boggles the mind when one realizes that the present program might

someday be translated by a \CEE/~compiler for \MMIX\ and used to simulate

someday be translated by a \CEE/~compiler for \MMIX\ and used to simulate

{\it itself}.

{\it itself}.

@ This high-performance prototype of \MMIX\ achieves its efficiency by

@ This high-performance prototype of \MMIX\ achieves its efficiency by

means of ``pipelining,'' a technique of overlapping that is explained

means of ``pipelining,'' a technique of overlapping that is explained

for the related \.{DLX} computer in Chapter~3 of Hennessy \char`\&\ Patterson's

for the related \.{DLX} computer in Chapter~3 of Hennessy \char`\&\ Patterson's

book {\sl Computer Architecture\/} (second edition). Other techniques

book {\sl Computer Architecture\/} (second edition). Other techniques

such as ``dynamic scheduling'' and ``multiple issue,'' explained in

such as ``dynamic scheduling'' and ``multiple issue,'' explained in

Chapter~4 of that book, are used too.

Chapter~4 of that book, are used too.

One good way to visualize the procedure is to imagine that somebody has

One good way to visualize the procedure is to imagine that somebody has

organized a high-tech car repair shop according to similar principles.

organized a high-tech car repair shop according to similar principles.

There are eight independent functional units, which we can think of as

There are eight independent functional units, which we can think of as

eight groups of auto mechanics, each specializing in a particular task;

eight groups of auto mechanics, each specializing in a particular task;

each group has its own workspace with room to deal with one car at a time.

each group has its own workspace with room to deal with one car at a time.

Group~F (the ``fetch'' group) is in charge of rounding up customers and

Group~F (the ``fetch'' group) is in charge of rounding up customers and

getting them to enter the assembly-line garage in an orderly fashion.

getting them to enter the assembly-line garage in an orderly fashion.

Group~D (the ``decode and dispatch'' group) does the initial vehicle

Group~D (the ``decode and dispatch'' group) does the initial vehicle

inspection and

inspection and

writes up an order that explains what kind of servicing is required.

writes up an order that explains what kind of servicing is required.

The vehicles go next to one of the four ``execution'' groups:

The vehicles go next to one of the four ``execution'' groups:

Group~X handles routine maintenance, while groups XF, XM, and XD are

Group~X handles routine maintenance, while groups XF, XM, and XD are

specialists in more complex tasks that tend to take longer. (The XF

specialists in more complex tasks that tend to take longer. (The XF

people are good at floating the points, while the XM and XD groups are

people are good at floating the points, while the XM and XD groups are

experts in multilink suspensions and differentials.) When the relevant X~group

experts in multilink suspensions and differentials.) When the relevant X~group

has finished its work, cars drive to M~station, where they send or receive

has finished its work, cars drive to M~station, where they send or receive

messages and possibly pay money to members of the ``memory'' group. Finally

messages and possibly pay money to members of the ``memory'' group. Finally

all necessary parts are installed by members of group~W, the ``write''

all necessary parts are installed by members of group~W, the ``write''

group, and the car leaves the shop. Everything is tightly organized so

group, and the car leaves the shop. Everything is tightly organized so

that in most cases the cars move in synchronized fashion from station

that in most cases the cars move in synchronized fashion from station

to station, at regular 100-nanocentury intervals. % about 5.3 minutes

to station, at regular 100-nanocentury intervals. % about 5.3 minutes

In a similar way, most \MMIX\ instructions can be handled in a five-stage

In a similar way, most \MMIX\ instructions can be handled in a five-stage

pipeline, F--D--X--M--W, with X replaced by XF for floating-point

pipeline, F--D--X--M--W, with X replaced by XF for floating-point

addition or conversion, or by XM for multiplication, or by XD for

addition or conversion, or by XM for multiplication, or by XD for

division or square root. Each stage ideally takes one clock cycle,

division or square root. Each stage ideally takes one clock cycle,

although XF, XM, and (especially) XD are slower. If the instructions enter

although XF, XM, and (especially) XD are slower. If the instructions enter

in a suitable pattern, we might see one instruction being fetched,

in a suitable pattern, we might see one instruction being fetched,

another being decoded, and up to four being executed, while another is accessing

another being decoded, and up to four being executed, while another is accessing

memory, and yet another is finishing up by writing new information into

memory, and yet another is finishing up by writing new information into

registers; all this is going on simultaneously during one clock cycle. Pipelining

registers; all this is going on simultaneously during one clock cycle. Pipelining

with eight separate stages might therefore make the machine run

with eight separate stages might therefore make the machine run

up to 8 times as fast as it could if each instruction were being dealt with

up to 8 times as fast as it could if each instruction were being dealt with

individually and without overlap. (Well, perfect speedup turns out to

individually and without overlap. (Well, perfect speedup turns out to

be impossible, because of the shared M and~W stages; the theory of

be impossible, because of the shared M and~W stages; the theory of

knapsack programming, to be discussed in Section~7.7 of {\sl The Art

knapsack programming, to be discussed in Section~7.7 of {\sl The Art

of Computer Programming}, tells us that the maximal achievable speedup is

of Computer Programming}, tells us that the maximal achievable speedup is

at most $8-1/p-1/q-1/r$ when XF, XM, and~XD have delays bounded by $p$,

at most $8-1/p-1/q-1/r$ when XF, XM, and~XD have delays bounded by $p$,

$q$, and~$r$ cycles. But we can achieve a factor of more than~7

$q$, and~$r$ cycles. But we can achieve a factor of more than~7

if we are very lucky.)

if we are very lucky.)

Consider, for example, the \.{ADD} instruction. This instruction enters

Consider, for example, the \.{ADD} instruction. This instruction enters

the computer's processing unit in F stage, taking only one clock cycle if

the computer's processing unit in F stage, taking only one clock cycle if

it is in the cache of instructions recently seen. Then the D~stage

it is in the cache of instructions recently seen. Then the D~stage

recognizes the command as an \.{ADD} and acquires the current values

recognizes the command as an \.{ADD} and acquires the current values

of \$Y and \$Z; meanwhile, of course, another instruction is being fetched

of \$Y and \$Z; meanwhile, of course, another instruction is being fetched

by~F.

by~F.

On the next clock cycle, the X stage adds the values together.

On the next clock cycle, the X stage adds the values together.

This prepares the way for the M stage to watch for overflow and to

This prepares the way for the M stage to watch for overflow and to

get ready for any exceptional action that might be needed with respect

get ready for any exceptional action that might be needed with respect

to the settings of special register~rA\null.

to the settings of special register~rA\null.

Finally, on the fifth clock cycle, the sum is either written into~\$X

Finally, on the fifth clock cycle, the sum is either written into~\$X

or the trip handler for integer overflow is invoked.

or the trip handler for integer overflow is invoked.

Although this process has taken five clock

Although this process has taken five clock

cycles (that is, $5\upsilon$),

cycles (that is, $5\upsilon$),

the net increase in running time has been only~$1\upsilon$.

the net increase in running time has been only~$1\upsilon$.

Of course congestion can occur, inside a computer as in a repair shop.

Of course congestion can occur, inside a computer as in a repair shop.

For example, auto parts might not be readily available; or a car might

For example, auto parts might not be readily available; or a car might

have to sit in D station while waiting to move to XM, thereby blocking

have to sit in D station while waiting to move to XM, thereby blocking

somebody else from moving from F to~D.  Sometimes there won't

somebody else from moving from F to~D.  Sometimes there won't

necessarily be a steady stream of customers.  In such cases the

necessarily be a steady stream of customers.  In such cases the

employees in some parts of the shop will occasionally be idle.  But we

employees in some parts of the shop will occasionally be idle.  But we

assume that they always do their jobs as fast as possible, given the

assume that they always do their jobs as fast as possible, given the

sequence of customers that they encounter. With a clever person

sequence of customers that they encounter. With a clever person

setting up appointments---translation: with a clever

setting up appointments---translation: with a clever

programmer and/or compiler arranging \MMIX\ instructions---the

programmer and/or compiler arranging \MMIX\ instructions---the

organization can often be expected to run at nearly peak capacity.

organization can often be expected to run at nearly peak capacity.

In fact, this program is designed for experiments with many kinds of

In fact, this program is designed for experiments with many kinds of

pipelines, potentially using additional functional units (such as

pipelines, potentially using additional functional units (such as

several independent X~groups), and potentially fetching, dispatching, and

several independent X~groups), and potentially fetching, dispatching, and

executing several nonconflicting instructions simultaneously.

executing several nonconflicting instructions simultaneously.

Such complications

Such complications

make this program more difficult than a simple pipeline simulator

make this program more difficult than a simple pipeline simulator

would be, but they also make it a lot more instructive because we

would be, but they also make it a lot more instructive because we

can get a better understanding of the issues involved if we are

can get a better understanding of the issues involved if we are

required to treat them in greater generality.

required to treat them in greater generality.

@ Here's the overall structure of the present program module.

@ Here's the overall structure of the present program module.

@c

@c

#include

#include

#include

#include

#include

#include

#include "abstime.h"

#include "abstime.h"

@h@#

@h@#

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@ The identifier \&{Extern} is used in {\mc MMIX-PIPE} to

@ The identifier \&{Extern} is used in {\mc MMIX-PIPE} to

declare variables that are accessed in other modules. Actually

declare variables that are accessed in other modules. Actually

all appearances of `\&{Extern}' are defined to be blank here, but

all appearances of `\&{Extern}' are defined to be blank here, but

`\&{Extern}' will become `\&{extern}' in the header file.

`\&{Extern}' will become `\&{extern}' in the header file.

@d Extern  /* blank for us, \&{extern} for them */

@d Extern  /* blank for us, \&{extern} for them */

@f Extern extern

@f Extern extern

@=

@=

Extern int verbose; /* controls the level of diagnostic output */

Extern int verbose; /* controls the level of diagnostic output */

@ The header file repeats the basic definitions and declarations.

@ The header file repeats the basic definitions and declarations.

@(mmix-pipe.h@>=

@(mmix-pipe.h@>=

#define Extern extern

#define Extern extern

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@@;

@ Subroutines of this program are declared first with a prototype,

@ Subroutines of this program are declared first with a prototype,

as in {\mc ANSI C}, then with an old-style \CEE/ function definition.

as in {\mc ANSI C}, then with an old-style \CEE/ function definition.

The following preprocessor commands make this work correctly with both

The following preprocessor commands make this work correctly with both

new-style and old-style compilers.

new-style and old-style compilers.

@^prototypes for functions@>

@^prototypes for functions@>

@=

@=

#ifdef __STDC__

#ifdef __STDC__

#define ARGS(list) list

#define ARGS(list) list

#else

#else

#define ARGS(list) ()

#define ARGS(list) ()

#endif

#endif

@ Some of the names that are natural for this program are in

@ Some of the names that are natural for this program are in

conflict with library names on at least

conflict with library names on at least

one of the host computers in the author's tests. So we

one of the host computers in the author's tests. So we

bypass the library names here.

bypass the library names here.

@=

@=

#define random my_random

#define random my_random

#define fsqrt my_fsqrt

#define fsqrt my_fsqrt

#define div my_div

#define div my_div

@ The amount of verbosity depends on the following bit codes.

@ The amount of verbosity depends on the following bit codes.

@=

@=

#define issue_bit (1<<0)

#define issue_bit (1<<0)

   /* show control blocks when issued, deissued, committed */

   /* show control blocks when issued, deissued, committed */

#define pipe_bit (1<<1)

#define pipe_bit (1<<1)

   /* show the pipeline and locks on every cycle */

   /* show the pipeline and locks on every cycle */

#define coroutine_bit (1<<2)

#define coroutine_bit (1<<2)

   /* show the coroutines when started on every cycle */

   /* show the coroutines when started on every cycle */

#define schedule_bit (1<<3)

#define schedule_bit (1<<3)

   /* show the coroutines when scheduled */

   /* show the coroutines when scheduled */

#define uninit_mem_bit (1<<4)

#define uninit_mem_bit (1<<4)

   /* complain when reading from an uninitialized chunk of memory */

   /* complain when reading from an uninitialized chunk of memory */

#define interactive_read_bit (1<<5)

#define interactive_read_bit (1<<5)

   /* prompt user when reading from I/O location */

   /* prompt user when reading from I/O location */

#define show_spec_bit (1<<6)

#define show_spec_bit (1<<6)

   /* display special read/write transactions as they happen */

   /* display special read/write transactions as they happen */

#define show_pred_bit (1<<7)

#define show_pred_bit (1<<7)

   /* display branch prediction details */

   /* display branch prediction details */

#define show_wholecache_bit (1<<8)

#define show_wholecache_bit (1<<8)

   /* display cache blocks even when their key tag is invalid */

   /* display cache blocks even when their key tag is invalid */

@ The |MMIX_init()| routine should be called exactly once, after

@ The |MMIX_init()| routine should be called exactly once, after

|MMIX_config()| has done its work but before the simulator starts to execute

|MMIX_config()| has done its work but before the simulator starts to execute

any programs. Then |MMIX_run| can be called as often as the user likes.

any programs. Then |MMIX_run| can be called as often as the user likes.

@s octa int

@s octa int

@=

@=

Extern void MMIX_init @,@,@[ARGS((void))@];

Extern void MMIX_init @,@,@[ARGS((void))@];

Extern void MMIX_run @,@,@[ARGS((int cycs, octa breakpoint))@];

Extern void MMIX_run @,@,@[ARGS((int cycs, octa breakpoint))@];

@ @=

@ @=

void MMIX_init()

void MMIX_init()

  register int i,j;

  register int i,j;

@;

@;

@#

@#

void MMIX_run(cycs,breakpoint)

void MMIX_run(cycs,breakpoint)

  int cycs;

  int cycs;

  octa breakpoint;

  octa breakpoint;

@;

@;

  while (cycs) {

  while (cycs) {

    if (verbose&(issue_bit|pipe_bit|coroutine_bit|schedule_bit))

    if (verbose&(issue_bit|pipe_bit|coroutine_bit|schedule_bit))

      printf("*** Cycle %d\n", ticks.l);

      printf("*** Cycle %d\n", ticks.l);

@;

@;

    if (verbose&pipe_bit) {

    if (verbose&pipe_bit) {

      print_pipe();@+ print_locks();

      print_pipe();@+ print_locks();

    if (breakpoint_hit||halted) {

    if (breakpoint_hit||halted) {

      if (breakpoint_hit)

      if (breakpoint_hit)

        printf("Breakpoint instruction fetched at time %d\n",ticks.l-1);

        printf("Breakpoint instruction fetched at time %d\n",ticks.l-1);

      if (halted) printf("Halted at time %d\n", ticks.l-1);

      if (halted) printf("Halted at time %d\n", ticks.l-1);

      break;

      break;

    cycs--;

    cycs--;

 cease:;

 cease:;

@ @=

@ @=

typedef enum {@!false, @!true, @!wow}@+bool; /* slightly extended booleans */

typedef enum {@!false, @!true, @!wow}@+bool; /* slightly extended booleans */

@ @=

@ @=

register int i,j,m;

register int i,j,m;

bool breakpoint_hit=false;

bool breakpoint_hit=false;

bool halted=false;

bool halted=false;

@ Error messages that abort this program are called panic messages.

@ Error messages that abort this program are called panic messages.

The macro called |confusion| will never be needed unless this program is

The macro called |confusion| will never be needed unless this program is

internally inconsistent.

internally inconsistent.

@d errprint0(f) fprintf(stderr,f)

@d errprint0(f) fprintf(stderr,f)

@d errprint1(f,a) fprintf(stderr,f,a)

@d errprint1(f,a) fprintf(stderr,f,a)

@d errprint2(f,a,b) fprintf(stderr,f,a,b)

@d errprint2(f,a,b) fprintf(stderr,f,a,b)

@d panic(x)@+ {@+errprint0("Panic: ");@+x;@+errprint0("!\n");@+expire();@+}

@d panic(x)@+ {@+errprint0("Panic: ");@+x;@+errprint0("!\n");@+expire();@+}

@d confusion(m) errprint1("This can't happen: %s",m)

@d confusion(m) errprint1("This can't happen: %s",m)

@.This can't happen@>

@.This can't happen@>

@=

@=

static void expire @,@,@[ARGS((void))@];

static void expire @,@,@[ARGS((void))@];

@ @=

@ @=

static void expire() /* the last gasp before dying */

static void expire() /* the last gasp before dying */

  if (ticks.h) errprint2("(Clock time is %dH+%d.)\n",ticks.h,ticks.l);

  if (ticks.h) errprint2("(Clock time is %dH+%d.)\n",ticks.h,ticks.l);

  else errprint1("(Clock time is %d.)\n",ticks.l);

  else errprint1("(Clock time is %d.)\n",ticks.l);

@.Clock time is...@>

@.Clock time is...@>

  exit(-2);

  exit(-2);

@ The data structures of this program are not precisely equivalent to

@ The data structures of this program are not precisely equivalent to

logical gates that could be implemented directly in silicon;

logical gates that could be implemented directly in silicon;

we will use data structures and

we will use data structures and

algorithms appropriate to the \CEE/ programming language. For example,

algorithms appropriate to the \CEE/ programming language. For example,

we'll use pointers and arrays, instead of buses and ports and latches. However,

we'll use pointers and arrays, instead of buses and ports and latches. However,

the net effect of our data structures and algorithms is intended to

the net effect of our data structures and algorithms is intended to

be equivalent to the net effect of a silicon implementation. The methods

be equivalent to the net effect of a silicon implementation. The methods

used below are essentially equivalent to those used in real machines today,

used below are essentially equivalent to those used in real machines today,

except that diagnostic facilities are added so that we can readily

except that diagnostic facilities are added so that we can readily

watch what is happening.

watch what is happening.

Each functional unit in the \MMIX\ pipeline is programmed here as a coroutine

Each functional unit in the \MMIX\ pipeline is programmed here as a coroutine

in~\CEE/. At every clock cycle, we will call on each active coroutine to do one

in~\CEE/. At every clock cycle, we will call on each active coroutine to do one

phase of its operation; in terms of the repair-station analogy

phase of its operation; in terms of the repair-station analogy

described in the main program,

described in the main program,

this corresponds to getting each group of

this corresponds to getting each group of

auto mechanics to do one unit of operation on a car.

auto mechanics to do one unit of operation on a car.

The coroutines are performed sequentially, although

The coroutines are performed sequentially, although

a real pipeline would have them act in parallel.

a real pipeline would have them act in parallel.

We will not ``cheat'' by letting one coroutine access a value early in its

We will not ``cheat'' by letting one coroutine access a value early in its

cycle that another one computes late in its cycle, unless computer hardware

cycle that another one computes late in its cycle, unless computer hardware

could ``cheat'' in an equivalent way.

could ``cheat'' in an equivalent way.

@* Low-level routines. Where should we begin? It is tempting to start with a

@* Low-level routines. Where should we begin? It is tempting to start with a

global view of the simulator and then to break it down into component parts.

global view of the simulator and then to break it down into component parts.

But that task is too daunting, because there are so many unknowns about what

But that task is too daunting, because there are so many unknowns about what

basic ingredients ought to be combined when we construct the larger

basic ingredients ought to be combined when we construct the larger

components. So let us look first at the primitive operations on which

components. So let us look first at the primitive operations on which

the superstructure will be built. Once we have created some infrastructure,

the superstructure will be built. Once we have created some infrastructure,

we'll be able to proceed with confidence to the larger tasks ahead.

we'll be able to proceed with confidence to the larger tasks ahead.

@ This program for the 64-bit \MMIX\ architecture is based on 32-bit integer

@ This program for the 64-bit \MMIX\ architecture is based on 32-bit integer

arithmetic, because nearly every computer available to the author at the time

arithmetic, because nearly every computer available to the author at the time

of writing (1998--1999) was limited in that way.

of writing (1998--1999) was limited in that way.

Details of the basic arithmetic appear in a separate program module

Details of the basic arithmetic appear in a separate program module

called {\mc MMIX-ARITH}, because the same routines are needed also

called {\mc MMIX-ARITH}, because the same routines are needed also

for the assembler and for the non-pipelined simulator. The

for the assembler and for the non-pipelined simulator. The

definition of type \&{tetra} should be changed, if necessary, to conform with

definition of type \&{tetra} should be changed, if necessary, to conform with

the definitions found there.

the definitions found there.

@^system dependencies@>

@^system dependencies@>

@=

@=

typedef unsigned int tetra;

typedef unsigned int tetra;

  /* for systems conforming to the LP-64 data model */

  /* for systems conforming to the LP-64 data model */

typedef struct { tetra h,l;} octa; /* two tetrabytes make one octabyte */

typedef struct { tetra h,l;} octa; /* two tetrabytes make one octabyte */

@ @=

@ @=

static void print_octa @,@,@[ARGS((octa))@];

static void print_octa @,@,@[ARGS((octa))@];

@ @=

@ @=

static void print_octa(o)

static void print_octa(o)

  octa o;

  octa o;

  if (o.h) printf("%x%08x",o.h,o.l);@+

  if (o.h) printf("%x%08x",o.h,o.l);@+

  else printf("%x",o.l);

  else printf("%x",o.l);

@ @=

@ @=

extern octa zero_octa; /* |zero_octa.h=zero_octa.l=0| */

extern octa zero_octa; /* |zero_octa.h=zero_octa.l=0| */

extern octa neg_one; /* |neg_one.h=neg_one.l=-1| */

extern octa neg_one; /* |neg_one.h=neg_one.l=-1| */

extern octa aux; /* auxiliary output of a subroutine */

extern octa aux; /* auxiliary output of a subroutine */

extern bool overflow; /* set by certain subroutines for signed arithmetic */

extern bool overflow; /* set by certain subroutines for signed arithmetic */

extern int exceptions; /* bits set by floating point operations */

extern int exceptions; /* bits set by floating point operations */

extern int cur_round; /* the current rounding mode */

extern int cur_round; /* the current rounding mode */

@ Most of the subroutines in {\mc MMIX-ARITH} return an octabyte as

@ Most of the subroutines in {\mc MMIX-ARITH} return an octabyte as

a function of two octabytes; for example, |oplus(y,z)| returns the

a function of two octabytes; for example, |oplus(y,z)| returns the

sum of octabytes |y| and~|z|. Multiplication returns the high

sum of octabytes |y| and~|z|. Multiplication returns the high

half of a product in the global variable~|aux|; division returns

half of a product in the global variable~|aux|; division returns

the remainder in~|aux|.

the remainder in~|aux|.

@=

@=

extern octa oplus @,@,@[ARGS((octa y,octa z))@];

extern octa oplus @,@,@[ARGS((octa y,octa z))@];

  /* unsigned $y+z$ */

  /* unsigned $y+z$ */

extern octa ominus @,@,@[ARGS((octa y,octa z))@];

extern octa ominus @,@,@[ARGS((octa y,octa z))@];

  /* unsigned $y-z$ */

  /* unsigned $y-z$ */

extern octa incr @,@,@[ARGS((octa y,int delta))@];

extern octa incr @,@,@[ARGS((octa y,int delta))@];

  /* unsigned $y+\delta$ ($\delta$ is signed) */

  /* unsigned $y+\delta$ ($\delta$ is signed) */

extern octa oand @,@,@[ARGS((octa y,octa z))@];

extern octa oand @,@,@[ARGS((octa y,octa z))@];

  /* $y\land z$ */

  /* $y\land z$ */

extern octa oandn @,@,@[ARGS((octa y,octa z))@];

extern octa oandn @,@,@[ARGS((octa y,octa z))@];

  /* $y\land \bar z$ */

  /* $y\land \bar z$ */

extern octa shift_left @,@,@[ARGS((octa y,int s))@];

extern octa shift_left @,@,@[ARGS((octa y,int s))@];

  /* $y\LL s$, $0\le s\le64$ */

  /* $y\LL s$, $0\le s\le64$ */

extern octa shift_right @,@,@[ARGS((octa y,int s,int uns))@];

extern octa shift_right @,@,@[ARGS((octa y,int s,int uns))@];

  /* $y\GG s$, signed if |!uns| */

  /* $y\GG s$, signed if |!uns| */

extern octa omult @,@,@[ARGS((octa y,octa z))@];

extern octa omult @,@,@[ARGS((octa y,octa z))@];

  /* unsigned $(|aux|,x)=y\times z$ */

  /* unsigned $(|aux|,x)=y\times z$ */

extern octa signed_omult @,@,@[ARGS((octa y,octa z))@];

extern octa signed_omult @,@,@[ARGS((octa y,octa z))@];

  /* signed $x=y\times z$, setting |overflow| */

  /* signed $x=y\times z$, setting |overflow| */

extern octa odiv @,@,@[ARGS((octa x,octa y,octa z))@];

extern octa odiv @,@,@[ARGS((octa x,octa y,octa z))@];

  /* unsigned $(x,y)/z$; $|aux|=(x,y)\bmod z$ */

  /* unsigned $(x,y)/z$; $|aux|=(x,y)\bmod z$ */

extern octa signed_odiv @,@,@[ARGS((octa y,octa z))@];

extern octa signed_odiv @,@,@[ARGS((octa y,octa z))@];

  /* signed $y/z$, when $z\ne0$; $|aux|=y\bmod z$ */

  /* signed $y/z$, when $z\ne0$; $|aux|=y\bmod z$ */

extern int count_bits @,@,@[ARGS((tetra z))@];

extern int count_bits @,@,@[ARGS((tetra z))@];

  /* $x=\nu(z)$ */

  /* $x=\nu(z)$ */

extern tetra byte_diff @,@,@[ARGS((tetra y,tetra z))@];

extern tetra byte_diff @,@,@[ARGS((tetra y,tetra z))@];

  /* half of \.{BDIF} */

  /* half of \.{BDIF} */

extern tetra wyde_diff @,@,@[ARGS((tetra y,tetra z))@];

extern tetra wyde_diff @,@,@[ARGS((tetra y,tetra z))@];

  /* half of \.{WDIF} */

  /* half of \.{WDIF} */

extern octa bool_mult @,@,@[ARGS((octa y,octa z,bool xor))@];

extern octa bool_mult @,@,@[ARGS((octa y,octa z,bool xor))@];

  /* \.{MOR} or \.{MXOR} */

  /* \.{MOR} or \.{MXOR} */

extern octa load_sf @,@,@[ARGS((tetra z))@];

extern octa load_sf @,@,@[ARGS((tetra z))@];

  /* load short float */

  /* load short float */

extern tetra store_sf @,@,@[ARGS((octa x))@];

extern tetra store_sf @,@,@[ARGS((octa x))@];

  /* store short float */

  /* store short float */

extern octa fplus @,@,@[ARGS((octa y,octa z))@];

extern octa fplus @,@,@[ARGS((octa y,octa z))@];

  /* floating point $x=y\oplus z$ */

  /* floating point $x=y\oplus z$ */

extern octa fmult @,@,@[ARGS((octa y ,octa z))@];

extern octa fmult @,@,@[ARGS((octa y ,octa z))@];

  /* floating point $x=y\otimes z$ */

  /* floating point $x=y\otimes z$ */

extern octa fdivide @,@,@[ARGS((octa y,octa z))@];

extern octa fdivide @,@,@[ARGS((octa y,octa z))@];

  /* floating point $x=y\oslash z$ */

  /* floating point $x=y\oslash z$ */

extern octa froot @,@,@[ARGS((octa,int))@];

extern octa froot @,@,@[ARGS((octa,int))@];

  /* floating point $x=\sqrt z$ */

  /* floating point $x=\sqrt z$ */

extern octa fremstep @,@,@[ARGS((octa y,octa z,int delta))@];

extern octa fremstep @,@,@[ARGS((octa y,octa z,int delta))@];

  /* floating point $x\,{\rm rem}\,z=y\,{\rm rem}\,z$ */

  /* floating point $x\,{\rm rem}\,z=y\,{\rm rem}\,z$ */

extern octa fintegerize @,@,@[ARGS((octa z,int mode))@];

extern octa fintegerize @,@,@[ARGS((octa z,int mode))@];

  /* floating point $x={\rm round}(z)$ */

  /* floating point $x={\rm round}(z)$ */

extern int fcomp @,@,@[ARGS((octa y,octa z))@];

extern int fcomp @,@,@[ARGS((octa y,octa z))@];

  /* $-1$, 0, 1, or 2 if $yz$, $y\parallel z$ */

  /* $-1$, 0, 1, or 2 if $yz$, $y\parallel z$ */

extern int fepscomp @,@,@[ARGS((octa y,octa z,octa eps,int sim))@];

extern int fepscomp @,@,@[ARGS((octa y,octa z,octa eps,int sim))@];

  /* $x=|sim|?\ [y\sim z\ (\epsilon)]:\ [y\approx z\ (\epsilon)]$ */

  /* $x=|sim|?\ [y\sim z\ (\epsilon)]:\ [y\approx z\ (\epsilon)]$ */

extern octa floatit @,@,@[ARGS((octa z,int mode,int unsgnd,int shrt))@];

extern octa floatit @,@,@[ARGS((octa z,int mode,int unsgnd,int shrt))@];

  /* fix to float */

  /* fix to float */

extern octa fixit @,@,@[ARGS((octa z,int mode))@];

extern octa fixit @,@,@[ARGS((octa z,int mode))@];

  /* float to fix */

  /* float to fix */

@ We had better check that our 32-bit assumption holds.

@ We had better check that our 32-bit assumption holds.

@=

@=

if (shift_left(neg_one,1).h!=0xffffffff)

if (shift_left(neg_one,1).h!=0xffffffff)

  panic(errprint0("Incorrect implementation of type tetra"));

  panic(errprint0("Incorrect implementation of type tetra"));

@.Incorrect implementation...@>

@.Incorrect implementation...@>

@* Coroutines. As stated earlier, this program can be regarded as a system of

@* Coroutines. As stated earlier, this program can be regarded as a system of

interacting coroutines. Coroutines---sometimes called threads---are more or

interacting coroutines. Coroutines---sometimes called threads---are more or

less independent processes that share and pass data and control back and

less independent processes that share and pass data and control back and

forth. They correspond to the individual workers in an organization.

forth. They correspond to the individual workers in an organization.

We don't need the full power of recursive coroutines, in which new threads are

We don't need the full power of recursive coroutines, in which new threads are

spawned dynamically and have independent stacks for computation; we are, after

spawned dynamically and have independent stacks for computation; we are, after

all, simulating a fixed piece of hardware. The total number of coroutines we

all, simulating a fixed piece of hardware. The total number of coroutines we

deal with is established once and for all by the |MMIX_config| routine, and

deal with is established once and for all by the |MMIX_config| routine, and

each coroutine has a fixed amount of local data.

each coroutine has a fixed amount of local data.

The simulation operates one clock tick at a time, by executing all

The simulation operates one clock tick at a time, by executing all

coroutines scheduled for time~$t$ before advancing to time~$t+1$. The

coroutines scheduled for time~$t$ before advancing to time~$t+1$. The

coroutines at time~$t$ may decide to become dormant or they may reschedule

coroutines at time~$t$ may decide to become dormant or they may reschedule

themselves and/or other coroutines for future times.

themselves and/or other coroutines for future times.

Each coroutine has a symbolic |name| for diagnostic purposes (e.g.,

Each coroutine has a symbolic |name| for diagnostic purposes (e.g.,

\.{ALU1}); a nonnegative |stage| number (e.g., 2~for the second stage

\.{ALU1}); a nonnegative |stage| number (e.g., 2~for the second stage

of a pipeline); a pointer to the next coroutine scheduled at the same time (or

of a pipeline); a pointer to the next coroutine scheduled at the same time (or

|NULL| if the coroutine is unscheduled); a pointer to a lock variable

|NULL| if the coroutine is unscheduled); a pointer to a lock variable

(or |NULL| if no lock is currently relevant);

(or |NULL| if no lock is currently relevant);

and a reference to a control block containing the data to be processed.

and a reference to a control block containing the data to be processed.

@s control_struct int

@s control_struct int

@=

@=

typedef struct coroutine_struct {

typedef struct coroutine_struct {

 char *name; /* symbolic identification of a coroutine */

 char *name; /* symbolic identification of a coroutine */

 int stage; /* its rank */

 int stage; /* its rank */

 struct coroutine_struct *next; /* its successor */

 struct coroutine_struct *next; /* its successor */

 struct coroutine_struct **lockloc; /* what it might be locking */

 struct coroutine_struct **lockloc; /* what it might be locking */

 struct control_struct *ctl; /* its data */

 struct control_struct *ctl; /* its data */

} coroutine;

} coroutine;

@ @=

@ @=

static void print_coroutine_id @,@,@[ARGS((coroutine*))@];

static void print_coroutine_id @,@,@[ARGS((coroutine*))@];

static void errprint_coroutine_id @,@,@[ARGS((coroutine*))@];

static void errprint_coroutine_id @,@,@[ARGS((coroutine*))@];

@ @=

@ @=

static void print_coroutine_id(c)

static void print_coroutine_id(c)

  coroutine *c;

  coroutine *c;

  if (c) printf("%s:%d",c->name,c->stage);

  if (c) printf("%s:%d",c->name,c->stage);

  else printf("??");

  else printf("??");

@#

@#

static void errprint_coroutine_id(c)

static void errprint_coroutine_id(c)

  coroutine *c;

  coroutine *c;

  if (c) errprint2("%s:%d",c->name,c->stage);

  if (c) errprint2("%s:%d",c->name,c->stage);

  else errprint0("??");

  else errprint0("??");

@.??@>

@.??@>

@ Coroutine control is masterminded by a ring of queues, one each for

@ Coroutine control is masterminded by a ring of queues, one each for

times $t$, $t+1$, \dots, $t+|ring_size|-1$, when $t$ is the current

times $t$, $t+1$, \dots, $t+|ring_size|-1$, when $t$ is the current

clock time.

clock time.

All scheduling is first-come-first-served, except that coroutines with higher

All scheduling is first-come-first-served, except that coroutines with higher

|stage| numbers have priority. We want to process the later stages of a

|stage| numbers have priority. We want to process the later stages of a

pipeline first, in this sequential implementation, for the same reason that a

pipeline first, in this sequential implementation, for the same reason that a

car must drive from M~station into W~station before another car can enter

car must drive from M~station into W~station before another car can enter

M~station.

M~station.

Each queue is a circular list of \&{coroutine} nodes, linked together by their

Each queue is a circular list of \&{coroutine} nodes, linked together by their

|next| fields. A list head~$h$ with |stage=max_stage| comes at the end and the

|next| fields. A list head~$h$ with |stage=max_stage| comes at the end and the

beginning of the queue. (All |stage| numbers of legitimate coroutines

beginning of the queue. (All |stage| numbers of legitimate coroutines

are less than~|max_stage|.) The queued items are |h->next|, |h->next->next|,

are less than~|max_stage|.) The queued items are |h->next|, |h->next->next|,

etc., from back to front, and we have |c->stage<=c->next->stage| unless |c=h|.

etc., from back to front, and we have |c->stage<=c->next->stage| unless |c=h|.

Initially all queues are empty.

Initially all queues are empty.

@=

@=

{@+register coroutine *p;

{@+register coroutine *p;

  for (p=ring;pnext=p;

  for (p=ring;pnext=p;

@ To schedule a coroutine |c| with positive delay |d

@ To schedule a coroutine |c| with positive delay |d

|schedule(c,d,s)|. (The |s| parameter is used only if scheduling is

|schedule(c,d,s)|. (The |s| parameter is used only if scheduling is

being logged; it does not affect the computation, but we will

being logged; it does not affect the computation, but we will

generally set |s| to the state at which the scheduled coroutine will begin.)

generally set |s| to the state at which the scheduled coroutine will begin.)

@=

@=

static void schedule @,@,@[ARGS((coroutine*,int,int))@];

static void schedule @,@,@[ARGS((coroutine*,int,int))@];

@ @=

@ @=

static void schedule(c,d,s)

static void schedule(c,d,s)

  coroutine *c;

  coroutine *c;

  int d,s;

  int d,s;

  register int tt=(cur_time+d)%ring_size;

  register int tt=(cur_time+d)%ring_size;

  register coroutine *p=&ring[tt]; /* start at the list head */

  register coroutine *p=&ring[tt]; /* start at the list head */

  if (d<=0 || d>=ring_size) /* do a sanity check */

  if (d<=0 || d>=ring_size) /* do a sanity check */

   panic(confusion("Scheduling ");errprint_coroutine_id(c);

   panic(confusion("Scheduling ");errprint_coroutine_id(c);

         errprint1(" with delay %d",d));

         errprint1(" with delay %d",d));

  while (p->next->stagestage) p=p->next;

  while (p->next->stagestage) p=p->next;

  c->next = p->next;

  c->next = p->next;

  p->next = c;

  p->next = c;

  if (verbose&schedule_bit) {

  if (verbose&schedule_bit) {

    printf(" scheduling ");@+print_coroutine_id(c);

    printf(" scheduling ");@+print_coroutine_id(c);

    printf(" at time %d, state %d\n",ticks.l+d,s);

    printf(" at time %d, state %d\n",ticks.l+d,s);

@ @=

@ @=

Extern int ring_size; /* set by |MMIX_config|, must be sufficiently large */

Extern int ring_size; /* set by |MMIX_config|, must be sufficiently large */

Extern coroutine *ring;

Extern coroutine *ring;

Extern int cur_time;

Extern int cur_time;

@ The all-important |ctl| field of a coroutine, which contains the

@ The all-important |ctl| field of a coroutine, which contains the

data being manipulated, will be explained below. One of its key

data being manipulated, will be explained below. One of its key

components is the |state| field, which helps to specify the next

components is the |state| field, which helps to specify the next

actions the coroutine will perform. When we schedule a coroutine for

actions the coroutine will perform. When we schedule a coroutine for

a new task, we often want it to begin in state~0.

a new task, we often want it to begin in state~0.

@=

@=

static void startup @,@,@[ARGS((coroutine*,int))@];

static void startup @,@,@[ARGS((coroutine*,int))@];

@ @=

@ @=

static void startup(c,d)

static void startup(c,d)

  coroutine *c;

  coroutine *c;

  int d;

  int d;

  c->ctl->state=0;

  c->ctl->state=0;

  schedule(c,d,0);

  schedule(c,d,0);

@ The following routine removes a coroutine from whatever queue it's in.

@ The following routine removes a coroutine from whatever queue it's in.

The case |c->next=c| is also permitted; such a self-loop can occur when a

The case |c->next=c| is also permitted; such a self-loop can occur when a

coroutine goes to sleep and expects to be awakened (that is, scheduled)

coroutine goes to sleep and expects to be awakened (that is, scheduled)

by another coroutine. Sleeping coroutines have important data in their

by another coroutine. Sleeping coroutines have important data in their

|ctl| field; they are therefore quite different from unscheduled

|ctl| field; they are therefore quite different from unscheduled

or ``unemployed'' coroutines, which have |c->next=NULL|. An unemployed

or ``unemployed'' coroutines, which have |c->next=NULL|. An unemployed

coroutine is not assumed to have any valid data in its |ctl| field.

coroutine is not assumed to have any valid data in its |ctl| field.

@=

@=

static void unschedule @,@,@[ARGS((coroutine*))@];

static void unschedule @,@,@[ARGS((coroutine*))@];

@ @=

@ @=

static void unschedule(c)

static void unschedule(c)

  coroutine *c;

  coroutine *c;

{@+register coroutine *p;

{@+register coroutine *p;

  if (c->next) {

  if (c->next) {

    for (p=c; p->next!=c; p=p->next) ;

    for (p=c; p->next!=c; p=p->next) ;

    p->next = c->next;

    p->next = c->next;

    c->next=NULL;

    c->next=NULL;

    if (verbose&schedule_bit) {

    if (verbose&schedule_bit) {

      printf(" unscheduling ");@+print_coroutine_id(c);@+printf("\n");

      printf(" unscheduling ");@+print_coroutine_id(c);@+printf("\n");

@ When it is time to process all coroutines that have queued up for a

@ When it is time to process all coroutines that have queued up for a

particular time~|t|, we empty the queue called |ring[t]| and link its items in

particular time~|t|, we empty the queue called |ring[t]| and link its items in

the opposite order (from front to back). The following subroutine uses the

the opposite order (from front to back). The following subroutine uses the

well known algorithm discussed in exercise 2.2.3--7 of {\sl The Art

well known algorithm discussed in exercise 2.2.3--7 of {\sl The Art

of Computer Programming}.

of Computer Programming}.

@=

@=

static coroutine *queuelist @,@,@[ARGS((int))@];

static coroutine *queuelist @,@,@[ARGS((int))@];

@ @=

@ @=

static coroutine* queuelist(t)

static coroutine* queuelist(t)

  int t;

  int t;

{@+register coroutine *p, *q=&sentinel, *r;

{@+register coroutine *p, *q=&sentinel, *r;

  for (p=ring[t].next;p!=&ring[t];p=r) {

  for (p=ring[t].next;p!=&ring[t];p=r) {

    r=p->next;

    r=p->next;

    p->next=q;

    p->next=q;

    q=p;

    q=p;

  ring[t].next=&ring[t];

  ring[t].next=&ring[t];

  sentinel.next=q;

  sentinel.next=q;

  return q;

  return q;

@ @=

@ @=

coroutine sentinel; /* dummy coroutine at origin of circular list */

coroutine sentinel; /* dummy coroutine at origin of circular list */

@ Coroutines often start working on tasks that are {\it speculative}, in the

@ Coroutines often start working on tasks that are {\it speculative}, in the

sense that we want certain results to be ready if they prove to be

sense that we want certain results to be ready if they prove to be

useful; we understand that speculative computations might not actually

useful; we understand that speculative computations might not actually

be needed. Therefore a coroutine might need to be aborted before it

be needed. Therefore a coroutine might need to be aborted before it

has finished its work.

has finished its work.

All coroutines must be written in such a way that important data structures

All coroutines must be written in such a way that important data structures

remain intact even when the coroutine is abruptly terminated. In particular,

remain intact even when the coroutine is abruptly terminated. In particular,

we need to be sure that ``locks'' on shared resources are restored to

we need to be sure that ``locks'' on shared resources are restored to

an unlocked state when a coroutine holding the lock is aborted.

an unlocked state when a coroutine holding the lock is aborted.

A \&{lockvar} variable is |NULL| when it is unlocked; otherwise it

A \&{lockvar} variable is |NULL| when it is unlocked; otherwise it

points to the coroutine responsible for unlocking~it.

points to the coroutine responsible for unlocking~it.

@d set_lock(c,l) {@+l=c;@+(c)->lockloc=&(l);@+}

@d set_lock(c,l) {@+l=c;@+(c)->lockloc=&(l);@+}

@d release_lock(c,l) {@+l=NULL;@+ (c)->lockloc=NULL;@+}

@d release_lock(c,l) {@+l=NULL;@+ (c)->lockloc=NULL;@+}

@=

@=

typedef coroutine *lockvar;

typedef coroutine *lockvar;

@ @=

@ @=

Extern void print_locks @,@,@[ARGS((void))@];

Extern void print_locks @,@,@[ARGS((void))@];

@ @=

@ @=

void print_locks()

void print_locks()

  print_cache_locks(ITcache);

  print_cache_locks(ITcache);

  print_cache_locks(DTcache);

  print_cache_locks(DTcache);

  print_cache_locks(Icache);

  print_cache_locks(Icache);

  print_cache_locks(Dcache);

  print_cache_locks(Dcache);

  print_cache_locks(Scache);

  print_cache_locks(Scache);

  if (mem_lock) printf("mem locked by %s:%d\n",mem_lock->name,mem_lock->stage);

  if (mem_lock) printf("mem locked by %s:%d\n",mem_lock->name,mem_lock->stage);

  if (dispatch_lock) printf("dispatch locked by %s:%d\n",

  if (dispatch_lock) printf("dispatch locked by %s:%d\n",

                    dispatch_lock->name,dispatch_lock->stage);

                    dispatch_lock->name,dispatch_lock->stage);

  if (wbuf_lock) printf("head of write buffer locked by %s:%d\n",

  if (wbuf_lock) printf("head of write buffer locked by %s:%d\n",

                    wbuf_lock->name,wbuf_lock->stage);

                    wbuf_lock->name,wbuf_lock->stage);

  if (clean_lock) printf("cleaner locked by %s:%d\n",

  if (clean_lock) printf("cleaner locked by %s:%d\n",

                    clean_lock->name,clean_lock->stage);

                    clean_lock->name,clean_lock->stage);

  if (speed_lock) printf("write buffer flush locked by %s:%d\n",

  if (speed_lock) printf("write buffer flush locked by %s:%d\n",

                    speed_lock->name,speed_lock->stage);

                    speed_lock->name,speed_lock->stage);

@ Many of the quantities we deal with are speculative values

@ Many of the quantities we deal with are speculative values

that might not yet have been certified as part of the ``real''

that might not yet have been certified as part of the ``real''

calculation; in fact, they might not yet have been calculated.

calculation; in fact, they might not yet have been calculated.

A \&{spec} consists of a 64-bit quantity |o| and a pointer~|p| to

A \&{spec} consists of a 64-bit quantity |o| and a pointer~|p| to

a \&{specnode}. The value~|o| is meaningful only if the

a \&{specnode}. The value~|o| is meaningful only if the

pointer~|p| is~|NULL|; otherwise |p| points to a source of further information.

pointer~|p| is~|NULL|; otherwise |p| points to a source of further information.

A \&{specnode} is a 64-bit quantity |o| together with links to other

A \&{specnode} is a 64-bit quantity |o| together with links to other

\&{specnode}s

\&{specnode}s

that are above it or below it in a doubly linked list. An additional

that are above it or below it in a doubly linked list. An additional

|known| bit tells whether the |o|~field has been calculated. There also is

|known| bit tells whether the |o|~field has been calculated. There also is

a 64-bit |addr| field, to identify the list and give further information.

a 64-bit |addr| field, to identify the list and give further information.

A \&{specnode} list keeps track of speculative values related to a specific

A \&{specnode} list keeps track of speculative values related to a specific

register or to all of main memory; we will discuss such lists in detail~later.

register or to all of main memory; we will discuss such lists in detail~later.

@s specnode_struct int

@s specnode_struct int

@=

@=

typedef struct {

typedef struct {

  octa o;

  octa o;

  struct specnode_struct *p;

  struct specnode_struct *p;

} spec;

} spec;

@#

@#

typedef struct specnode_struct {

typedef struct specnode_struct {

  octa o;

  octa o;

  bool known;

  bool known;

  octa addr;

  octa addr;

  struct specnode_struct *up,*down;

  struct specnode_struct *up,*down;

} specnode;

} specnode;

@ @=

@ @=

spec zero_spec; /* |zero_spec.o.h=zero_spec.o.l=0| and |zero_spec.p=NULL| */

spec zero_spec; /* |zero_spec.o.h=zero_spec.o.l=0| and |zero_spec.p=NULL| */

@ @=

@ @=

static void print_spec @,@,@[ARGS((spec))@];

static void print_spec @,@,@[ARGS((spec))@];

@ @=

@ @=

static void print_spec(s)

static void print_spec(s)

  spec s;

  spec s;

  if (!s.p) print_octa(s.o);

  if (!s.p) print_octa(s.o);

  else {

  else {

    printf(">");@+ print_specnode_id(s.p->addr);

    printf(">");@+ print_specnode_id(s.p->addr);

@#

@#

static void print_specnode(s)

static void print_specnode(s)

  specnode s;

  specnode s;

  if (s.known) {@+print_octa(s.o);@+printf("!");@+}

  if (s.known) {@+print_octa(s.o);@+printf("!");@+}

  else if (s.o.h || s.o.l) {@+print_octa(s.o);@+printf("?");@+}

  else if (s.o.h || s.o.l) {@+print_octa(s.o);@+printf("?");@+}

  else printf("?");

  else printf("?");

  print_specnode_id(s.addr);

  print_specnode_id(s.addr);

@ The analog of an automobile in our simulator is a block of data called

@ The analog of an automobile in our simulator is a block of data called

\&{control}, which represents all the relevant facts about an \MMIX\

\&{control}, which represents all the relevant facts about an \MMIX\

instruction.  We can think of it as the work order attached to a car's

instruction.  We can think of it as the work order attached to a car's

windshield. Each group of employees updates the work order as the car moves

windshield. Each group of employees updates the work order as the car moves

through the shop.

through the shop.

A \&{control} record contains the original location of an instruction,

A \&{control} record contains the original location of an instruction,

and its four bytes OP~X~Y~Z. An instruction has up to four inputs, which are

and its four bytes OP~X~Y~Z. An instruction has up to four inputs, which are

\&{spec} records called |y|, |z|, |b| and~|ra|; it also has up to three

\&{spec} records called |y|, |z|, |b| and~|ra|; it also has up to three

outputs, which are \&{specnode} records called |x|, |a|, and~|rl|.

outputs, which are \&{specnode} records called |x|, |a|, and~|rl|.

(We usually don't mention the special input~|ra| or the special output~|rl|,

(We usually don't mention the special input~|ra| or the special output~|rl|,

which refer to \.{MMIX}'s internal registers rA and~rL.) For example, the

which refer to \.{MMIX}'s internal registers rA and~rL.) For example, the

main inputs to a \.{DIVU} command are \$Y, \$Z, and~rD; the outputs are the

main inputs to a \.{DIVU} command are \$Y, \$Z, and~rD; the outputs are the

quotient~\$X and the remainder~rR. The inputs to a

quotient~\$X and the remainder~rR. The inputs to a

\.{STO} command are \$Y, \$Z, and~\$X; there is one ``output,'' and

\.{STO} command are \$Y, \$Z, and~\$X; there is one ``output,'' and

the field~|x.addr| will be set to the physical address of the memory location

the field~|x.addr| will be set to the physical address of the memory location

corresponding to virtual address $\rm \$Y+\$Z$.

corresponding to virtual address $\rm \$Y+\$Z$.

Each \&{control} block also points to the coroutine that owns it, if any.

Each \&{control} block also points to the coroutine that owns it, if any.

And it has various other fields that contain other tidbits of information;

And it has various other fields that contain other tidbits of information;

for example, we have already mentioned

for example, we have already mentioned

the |state|~field, which often governs a coroutine's actions. The |i|~field,

the |state|~field, which often governs a coroutine's actions. The |i|~field,

which contains an internal operation code number, is generally used together

which contains an internal operation code number, is generally used together

with |state| to switch between alternative computational steps. If, for

with |state| to switch between alternative computational steps. If, for

example, the |op|~field is \.{SUB} or \.{SUBI} or \.{NEG} or \.{NEGI},

example, the |op|~field is \.{SUB} or \.{SUBI} or \.{NEG} or \.{NEGI},

the internal opcode~|i| will be simply~|sub|.

the internal opcode~|i| will be simply~|sub|.

We shall define all the fields of \&{control} records

We shall define all the fields of \&{control} records

now and discuss them later.

now and discuss them later.

An actual hardware implementation of \MMIX\ wouldn't need all the information

An actual hardware implementation of \MMIX\ wouldn't need all the information

we are putting into a \&{control} block. Some of that information would

we are putting into a \&{control} block. Some of that information would

typically be latched between stages of a pipeline; other portions would

typically be latched between stages of a pipeline; other portions would

probably appear in so-called ``rename registers.''

probably appear in so-called ``rename registers.''

@^rename registers@>

@^rename registers@>

We simulate rename registers only indirectly,

We simulate rename registers only indirectly,

by counting how many registers of that

by counting how many registers of that

kind would be in use if we were mimicking low-level hardware details more

kind would be in use if we were mimicking low-level hardware details more

precisely. The |go| field is a \&{specnode} for convenience in programming,

precisely. The |go| field is a \&{specnode} for convenience in programming,

although we use only its |known| and |o| subfields. It generally contains

although we use only its |known| and |o| subfields. It generally contains

the address of the subsequent instruction.

the address of the subsequent instruction.

@s mmix_opcode int

@s mmix_opcode int

@s internal_opcode int

@s internal_opcode int

@=

@=

@@;

@@;

typedef struct control_struct {

typedef struct control_struct {

 octa loc; /* virtual address where an instruction originated */

 octa loc; /* virtual address where an instruction originated */

 mmix_opcode op;@+ unsigned char xx,yy,zz; /* the original instruction bytes */

 mmix_opcode op;@+ unsigned char xx,yy,zz; /* the original instruction bytes */

 spec y,z,b,ra; /* inputs */

 spec y,z,b,ra; /* inputs */

 specnode x,a,go,rl; /* outputs */

 specnode x,a,go,rl; /* outputs */

 coroutine *owner; /* a coroutine whose |ctl| this is */

 coroutine *owner; /* a coroutine whose |ctl| this is */

 internal_opcode i; /* internal opcode */

 internal_opcode i; /* internal opcode */

 int state; /* internal mindset */

 int state; /* internal mindset */

 bool usage; /* should rU be increased? */

 bool usage; /* should rU be increased? */

 bool need_b; /* should we stall until |b.p==NULL|? */

 bool need_b; /* should we stall until |b.p==NULL|? */

 bool need_ra; /* should we stall until |ra.p==NULL|? */

 bool need_ra; /* should we stall until |ra.p==NULL|? */

 bool ren_x; /* does |x| correspond to a rename register? */

 bool ren_x; /* does |x| correspond to a rename register? */

 bool mem_x; /* does |x| correspond to a memory write? */

 bool mem_x; /* does |x| correspond to a memory write? */

 bool ren_a; /* does |a| correspond to a rename register? */

 bool ren_a; /* does |a| correspond to a rename register? */

 bool set_l; /* does |rl| correspond to a new value of rL? */

 bool set_l; /* does |rl| correspond to a new value of rL? */

 bool interim; /* does this instruction need to be reissued on interrupt? */

 bool interim; /* does this instruction need to be reissued on interrupt? */

 unsigned int arith_exc; /* arithmetic exceptions for event bits of rA */

 unsigned int arith_exc; /* arithmetic exceptions for event bits of rA */

 unsigned int hist; /* history bits for use in branch prediction */

 unsigned int hist; /* history bits for use in branch prediction */

 int denin,denout; /* execution time penalties for subnormal handling */

 int denin,denout; /* execution time penalties for subnormal handling */

 octa cur_O,cur_S; /* speculative rO and rS before this instruction */

 octa cur_O,cur_S; /* speculative rO and rS before this instruction */

 unsigned int interrupt; /* does this instruction generate an interrupt? */

 unsigned int interrupt; /* does this instruction generate an interrupt? */

 void *ptr_a, *ptr_b, *ptr_c; /* generic pointers for miscellaneous use */

 void *ptr_a, *ptr_b, *ptr_c; /* generic pointers for miscellaneous use */

} control;

} control;

@ @=

@ @=

static void print_control_block @,@,@[ARGS((control*))@];

static void print_control_block @,@,@[ARGS((control*))@];

@ @=

@ @=

static void print_control_block(c)

static void print_control_block(c)

  control *c;

  control *c;

  octa default_go;

  octa default_go;

  if (c->loc.h || c->loc.l || c->op || c->xx || c->yy || c->zz || c->owner) {

  if (c->loc.h || c->loc.l || c->op || c->xx || c->yy || c->zz || c->owner) {

    print_octa(c->loc);

    print_octa(c->loc);

    printf(": %02x%02x%02x%02x(%s)",c->op,c->xx,c->yy,c->zz,

    printf(": %02x%02x%02x%02x(%s)",c->op,c->xx,c->yy,c->zz,

              internal_op_name[c->i]);

              internal_op_name[c->i]);

  if (c->usage) printf("*");

  if (c->usage) printf("*");

  if (c->interim) printf("+");

  if (c->interim) printf("+");

  if (c->y.o.h || c->y.o.l || c->y.p) {@+printf(" y=");@+print_spec(c->y);@+}

  if (c->y.o.h || c->y.o.l || c->y.p) {@+printf(" y=");@+print_spec(c->y);@+}

  if (c->z.o.h || c->z.o.l || c->z.p) {@+printf(" z=");@+print_spec(c->z);@+}

  if (c->z.o.h || c->z.o.l || c->z.p) {@+printf(" z=");@+print_spec(c->z);@+}

  if (c->b.o.h || c->b.o.l || c->b.p || c->need_b) {

  if (c->b.o.h || c->b.o.l || c->b.p || c->need_b) {

    printf(" b=");@+print_spec(c->b);

    printf(" b=");@+print_spec(c->b);

    if (c->need_b) printf("*");

    if (c->need_b) printf("*");

  if (c->need_ra) {@+printf(" rA=");@+print_spec(c->ra);@+}

  if (c->need_ra) {@+printf(" rA=");@+print_spec(c->ra);@+}

  if (c->ren_x || c->mem_x) {@+printf(" x=");@+print_specnode(c->x);@+}

  if (c->ren_x || c->mem_x) {@+printf(" x=");@+print_specnode(c->x);@+}

  else if (c->x.o.h || c->x.o.l) {

  else if (c->x.o.h || c->x.o.l) {

    printf(" x=");@+print_octa(c->x.o);@+printf("%c",c->x.known? '!': '?');

    printf(" x=");@+print_octa(c->x.o);@+printf("%c",c->x.known? '!': '?');

  if (c->ren_a) {@+printf(" a=");@+print_specnode(c->a);@+}

  if (c->ren_a) {@+printf(" a=");@+print_specnode(c->a);@+}

  if (c->set_l) {@+printf(" rL=");@+print_specnode(c->rl);@+}

  if (c->set_l) {@+printf(" rL=");@+print_specnode(c->rl);@+}

  if (c->interrupt) {@+printf(" int=");@+print_bits(c->interrupt);@+}

  if (c->interrupt) {@+printf(" int=");@+print_bits(c->interrupt);@+}

  if (c->arith_exc) {@+printf(" exc=");@+print_bits(c->arith_exc<<8);@+}

  if (c->arith_exc) {@+printf(" exc=");@+print_bits(c->arith_exc<<8);@+}

  default_go=incr(c->loc,4);

  default_go=incr(c->loc,4);

  if (c->go.o.l!=default_go.l || c->go.o.h!=default_go.h) {

  if (c->go.o.l!=default_go.l || c->go.o.h!=default_go.h) {

    printf(" ->");@+print_octa(c->go.o);

    printf(" ->");@+print_octa(c->go.o);

  if (verbose&show_pred_bit) printf(" hist=%x",c->hist);

  if (verbose&show_pred_bit) printf(" hist=%x",c->hist);

  if (c->i==pop) {

  if (c->i==pop) {

     printf(" rS="); print_octa(c->cur_S);

     printf(" rS="); print_octa(c->cur_S);

     printf(" rO="); print_octa(c->cur_O);

     printf(" rO="); print_octa(c->cur_O);

  printf(" state=%d",c->state);

  printf(" state=%d",c->state);

@* Lists. Here is a (boring) list of all the \MMIX\ opcodes, in order.

@* Lists. Here is a (boring) list of all the \MMIX\ opcodes, in order.

@=

@=

typedef enum{@/

typedef enum{@/

@!TRAP,@!FCMP,@!FUN,@!FEQL,@!FADD,@!FIX,@!FSUB,@!FIXU,@/

@!TRAP,@!FCMP,@!FUN,@!FEQL,@!FADD,@!FIX,@!FSUB,@!FIXU,@/

@!FLOT,@!FLOTI,@!FLOTU,@!FLOTUI,@!SFLOT,@!SFLOTI,@!SFLOTU,@!SFLOTUI,@/

@!FLOT,@!FLOTI,@!FLOTU,@!FLOTUI,@!SFLOT,@!SFLOTI,@!SFLOTU,@!SFLOTUI,@/

@!FMUL,@!FCMPE,@!FUNE,@!FEQLE,@!FDIV,@!FSQRT,@!FREM,@!FINT,@/

@!FMUL,@!FCMPE,@!FUNE,@!FEQLE,@!FDIV,@!FSQRT,@!FREM,@!FINT,@/

@!MUL,@!MULI,@!MULU,@!MULUI,@!DIV,@!DIVI,@!DIVU,@!DIVUI,@/

@!MUL,@!MULI,@!MULU,@!MULUI,@!DIV,@!DIVI,@!DIVU,@!DIVUI,@/

@!ADD,@!ADDI,@!ADDU,@!ADDUI,@!SUB,@!SUBI,@!SUBU,@!SUBUI,@/

@!ADD,@!ADDI,@!ADDU,@!ADDUI,@!SUB,@!SUBI,@!SUBU,@!SUBUI,@/

@!IIADDU,@!IIADDUI,@!IVADDU,@!IVADDUI,@!VIIIADDU,@!VIIIADDUI,@!XVIADDU,@!XVIADDUI,@/

@!IIADDU,@!IIADDUI,@!IVADDU,@!IVADDUI,@!VIIIADDU,@!VIIIADDUI,@!XVIADDU,@!XVIADDUI,@/

@!CMP,@!CMPI,@!CMPU,@!CMPUI,@!NEG,@!NEGI,@!NEGU,@!NEGUI,@/

@!CMP,@!CMPI,@!CMPU,@!CMPUI,@!NEG,@!NEGI,@!NEGU,@!NEGUI,@/

@!SL,@!SLI,@!SLU,@!SLUI,@!SR,@!SRI,@!SRU,@!SRUI,@/

@!SL,@!SLI,@!SLU,@!SLUI,@!SR,@!SRI,@!SRU,@!SRUI,@/

@!BN,@!BNB,@!BZ,@!BZB,@!BP,@!BPB,@!BOD,@!BODB,@/

@!BN,@!BNB,@!BZ,@!BZB,@!BP,@!BPB,@!BOD,@!BODB,@/

@!BNN,@!BNNB,@!BNZ,@!BNZB,@!BNP,@!BNPB,@!BEV,@!BEVB,@/

@!BNN,@!BNNB,@!BNZ,@!BNZB,@!BNP,@!BNPB,@!BEV,@!BEVB,@/

@!PBN,@!PBNB,@!PBZ,@!PBZB,@!PBP,@!PBPB,@!PBOD,@!PBODB,@/

@!PBN,@!PBNB,@!PBZ,@!PBZB,@!PBP,@!PBPB,@!PBOD,@!PBODB,@/

@!PBNN,@!PBNNB,@!PBNZ,@!PBNZB,@!PBNP,@!PBNPB,@!PBEV,@!PBEVB,@/

@!PBNN,@!PBNNB,@!PBNZ,@!PBNZB,@!PBNP,@!PBNPB,@!PBEV,@!PBEVB,@/

@!CSN,@!CSNI,@!CSZ,@!CSZI,@!CSP,@!CSPI,@!CSOD,@!CSODI,@/

@!CSN,@!CSNI,@!CSZ,@!CSZI,@!CSP,@!CSPI,@!CSOD,@!CSODI,@/

@!CSNN,@!CSNNI,@!CSNZ,@!CSNZI,@!CSNP,@!CSNPI,@!CSEV,@!CSEVI,@/

@!CSNN,@!CSNNI,@!CSNZ,@!CSNZI,@!CSNP,@!CSNPI,@!CSEV,@!CSEVI,@/

@!ZSN,@!ZSNI,@!ZSZ,@!ZSZI,@!ZSP,@!ZSPI,@!ZSOD,@!ZSODI,@/

@!ZSN,@!ZSNI,@!ZSZ,@!ZSZI,@!ZSP,@!ZSPI,@!ZSOD,@!ZSODI,@/

@!ZSNN,@!ZSNNI,@!ZSNZ,@!ZSNZI,@!ZSNP,@!ZSNPI,@!ZSEV,@!ZSEVI,@/

@!ZSNN,@!ZSNNI,@!ZSNZ,@!ZSNZI,@!ZSNP,@!ZSNPI,@!ZSEV,@!ZSEVI,@/

@!LDB,@!LDBI,@!LDBU,@!LDBUI,@!LDW,@!LDWI,@!LDWU,@!LDWUI,@/

@!LDB,@!LDBI,@!LDBU,@!LDBUI,@!LDW,@!LDWI,@!LDWU,@!LDWUI,@/

@!LDT,@!LDTI,@!LDTU,@!LDTUI,@!LDO,@!LDOI,@!LDOU,@!LDOUI,@/

@!LDT,@!LDTI,@!LDTU,@!LDTUI,@!LDO,@!LDOI,@!LDOU,@!LDOUI,@/

@!LDSF,@!LDSFI,@!LDHT,@!LDHTI,@!CSWAP,@!CSWAPI,@!LDUNC,@!LDUNCI,@/

@!LDSF,@!LDSFI,@!LDHT,@!LDHTI,@!CSWAP,@!CSWAPI,@!LDUNC,@!LDUNCI,@/

@!LDVTS,@!LDVTSI,@!PRELD,@!PRELDI,@!PREGO,@!PREGOI,@!GO,@!GOI,@/

@!LDVTS,@!LDVTSI,@!PRELD,@!PRELDI,@!PREGO,@!PREGOI,@!GO,@!GOI,@/

@!STB,@!STBI,@!STBU,@!STBUI,@!STW,@!STWI,@!STWU,@!STWUI,@/

@!STB,@!STBI,@!STBU,@!STBUI,@!STW,@!STWI,@!STWU,@!STWUI,@/

@!STT,@!STTI,@!STTU,@!STTUI,@!STO,@!STOI,@!STOU,@!STOUI,@/

@!STT,@!STTI,@!STTU,@!STTUI,@!STO,@!STOI,@!STOU,@!STOUI,@/

@!STSF,@!STSFI,@!STHT,@!STHTI,@!STCO,@!STCOI,@!STUNC,@!STUNCI,@/

@!STSF,@!STSFI,@!STHT,@!STHTI,@!STCO,@!STCOI,@!STUNC,@!STUNCI,@/

@!SYNCD,@!SYNCDI,@!PREST,@!PRESTI,@!SYNCID,@!SYNCIDI,@!PUSHGO,@!PUSHGOI,@/

@!SYNCD,@!SYNCDI,@!PREST,@!PRESTI,@!SYNCID,@!SYNCIDI,@!PUSHGO,@!PUSHGOI,@/

@!OR,@!ORI,@!ORN,@!ORNI,@!NOR,@!NORI,@!XOR,@!XORI,@/

@!OR,@!ORI,@!ORN,@!ORNI,@!NOR,@!NORI,@!XOR,@!XORI,@/

@!AND,@!ANDI,@!ANDN,@!ANDNI,@!NAND,@!NANDI,@!NXOR,@!NXORI,@/

@!AND,@!ANDI,@!ANDN,@!ANDNI,@!NAND,@!NANDI,@!NXOR,@!NXORI,@/

@!BDIF,@!BDIFI,@!WDIF,@!WDIFI,@!TDIF,@!TDIFI,@!ODIF,@!ODIFI,@/

@!BDIF,@!BDIFI,@!WDIF,@!WDIFI,@!TDIF,@!TDIFI,@!ODIF,@!ODIFI,@/

@!MUX,@!MUXI,@!SADD,@!SADDI,@!MOR,@!MORI,@!MXOR,@!MXORI,@/

@!MUX,@!MUXI,@!SADD,@!SADDI,@!MOR,@!MORI,@!MXOR,@!MXORI,@/

@!SETH,@!SETMH,@!SETML,@!SETL,@!INCH,@!INCMH,@!INCML,@!INCL,@/

@!SETH,@!SETMH,@!SETML,@!SETL,@!INCH,@!INCMH,@!INCML,@!INCL,@/

@!ORH,@!ORMH,@!ORML,@!ORL,@!ANDNH,@!ANDNMH,@!ANDNML,@!ANDNL,@/

@!ORH,@!ORMH,@!ORML,@!ORL,@!ANDNH,@!ANDNMH,@!ANDNML,@!ANDNL,@/

@!JMP,@!JMPB,@!PUSHJ,@!PUSHJB,@!GETA,@!GETAB,@!PUT,@!PUTI,@/

@!JMP,@!JMPB,@!PUSHJ,@!PUSHJB,@!GETA,@!GETAB,@!PUT,@!PUTI,@/

@!POP,@!RESUME,@!SAVE,@!UNSAVE,@!SYNC,@!SWYM,@!GET,@!TRIP}@+@!mmix_opcode;

@!POP,@!RESUME,@!SAVE,@!UNSAVE,@!SYNC,@!SWYM,@!GET,@!TRIP}@+@!mmix_opcode;

@ @=

@ @=

char *opcode_name[]={

char *opcode_name[]={

"TRAP","FCMP","FUN","FEQL","FADD","FIX","FSUB","FIXU",@/

"TRAP","FCMP","FUN","FEQL","FADD","FIX","FSUB","FIXU",@/

"FLOT","FLOTI","FLOTU","FLOTUI","SFLOT","SFLOTI","SFLOTU","SFLOTUI",@/

"FLOT","FLOTI","FLOTU","FLOTUI","SFLOT","SFLOTI","SFLOTU","SFLOTUI",@/

"FMUL","FCMPE","FUNE","FEQLE","FDIV","FSQRT","FREM","FINT",@/

"FMUL","FCMPE","FUNE","FEQLE","FDIV","FSQRT","FREM","FINT",@/

"MUL","MULI","MULU","MULUI","DIV","DIVI","DIVU","DIVUI",@/

"MUL","MULI","MULU","MULUI","DIV","DIVI","DIVU","DIVUI",@/

"ADD","ADDI","ADDU","ADDUI","SUB","SUBI","SUBU","SUBUI",@/

"ADD","ADDI","ADDU","ADDUI","SUB","SUBI","SUBU","SUBUI",@/

"2ADDU","2ADDUI","4ADDU","4ADDUI","8ADDU","8ADDUI","16ADDU","16ADDUI",@/

"2ADDU","2ADDUI","4ADDU","4ADDUI","8ADDU","8ADDUI","16ADDU","16ADDUI",@/

"CMP","CMPI","CMPU","CMPUI","NEG","NEGI","NEGU","NEGUI",@/

"CMP","CMPI","CMPU","CMPUI","NEG","NEGI","NEGU","NEGUI",@/

"SL","SLI","SLU","SLUI","SR","SRI","SRU","SRUI",@/

"SL","SLI","SLU","SLUI","SR","SRI","SRU","SRUI",@/

"BN","BNB","BZ","BZB","BP","BPB","BOD","BODB",@/

"BN","BNB","BZ","BZB","BP","BPB","BOD","BODB",@/

"BNN","BNNB","BNZ","BNZB","BNP","BNPB","BEV","BEVB",@/

"BNN","BNNB","BNZ","BNZB","BNP","BNPB","BEV","BEVB",@/

"PBN","PBNB","PBZ","PBZB","PBP","PBPB","PBOD","PBODB",@/

"PBN","PBNB","PBZ","PBZB","PBP","PBPB","PBOD","PBODB",@/

"PBNN","PBNNB","PBNZ","PBNZB","PBNP","PBNPB","PBEV","PBEVB",@/

"PBNN","PBNNB","PBNZ","PBNZB","PBNP","PBNPB","PBEV","PBEVB",@/

"CSN","CSNI","CSZ","CSZI","CSP","CSPI","CSOD","CSODI",@/

"CSN","CSNI","CSZ","CSZI","CSP","CSPI","CSOD","CSODI",@/

"CSNN","CSNNI","CSNZ","CSNZI","CSNP","CSNPI","CSEV","CSEVI",@/

"CSNN","CSNNI","CSNZ","CSNZI","CSNP","CSNPI","CSEV","CSEVI",@/

"ZSN","ZSNI","ZSZ","ZSZI","ZSP","ZSPI","ZSOD","ZSODI",@/

"ZSN","ZSNI","ZSZ","ZSZI","ZSP","ZSPI","ZSOD","ZSODI",@/

"ZSNN","ZSNNI","ZSNZ","ZSNZI","ZSNP","ZSNPI","ZSEV","ZSEVI",@/

"ZSNN","ZSNNI","ZSNZ","ZSNZI","ZSNP","ZSNPI","ZSEV","ZSEVI",@/

"LDB","LDBI","LDBU","LDBUI","LDW","LDWI","LDWU","LDWUI",@/

"LDB","LDBI","LDBU","LDBUI","LDW","LDWI","LDWU","LDWUI",@/

"LDT","LDTI","LDTU","LDTUI","LDO","LDOI","LDOU","LDOUI",@/

"LDT","LDTI","LDTU","LDTUI","LDO","LDOI","LDOU","LDOUI",@/

"LDSF","LDSFI","LDHT","LDHTI","CSWAP","CSWAPI","LDUNC","LDUNCI",@/

"LDSF","LDSFI","LDHT","LDHTI","CSWAP","CSWAPI","LDUNC","LDUNCI",@/

"LDVTS","LDVTSI","PRELD","PRELDI","PREGO","PREGOI","GO","GOI",@/

"LDVTS","LDVTSI","PRELD","PRELDI","PREGO","PREGOI","GO","GOI",@/

"STB","STBI","STBU","STBUI","STW","STWI","STWU","STWUI",@/

"STB","STBI","STBU","STBUI","STW","STWI","STWU","STWUI",@/

"STT","STTI","STTU","STTUI","STO","STOI","STOU","STOUI",@/

"STT","STTI","STTU","STTUI","STO","STOI","STOU","STOUI",@/

"STSF","STSFI","STHT","STHTI","STCO","STCOI","STUNC","STUNCI",@/

"STSF","STSFI","STHT","STHTI","STCO","STCOI","STUNC","STUNCI",@/

"SYNCD","SYNCDI","PREST","PRESTI","SYNCID","SYNCIDI","PUSHGO","PUSHGOI",@/

"SYNCD","SYNCDI","PREST","PRESTI","SYNCID","SYNCIDI","PUSHGO","PUSHGOI",@/

"OR","ORI","ORN","ORNI","NOR","NORI","XOR","XORI",@/

"OR","ORI","ORN","ORNI","NOR","NORI","XOR","XORI",@/

"AND","ANDI","ANDN","ANDNI","NAND","NANDI","NXOR","NXORI",@/

"AND","ANDI","ANDN","ANDNI","NAND","NANDI","NXOR","NXORI",@/

"BDIF","BDIFI","WDIF","WDIFI","TDIF","TDIFI","ODIF","ODIFI",@/

"BDIF","BDIFI","WDIF","WDIFI","TDIF","TDIFI","ODIF","ODIFI",@/

"MUX","MUXI","SADD","SADDI","MOR","MORI","MXOR","MXORI",@/

"MUX","MUXI","SADD","SADDI","MOR","MORI","MXOR","MXORI",@/

"SETH","SETMH","SETML","SETL","INCH","INCMH","INCML","INCL",@/

"SETH","SETMH","SETML","SETL","INCH","INCMH","INCML","INCL",@/

"ORH","ORMH","ORML","ORL","ANDNH","ANDNMH","ANDNML","ANDNL",@/

"ORH","ORMH","ORML","ORL","ANDNH","ANDNMH","ANDNML","ANDNL",@/

"JMP","JMPB","PUSHJ","PUSHJB","GETA","GETAB","PUT","PUTI",@/

"JMP","JMPB","PUSHJ","PUSHJB","GETA","GETAB","PUT","PUTI",@/

"POP","RESUME","SAVE","UNSAVE","SYNC","SWYM","GET","TRIP"};

"POP","RESUME","SAVE","UNSAVE","SYNC","SWYM","GET","TRIP"};

@ And here is a (likewise boring) list of all the internal opcodes.

@ And here is a (likewise boring) list of all the internal opcodes.

The smallest numbers, less than or equal to |max_pipe_op|, correspond

The smallest numbers, less than or equal to |max_pipe_op|, correspond

to operations for which arbitrary pipeline delays can be configured

to operations for which arbitrary pipeline delays can be configured

with |MMIX_config|. The largest numbers, greater than |max_real_command|,

with |MMIX_config|. The largest numbers, greater than |max_real_command|,

correspond to internally

correspond to internally

generated operations that have no official OP code; for example,

generated operations that have no official OP code; for example,

there are internal operations to shift the $\gamma$ pointer in the

there are internal operations to shift the $\gamma$ pointer in the

register stack, and to compute page table entries.

register stack, and to compute page table entries.

@=

@=

#define max_pipe_op feps

#define max_pipe_op feps

#define max_real_command trip

#define max_real_command trip

typedef enum{@/

typedef enum{@/

@!mul0, /* multiplication by zero */

@!mul0, /* multiplication by zero */

@!mul1, /* multiplication by 1--8 bits */

@!mul1, /* multiplication by 1--8 bits */

@!mul2, /* multiplication by 9--16 bits */

@!mul2, /* multiplication by 9--16 bits */

@!mul3, /* multiplication by 17--24 bits */

@!mul3, /* multiplication by 17--24 bits */

@!mul4, /* multiplication by 25--32 bits */

@!mul4, /* multiplication by 25--32 bits */

@!mul5, /* multiplication by 33--40 bits */

@!mul5, /* multiplication by 33--40 bits */

@!mul6, /* multiplication by 41--48 bits */

@!mul6, /* multiplication by 41--48 bits */

@!mul7, /* multiplication by 49--56 bits */

@!mul7, /* multiplication by 49--56 bits */

@!mul8, /* multiplication by 57--64 bits */

@!mul8, /* multiplication by 57--64 bits */

@!div, /* \.{DIV[U][I]} */

@!div, /* \.{DIV[U][I]} */

@!sh, /* \.{S[L,R][U][I]} */

@!sh, /* \.{S[L,R][U][I]} */

@!mux, /* \.{MUX[I]} */

@!mux, /* \.{MUX[I]} */

@!sadd, /* \.{SADD[I]} */

@!sadd, /* \.{SADD[I]} */

@!mor, /* \.{M[X]OR[I]} */

@!mor, /* \.{M[X]OR[I]} */

@!fadd, /* \.{FADD}, \.{FSUB} */

@!fadd, /* \.{FADD}, \.{FSUB} */

@!fmul, /* \.{FMUL} */

@!fmul, /* \.{FMUL} */

@!fdiv, /* \.{FDIV} */

@!fdiv, /* \.{FDIV} */

@!fsqrt, /* \.{FSQRT} */

@!fsqrt, /* \.{FSQRT} */

@!fint, /* \.{FINT} */

@!fint, /* \.{FINT} */

@!fix, /* \.{FIX[U]} */

@!fix, /* \.{FIX[U]} */

@!flot, /* \.{[S]FLOT[U][I]} */

@!flot, /* \.{[S]FLOT[U][I]} */

@!feps, /* \.{FCMPE}, \.{FUNE}, \.{FEQLE} */

@!feps, /* \.{FCMPE}, \.{FUNE}, \.{FEQLE} */

@!fcmp, /* \.{FCMP} */

@!fcmp, /* \.{FCMP} */

@!funeq, /* \.{FUN}, \.{FEQL} */

@!funeq, /* \.{FUN}, \.{FEQL} */

@!fsub, /* \.{FSUB} */

@!fsub, /* \.{FSUB} */

@!frem, /* \.{FREM} */

@!frem, /* \.{FREM} */

@!mul, /* \.{MUL[I]} */

@!mul, /* \.{MUL[I]} */

@!mulu, /* \.{MULU[I]} */

@!mulu, /* \.{MULU[I]} */

@!divu, /* \.{DIVU[I]} */

@!divu, /* \.{DIVU[I]} */

@!add, /* \.{ADD[I]} */

@!add, /* \.{ADD[I]} */

@!addu, /* \.{[2,4,8,16,]ADDU[I]}, \.{INC[M][H,L]} */

@!addu, /* \.{[2,4,8,16,]ADDU[I]}, \.{INC[M][H,L]} */

@!sub, /* \.{SUB[I]}, \.{NEG[I]} */

@!sub, /* \.{SUB[I]}, \.{NEG[I]} */

@!subu, /* \.{SUBU[I]}, \.{NEGU[I]} */

@!subu, /* \.{SUBU[I]}, \.{NEGU[I]} */

@!set, /* \.{SET[M][H,L]}, \.{GETA[B]} */

@!set, /* \.{SET[M][H,L]}, \.{GETA[B]} */

@!or, /* \.{OR[I]}, \.{OR[M][H,L]} */

@!or, /* \.{OR[I]}, \.{OR[M][H,L]} */

@!orn, /* \.{ORN[I]} */

@!orn, /* \.{ORN[I]} */

@!nor, /* \.{NOR[I]} */

@!nor, /* \.{NOR[I]} */

@!and, /* \.{AND[I]} */

@!and, /* \.{AND[I]} */

@!andn, /* \.{ANDN[I]}, \.{ANDN[M][H,L]} */

@!andn, /* \.{ANDN[I]}, \.{ANDN[M][H,L]} */

@!nand, /* \.{NAND[I]} */

@!nand, /* \.{NAND[I]} */

@!xor, /* \.{XOR[I]} */

@!xor, /* \.{XOR[I]} */

@!nxor, /* \.{NXOR[I]} */

@!nxor, /* \.{NXOR[I]} */

@!shlu, /* \.{SLU[I]} */

@!shlu, /* \.{SLU[I]} */

@!shru, /* \.{SRU[I]} */

@!shru, /* \.{SRU[I]} */

@!shl, /* \.{SL[I]} */

@!shl, /* \.{SL[I]} */

@!shr, /* \.{SR[I]} */

@!shr, /* \.{SR[I]} */

@!cmp, /* \.{CMP[I]} */

@!cmp, /* \.{CMP[I]} */

@!cmpu, /* \.{CMPU[I]} */

@!cmpu, /* \.{CMPU[I]} */

@!bdif, /* \.{BDIF[I]} */

@!bdif, /* \.{BDIF[I]} */

@!wdif, /* \.{WDIF[I]} */

@!wdif, /* \.{WDIF[I]} */

@!tdif, /* \.{TDIF[I]} */

@!tdif, /* \.{TDIF[I]} */

@!odif, /* \.{ODIF[I]} */

@!odif, /* \.{ODIF[I]} */

@!zset, /* \.{ZS[N][N,Z,P][I]}, \.{ZSEV[I]}, \.{ZSOD[I]} */

@!zset, /* \.{ZS[N][N,Z,P][I]}, \.{ZSEV[I]}, \.{ZSOD[I]} */

@!cset, /* \.{CS[N][N,Z,P][I]}, \.{CSEV[I]}, \.{CSOD[I]} */

@!cset, /* \.{CS[N][N,Z,P][I]}, \.{CSEV[I]}, \.{CSOD[I]} */

@!get, /* \.{GET} */

@!get, /* \.{GET} */

@!put, /* \.{PUT[I]} */

@!put, /* \.{PUT[I]} */

@!ld, /* \.{LD[B,W,T,O][U][I]}, \.{LDHT[I]}, \.{LDSF[I]} */

@!ld, /* \.{LD[B,W,T,O][U][I]}, \.{LDHT[I]}, \.{LDSF[I]} */

@!ldptp, /* load page table pointer */

@!ldptp, /* load page table pointer */

@!ldpte, /* load page table entry */

@!ldpte, /* load page table entry */

@!ldunc, /* \.{LDUNC[I]} */

@!ldunc, /* \.{LDUNC[I]} */

@!ldvts, /* \.{LDVTS[I]} */

@!ldvts, /* \.{LDVTS[I]} */

@!preld, /* \.{PRELD[I]} */

@!preld, /* \.{PRELD[I]} */

@!prest, /* \.{PREST[I]} */

@!prest, /* \.{PREST[I]} */

@!st, /* \.{STO[U][I]}, \.{STCO[I]}, \.{STUNC[I]} */

@!st, /* \.{STO[U][I]}, \.{STCO[I]}, \.{STUNC[I]} */

@!syncd, /* \.{SYNCD[I]} */

@!syncd, /* \.{SYNCD[I]} */

@!syncid, /* \.{SYNCID[I]} */

@!syncid, /* \.{SYNCID[I]} */

@!pst, /* \.{ST[B,W,T][U][I]}, \.{STHT[I]} */

@!pst, /* \.{ST[B,W,T][U][I]}, \.{STHT[I]} */

@!stunc, /* \.{STUNC[I]}, in write buffer */

@!stunc, /* \.{STUNC[I]}, in write buffer */

@!cswap, /* \.{CSWAP[I]} */

@!cswap, /* \.{CSWAP[I]} */

@!br, /* \.{B[N][N,Z,P][B]} */

@!br, /* \.{B[N][N,Z,P][B]} */

@!pbr, /* \.{PB[N][N,Z,P][B]} */

@!pbr, /* \.{PB[N][N,Z,P][B]} */

@!pushj, /* \.{PUSHJ[B]} */

@!pushj, /* \.{PUSHJ[B]} */

@!go, /* \.{GO[I]} */

@!go, /* \.{GO[I]} */

@!prego, /* \.{PREGO[I]} */

@!prego, /* \.{PREGO[I]} */

@!pushgo, /* \.{PUSHGO[I]} */

@!pushgo, /* \.{PUSHGO[I]} */

@!pop, /* \.{POP} */

@!pop, /* \.{POP} */

@!resume, /* \.{RESUME} */

@!resume, /* \.{RESUME} */

@!save, /* \.{SAVE} */

@!save, /* \.{SAVE} */

@!unsave, /* \.{UNSAVE} */

@!unsave, /* \.{UNSAVE} */

@!sync, /* \.{SYNC} */

@!sync, /* \.{SYNC} */

@!jmp, /* \.{JMP[B]} */

@!jmp, /* \.{JMP[B]} */

@!noop, /* \.{SWYM} */

@!noop, /* \.{SWYM} */

@!trap, /* \.{TRAP} */

@!trap, /* \.{TRAP} */

@!trip, /* \.{TRIP} */

@!trip, /* \.{TRIP} */

@!incgamma, /* increase $\gamma$ pointer */

@!incgamma, /* increase $\gamma$ pointer */

@!decgamma, /* decrease $\gamma$ pointer */

@!decgamma, /* decrease $\gamma$ pointer */

@!incrl, /* increase rL and $\beta$ */

@!incrl, /* increase rL and $\beta$ */

@!sav, /* intermediate stage of \.{SAVE} */

@!sav, /* intermediate stage of \.{SAVE} */

@!unsav, /* intermediate stage of \.{UNSAVE} */

@!unsav, /* intermediate stage of \.{UNSAVE} */

@!resum /* intermediate stage of \.{RESUME} */

@!resum /* intermediate stage of \.{RESUME} */

}@! internal_opcode;

}@! internal_opcode;

@ @=

@ @=

char *internal_op_name[]={

char *internal_op_name[]={

"mul0",

"mul0",

"mul1",

"mul1",

"mul2",

"mul2",

"mul3",

"mul3",

"mul4",

"mul4",

"mul5",

"mul5",

"mul6",

"mul6",

"mul7",

"mul7",

"mul8",

"mul8",

"div",

"div",

"sh",

"sh",

"mux",

"mux",

"sadd",

"sadd",

"mor",

"mor",

"fadd",

"fadd",

"fmul",

"fmul",

"fdiv",

"fdiv",

"fsqrt",

"fsqrt",

"fint",

"fint",

"fix",

"fix",

"flot",

"flot",

"feps",

"feps",

"fcmp",

"fcmp",

"funeq",

"funeq",

"fsub",

"fsub",

"frem",

"frem",

"mul",

"mul",

"mulu",

"mulu",

"divu",

"divu",

"add",

"add",

"addu",

"addu",

"sub",

"sub",

"subu",

"subu",

"set",

"set",

"or",

"or",

"orn",

"orn",

"nor",

"nor",

"and",

"and",

"andn",

"andn",

"nand",

"nand",

"xor",

"xor",

"nxor",

"nxor",

"shlu",

"shlu",

"shru",

"shru",

"shl",

"shl",

"shr",

"shr",

"cmp",

"cmp",

"cmpu",

"cmpu",

"bdif",

"bdif",

"wdif",

"wdif",

"tdif",

"tdif",

"odif",

"odif",

"zset",

"zset",

"cset",

"cset",

"get",

"get",

"put",

"put",

"ld",

"ld",

"ldptp",

"ldptp",

"ldpte",

"ldpte",

"ldunc",

"ldunc",

"ldvts",

"ldvts",

"preld",

"preld",

"prest",

"prest",

"st",

"st",

"syncd",

"syncd",

"syncid",

"syncid",

"pst",

"pst",

"stunc",

"stunc",

"cswap",

"cswap",

"br",

"br",

"pbr",

"pbr",

"pushj",

"pushj",

"go",

"go",

"prego",

"prego",

"pushgo",

"pushgo",

"pop",

"pop",

"resume",

"resume",

"save",

"save",

"unsave",

"unsave",

"sync",

"sync",

"jmp",

"jmp",

"noop",

"noop",

"trap",

"trap",

"trip",

"trip",

"incgamma",

"incgamma",

"decgamma",

"decgamma",

"incrl",

"incrl",

"sav",

"sav",

"unsav",

"unsav",

"resum"};

"resum"};

@ We need a table to convert the external opcodes to

@ We need a table to convert the external opcodes to

internal ones.

internal ones.

@=

@=

internal_opcode internal_op[256]={@/

internal_opcode internal_op[256]={@/

  trap,fcmp,funeq,funeq,fadd,fix,fsub,fix,@/

  trap,fcmp,funeq,funeq,fadd,fix,fsub,fix,@/

  flot,flot,flot,flot,flot,flot,flot,flot,@/

  flot,flot,flot,flot,flot,flot,flot,flot,@/

  fmul,feps,feps,feps,fdiv,fsqrt,frem,fint,@/

  fmul,feps,feps,feps,fdiv,fsqrt,frem,fint,@/

  mul,mul,mulu,mulu,div,div,divu,divu,@/

  mul,mul,mulu,mulu,div,div,divu,divu,@/

  add,add,addu,addu,sub,sub,subu,subu,@/

  add,add,addu,addu,sub,sub,subu,subu,@/

  addu,addu,addu,addu,addu,addu,addu,addu,@/

  addu,addu,addu,addu,addu,addu,addu,addu,@/

  cmp,cmp,cmpu,cmpu,sub,sub,subu,subu,@/

  cmp,cmp,cmpu,cmpu,sub,sub,subu,subu,@/

  shl,shl,shlu,shlu,shr,shr,shru,shru,@/

  shl,shl,shlu,shlu,shr,shr,shru,shru,@/

  br,br,br,br,br,br,br,br,@/

  br,br,br,br,br,br,br,br,@/

  br,br,br,br,br,br,br,br,@/

  br,br,br,br,br,br,br,br,@/

  pbr,pbr,pbr,pbr,pbr,pbr,pbr,pbr,@/

  pbr,pbr,pbr,pbr,pbr,pbr,pbr,pbr,@/

  pbr,pbr,pbr,pbr,pbr,pbr,pbr,pbr,@/

  pbr,pbr,pbr,pbr,pbr,pbr,pbr,pbr,@/

  cset,cset,cset,cset,cset,cset,cset,cset,@/

  cset,cset,cset,cset,cset,cset,cset,cset,@/

  cset,cset,cset,cset,cset,cset,cset,cset,@/

  cset,cset,cset,cset,cset,cset,cset,cset,@/

  zset,zset,zset,zset,zset,zset,zset,zset,@/

  zset,zset,zset,zset,zset,zset,zset,zset,@/

  zset,zset,zset,zset,zset,zset,zset,zset,@/

  zset,zset,zset,zset,zset,zset,zset,zset,@/

  ld,ld,ld,ld,ld,ld,ld,ld,@/

  ld,ld,ld,ld,ld,ld,ld,ld,@/

  ld,ld,ld,ld,ld,ld,ld,ld,@/

  ld,ld,ld,ld,ld,ld,ld,ld,@/

  ld,ld,ld,ld,cswap,cswap,ldunc,ldunc,@/

  ld,ld,ld,ld,cswap,cswap,ldunc,ldunc,@/

  ldvts,ldvts,preld,preld,prego,prego,go,go,@/

  ldvts,ldvts,preld,preld,prego,prego,go,go,@/

  pst,pst,pst,pst,pst,pst,pst,pst,@/

  pst,pst,pst,pst,pst,pst,pst,pst,@/

  pst,pst,pst,pst,st,st,st,st,@/

  pst,pst,pst,pst,st,st,st,st,@/

  pst,pst,pst,pst,st,st,st,st,@/

  pst,pst,pst,pst,st,st,st,st,@/

  syncd,syncd,prest,prest,syncid,syncid,pushgo,pushgo,@/

  syncd,syncd,prest,prest,syncid,syncid,pushgo,pushgo,@/

  or,or,orn,orn,nor,nor,xor,xor,@/

  or,or,orn,orn,nor,nor,xor,xor,@/

  and,and,andn,andn,nand,nand,nxor,nxor,@/

  and,and,andn,andn,nand,nand,nxor,nxor,@/

  bdif,bdif,wdif,wdif,tdif,tdif,odif,odif,@/

  bdif,bdif,wdif,wdif,tdif,tdif,odif,odif,@/

  mux,mux,sadd,sadd,mor,mor,mor,mor,@/

  mux,mux,sadd,sadd,mor,mor,mor,mor,@/

  set,set,set,set,addu,addu,addu,addu,@/

  set,set,set,set,addu,addu,addu,addu,@/

  or,or,or,or,andn,andn,andn,andn,@/

  or,or,or,or,andn,andn,andn,andn,@/

  jmp,jmp,pushj,pushj,set,set,put,put,@/

  jmp,jmp,pushj,pushj,set,set,put,put,@/

  pop,resume,save,unsave,sync,noop,get,trip};

  pop,resume,save,unsave,sync,noop,get,trip};

@ While we're into boring lists, we might as well define all the

@ While we're into boring lists, we might as well define all the

special register numbers, together with an inverse table for

special register numbers, together with an inverse table for

use in diagnostic outputs. These codes have been designed so that

use in diagnostic outputs. These codes have been designed so that

special registers 0--7 are unencumbered, 8--11 can't be \.{PUT} by anybody,

special registers 0--7 are unencumbered, 8--11 can't be \.{PUT} by anybody,

12--18 can't be \.{PUT} by the user. Pipeline delays might occur

12--18 can't be \.{PUT} by the user. Pipeline delays might occur

when \.{GET} is applied to special registers 21--31 or when

when \.{GET} is applied to special registers 21--31 or when

\.{PUT} is applied to special registers 15--20. The \.{SAVE} and

\.{PUT} is applied to special registers 15--20. The \.{SAVE} and

\.{UNSAVE} commands store and restore special registers 0--6 and 23--27.

\.{UNSAVE} commands store and restore special registers 0--6 and 23--27.

@=

@=

#define rA 21 /* arithmetic status register */

#define rA 21 /* arithmetic status register */

#define rB 0  /* bootstrap register (trip) */

#define rB 0  /* bootstrap register (trip) */

#define rC 8  /* cycle counter */

#define rC 8  /* cycle counter */

#define rD 1  /* dividend register */

#define rD 1  /* dividend register */

#define rE 2  /* epsilon register */

#define rE 2  /* epsilon register */

#define rF 22 /* failure location register */

#define rF 22 /* failure location register */

#define rG 19 /* global threshold register */

#define rG 19 /* global threshold register */

#define rH 3  /* himult register */

#define rH 3  /* himult register */

#define rI 12 /* interval counter */

#define rI 12 /* interval counter */

#define rJ 4  /* return-jump register */

#define rJ 4  /* return-jump register */

#define rK 15 /* interrupt mask register */

#define rK 15 /* interrupt mask register */

#define rL 20 /* local threshold register */

#define rL 20 /* local threshold register */

#define rM 5  /* multiplex mask register */

#define rM 5  /* multiplex mask register */

#define rN 9  /* serial number */

#define rN 9  /* serial number */

#define rO 10 /* register stack offset */

#define rO 10 /* register stack offset */

#define rP 23 /* prediction register */

#define rP 23 /* prediction register */

#define rQ 16 /* interrupt request register */

#define rQ 16 /* interrupt request register */

#define rR 6  /* remainder register */

#define rR 6  /* remainder register */

#define rS 11 /* register stack pointer */

#define rS 11 /* register stack pointer */

#define rT 13 /* trap address register */

#define rT 13 /* trap address register */

#define rU 17 /* usage counter */

#define rU 17 /* usage counter */

#define rV 18 /* virtual translation register */

#define rV 18 /* virtual translation register */

#define rW 24 /* where-interrupted register (trip) */

#define rW 24 /* where-interrupted register (trip) */

#define rX 25 /* execution register (trip) */

#define rX 25 /* execution register (trip) */

#define rY 26 /* Y operand (trip) */

#define rY 26 /* Y operand (trip) */

#define rZ 27 /* Z operand (trip) */

#define rZ 27 /* Z operand (trip) */

#define rBB 7  /* bootstrap register (trap) */

#define rBB 7  /* bootstrap register (trap) */

#define rTT 14 /* dynamic trap address register */

#define rTT 14 /* dynamic trap address register */

#define rWW 28 /* where-interrupted register (trap) */

#define rWW 28 /* where-interrupted register (trap) */

#define rXX 29 /* execution register (trap) */

#define rXX 29 /* execution register (trap) */

#define rYY 30 /* Y operand (trap) */

#define rYY 30 /* Y operand (trap) */

#define rZZ 31 /* Z operand (trap) */

#define rZZ 31 /* Z operand (trap) */

@ @=

@ @=

char *special_name[32]={"rB","rD","rE","rH","rJ","rM","rR","rBB",

char *special_name[32]={"rB","rD","rE","rH","rJ","rM","rR","rBB",

 "rC","rN","rO","rS","rI","rT","rTT","rK","rQ","rU","rV","rG","rL",

 "rC","rN","rO","rS","rI","rT","rTT","rK","rQ","rU","rV","rG","rL",

 "rA","rF","rP","rW","rX","rY","rZ","rWW","rXX","rYY","rZZ"};

 "rA","rF","rP","rW","rX","rY","rZ","rWW","rXX","rYY","rZZ"};

@ Here are the bit codes that affect trips and traps. The first eight

@ Here are the bit codes that affect trips and traps. The first eight

cases also apply to the upper half of~rQ; the next eight apply to~rA.

cases also apply to the upper half of~rQ; the next eight apply to~rA.

@d P_BIT (1<<0) /* instruction in privileged location */

@d P_BIT (1<<0) /* instruction in privileged location */

@d S_BIT (1<<1) /* security violation */

@d S_BIT (1<<1) /* security violation */

@d B_BIT (1<<2) /* instruction breaks the rules */

@d B_BIT (1<<2) /* instruction breaks the rules */

@d K_BIT (1<<3) /* instruction for kernel only */

@d K_BIT (1<<3) /* instruction for kernel only */

@d N_BIT (1<<4) /* virtual translation bypassed */

@d N_BIT (1<<4) /* virtual translation bypassed */

@d PX_BIT (1<<5) /* permission lacking to execute from page */

@d PX_BIT (1<<5) /* permission lacking to execute from page */

@d PW_BIT (1<<6) /* permission lacking to write on page */

@d PW_BIT (1<<6) /* permission lacking to write on page */

@d PR_BIT (1<<7) /* permission lacking to read from page */

@d PR_BIT (1<<7) /* permission lacking to read from page */

@d PROT_OFFSET 5 /* distance from |PR_BIT| to protection code position */

@d PROT_OFFSET 5 /* distance from |PR_BIT| to protection code position */

@d X_BIT (1<<8) /* floating inexact */

@d X_BIT (1<<8) /* floating inexact */

@d Z_BIT (1<<9) /* floating division by zero */

@d Z_BIT (1<<9) /* floating division by zero */

@d U_BIT (1<<10) /* floating underflow */

@d U_BIT (1<<10) /* floating underflow */

@d O_BIT (1<<11) /* floating overflow */

@d O_BIT (1<<11) /* floating overflow */

@d I_BIT (1<<12) /* floating invalid operation */

@d I_BIT (1<<12) /* floating invalid operation */

@d W_BIT (1<<13) /* float-to-fix overflow */

@d W_BIT (1<<13) /* float-to-fix overflow */

@d V_BIT (1<<14) /* integer overflow */

@d V_BIT (1<<14) /* integer overflow */

@d D_BIT (1<<15) /* integer divide check */

@d D_BIT (1<<15) /* integer divide check */

@d H_BIT (1<<16) /* trip handler bit */

@d H_BIT (1<<16) /* trip handler bit */

@d F_BIT (1<<17) /* forced trap bit */

@d F_BIT (1<<17) /* forced trap bit */

@d E_BIT (1<<18) /* external (dynamic) trap bit */

@d E_BIT (1<<18) /* external (dynamic) trap bit */

@=

@=

char bit_code_map[]="EFHDVWIOUZXrwxnkbsp";

char bit_code_map[]="EFHDVWIOUZXrwxnkbsp";

@ @=

@ @=

static void print_bits @,@,@[ARGS((int))@];

static void print_bits @,@,@[ARGS((int))@];

@ @=

@ @=

static void print_bits(x)

static void print_bits(x)

  int x;

  int x;

  register int b,j;

  register int b,j;

  for (j=0,b=E_BIT;(x&(b+b-1))&&b;j++,b>>=1)

  for (j=0,b=E_BIT;(x&(b+b-1))&&b;j++,b>>=1)

    if (x&b) printf("%c",bit_code_map[j]);

    if (x&b) printf("%c",bit_code_map[j]);

@ The lower half of rQ holds external interrupts of highest priority.

@ The lower half of rQ holds external interrupts of highest priority.

Most of them are implementation-dependent, but a few are defined in general.

Most of them are implementation-dependent, but a few are defined in general.

@=

@=

#define POWER_FAILURE (1<<0) /* try to shut down calmly and quickly */

#define POWER_FAILURE (1<<0) /* try to shut down calmly and quickly */

#define PARITY_ERROR (1<<1) /* try to save the file systems */

#define PARITY_ERROR (1<<1) /* try to save the file systems */

#define NONEXISTENT_MEMORY (1<<2) /* a memory address can't be used */

#define NONEXISTENT_MEMORY (1<<2) /* a memory address can't be used */

#define REBOOT_SIGNAL (1<<4) /* it's time to start over */

#define REBOOT_SIGNAL (1<<4) /* it's time to start over */

#define INTERVAL_TIMEOUT (1<<7) /* the timer register, rI, has reached zero */

#define INTERVAL_TIMEOUT (1<<7) /* the timer register, rI, has reached zero */

@* Dynamic speculation.

@* Dynamic speculation.

Now that we understand some basic low-level structures,

Now that we understand some basic low-level structures,

we're ready to look at the larger picture.

we're ready to look at the larger picture.

This simulator is based on the idea of ``dynamic scheduling with register

This simulator is based on the idea of ``dynamic scheduling with register

renaming,'' as introduced in the 1960s by R.~M. Tomasulo [{\sl IBM Journal

renaming,'' as introduced in the 1960s by R.~M. Tomasulo [{\sl IBM Journal

@^Tomasulo, Robert Marco@>

@^Tomasulo, Robert Marco@>

of Research and Development\/ \bf11} (1967), 25--33]. Moreover, the dynamic

of Research and Development\/ \bf11} (1967), 25--33]. Moreover, the dynamic

scheduling method is extended here to ``speculative execution,'' as

scheduling method is extended here to ``speculative execution,'' as

implemented in several processors of the 1990s and described in section~4.6 of

implemented in several processors of the 1990s and described in section~4.6 of

Hennessy and Patterson's {\sl Computer Architecture}, second edition (1995).

Hennessy and Patterson's {\sl Computer Architecture}, second edition (1995).

@^Hennessy, John LeRoy@>

@^Hennessy, John LeRoy@>

@^Patterson, David Andrew@>

@^Patterson, David Andrew@>

The essential idea is to keep track of the pipeline contents by recording all

The essential idea is to keep track of the pipeline contents by recording all

dependencies between unfinished computations in a queue called the {\it

dependencies between unfinished computations in a queue called the {\it

reorder buffer}. An entry in the reorder buffer might, for example, correspond

reorder buffer}. An entry in the reorder buffer might, for example, correspond

to an instruction that adds together two numbers whose values are still being

to an instruction that adds together two numbers whose values are still being

computed; those numbers have been allocated space in earlier positions of the

computed; those numbers have been allocated space in earlier positions of the

reorder buffer. The addition will take place as soon as both of its operands

reorder buffer. The addition will take place as soon as both of its operands

are known, but the sum won't be written immediately into the destination

are known, but the sum won't be written immediately into the destination

register. It will stay in the reorder buffer until reaching the {\it hot

register. It will stay in the reorder buffer until reaching the {\it hot

seat\/} at the front of the queue. Finally, the addition leaves the

seat\/} at the front of the queue. Finally, the addition leaves the

hot seat and is said to be {\it committed}.

hot seat and is said to be {\it committed}.

Some instructions in the reorder buffer may in fact be executed only

Some instructions in the reorder buffer may in fact be executed only

on speculation, meaning that they won't really be called for unless a prior

on speculation, meaning that they won't really be called for unless a prior

branch instruction has the predicted outcome. Indeed, we can say that

branch instruction has the predicted outcome. Indeed, we can say that

all instructions not yet in the hot seat are being executed speculatively,

all instructions not yet in the hot seat are being executed speculatively,

because an external interrupt might occur at any time and change the entire

because an external interrupt might occur at any time and change the entire

course of computation. Organizing the pipeline as a reorder buffer allows us

course of computation. Organizing the pipeline as a reorder buffer allows us

to look ahead and keep busy computing values that have a good chance of being

to look ahead and keep busy computing values that have a good chance of being

needed later, instead of waiting for slow instructions or slow memory

needed later, instead of waiting for slow instructions or slow memory

references to be completed.

references to be completed.

The reorder buffer is in fact a queue of \&{control} records, conceptually

The reorder buffer is in fact a queue of \&{control} records, conceptually

forming part of a circle of such records inside the simulator, corresponding

forming part of a circle of such records inside the simulator, corresponding

to all instructions that have been dispatched or {\it issued\/} but not yet

to all instructions that have been dispatched or {\it issued\/} but not yet

committed, in strict program order.

committed, in strict program order.

The best way to get an understanding of speculative execution is perhaps to

The best way to get an understanding of speculative execution is perhaps to

imagine that the reorder buffer is large enough to hold hundreds of

imagine that the reorder buffer is large enough to hold hundreds of

instructions in various stages of execution, and to think of an implementation

instructions in various stages of execution, and to think of an implementation

of \MMIX\ that has dozens of functional units---more than would ever actually

of \MMIX\ that has dozens of functional units---more than would ever actually

@^thinking big@>

@^thinking big@>

be built into a chip. Then one can readily visualize the kinds of control

be built into a chip. Then one can readily visualize the kinds of control

structures and checks that must be made to ensure correct execution. Without

structures and checks that must be made to ensure correct execution. Without

such a broad viewpoint, a programmer or hardware designer will be inclined to

such a broad viewpoint, a programmer or hardware designer will be inclined to

think only of the simple cases and to devise algorithms that lack the proper

think only of the simple cases and to devise algorithms that lack the proper

generality. Thus we have a somewhat paradoxical situation in which a difficult

generality. Thus we have a somewhat paradoxical situation in which a difficult

general problem turns out to be easier to solve than its simpler special cases,

general problem turns out to be easier to solve than its simpler special cases,

because it enforces clarity of thinking.

because it enforces clarity of thinking.

Instructions that have completed execution and have not yet been committed are

Instructions that have completed execution and have not yet been committed are

analogous to cars that have gone through our hypothetical repair shop and are

analogous to cars that have gone through our hypothetical repair shop and are

waiting for their owners to pick them up. However, all analogies break down,

waiting for their owners to pick them up. However, all analogies break down,

and the world of automobiles does not have a natural counterpart for the

and the world of automobiles does not have a natural counterpart for the

notion of speculative execution. That notion corresponds roughly to situations

notion of speculative execution. That notion corresponds roughly to situations

in which people are led to believe that their cars need a new piece of

in which people are led to believe that their cars need a new piece of

equipment, but they suddenly change their mind once they see the price tag,

equipment, but they suddenly change their mind once they see the price tag,

and they insist on having the equipment removed even after it has been

and they insist on having the equipment removed even after it has been

partially or completely installed.

partially or completely installed.

Speculatively executed instructions might make no sense: They might divide

Speculatively executed instructions might make no sense: They might divide

by zero or refer to protected memory areas, etc. Such anomalies are not

by zero or refer to protected memory areas, etc. Such anomalies are not

considered catastrophic or even exceptional until the instruction reaches the

considered catastrophic or even exceptional until the instruction reaches the

hot~seat.

hot~seat.

The person who designs a computer with speculative execution is an optimist,

The person who designs a computer with speculative execution is an optimist,

who has faith that the vast majority of the machine's predictions will come

who has faith that the vast majority of the machine's predictions will come

true. The person who designs a reliable implementation of such a computer

true. The person who designs a reliable implementation of such a computer

is a pessimist, who understands that all predictions might come to naught.

is a pessimist, who understands that all predictions might come to naught.

The pessimist does, however, take pains to optimize the cases that do turn out

The pessimist does, however, take pains to optimize the cases that do turn out

well.

well.

@ Let's consider what happens to a single instruction, say

@ Let's consider what happens to a single instruction, say

\.{ADD} \.{\$1,\$2,\$3}, as it travels through the pipeline in a normal

\.{ADD} \.{\$1,\$2,\$3}, as it travels through the pipeline in a normal

situation. The first time this instruction is encountered, it is placed into

situation. The first time this instruction is encountered, it is placed into

the I-cache (that is, the instruction cache), so that we won't have to access

the I-cache (that is, the instruction cache), so that we won't have to access

memory when we need to perform it again. We will assume for simplicity in this

memory when we need to perform it again. We will assume for simplicity in this

discussion that each I-cache access takes one clock cycle, although other

discussion that each I-cache access takes one clock cycle, although other

possibilities are allowed by |MMIX_config|.

possibilities are allowed by |MMIX_config|.

Suppose the simulated machine fetches the example \.{ADD} instruction

Suppose the simulated machine fetches the example \.{ADD} instruction

at time 1000. Fetching is done by a coroutine whose |stage| number is~0.

at time 1000. Fetching is done by a coroutine whose |stage| number is~0.

A cache block typically contains 8 or 16 instructions. The fetch unit

A cache block typically contains 8 or 16 instructions. The fetch unit

of our machine is able to fetch up to |fetch_max| instructions on each clock

of our machine is able to fetch up to |fetch_max| instructions on each clock

cycle and place them in the fetch buffer, provided that there is room in the

cycle and place them in the fetch buffer, provided that there is room in the

buffer and that all the instructions belong to the same cache block.

buffer and that all the instructions belong to the same cache block.

The dispatch unit of our simulator is able to issue up to |dispatch_max|

The dispatch unit of our simulator is able to issue up to |dispatch_max|

instructions on each clock cycle and move them from the fetch buffer to the

instructions on each clock cycle and move them from the fetch buffer to the

reorder buffer, provided that functional units are available for those

reorder buffer, provided that functional units are available for those

instructions and there is room in the reorder buffer. A functional unit that

instructions and there is room in the reorder buffer. A functional unit that

handles \.{ADD} is usually called an ALU (arithmetic logic unit), and our

handles \.{ADD} is usually called an ALU (arithmetic logic unit), and our

simulated machine might have several of them. If they aren't all stalled

simulated machine might have several of them. If they aren't all stalled

in stage~1 of their pipelines, and if the reorder buffer isn't full, and if

in stage~1 of their pipelines, and if the reorder buffer isn't full, and if

the machine isn't in the process of deissuing instructions that were

the machine isn't in the process of deissuing instructions that were

mispredicted, and if

mispredicted, and if

fewer than |dispatch_max| instructions are ahead of the \.{ADD} in the fetch

fewer than |dispatch_max| instructions are ahead of the \.{ADD} in the fetch

buffer, and if all such prior instructions can be issued without using up all

buffer, and if all such prior instructions can be issued without using up all

the free ALUs, our \.{ADD} instruction will be issued at time 1001.

the free ALUs, our \.{ADD} instruction will be issued at time 1001.

(In fact, all of these conditions are usually true.)

(In fact, all of these conditions are usually true.)

We assume that $\rm L>3$, so that \$1, \$2, and~\$3 are local registers.

We assume that $\rm L>3$, so that \$1, \$2, and~\$3 are local registers.

For simplicity we'll assume in fact that the register stack is empty, so that

For simplicity we'll assume in fact that the register stack is empty, so that

the \.{ADD} instruction is supposed to set $\rm l[1]\gets l[2]+l[3]$. The

the \.{ADD} instruction is supposed to set $\rm l[1]\gets l[2]+l[3]$. The

operands l[2] and~l[3] might not be known at time 1001; they are \&{spec}

operands l[2] and~l[3] might not be known at time 1001; they are \&{spec}

values, which might point to \&{specnode} entries in the reorder buffer for

values, which might point to \&{specnode} entries in the reorder buffer for

previous instructions whose destinations are l[2] and~l[3].

previous instructions whose destinations are l[2] and~l[3].

The dispatcher fills the next available control block of the reorder buffer

The dispatcher fills the next available control block of the reorder buffer

with information for the \.{ADD}, containing appropriate \&{spec} values

with information for the \.{ADD}, containing appropriate \&{spec} values

corresponding to l[2] and~l[3] in its |y| and~|z| fields. The |x|~field of

corresponding to l[2] and~l[3] in its |y| and~|z| fields. The |x|~field of

this control block will be inserted into a doubly linked list of \&{specnode}

this control block will be inserted into a doubly linked list of \&{specnode}

records, corresponding to l[1] and to all instructions in the reorder buffer

records, corresponding to l[1] and to all instructions in the reorder buffer

that have l[1] as a destination. The boolean value |x.known| will be set to

that have l[1] as a destination. The boolean value |x.known| will be set to

|false|, meaning that this speculative value still needs to be

|false|, meaning that this speculative value still needs to be

computed. Subsequent instructions that need l[1] as a source will point to

computed. Subsequent instructions that need l[1] as a source will point to

|x|, if they are issued before the sum |x.o| has been computed. Double

|x|, if they are issued before the sum |x.o| has been computed. Double

linking is used in the \&{specnode} list because the \.{ADD} instruction might

linking is used in the \&{specnode} list because the \.{ADD} instruction might

be cancelled before it is finally committed; thus deletions might occur

be cancelled before it is finally committed; thus deletions might occur

at either end of the list for~l[1].

at either end of the list for~l[1].

At time 1002, the ALU handling the \.{ADD} will stall if its inputs |y|

At time 1002, the ALU handling the \.{ADD} will stall if its inputs |y|

and~|z| are not both known (namely if |y.p!=NULL| or |z.p!=NULL|).

and~|z| are not both known (namely if |y.p!=NULL| or |z.p!=NULL|).

In fact, it will also stall if its third input rA is not known;

In fact, it will also stall if its third input rA is not known;

the current speculative value of rA, except for its event bits,

the current speculative value of rA, except for its event bits,

is represented in the |ra|~field of the control block, and we must

is represented in the |ra|~field of the control block, and we must

have |ra.p==NULL|. In such a case the ALU will look to see if the

have |ra.p==NULL|. In such a case the ALU will look to see if the

\&{spec} values pointed to by |y.p| and/or |z.p| and/or |ra.p| become

\&{spec} values pointed to by |y.p| and/or |z.p| and/or |ra.p| become

defined on this clock cycle, and it will update its own input values

defined on this clock cycle, and it will update its own input values

accordingly.

accordingly.

But let's assume that |y|, |z|, and |ra| are already known at time 1002.

But let's assume that |y|, |z|, and |ra| are already known at time 1002.

Then |x.o| will be set to |y.o+z.o| and |x.known| will become~|true|.

Then |x.o| will be set to |y.o+z.o| and |x.known| will become~|true|.

This will make the result destined for~l[1] available to be used in other

This will make the result destined for~l[1] available to be used in other

commands at time~1003.

commands at time~1003.

If no overflow occurs when adding |y.o| to |z.o|, the |interrupt| and

If no overflow occurs when adding |y.o| to |z.o|, the |interrupt| and

|arith_exc| fields of the control block for \.{ADD} are set to zero.  But when

|arith_exc| fields of the control block for \.{ADD} are set to zero.  But when

overflow does occur (shudder), there are two cases, based on the V-enable bit

overflow does occur (shudder), there are two cases, based on the V-enable bit

of rA, which is found in field |b.o| of the control block. If this bit is~0,

of rA, which is found in field |b.o| of the control block. If this bit is~0,

the V-bit of the |arith_exc| field in the control block is set to~1; the

the V-bit of the |arith_exc| field in the control block is set to~1; the

|arith_exc| field will be ored into~rA when the \.{ADD} instruction is

|arith_exc| field will be ored into~rA when the \.{ADD} instruction is

eventually committed.  But if the V-enable bit is~1, the trip handler should

eventually committed.  But if the V-enable bit is~1, the trip handler should

be called, interrupting the normal sequence. In such a case, the |interrupt|

be called, interrupting the normal sequence. In such a case, the |interrupt|

field of the control block is set to specify a trip, and the fetcher and

field of the control block is set to specify a trip, and the fetcher and

dispatcher are told to forget what they have been doing; all instructions

dispatcher are told to forget what they have been doing; all instructions

following the \.{ADD} in the reorder buffer must now be deissued. The virtual starting

following the \.{ADD} in the reorder buffer must now be deissued. The virtual starting

address of the overflow trip handler, namely location~32, is hastily passed to

address of the overflow trip handler, namely location~32, is hastily passed to

the fetch routine, and instructions will be fetched from that location

the fetch routine, and instructions will be fetched from that location

as soon as possible. (Of course the overflow and the trip handler are

as soon as possible. (Of course the overflow and the trip handler are

still speculative until the \.{ADD} instruction is committed. Other exceptional

still speculative until the \.{ADD} instruction is committed. Other exceptional

conditions might cause the \.{ADD} itself to be terminated before it

conditions might cause the \.{ADD} itself to be terminated before it

gets to the hot seat. But the pipeline keeps charging ahead, always trying to

gets to the hot seat. But the pipeline keeps charging ahead, always trying to

guess the most probable outcome.)

guess the most probable outcome.)

The commission unit of this simulator is able to commit and/or deissue up to

The commission unit of this simulator is able to commit and/or deissue up to

|commit_max| instructions on each clock cycle. With luck, fewer than

|commit_max| instructions on each clock cycle. With luck, fewer than

|commit_max| instructions will be ahead of our \.{ADD} instruction at

|commit_max| instructions will be ahead of our \.{ADD} instruction at

time~1003, and they will all be completed normally. Then l[1]~can be set

time~1003, and they will all be completed normally. Then l[1]~can be set

to |x.o|, and the event bits of~rA can be updated from |arith_exc|,

to |x.o|, and the event bits of~rA can be updated from |arith_exc|,

and the \.{ADD} command can pass through the hot seat and out of the

and the \.{ADD} command can pass through the hot seat and out of the

reorder buffer.

reorder buffer.

@=

@=

Extern int fetch_max, dispatch_max, peekahead, commit_max;

Extern int fetch_max, dispatch_max, peekahead, commit_max;

 /* limits on instructions that can be handled per clock cycle */

 /* limits on instructions that can be handled per clock cycle */

@ The instruction currently occupying the hot seat is the only

@ The instruction currently occupying the hot seat is the only

issued-but-not-yet-committed instruction that is guaranteed to be truly

issued-but-not-yet-committed instruction that is guaranteed to be truly

essential to the machine's computation. All other instructions in the reorder

essential to the machine's computation. All other instructions in the reorder

buffer are being executed on speculation; if they prove to be needed, well and

buffer are being executed on speculation; if they prove to be needed, well and

good, but we might want to jettison them all if, say, an external interrupt

good, but we might want to jettison them all if, say, an external interrupt

occurs.

occurs.

Thus all instructions that change the global state in complicated ways---like

Thus all instructions that change the global state in complicated ways---like

\.{LDVTS}, which changes the virtual address translation caches---are

\.{LDVTS}, which changes the virtual address translation caches---are

performed only when they reach the hot seat. Fortunately the vast majority

performed only when they reach the hot seat. Fortunately the vast majority

of instructions are sufficiently simple that we can deal with them more

of instructions are sufficiently simple that we can deal with them more

efficiently while other computations are taking place.

efficiently while other computations are taking place.

In this implementation the reorder buffer is simply housed in an array of

In this implementation the reorder buffer is simply housed in an array of

control records. The first array element is |reorder_bot|, and the last is

control records. The first array element is |reorder_bot|, and the last is

|reorder_top|. Variable |hot| points to the control block in the hot seat, and

|reorder_top|. Variable |hot| points to the control block in the hot seat, and

|hot-1| to its predecessor, etc. Variable |cool| points to the next control

|hot-1| to its predecessor, etc. Variable |cool| points to the next control

block that will be filled in the reorder buffer. If |hot==cool| the reorder

block that will be filled in the reorder buffer. If |hot==cool| the reorder

buffer is empty; otherwise it contains the control records |hot|, |hot-1|,

buffer is empty; otherwise it contains the control records |hot|, |hot-1|,

\dots,~|cool+1|, except of course that we wrap around from |reorder_bot| to

\dots,~|cool+1|, except of course that we wrap around from |reorder_bot| to

|reorder_top| when moving down in the buffer.

|reorder_top| when moving down in the buffer.

@=

@=

Extern control *reorder_bot, *reorder_top; /* least and greatest

Extern control *reorder_bot, *reorder_top; /* least and greatest

                   entries in the ring containing the reorder buffer */

                   entries in the ring containing the reorder buffer */

Extern control *hot, *cool; /* front and rear of the reorder buffer */

Extern control *hot, *cool; /* front and rear of the reorder buffer */

Extern control *old_hot; /* value of |hot| at beginning of cycle */

Extern control *old_hot; /* value of |hot| at beginning of cycle */

Extern int deissues; /* the number of instructions that need to be deissued */

Extern int deissues; /* the number of instructions that need to be deissued */

@ @=

@ @=

hot=cool=reorder_top;

hot=cool=reorder_top;

deissues=0;

deissues=0;

@ @=

@ @=

static void print_reorder_buffer @,@,@[ARGS((void))@];

static void print_reorder_buffer @,@,@[ARGS((void))@];

@ @=

@ @=

static void print_reorder_buffer()

static void print_reorder_buffer()

  printf("Reorder buffer");

  printf("Reorder buffer");

  if (hot==cool) printf(" (empty)\n");

  if (hot==cool) printf(" (empty)\n");

  else {@+register control *p;

  else {@+register control *p;

    if (deissues) printf(" (%d to be deissued)",deissues);

    if (deissues) printf(" (%d to be deissued)",deissues);

    if (doing_interrupt) printf(" (interrupt state %d)",doing_interrupt);

    if (doing_interrupt) printf(" (interrupt state %d)",doing_interrupt);

    printf(":\n");

    printf(":\n");

    for (p=hot;p!=cool; p=(p==reorder_bot? reorder_top: p-1)) {

    for (p=hot;p!=cool; p=(p==reorder_bot? reorder_top: p-1)) {

      print_control_block(p);

      print_control_block(p);

      if (p->owner) {

      if (p->owner) {

        printf(" ");@+ print_coroutine_id(p->owner);

        printf(" ");@+ print_coroutine_id(p->owner);

      printf("\n");

      printf("\n");

  printf(" %d available rename register%s, %d memory slot%s\n",

  printf(" %d available rename register%s, %d memory slot%s\n",

     rename_regs, rename_regs!=1? "s": "",

     rename_regs, rename_regs!=1? "s": "",

     mem_slots, mem_slots!=1? "s": "");

     mem_slots, mem_slots!=1? "s": "");

@ Here is an overview of what happens on each clock cycle.

@ Here is an overview of what happens on each clock cycle.

@=

@=

@;

@;

  dispatch_count=0;

  dispatch_count=0;

  old_hot=hot; /* remember the hot seat position at beginning of cycle */

  old_hot=hot; /* remember the hot seat position at beginning of cycle */

  old_tail=tail; /* remember the fetch buffer contents at beginning of cycle */

  old_tail=tail; /* remember the fetch buffer contents at beginning of cycle */

  suppress_dispatch=(deissues || dispatch_lock);

  suppress_dispatch=(deissues || dispatch_lock);

  if (doing_interrupt) @@;

  if (doing_interrupt) @@;

  else @;

  else @;

@;

@;

  if (!suppress_dispatch) @;

  if (!suppress_dispatch) @;

  ticks=incr(ticks,1); /* and the beat moves on */

  ticks=incr(ticks,1); /* and the beat moves on */

  dispatch_stat[dispatch_count]++;

  dispatch_stat[dispatch_count]++;

@ @=

@ @=

int dispatch_count; /* how many dispatched on this cycle */

int dispatch_count; /* how many dispatched on this cycle */

bool suppress_dispatch; /* should dispatching be bypassed? */

bool suppress_dispatch; /* should dispatching be bypassed? */

int doing_interrupt; /* how many cycles of interrupt preparations remain */

int doing_interrupt; /* how many cycles of interrupt preparations remain */

lockvar dispatch_lock; /* lock to prevent instruction issues */

lockvar dispatch_lock; /* lock to prevent instruction issues */

@ @=

@ @=

Extern int *dispatch_stat;

Extern int *dispatch_stat;

  /* how often did we dispatch 0, 1, ... instructions? */

  /* how often did we dispatch 0, 1, ... instructions? */

Extern bool security_disabled; /* omit security checks for testing purposes? */

Extern bool security_disabled; /* omit security checks for testing purposes? */

@ @=

@ @=

  for (m=commit_max;m>0 && deissues>0; m--)

  for (m=commit_max;m>0 && deissues>0; m--)

@;

@;

  for (;m>0;m--) {

  for (;m>0;m--) {

    if (hot==cool) break; /* reorder buffer is empty */

    if (hot==cool) break; /* reorder buffer is empty */

    if (!security_disabled) @;

    if (!security_disabled) @;

    if (hot->owner) break; /* hot seat instruction isn't finished */

    if (hot->owner) break; /* hot seat instruction isn't finished */

@;

@;

    i=hot->i;

    i=hot->i;

    if (hot==reorder_bot) hot=reorder_top;

    if (hot==reorder_bot) hot=reorder_top;

    else hot--;

    else hot--;

    if (i==resum) break; /* allow the resumed instruction to see the new rK */

    if (i==resum) break; /* allow the resumed instruction to see the new rK */

@* The dispatch stage. It would be nice to present the parts of this simulator

@* The dispatch stage. It would be nice to present the parts of this simulator

by dealing with the fetching, dispatching, executing, and committing

by dealing with the fetching, dispatching, executing, and committing

stages in that order. After all, instructions are first fetched,

stages in that order. After all, instructions are first fetched,

then dispatched, then executed, and finally committed.

then dispatched, then executed, and finally committed.

However, the fetch stage depends heavily on difficult questions of

However, the fetch stage depends heavily on difficult questions of

memory management that are best deferred until we have looked at

memory management that are best deferred until we have looked at

the simpler parts of simulation. Therefore we will take our initial

the simpler parts of simulation. Therefore we will take our initial

plunge into the details of this program by looking first at the dispatch phase,

plunge into the details of this program by looking first at the dispatch phase,

assuming that instructions have somehow appeared magically in the fetch buffer.

assuming that instructions have somehow appeared magically in the fetch buffer.

The fetch buffer, like the circular priority queue of all coroutines

The fetch buffer, like the circular priority queue of all coroutines

and the circular queue used for the reorder buffer, lives in an

and the circular queue used for the reorder buffer, lives in an

array that is best regarded as a ring of elements. The elements

array that is best regarded as a ring of elements. The elements

are structures of type \&{fetch}, which have five fields:

are structures of type \&{fetch}, which have five fields:

A 32-bit |inst|, which is an \MMIX\ instruction; a 64-bit |loc|,

A 32-bit |inst|, which is an \MMIX\ instruction; a 64-bit |loc|,

which is the virtual address of that instruction; an |interrupt| field,

which is the virtual address of that instruction; an |interrupt| field,

which is nonzero if, for example, the protection bits in the relevant page

which is nonzero if, for example, the protection bits in the relevant page

table entry for this address do not permit execution access; a boolean

table entry for this address do not permit execution access; a boolean

|noted| field, which becomes |true| after the dispatch unit has peeked

|noted| field, which becomes |true| after the dispatch unit has peeked

at the instruction to see whether it is a jump or probable branch;

at the instruction to see whether it is a jump or probable branch;

and a |hist| field, which records the recent branch history.

and a |hist| field, which records the recent branch history.

(The least significant bits of~|hist| correspond to the most recent branches.)

(The least significant bits of~|hist| correspond to the most recent branches.)

@=

@=

typedef struct {

typedef struct {

  octa loc; /* virtual address of instruction */

  octa loc; /* virtual address of instruction */

  tetra inst; /* the instruction itself */

  tetra inst; /* the instruction itself */

  unsigned int interrupt; /* bit codes that might cause interruption */

  unsigned int interrupt; /* bit codes that might cause interruption */

  bool noted; /* have we peeked at this instruction? */

  bool noted; /* have we peeked at this instruction? */

  unsigned int hist; /* if we peeked, this was the |peek_hist| */

  unsigned int hist; /* if we peeked, this was the |peek_hist| */

} fetch;

} fetch;

@ The oldest and youngest entries in the fetch buffer are pointed

@ The oldest and youngest entries in the fetch buffer are pointed

to by |head| and |tail|, just as the oldest and youngest entries in the

to by |head| and |tail|, just as the oldest and youngest entries in the

reorder buffer are called |hot| and |cool|. The fetch coroutine will

reorder buffer are called |hot| and |cool|. The fetch coroutine will

be adding entries at the |tail| position, which starts at |old_tail|

be adding entries at the |tail| position, which starts at |old_tail|

when a cycle begins, in parallel with the actions simulated by

when a cycle begins, in parallel with the actions simulated by

the dispatcher. Therefore the dispatcher is allowed to look only at

the dispatcher. Therefore the dispatcher is allowed to look only at

instructions in |head|, |head-1|, \dots,~|old_tail+1|, although a few

instructions in |head|, |head-1|, \dots,~|old_tail+1|, although a few

more recently fetched instructions will usually be present in the fetch

more recently fetched instructions will usually be present in the fetch

buffer by the time this part of the program is executed.

buffer by the time this part of the program is executed.

@=

@=

Extern fetch *fetch_bot, *fetch_top; /* least and greatest

Extern fetch *fetch_bot, *fetch_top; /* least and greatest

                   entries in the ring containing the fetch buffer */

                   entries in the ring containing the fetch buffer */

Extern fetch *head, *tail; /* front and rear of the fetch buffer */

Extern fetch *head, *tail; /* front and rear of the fetch buffer */

@ @=

@ @=

fetch *old_tail; /* rear of the fetch buffer available on the current cycle */

fetch *old_tail; /* rear of the fetch buffer available on the current cycle */

@ @d UNKNOWN_SPEC ((specnode*)1)

@ @d UNKNOWN_SPEC ((specnode*)1)

@=

@=

head=tail=fetch_top;

head=tail=fetch_top;

inst_ptr.p=UNKNOWN_SPEC;

inst_ptr.p=UNKNOWN_SPEC;

@ @=

@ @=

static void print_fetch_buffer @,@,@[ARGS((void))@];

static void print_fetch_buffer @,@,@[ARGS((void))@];

@ @=

@ @=

static void print_fetch_buffer()

static void print_fetch_buffer()

  printf("Fetch buffer");

  printf("Fetch buffer");

  if (head==tail) printf(" (empty)\n");

  if (head==tail) printf(" (empty)\n");

  else {@+register fetch *p;

  else {@+register fetch *p;

    if (resuming) printf(" (resumption state %d)",resuming);

    if (resuming) printf(" (resumption state %d)",resuming);

    printf(":\n");

    printf(":\n");

    for (p=head;p!=tail; p=(p==fetch_bot? fetch_top: p-1)) {

    for (p=head;p!=tail; p=(p==fetch_bot? fetch_top: p-1)) {

      print_octa(p->loc);

      print_octa(p->loc);

      printf(": %08x(%s)",p->inst,opcode_name[p->inst>>24]);

      printf(": %08x(%s)",p->inst,opcode_name[p->inst>>24]);

      if (p->interrupt) print_bits(p->interrupt);

      if (p->interrupt) print_bits(p->interrupt);

      if (p->noted) printf("*");

      if (p->noted) printf("*");

      printf("\n");

      printf("\n");

  printf("Instruction pointer is ");

  printf("Instruction pointer is ");

  if (inst_ptr.p==NULL) print_octa(inst_ptr.o);

  if (inst_ptr.p==NULL) print_octa(inst_ptr.o);

  else {

  else {

    printf("waiting for ");

    printf("waiting for ");

    if (inst_ptr.p==UNKNOWN_SPEC) printf("dispatch");

    if (inst_ptr.p==UNKNOWN_SPEC) printf("dispatch");

    else if (inst_ptr.p->addr.h==(tetra)-1)

    else if (inst_ptr.p->addr.h==(tetra)-1)

      print_coroutine_id(((control*)inst_ptr.p->up)->owner);

      print_coroutine_id(((control*)inst_ptr.p->up)->owner);

    else print_specnode_id(inst_ptr.p->addr);

    else print_specnode_id(inst_ptr.p->addr);

  printf("\n");

  printf("\n");

@ The best way to understand the dispatching process is once again

@ The best way to understand the dispatching process is once again

to ``think big,'' by imagining a huge fetch buffer and the

to ``think big,'' by imagining a huge fetch buffer and the

@^thinking big@>

@^thinking big@>

potential ability to issue dozens of instructions per cycle, although

potential ability to issue dozens of instructions per cycle, although

the actual numbers are typically quite small.

the actual numbers are typically quite small.

If the fetch buffer is not empty after |dispatch_max| instructions have

If the fetch buffer is not empty after |dispatch_max| instructions have

been dispatched, the dispatcher also looks at up to |peekahead| further

been dispatched, the dispatcher also looks at up to |peekahead| further

instructions to see if they are jumps or other commands that change the

instructions to see if they are jumps or other commands that change the

flow of control. Much of this action would happen in parallel on a

flow of control. Much of this action would happen in parallel on a

real machine, but our simulator works sequentially.

real machine, but our simulator works sequentially.

In the following program, |true_head| records the head of the fetch buffer as

In the following program, |true_head| records the head of the fetch buffer as

instructions are actually dispatched, while |head| refers to the position

instructions are actually dispatched, while |head| refers to the position

currently being examined (possibly peeking into the future).

currently being examined (possibly peeking into the future).

If the fetch buffer is empty at the beginning of the current clock

If the fetch buffer is empty at the beginning of the current clock

cycle, a ``dispatch bypass'' allows the dispatcher to issue the

cycle, a ``dispatch bypass'' allows the dispatcher to issue the

first instruction that enters the fetch buffer on this cycle. Otherwise

first instruction that enters the fetch buffer on this cycle. Otherwise

the dispatcher is restricted to previously fetched instructions.

the dispatcher is restricted to previously fetched instructions.

@s func int

@s func int

@=

@=

{@+register fetch *true_head, *new_head;

{@+register fetch *true_head, *new_head;

  true_head=head;

  true_head=head;

  if (head==old_tail && head!=tail)

  if (head==old_tail && head!=tail)

    old_tail=(head==fetch_bot? fetch_top: head-1);

    old_tail=(head==fetch_bot? fetch_top: head-1);

  peek_hist=cool_hist;

  peek_hist=cool_hist;

  for (j=0;j

  for (j=0;j

              to dispatch it if |j;

              to dispatch it if |j;

  head=true_head;

  head=true_head;

@ @=

@ @=

  register mmix_opcode op;

  register mmix_opcode op;

  register int yz,f;

  register int yz,f;

  register bool freeze_dispatch=false;

  register bool freeze_dispatch=false;

  register func *u=NULL;

  register func *u=NULL;

  if (head==old_tail) break; /* fetch buffer empty */

  if (head==old_tail) break; /* fetch buffer empty */

  if (head==fetch_bot) new_head=fetch_top;@+else new_head=head-1;

  if (head==fetch_bot) new_head=fetch_top;@+else new_head=head-1;

  op=head->inst>>24; @+yz=head->inst&0xffff;

  op=head->inst>>24; @+yz=head->inst&0xffff;

@;

@;

@;

@;

  if (f&rel_addr_bit) @;

  if (f&rel_addr_bit) @;

  if (head->noted) peek_hist=head->hist;

  if (head->noted) peek_hist=head->hist;

  else @;

  else @;

  if (j>=dispatch_max || dispatch_lock || nullifying) {

  if (j>=dispatch_max || dispatch_lock || nullifying) {

    head=new_head;@+ continue; /* can't dispatch, but can peek ahead */

    head=new_head;@+ continue; /* can't dispatch, but can peek ahead */

  if (cool==reorder_bot) new_cool=reorder_top;@+else new_cool=cool-1;

  if (cool==reorder_bot) new_cool=reorder_top;@+else new_cool=cool-1;

    otherwise |goto stall|@>;

    otherwise |goto stall|@>;

@;

@;

@;

@;

  if ((op&0xe0)==0x40) @;

  if ((op&0xe0)==0x40) @;

@;

@;

  cool=new_cool;@+ cool_O=new_O;@+ cool_S=new_S;

  cool=new_cool;@+ cool_O=new_O;@+ cool_S=new_S;

  cool_hist=peek_hist;@+ continue;

  cool_hist=peek_hist;@+ continue;

stall: @

stall: @

    and |break|@>;

    and |break|@>;

@ An instruction can be dispatched only if a functional unit

@ An instruction can be dispatched only if a functional unit

is available to handle it. A functional unit consists of a 256-bit

is available to handle it. A functional unit consists of a 256-bit

vector that specifies a subset of \MMIX's opcodes, and an array

vector that specifies a subset of \MMIX's opcodes, and an array

of coroutines for the pipeline stages. There are $k$ coroutines in the

of coroutines for the pipeline stages. There are $k$ coroutines in the

array, where $k$ is the maximum number of stages needed by any of the opcodes

array, where $k$ is the maximum number of stages needed by any of the opcodes

supported.

supported.

@=

@=

typedef struct func_struct{

typedef struct func_struct{

  char name[16]; /* symbolic designation */

  char name[16]; /* symbolic designation */

  tetra ops[8]; /* big-endian bitmap for the opcodes supported */

  tetra ops[8]; /* big-endian bitmap for the opcodes supported */

  int k; /* number of pipeline stages */

  int k; /* number of pipeline stages */

  coroutine *co; /* pointer to the first of $k$ consecutive coroutines */

  coroutine *co; /* pointer to the first of $k$ consecutive coroutines */

} @!func;

} @!func;

@ @=

@ @=

Extern func *funit; /* pointer to array of functional units */

Extern func *funit; /* pointer to array of functional units */

Extern int funit_count; /* the number of functional units */

Extern int funit_count; /* the number of functional units */

@ It is convenient to have

@ It is convenient to have

a 256-bit vector of all the supported opcodes, because we need to

a 256-bit vector of all the supported opcodes, because we need to

shut off a lot of special actions when an opcode is not supported.

shut off a lot of special actions when an opcode is not supported.

@=

@=

control *new_cool; /* the reorder position following |cool| */

control *new_cool; /* the reorder position following |cool| */

int resuming; /* set nonzero if resuming an interrupted instruction */

int resuming; /* set nonzero if resuming an interrupted instruction */

tetra support[8]; /* big-endian bitmap for all opcodes supported */

tetra support[8]; /* big-endian bitmap for all opcodes supported */

@ @=

@ @=

{@+register func *u;

{@+register func *u;

  for (u=funit;u<=funit+funit_count;u++)

  for (u=funit;u<=funit+funit_count;u++)

    for (i=0;i<8;i++) support[i] |= u->ops[i];

    for (i=0;i<8;i++) support[i] |= u->ops[i];

@ @d sign_bit ((unsigned)0x80000000)

@ @d sign_bit ((unsigned)0x80000000)

@=

@=

if (!(support[op>>5]&(sign_bit>>(op&31)))) {

if (!(support[op>>5]&(sign_bit>>(op&31)))) {

  /* oops, this opcode isn't supported by any function unit */

  /* oops, this opcode isn't supported by any function unit */

  f=flags[TRAP], i=trap;

  f=flags[TRAP], i=trap;

}@+else f=flags[op], i=internal_op[op];

}@+else f=flags[op], i=internal_op[op];

if (i==trip && (head->loc.h&sign_bit)) f=0,i=noop;

if (i==trip && (head->loc.h&sign_bit)) f=0,i=noop;

@ @=

@ @=

if (cool->interim) {

if (cool->interim) {

  cool->usage=false;

  cool->usage=false;

  if (cool->op==SAVE) @@;

  if (cool->op==SAVE) @@;

  else if (cool->op==UNSAVE) @@;

  else if (cool->op==UNSAVE) @@;

  else if (cool->i==preld || cool->i==prest)

  else if (cool->i==preld || cool->i==prest)

@@;

@@;

  else if (cool->i==prego) @@;

  else if (cool->i==prego) @@;

else if (cool->i<=max_real_command) {

else if (cool->i<=max_real_command) {

  if ((flags[cool->op]&ctl_change_bit)||cool->i==pbr)

  if ((flags[cool->op]&ctl_change_bit)||cool->i==pbr)

    if (inst_ptr.p==NULL && (inst_ptr.o.h&sign_bit) && !(cool->loc.h&sign_bit)

    if (inst_ptr.p==NULL && (inst_ptr.o.h&sign_bit) && !(cool->loc.h&sign_bit)

           && cool->i!=trap)

           && cool->i!=trap)

      cool->interrupt|=P_BIT; /* jumping from nonnegative to negative */

      cool->interrupt|=P_BIT; /* jumping from nonnegative to negative */

  true_head=head=new_head; /* delete instruction from fetch buffer */

  true_head=head=new_head; /* delete instruction from fetch buffer */

  resuming=0;

  resuming=0;

if (freeze_dispatch) set_lock(u->co,dispatch_lock);

if (freeze_dispatch) set_lock(u->co,dispatch_lock);

cool->owner=u->co;@+ u->co->ctl=cool;

cool->owner=u->co;@+ u->co->ctl=cool;

startup(u->co,1); /* schedule execution of the new inst */

startup(u->co,1); /* schedule execution of the new inst */

if (verbose&issue_bit) {

if (verbose&issue_bit) {

  printf("Issuing ");@+print_control_block(cool);

  printf("Issuing ");@+print_control_block(cool);

  printf(" ");@+print_coroutine_id(u->co);@+printf("\n");

  printf(" ");@+print_coroutine_id(u->co);@+printf("\n");

dispatch_count++;

dispatch_count++;

@ We assign the first functional unit that supports |op| and is

@ We assign the first functional unit that supports |op| and is

totally unoccupied, if possible; otherwise we assign the first

totally unoccupied, if possible; otherwise we assign the first

functional unit that supports |op| and has stage~1 unoccupied.

functional unit that supports |op| and has stage~1 unoccupied.

@=

@=

{@+register int t=op>>5, b=sign_bit>>(op&31);

{@+register int t=op>>5, b=sign_bit>>(op&31);

  if (cool->i==trap && op!=TRAP) { /* opcode needs to be emulated */

  if (cool->i==trap && op!=TRAP) { /* opcode needs to be emulated */

    u=funit+funit_count; /* this unit supports just \.{TRIP} and \.{TRAP} */

    u=funit+funit_count; /* this unit supports just \.{TRIP} and \.{TRAP} */

    goto unit_found;

    goto unit_found;

  for (u=funit;u<=funit+funit_count;u++) if (u->ops[t]&b) {

  for (u=funit;u<=funit+funit_count;u++) if (u->ops[t]&b) {

    for (i=0;ik;i++) if (u->co[i].next) goto unit_busy;

    for (i=0;ik;i++) if (u->co[i].next) goto unit_busy;

    goto unit_found;

    goto unit_found;

  unit_busy: ;

  unit_busy: ;

  for (u=funit;u

  for (u=funit;u

    if ((u->ops[t]&b) && (u->co->next==NULL)) goto unit_found;

    if ((u->ops[t]&b) && (u->co->next==NULL)) goto unit_found;

  goto stall; /* all units for this |op| are busy */

  goto stall; /* all units for this |op| are busy */

unit_found:

unit_found:

@ The |flags| table records special properties of each operation code

@ The |flags| table records special properties of each operation code

in binary notation: \Hex{1}~means Z~is an immediate value, \Hex{2}~means rZ is

in binary notation: \Hex{1}~means Z~is an immediate value, \Hex{2}~means rZ is

a source operand, \Hex{4}~means Y~is an immediate value, \Hex{8}~means rY is a

a source operand, \Hex{4}~means Y~is an immediate value, \Hex{8}~means rY is a

source operand, \Hex{10}~means rX is a source operand, \Hex{20}~means

source operand, \Hex{10}~means rX is a source operand, \Hex{20}~means

rX is a destination, \Hex{40}~means YZ is part of a relative address,

rX is a destination, \Hex{40}~means YZ is part of a relative address,

\Hex{80}~means the control changes at this point.

\Hex{80}~means the control changes at this point.

@d X_is_dest_bit 0x20

@d X_is_dest_bit 0x20

@d rel_addr_bit 0x40

@d rel_addr_bit 0x40

@d ctl_change_bit 0x80

@d ctl_change_bit 0x80

@=

@=

unsigned char flags[256]={

unsigned char flags[256]={

0x8a, 0x2a, 0x2a, 0x2a, 0x2a, 0x26, 0x2a, 0x26, /* \.{TRAP}, \dots\ */

0x8a, 0x2a, 0x2a, 0x2a, 0x2a, 0x26, 0x2a, 0x26, /* \.{TRAP}, \dots\ */

0x26, 0x25, 0x26, 0x25, 0x26, 0x25, 0x26, 0x25, /* \.{FLOT}, \dots\ */

0x26, 0x25, 0x26, 0x25, 0x26, 0x25, 0x26, 0x25, /* \.{FLOT}, \dots\ */

0x2a, 0x2a, 0x2a, 0x2a, 0x2a, 0x26, 0x2a, 0x26, /* \.{FMUL}, \dots\ */

0x2a, 0x2a, 0x2a, 0x2a, 0x2a, 0x26, 0x2a, 0x26, /* \.{FMUL}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{MUL}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{MUL}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{ADD}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{ADD}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{2ADDU}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{2ADDU}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x26, 0x25, 0x26, 0x25, /* \.{CMP}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x26, 0x25, 0x26, 0x25, /* \.{CMP}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{SL}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{SL}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{BN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{BN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{BNN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{BNN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{PBN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{PBN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{PBNN}, \dots\ */

0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, 0x50, /* \.{PBNN}, \dots\ */

0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, /* \.{CSN}, \dots\ */

0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, /* \.{CSN}, \dots\ */

0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, /* \.{CSNN}, \dots\ */

0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, 0x3a, 0x39, /* \.{CSNN}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{ZSN}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{ZSN}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{ZSNN}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{ZSNN}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{LDB}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{LDB}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{LDT}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{LDT}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x1a, 0x19, 0x2a, 0x29, /* \.{LDSF}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x1a, 0x19, 0x2a, 0x29, /* \.{LDSF}, \dots\ */

0x2a, 0x29, 0x0a, 0x09, 0x0a, 0x09, 0xaa, 0xa9, /* \.{LDVTS}, \dots\ */

0x2a, 0x29, 0x0a, 0x09, 0x0a, 0x09, 0xaa, 0xa9, /* \.{LDVTS}, \dots\ */

0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, /* \.{STB}, \dots\ */

0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, /* \.{STB}, \dots\ */

0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, /* \.{STT}, \dots\ */

0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, 0x1a, 0x19, /* \.{STT}, \dots\ */

0x1a, 0x19, 0x1a, 0x19, 0x0a, 0x09, 0x1a, 0x19, /* \.{STSF}, \dots\ */

0x1a, 0x19, 0x1a, 0x19, 0x0a, 0x09, 0x1a, 0x19, /* \.{STSF}, \dots\ */

0x0a, 0x09, 0x0a, 0x09, 0x0a, 0x09, 0xaa, 0xa9, /* \.{SYNCD}, \dots\ */

0x0a, 0x09, 0x0a, 0x09, 0x0a, 0x09, 0xaa, 0xa9, /* \.{SYNCD}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{OR}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{OR}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{AND}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{AND}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{BDIF}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{BDIF}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{MUX}, \dots\ */

0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, 0x2a, 0x29, /* \.{MUX}, \dots\ */

0x20, 0x20, 0x20, 0x20, 0x30, 0x30, 0x30, 0x30, /* \.{SETH}, \dots\ */

0x20, 0x20, 0x20, 0x20, 0x30, 0x30, 0x30, 0x30, /* \.{SETH}, \dots\ */

0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, /* \.{ORH}, \dots\ */

0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, /* \.{ORH}, \dots\ */

0xc0, 0xc0, 0xe0, 0xe0, 0x60, 0x60, 0x02, 0x01, /* \.{JMP}, \dots\ */

0xc0, 0xc0, 0xe0, 0xe0, 0x60, 0x60, 0x02, 0x01, /* \.{JMP}, \dots\ */

0x80, 0x80, 0x00, 0x02, 0x01, 0x00, 0x20, 0x8a}; /* \.{POP}, \dots\ */

0x80, 0x80, 0x00, 0x02, 0x01, 0x00, 0x20, 0x8a}; /* \.{POP}, \dots\ */

@ @=

@ @=

  if (i==jmp) yz=head->inst&0xffffff;

  if (i==jmp) yz=head->inst&0xffffff;

  if (op&1) yz-=(i==jmp? 0x1000000: 0x10000);

  if (op&1) yz-=(i==jmp? 0x1000000: 0x10000);

  cool->y.o=incr(head->loc,4), cool->y.p=NULL;

  cool->y.o=incr(head->loc,4), cool->y.p=NULL;

  cool->z.o=incr(head->loc,yz<<2), cool->z.p=NULL;

  cool->z.o=incr(head->loc,yz<<2), cool->z.p=NULL;

@ The location of the next instruction to be fetched is in a \&{spec} variable

@ The location of the next instruction to be fetched is in a \&{spec} variable

called |inst_ptr|. A slightly tricky optimization of the \.{POP} instruction

called |inst_ptr|. A slightly tricky optimization of the \.{POP} instruction

is made in the common case that the speculative value of~rJ is known.

is made in the common case that the speculative value of~rJ is known.

@=

@=

{@+register int predicted=0;

{@+register int predicted=0;

  if ((op&0xe0)==0x40) @;

  if ((op&0xe0)==0x40) @;

  head->noted=true;

  head->noted=true;

  head->hist=peek_hist;

  head->hist=peek_hist;

  if (predicted||(f&ctl_change_bit) || (i==syncid&&!(cool->loc.h&sign_bit))) {

  if (predicted||(f&ctl_change_bit) || (i==syncid&&!(cool->loc.h&sign_bit))) {

    old_tail=tail=new_head; /* discard all remaining fetches */

    old_tail=tail=new_head; /* discard all remaining fetches */

@;

@;

    switch (i) {

    switch (i) {

 case jmp: case br: case pbr: case pushj: inst_ptr=cool->z;@+ break;

 case jmp: case br: case pbr: case pushj: inst_ptr=cool->z;@+ break;

 case pop:@+if (g[rJ].up->known &&

 case pop:@+if (g[rJ].up->known &&

      inst_ptr.o=incr(g[rJ].up->o,yz<<2), inst_ptr.p=NULL;@+break;

      inst_ptr.o=incr(g[rJ].up->o,yz<<2), inst_ptr.p=NULL;@+break;

      } /* otherwise fall through, will wait on |cool->go| */

      } /* otherwise fall through, will wait on |cool->go| */

 case go: case pushgo: case trap: case resume: case syncid:

 case go: case pushgo: case trap: case resume: case syncid:

    inst_ptr.p=UNKNOWN_SPEC;@+ break;

    inst_ptr.p=UNKNOWN_SPEC;@+ break;

 case trip: inst_ptr=zero_spec;@+ break;

 case trip: inst_ptr=zero_spec;@+ break;

@ At any given time the simulated machine is in two main states, the

@ At any given time the simulated machine is in two main states, the

``hot state'' corresponding to instructions that have been committed and the

``hot state'' corresponding to instructions that have been committed and the

``cool state'' corresponding to all the speculative changes currently

``cool state'' corresponding to all the speculative changes currently

being considered. The dispatcher works with cool instructions and puts them

being considered. The dispatcher works with cool instructions and puts them

into the reorder buffer, where they gradually get warmer and warmer.

into the reorder buffer, where they gradually get warmer and warmer.

Intermediate instructions, between |hot| and |cool|, have intermediate

Intermediate instructions, between |hot| and |cool|, have intermediate

temperatures.

temperatures.

A machine register like l[101] or g[250] is represented by a specnode whose

A machine register like l[101] or g[250] is represented by a specnode whose

|o|~field is the current hot value of the register. If the |up| and |down|

|o|~field is the current hot value of the register. If the |up| and |down|

fields of this specnode point to the node itself,

fields of this specnode point to the node itself,

the hot and cool values of the register are

the hot and cool values of the register are

identical. Otherwise |up| and |down| are pointers to the coolest and hottest

identical. Otherwise |up| and |down| are pointers to the coolest and hottest

ends of a doubly linked list of specnodes, representing intermediate

ends of a doubly linked list of specnodes, representing intermediate

speculative values (sometimes called ``rename registers'').

speculative values (sometimes called ``rename registers'').

@^rename registers@>

@^rename registers@>

The rename registers are implemented as the |x| or~|a| specnodes inside control

The rename registers are implemented as the |x| or~|a| specnodes inside control

blocks, for speculative instructions that use this register as a

blocks, for speculative instructions that use this register as a

destination. Speculative instructions that use the register as a

destination. Speculative instructions that use the register as a

source operand point to the next-hottest specnode on the list, until

source operand point to the next-hottest specnode on the list, until

the value becomes known. The doubly linked list of specnodes is an

the value becomes known. The doubly linked list of specnodes is an

input-restricted deque: A node is inserted at the cool end when the

input-restricted deque: A node is inserted at the cool end when the

dispatcher issues an instruction with this register as destination;

dispatcher issues an instruction with this register as destination;

a node is removed from the cool end if an instruction needs to be deissued;

a node is removed from the cool end if an instruction needs to be deissued;

a node is removed from the hot end when an instruction is committed.

a node is removed from the hot end when an instruction is committed.

The special registers rA, rB, \dots\ occupy the same array as the

The special registers rA, rB, \dots\ occupy the same array as the

global registers g[32], g[33], \dots~\thinspace. For example,

global registers g[32], g[33], \dots~\thinspace. For example,

rB is internally the same as g[0], because |rB=0|.

rB is internally the same as g[0], because |rB=0|.

@=

@=

Extern specnode g[256]; /* global registers and special registers */

Extern specnode g[256]; /* global registers and special registers */

Extern specnode *l; /* the ring of local registers */

Extern specnode *l; /* the ring of local registers */

Extern int lring_size; /* the number of on-chip local registers

Extern int lring_size; /* the number of on-chip local registers

         (must be a power of~2) */

         (must be a power of~2) */

Extern int max_rename_regs, max_mem_slots; /* capacity of reorder buffer */

Extern int max_rename_regs, max_mem_slots; /* capacity of reorder buffer */

Extern int rename_regs, mem_slots; /* currently unused capacity */

Extern int rename_regs, mem_slots; /* currently unused capacity */

@ @=

@ @=

#define ticks @[g[rC].o@] /* the internal clock */

#define ticks @[g[rC].o@] /* the internal clock */

@ @=

@ @=

int lring_mask; /* for calculations modulo |lring_size| */

int lring_mask; /* for calculations modulo |lring_size| */

@ The |addr| fields in the specnode lists for registers are used

@ The |addr| fields in the specnode lists for registers are used

to identify that register in diagnostic messages. Such addresses

to identify that register in diagnostic messages. Such addresses

are negative; memory addresses are positive.

are negative; memory addresses are positive.

All registers are initially zero except rG, which is initially 255,

All registers are initially zero except rG, which is initially 255,

and rN, which has a constant value identifying the time of compilation.

and rN, which has a constant value identifying the time of compilation.

(The macro \.{ABSTIME} is defined externally in the file \.{abstime.h},

(The macro \.{ABSTIME} is defined externally in the file \.{abstime.h},

which should have just been created by {\mc ABSTIME}\kern.05em;

which should have just been created by {\mc ABSTIME}\kern.05em;

{\mc ABSTIME} is

{\mc ABSTIME} is

a trivial program that computes the value of the standard library function

a trivial program that computes the value of the standard library function

|time(NULL)|. We assume that this number, which is the number of seconds in

|time(NULL)|. We assume that this number, which is the number of seconds in

the ``{\mc UNIX} epoch,'' is less than~$2^{32}$. Beware: Our assumption will

the ``{\mc UNIX} epoch,'' is less than~$2^{32}$. Beware: Our assumption will

fail in February of 2106.)

fail in February of 2106.)

@^system dependencies@>

@^system dependencies@>

@d VERSION 1 /* version of the \MMIX\ architecture that we support */

@d VERSION 1 /* version of the \MMIX\ architecture that we support */

@d SUBVERSION 0 /* secondary byte of version number */

@d SUBVERSION 0 /* secondary byte of version number */

@d SUBSUBVERSION 0 /* further qualification to version number */

@d SUBSUBVERSION 0 /* further qualification to version number */

@=

@=

rename_regs=max_rename_regs;

rename_regs=max_rename_regs;

mem_slots=max_mem_slots;

mem_slots=max_mem_slots;

lring_mask=lring_size-1;

lring_mask=lring_size-1;

for (j=0;j<256;j++) {

for (j=0;j<256;j++) {

  g[j].addr.h=sign_bit, g[j].addr.l=j, g[j].known=true;

  g[j].addr.h=sign_bit, g[j].addr.l=j, g[j].known=true;

  g[j].up=g[j].down=&g[j];

  g[j].up=g[j].down=&g[j];

g[rG].o.l=255;

g[rG].o.l=255;

g[rN].o.h=(VERSION<<24)+(SUBVERSION<<16)+(SUBSUBVERSION<<8);

g[rN].o.h=(VERSION<<24)+(SUBVERSION<<16)+(SUBSUBVERSION<<8);

g[rN].o.l=ABSTIME; /* see comment and warning above */

g[rN].o.l=ABSTIME; /* see comment and warning above */

for (j=0;j

for (j=0;j

  l[j].addr.h=sign_bit, l[j].addr.l=256+j, l[j].known=true;

  l[j].addr.h=sign_bit, l[j].addr.l=256+j, l[j].known=true;

  l[j].up=l[j].down=&l[j];

  l[j].up=l[j].down=&l[j];

@ @=

@ @=

static void print_specnode_id @,@,@[ARGS((octa))@];

static void print_specnode_id @,@,@[ARGS((octa))@];

@ @=

@ @=

static void print_specnode_id(a)

static void print_specnode_id(a)

  octa a;

  octa a;

  if (a.h==sign_bit) {

  if (a.h==sign_bit) {

    if (a.l<32) printf(special_name[a.l]);

    if (a.l<32) printf(special_name[a.l]);

    else if (a.l<256) printf("g[%d]",a.l);

    else if (a.l<256) printf("g[%d]",a.l);

    else printf("l[%d]",a.l-256);

    else printf("l[%d]",a.l-256);

  }@+else if (a.h!=(tetra)-1) {

  }@+else if (a.h!=(tetra)-1) {

    printf("m[");@+print_octa(a);@+printf("]");

    printf("m[");@+print_octa(a);@+printf("]");

@ The |specval| subroutine produces a \&{spec} corresponding to the

@ The |specval| subroutine produces a \&{spec} corresponding to the

currently coolest value of a given local or global register.

currently coolest value of a given local or global register.

@=

@=

static spec specval @,@,@[ARGS((specnode*))@];

static spec specval @,@,@[ARGS((specnode*))@];

@ @=

@ @=

static spec specval(r)

static spec specval(r)

  specnode *r;

  specnode *r;

{@+spec res;

{@+spec res;

  if (r->up->known) res.o=r->up->o,res.p=NULL;

  if (r->up->known) res.o=r->up->o,res.p=NULL;

  else res.p=r->up;

  else res.p=r->up;

  return res;

  return res;

@ The |spec_install| subroutine introduces a new speculative value at

@ The |spec_install| subroutine introduces a new speculative value at

the cool end of a given doubly linked~list.

the cool end of a given doubly linked~list.

@=

@=

static void spec_install @,@,@[ARGS((specnode*,specnode*))@];

static void spec_install @,@,@[ARGS((specnode*,specnode*))@];

@ @=

@ @=

static void spec_install(r,t) /* insert |t| into list |r| */

static void spec_install(r,t) /* insert |t| into list |r| */

  specnode *r,*t;

  specnode *r,*t;

  t->up=r->up;

  t->up=r->up;

  t->up->down=t;

  t->up->down=t;

  r->up=t;

  r->up=t;

  t->down=r;

  t->down=r;

  t->addr=r->addr;

  t->addr=r->addr;

@ Conversely, |spec_rem| takes such a value out.

@ Conversely, |spec_rem| takes such a value out.

@=

@=

static void spec_rem @,@,@[ARGS((specnode*))@];

static void spec_rem @,@,@[ARGS((specnode*))@];

@ @=

@ @=

static void spec_rem(t) /* remove |t| from its list */

static void spec_rem(t) /* remove |t| from its list */

  specnode *t;

  specnode *t;

{@+register specnode *u=t->up, *d=t->down;

{@+register specnode *u=t->up, *d=t->down;

  u->down=d;@+ d->up=u;

  u->down=d;@+ d->up=u;

@ Some special registers are so central to \MMIX's operation, they are

@ Some special registers are so central to \MMIX's operation, they are

carried along with each control block in the reorder buffer instead of being

carried along with each control block in the reorder buffer instead of being

treated as source and destination registers of each instruction. For example,

treated as source and destination registers of each instruction. For example,

the register stack pointers rO and~rS are treated in this way.

the register stack pointers rO and~rS are treated in this way.

The normal specnodes for rO and~rS, namely |g[rO]| and~|g[rS]|,

The normal specnodes for rO and~rS, namely |g[rO]| and~|g[rS]|,

are not actually used;

are not actually used;

the cool values are called |cool_O| and |cool_S|.

the cool values are called |cool_O| and |cool_S|.

(Actually |cool_O| and |cool_S| correspond to the register

(Actually |cool_O| and |cool_S| correspond to the register

values divided by~8, since rO and~rS are always multiples of~8.)

values divided by~8, since rO and~rS are always multiples of~8.)

The arithmetic status register, rA, is also treated specially. Its

The arithmetic status register, rA, is also treated specially. Its

event bits are kept up to date only at the ``hot'' end, by accumulating

event bits are kept up to date only at the ``hot'' end, by accumulating

values of |arith_exc|; an instruction

values of |arith_exc|; an instruction

to \.{GET} the value of~rA will be executed only in the hot seat.

to \.{GET} the value of~rA will be executed only in the hot seat.

The other bits of~rA, which are needed to control trip handlers and

The other bits of~rA, which are needed to control trip handlers and

floating point rounding, are treated in the normal way.

floating point rounding, are treated in the normal way.

@=

@=

Extern octa cool_O,cool_S; /* values of rO, rS before the |cool| instruction */

Extern octa cool_O,cool_S; /* values of rO, rS before the |cool| instruction */

@ @=

@ @=

int cool_L,cool_G; /* values of rL and rG before the |cool| instruction */

int cool_L,cool_G; /* values of rL and rG before the |cool| instruction */

unsigned int cool_hist,peek_hist; /* history bits for branch prediction */

unsigned int cool_hist,peek_hist; /* history bits for branch prediction */

octa new_O,new_S; /* values of rO, rS after |cool| */

octa new_O,new_S; /* values of rO, rS after |cool| */

@ @=

@ @=

cool->op=op; @+cool->i=i;

cool->op=op; @+cool->i=i;

cool->xx=(head->inst>>16)&0xff;@+

cool->xx=(head->inst>>16)&0xff;@+

cool->yy=(head->inst>>8)&0xff;@+

cool->yy=(head->inst>>8)&0xff;@+

cool->zz=(head->inst)&0xff;

cool->zz=(head->inst)&0xff;

cool->loc=head->loc;

cool->loc=head->loc;

cool->y=cool->z=cool->b=cool->ra=zero_spec;

cool->y=cool->z=cool->b=cool->ra=zero_spec;

cool->x.o=cool->a.o=cool->rl.o=zero_octa;

cool->x.o=cool->a.o=cool->rl.o=zero_octa;

cool->x.known=false; cool->x.up=NULL;

cool->x.known=false; cool->x.up=NULL;

cool->a.known=false; cool->a.up=NULL;

cool->a.known=false; cool->a.up=NULL;

cool->rl.known=true; cool->rl.up=NULL;

cool->rl.known=true; cool->rl.up=NULL;

cool->need_b=cool->need_ra=

cool->need_b=cool->need_ra=

  cool->ren_x=cool->mem_x=cool->ren_a=cool->set_l=false;

  cool->ren_x=cool->mem_x=cool->ren_a=cool->set_l=false;

cool->arith_exc=cool->denin=cool->denout=0;

cool->arith_exc=cool->denin=cool->denout=0;

if ((head->loc.h&sign_bit) && !(g[rU].o.h&0x8000)) cool->usage=false;

if ((head->loc.h&sign_bit) && !(g[rU].o.h&0x8000)) cool->usage=false;

else cool->usage=((op&(g[rU].o.h>>16))==g[rU].o.h>>24? true: false);

else cool->usage=((op&(g[rU].o.h>>16))==g[rU].o.h>>24? true: false);

new_O=cool->cur_O=cool_O;@+ new_S=cool->cur_S=cool_S;

new_O=cool->cur_O=cool_O;@+ new_S=cool->cur_S=cool_S;

cool->interrupt=head->interrupt;

cool->interrupt=head->interrupt;

cool->hist=peek_hist;

cool->hist=peek_hist;

cool->go.o=incr(cool->loc,4);

cool->go.o=incr(cool->loc,4);

cool->go.known=false, cool->go.addr.h=-1,cool->go.up=(specnode*)cool;

cool->go.known=false, cool->go.addr.h=-1,cool->go.up=(specnode*)cool;

cool->interim=false;

cool->interim=false;

@ @=

@ @=

if (new_cool==hot) goto stall; /* reorder buffer is full */

if (new_cool==hot) goto stall; /* reorder buffer is full */

@;

@;

@;

@;

if (f&X_is_dest_bit) @

if (f&X_is_dest_bit) @

  an internal command and |goto dispatch_done| if X is marginal@>;

  an internal command and |goto dispatch_done| if X is marginal@>;

switch (i) {

switch (i) {

@@;

@@;

default: break;

default: break;

dispatch_done:@;

dispatch_done:@;

@ The \.{UNSAVE} operation begins by loading register~rG from memory.

@ The \.{UNSAVE} operation begins by loading register~rG from memory.

We don't really need to know the value of~rG until twelve other registers

We don't really need to know the value of~rG until twelve other registers

have been unsaved, so we aren't fussy about it here.

have been unsaved, so we aren't fussy about it here.

@=

@=

if (!g[rL].up->known) goto stall;

if (!g[rL].up->known) goto stall;

cool_L=g[rL].up->o.l;

cool_L=g[rL].up->o.l;

if (!g[rG].up->known && !(op==UNSAVE && cool->xx==1)) goto stall;

if (!g[rG].up->known && !(op==UNSAVE && cool->xx==1)) goto stall;

cool_G=g[rG].up->o.l;

cool_G=g[rG].up->o.l;

@ @=

@ @=

if (resuming)

if (resuming)

@@;

@@;

else{

else{

  if (f&0x10) @b| from register X@>@;

  if (f&0x10) @b| from register X@>@;

  if (third_operand[op] && (cool->i!=trap))

  if (third_operand[op] && (cool->i!=trap))

    @b| and/or |cool->ra| from special register@>;

    @b| and/or |cool->ra| from special register@>;

  if (f&0x1) cool->z.o.l=cool->zz;

  if (f&0x1) cool->z.o.l=cool->zz;

  else if (f&0x2) @z| from register Z@>@;

  else if (f&0x2) @z| from register Z@>@;

  else if ((op&0xf0)==0xe0) @z| as an immediate wyde@>;

  else if ((op&0xf0)==0xe0) @z| as an immediate wyde@>;

  if (f&0x4) cool->y.o.l=cool->yy;

  if (f&0x4) cool->y.o.l=cool->yy;

  else if (f&0x8) @y| from register Y@>@;

  else if (f&0x8) @y| from register Y@>@;

@ @z| from register Z@>=

@ @z| from register Z@>=

  if (cool->zz>=cool_G) cool->z=specval(&g[cool->zz]);

  if (cool->zz>=cool_G) cool->z=specval(&g[cool->zz]);

  else if (cool->zzz=specval(&l[(cool_O.l+cool->zz)&lring_mask]);

  else if (cool->zzz=specval(&l[(cool_O.l+cool->zz)&lring_mask]);

@ @y| from register Y@>=

@ @y| from register Y@>=

  if (cool->yy>=cool_G) cool->y=specval(&g[cool->yy]);

  if (cool->yy>=cool_G) cool->y=specval(&g[cool->yy]);

  else if (cool->yyy=specval(&l[(cool_O.l+cool->yy)&lring_mask]);

  else if (cool->yyy=specval(&l[(cool_O.l+cool->yy)&lring_mask]);

@ @b| from register X@>=

@ @b| from register X@>=

  if (cool->xx>=cool_G) cool->b=specval(&g[cool->xx]);

  if (cool->xx>=cool_G) cool->b=specval(&g[cool->xx]);

  else if (cool->xx

  else if (cool->xx

    cool->b=specval(&l[(cool_O.l+cool->xx)&lring_mask]);

    cool->b=specval(&l[(cool_O.l+cool->xx)&lring_mask]);

  if (f&rel_addr_bit) cool->need_b=true; /* |br|, |pbr| */

  if (f&rel_addr_bit) cool->need_b=true; /* |br|, |pbr| */

@ If an operation requires a special register as third operand,

@ If an operation requires a special register as third operand,

that register is listed in the |third_operand| table.

that register is listed in the |third_operand| table.

@=

@=

unsigned char third_operand[256]={@/

unsigned char third_operand[256]={@/

  0,rA,0,0,rA,rA,rA,rA, /* \.{TRAP}, \dots\ */

  0,rA,0,0,rA,rA,rA,rA, /* \.{TRAP}, \dots\ */

  rA,rA,rA,rA,rA,rA,rA,rA, /* \.{FLOT}, \dots\ */

  rA,rA,rA,rA,rA,rA,rA,rA, /* \.{FLOT}, \dots\ */

  rA,rE,rE,rE,rA,rA,rA,rA, /* \.{FMUL}, \dots\ */

  rA,rE,rE,rE,rA,rA,rA,rA, /* \.{FMUL}, \dots\ */

  rA,rA,0,0,rA,rA,rD,rD, /* \.{MUL}, \dots\ */

  rA,rA,0,0,rA,rA,rD,rD, /* \.{MUL}, \dots\ */

  rA,rA,0,0,rA,rA,0,0, /* \.{ADD}, \dots\ */

  rA,rA,0,0,rA,rA,0,0, /* \.{ADD}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{2ADDU}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{2ADDU}, \dots\ */

  0,0,0,0,rA,rA,0,0, /* \.{CMP}, \dots\ */

  0,0,0,0,rA,rA,0,0, /* \.{CMP}, \dots\ */

  rA,rA,0,0,0,0,0,0, /* \.{SL}, \dots\ */

  rA,rA,0,0,0,0,0,0, /* \.{SL}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{BN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{BN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{BNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{BNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{PBN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{PBN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{PBNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{PBNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{CSN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{CSN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{CSNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{CSNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{ZSN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{ZSN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{ZSNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{ZSNN}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDB}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDB}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDT}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDT}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDSF}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDSF}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDVTS}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{LDVTS}, \dots\ */

  rA,rA,0,0,rA,rA,0,0, /* \.{STB}, \dots\ */

  rA,rA,0,0,rA,rA,0,0, /* \.{STB}, \dots\ */

  rA,rA,0,0,0,0,0,0, /* \.{STT}, \dots\ */

  rA,rA,0,0,0,0,0,0, /* \.{STT}, \dots\ */

  rA,rA,0,0,0,0,0,0, /* \.{STSF}, \dots\ */

  rA,rA,0,0,0,0,0,0, /* \.{STSF}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{SYNCD}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{SYNCD}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{OR}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{OR}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{AND}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{AND}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{BDIF}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{BDIF}, \dots\ */

  rM,rM,0,0,0,0,0,0, /* \.{MUX}, \dots\ */

  rM,rM,0,0,0,0,0,0, /* \.{MUX}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{SETH}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{SETH}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{ORH}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{ORH}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{JMP}, \dots\ */

  0,0,0,0,0,0,0,0, /* \.{JMP}, \dots\ */

  rJ,0,0,0,0,0,0,255}; /* \.{POP}, \dots\ */

  rJ,0,0,0,0,0,0,255}; /* \.{POP}, \dots\ */

@ The |cool->b| field is busy in operations like \.{STB} or \.{STSF},

@ The |cool->b| field is busy in operations like \.{STB} or \.{STSF},

which need~rA. So we use |cool->ra| instead, when rA is needed.

which need~rA. So we use |cool->ra| instead, when rA is needed.

@b| and/or |cool->ra| from special register@>=

@b| and/or |cool->ra| from special register@>=

  if (third_operand[op]==rA || third_operand[op]==rE)

  if (third_operand[op]==rA || third_operand[op]==rE)

    cool->need_ra=true, cool->ra=specval(&g[rA]);

    cool->need_ra=true, cool->ra=specval(&g[rA]);

  if (third_operand[op]!=rA)

  if (third_operand[op]!=rA)

    cool->need_b=true, cool->b=specval(&g[third_operand[op]]);

    cool->need_b=true, cool->b=specval(&g[third_operand[op]]);

@ @z| as an immediate wyde@>=

@ @z| as an immediate wyde@>=

{  switch (op&3) {

{  switch (op&3) {

case 0: cool->z.o.h=yz<<16;@+break;

case 0: cool->z.o.h=yz<<16;@+break;

case 1: cool->z.o.h=yz;@+break;

case 1: cool->z.o.h=yz;@+break;

case 2: cool->z.o.l=yz<<16;@+break;

case 2: cool->z.o.l=yz<<16;@+break;

case 3: cool->z.o.l=yz;@+break;

case 3: cool->z.o.l=yz;@+break;

  if (i!=set) { /* register X should also be the Y operand */

  if (i!=set) { /* register X should also be the Y operand */

    cool->y=cool->b; cool->b=zero_spec;

    cool->y=cool->b; cool->b=zero_spec;

@ @=

@ @=

  if (cool->xx>=cool_G) {

  if (cool->xx>=cool_G) {

    if (i!=pushgo && i!=pushj)

    if (i!=pushgo && i!=pushj)

      cool->ren_x=true,spec_install(&g[cool->xx],&cool->x);

      cool->ren_x=true,spec_install(&g[cool->xx],&cool->x);

  }@+else if (cool->xx

  }@+else if (cool->xx

    cool->ren_x=true,

    cool->ren_x=true,

      spec_install(&l[(cool_O.l+cool->xx)&lring_mask],&cool->x);

      spec_install(&l[(cool_O.l+cool->xx)&lring_mask],&cool->x);

  else { /* we need to increase L before issuing |head->inst| */

  else { /* we need to increase L before issuing |head->inst| */

 increase_L:@+ if (((cool_S.l-cool_O.l-cool_L-1)&lring_mask)==0)

 increase_L:@+ if (((cool_S.l-cool_O.l-cool_L-1)&lring_mask)==0)

@@;

@@;

    else @;

    else @;

@ @=

@ @=

if (rename_regsren_x+cool->ren_a) goto stall;

if (rename_regsren_x+cool->ren_a) goto stall;

if (cool->mem_x)

if (cool->mem_x)

  if (mem_slots) mem_slots--;@+else goto stall;

  if (mem_slots) mem_slots--;@+else goto stall;

rename_regs-=cool->ren_x+cool->ren_a;

rename_regs-=cool->ren_x+cool->ren_a;

@ The |incrl| instruction

@ The |incrl| instruction

advances $\beta$ and~rL by~1 at a time when we know that $\beta\ne\gamma$,

advances $\beta$ and~rL by~1 at a time when we know that $\beta\ne\gamma$,

in the ring of local registers.

in the ring of local registers.

@=

@=

  cool->i=incrl;

  cool->i=incrl;

  spec_install(&l[(cool_O.l+cool_L)&lring_mask],&cool->x);

  spec_install(&l[(cool_O.l+cool_L)&lring_mask],&cool->x);

  cool->need_b=cool->need_ra=false;

  cool->need_b=cool->need_ra=false;

  cool->y=cool->z=zero_spec;

  cool->y=cool->z=zero_spec;

  cool->x.known=true; /* |cool->x.o=zero_octa| */

  cool->x.known=true; /* |cool->x.o=zero_octa| */

  spec_install(&g[rL],&cool->rl);

  spec_install(&g[rL],&cool->rl);

  cool->rl.o.l=cool_L+1;

  cool->rl.o.l=cool_L+1;

  cool->ren_x=cool->set_l=true;

  cool->ren_x=cool->set_l=true;

  op=SETH; /* this instruction to be handled by the simplest units */

  op=SETH; /* this instruction to be handled by the simplest units */

  cool->interim=true;

  cool->interim=true;

  goto dispatch_done;

  goto dispatch_done;

@ The |incgamma| instruction advances $\gamma$ and rS by storing an octabyte

@ The |incgamma| instruction advances $\gamma$ and rS by storing an octabyte

from the local register ring to virtual memory location |cool_S<<3|.

from the local register ring to virtual memory location |cool_S<<3|.

@=

@=

  cool->need_b=cool->need_ra=false;

  cool->need_b=cool->need_ra=false;

  cool->i=incgamma;

  cool->i=incgamma;

  new_S=incr(cool_S,1);

  new_S=incr(cool_S,1);

  cool->b=specval(&l[cool_S.l&lring_mask]);

  cool->b=specval(&l[cool_S.l&lring_mask]);

  cool->y.p=NULL, cool->y.o=shift_left(cool_S,3);

  cool->y.p=NULL, cool->y.o=shift_left(cool_S,3);

  cool->z=zero_spec;

  cool->z=zero_spec;

  cool->mem_x=true, spec_install(&mem,&cool->x);

  cool->mem_x=true, spec_install(&mem,&cool->x);

  op=STOU; /* this instruction needs to be handled by load/store unit */

  op=STOU; /* this instruction needs to be handled by load/store unit */

  cool->interim=true;

  cool->interim=true;

  goto dispatch_done;

  goto dispatch_done;

@ The |decgamma| instruction decreases $\gamma$ and rS by loading an octabyte

@ The |decgamma| instruction decreases $\gamma$ and rS by loading an octabyte

from virtual memory location |(cool_S-1)<<3| into the local register ring.

from virtual memory location |(cool_S-1)<<3| into the local register ring.

@=

@=

  cool->i=decgamma;

  cool->i=decgamma;

  new_S=incr(cool_S,-1);

  new_S=incr(cool_S,-1);

  cool->z=cool->b=zero_spec; cool->need_b=false;

  cool->z=cool->b=zero_spec; cool->need_b=false;

  cool->y.p=NULL, cool->y.o=shift_left(new_S,3);

  cool->y.p=NULL, cool->y.o=shift_left(new_S,3);

  cool->ren_x=true, spec_install(&l[new_S.l&lring_mask],&cool->x);

  cool->ren_x=true, spec_install(&l[new_S.l&lring_mask],&cool->x);

  op=LDOU; /* this instruction needs to be handled by load/store unit */

  op=LDOU; /* this instruction needs to be handled by load/store unit */

  cool->interim=true;

  cool->interim=true;

  cool->ptr_a=(void*)mem.up;

  cool->ptr_a=(void*)mem.up;

  goto dispatch_done;

  goto dispatch_done;

@ Storing into memory requires a doubly linked data list of specnodes

@ Storing into memory requires a doubly linked data list of specnodes

like the lists we use for local and global registers. In this case

like the lists we use for local and global registers. In this case

the head of the list is called |mem|, and the |addr| fields are

the head of the list is called |mem|, and the |addr| fields are

physical addresses in memory.

physical addresses in memory.

@=

@=

Extern specnode mem;

Extern specnode mem;

@ The |addr| field of a memory specnode

@ The |addr| field of a memory specnode

is all 1s until the physical address has been computed.

is all 1s until the physical address has been computed.

@=

@=

mem.addr.h=mem.addr.l=-1;

mem.addr.h=mem.addr.l=-1;

mem.up=mem.down=&mem;

mem.up=mem.down=&mem;

@ The \.{CSWAP} operation is treated as a partial store, with \$X

@ The \.{CSWAP} operation is treated as a partial store, with \$X

as a secondary output. Partial store (|pst|) commands read an octabyte

as a secondary output. Partial store (|pst|) commands read an octabyte

from memory before they write it.

from memory before they write it.

@=

@=

case cswap: cool->ren_a=true;

case cswap: cool->ren_a=true;

  spec_install(cool->xx>=cool_G? &g[cool->xx]:

  spec_install(cool->xx>=cool_G? &g[cool->xx]:

      &l[(cool_O.l+cool->xx)&lring_mask],&cool->a);

      &l[(cool_O.l+cool->xx)&lring_mask],&cool->a);

  cool->i=pst;

  cool->i=pst;

case st:@+ if ((op&0xfe)==STCO) cool->b.o.l=cool->xx;

case st:@+ if ((op&0xfe)==STCO) cool->b.o.l=cool->xx;

case pst:

case pst:

 cool->mem_x=true, spec_install(&mem,&cool->x);@+ break;

 cool->mem_x=true, spec_install(&mem,&cool->x);@+ break;

case ld: case ldunc: cool->ptr_a=(void *)mem.up;@+ break;

case ld: case ldunc: cool->ptr_a=(void *)mem.up;@+ break;

@ When new data is \.{PUT} into special registers 15--20 (namely rK,

@ When new data is \.{PUT} into special registers 15--20 (namely rK,

rQ, rU, rV, rG, or~rL) it can affect many things. Therefore we stop

rQ, rU, rV, rG, or~rL) it can affect many things. Therefore we stop

issuing further instructions until such \.{PUT}s are committed.

issuing further instructions until such \.{PUT}s are committed.

Moreover, we will see later that such drastic \.{PUT}s defer execution until

Moreover, we will see later that such drastic \.{PUT}s defer execution until

they reach the hot seat.

they reach the hot seat.

@=

@=

case put:@+ if (cool->yy!=0 || cool->xx>=32) goto illegal_inst;

case put:@+ if (cool->yy!=0 || cool->xx>=32) goto illegal_inst;

 if (cool->xx>=8) {

 if (cool->xx>=8) {

   if (cool->xx<=11) goto illegal_inst;

   if (cool->xx<=11) goto illegal_inst;

   if (cool->xx<=18 && !(cool->loc.h&sign_bit)) goto privileged_inst;

   if (cool->xx<=18 && !(cool->loc.h&sign_bit)) goto privileged_inst;

 if (cool->xx>=15 && cool->xx<=20) freeze_dispatch=true;

 if (cool->xx>=15 && cool->xx<=20) freeze_dispatch=true;

 cool->ren_x=true, spec_install(&g[cool->xx],&cool->x);@+break;

 cool->ren_x=true, spec_install(&g[cool->xx],&cool->x);@+break;

@#

@#

case get:@+ if (cool->yy || cool->zz>=32) goto illegal_inst;

case get:@+ if (cool->yy || cool->zz>=32) goto illegal_inst;

 if (cool->zz==rO) cool->z.o=shift_left(cool_O,3);

 if (cool->zz==rO) cool->z.o=shift_left(cool_O,3);

 else if (cool->zz==rS) cool->z.o=shift_left(cool_S,3);

 else if (cool->zz==rS) cool->z.o=shift_left(cool_S,3);

 else cool->z=specval(&g[cool->zz]);@+break;

 else cool->z=specval(&g[cool->zz]);@+break;

illegal_inst: cool->interrupt |= B_BIT;@+goto noop_inst;

illegal_inst: cool->interrupt |= B_BIT;@+goto noop_inst;

case ldvts:@+ if (cool->loc.h&sign_bit) break;

case ldvts:@+ if (cool->loc.h&sign_bit) break;

privileged_inst:  cool->interrupt |= K_BIT;

privileged_inst:  cool->interrupt |= K_BIT;

noop_inst: cool->i=noop;@+break;

noop_inst: cool->i=noop;@+break;

@ A \.{PUSHGO} instruction with $\rm X\ge G$ causes L to increase

@ A \.{PUSHGO} instruction with $\rm X\ge G$ causes L to increase

momentarily by~1, even if $\rm L=G$.

momentarily by~1, even if $\rm L=G$.

But the value of~L will be decreased before the \.{PUSHGO}

But the value of~L will be decreased before the \.{PUSHGO}

is complete, so it will never actually exceed~G. Moreover, we needn't

is complete, so it will never actually exceed~G. Moreover, we needn't

insert an~|incrl| command.

insert an~|incrl| command.

@=

@=

case pushgo: inst_ptr.p=&cool->go;

case pushgo: inst_ptr.p=&cool->go;

case pushj: {@+register int x=cool->xx;

case pushj: {@+register int x=cool->xx;

  if (x>=cool_G) {

  if (x>=cool_G) {

    if (((cool_S.l-cool_O.l-cool_L-1)&lring_mask)==0)

    if (((cool_S.l-cool_O.l-cool_L-1)&lring_mask)==0)

@@;

@@;

    x=cool_L;@+ cool_L++;

    x=cool_L;@+ cool_L++;

    cool->ren_x=true, spec_install(&l[(cool_O.l+x)&lring_mask],&cool->x);

    cool->ren_x=true, spec_install(&l[(cool_O.l+x)&lring_mask],&cool->x);

  cool->x.known=true, cool->x.o.h=0, cool->x.o.l=x;

  cool->x.known=true, cool->x.o.h=0, cool->x.o.l=x;

  cool->ren_a=true, spec_install(&g[rJ],&cool->a);

  cool->ren_a=true, spec_install(&g[rJ],&cool->a);

  cool->a.known=true, cool->a.o=incr(cool->loc,4);

  cool->a.known=true, cool->a.o=incr(cool->loc,4);

  cool->set_l=true, spec_install(&g[rL],&cool->rl);

  cool->set_l=true, spec_install(&g[rL],&cool->rl);

  cool->rl.o.l=cool_L-x-1;

  cool->rl.o.l=cool_L-x-1;

  new_O=incr(cool_O,x+1);

  new_O=incr(cool_O,x+1);

}@+break;

}@+break;

case syncid: if (cool->loc.h&sign_bit) break;

case syncid: if (cool->loc.h&sign_bit) break;

case go: inst_ptr.p=&cool->go;@+break;

case go: inst_ptr.p=&cool->go;@+break;

@ We need to know the topmost ``hidden'' element of the register stack

@ We need to know the topmost ``hidden'' element of the register stack

when a \.{POP} instruction is dispatched. This element is usually

when a \.{POP} instruction is dispatched. This element is usually

present in the local register ring, unless $\gamma=\alpha$.

present in the local register ring, unless $\gamma=\alpha$.

Once it is known, let $x$ be its least significant byte. We will

Once it is known, let $x$ be its least significant byte. We will

be decreasing rO by $x+1$, so we may have to decrease $\gamma$ repeatedly

be decreasing rO by $x+1$, so we may have to decrease $\gamma$ repeatedly

in order to maintain the condition $\rm rS\le rO$.

in order to maintain the condition $\rm rS\le rO$.

@=

@=

case pop:@+if (cool->xx && cool_L>=cool->xx)

case pop:@+if (cool->xx && cool_L>=cool->xx)

      cool->y=specval(&l[(cool_O.l+cool->xx-1)&lring_mask]);

      cool->y=specval(&l[(cool_O.l+cool->xx-1)&lring_mask]);

pop_unsave:@+if (cool_S.l==cool_O.l)

pop_unsave:@+if (cool_S.l==cool_O.l)

@;

@;

  {@+register tetra x; register int new_L;

  {@+register tetra x; register int new_L;

    register specnode *p=l[(cool_O.l-1)&lring_mask].up;

    register specnode *p=l[(cool_O.l-1)&lring_mask].up;

    if (p->known) x=(p->o.l)&0xff;@+ else goto stall;

    if (p->known) x=(p->o.l)&0xff;@+ else goto stall;

    if ((tetra)(cool_O.l-cool_S.l)<=x)

    if ((tetra)(cool_O.l-cool_S.l)<=x)

@;

@;

    new_O=incr(cool_O,-x-1);

    new_O=incr(cool_O,-x-1);

    if (cool->i==pop) new_L=x+(cool->xx<=cool_L? cool->xx: cool_L+1);

    if (cool->i==pop) new_L=x+(cool->xx<=cool_L? cool->xx: cool_L+1);

    else new_L=x;

    else new_L=x;

    if (new_L>cool_G) new_L=cool_G;

    if (new_L>cool_G) new_L=cool_G;

    if (x

    if (x

      cool->ren_x=true, spec_install(&l[(cool_O.l-1)&lring_mask],&cool->x);

      cool->ren_x=true, spec_install(&l[(cool_O.l-1)&lring_mask],&cool->x);

    cool->set_l=true, spec_install(&g[rL],&cool->rl);

    cool->set_l=true, spec_install(&g[rL],&cool->rl);

    cool->rl.o.l=new_L;

    cool->rl.o.l=new_L;

    if (cool->i==pop) {

    if (cool->i==pop) {

      cool->z.o.l=yz<<2;

      cool->z.o.l=yz<<2;

      if (inst_ptr.p==UNKNOWN_SPEC && new_head==tail) inst_ptr.p=&cool->go;

      if (inst_ptr.p==UNKNOWN_SPEC && new_head==tail) inst_ptr.p=&cool->go;

    break;

    break;

@ @=

@ @=

case mulu: cool->ren_a=true, spec_install(&g[rH],&cool->a);@+break;

case mulu: cool->ren_a=true, spec_install(&g[rH],&cool->a);@+break;

case div: case divu: cool->ren_a=true, spec_install(&g[rR],&cool->a);@+break;

case div: case divu: cool->ren_a=true, spec_install(&g[rR],&cool->a);@+break;

@ It's tempting to say that we could avoid taking up space in the reorder

@ It's tempting to say that we could avoid taking up space in the reorder

buffer when no operation needs to be done.

buffer when no operation needs to be done.

A \.{JMP} instruction qualifies as a no-op in this sense,

A \.{JMP} instruction qualifies as a no-op in this sense,

because the change of control occurs before the execution stage.

because the change of control occurs before the execution stage.

However, even a no-op might have to be counted in the usage register~rU,

However, even a no-op might have to be counted in the usage register~rU,

so it might get into the execution stage for that reason.

so it might get into the execution stage for that reason.

A no-op can also cause a protection interrupt, if it appears in a negative

A no-op can also cause a protection interrupt, if it appears in a negative

location. Even more importantly, a program might get into a loop that consists

location. Even more importantly, a program might get into a loop that consists

entirely of jumps and no-ops; then we wouldn't be able to interrupt it,

entirely of jumps and no-ops; then we wouldn't be able to interrupt it,

because the interruption mechanism needs to find the current location

because the interruption mechanism needs to find the current location

in the reorder buffer! At least one functional unit therefore needs to provide

in the reorder buffer! At least one functional unit therefore needs to provide

explicit support for \.{JMP}, \.{JMPB}, and \.{SWYM}.

explicit support for \.{JMP}, \.{JMPB}, and \.{SWYM}.

The \.{SWYM} instruction with |F_BIT| set is a special case: This is

The \.{SWYM} instruction with |F_BIT| set is a special case: This is

a request from the fetch coroutine for an update to the IT-cache,

a request from the fetch coroutine for an update to the IT-cache,

when the page table method isn't implemented in hardware.

when the page table method isn't implemented in hardware.

@=

@=

case noop:@+if (cool->interrupt&F_BIT) {

case noop:@+if (cool->interrupt&F_BIT) {

   cool->go.o=cool->y.o=cool->loc;

   cool->go.o=cool->y.o=cool->loc;

   inst_ptr=specval(&g[rT]);

   inst_ptr=specval(&g[rT]);

 break;

 break;

@ @=

@ @=

if (cool->ren_x || cool->mem_x) spec_rem(&cool->x);

if (cool->ren_x || cool->mem_x) spec_rem(&cool->x);

if (cool->ren_a) spec_rem(&cool->a);

if (cool->ren_a) spec_rem(&cool->a);

if (cool->set_l) spec_rem(&cool->rl);

if (cool->set_l) spec_rem(&cool->rl);

if (inst_ptr.p==&cool->go) inst_ptr.p=UNKNOWN_SPEC;

if (inst_ptr.p==&cool->go) inst_ptr.p=UNKNOWN_SPEC;

break;

break;

@* The execution stages. \MMIX's {\it raison d'\^etre\/} is its ability

@* The execution stages. \MMIX's {\it raison d'\^etre\/} is its ability

to execute instructions. So now we want to simulate the behavior of its

to execute instructions. So now we want to simulate the behavior of its

functional units.

functional units.

Each coroutine scheduled for action at the current tick of the clock has a

Each coroutine scheduled for action at the current tick of the clock has a

|stage| number corresponding to a particular subset of the \MMIX\ hardware.

|stage| number corresponding to a particular subset of the \MMIX\ hardware.

For example, the coroutines with |stage=2| are the second stages in the

For example, the coroutines with |stage=2| are the second stages in the

pipelines of the functional units. A coroutine with |stage=0| works

pipelines of the functional units. A coroutine with |stage=0| works

in the fetch unit. Several artificially large stage numbers

in the fetch unit. Several artificially large stage numbers

are used to control special coroutines that do things like write data

are used to control special coroutines that do things like write data

from buffers into memory.

from buffers into memory.

In this program the current coroutine of interest is called |self|; hence

In this program the current coroutine of interest is called |self|; hence

|self->stage| is the current stage number of interest. Another key variable,

|self->stage| is the current stage number of interest. Another key variable,

|self->ctl|, is called~|data|; this is the control block being operated on by

|self->ctl|, is called~|data|; this is the control block being operated on by

the current coroutine. We typically are simulating an operation in which

the current coroutine. We typically are simulating an operation in which

|data->x| is being computed as a function of |data->y| and |data->z|.

|data->x| is being computed as a function of |data->y| and |data->z|.

The |data| record has many fields, as described earlier when we defined

The |data| record has many fields, as described earlier when we defined

\&{control} structures; for example, |data->owner| is the same as

\&{control} structures; for example, |data->owner| is the same as

|self|, during the execution stage, if it is nonnull.

|self|, during the execution stage, if it is nonnull.

This part of the simulator is written as if each functional unit is able to

This part of the simulator is written as if each functional unit is able to

handle all 256 operations. In practice, of course, a functional unit tends to

handle all 256 operations. In practice, of course, a functional unit tends to

be much more specialized; the actual specialization is governed by the

be much more specialized; the actual specialization is governed by the

dispatcher, which issues an instruction only to a functional unit that

dispatcher, which issues an instruction only to a functional unit that

supports it. Once an instruction has been dispatched, however, we can simulate

supports it. Once an instruction has been dispatched, however, we can simulate

it most easily if we imagine that its functional unit is universal.

it most easily if we imagine that its functional unit is universal.

Coroutines with higher |stage| numbers are processed first.

Coroutines with higher |stage| numbers are processed first.

The three most important variables that govern a coroutine's behavior, once

The three most important variables that govern a coroutine's behavior, once

|self->stage| is given, are the external operation code |data->op|, the

|self->stage| is given, are the external operation code |data->op|, the

internal operation code |data->i|, and the value of |data->state|. We

internal operation code |data->i|, and the value of |data->state|. We

typically have |data->state=0| when a coroutine is first fired~up.

typically have |data->state=0| when a coroutine is first fired~up.

@=

@=

register coroutine *self; /* the current coroutine being executed */

register coroutine *self; /* the current coroutine being executed */

register control *data; /* the |control| block of the current coroutine */

register control *data; /* the |control| block of the current coroutine */

@ When a coroutine has done all it wants to on a single cycle,

@ When a coroutine has done all it wants to on a single cycle,

it says |goto done|. It will not be scheduled to do any further work

it says |goto done|. It will not be scheduled to do any further work

unless the |schedule| routine has been called since it began execution.

unless the |schedule| routine has been called since it began execution.

The |wait| macro is a convenient way to say ``Please schedule me to resume

The |wait| macro is a convenient way to say ``Please schedule me to resume

again at the current |data->state|'' after a specified time; for example,

again at the current |data->state|'' after a specified time; for example,

|wait(1)| will restart a coroutine on the next clock tick.

|wait(1)| will restart a coroutine on the next clock tick.

@d wait(t)@+ {@+schedule(self,t,data->state);@+ goto done;@+}

@d wait(t)@+ {@+schedule(self,t,data->state);@+ goto done;@+}

@d pass_after(t)  schedule(self+1,t,data->state)

@d pass_after(t)  schedule(self+1,t,data->state)

@d sleep@+ {@+self->next=self;@+ goto done;@+} /* wait forever */

@d sleep@+ {@+self->next=self;@+ goto done;@+} /* wait forever */

@d awaken(c,t)  schedule(c,t,c->ctl->state)

@d awaken(c,t)  schedule(c,t,c->ctl->state)

@=

@=

cur_time++;@+ if (cur_time==ring_size) cur_time=0;

cur_time++;@+ if (cur_time==ring_size) cur_time=0;

for (self=queuelist(cur_time);self!=&sentinel;self=sentinel.next) {

for (self=queuelist(cur_time);self!=&sentinel;self=sentinel.next) {

  sentinel.next=self->next;@+self->next=NULL; /* unschedule this coroutine */

  sentinel.next=self->next;@+self->next=NULL; /* unschedule this coroutine */

  data=self->ctl;

  data=self->ctl;

  if (verbose&coroutine_bit) {

  if (verbose&coroutine_bit) {

    printf(" running ");@+print_coroutine_id(self);@+printf(" ");

    printf(" running ");@+print_coroutine_id(self);@+printf(" ");

    print_control_block(data);@+printf("\n");

    print_control_block(data);@+printf("\n");

  switch(self->stage) {

  switch(self->stage) {

 case 0:@;

 case 0:@;

 case 1:@;

 case 1:@;

 default:@;

 default:@;

 @t\4@>@;

 @t\4@>@;

 terminate:@+if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 terminate:@+if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 done:;

 done:;

@ A special coroutine whose |stage| number is |vanish| simply goes away

@ A special coroutine whose |stage| number is |vanish| simply goes away

at its scheduled time.

at its scheduled time.

@=

@=

case vanish: goto terminate;

case vanish: goto terminate;

@ @=

@ @=

coroutine mem_locker; /* trivial coroutine that vanishes */

coroutine mem_locker; /* trivial coroutine that vanishes */

coroutine Dlocker; /* another */

coroutine Dlocker; /* another */

control vanish_ctl; /* such coroutines share a common control block */

control vanish_ctl; /* such coroutines share a common control block */

@ @=

@ @=

mem_locker.name="Locker";

mem_locker.name="Locker";

mem_locker.ctl=&vanish_ctl;

mem_locker.ctl=&vanish_ctl;

mem_locker.stage=vanish;

mem_locker.stage=vanish;

Dlocker.name="Dlocker";

Dlocker.name="Dlocker";

Dlocker.ctl=&vanish_ctl;

Dlocker.ctl=&vanish_ctl;

Dlocker.stage=vanish;

Dlocker.stage=vanish;

vanish_ctl.go.o.l=4;

vanish_ctl.go.o.l=4;

for (j=0;jports;j++) DTcache->reader[j].ctl=&vanish_ctl;

for (j=0;jports;j++) DTcache->reader[j].ctl=&vanish_ctl;

if (Dcache) for (j=0;jports;j++) Dcache->reader[j].ctl=&vanish_ctl;

if (Dcache) for (j=0;jports;j++) Dcache->reader[j].ctl=&vanish_ctl;

for (j=0;jports;j++) ITcache->reader[j].ctl=&vanish_ctl;

for (j=0;jports;j++) ITcache->reader[j].ctl=&vanish_ctl;

if (Icache) for (j=0;jports;j++) Icache->reader[j].ctl=&vanish_ctl;

if (Icache) for (j=0;jports;j++) Icache->reader[j].ctl=&vanish_ctl;

@ Here is a list of the |stage| numbers for special coroutines to be

@ Here is a list of the |stage| numbers for special coroutines to be

defined below.

defined below.

@=

@=

#define max_stage 99 /* exceeds all |stage| numbers */

#define max_stage 99 /* exceeds all |stage| numbers */

#define vanish 98 /* special coroutine that just goes away */

#define vanish 98 /* special coroutine that just goes away */

#define flush_to_mem 97 /* coroutine for flushing from a cache to memory */

#define flush_to_mem 97 /* coroutine for flushing from a cache to memory */

#define flush_to_S 96 /* coroutine for flushing from a cache to the S-cache */

#define flush_to_S 96 /* coroutine for flushing from a cache to the S-cache */

#define fill_from_mem 95 /* coroutine for filling a cache from memory */

#define fill_from_mem 95 /* coroutine for filling a cache from memory */

#define fill_from_S 94 /* coroutine for filling a cache from the S-cache */

#define fill_from_S 94 /* coroutine for filling a cache from the S-cache */

#define fill_from_virt 93 /* coroutine for filling a translation cache */

#define fill_from_virt 93 /* coroutine for filling a translation cache */

#define write_from_wbuf 92 /* coroutine for emptying the write buffer */

#define write_from_wbuf 92 /* coroutine for emptying the write buffer */

#define cleanup 91 /* coroutine for cleaning the caches */

#define cleanup 91 /* coroutine for cleaning the caches */

@ At the very beginning of stage 1, a functional unit will stall if necessary

@ At the very beginning of stage 1, a functional unit will stall if necessary

until its operands are available. As soon as the operands are all present, the

until its operands are available. As soon as the operands are all present, the

|state| is set nonzero and execution proper begins.

|state| is set nonzero and execution proper begins.

@=

@=

switch1:@+ switch(data->state) {

switch1:@+ switch(data->state) {

 case 0: @;

 case 0: @;

 case 1: @;

 case 1: @;

 case 2: @;

 case 2: @;

 case 3: @;

 case 3: @;

@;

@;

@ If some of our input data has been computed by another coroutine on the

@ If some of our input data has been computed by another coroutine on the

current cycle, we grab it now but wait for the next cycle. (An actual machine

current cycle, we grab it now but wait for the next cycle. (An actual machine

wouldn't have latched the data until then.)

wouldn't have latched the data until then.)

@=

@=

j=0;

j=0;

if (data->y.p) {

if (data->y.p) {

  j++;

  j++;

  if (data->y.p->known) data->y.o=data->y.p->o, data->y.p=NULL;

  if (data->y.p->known) data->y.o=data->y.p->o, data->y.p=NULL;

  else j+=10;

  else j+=10;

if (data->z.p) {

if (data->z.p) {

  j++;

  j++;

  if (data->z.p->known) data->z.o=data->z.p->o, data->z.p=NULL;

  if (data->z.p->known) data->z.o=data->z.p->o, data->z.p=NULL;

  else j+=10;

  else j+=10;

if (data->b.p) {

if (data->b.p) {

  if (data->need_b) j++;

  if (data->need_b) j++;

  if (data->b.p->known) data->b.o=data->b.p->o, data->b.p=NULL;

  if (data->b.p->known) data->b.o=data->b.p->o, data->b.p=NULL;

  else if (data->need_b) j+=10;

  else if (data->need_b) j+=10;

if (data->ra.p) {

if (data->ra.p) {

  if (data->need_ra) j++;

  if (data->need_ra) j++;

  if (data->ra.p->known) data->ra.o=data->ra.p->o, data->ra.p=NULL;

  if (data->ra.p->known) data->ra.o=data->ra.p->o, data->ra.p=NULL;

  else if (data->need_ra) j+=10;

  else if (data->need_ra) j+=10;

if (j<10) data->state=1;

if (j<10) data->state=1;

if (j) wait(1); /* otherwise we fall through to case 1 */

if (j) wait(1); /* otherwise we fall through to case 1 */

@ Simple register-to-register instructions like \.{ADD} are assumed to take

@ Simple register-to-register instructions like \.{ADD} are assumed to take

just one cycle, but others like \.{FADD} almost certainly require more time.

just one cycle, but others like \.{FADD} almost certainly require more time.

This simulator can be configured so that \.{FADD} might take, say, four

This simulator can be configured so that \.{FADD} might take, say, four

pipeline stages of one cycle each ($1+1+1+1$), or two pipeline stages of two

pipeline stages of one cycle each ($1+1+1+1$), or two pipeline stages of two

cycles each ($2+2$), or a single unpipelined stage lasting four cycles (4),

cycles each ($2+2$), or a single unpipelined stage lasting four cycles (4),

etc. In any case the simulator computes the results now, for simplicity,

etc. In any case the simulator computes the results now, for simplicity,

placing them in |data->x| and possibly also in |data->a| and/or

placing them in |data->x| and possibly also in |data->a| and/or

|data->interrupt|. The results will not be officially made |known| until

|data->interrupt|. The results will not be officially made |known| until

the proper time.

the proper time.

@=

@=

switch (data->i) {

switch (data->i) {

@;

@;

@;

@;

@;

@;

@;

@;

@ If the internal opcode |data->i| is |max_pipe_op| or less, a special

@ If the internal opcode |data->i| is |max_pipe_op| or less, a special

pipeline sequence like $1+1+1+1$ or $2+2$ or $15+10$, etc., has been

pipeline sequence like $1+1+1+1$ or $2+2$ or $15+10$, etc., has been

configured. Otherwise we assume that the pipeline sequence is simply~1.

configured. Otherwise we assume that the pipeline sequence is simply~1.

Suppose the pipeline sequence is $t_1+t_2+\cdots+t_k$. Each $t_j$ is

Suppose the pipeline sequence is $t_1+t_2+\cdots+t_k$. Each $t_j$ is

positive and less than~256, so we represent the sequence as a

positive and less than~256, so we represent the sequence as a

string |pipe_seq[data->i]| of unsigned ``characters,'' terminated by~0.

string |pipe_seq[data->i]| of unsigned ``characters,'' terminated by~0.

Given such a string, we want to do the following: Wait $(t_1-1)$ cycles

Given such a string, we want to do the following: Wait $(t_1-1)$ cycles

and pass |data| to stage~2; wait $t_2$ cycles and pass |data| to stage~3;

and pass |data| to stage~2; wait $t_2$ cycles and pass |data| to stage~3;

\dots; wait $t_{k-1}$ cycles and pass |data| to stage~$k$; wait $t_k$ cycles

\dots; wait $t_{k-1}$ cycles and pass |data| to stage~$k$; wait $t_k$ cycles

and make the results |known|.

and make the results |known|.

The value of |denin| is added to $t_1$; the value of |denout| is

The value of |denin| is added to $t_1$; the value of |denout| is

added to~$t_k$.

added to~$t_k$.

@=

@=

data->state=3;

data->state=3;

if (data->i<=max_pipe_op) {@+register unsigned char *s=pipe_seq[data->i];

if (data->i<=max_pipe_op) {@+register unsigned char *s=pipe_seq[data->i];

  j=s[0]+data->denin;

  j=s[0]+data->denin;

  if (s[1]) data->state=2; /* more than one stage */

  if (s[1]) data->state=2; /* more than one stage */

  else j+=data->denout;

  else j+=data->denout;

  if (j>1) wait(j-1);

  if (j>1) wait(j-1);

goto switch1;

goto switch1;

@ When we're in stage $j$, the coroutine for stage $j+1$ of the same functional

@ When we're in stage $j$, the coroutine for stage $j+1$ of the same functional

unit is |self+1|.

unit is |self+1|.

@=

@=

pass_data:@+

pass_data:@+

if ((self+1)->next) wait(1); /* stall if the next stage is occupied */

if ((self+1)->next) wait(1); /* stall if the next stage is occupied */

{@+register unsigned char *s=pipe_seq[data->i];

{@+register unsigned char *s=pipe_seq[data->i];

  j=s[self->stage];

  j=s[self->stage];

  if (s[self->stage+1]==0) j+=data->denout,data->state=3;

  if (s[self->stage+1]==0) j+=data->denout,data->state=3;

          /* the next stage is the last */

          /* the next stage is the last */

  pass_after(j);

  pass_after(j);

passit: (self+1)->ctl=data;

passit: (self+1)->ctl=data;

data->owner=self+1;

data->owner=self+1;

goto done;

goto done;

@ @=

@ @=

switch2:@+if (data->b.p && data->b.p->known)

switch2:@+if (data->b.p && data->b.p->known)

    data->b.o=data->b.p->o, data->b.p=NULL;

    data->b.o=data->b.p->o, data->b.p=NULL;

 switch(data->state) {

 switch(data->state) {

 case 0: panic(confusion("switch2"));

 case 0: panic(confusion("switch2"));

 case 1: @;

 case 1: @;

 case 2: goto pass_data;

 case 2: goto pass_data;

 case 3: goto fin_ex;

 case 3: goto fin_ex;

@;

@;

@ The default pipeline times use only one stage; they

@ The default pipeline times use only one stage; they

can be overridden by |MMIX_config|. The total number of stages

can be overridden by |MMIX_config|. The total number of stages

supported by this simulator is limited to 90, since

supported by this simulator is limited to 90, since

it must never interfere with the |stage| numbers for special coroutines

it must never interfere with the |stage| numbers for special coroutines

defined below. (The author doesn't feel guilty about making this restriction.)

defined below. (The author doesn't feel guilty about making this restriction.)

@=

@=

#define pipe_limit 90

#define pipe_limit 90

Extern unsigned char pipe_seq[max_pipe_op+1][pipe_limit+1];

Extern unsigned char pipe_seq[max_pipe_op+1][pipe_limit+1];

@ The simplest of all register-to-register operations is |set|,

@ The simplest of all register-to-register operations is |set|,

which occurs for commands like \.{SETH} as well as for commands

which occurs for commands like \.{SETH} as well as for commands

like \.{GETA}. (We might as well start with the easy cases and work our

like \.{GETA}. (We might as well start with the easy cases and work our

way up.)

way up.)

@=

@=

case set: data->x.o=data->z.o;@+break;

case set: data->x.o=data->z.o;@+break;

@ Here are the basic boolean operations, which account for 24 of \MMIX's

@ Here are the basic boolean operations, which account for 24 of \MMIX's

256 opcodes.

256 opcodes.

@=

@=

case or: data->x.o.h=data->y.o.h | data->z.o.h;

case or: data->x.o.h=data->y.o.h | data->z.o.h;

   data->x.o.l=data->y.o.l | data->z.o.l; break;

   data->x.o.l=data->y.o.l | data->z.o.l; break;

case orn: data->x.o.h=data->y.o.h |~data->z.o.h;

case orn: data->x.o.h=data->y.o.h |~data->z.o.h;

   data->x.o.l=data->y.o.l |~data->z.o.l; break;

   data->x.o.l=data->y.o.l |~data->z.o.l; break;

case nor: data->x.o.h=~(data->y.o.h | data->z.o.h);

case nor: data->x.o.h=~(data->y.o.h | data->z.o.h);

   data->x.o.l=~(data->y.o.l | data->z.o.l); break;

   data->x.o.l=~(data->y.o.l | data->z.o.l); break;

case and: data->x.o.h=data->y.o.h & data->z.o.h;

case and: data->x.o.h=data->y.o.h & data->z.o.h;

   data->x.o.l=data->y.o.l & data->z.o.l; break;

   data->x.o.l=data->y.o.l & data->z.o.l; break;

case andn: data->x.o.h=data->y.o.h &~data->z.o.h;

case andn: data->x.o.h=data->y.o.h &~data->z.o.h;

   data->x.o.l=data->y.o.l &~data->z.o.l; break;

   data->x.o.l=data->y.o.l &~data->z.o.l; break;

case nand: data->x.o.h=~(data->y.o.h & data->z.o.h);

case nand: data->x.o.h=~(data->y.o.h & data->z.o.h);

   data->x.o.l=~(data->y.o.l & data->z.o.l); break;

   data->x.o.l=~(data->y.o.l & data->z.o.l); break;

case xor: data->x.o.h=data->y.o.h ^ data->z.o.h;

case xor: data->x.o.h=data->y.o.h ^ data->z.o.h;

   data->x.o.l=data->y.o.l ^ data->z.o.l; break;

   data->x.o.l=data->y.o.l ^ data->z.o.l; break;

case nxor: data->x.o.h=data->y.o.h ^~data->z.o.h;

case nxor: data->x.o.h=data->y.o.h ^~data->z.o.h;

   data->x.o.l=data->y.o.l ^~data->z.o.l; break;

   data->x.o.l=data->y.o.l ^~data->z.o.l; break;

@ The implementation of \.{ADDU} is only slightly more difficult.

@ The implementation of \.{ADDU} is only slightly more difficult.

It would be trivial except for the fact that internal opcode

It would be trivial except for the fact that internal opcode

|addu| is used not only for the \.{ADDU[I]} and \.{INC[M][H,L]} operations,

|addu| is used not only for the \.{ADDU[I]} and \.{INC[M][H,L]} operations,

in which we simply want to add |data->y.o| to |data->z.o|, but also for

in which we simply want to add |data->y.o| to |data->z.o|, but also for

operations like \.{4ADDU}.

operations like \.{4ADDU}.

@=

@=

case addu: data->x.o=oplus((data->op&0xf8)==0x28?@|

case addu: data->x.o=oplus((data->op&0xf8)==0x28?@|

          shift_left(data->y.o,1+((data->op>>1)&0x3)): data->y.o, data->z.o);

          shift_left(data->y.o,1+((data->op>>1)&0x3)): data->y.o, data->z.o);

 break;

 break;

case subu: data->x.o=ominus(data->y.o,data->z.o);@+ break;

case subu: data->x.o=ominus(data->y.o,data->z.o);@+ break;

@ Signed addition and subtraction produce the same results as their

@ Signed addition and subtraction produce the same results as their

unsigned counterparts, but overflow must also be detected. Overflow

unsigned counterparts, but overflow must also be detected. Overflow

occurs when adding |y| to~|z| if and only if |y| and~|z| have the

occurs when adding |y| to~|z| if and only if |y| and~|z| have the

same sign but their sum has a different sign. Overflow occurs in

same sign but their sum has a different sign. Overflow occurs in

the calculation |x=y-z| if and only if it occurs in the calculation~|y=x+z|.

the calculation |x=y-z| if and only if it occurs in the calculation~|y=x+z|.

@=

@=

case add: data->x.o=oplus(data->y.o,data->z.o);

case add: data->x.o=oplus(data->y.o,data->z.o);

  if (((data->y.o.h ^ data->z.o.h)&sign_bit)==0 &&

  if (((data->y.o.h ^ data->z.o.h)&sign_bit)==0 &&

      ((data->y.o.h ^ data->x.o.h)&sign_bit)!=0) data->interrupt|=V_BIT;

      ((data->y.o.h ^ data->x.o.h)&sign_bit)!=0) data->interrupt|=V_BIT;

  break;

  break;

case sub: data->x.o=ominus(data->y.o,data->z.o);

case sub: data->x.o=ominus(data->y.o,data->z.o);

  if (((data->x.o.h ^ data->z.o.h)&sign_bit)==0 &&

  if (((data->x.o.h ^ data->z.o.h)&sign_bit)==0 &&

      ((data->y.o.h ^ data->x.o.h)&sign_bit)!=0) data->interrupt|=V_BIT;

      ((data->y.o.h ^ data->x.o.h)&sign_bit)!=0) data->interrupt|=V_BIT;

  break;

  break;

@ The shift commands might take more than one cycle, or they might even be

@ The shift commands might take more than one cycle, or they might even be

pipelined, if the default value of |pipe_seq[sh]| is changed. But we compute

pipelined, if the default value of |pipe_seq[sh]| is changed. But we compute

shifts all at once here, because other parts of the simulator will take care

shifts all at once here, because other parts of the simulator will take care

of the pipeline timing. (Notice that |shlu| is changed to |sh|, for this

of the pipeline timing. (Notice that |shlu| is changed to |sh|, for this

reason. Similar changes to the internal op codes are made for other operators

reason. Similar changes to the internal op codes are made for other operators

below.)

below.)

@d shift_amt (data->z.o.h || data->z.o.l>=64? 64: data->z.o.l)

@d shift_amt (data->z.o.h || data->z.o.l>=64? 64: data->z.o.l)

@=

@=

case shlu: data->x.o=shift_left(data->y.o,shift_amt);@+data->i=sh;@+ break;

case shlu: data->x.o=shift_left(data->y.o,shift_amt);@+data->i=sh;@+ break;

case shl: data->x.o=shift_left(data->y.o,shift_amt);@+data->i=sh;

case shl: data->x.o=shift_left(data->y.o,shift_amt);@+data->i=sh;

 {@+octa tmpo;

 {@+octa tmpo;

    tmpo=shift_right(data->x.o,shift_amt,0);

    tmpo=shift_right(data->x.o,shift_amt,0);

   if (tmpo.h!=data->y.o.h || tmpo.l!=data->y.o.l) data->interrupt|=V_BIT;

   if (tmpo.h!=data->y.o.h || tmpo.l!=data->y.o.l) data->interrupt|=V_BIT;

 }@+break;

 }@+break;

case shru: data->x.o=shift_right(data->y.o,shift_amt,1);@+data->i=sh;@+ break;

case shru: data->x.o=shift_right(data->y.o,shift_amt,1);@+data->i=sh;@+ break;

case shr:  data->x.o=shift_right(data->y.o,shift_amt,0);@+data->i=sh;@+ break;

case shr:  data->x.o=shift_right(data->y.o,shift_amt,0);@+data->i=sh;@+ break;

@ The \.{MUX} operation has three operands, namely |data->y|, |data->z|,

@ The \.{MUX} operation has three operands, namely |data->y|, |data->z|,

and |data->b|; the third operand is the current (speculative) value of~rM, the

and |data->b|; the third operand is the current (speculative) value of~rM, the

special mask register. Otherwise \.{MUX} is unexceptional.

special mask register. Otherwise \.{MUX} is unexceptional.

@=

@=

case mux: data->x.o.h=(data->y.o.h&data->b.o.h)+(data->z.o.h&~data->b.o.h);

case mux: data->x.o.h=(data->y.o.h&data->b.o.h)+(data->z.o.h&~data->b.o.h);

          data->x.o.l=(data->y.o.l&data->b.o.l)+(data->z.o.l&~data->b.o.l);

          data->x.o.l=(data->y.o.l&data->b.o.l)+(data->z.o.l&~data->b.o.l);

  break;

  break;

@ Comparisons are a breeze.

@ Comparisons are a breeze.

@=

@=

case cmp:@+if ((data->y.o.h&sign_bit)>(data->z.o.h&sign_bit)) goto cmp_neg;

case cmp:@+if ((data->y.o.h&sign_bit)>(data->z.o.h&sign_bit)) goto cmp_neg;

  if ((data->y.o.h&sign_bit)<(data->z.o.h&sign_bit)) goto cmp_pos;

  if ((data->y.o.h&sign_bit)<(data->z.o.h&sign_bit)) goto cmp_pos;

case cmpu:@+if (data->y.o.hz.o.h) goto cmp_neg;

case cmpu:@+if (data->y.o.hz.o.h) goto cmp_neg;

  if (data->y.o.h>data->z.o.h) goto cmp_pos;

  if (data->y.o.h>data->z.o.h) goto cmp_pos;

  if (data->y.o.lz.o.l) goto cmp_neg;

  if (data->y.o.lz.o.l) goto cmp_neg;

  if (data->y.o.l>data->z.o.l) goto cmp_pos;

  if (data->y.o.l>data->z.o.l) goto cmp_pos;

 cmp_zero: break; /* |data->x| is zero */

 cmp_zero: break; /* |data->x| is zero */

 cmp_pos: data->x.o.l=1;@+ break; /* |data->x.o.h| is zero */

 cmp_pos: data->x.o.l=1;@+ break; /* |data->x.o.h| is zero */

 cmp_neg: data->x.o=neg_one;@+ break;

 cmp_neg: data->x.o=neg_one;@+ break;

@ The other operations will be deferred until later, now that we understand

@ The other operations will be deferred until later, now that we understand

the basic ideas. But one more piece of code ought to be

the basic ideas. But one more piece of code ought to be

written before we move on, because

written before we move on, because

it completes the execution stage for the simple cases already considered.

it completes the execution stage for the simple cases already considered.

The |ren_x| and |ren_a| fields tell us whether the |x| and/or |a|

The |ren_x| and |ren_a| fields tell us whether the |x| and/or |a|

fields contain valid information that should become officially known.

fields contain valid information that should become officially known.

@=

@=

fin_ex:@+if (data->ren_x) data->x.known=true;

fin_ex:@+if (data->ren_x) data->x.known=true;

else if (data->mem_x) data->x.known=true, data->x.addr.l&=-8;

else if (data->mem_x) data->x.known=true, data->x.addr.l&=-8;

if (data->ren_a) data->a.known=true;

if (data->ren_a) data->a.known=true;

if (data->loc.h&sign_bit)

if (data->loc.h&sign_bit)

  data->ra.o.l=0; /* no trips enabled for the operating system */

  data->ra.o.l=0; /* no trips enabled for the operating system */

if (data->interrupt&0xffff) @;

if (data->interrupt&0xffff) @;

die: data->owner=NULL;@+goto terminate; /* this coroutine now fades away */

die: data->owner=NULL;@+goto terminate; /* this coroutine now fades away */

@* The commission/deissue stage. Control blocks leave the reorder buffer

@* The commission/deissue stage. Control blocks leave the reorder buffer

either at the hot end (when they're committed) or at the cool end

either at the hot end (when they're committed) or at the cool end

(when they're deissued). We hope most of them are committed, but

(when they're deissued). We hope most of them are committed, but

from time to time our speculation is incorrect and we must deissue

from time to time our speculation is incorrect and we must deissue

a sequence of instructions that prove to be unwanted. Deissuing must

a sequence of instructions that prove to be unwanted. Deissuing must

take priority over committing, because the dispatcher cannot do anything

take priority over committing, because the dispatcher cannot do anything

until the machine's cool state has stabilized.

until the machine's cool state has stabilized.

Deissuing changes the cool state by undoing the most recently issued

Deissuing changes the cool state by undoing the most recently issued

instructions, in reverse order. Committing changes the hot state by

instructions, in reverse order. Committing changes the hot state by

doing the least recently issued instructions, in their original order.

doing the least recently issued instructions, in their original order.

Both operations are similar, so we assume that they take the same time;

Both operations are similar, so we assume that they take the same time;

at most |commit_max| instructions are deissued and/or committed on

at most |commit_max| instructions are deissued and/or committed on

each clock cycle.

each clock cycle.

@=

@=

  cool=(cool==reorder_top? reorder_bot: cool+1);

  cool=(cool==reorder_top? reorder_bot: cool+1);

  if (verbose&issue_bit) {

  if (verbose&issue_bit) {

    printf("Deissuing ");@+print_control_block(cool);

    printf("Deissuing ");@+print_control_block(cool);

    if (cool->owner) {@+printf(" ");@+print_coroutine_id(cool->owner);@+}

    if (cool->owner) {@+printf(" ");@+print_coroutine_id(cool->owner);@+}

    printf("\n");

    printf("\n");

  if (cool->ren_x) rename_regs++,spec_rem(&cool->x);

  if (cool->ren_x) rename_regs++,spec_rem(&cool->x);

  if (cool->ren_a) rename_regs++,spec_rem(&cool->a);

  if (cool->ren_a) rename_regs++,spec_rem(&cool->a);

  if (cool->mem_x) mem_slots++,spec_rem(&cool->x);

  if (cool->mem_x) mem_slots++,spec_rem(&cool->x);

  if (cool->set_l) spec_rem(&cool->rl);

  if (cool->set_l) spec_rem(&cool->rl);

  if (cool->owner) {

  if (cool->owner) {

    if (cool->owner->lockloc)

    if (cool->owner->lockloc)

      *(cool->owner->lockloc)=NULL, cool->owner->lockloc=NULL;

      *(cool->owner->lockloc)=NULL, cool->owner->lockloc=NULL;

    if (cool->owner->next) unschedule(cool->owner);

    if (cool->owner->next) unschedule(cool->owner);

  cool_O=cool->cur_O;@+ cool_S=cool->cur_S;

  cool_O=cool->cur_O;@+ cool_S=cool->cur_S;

  deissues--;

  deissues--;

@ @=

@ @=

  if (nullifying) @@;

  if (nullifying) @@;

  else {

  else {

    if (hot->i==get && hot->zz==rQ)

    if (hot->i==get && hot->zz==rQ)

      new_Q=oandn(g[rQ].o,hot->x.o);

      new_Q=oandn(g[rQ].o,hot->x.o);

    else if (hot->i==put && hot->xx==rQ)

    else if (hot->i==put && hot->xx==rQ)

      hot->x.o.h |= new_Q.h, hot->x.o.l |= new_Q.l;

      hot->x.o.h |= new_Q.h, hot->x.o.l |= new_Q.l;

    if (hot->mem_x) @;

    if (hot->mem_x) @;

    if (verbose&issue_bit) {

    if (verbose&issue_bit) {

      printf("Committing ");@+print_control_block(hot);@+printf("\n");

      printf("Committing ");@+print_control_block(hot);@+printf("\n");

    if (hot->ren_x) rename_regs++,hot->x.up->o=hot->x.o,spec_rem(&(hot->x));

    if (hot->ren_x) rename_regs++,hot->x.up->o=hot->x.o,spec_rem(&(hot->x));

    if (hot->ren_a) rename_regs++,hot->a.up->o=hot->a.o,spec_rem(&(hot->a));

    if (hot->ren_a) rename_regs++,hot->a.up->o=hot->a.o,spec_rem(&(hot->a));

    if (hot->set_l) hot->rl.up->o=hot->rl.o,spec_rem(&(hot->rl));

    if (hot->set_l) hot->rl.up->o=hot->rl.o,spec_rem(&(hot->rl));

    if (hot->arith_exc) g[rA].o.l |= hot->arith_exc;

    if (hot->arith_exc) g[rA].o.l |= hot->arith_exc;

    if (hot->usage) {

    if (hot->usage) {

      g[rU].o.l++;@+ if (g[rU].o.l==0) {

      g[rU].o.l++;@+ if (g[rU].o.l==0) {

        g[rU].o.h++;@+ if ((g[rU].o.h&0x7fff)==0) g[rU].o.h-=0x8000;

        g[rU].o.h++;@+ if ((g[rU].o.h&0x7fff)==0) g[rU].o.h-=0x8000;

  if (hot->interrupt>=H_BIT) @;

  if (hot->interrupt>=H_BIT) @;

@ A load or store instruction is ``nullified'' if it is about to be captured

@ A load or store instruction is ``nullified'' if it is about to be captured

by a trap interrupt. In such cases it will be the only item in the reorder

by a trap interrupt. In such cases it will be the only item in the reorder

buffer; thus nullifying is sort of a cross between deissuing and

buffer; thus nullifying is sort of a cross between deissuing and

committing. (It is important to have stopped dispatching when nullification

committing. (It is important to have stopped dispatching when nullification

is necessary, because instructions such as |incgamma| and

is necessary, because instructions such as |incgamma| and

|decgamma| change~rS, and we need to change it back when an unexpected

|decgamma| change~rS, and we need to change it back when an unexpected

interruption occurs.)

interruption occurs.)

@=

@=

  if (verbose&issue_bit) {

  if (verbose&issue_bit) {

    printf("Nullifying ");@+print_control_block(hot);@+printf("\n");

    printf("Nullifying ");@+print_control_block(hot);@+printf("\n");

  if (hot->ren_x) rename_regs++,spec_rem(&hot->x);

  if (hot->ren_x) rename_regs++,spec_rem(&hot->x);

  if (hot->ren_a) rename_regs++,spec_rem(&hot->a);

  if (hot->ren_a) rename_regs++,spec_rem(&hot->a);

  if (hot->mem_x) mem_slots++,spec_rem(&hot->x);

  if (hot->mem_x) mem_slots++,spec_rem(&hot->x);

  if (hot->set_l) spec_rem(&hot->rl);

  if (hot->set_l) spec_rem(&hot->rl);

  cool_O=hot->cur_O, cool_S=hot->cur_S;

  cool_O=hot->cur_O, cool_S=hot->cur_S;

  nullifying=false;

  nullifying=false;

@ Interrupt bits in rQ might be lost if they are set between a \.{GET}

@ Interrupt bits in rQ might be lost if they are set between a \.{GET}

and a~\.{PUT}. Therefore we don't allow \.{PUT} to zero out bits that

and a~\.{PUT}. Therefore we don't allow \.{PUT} to zero out bits that

have become~1 since the most recently committed \.{GET}.

have become~1 since the most recently committed \.{GET}.

@=

@=

octa new_Q; /* when rQ increases in any bit position, so should this */

octa new_Q; /* when rQ increases in any bit position, so should this */

@ An instruction will not be committed immediately if it violates the basic

@ An instruction will not be committed immediately if it violates the basic

security rule of \MMIX: An instruction in a nonnegative location

security rule of \MMIX: An instruction in a nonnegative location

should not be performed unless all eight of the internal interrupts

should not be performed unless all eight of the internal interrupts

have been enabled in the interrupt mask register~rK.

have been enabled in the interrupt mask register~rK.

Conversely, an instruction in a negative location should not be performed

Conversely, an instruction in a negative location should not be performed

if the |P_BIT| is enabled in~rK.

if the |P_BIT| is enabled in~rK.

Such instructions take one extra cycle before they are committed.

Such instructions take one extra cycle before they are committed.

The nonnegative-location case turns on the |S_BIT| of both rK and~rQ\null,

The nonnegative-location case turns on the |S_BIT| of both rK and~rQ\null,

leading to an immediate interrupt (unless the current instruction

leading to an immediate interrupt (unless the current instruction

is |trap|, |put|, or~|resume|).

is |trap|, |put|, or~|resume|).

@=

@=

  if (hot->loc.h&sign_bit) {

  if (hot->loc.h&sign_bit) {

    if ((g[rK].o.h&P_BIT) && !(hot->interrupt&P_BIT)) {

    if ((g[rK].o.h&P_BIT) && !(hot->interrupt&P_BIT)) {

      hot->interrupt |= P_BIT;

      hot->interrupt |= P_BIT;

      g[rQ].o.h |= P_BIT;

      g[rQ].o.h |= P_BIT;

      new_Q.h |= P_BIT;

      new_Q.h |= P_BIT;

      if (verbose&issue_bit) {

      if (verbose&issue_bit) {

        printf(" setting rQ=");@+print_octa(g[rQ].o);@+printf("\n");

        printf(" setting rQ=");@+print_octa(g[rQ].o);@+printf("\n");

      break;

      break;

  }@+else if ((g[rK].o.h&0xff)!=0xff && !(hot->interrupt&S_BIT)) {

  }@+else if ((g[rK].o.h&0xff)!=0xff && !(hot->interrupt&S_BIT)) {

    hot->interrupt |= S_BIT;

    hot->interrupt |= S_BIT;

    g[rQ].o.h |= S_BIT;

    g[rQ].o.h |= S_BIT;

    new_Q.h |= S_BIT;

    new_Q.h |= S_BIT;

    g[rK].o.h |= S_BIT;

    g[rK].o.h |= S_BIT;

    if (verbose&issue_bit) {

    if (verbose&issue_bit) {

      printf(" setting rQ=");@+print_octa(g[rQ].o);

      printf(" setting rQ=");@+print_octa(g[rQ].o);

      printf(", rK=");@+print_octa(g[rK].o);@+printf("\n");

      printf(", rK=");@+print_octa(g[rK].o);@+printf("\n");

    break;

    break;

@* Branch prediction. An \MMIX\ programmer distinguishes statically between

@* Branch prediction. An \MMIX\ programmer distinguishes statically between

``branches'' and ``probable branches,'' but many modern computers attempt to

``branches'' and ``probable branches,'' but many modern computers attempt to

do better by implementing dynamic branch prediction. (See, for example,

do better by implementing dynamic branch prediction. (See, for example,

section~4.3 of Hennessy and Patterson's {\sl Computer Architecture},

section~4.3 of Hennessy and Patterson's {\sl Computer Architecture},

second edition.) Experience has shown that dynamic branch prediction can

second edition.) Experience has shown that dynamic branch prediction can

@^Hennessy, John LeRoy@>

@^Hennessy, John LeRoy@>

@^Patterson, David Andrew@>

@^Patterson, David Andrew@>

significantly improve the performance of speculative execution, by

significantly improve the performance of speculative execution, by

reducing the number of instructions that need to be deissued.

reducing the number of instructions that need to be deissued.

This simulator has an optional |bp_table| containing $2^{\mkern1mua+b+c}$ entries of

This simulator has an optional |bp_table| containing $2^{\mkern1mua+b+c}$ entries of

$n$~bits each, where $n$ is between 1 and~8. Usually $n$ is 1 or~2 in

$n$~bits each, where $n$ is between 1 and~8. Usually $n$ is 1 or~2 in

practice, but 8 bits are allocated per entry for convenience in this program.

practice, but 8 bits are allocated per entry for convenience in this program.

The |bp_table| is consulted and updated on every branch instruction

The |bp_table| is consulted and updated on every branch instruction

(every \.{B}~or \.{PB} instruction, but not~\.{JMP}), for advice on

(every \.{B}~or \.{PB} instruction, but not~\.{JMP}), for advice on

past history of similar situations. It is indexed by the $a$ least

past history of similar situations. It is indexed by the $a$ least

significant bits of the address of the instruction, the $b$ most recent

significant bits of the address of the instruction, the $b$ most recent

bits of global branch history, and the next $c$ bits of both address

bits of global branch history, and the next $c$ bits of both address

and history (exclusive-ored).

and history (exclusive-ored).

A |bp_table| entry begins at zero and is regarded as a signed $n$-bit number.

A |bp_table| entry begins at zero and is regarded as a signed $n$-bit number.

If it is nonnegative, we will follow the prediction in the instruction,

If it is nonnegative, we will follow the prediction in the instruction,

namely to predict a branch taken only in the \.{PB} case. If it is

namely to predict a branch taken only in the \.{PB} case. If it is

negative, we will predict the opposite of the instruction's recommendation.

negative, we will predict the opposite of the instruction's recommendation.

The $n$-bit number is increased (if possible) if the instruction's

The $n$-bit number is increased (if possible) if the instruction's

prediction was correct, decreased (if possible) if the instruction's

prediction was correct, decreased (if possible) if the instruction's

prediction was incorrect.

prediction was incorrect.

(Incidentally, a large value of~$n$ is not necessarily a good idea.

(Incidentally, a large value of~$n$ is not necessarily a good idea.

For example, if $n=8$ the machine might need 128 steps to

For example, if $n=8$ the machine might need 128 steps to

recognize that a branch taken the first 150 times is not taken

recognize that a branch taken the first 150 times is not taken

the next 150 times. And if we modify the update criteria to avoid this

the next 150 times. And if we modify the update criteria to avoid this

problem, we obtain a scheme that is rarely better than a simple scheme

problem, we obtain a scheme that is rarely better than a simple scheme

with smaller~$n$.)

with smaller~$n$.)

The values $a$, $b$, $c$, and $n$ in this discussion are called

The values $a$, $b$, $c$, and $n$ in this discussion are called

|bp_a|, |bp_b|, |bp_c|, and |bp_n| in the program.

|bp_a|, |bp_b|, |bp_c|, and |bp_n| in the program.

@=

@=

Extern int bp_a,bp_b,bp_c,bp_n; /* parameters for branch prediction */

Extern int bp_a,bp_b,bp_c,bp_n; /* parameters for branch prediction */

Extern char *bp_table; /* either |NULL| or an array of $2^{\mkern1mua+b+c}$ items */

Extern char *bp_table; /* either |NULL| or an array of $2^{\mkern1mua+b+c}$ items */

@ Branch prediction is made when we are either about to issue an

@ Branch prediction is made when we are either about to issue an

instruction or peeking ahead. We look at the |bp_table|, but we

instruction or peeking ahead. We look at the |bp_table|, but we

don't want to update it yet.

don't want to update it yet.

@=

@=

  predicted=op&0x10; /* start with the instruction's recommendation */

  predicted=op&0x10; /* start with the instruction's recommendation */

  if (bp_table) {@+register int h;

  if (bp_table) {@+register int h;

    m=((head->loc.l&bp_cmask)<loc.l&bp_amask);

    m=((head->loc.l&bp_cmask)<loc.l&bp_amask);

    m=((cool_hist&bp_bcmask)<>2);

    m=((cool_hist&bp_bcmask)<>2);

    h=bp_table[m];

    h=bp_table[m];

    if (h&bp_npower) predicted^=0x10;

    if (h&bp_npower) predicted^=0x10;

  if (predicted) peek_hist=(peek_hist<<1)+1;

  if (predicted) peek_hist=(peek_hist<<1)+1;

  else peek_hist<<=1;

  else peek_hist<<=1;

@ We update the |bp_table| when an instruction is issued.

@ We update the |bp_table| when an instruction is issued.

And we store the opposite table

And we store the opposite table

value in |cool->x.o.l|, just in case our prediction turns out to be wrong.

value in |cool->x.o.l|, just in case our prediction turns out to be wrong.

@=

@=

if (bp_table) {@+register int reversed,h,h_up,h_down;

if (bp_table) {@+register int reversed,h,h_up,h_down;

  reversed=op&0x10;

  reversed=op&0x10;

  if (peek_hist&1) reversed^=0x10;

  if (peek_hist&1) reversed^=0x10;

  m=((head->loc.l&bp_cmask)<loc.l&bp_amask);

  m=((head->loc.l&bp_cmask)<loc.l&bp_amask);

  m=((cool_hist&bp_bcmask)<>2);

  m=((cool_hist&bp_bcmask)<>2);

  h=bp_table[m];

  h=bp_table[m];

  h_up=(h+1)&bp_nmask;@+ if (h_up==bp_npower) h_up=h;

  h_up=(h+1)&bp_nmask;@+ if (h_up==bp_npower) h_up=h;

  if (h==bp_npower) h_down=h;@+ else h_down=(h-1)&bp_nmask;

  if (h==bp_npower) h_down=h;@+ else h_down=(h-1)&bp_nmask;

  if (reversed) {

  if (reversed) {

    bp_table[m]=h_down, cool->x.o.l=h_up;

    bp_table[m]=h_down, cool->x.o.l=h_up;

    cool->i=pbr+br-cool->i; /* reverse the sense */

    cool->i=pbr+br-cool->i; /* reverse the sense */

    bp_rev_stat++;

    bp_rev_stat++;

  }@+else {

  }@+else {

    bp_table[m]=h_up, cool->x.o.l=h_down; /* go with the flow */

    bp_table[m]=h_up, cool->x.o.l=h_down; /* go with the flow */

    bp_ok_stat++;

    bp_ok_stat++;

  if (verbose&show_pred_bit) {

  if (verbose&show_pred_bit) {

    printf(" predicting ");@+print_octa(cool->loc);

    printf(" predicting ");@+print_octa(cool->loc);

    printf(" %s; bp[%x]=%d\n",reversed? "NG": "OK",m,

    printf(" %s; bp[%x]=%d\n",reversed? "NG": "OK",m,

          bp_table[m]-((bp_table[m]&bp_npower)<<1));

          bp_table[m]-((bp_table[m]&bp_npower)<<1));

  cool->x.o.h=m;

  cool->x.o.h=m;

@ The calculations in the previous sections need several precomputed constants,

@ The calculations in the previous sections need several precomputed constants,

depending on the parameters $a$, $b$, $c$, and~$n$.

depending on the parameters $a$, $b$, $c$, and~$n$.

@=

@=

bp_amask=((1<

bp_amask=((1<

bp_cmask=((1<

bp_cmask=((1<

bp_bcmask=(1<<(bp_b+bp_c))-1; /* least $b+c$ bits of history info */

bp_bcmask=(1<<(bp_b+bp_c))-1; /* least $b+c$ bits of history info */

bp_nmask=(1<

bp_nmask=(1<

bp_npower=1<<(bp_n-1); /* $2^{n-1}$, the sign bit of an $n$-bit number */

bp_npower=1<<(bp_n-1); /* $2^{n-1}$, the sign bit of an $n$-bit number */

@ @=

@ @=

int bp_amask,bp_cmask,bp_bcmask,bp_nmask,bp_npower;

int bp_amask,bp_cmask,bp_bcmask,bp_nmask,bp_npower;

int bp_rev_stat,bp_ok_stat; /* how often we overrode and agreed */

int bp_rev_stat,bp_ok_stat; /* how often we overrode and agreed */

int bp_bad_stat,bp_good_stat; /* how often we failed and succeeded */

int bp_bad_stat,bp_good_stat; /* how often we failed and succeeded */

@ After a branch or probable branch instruction has been issued and

@ After a branch or probable branch instruction has been issued and

the value of the relevant register has been computed in the

the value of the relevant register has been computed in the

reorder buffer as |data->b.o|, we're ready to determine if the

reorder buffer as |data->b.o|, we're ready to determine if the

prediction was correct or not.

prediction was correct or not.

@=

@=

case br: case pbr: j=register_truth(data->b.o,data->op);

case br: case pbr: j=register_truth(data->b.o,data->op);

  if (j) data->go.o=data->z.o;@+ else data->go.o=data->y.o;

  if (j) data->go.o=data->z.o;@+ else data->go.o=data->y.o;

  if (j==(data->i==pbr)) bp_good_stat++;

  if (j==(data->i==pbr)) bp_good_stat++;

  else { /* oops, misprediction */

  else { /* oops, misprediction */

    bp_bad_stat++;

    bp_bad_stat++;

@;

@;

  goto fin_ex;

  goto fin_ex;

@ The |register_truth| subroutine is used by \.B, \.{PB}, \.{CS}, and

@ The |register_truth| subroutine is used by \.B, \.{PB}, \.{CS}, and

\.{ZS} commands to decide whether an octabyte satisfies the

\.{ZS} commands to decide whether an octabyte satisfies the

conditions of the opcode, |data->op|.

conditions of the opcode, |data->op|.

@=

@=

static int register_truth @,@,@[ARGS((octa,mmix_opcode))@];

static int register_truth @,@,@[ARGS((octa,mmix_opcode))@];

@ @=

@ @=

static int register_truth(o,op)

static int register_truth(o,op)

  octa o;

  octa o;

  mmix_opcode op;

  mmix_opcode op;

{@+register int b;

{@+register int b;

  switch ((op>>1) & 0x3) {

  switch ((op>>1) & 0x3) {

 case 0: b=o.h>>31;@+break; /* negative? */

 case 0: b=o.h>>31;@+break; /* negative? */

 case 1: b=(o.h==0 && o.l==0);@+break; /* zero? */

 case 1: b=(o.h==0 && o.l==0);@+break; /* zero? */

 case 2: b=(o.h

 case 2: b=(o.h

 case 3: b=o.l&0x1;@+break; /* odd? */

 case 3: b=o.l&0x1;@+break; /* odd? */

  if (op&0x8) return b^1;

  if (op&0x8) return b^1;

  else return b;

  else return b;

@ The |issued_between| subroutine determines how many speculative instructions

@ The |issued_between| subroutine determines how many speculative instructions

were issued between a given control block in the reorder buffer and

were issued between a given control block in the reorder buffer and

the current |cool| pointer, when |cc=cool|.

the current |cool| pointer, when |cc=cool|.

@=

@=

static int issued_between @,@,@[ARGS((control*,control*))@];

static int issued_between @,@,@[ARGS((control*,control*))@];

@ @=

@ @=

static int issued_between(c,cc)

static int issued_between(c,cc)

  control *c,*cc;

  control *c,*cc;

  if (c>cc) return c-1-cc;

  if (c>cc) return c-1-cc;

  return (c-reorder_bot)+(reorder_top-cc);

  return (c-reorder_bot)+(reorder_top-cc);

@ If more than one functional unit is able to process branch instructions and

@ If more than one functional unit is able to process branch instructions and

if two of them simultaneously discover misprediction, or if misprediction is

if two of them simultaneously discover misprediction, or if misprediction is

detected by one unit just as another unit is generating an interrupt, we

detected by one unit just as another unit is generating an interrupt, we

assume that an arbitration takes place so that only the hottest one actually

assume that an arbitration takes place so that only the hottest one actually

deissues the cooler instructions.

deissues the cooler instructions.

Changes to the |bp_table| aren't undone when they were made on speculation in

Changes to the |bp_table| aren't undone when they were made on speculation in

an instruction being deissued; nor do we worry about cases where the same

an instruction being deissued; nor do we worry about cases where the same

|bp_table| entry is being updated by two or more active coroutines. After all,

|bp_table| entry is being updated by two or more active coroutines. After all,

the |bp_table| is just a heuristic, not part of the real computation.

the |bp_table| is just a heuristic, not part of the real computation.

We correct the |bp_table| only if we discover that a prediction was wrong, so

We correct the |bp_table| only if we discover that a prediction was wrong, so

that we will be less likely to make the same mistake later.

that we will be less likely to make the same mistake later.

@=

@=

i=issued_between(data,cool);

i=issued_between(data,cool);

if (i

if (i

deissues=i;

deissues=i;

old_tail=tail=head;@+resuming=0; /* clear the fetch buffer */

old_tail=tail=head;@+resuming=0; /* clear the fetch buffer */

@;

@;

inst_ptr.o=data->go.o, inst_ptr.p=NULL;

inst_ptr.o=data->go.o, inst_ptr.p=NULL;

if (!(data->loc.h&sign_bit)) {

if (!(data->loc.h&sign_bit)) {

  if (inst_ptr.o.h&sign_bit) data->interrupt |= P_BIT;

  if (inst_ptr.o.h&sign_bit) data->interrupt |= P_BIT;

  else data->interrupt &=~P_BIT;

  else data->interrupt &=~P_BIT;

if (bp_table) {

if (bp_table) {

  bp_table[data->x.o.h]=data->x.o.l; /* this is what we should have stored */

  bp_table[data->x.o.h]=data->x.o.l; /* this is what we should have stored */

  if (verbose&show_pred_bit) {

  if (verbose&show_pred_bit) {

    printf(" mispredicted ");@+print_octa(data->loc);

    printf(" mispredicted ");@+print_octa(data->loc);

    printf("; bp[%x]=%d\n",data->x.o.h,

    printf("; bp[%x]=%d\n",data->x.o.h,

          data->x.o.l-((data->x.o.l&bp_npower)<<1));

          data->x.o.l-((data->x.o.l&bp_npower)<<1));

cool_hist=(j? (data->hist<<1)+1: data->hist<<1);

cool_hist=(j? (data->hist<<1)+1: data->hist<<1);

@ @=

@ @=

Extern void print_stats @,@,@[ARGS((void))@];

Extern void print_stats @,@,@[ARGS((void))@];

@ @=

@ @=

void print_stats()

void print_stats()

  register int j;

  register int j;

  if (bp_table)

  if (bp_table)

    printf("Predictions: %d in agreement, %d in opposition; %d good, %d bad\n",

    printf("Predictions: %d in agreement, %d in opposition; %d good, %d bad\n",

                 bp_ok_stat,bp_rev_stat,bp_good_stat,bp_bad_stat);

                 bp_ok_stat,bp_rev_stat,bp_good_stat,bp_bad_stat);

  else printf("Predictions: %d good, %d bad\n",bp_good_stat,bp_bad_stat);

  else printf("Predictions: %d good, %d bad\n",bp_good_stat,bp_bad_stat);

  printf("Instructions issued per cycle:\n");

  printf("Instructions issued per cycle:\n");

  for (j=0;j<=dispatch_max;j++)

  for (j=0;j<=dispatch_max;j++)

    printf("  %d   %d\n",j,dispatch_stat[j]);

    printf("  %d   %d\n",j,dispatch_stat[j]);

@* Cache memory. It's time now to consider \MMIX's MMU, the memory management

@* Cache memory. It's time now to consider \MMIX's MMU, the memory management

unit. This part of the machine deals with the critical problem of getting data

unit. This part of the machine deals with the critical problem of getting data

to and from the computational units. In a RISC architecture all interaction

to and from the computational units. In a RISC architecture all interaction

between main memory and the computer registers is specified by load and store

between main memory and the computer registers is specified by load and store

instructions; thus memory accesses are much easier to deal with than they

instructions; thus memory accesses are much easier to deal with than they

would be on a machine with more complex kinds of interaction. But memory

would be on a machine with more complex kinds of interaction. But memory

management is still difficult, if we want to do it well, because main memory

management is still difficult, if we want to do it well, because main memory

typically operates at a much slower speed than the registers do. High-speed

typically operates at a much slower speed than the registers do. High-speed

implementations of \MMIX\ introduce intermediate ``caches'' of storage in

implementations of \MMIX\ introduce intermediate ``caches'' of storage in

order to keep the most important data accessible, and cache maintenance can be

order to keep the most important data accessible, and cache maintenance can be

complicated when all the details are taken into account.

complicated when all the details are taken into account.

(See, for example, Chapter 5 of Hennessy and Patterson's

(See, for example, Chapter 5 of Hennessy and Patterson's

{\sl Computer Architecture}, second edition.)

{\sl Computer Architecture}, second edition.)

@^Hennessy, John LeRoy@>

@^Hennessy, John LeRoy@>

@^Patterson, David Andrew@>

@^Patterson, David Andrew@>

@^caches@>

@^caches@>

This simulator can be configured to have up to three auxiliary caches between

This simulator can be configured to have up to three auxiliary caches between

registers and memory: An I-cache for instructions, a D-cache for data, and an

registers and memory: An I-cache for instructions, a D-cache for data, and an

S-cache for both instructions and data. The S-cache, also called a {\it

S-cache for both instructions and data. The S-cache, also called a {\it

secondary cache}, is supported only if both I-cache and D-cache are present.

secondary cache}, is supported only if both I-cache and D-cache are present.

Arbitrary access times for each cache can be specified independently;

Arbitrary access times for each cache can be specified independently;

we might assume, for example, that data items in the I-cache or D-cache can

we might assume, for example, that data items in the I-cache or D-cache can

be sent to a register in one or two clock cycles, but the access time for the

be sent to a register in one or two clock cycles, but the access time for the

S-cache might be say 5 cycles, and main memory might require 20 cycles or more.

S-cache might be say 5 cycles, and main memory might require 20 cycles or more.

Our speculative pipeline can have many functional units handling load

Our speculative pipeline can have many functional units handling load

and store instructions, but only one load or store instruction can be

and store instructions, but only one load or store instruction can be

updating the D-cache or S-cache or main memory at a time. (However, the

updating the D-cache or S-cache or main memory at a time. (However, the

D-cache can have several read ports; furthermore, data might

D-cache can have several read ports; furthermore, data might

be passing between the S-cache and memory while other data is passing

be passing between the S-cache and memory while other data is passing

between the reorder buffer and the D-cache.)

between the reorder buffer and the D-cache.)

Besides the optional I-cache, D-cache, and S-cache, there are required caches

Besides the optional I-cache, D-cache, and S-cache, there are required caches

called the IT-cache and DT-cache, for translation of virtual addresses to

called the IT-cache and DT-cache, for translation of virtual addresses to

physical addresses. A translation cache is often called a ``translation

physical addresses. A translation cache is often called a ``translation

@^TLB@>

@^TLB@>

@^translation caches@>

@^translation caches@>

lookaside buffer'' or TLB; but we call it a cache since it is implemented in

lookaside buffer'' or TLB; but we call it a cache since it is implemented in

nearly the same way as an I-cache.

nearly the same way as an I-cache.

@ Consider a cache that has blocks of $2^b$~bytes each and

@ Consider a cache that has blocks of $2^b$~bytes each and

associativity~$2^a$; here $b\ge3$ and $a\ge0$. The I-cache, D-cache, and

associativity~$2^a$; here $b\ge3$ and $a\ge0$. The I-cache, D-cache, and

S-cache are addressed by 48-bit physical addresses, as if they were part of

S-cache are addressed by 48-bit physical addresses, as if they were part of

main memory; but the IT and DT caches are addressed by 64-bit keys, obtained

main memory; but the IT and DT caches are addressed by 64-bit keys, obtained

from a virtual address by blanking out the lower $s$ bits and inserting the

from a virtual address by blanking out the lower $s$ bits and inserting the

value of~$n$, where the page size~$s$ and the process number~$n$ are found

value of~$n$, where the page size~$s$ and the process number~$n$ are found

in~rV. We will consider all caches to be addressed by 64-bit keys, so that

in~rV. We will consider all caches to be addressed by 64-bit keys, so that

both cases are handled with the same basic methods.

both cases are handled with the same basic methods.

Given a 64-bit key,

Given a 64-bit key,

we ignore the low-order $b$~bits and use the next $c$~bits

we ignore the low-order $b$~bits and use the next $c$~bits

to address the {\it cache set\/}; then the remaining $64-b-c$ bits should

to address the {\it cache set\/}; then the remaining $64-b-c$ bits should

match one of $2^a$ {\it tags\/} in that set. The case $a=0$ corresponds to a

match one of $2^a$ {\it tags\/} in that set. The case $a=0$ corresponds to a

so-called {\it direct-mapped\/} cache; the case $c=0$ corresponds to a

so-called {\it direct-mapped\/} cache; the case $c=0$ corresponds to a

so-called {\it fully associative\/} cache. With $2^c$ sets of $2^a$ blocks

so-called {\it fully associative\/} cache. With $2^c$ sets of $2^a$ blocks

each, and $2^b$ bytes per block, the cache contains $2^{a+b+c}$ bytes of data,

each, and $2^b$ bytes per block, the cache contains $2^{a+b+c}$ bytes of data,

in addition to the space needed for tags. Translation caches have $b=3$ and

in addition to the space needed for tags. Translation caches have $b=3$ and

they also usually have $c=0$.

they also usually have $c=0$.

If a tag matches the specified bits, we ``hit'' in the cache and can

If a tag matches the specified bits, we ``hit'' in the cache and can

use and/or update the data found there. Otherwise we ``miss,'' and

use and/or update the data found there. Otherwise we ``miss,'' and

we probably want to replace one of the cache blocks by the block containing

we probably want to replace one of the cache blocks by the block containing

the item sought. The item chosen for replacement is called a {\it victim}.

the item sought. The item chosen for replacement is called a {\it victim}.

The choice of victim is forced when the cache is direct-mapped, but four

The choice of victim is forced when the cache is direct-mapped, but four

strategies for victim selection are available when we must choose from

strategies for victim selection are available when we must choose from

among $2^a$ entries for $a>0$:

among $2^a$ entries for $a>0$:

\smallskip\textindent{$\bullet$} ``Random'' selection chooses the victim

\smallskip\textindent{$\bullet$} ``Random'' selection chooses the victim

by extracting the least significant $a$~bits of the clock.

by extracting the least significant $a$~bits of the clock.

\smallskip\textindent{$\bullet$} ``Serial'' selection chooses 0, 1, \dots,

\smallskip\textindent{$\bullet$} ``Serial'' selection chooses 0, 1, \dots,

$2^a-1$, 0, 1, \dots, $2^a-1$, 0, \dots~on successive trials.

$2^a-1$, 0, 1, \dots, $2^a-1$, 0, \dots~on successive trials.

\smallskip\textindent{$\bullet$} ``LRU (Least Recently Used)'' selection

\smallskip\textindent{$\bullet$} ``LRU (Least Recently Used)'' selection

chooses the victim that ranks last if items are ranked inversely to the time

chooses the victim that ranks last if items are ranked inversely to the time

that has elapsed since their previous use.

that has elapsed since their previous use.

\smallskip\textindent{$\bullet$} ``Pseudo-LRU'' selection chooses the

\smallskip\textindent{$\bullet$} ``Pseudo-LRU'' selection chooses the

victim by a rough approximation to LRU that is simpler to implement

victim by a rough approximation to LRU that is simpler to implement

in hardware. It requires a bit table $r_1\ldots r_{2^a-1}$.

in hardware. It requires a bit table $r_1\ldots r_{2^a-1}$.

Whenever we use an item

Whenever we use an item

with binary address $(i_1\ldots i_a)_2$ in the set, we adjust the

with binary address $(i_1\ldots i_a)_2$ in the set, we adjust the

bit table as follows:

bit table as follows:

$$r_1\gets1-i_1,\quad r_{1i_1}\gets1-i_2,\quad\ldots,\quad

$$r_1\gets1-i_1,\quad r_{1i_1}\gets1-i_2,\quad\ldots,\quad

r_{1i_1\ldots i_{a-1}}\gets1-i_a;$$

r_{1i_1\ldots i_{a-1}}\gets1-i_a;$$

here the subscripts on~$r$ are binary numbers. (For example, when $a=3$,

here the subscripts on~$r$ are binary numbers. (For example, when $a=3$,

the use of element $(010)_2$ sets $r_1\gets1$, $r_{10}\gets0$, $r_{101}\gets1$,

the use of element $(010)_2$ sets $r_1\gets1$, $r_{10}\gets0$, $r_{101}\gets1$,

where $r_{101}$ means the same as $r_5$.) To select a victim, we start with

where $r_{101}$ means the same as $r_5$.) To select a victim, we start with

$l\gets1$ and then repeatedly set $l\gets2l+r_l$, $a$~times; then we

$l\gets1$ and then repeatedly set $l\gets2l+r_l$, $a$~times; then we

choose element $l-2^a$. When $a=1$, this scheme is equivalent to LRU.

choose element $l-2^a$. When $a=1$, this scheme is equivalent to LRU.

When $a=2$, this scheme was implemented in the Intel 80486 chip.

When $a=2$, this scheme was implemented in the Intel 80486 chip.

@=

@=

typedef enum {@!random,@!serial,@!pseudo_lru,@!lru} replace_policy;

typedef enum {@!random,@!serial,@!pseudo_lru,@!lru} replace_policy;

@ A cache might also include a ``victim'' area, which contains the

@ A cache might also include a ``victim'' area, which contains the

last $2^v$ victim blocks removed from the main cache area. The victim

last $2^v$ victim blocks removed from the main cache area. The victim

area can be searched in parallel with the specified cache set, thereby

area can be searched in parallel with the specified cache set, thereby

increasing the chance of a hit without making the search go slower.

increasing the chance of a hit without making the search go slower.

Each of the three replacement policies can be used also in the victim cache.

Each of the three replacement policies can be used also in the victim cache.

@ A cache also has a {\it granularity\/} $2^g$, where $b\ge g\ge3$.  This

@ A cache also has a {\it granularity\/} $2^g$, where $b\ge g\ge3$.  This

means that we maintain, for each cache block, a set of $2^{b-g}$ ``dirty

means that we maintain, for each cache block, a set of $2^{b-g}$ ``dirty

bits,'' which identify the $2^g$-byte groups that have possibly changed since

bits,'' which identify the $2^g$-byte groups that have possibly changed since

they were last read from memory. Thus if $g=b$, an entire cache block is

they were last read from memory. Thus if $g=b$, an entire cache block is

either dirty or clean; if $g=3$, the dirtiness of each octabyte is maintained

either dirty or clean; if $g=3$, the dirtiness of each octabyte is maintained

separately.

separately.

Two policies are available when new data is written into all or part

Two policies are available when new data is written into all or part

of a cache block. We can {\it write-through}, meaning that we send all new data

of a cache block. We can {\it write-through}, meaning that we send all new data

to memory immediately and never mark anything dirty; or we can {\it

to memory immediately and never mark anything dirty; or we can {\it

write-back}, meaning that we update the memory from the cache only when

write-back}, meaning that we update the memory from the cache only when

absolutely necessary. Furthermore we can {\it write-allocate},

absolutely necessary. Furthermore we can {\it write-allocate},

meaning that we keep the new data in the cache, even if the cache block being

meaning that we keep the new data in the cache, even if the cache block being

written has to be fetched first because of a miss; or we can {\it

written has to be fetched first because of a miss; or we can {\it

write-around}, meaning that we keep the new data only if it was part of an

write-around}, meaning that we keep the new data only if it was part of an

existing cache block.

existing cache block.

(In this discussion, ``memory'' is shorthand for ``the next level

(In this discussion, ``memory'' is shorthand for ``the next level

of the memory hierarchy''; if there is an S-cache, the I-cache and

of the memory hierarchy''; if there is an S-cache, the I-cache and

D-cache write new data to the S-cache, not directly to memory. The I-cache,

D-cache write new data to the S-cache, not directly to memory. The I-cache,

IT-cache, and DT-cache are read-only, so they do not need the facilities

IT-cache, and DT-cache are read-only, so they do not need the facilities

discussed in this section. Moreover, the D-cache and S-cache can be assumed to

discussed in this section. Moreover, the D-cache and S-cache can be assumed to

have the same granularity.)

have the same granularity.)

@=

@=

#define WRITE_BACK 1 /* use this if not write-through */

#define WRITE_BACK 1 /* use this if not write-through */

#define WRITE_ALLOC 2 /* use this if not write-around */

#define WRITE_ALLOC 2 /* use this if not write-around */

@ We have seen that many flavors of cache can be simulated. They are

@ We have seen that many flavors of cache can be simulated. They are

represented by \&{cache} structures, containing arrays of \&{cacheset}

represented by \&{cache} structures, containing arrays of \&{cacheset}

structures that contain arrays of \&{cacheblock} structures

structures that contain arrays of \&{cacheblock} structures

for the individual blocks. We use a full byte to store each |dirty| bit,

for the individual blocks. We use a full byte to store each |dirty| bit,

and we use full integer words to store |rank| fields for LRU processing, etc.;

and we use full integer words to store |rank| fields for LRU processing, etc.;

memory economy is less important than simplicity in this simulator.

memory economy is less important than simplicity in this simulator.

@=

@=

typedef struct{

typedef struct{

  octa tag; /* bits of key not included in the cache block address */

  octa tag; /* bits of key not included in the cache block address */

  char *dirty; /* array of $2^{g-b}$ dirty bits, one per granule */

  char *dirty; /* array of $2^{g-b}$ dirty bits, one per granule */

  octa *data; /* array of $2^{b-3}$ octabytes, the data in a cache block */

  octa *data; /* array of $2^{b-3}$ octabytes, the data in a cache block */

  int rank; /* auxiliary information for non-|random| policies */

  int rank; /* auxiliary information for non-|random| policies */

} cacheblock;

} cacheblock;

@#

@#

typedef cacheblock *cacheset; /* array of $2^a$ or $2^v$ blocks */

typedef cacheblock *cacheset; /* array of $2^a$ or $2^v$ blocks */

@#

@#

typedef struct{

typedef struct{

  int a,b,c,g,v; /* lg of associativity, blocksize, setsize, granularity,

  int a,b,c,g,v; /* lg of associativity, blocksize, setsize, granularity,

         and victimsize */

         and victimsize */

  int aa,bb,cc,gg,vv; /* associativity, blocksize, setsize, granularity,

  int aa,bb,cc,gg,vv; /* associativity, blocksize, setsize, granularity,

         and victimsize (all powers of~2) */

         and victimsize (all powers of~2) */

  int tagmask; /* $-2^{b+c}$ */

  int tagmask; /* $-2^{b+c}$ */

  replace_policy repl,vrepl; /* how to choose victims and victim-victims */

  replace_policy repl,vrepl; /* how to choose victims and victim-victims */

  int mode; /* optional |WRITE_BACK| and/or |WRITE_ALLOC| */

  int mode; /* optional |WRITE_BACK| and/or |WRITE_ALLOC| */

  int access_time; /* cycles to know if there's a hit */

  int access_time; /* cycles to know if there's a hit */

  int copy_in_time; /* cycles to copy a new block into the cache */

  int copy_in_time; /* cycles to copy a new block into the cache */

  int copy_out_time; /* cycles to copy an old block from the cache */

  int copy_out_time; /* cycles to copy an old block from the cache */

  cacheset *set; /* array of $2^c$ sets of arrays of cache blocks */

  cacheset *set; /* array of $2^c$ sets of arrays of cache blocks */

  cacheset victim; /* the victim cache, if present */

  cacheset victim; /* the victim cache, if present */

  coroutine filler; /* a coroutine for copying new blocks into the cache */

  coroutine filler; /* a coroutine for copying new blocks into the cache */

  control filler_ctl; /* its control block */

  control filler_ctl; /* its control block */

  coroutine flusher; /* a coroutine for writing dirty old data

  coroutine flusher; /* a coroutine for writing dirty old data

                           from the cache */

                           from the cache */

  control flusher_ctl; /* its control block */

  control flusher_ctl; /* its control block */

  cacheblock inbuf; /* filling comes from here */

  cacheblock inbuf; /* filling comes from here */

  cacheblock outbuf; /* flushing goes to here */

  cacheblock outbuf; /* flushing goes to here */

  lockvar lock; /* nonzero when the cache is being changed significantly */

  lockvar lock; /* nonzero when the cache is being changed significantly */

  lockvar fill_lock; /* nonzero when filler should pass data back */

  lockvar fill_lock; /* nonzero when filler should pass data back */

  int ports; /* how many coroutines can be reading the cache? */

  int ports; /* how many coroutines can be reading the cache? */

  coroutine *reader; /* array of coroutines that might be reading

  coroutine *reader; /* array of coroutines that might be reading

                                    simultaneously */

                                    simultaneously */

  char *name; /* |"Icache"|, for example */

  char *name; /* |"Icache"|, for example */

} cache;

} cache;

@ @=

@ @=

Extern cache *Icache, *Dcache, *Scache, *ITcache, *DTcache;

Extern cache *Icache, *Dcache, *Scache, *ITcache, *DTcache;

@ Now we are ready to define some basic subroutines for cache maintenance.

@ Now we are ready to define some basic subroutines for cache maintenance.

Let's begin with a trivial routine that tests if a given cache block is dirty.

Let's begin with a trivial routine that tests if a given cache block is dirty.

@=

@=

static bool is_dirty @,@,@[ARGS((cache*,cacheblock*))@];

static bool is_dirty @,@,@[ARGS((cache*,cacheblock*))@];

@ @=

@ @=

static bool is_dirty(c,p)

static bool is_dirty(c,p)

  cache *c; /* the cache containing it */

  cache *c; /* the cache containing it */

  cacheblock *p; /* a cache block */

  cacheblock *p; /* a cache block */

  register int j;

  register int j;

  register char *d=p->dirty;

  register char *d=p->dirty;

  for (j=0;jbb;d++,j+=c->gg) if (*d) return true;

  for (j=0;jbb;d++,j+=c->gg) if (*d) return true;

  return false;

  return false;

@ For diagnostic purposes we might want to display an entire cache block.

@ For diagnostic purposes we might want to display an entire cache block.

@=

@=

static void print_cache_block @,@,@[ARGS((cacheblock,cache*))@];

static void print_cache_block @,@,@[ARGS((cacheblock,cache*))@];

@ @=

@ @=

static void print_cache_block(p,c)

static void print_cache_block(p,c)

  cacheblock p;

  cacheblock p;

  cache *c;

  cache *c;

{@+register int i,j,b=c->bb>>3,g=c->gg>>3;

{@+register int i,j,b=c->bb>>3,g=c->gg>>3;

  printf("%08x%08x: ",p.tag.h,p.tag.l);

  printf("%08x%08x: ",p.tag.h,p.tag.l);

  for (i=j=0; j

  for (i=j=0; j

    printf("%08x%08x%c",p.data[j].h,p.data[j].l,p.dirty[i]?'*':' ');

    printf("%08x%08x%c",p.data[j].h,p.data[j].l,p.dirty[i]?'*':' ');

  printf(" (%d)\n",p.rank);

  printf(" (%d)\n",p.rank);

@ @=

@ @=

static void print_cache_locks @,@,@[ARGS((cache*))@];

static void print_cache_locks @,@,@[ARGS((cache*))@];

@ @=

@ @=

static void print_cache_locks(c)

static void print_cache_locks(c)

  cache *c;

  cache *c;

  if (c) {

  if (c) {

    if (c->lock) printf("%s locked by %s:%d\n",

    if (c->lock) printf("%s locked by %s:%d\n",

                    c->name,c->lock->name,c->lock->stage);

                    c->name,c->lock->name,c->lock->stage);

    if (c->fill_lock) printf("%sfill locked by %s:%d\n",

    if (c->fill_lock) printf("%sfill locked by %s:%d\n",

                    c->name,c->fill_lock->name,c->fill_lock->stage);

                    c->name,c->fill_lock->name,c->fill_lock->stage);

@ The |print_cache| routine prints the entire contents of a cache. This can be

@ The |print_cache| routine prints the entire contents of a cache. This can be

a huge amount of data, but it can be very useful when debugging. Fortunately,

a huge amount of data, but it can be very useful when debugging. Fortunately,

the task of debugging favors the use of small caches, since interesting cases

the task of debugging favors the use of small caches, since interesting cases

arise more often when a cache is fairly small.

arise more often when a cache is fairly small.

@=

@=

Extern void print_cache @,@,@[ARGS((cache*,bool))@];

Extern void print_cache @,@,@[ARGS((cache*,bool))@];

@ @=

@ @=

void print_cache(c,dirty_only)

void print_cache(c,dirty_only)

  cache *c;

  cache *c;

  bool dirty_only;

  bool dirty_only;

  if (c) {@+register int i,j;

  if (c) {@+register int i,j;

    printf("%s of %s:",dirty_only?"Dirty blocks":"Contents",c->name);

    printf("%s of %s:",dirty_only?"Dirty blocks":"Contents",c->name);

    if (c->filler.next) {

    if (c->filler.next) {

      printf(" (filling ");

      printf(" (filling ");

      print_octa(c->name[1]=='T'? c->filler_ctl.y.o: c->filler_ctl.z.o);

      print_octa(c->name[1]=='T'? c->filler_ctl.y.o: c->filler_ctl.z.o);

      printf(")");

      printf(")");

    if (c->flusher.next) {

    if (c->flusher.next) {

      printf(" (flushing ");

      printf(" (flushing ");

      print_octa(c->outbuf.tag);

      print_octa(c->outbuf.tag);

      printf(")");

      printf(")");

    printf("\n");

    printf("\n");

@;

@;

@ We don't print the cache blocks that have an invalid tag, unless

@ We don't print the cache blocks that have an invalid tag, unless

requested to be verbose.

requested to be verbose.

@=

@=

for (i=0;icc;i++) for (j=0;jaa;j++)

for (i=0;icc;i++) for (j=0;jaa;j++)

  if ((!(c->set[i][j].tag.h&sign_bit)||(verbose&show_wholecache_bit))&&@|

  if ((!(c->set[i][j].tag.h&sign_bit)||(verbose&show_wholecache_bit))&&@|

       (!dirty_only || is_dirty(c,&c->set[i][j]))) {

       (!dirty_only || is_dirty(c,&c->set[i][j]))) {

    printf("[%d][%d] ",i,j);

    printf("[%d][%d] ",i,j);

    print_cache_block(c->set[i][j],c);

    print_cache_block(c->set[i][j],c);

for (j=0;jvv;j++)

for (j=0;jvv;j++)

  if ((!(c->victim[j].tag.h&sign_bit)||(verbose&show_wholecache_bit))&&@|

  if ((!(c->victim[j].tag.h&sign_bit)||(verbose&show_wholecache_bit))&&@|

       (!dirty_only || is_dirty(c,&c->victim[j]))) {

       (!dirty_only || is_dirty(c,&c->victim[j]))) {

    printf("V[%d] ",j);

    printf("V[%d] ",j);

    print_cache_block(c->victim[j],c);

    print_cache_block(c->victim[j],c);

@ The |clean_block| routine simply initializes a given cache block.

@ The |clean_block| routine simply initializes a given cache block.

@=

@=

Extern void clean_block @,@,@[ARGS((cache*,cacheblock*))@];

Extern void clean_block @,@,@[ARGS((cache*,cacheblock*))@];

@ @=

@ @=

void clean_block(c,p)

void clean_block(c,p)

  cache *c;

  cache *c;

  cacheblock *p;

  cacheblock *p;

  register int j;

  register int j;

  p->tag.h=sign_bit, p->tag.l=0;

  p->tag.h=sign_bit, p->tag.l=0;

  for (j=0;jbb>>3;j++) p->data[j]=zero_octa;

  for (j=0;jbb>>3;j++) p->data[j]=zero_octa;

  for (j=0;jbb>>c->g;j++) p->dirty[j]=false;

  for (j=0;jbb>>c->g;j++) p->dirty[j]=false;

@ The |zap_cache| routine invalidates all tags of a given cache,

@ The |zap_cache| routine invalidates all tags of a given cache,

effectively restoring it to its initial condition.

effectively restoring it to its initial condition.

@=

@=

Extern void zap_cache @,@,@[ARGS((cache*))@];

Extern void zap_cache @,@,@[ARGS((cache*))@];

@ We clear the |dirty| entries here, just to be tidy, although

@ We clear the |dirty| entries here, just to be tidy, although

they could actually be left in arbitrary condition when the tags are invalid.

they could actually be left in arbitrary condition when the tags are invalid.

@=

@=

void zap_cache(c)

void zap_cache(c)

  cache *c;

  cache *c;

  register int i,j;

  register int i,j;

  for (i=0;icc;i++) for (j=0;jaa;j++) {

  for (i=0;icc;i++) for (j=0;jaa;j++) {

    clean_block(c,&(c->set[i][j]));

    clean_block(c,&(c->set[i][j]));

  for (j=0;jvv;j++) {

  for (j=0;jvv;j++) {

    clean_block(c,&(c->victim[j]));

    clean_block(c,&(c->victim[j]));

@ The |get_reader| subroutine finds the index of

@ The |get_reader| subroutine finds the index of

an available reader coroutine for a given cache, or returns a negative value

an available reader coroutine for a given cache, or returns a negative value

if no readers are available.

if no readers are available.

@=

@=

static int get_reader @,@,@[ARGS((cache*))@];

static int get_reader @,@,@[ARGS((cache*))@];

@ @=

@ @=

static int get_reader(c)

static int get_reader(c)

  cache *c;

  cache *c;

{@+ register int j;

{@+ register int j;

  for (j=0;jports;j++)

  for (j=0;jports;j++)

    if (c->reader[j].next==NULL) return j;

    if (c->reader[j].next==NULL) return j;

  return -1;

  return -1;

@ The subroutine |copy_block(c,p,cc,pp)| copies the dirty

@ The subroutine |copy_block(c,p,cc,pp)| copies the dirty

items from block~|p| of cache~|c| into block~|pp| of cache~|cc|, assuming

items from block~|p| of cache~|c| into block~|pp| of cache~|cc|, assuming

that the destination cache has a sufficiently large block size.

that the destination cache has a sufficiently large block size.

(In other words, we assume that |cc->b>=c->b|.) We also assume that both

(In other words, we assume that |cc->b>=c->b|.) We also assume that both

blocks have compatible tags, and that both caches have the same granularity.

blocks have compatible tags, and that both caches have the same granularity.

@=

@=

static void copy_block @,@,@[ARGS((cache*,cacheblock*,cache*,cacheblock*))@];

static void copy_block @,@,@[ARGS((cache*,cacheblock*,cache*,cacheblock*))@];

@ @=

@ @=

static void copy_block(c,p,cc,pp)

static void copy_block(c,p,cc,pp)

  cache *c,*cc;

  cache *c,*cc;

  cacheblock *p,*pp;

  cacheblock *p,*pp;

  register int j,jj,i,ii,lim; register int off=p->tag.l&(cc->bb-1);

  register int j,jj,i,ii,lim; register int off=p->tag.l&(cc->bb-1);

  if (c->g!=cc->g || p->tag.h!=pp->tag.h || p->tag.l-off!=pp->tag.l)

  if (c->g!=cc->g || p->tag.h!=pp->tag.h || p->tag.l-off!=pp->tag.l)

    panic(confusion("copy block"));

    panic(confusion("copy block"));

  for (j=0,jj=off>>c->g;jbb>>c->g;j++,jj++) if (p->dirty[j]) {

  for (j=0,jj=off>>c->g;jbb>>c->g;j++,jj++) if (p->dirty[j]) {

    pp->dirty[jj]=true;

    pp->dirty[jj]=true;

    for (i=j<<(c->g-3),ii=jj<<(c->g-3),lim=(j+1)<<(c->g-3);

    for (i=j<<(c->g-3),ii=jj<<(c->g-3),lim=(j+1)<<(c->g-3);

              idata[ii]=p->data[i];

              idata[ii]=p->data[i];

@ The |choose_victim| subroutine selects the victim to be replaced when we

@ The |choose_victim| subroutine selects the victim to be replaced when we

need to change a cache~set. We need only one bit of the |rank| fields to

need to change a cache~set. We need only one bit of the |rank| fields to

implement the $r$~table when |policy=pseudo_lru|,

implement the $r$~table when |policy=pseudo_lru|,

and we don't need |rank| at all when |policy=random|. Of course we use an

and we don't need |rank| at all when |policy=random|. Of course we use an

$a$-bit counter to implement |policy=serial|. In the other case,

$a$-bit counter to implement |policy=serial|. In the other case,

|policy=lru|, we need an $a$-bit |rank| field; the least recently used entry

|policy=lru|, we need an $a$-bit |rank| field; the least recently used entry

has rank~0, and the most recently used entry has rank~$2^a-1=|aa|-1$.

has rank~0, and the most recently used entry has rank~$2^a-1=|aa|-1$.

@=

@=

static cacheblock* choose_victim @,@,@[ARGS((cacheset,int,replace_policy))@];

static cacheblock* choose_victim @,@,@[ARGS((cacheset,int,replace_policy))@];

@ @=

@ @=

static cacheblock* choose_victim(s,aa,policy)

static cacheblock* choose_victim(s,aa,policy)

  cacheset s;

  cacheset s;

  int aa; /* setsize */

  int aa; /* setsize */

  replace_policy policy;

  replace_policy policy;

  register cacheblock *p;

  register cacheblock *p;

  register int l,m;

  register int l,m;

  switch (policy) {

  switch (policy) {

 case random: return &s[ticks.l&(aa-1)];

 case random: return &s[ticks.l&(aa-1)];

 case serial: l=s[0].rank;@+ s[0].rank=(l+1)&(aa-1);@+ return &s[l];

 case serial: l=s[0].rank;@+ s[0].rank=(l+1)&(aa-1);@+ return &s[l];

 case lru: for (p=s;p

 case lru: for (p=s;p

    if (p->rank==0) return p;

    if (p->rank==0) return p;

  panic(confusion("lru victim")); /* what happened? nobody has rank zero */

  panic(confusion("lru victim")); /* what happened? nobody has rank zero */

 case pseudo_lru: for (l=1,m=aa>>1; m; m>>=1) l=l+l+s[l].rank;

 case pseudo_lru: for (l=1,m=aa>>1; m; m>>=1) l=l+l+s[l].rank;

   return &s[l-aa];

   return &s[l-aa];

@ The |note_usage| subroutine updates the |rank| entries to record the

@ The |note_usage| subroutine updates the |rank| entries to record the

fact that a particular block in a cache set is now being used.

fact that a particular block in a cache set is now being used.

@=

@=

static void note_usage @,@,@[ARGS((cacheblock*,cacheset,int,replace_policy))@];

static void note_usage @,@,@[ARGS((cacheblock*,cacheset,int,replace_policy))@];

@ @=

@ @=

static void note_usage(l,s,aa,policy)

static void note_usage(l,s,aa,policy)

  cacheblock *l; /* a cache block that's probably worth preserving */

  cacheblock *l; /* a cache block that's probably worth preserving */

  cacheset s; /* the set that contains $l$ */

  cacheset s; /* the set that contains $l$ */

  int aa; /* setsize */

  int aa; /* setsize */

  replace_policy policy;

  replace_policy policy;

  register cacheblock *p;

  register cacheblock *p;

  register int j,m,r;

  register int j,m,r;

  if (aa==1 || policy<=serial) return;

  if (aa==1 || policy<=serial) return;

  if (policy==lru) {

  if (policy==lru) {

    r=l->rank;

    r=l->rank;

    for (p=s;prank>r) p->rank--;

    for (p=s;prank>r) p->rank--;

    l->rank=aa-1;

    l->rank=aa-1;

  } else { /* |policy==pseudo_lru| */

  } else { /* |policy==pseudo_lru| */

    r=l-s;

    r=l-s;

    for (j=1,m=aa>>1;m;m>>=1)

    for (j=1,m=aa>>1;m;m>>=1)

      if (r&m) s[j].rank=0,j=j+j+1;

      if (r&m) s[j].rank=0,j=j+j+1;

      else s[j].rank=1, j=j+j;

      else s[j].rank=1, j=j+j;

  return;

  return;

@ The |demote_usage| subroutine is sort of the opposite of |note_usage|;

@ The |demote_usage| subroutine is sort of the opposite of |note_usage|;

it changes the rank of a given block to {\it least\/} recently used.

it changes the rank of a given block to {\it least\/} recently used.

@=

@=

static void demote_usage @,@,@[ARGS((cacheblock*,cacheset,int,replace_policy))@];

static void demote_usage @,@,@[ARGS((cacheblock*,cacheset,int,replace_policy))@];

@ @=

@ @=

static void demote_usage(l,s,aa,policy)

static void demote_usage(l,s,aa,policy)

  cacheblock *l; /* a cache block we probably don't need */

  cacheblock *l; /* a cache block we probably don't need */

  cacheset s; /* the set that contains $l$ */

  cacheset s; /* the set that contains $l$ */

  int aa; /* setsize */

  int aa; /* setsize */

  replace_policy policy;

  replace_policy policy;

  register cacheblock *p;

  register cacheblock *p;

  register int j,m,r;

  register int j,m,r;

  if (aa==1 || policy<=serial) return;

  if (aa==1 || policy<=serial) return;

  if (policy==lru) {

  if (policy==lru) {

    r=l->rank;

    r=l->rank;

    for (p=s;prankrank++;

    for (p=s;prankrank++;

    l->rank=0;

    l->rank=0;

  } else { /* |policy==pseudo_lru| */

  } else { /* |policy==pseudo_lru| */

    r=l-s;

    r=l-s;

    for (j=1,m=aa>>1;m;m>>=1)

    for (j=1,m=aa>>1;m;m>>=1)

      if (r&m) s[j].rank=1,j=j+j+1;

      if (r&m) s[j].rank=1,j=j+j+1;

      else s[j].rank=0, j=j+j;

      else s[j].rank=0, j=j+j;

  return;

  return;

@ The |cache_search| routine looks for a given key $\alpha$

@ The |cache_search| routine looks for a given key $\alpha$

in a given cache, and returns a cache block if there's a hit; otherwise

in a given cache, and returns a cache block if there's a hit; otherwise

it returns~|NULL|. If the search hits, the set in which the block was

it returns~|NULL|. If the search hits, the set in which the block was

found is stored in global variable |hit_set|. Notice that we need to check

found is stored in global variable |hit_set|. Notice that we need to check

more bits of the tag when we search in the victim area.

more bits of the tag when we search in the victim area.

@d cache_addr(c,alf) c->set[(alf.l&~(c->tagmask))>>c->b]

@d cache_addr(c,alf) c->set[(alf.l&~(c->tagmask))>>c->b]

@=

@=

static cacheblock* cache_search @,@,@[ARGS((cache*,octa))@];

static cacheblock* cache_search @,@,@[ARGS((cache*,octa))@];

@ @=

@ @=

static cacheblock* cache_search(c,alf)

static cacheblock* cache_search(c,alf)

  cache *c; /* the cache to be searched */

  cache *c; /* the cache to be searched */

  octa alf; /* the key */

  octa alf; /* the key */

  register cacheset s;

  register cacheset s;

  register cacheblock* p;

  register cacheblock* p;

  s=cache_addr(c,alf); /* the set corresponding to |alf| */

  s=cache_addr(c,alf); /* the set corresponding to |alf| */

  for (p=s;paa;p++)

  for (p=s;paa;p++)

    if (((p->tag.l ^ alf.l)&c->tagmask)==0 && p->tag.h==alf.h) goto hit;

    if (((p->tag.l ^ alf.l)&c->tagmask)==0 && p->tag.h==alf.h) goto hit;

  s=c->victim;

  s=c->victim;

  if (!s) return NULL; /* cache miss, and no victim area */

  if (!s) return NULL; /* cache miss, and no victim area */

  for (p=s;pvv;p++)

  for (p=s;pvv;p++)

    if (((p->tag.l^alf.l)&(-c->bb))==0 && p->tag.h==alf.h) goto hit;

    if (((p->tag.l^alf.l)&(-c->bb))==0 && p->tag.h==alf.h) goto hit;

  return NULL; /* double miss */

  return NULL; /* double miss */

 hit: hit_set=s;@+ return p;

 hit: hit_set=s;@+ return p;

@ @=

@ @=

cacheset hit_set;

cacheset hit_set;

@ If |p=cache_search(c,alf)| hits and if we call |use_and_fix(c,p)|

@ If |p=cache_search(c,alf)| hits and if we call |use_and_fix(c,p)|

immediately afterwards, cache~|c| is updated to record the usage of

immediately afterwards, cache~|c| is updated to record the usage of

key~|alf|. A hit in the victim area moves the cache block to the main area,

key~|alf|. A hit in the victim area moves the cache block to the main area,

unless the |filler| routine of cache~|c| is active.

unless the |filler| routine of cache~|c| is active.

A pointer to the (possibly moved) cache block is returned.

A pointer to the (possibly moved) cache block is returned.

@=

@=

static cacheblock* use_and_fix @,@,@[ARGS((cache*,cacheblock*))@];

static cacheblock* use_and_fix @,@,@[ARGS((cache*,cacheblock*))@];

@ @=

@ @=

static cacheblock *use_and_fix(c,p)

static cacheblock *use_and_fix(c,p)

  cache *c;

  cache *c;

  cacheblock *p;

  cacheblock *p;

  if (hit_set!=c->victim) note_usage(p,hit_set,c->aa,c->repl);

  if (hit_set!=c->victim) note_usage(p,hit_set,c->aa,c->repl);

  else { note_usage(p,hit_set,c->vv,c->vrepl); /* found in victim cache */

  else { note_usage(p,hit_set,c->vv,c->vrepl); /* found in victim cache */

    if (!c->filler.next) {

    if (!c->filler.next) {

      register cacheset s=cache_addr(c,p->tag);

      register cacheset s=cache_addr(c,p->tag);

      register cacheblock *q=choose_victim(s,c->aa,c->repl);

      register cacheblock *q=choose_victim(s,c->aa,c->repl);

      note_usage(q,s,c->aa,c->repl);

      note_usage(q,s,c->aa,c->repl);

@;

@;

      return q;

      return q;

  return p;

  return p;

@ We can simply permute the pointers inside the cacheblock structures of a

@ We can simply permute the pointers inside the cacheblock structures of a

cache, instead of copying the data, if we are careful not to let any of those

cache, instead of copying the data, if we are careful not to let any of those

pointers escape into other data structures.

pointers escape into other data structures.

@=

@=

  octa t;

  octa t;

  register char *d=p->dirty;

  register char *d=p->dirty;

  register octa *dd=p->data;

  register octa *dd=p->data;

  t=p->tag;@+p->tag=q->tag;@+q->tag=t;

  t=p->tag;@+p->tag=q->tag;@+q->tag=t;

  p->dirty=q->dirty;@+q->dirty=d;

  p->dirty=q->dirty;@+q->dirty=d;

  p->data=q->data;@+q->data=dd;

  p->data=q->data;@+q->data=dd;

@ The |demote_and_fix| routine is analogous to |use_and_fix|,

@ The |demote_and_fix| routine is analogous to |use_and_fix|,

except that we don't want to promote the data we found.

except that we don't want to promote the data we found.

@=

@=

static cacheblock* demote_and_fix @,@,@[ARGS((cache*,cacheblock*))@];

static cacheblock* demote_and_fix @,@,@[ARGS((cache*,cacheblock*))@];

@ @=

@ @=

static cacheblock *demote_and_fix(c,p)

static cacheblock *demote_and_fix(c,p)

  cache *c;

  cache *c;

  cacheblock *p;

  cacheblock *p;

  if (hit_set!=c->victim) demote_usage(p,hit_set,c->aa,c->repl);

  if (hit_set!=c->victim) demote_usage(p,hit_set,c->aa,c->repl);

  else demote_usage(p,hit_set,c->vv,c->vrepl);

  else demote_usage(p,hit_set,c->vv,c->vrepl);

  return p;

  return p;

@ The subroutine |load_cache(c,p)| is called at a moment when

@ The subroutine |load_cache(c,p)| is called at a moment when

|c->lock| has been set and |c->inbuf| has been filled with clean data

|c->lock| has been set and |c->inbuf| has been filled with clean data

to be placed in the cache block~|p|.

to be placed in the cache block~|p|.

@=

@=

static void load_cache @,@,@[ARGS((cache*,cacheblock*))@];

static void load_cache @,@,@[ARGS((cache*,cacheblock*))@];

@ @=

@ @=

static void load_cache(c,p)

static void load_cache(c,p)

  cache *c;

  cache *c;

  cacheblock *p;

  cacheblock *p;

  register int i;

  register int i;

  register octa *d;

  register octa *d;

  for (i=0;ibb>>c->g;i++) p->dirty[i]=false;

  for (i=0;ibb>>c->g;i++) p->dirty[i]=false;

  d=p->data;@+ p->data=c->inbuf.data;@+ c->inbuf.data=d;

  d=p->data;@+ p->data=c->inbuf.data;@+ c->inbuf.data=d;

  p->tag=c->inbuf.tag;

  p->tag=c->inbuf.tag;

  hit_set=cache_addr(c,p->tag);@+

  hit_set=cache_addr(c,p->tag);@+

  use_and_fix(c,p); /* |p| not moved */

  use_and_fix(c,p); /* |p| not moved */

@ The subroutine |flush_cache(c,p,keep)| is called at a ``quiet''

@ The subroutine |flush_cache(c,p,keep)| is called at a ``quiet''

moment when |c->flusher.next=NULL|.

moment when |c->flusher.next=NULL|.

It puts cache block~|p| into |c->outbuf| and

It puts cache block~|p| into |c->outbuf| and

fires up the |c->flusher| coroutine, which will take care of

fires up the |c->flusher| coroutine, which will take care of

sending the data to lower levels of the memory hierarchy.

sending the data to lower levels of the memory hierarchy.

Cache block~|p| is also marked clean.

Cache block~|p| is also marked clean.

@=

@=

static void flush_cache @,@,@[ARGS((cache*,cacheblock*,bool))@];

static void flush_cache @,@,@[ARGS((cache*,cacheblock*,bool))@];

@ @=

@ @=

static void flush_cache(c,p,keep)

static void flush_cache(c,p,keep)

  cache *c;

  cache *c;

  cacheblock *p; /* a block inside cache |c| */

  cacheblock *p; /* a block inside cache |c| */

  bool keep; /* should we preserve the data in |p|? */

  bool keep; /* should we preserve the data in |p|? */

    register octa *d;

    register octa *d;

    register char *dd;

    register char *dd;

    register int j;

    register int j;

    c->outbuf.tag=p->tag;

    c->outbuf.tag=p->tag;

    if (keep)@+ for (j=0;jbb>>3;j++) c->outbuf.data[j]=p->data[j];

    if (keep)@+ for (j=0;jbb>>3;j++) c->outbuf.data[j]=p->data[j];

    else d=c->outbuf.data, c->outbuf.data=p->data, p->data=d;

    else d=c->outbuf.data, c->outbuf.data=p->data, p->data=d;

    dd=c->outbuf.dirty, c->outbuf.dirty=p->dirty, p->dirty=dd;

    dd=c->outbuf.dirty, c->outbuf.dirty=p->dirty, p->dirty=dd;

    for (j=0;jbb>>c->g;j++) p->dirty[j]=false;

    for (j=0;jbb>>c->g;j++) p->dirty[j]=false;

    startup(&c->flusher,c->copy_out_time); /* will not be aborted */

    startup(&c->flusher,c->copy_out_time); /* will not be aborted */

@ The |alloc_slot| routine is called when we wish to put new information

@ The |alloc_slot| routine is called when we wish to put new information

into a cache after a cache miss. It returns a pointer to a cache block

into a cache after a cache miss. It returns a pointer to a cache block

in the main area where the new information should be put. The tag of

in the main area where the new information should be put. The tag of

that cache block is invalidated; the calling routine should take care

that cache block is invalidated; the calling routine should take care

of filling it and giving it a valid tag in due time. The cache's |filler|

of filling it and giving it a valid tag in due time. The cache's |filler|

routine should not be active when |alloc_slot| is called.

routine should not be active when |alloc_slot| is called.

Inserting new information might also require writing old information

Inserting new information might also require writing old information

into the next level of the memory hierarchy, if the block being replaced

into the next level of the memory hierarchy, if the block being replaced

is dirty. This routine returns |NULL| in such cases if the cache is

is dirty. This routine returns |NULL| in such cases if the cache is

flushing a previously discarded block.

flushing a previously discarded block.

Otherwise it schedules the |flusher| coroutine.

Otherwise it schedules the |flusher| coroutine.

This routine returns |NULL| also if the given key happens to be in the

This routine returns |NULL| also if the given key happens to be in the

cache. Such cases are rare, but the following scenario shows that

cache. Such cases are rare, but the following scenario shows that

they aren't impossible: Suppose the DT-cache access time is 5, the D-cache

they aren't impossible: Suppose the DT-cache access time is 5, the D-cache

access time is~1, and two processes simultaneously look for the

access time is~1, and two processes simultaneously look for the

same physical address. One process hits in DT-cache but misses in D-cache,

same physical address. One process hits in DT-cache but misses in D-cache,

waiting 5 cycles before trying |alloc_slot| in the D-cache; meanwhile

waiting 5 cycles before trying |alloc_slot| in the D-cache; meanwhile

the other process missed in D-cache but didn't need to use the DT-cache,

the other process missed in D-cache but didn't need to use the DT-cache,

so it might have updated the D-cache.

so it might have updated the D-cache.

A key value is never negative. Therefore we can invalidate the tag in

A key value is never negative. Therefore we can invalidate the tag in

the chosen slot by forcing it to be negative.

the chosen slot by forcing it to be negative.

@=

@=

static cacheblock* alloc_slot @,@,@[ARGS((cache*,octa))@];

static cacheblock* alloc_slot @,@,@[ARGS((cache*,octa))@];

@ @=

@ @=

static cacheblock* alloc_slot(c,alf)

static cacheblock* alloc_slot(c,alf)

  cache *c;

  cache *c;

  octa alf; /* key that probably isn't in the cache */

  octa alf; /* key that probably isn't in the cache */

  register cacheset s;

  register cacheset s;

  register cacheblock *p,*q;

  register cacheblock *p,*q;

  if (cache_search(c,alf)) return NULL;

  if (cache_search(c,alf)) return NULL;

  s=cache_addr(c,alf); /* the set corresponding to |alf| */

  s=cache_addr(c,alf); /* the set corresponding to |alf| */

  if (c->victim) p=choose_victim(c->victim,c->vv,c->vrepl);

  if (c->victim) p=choose_victim(c->victim,c->vv,c->vrepl);

  else p=choose_victim(s,c->aa,c->repl);

  else p=choose_victim(s,c->aa,c->repl);

  if (is_dirty(c,p)) {

  if (is_dirty(c,p)) {

    if (c->flusher.next) return NULL;

    if (c->flusher.next) return NULL;

    flush_cache(c,p,false);

    flush_cache(c,p,false);

  if (c->victim) {

  if (c->victim) {

    q=choose_victim(s,c->aa,c->repl);

    q=choose_victim(s,c->aa,c->repl);

@;

@;

    q->tag.h |= sign_bit; /* invalidate the tag */

    q->tag.h |= sign_bit; /* invalidate the tag */

    return q;

    return q;

  p->tag.h |= sign_bit;@+ return p;

  p->tag.h |= sign_bit;@+ return p;

@* Simulated memory. How should we deal with the potentially gigantic

@* Simulated memory. How should we deal with the potentially gigantic

memory of~\MMIX? We can't simply declare an array~$m$ that has

memory of~\MMIX? We can't simply declare an array~$m$ that has

$2^{48}$ bytes. (Indeed, up to $2^{63}$ bytes are needed, if we

$2^{48}$ bytes. (Indeed, up to $2^{63}$ bytes are needed, if we

consider also the physical addresses $\ge2^{48}$ that are reserved for

consider also the physical addresses $\ge2^{48}$ that are reserved for

memory-mapped input/output.)

memory-mapped input/output.)

We could regard memory as a special kind of cache,

We could regard memory as a special kind of cache,

in which every access is required to hit. For example, such an ``M-cache''

in which every access is required to hit. For example, such an ``M-cache''

could be fully associative, with $2^a$ blocks each

could be fully associative, with $2^a$ blocks each

having a different tag; simulation could proceed until more than~$2^a-1$ tags

having a different tag; simulation could proceed until more than~$2^a-1$ tags

are required. But then the predefined value of~$a$ might well be so large that

are required. But then the predefined value of~$a$ might well be so large that

the sequential search of our |cache_search| routine would be too slow.

the sequential search of our |cache_search| routine would be too slow.

Instead, we will allocate memory in chunks of $2^{16}$ bytes at a time,

Instead, we will allocate memory in chunks of $2^{16}$ bytes at a time,

as needed, and we will use hashing to search for the relevant chunk

as needed, and we will use hashing to search for the relevant chunk

whenever a physical address is given. If the address is $2^{48}$ or greater,

whenever a physical address is given. If the address is $2^{48}$ or greater,

special routines called |spec_read| and |spec_write|, supplied by the

special routines called |spec_read| and |spec_write|, supplied by the

user, will be called upon to do the reading or writing. Otherwise

user, will be called upon to do the reading or writing. Otherwise

the 48-bit address consists of a 32-bit {\it chunk address\/} and a

the 48-bit address consists of a 32-bit {\it chunk address\/} and a

16-bit {\it chunk offset}.

16-bit {\it chunk offset}.

Chunk addresses that are not used take no space in this simulator. But if,

Chunk addresses that are not used take no space in this simulator. But if,

say, 1000 such patterns occur, the simulator will dynamically allocate

say, 1000 such patterns occur, the simulator will dynamically allocate

approximately 65MB for the portions of main memory that are used.

approximately 65MB for the portions of main memory that are used.

Parameter |mem_chunks_max| specifies the largest number of different chunk

Parameter |mem_chunks_max| specifies the largest number of different chunk

addresses that are supported. This parameter does not constrain the range of

addresses that are supported. This parameter does not constrain the range of

simulated physical addresses, which cover the entire 256 large-terabyte range

simulated physical addresses, which cover the entire 256 large-terabyte range

permitted by~\MMIX.

permitted by~\MMIX.

@=

@=

typedef struct {

typedef struct {

  tetra tag; /* 32-bit chunk address */

  tetra tag; /* 32-bit chunk address */

  octa *chunk; /* either |NULL| or an array of $2^{13}$ octabytes */

  octa *chunk; /* either |NULL| or an array of $2^{13}$ octabytes */

} chunknode;

} chunknode;

@ The parameter |hash_prime| should be a prime number larger than the

@ The parameter |hash_prime| should be a prime number larger than the

parameter

parameter

|mem_chunks_max|, preferably more than twice as large but not much bigger

|mem_chunks_max|, preferably more than twice as large but not much bigger

than~that. The default values |mem_chunks_max=1000| and |hash_prime=2003| are

than~that. The default values |mem_chunks_max=1000| and |hash_prime=2003| are

set by |MMIX_config| unless the user specifies otherwise.

set by |MMIX_config| unless the user specifies otherwise.

@=

@=

Extern int mem_chunks; /* this many chunks are allocated so far */

Extern int mem_chunks; /* this many chunks are allocated so far */

Extern int mem_chunks_max; /* up to this many different chunks per run */

Extern int mem_chunks_max; /* up to this many different chunks per run */

Extern int hash_prime; /* larger than |mem_chunks_max|, but not enormous */

Extern int hash_prime; /* larger than |mem_chunks_max|, but not enormous */

Extern chunknode *mem_hash; /* the simulated main memory */

Extern chunknode *mem_hash; /* the simulated main memory */

@ The separately compiled procedures |spec_read()| and |spec_write()| have the

@ The separately compiled procedures |spec_read()| and |spec_write()| have the

same calling conventions as the general procedures

same calling conventions as the general procedures

|mem_read()| and |mem_write()|.

|mem_read()| and |mem_write()|.

@=

@=

extern octa spec_read @,@,@[ARGS((octa addr))@]; /* for memory mapped I/O */

extern octa spec_read @,@,@[ARGS((octa addr))@]; /* for memory mapped I/O */

extern void spec_write @,@,@[ARGS((octa addr,octa val))@]; /* likewise */

extern void spec_write @,@,@[ARGS((octa addr,octa val))@]; /* likewise */

@ If the program tries to read from a chunk that hasn't been allocated,

@ If the program tries to read from a chunk that hasn't been allocated,

the value zero is returned, optionally with a comment to the user.

the value zero is returned, optionally with a comment to the user.

Chunk address 0 is always allocated first. Then we can assume that

Chunk address 0 is always allocated first. Then we can assume that

a matching chunk tag implies a nonnull |chunk| pointer.

a matching chunk tag implies a nonnull |chunk| pointer.

This routine sets |last_h| to the chunk found, so that we can rapidly read

This routine sets |last_h| to the chunk found, so that we can rapidly read

other words that we know must belong to the same chunk. For this purpose

other words that we know must belong to the same chunk. For this purpose

it is convenient to let |mem_hash[hash_prime]| be a chunk full of zeros,

it is convenient to let |mem_hash[hash_prime]| be a chunk full of zeros,

representing uninitialized memory.

representing uninitialized memory.

@=

@=

Extern octa mem_read @,@,@[ARGS((octa addr))@];

Extern octa mem_read @,@,@[ARGS((octa addr))@];

@ @=

@ @=

octa mem_read(addr)

octa mem_read(addr)

  octa addr;

  octa addr;

  register tetra off,key;

  register tetra off,key;

  register int h;

  register int h;

  if (addr.h>=(1<<16)) return spec_read(addr);

  if (addr.h>=(1<<16)) return spec_read(addr);

  off=(addr.l&0xffff)>>3;

  off=(addr.l&0xffff)>>3;

  key=(addr.l&0xffff0000)+addr.h;

  key=(addr.l&0xffff0000)+addr.h;

  for (h=key%hash_prime;mem_hash[h].tag!=key;h--) {

  for (h=key%hash_prime;mem_hash[h].tag!=key;h--) {

    if (mem_hash[h].chunk==NULL) {

    if (mem_hash[h].chunk==NULL) {

      if (verbose&uninit_mem_bit)

      if (verbose&uninit_mem_bit)

        errprint2("uninitialized memory read at %08x%08x",addr.h,addr.l);

        errprint2("uninitialized memory read at %08x%08x",addr.h,addr.l);

@.uninitialized memory...@>

@.uninitialized memory...@>

      h=hash_prime;@+ break; /* zero will be returned */

      h=hash_prime;@+ break; /* zero will be returned */

    if (h==0) h=hash_prime;

    if (h==0) h=hash_prime;

  last_h=h;

  last_h=h;

  return mem_hash[h].chunk[off];

  return mem_hash[h].chunk[off];

@ @=

@ @=

Extern int last_h; /* the hash index that was most recently correct */

Extern int last_h; /* the hash index that was most recently correct */

@ @=

@ @=

Extern void mem_write @,@,@[ARGS((octa addr,octa val))@];

Extern void mem_write @,@,@[ARGS((octa addr,octa val))@];

@ @=

@ @=

void mem_write(addr,val)

void mem_write(addr,val)

  octa addr,val;

  octa addr,val;

  register tetra off,key;

  register tetra off,key;

  register int h;

  register int h;

  if (addr.h>=(1<<16)) {@+spec_write(addr,val);@+return;@+}

  if (addr.h>=(1<<16)) {@+spec_write(addr,val);@+return;@+}

  off=(addr.l&0xffff)>>3;

  off=(addr.l&0xffff)>>3;

  key=(addr.l&0xffff0000)+addr.h;

  key=(addr.l&0xffff0000)+addr.h;

  for (h=key%hash_prime;mem_hash[h].tag!=key;h--) {

  for (h=key%hash_prime;mem_hash[h].tag!=key;h--) {

    if (mem_hash[h].chunk==NULL) {

    if (mem_hash[h].chunk==NULL) {

      if (++mem_chunks>mem_chunks_max)

      if (++mem_chunks>mem_chunks_max)

        panic(errprint1("More than %d memory chunks are needed",

        panic(errprint1("More than %d memory chunks are needed",

@.More...chunks are needed@>

@.More...chunks are needed@>

                 mem_chunks_max));

                 mem_chunks_max));

      mem_hash[h].chunk=(octa *)calloc(1<<13,sizeof(octa));

      mem_hash[h].chunk=(octa *)calloc(1<<13,sizeof(octa));

      if (mem_hash[h].chunk==NULL)

      if (mem_hash[h].chunk==NULL)

        panic(errprint1("I can't allocate memory chunk number %d",

        panic(errprint1("I can't allocate memory chunk number %d",

@.I can't allocate...@>

@.I can't allocate...@>

                 mem_chunks));

                 mem_chunks));

      mem_hash[h].tag=key;

      mem_hash[h].tag=key;

      break;

      break;

    if (h==0) h=hash_prime;

    if (h==0) h=hash_prime;

  last_h=h;

  last_h=h;

  mem_hash[h].chunk[off]=val;

  mem_hash[h].chunk[off]=val;

@ The memory is characterized by several parameters, depending on the

@ The memory is characterized by several parameters, depending on the

characteristics of the memory bus being simulated. Let |bus_words|

characteristics of the memory bus being simulated. Let |bus_words|

be the number of octabytes read or written simultaneously (usually

be the number of octabytes read or written simultaneously (usually

|bus_words| is 1 or~2; it must be a power of~2). The number of clock

|bus_words| is 1 or~2; it must be a power of~2). The number of clock

cycles needed to read or write |c*bus_words| octabytes that all belong to the

cycles needed to read or write |c*bus_words| octabytes that all belong to the

same cache block is assumed to be |mem_addr_time+c*mem_read_time| or

same cache block is assumed to be |mem_addr_time+c*mem_read_time| or

|mem_addr_time+c*mem_write_time|, respectively.

|mem_addr_time+c*mem_write_time|, respectively.

@=

@=

Extern int mem_addr_time; /* cycles to transmit an address on memory bus */

Extern int mem_addr_time; /* cycles to transmit an address on memory bus */

Extern int bus_words; /* width of memory bus, in octabytes */

Extern int bus_words; /* width of memory bus, in octabytes */

Extern int mem_read_time; /* cycles to read from main memory */

Extern int mem_read_time; /* cycles to read from main memory */

Extern int mem_write_time; /* cycles to write to main memory */

Extern int mem_write_time; /* cycles to write to main memory */

Extern lockvar mem_lock; /* is nonnull when the bus is busy */

Extern lockvar mem_lock; /* is nonnull when the bus is busy */

@ One of the principal ways to write memory is to invoke

@ One of the principal ways to write memory is to invoke

a |flush_to_mem| coroutine,

a |flush_to_mem| coroutine,

which is the |Scache->flusher| if there is an S-cache, or the

which is the |Scache->flusher| if there is an S-cache, or the

|Dcache->flusher| if there is a D-cache but no S-cache.

|Dcache->flusher| if there is a D-cache but no S-cache.

When such a coroutine is started, its |data->ptr_a| will be |Scache|

When such a coroutine is started, its |data->ptr_a| will be |Scache|

or~|Dcache|. The data to be written will just have been copied to the cache's

or~|Dcache|. The data to be written will just have been copied to the cache's

|outbuf|.

|outbuf|.

@=

@=

case flush_to_mem: {@+register cache *c=(cache *)data->ptr_a;

case flush_to_mem: {@+register cache *c=(cache *)data->ptr_a;

 switch (data->state) {

 switch (data->state) {

  case 0:@+ if (mem_lock) wait(1);

  case 0:@+ if (mem_lock) wait(1);

    data->state=1;

    data->state=1;

  case 1: set_lock(self,mem_lock);

  case 1: set_lock(self,mem_lock);

    data->state=2;

    data->state=2;

    @outbuf| and wait for the bus@>;

    @outbuf| and wait for the bus@>;

  case 2: goto terminate; /* this frees |mem_lock| and |c->outbuf| */

  case 2: goto terminate; /* this frees |mem_lock| and |c->outbuf| */

@ @outbuf| and wait for the bus@>=

@ @outbuf| and wait for the bus@>=

  register int off,last_off,count,first,ii;

  register int off,last_off,count,first,ii;

  register int del=c->gg>>3; /* octabytes per granule */

  register int del=c->gg>>3; /* octabytes per granule */

  octa addr;

  octa addr;

  addr=c->outbuf.tag;@+ off=(addr.l&0xffff)>>3;

  addr=c->outbuf.tag;@+ off=(addr.l&0xffff)>>3;

  for (i=j=0,first=1,count=0;jbb>>c->g;j++) {

  for (i=j=0,first=1,count=0;jbb>>c->g;j++) {

    ii=i+del;

    ii=i+del;

    if (!c->outbuf.dirty[j]) i=ii,off+=del,addr.l+=del<<3;

    if (!c->outbuf.dirty[j]) i=ii,off+=del,addr.l+=del<<3;

    else@+ while (i

    else@+ while (i

      if (first) {

      if (first) {

        count++;@+ last_off=off;@+ first=0;

        count++;@+ last_off=off;@+ first=0;

        mem_write(addr,c->outbuf.data[i]);

        mem_write(addr,c->outbuf.data[i]);

      }@+else {

      }@+else {

        if ((off^last_off)&(-bus_words)) count++;

        if ((off^last_off)&(-bus_words)) count++;

        last_off=off;

        last_off=off;

        mem_hash[last_h].chunk[off]=c->outbuf.data[i];

        mem_hash[last_h].chunk[off]=c->outbuf.data[i];

      i++;@+ off++;@+ addr.l+=8;

      i++;@+ off++;@+ addr.l+=8;

  wait(mem_addr_time+count*mem_write_time);

  wait(mem_addr_time+count*mem_write_time);

@* Cache transfers. We have seen that the |Dcache->flusher| sends

@* Cache transfers. We have seen that the |Dcache->flusher| sends

data directly to the main memory if there is no S-cache.

data directly to the main memory if there is no S-cache.

But if both D-cache and S-cache exist, the |Dcache->flusher| is a

But if both D-cache and S-cache exist, the |Dcache->flusher| is a

more complicated coroutine of type |flush_to_S|. In this case we need

more complicated coroutine of type |flush_to_S|. In this case we need

to deal with the fact that the S-cache blocks might be larger than

to deal with the fact that the S-cache blocks might be larger than

the D-cache blocks; furthermore, the S-cache might have a

the D-cache blocks; furthermore, the S-cache might have a

write-around and/or write-through policy, etc. But one simplifying

write-around and/or write-through policy, etc. But one simplifying

fact does help us: We know that the flusher coroutine will not be

fact does help us: We know that the flusher coroutine will not be

aborted until it has run to completion.

aborted until it has run to completion.

Some machines, such as the Alpha 21164, have an additional cache between

Some machines, such as the Alpha 21164, have an additional cache between

the S-cache and memory, called the B-cache (the ``backup cache''). A B-cache

the S-cache and memory, called the B-cache (the ``backup cache''). A B-cache

could be simulated by extending the logic used here; but such extensions

could be simulated by extending the logic used here; but such extensions

of the present program are left to the interested reader.

of the present program are left to the interested reader.

@=

@=

case flush_to_S: {@+register cache *c=(cache *)data->ptr_a;

case flush_to_S: {@+register cache *c=(cache *)data->ptr_a;

  register int block_diff=Scache->bb-c->bb;

  register int block_diff=Scache->bb-c->bb;

  p=(cacheblock*)data->ptr_b;

  p=(cacheblock*)data->ptr_b;

 switch (data->state) {

 switch (data->state) {

  case 0:@+ if (Scache->lock) wait(1);

  case 0:@+ if (Scache->lock) wait(1);

    data->state=1;

    data->state=1;

  case 1: set_lock(self,Scache->lock);

  case 1: set_lock(self,Scache->lock);

    data->ptr_b=(void*)cache_search(Scache,c->outbuf.tag);

    data->ptr_b=(void*)cache_search(Scache,c->outbuf.tag);

    if (data->ptr_b) data->state=4;

    if (data->ptr_b) data->state=4;

    else if (Scache->mode & WRITE_ALLOC) data->state=(block_diff? 2: 3);

    else if (Scache->mode & WRITE_ALLOC) data->state=(block_diff? 2: 3);

    else data->state=6;

    else data->state=6;

    wait(Scache->access_time);

    wait(Scache->access_time);

  case 2: @inbuf| with clean memory data@>;

  case 2: @inbuf| with clean memory data@>;

  case 3: @;

  case 3: @;

    if (block_diff) @inbuf| to slot |p|@>;

    if (block_diff) @inbuf| to slot |p|@>;

  case 4: copy_block(c,&(c->outbuf),Scache,p);

  case 4: copy_block(c,&(c->outbuf),Scache,p);

    hit_set=cache_addr(Scache,c->outbuf.tag);@+ use_and_fix(Scache,p);

    hit_set=cache_addr(Scache,c->outbuf.tag);@+ use_and_fix(Scache,p);

                   /* |p| not moved */

                   /* |p| not moved */

    data->state=5;@+ wait(Scache->copy_in_time);

    data->state=5;@+ wait(Scache->copy_in_time);

  case 5:@+ if ((Scache->mode&WRITE_BACK)==0) { /* write-through */

  case 5:@+ if ((Scache->mode&WRITE_BACK)==0) { /* write-through */

      if (Scache->flusher.next) wait(1);

      if (Scache->flusher.next) wait(1);

      flush_cache(Scache,p,true);

      flush_cache(Scache,p,true);

    goto terminate;

    goto terminate;

  case 6:@;

  case 6:@;

@ @=

@ @=

if (Scache->filler.next) wait(1); /* perhaps an unnecessary precaution? */

if (Scache->filler.next) wait(1); /* perhaps an unnecessary precaution? */

p=alloc_slot(Scache,c->outbuf.tag);

p=alloc_slot(Scache,c->outbuf.tag);

if (!p) wait(1);

if (!p) wait(1);

data->ptr_b=(void*)p;

data->ptr_b=(void*)p;

p->tag=c->outbuf.tag;@+ p->tag.l=c->outbuf.tag.l&(-Scache->bb);

p->tag=c->outbuf.tag;@+ p->tag.l=c->outbuf.tag.l&(-Scache->bb);

@ We only need to read |block_diff| bytes, but it's easier to

@ We only need to read |block_diff| bytes, but it's easier to

read them all and to charge only for reading the ones we needed.

read them all and to charge only for reading the ones we needed.

@inbuf| with clean memory data@>=

@inbuf| with clean memory data@>=

{@+register int count=block_diff>>3;

{@+register int count=block_diff>>3;

  register int off,delay;

  register int off,delay;

  octa addr;

  octa addr;

  if (mem_lock) wait(1);

  if (mem_lock) wait(1);

  addr.h=c->outbuf.tag.h;@+ addr.l=c->outbuf.tag.l&-Scache->bb;

  addr.h=c->outbuf.tag.h;@+ addr.l=c->outbuf.tag.l&-Scache->bb;

  off=(addr.l&0xffff)>>3;

  off=(addr.l&0xffff)>>3;

  for (j=0;jbb>>3;j++)

  for (j=0;jbb>>3;j++)

    if (j==0) Scache->inbuf.data[j]=mem_read(addr);

    if (j==0) Scache->inbuf.data[j]=mem_read(addr);

    else Scache->inbuf.data[j]=mem_hash[last_h].chunk[j+off];

    else Scache->inbuf.data[j]=mem_hash[last_h].chunk[j+off];

  set_lock(&mem_locker,mem_lock);

  set_lock(&mem_locker,mem_lock);

  delay=mem_addr_time+(int)((count+bus_words-1)/(bus_words))*mem_read_time;

  delay=mem_addr_time+(int)((count+bus_words-1)/(bus_words))*mem_read_time;

  startup(&mem_locker,delay);

  startup(&mem_locker,delay);

  data->state=3;@+ wait(delay);

  data->state=3;@+ wait(delay);

@ @inbuf| to slot |p|@>=

@ @inbuf| to slot |p|@>=

  register octa *d=p->data;

  register octa *d=p->data;

  p->data=Scache->inbuf.data;@+Scache->inbuf.data=d;

  p->data=Scache->inbuf.data;@+Scache->inbuf.data=d;

@ Here we assume that the granularity is~8.

@ Here we assume that the granularity is~8.

@=

@=

if (Scache->flusher.next) wait(1);

if (Scache->flusher.next) wait(1);

Scache->outbuf.tag.h=c->outbuf.tag.h;

Scache->outbuf.tag.h=c->outbuf.tag.h;

Scache->outbuf.tag.l=c->outbuf.tag.l&(-Scache->bb);

Scache->outbuf.tag.l=c->outbuf.tag.l&(-Scache->bb);

for (j=0;jbb>>Scache->g;j++) Scache->outbuf.dirty[j]=false;

for (j=0;jbb>>Scache->g;j++) Scache->outbuf.dirty[j]=false;

copy_block(c,&(c->outbuf),Scache,&(Scache->outbuf));

copy_block(c,&(c->outbuf),Scache,&(Scache->outbuf));

startup(&Scache->flusher,Scache->copy_out_time);

startup(&Scache->flusher,Scache->copy_out_time);

goto terminate;

goto terminate;

@ The S-cache gets new data from memory by invoking a |fill_from_mem|

@ The S-cache gets new data from memory by invoking a |fill_from_mem|

coroutine; the I-cache or D-cache may also invoke a |fill_from_mem| coroutine,

coroutine; the I-cache or D-cache may also invoke a |fill_from_mem| coroutine,

if there is no S-cache. When such a coroutine is invoked, it holds

if there is no S-cache. When such a coroutine is invoked, it holds

|mem_lock|, and its caller has gone to sleep.

|mem_lock|, and its caller has gone to sleep.

A physical memory address is given in |data->z.o|,

A physical memory address is given in |data->z.o|,

and |data->ptr_a| specifies either |Icache| or |Dcache|.

and |data->ptr_a| specifies either |Icache| or |Dcache|.

Furthermore, |data->ptr_b| specifies a block within that

Furthermore, |data->ptr_b| specifies a block within that

cache, determined by the |alloc_slot| routine. The coroutine

cache, determined by the |alloc_slot| routine. The coroutine

simulates reading the contents of the specified memory location,

simulates reading the contents of the specified memory location,

places the result in the |x.o| field of its caller's control block,

places the result in the |x.o| field of its caller's control block,

and wakes up the caller. It proceeds to fill the cache's |inbuf| and,

and wakes up the caller. It proceeds to fill the cache's |inbuf| and,

ultimately, the specified cache block, before waking the caller again.

ultimately, the specified cache block, before waking the caller again.

Let |c=data->ptr_b|. The caller is then |c->fill_lock|, if this variable is

Let |c=data->ptr_b|. The caller is then |c->fill_lock|, if this variable is

nonnull. However, the caller might not wish to be awoken or to receive

nonnull. However, the caller might not wish to be awoken or to receive

the data (for example, if it has been aborted). In such cases |c->fill_lock|

the data (for example, if it has been aborted). In such cases |c->fill_lock|

will be~|NULL|; the filling action continues without the wakeup calls.

will be~|NULL|; the filling action continues without the wakeup calls.

If |c=Scache|, the S-cache will be locked and the caller will not

If |c=Scache|, the S-cache will be locked and the caller will not

have been aborted.

have been aborted.

@=

@=

case fill_from_mem: {@+register cache *c=(cache *)data->ptr_a;

case fill_from_mem: {@+register cache *c=(cache *)data->ptr_a;

  register coroutine *cc=c->fill_lock;

  register coroutine *cc=c->fill_lock;

 switch (data->state) {

 switch (data->state) {

  case 0: data->x.o=mem_read(data->z.o);

  case 0: data->x.o=mem_read(data->z.o);

    if (cc) {

    if (cc) {

      cc->ctl->x.o=data->x.o;

      cc->ctl->x.o=data->x.o;

      awaken(cc,mem_read_time);

      awaken(cc,mem_read_time);

    data->state=1;

    data->state=1;

    @inbuf| and wait for the bus@>;

    @inbuf| and wait for the bus@>;

  case 1: release_lock(self,mem_lock);

  case 1: release_lock(self,mem_lock);

    data->state=2;

    data->state=2;

  case 2:@+if (c!=Scache) {

  case 2:@+if (c!=Scache) {

      if (c->lock) wait(1);

      if (c->lock) wait(1);

      set_lock(self,c->lock);

      set_lock(self,c->lock);

    if (cc) awaken(cc,c->copy_in_time); /* the second wakeup call */

    if (cc) awaken(cc,c->copy_in_time); /* the second wakeup call */

    load_cache(c,(cacheblock*)data->ptr_b);

    load_cache(c,(cacheblock*)data->ptr_b);

    data->state=3;@+ wait(c->copy_in_time);

    data->state=3;@+ wait(c->copy_in_time);

  case 3: goto terminate;

  case 3: goto terminate;

@ If |c|'s cache size is no larger than the memory bus, we wait an extra

@ If |c|'s cache size is no larger than the memory bus, we wait an extra

cycle, so that there will be two wakeup calls.

cycle, so that there will be two wakeup calls.

@inbuf|...@>=

@inbuf|...@>=

  register int count, off;

  register int count, off;

  c->inbuf.tag=data->z.o;@+ c->inbuf.tag.l &= -c->bb;

  c->inbuf.tag=data->z.o;@+ c->inbuf.tag.l &= -c->bb;

  count=c->bb>>3, off=(c->inbuf.tag.l&0xffff)>>3;

  count=c->bb>>3, off=(c->inbuf.tag.l&0xffff)>>3;

  for (i=0;iinbuf.data[i]=mem_hash[last_h].chunk[off];

  for (i=0;iinbuf.data[i]=mem_hash[last_h].chunk[off];

  if (count<=bus_words) wait(1+mem_read_time)@;

  if (count<=bus_words) wait(1+mem_read_time)@;

  else wait((int)(count/bus_words)*mem_read_time);

  else wait((int)(count/bus_words)*mem_read_time);

@ The |fill_from_S| coroutine has the same conventions as |fill_from_mem|,

@ The |fill_from_S| coroutine has the same conventions as |fill_from_mem|,

except that the data comes directly from the S-cache if it is present there.

except that the data comes directly from the S-cache if it is present there.

This is the |filler| coroutine for the I-cache and D-cache if an S-cache

This is the |filler| coroutine for the I-cache and D-cache if an S-cache

is present.

is present.

@=

@=

case fill_from_S: {@+register cache *c=(cache *)data->ptr_a;

case fill_from_S: {@+register cache *c=(cache *)data->ptr_a;

  register coroutine *cc=c->fill_lock;

  register coroutine *cc=c->fill_lock;

  p=(cacheblock*)data->ptr_c;

  p=(cacheblock*)data->ptr_c;

  switch (data->state) {

  switch (data->state) {

  case 0: p=cache_search(Scache,data->z.o);

  case 0: p=cache_search(Scache,data->z.o);

    if (p) goto S_non_miss;

    if (p) goto S_non_miss;

    data->state=1;

    data->state=1;

  case 1: @;

  case 1: @;

    data->state=2;@+sleep;

    data->state=2;@+sleep;

  case 2:@+if (cc) {

  case 2:@+if (cc) {

      cc->ctl->x.o=data->x.o;

      cc->ctl->x.o=data->x.o;

            /* this data has been supplied by |Scache->filler| */

            /* this data has been supplied by |Scache->filler| */

      awaken(cc,Scache->access_time); /* we propagate it back */

      awaken(cc,Scache->access_time); /* we propagate it back */

    data->state=3;@+sleep; /* when we awake, the S-cache will have our data */

    data->state=3;@+sleep; /* when we awake, the S-cache will have our data */

  S_non_miss:@+if (cc) {

  S_non_miss:@+if (cc) {

      cc->ctl->x.o=p->data[(data->z.o.l&(Scache->bb-1))>>3];

      cc->ctl->x.o=p->data[(data->z.o.l&(Scache->bb-1))>>3];

      awaken(cc,Scache->access_time);

      awaken(cc,Scache->access_time);

  case 3: @inbuf|@>;

  case 3: @inbuf|@>;

    data->state=4;@+wait(Scache->access_time);

    data->state=4;@+wait(Scache->access_time);

  case 4:@+ if (c->lock) wait(1);

  case 4:@+ if (c->lock) wait(1);

    set_lock(self,c->lock);

    set_lock(self,c->lock);

    Scache->lock=NULL; /* we had been holding that lock */

    Scache->lock=NULL; /* we had been holding that lock */

    load_cache(c,(cacheblock*)data->ptr_b);

    load_cache(c,(cacheblock*)data->ptr_b);

    data->state=5;@+ wait(c->copy_in_time);

    data->state=5;@+ wait(c->copy_in_time);

  case 5:@+if (cc) awaken(cc,1); /* second wakeup call */

  case 5:@+if (cc) awaken(cc,1); /* second wakeup call */

    goto terminate;

    goto terminate;

@ We are already holding the |Scache->lock|, but we're about to take on the

@ We are already holding the |Scache->lock|, but we're about to take on the

|Scache->fill_lock| too (with the understanding that one is ``stronger''

|Scache->fill_lock| too (with the understanding that one is ``stronger''

than the other). For a short time the |Scache->lock| will point to us

than the other). For a short time the |Scache->lock| will point to us

but we will point to |Scache->fill_lock|; this will not cause difficulty,

but we will point to |Scache->fill_lock|; this will not cause difficulty,

because the present coroutine is not abortable.

because the present coroutine is not abortable.

@=

@=

if (Scache->filler.next || mem_lock) wait(1);

if (Scache->filler.next || mem_lock) wait(1);

p=alloc_slot(Scache,data->z.o);

p=alloc_slot(Scache,data->z.o);

if (!p) wait(1);

if (!p) wait(1);

set_lock(&Scache->filler,mem_lock);

set_lock(&Scache->filler,mem_lock);

set_lock(self,Scache->fill_lock);

set_lock(self,Scache->fill_lock);

data->ptr_c=Scache->filler_ctl.ptr_b=(void *)p;

data->ptr_c=Scache->filler_ctl.ptr_b=(void *)p;

Scache->filler_ctl.z.o=data->z.o;

Scache->filler_ctl.z.o=data->z.o;

startup(&Scache->filler,mem_addr_time);

startup(&Scache->filler,mem_addr_time);

@ The S-cache blocks might be wider than the blocks of the I-cache or

@ The S-cache blocks might be wider than the blocks of the I-cache or

D-cache, so the copying in this step isn't quite trivial.

D-cache, so the copying in this step isn't quite trivial.

@inbuf|@>=

@inbuf|@>=

{@+register int off;

{@+register int off;

  c->inbuf.tag=data->z.o;@+c->inbuf.tag.l &=-c->bb;

  c->inbuf.tag=data->z.o;@+c->inbuf.tag.l &=-c->bb;

  for (j=0,off=(c->inbuf.tag.l&(Scache->bb-1))>>3;jbb>>3;j++,off++)

  for (j=0,off=(c->inbuf.tag.l&(Scache->bb-1))>>3;jbb>>3;j++,off++)

    c->inbuf.data[j]=p->data[off];

    c->inbuf.data[j]=p->data[off];

  release_lock(self,Scache->fill_lock);

  release_lock(self,Scache->fill_lock);

  set_lock(self,Scache->lock);

  set_lock(self,Scache->lock);

@ The instruction \.{PRELD} \.{X,\$Y,\$Z} generates $\lfloor{\rm X}/2^b\rfloor$

@ The instruction \.{PRELD} \.{X,\$Y,\$Z} generates $\lfloor{\rm X}/2^b\rfloor$

commands if there are $2^b$ bytes per block in the D-cache. These

commands if there are $2^b$ bytes per block in the D-cache. These

commands will try to preload blocks $\rm\$Y+\$Z$, ${\rm\$Y}+{\rm\$Z}+2^b$,

commands will try to preload blocks $\rm\$Y+\$Z$, ${\rm\$Y}+{\rm\$Z}+2^b$,

\dots, into the cache if it is not too busy.

\dots, into the cache if it is not too busy.

Similar considerations apply to the instructions \.{PREGO} \.{X,\$Y,\$Z}

Similar considerations apply to the instructions \.{PREGO} \.{X,\$Y,\$Z}

and \.{PREST} \.{X,\$Y,\$Z}.

and \.{PREST} \.{X,\$Y,\$Z}.

@=

@=

case preld: case prest:@+ if (!Dcache) goto noop_inst;

case preld: case prest:@+ if (!Dcache) goto noop_inst;

  if (cool->xx>=Dcache->bb) cool->interim=true;

  if (cool->xx>=Dcache->bb) cool->interim=true;

  cool->ptr_a=(void *)mem.up;@+ break;

  cool->ptr_a=(void *)mem.up;@+ break;

case prego:@+ if (!Icache) goto noop_inst;

case prego:@+ if (!Icache) goto noop_inst;

  if (cool->xx>=Icache->bb) cool->interim=true;

  if (cool->xx>=Icache->bb) cool->interim=true;

  cool->ptr_a=(void *)mem.up;@+ break;

  cool->ptr_a=(void *)mem.up;@+ break;

@ If the block size is 64, a command like \.{PREST}~\.{200,\$Y,\$Z}

@ If the block size is 64, a command like \.{PREST}~\.{200,\$Y,\$Z}

is actually issued as four commands \.{PREST}~\.{200,\$Y,\$Z;}

is actually issued as four commands \.{PREST}~\.{200,\$Y,\$Z;}

\.{PREST}~\.{191,\$Y,\$Z;}  \.{PREST}~\.{127,\$Y,\$Z;}

\.{PREST}~\.{191,\$Y,\$Z;}  \.{PREST}~\.{127,\$Y,\$Z;}

\.{PREST}~\.{63,\$Y,\$Z}. An interruption will then be able to resume

\.{PREST}~\.{63,\$Y,\$Z}. An interruption will then be able to resume

properly. In the pipeline, the instruction \.{PREST}~\.{200,\$Y,\$Z}

properly. In the pipeline, the instruction \.{PREST}~\.{200,\$Y,\$Z}

is considered to affect bytes $\rm\$Y+\$Z+192$ through $\rm\$Y+\$Z+200$,

is considered to affect bytes $\rm\$Y+\$Z+192$ through $\rm\$Y+\$Z+200$,

or fewer bytes if $\rm\$Y+\$Z$ is not a multiple of~64. (Remember that

or fewer bytes if $\rm\$Y+\$Z$ is not a multiple of~64. (Remember that

these instructions are only hints; we act on them only if it is

these instructions are only hints; we act on them only if it is

reasonably convenient to do so.)

reasonably convenient to do so.)

@=

@=

head->inst = (head->inst&~((Dcache->bb-1)<<16))-0x10000;

head->inst = (head->inst&~((Dcache->bb-1)<<16))-0x10000;

@ @=

@ @=

head->inst = (head->inst&~((Icache->bb-1)<<16))-0x10000;

head->inst = (head->inst&~((Icache->bb-1)<<16))-0x10000;

@ Another coroutine, called |cleanup|, is occasionally called into

@ Another coroutine, called |cleanup|, is occasionally called into

action to remove dirty data from the D-cache and S-cache. If it is

action to remove dirty data from the D-cache and S-cache. If it is

invoked by starting in state 0, with its |i| field set to |sync|, it

invoked by starting in state 0, with its |i| field set to |sync|, it

will clean everything. It can also be

will clean everything. It can also be

invoked in state~4, with its |i| field set to |syncd| and with a physical

invoked in state~4, with its |i| field set to |syncd| and with a physical

address in its |z.o| field; then it simply makes sure that no D-cache

address in its |z.o| field; then it simply makes sure that no D-cache

or S-cache blocks associated with that address are dirty.

or S-cache blocks associated with that address are dirty.

Field |x.o.h| should be set to zero if items are expected to remain

Field |x.o.h| should be set to zero if items are expected to remain

in the cache after being cleaned; otherwise field |x.o.h| should be

in the cache after being cleaned; otherwise field |x.o.h| should be

set to |sign_bit|.

set to |sign_bit|.

The coroutine that invokes |cleanup| should hold |clean_lock|. If that

The coroutine that invokes |cleanup| should hold |clean_lock|. If that

coroutine dies, because of an interruption, the |cleanup| coroutine

coroutine dies, because of an interruption, the |cleanup| coroutine

will terminate prematurely.

will terminate prematurely.

We assume that the D-cache and S-cache have some sort of way to

We assume that the D-cache and S-cache have some sort of way to

identify their first dirty block, if any, in |access_time| cycles.

identify their first dirty block, if any, in |access_time| cycles.

@=

@=

coroutine clean_co;

coroutine clean_co;

control clean_ctl;

control clean_ctl;

lockvar clean_lock;

lockvar clean_lock;

@ @=

@ @=

clean_co.ctl=&clean_ctl;

clean_co.ctl=&clean_ctl;

clean_co.name="Clean";

clean_co.name="Clean";

clean_co.stage=cleanup;

clean_co.stage=cleanup;

clean_ctl.go.o.l=4;

clean_ctl.go.o.l=4;

@ @=

@ @=

case cleanup: p=(cacheblock*)data->ptr_b;

case cleanup: p=(cacheblock*)data->ptr_b;

  switch(data->state) {

  switch(data->state) {

@;

@;

@;

@;

case 10: goto terminate;

case 10: goto terminate;

@ @=

@ @=

case 0:@+ if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1);

case 0:@+ if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1);

  startup(&Dcache->reader[j],Dcache->access_time);

  startup(&Dcache->reader[j],Dcache->access_time);

  set_lock(self,Dcache->lock);

  set_lock(self,Dcache->lock);

  i=j=0;

  i=j=0;

Dclean_loop: p=(icc? &(Dcache->set[i][j]): &(Dcache->victim[j]));

Dclean_loop: p=(icc? &(Dcache->set[i][j]): &(Dcache->victim[j]));

  if (p->tag.h&sign_bit) goto Dclean_inc;

  if (p->tag.h&sign_bit) goto Dclean_inc;

  if (!is_dirty(Dcache,p)) {

  if (!is_dirty(Dcache,p)) {

    p->tag.h|=data->x.o.h;@+goto Dclean_inc;

    p->tag.h|=data->x.o.h;@+goto Dclean_inc;

  data->y.o.h=i, data->y.o.l=j;

  data->y.o.h=i, data->y.o.l=j;

Dclean: data->state=1;@+

Dclean: data->state=1;@+

  data->ptr_b=(void*)p;@+

  data->ptr_b=(void*)p;@+

  wait(Dcache->access_time);

  wait(Dcache->access_time);

case 1:@+if (Dcache->flusher.next) wait(1);

case 1:@+if (Dcache->flusher.next) wait(1);

  flush_cache(Dcache,p,data->x.o.h==0);

  flush_cache(Dcache,p,data->x.o.h==0);

  p->tag.h|=data->x.o.h;

  p->tag.h|=data->x.o.h;

  release_lock(self,Dcache->lock);

  release_lock(self,Dcache->lock);

  data->state=2;@+

  data->state=2;@+

  wait(Dcache->copy_out_time);

  wait(Dcache->copy_out_time);

case 2:@+ if (!clean_lock) goto done; /* premature termination */

case 2:@+ if (!clean_lock) goto done; /* premature termination */

  if (Dcache->flusher.next) wait(1);

  if (Dcache->flusher.next) wait(1);

  if (data->i!=sync) goto Sprep;

  if (data->i!=sync) goto Sprep;

  data->state=3;

  data->state=3;

case 3:@+ if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1);

case 3:@+ if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1);

  startup(&Dcache->reader[j],Dcache->access_time);

  startup(&Dcache->reader[j],Dcache->access_time);

  set_lock(self,Dcache->lock);

  set_lock(self,Dcache->lock);

  i=data->y.o.h, j=data->y.o.l;

  i=data->y.o.h, j=data->y.o.l;

Dclean_inc: j++;

Dclean_inc: j++;

  if (icc && j==Dcache->aa) j=0, i++;

  if (icc && j==Dcache->aa) j=0, i++;

  if (i==Dcache->cc && j==Dcache->vv) {

  if (i==Dcache->cc && j==Dcache->vv) {

    data->state=5;@+

    data->state=5;@+

    wait(Dcache->access_time);

    wait(Dcache->access_time);

  goto Dclean_loop;

  goto Dclean_loop;

case 4:@+ if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1);

case 4:@+ if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1);

  startup(&Dcache->reader[j],Dcache->access_time);

  startup(&Dcache->reader[j],Dcache->access_time);

  set_lock(self,Dcache->lock);

  set_lock(self,Dcache->lock);

  p=cache_search(Dcache,data->z.o);

  p=cache_search(Dcache,data->z.o);

  if (p) {

  if (p) {

    demote_and_fix(Dcache,p);

    demote_and_fix(Dcache,p);

    if (is_dirty(Dcache,p)) goto Dclean;

    if (is_dirty(Dcache,p)) goto Dclean;

  data->state=9;@+

  data->state=9;@+

  wait(Dcache->access_time);

  wait(Dcache->access_time);

@ @=

@ @=

case 5:@+ if (self->lockloc) *(self->lockloc)=NULL, self->lockloc=NULL;

case 5:@+ if (self->lockloc) *(self->lockloc)=NULL, self->lockloc=NULL;

  if (!Scache) goto done;

  if (!Scache) goto done;

  if (Scache->lock) wait(1);

  if (Scache->lock) wait(1);

  set_lock(self,Scache->lock);

  set_lock(self,Scache->lock);

  i=j=0;

  i=j=0;

Sclean_loop: p=(icc? &(Scache->set[i][j]): &(Scache->victim[j]));

Sclean_loop: p=(icc? &(Scache->set[i][j]): &(Scache->victim[j]));

  if (p->tag.h&sign_bit) goto Sclean_inc;

  if (p->tag.h&sign_bit) goto Sclean_inc;

  if (!is_dirty(Scache,p)) {

  if (!is_dirty(Scache,p)) {

    p->tag.h|=data->x.o.h;@+goto Sclean_inc;

    p->tag.h|=data->x.o.h;@+goto Sclean_inc;

  data->y.o.h=i, data->y.o.l=j;

  data->y.o.h=i, data->y.o.l=j;

Sclean: data->state=6;@+

Sclean: data->state=6;@+

  data->ptr_b=(void*)p;@+

  data->ptr_b=(void*)p;@+

  wait(Scache->access_time);

  wait(Scache->access_time);

case 6:@+if (Scache->flusher.next) wait(1);

case 6:@+if (Scache->flusher.next) wait(1);

  flush_cache(Scache,p,data->x.o.h==0);

  flush_cache(Scache,p,data->x.o.h==0);

  p->tag.h|=data->x.o.h;

  p->tag.h|=data->x.o.h;

  release_lock(self,Scache->lock);

  release_lock(self,Scache->lock);

  data->state=7;@+

  data->state=7;@+

  wait(Scache->copy_out_time);

  wait(Scache->copy_out_time);

case 7:@+ if (!clean_lock) goto done; /* premature termination */

case 7:@+ if (!clean_lock) goto done; /* premature termination */

  if (Scache->flusher.next) wait(1);

  if (Scache->flusher.next) wait(1);

  if (data->i!=sync) goto done;

  if (data->i!=sync) goto done;

  data->state=8;

  data->state=8;

case 8:@+ if (Scache->lock) wait(1);

case 8:@+ if (Scache->lock) wait(1);

  set_lock(self,Scache->lock);

  set_lock(self,Scache->lock);

  i=data->y.o.h, j=data->y.o.l;

  i=data->y.o.h, j=data->y.o.l;

Sclean_inc: j++;

Sclean_inc: j++;

  if (icc && j==Scache->aa) j=0, i++;

  if (icc && j==Scache->aa) j=0, i++;

  if (i==Scache->cc && j==Scache->vv) {

  if (i==Scache->cc && j==Scache->vv) {

    data->state=10;@+

    data->state=10;@+

    wait(Scache->access_time);

    wait(Scache->access_time);

  goto Sclean_loop;

  goto Sclean_loop;

Sprep: data->state=9;

Sprep: data->state=9;

case 9:@+if (self->lockloc) release_lock(self,Dcache->lock);

case 9:@+if (self->lockloc) release_lock(self,Dcache->lock);

  if (!Scache) goto done;

  if (!Scache) goto done;

  if (Scache->lock) wait(1);

  if (Scache->lock) wait(1);

  set_lock(self,Scache->lock);

  set_lock(self,Scache->lock);

  p=cache_search(Scache,data->z.o);

  p=cache_search(Scache,data->z.o);

  if (p) {

  if (p) {

    demote_and_fix(Scache,p);

    demote_and_fix(Scache,p);

    if (is_dirty(Scache,p)) goto Sclean;

    if (is_dirty(Scache,p)) goto Sclean;

  data->state=10;@+

  data->state=10;@+

  wait(Scache->access_time);

  wait(Scache->access_time);

@* Virtual address translation. Special arrays of coroutines and control

@* Virtual address translation. Special arrays of coroutines and control

blocks come into play when we need to implement \MMIX's rather complicated

blocks come into play when we need to implement \MMIX's rather complicated

page table mechanism for virtual address translation. In effect, we have up to

page table mechanism for virtual address translation. In effect, we have up to

ten control blocks {\it outside\/} of the reorder buffer that are capable of

ten control blocks {\it outside\/} of the reorder buffer that are capable of

executing instructions just as if they were part of that buffer. The

executing instructions just as if they were part of that buffer. The

``opcodes'' of these non-abortable instructions are special internal

``opcodes'' of these non-abortable instructions are special internal

operations called |ldptp| and |ldpte|, for loading page table pointers and

operations called |ldptp| and |ldpte|, for loading page table pointers and

page table entries.

page table entries.

Suppose, for example, that we need to translate a virtual address for the

Suppose, for example, that we need to translate a virtual address for the

DT-cache in which the virtual page address $(a_4a_3a_2a_1a_0)_{1024}$ of

DT-cache in which the virtual page address $(a_4a_3a_2a_1a_0)_{1024}$ of

segment~$i$ has $a_4=a_3=0$ and $a_2\ne0$. Then the rules say that we should

segment~$i$ has $a_4=a_3=0$ and $a_2\ne0$. Then the rules say that we should

first find a page table pointer $p_2$ in physical location

first find a page table pointer $p_2$ in physical location

$2^{13}(r+b_i+2)+8a_2$, then another page table pointer~$p_1$ in location

$2^{13}(r+b_i+2)+8a_2$, then another page table pointer~$p_1$ in location

$p_2+8a_1$, and finally the page table entry~$p_0$ in location $p_1+8a_0$. The

$p_2+8a_1$, and finally the page table entry~$p_0$ in location $p_1+8a_0$. The

simulator achieves this by setting up three coroutines $c_0$, $c_1$, $c_2$

simulator achieves this by setting up three coroutines $c_0$, $c_1$, $c_2$

whose control blocks correspond to the pseudo-instructions

whose control blocks correspond to the pseudo-instructions

$$\vbox{\halign{\tt#\hfil\cr

$$\vbox{\halign{\tt#\hfil\cr

LDPTP $x$,[$2^{63}+2^{13}(r+b_i+2)$],$8a_2$\cr

LDPTP $x$,[$2^{63}+2^{13}(r+b_i+2)$],$8a_2$\cr

LDPTP $x$,$x$,$8a_1$\cr

LDPTP $x$,$x$,$8a_1$\cr

LDPTE $x$,$x$,$8a_0$\cr}}$$

LDPTE $x$,$x$,$8a_0$\cr}}$$

where $x$ is a hidden internal register and the other quantities are immediate

where $x$ is a hidden internal register and the other quantities are immediate

values. Slight changes to the normal functionality of \.{LDO} give us the

values. Slight changes to the normal functionality of \.{LDO} give us the

actions needed to implement \.{LDPTP} and \.{LDPTE}. Coroutine~$c_j$

actions needed to implement \.{LDPTP} and \.{LDPTE}. Coroutine~$c_j$

corresponds to the instruction that involves $a_j$ and computes~$p_j$; when

corresponds to the instruction that involves $a_j$ and computes~$p_j$; when

$c_0$ has computed its value~$p_0$, we know how to translate the original

$c_0$ has computed its value~$p_0$, we know how to translate the original

virtual address.

virtual address.

The \.{LDPTP} and \.{LDPTE} commands return zero

The \.{LDPTP} and \.{LDPTE} commands return zero

if their $y$~operand is zero or if the page table does not properly match~rV.

if their $y$~operand is zero or if the page table does not properly match~rV.

@d LDPTP PREGO /* internally this won't cause confusion */

@d LDPTP PREGO /* internally this won't cause confusion */

@d LDPTE GO

@d LDPTE GO

@=

@=

control IPTctl[5], DPTctl[5]; /* control blocks for I and D page translation */

control IPTctl[5], DPTctl[5]; /* control blocks for I and D page translation */

coroutine IPTco[10], DPTco[10]; /* each coroutine is a two-stage pipeline */

coroutine IPTco[10], DPTco[10]; /* each coroutine is a two-stage pipeline */

char *IPTname[5]={"IPT0","IPT1","IPT2","IPT3","IPT4"};

char *IPTname[5]={"IPT0","IPT1","IPT2","IPT3","IPT4"};

char *DPTname[5]={"DPT0","DPT1","DPT2","DPT3","DPT4"};

char *DPTname[5]={"DPT0","DPT1","DPT2","DPT3","DPT4"};

@ @=

@ @=

for (j=0;j<5;j++) {

for (j=0;j<5;j++) {

  DPTco[2*j].ctl=&DPTctl[j];@+  IPTco[2*j].ctl=&IPTctl[j];

  DPTco[2*j].ctl=&DPTctl[j];@+  IPTco[2*j].ctl=&IPTctl[j];

  if (j>0) DPTctl[j].op=IPTctl[j].op=LDPTP,DPTctl[j].i=IPTctl[j].i=ldptp;

  if (j>0) DPTctl[j].op=IPTctl[j].op=LDPTP,DPTctl[j].i=IPTctl[j].i=ldptp;

  else DPTctl[0].op=IPTctl[0].op=LDPTE,DPTctl[0].i=IPTctl[0].i=ldpte;

  else DPTctl[0].op=IPTctl[0].op=LDPTE,DPTctl[0].i=IPTctl[0].i=ldpte;

  IPTctl[j].loc=DPTctl[j].loc=neg_one;

  IPTctl[j].loc=DPTctl[j].loc=neg_one;

  IPTctl[j].go.o=DPTctl[j].go.o=incr(neg_one,4);

  IPTctl[j].go.o=DPTctl[j].go.o=incr(neg_one,4);

  IPTctl[j].ptr_a=DPTctl[j].ptr_a=(void*)&mem;

  IPTctl[j].ptr_a=DPTctl[j].ptr_a=(void*)&mem;

  IPTctl[j].ren_x=DPTctl[j].ren_x=true;

  IPTctl[j].ren_x=DPTctl[j].ren_x=true;

  IPTctl[j].x.addr.h=DPTctl[j].x.addr.h=-1;

  IPTctl[j].x.addr.h=DPTctl[j].x.addr.h=-1;

  IPTco[2*j].stage=DPTco[2*j].stage=1;

  IPTco[2*j].stage=DPTco[2*j].stage=1;

  IPTco[2*j+1].stage=DPTco[2*j+1].stage=2;

  IPTco[2*j+1].stage=DPTco[2*j+1].stage=2;

  IPTco[2*j].name=IPTco[2*j+1].name=IPTname[j];

  IPTco[2*j].name=IPTco[2*j+1].name=IPTname[j];

  DPTco[2*j].name=DPTco[2*j+1].name=DPTname[j];

  DPTco[2*j].name=DPTco[2*j+1].name=DPTname[j];

ITcache->filler_ctl.ptr_c=(void*)&IPTco[0];@+

ITcache->filler_ctl.ptr_c=(void*)&IPTco[0];@+

DTcache->filler_ctl.ptr_c=(void*)&DPTco[0];

DTcache->filler_ctl.ptr_c=(void*)&DPTco[0];

@ Page table calculations are invoked by a coroutine of type |fill_from_virt|,

@ Page table calculations are invoked by a coroutine of type |fill_from_virt|,

which is used to fill the IT-cache or DT-cache. The calling conventions of

which is used to fill the IT-cache or DT-cache. The calling conventions of

|fill_from_virt| are analogous to those of |fill_from_mem| or |fill_from_S|:

|fill_from_virt| are analogous to those of |fill_from_mem| or |fill_from_S|:

A virtual address is supplied in |data->y.o|, and |data->ptr_a| points

A virtual address is supplied in |data->y.o|, and |data->ptr_a| points

to a cache (|ITcache| or |DTcache|), while |data->ptr_b| is a block in that

to a cache (|ITcache| or |DTcache|), while |data->ptr_b| is a block in that

cache. We wake up the caller, who holds the cache's |fill_lock|, as soon as

cache. We wake up the caller, who holds the cache's |fill_lock|, as soon as

the translation of the given address has been calculated, unless the caller

the translation of the given address has been calculated, unless the caller

has been aborted. (No second wakeup call is necessary.)

has been aborted. (No second wakeup call is necessary.)

@=

@=

case fill_from_virt: {@+register cache *c=(cache *)data->ptr_a;

case fill_from_virt: {@+register cache *c=(cache *)data->ptr_a;

  register coroutine *cc=c->fill_lock;

  register coroutine *cc=c->fill_lock;

  register coroutine *co=(coroutine*)data->ptr_c;

  register coroutine *co=(coroutine*)data->ptr_c;

                          /* |&IPTco[0]| or |&DPTco[0]| */

                          /* |&IPTco[0]| or |&DPTco[0]| */

  octa aaaaa;

  octa aaaaa;

 switch (data->state) {

 switch (data->state) {

  case 0: @;

  case 0: @;

    data->state=1;

    data->state=1;

  case 1:@+if (data->b.p) {

  case 1:@+if (data->b.p) {

      if (data->b.p->known) data->b.o=data->b.p->o, data->b.p=NULL;

      if (data->b.p->known) data->b.o=data->b.p->o, data->b.p=NULL;

      else wait(1);

      else wait(1);

    @inbuf| and give the caller a sneak

    @inbuf| and give the caller a sneak

              preview@>;

              preview@>;

    data->state=2;

    data->state=2;

  case 2:@+if (c->lock) wait(1);

  case 2:@+if (c->lock) wait(1);

    set_lock(self,c->lock);

    set_lock(self,c->lock);

    load_cache(c,(cacheblock*)data->ptr_b);

    load_cache(c,(cacheblock*)data->ptr_b);

    data->state=3;@+ wait(c->copy_in_time);

    data->state=3;@+ wait(c->copy_in_time);

  case 3: data->b.o=zero_octa;@+goto terminate;

  case 3: data->b.o=zero_octa;@+goto terminate;

@ The current contents of rV, the special virtual translation register, are

@ The current contents of rV, the special virtual translation register, are

kept unpacked in several global variables |page_r|, |page_s|, etc., for

kept unpacked in several global variables |page_r|, |page_s|, etc., for

convenience. Whenever rV changes, we recompute all these variables.

convenience. Whenever rV changes, we recompute all these variables.

@=

@=

int page_n; /* the 10-bit |n| field of rV, times 8 */

int page_n; /* the 10-bit |n| field of rV, times 8 */

int page_r; /* the 27-bit |r| field of rV */

int page_r; /* the 27-bit |r| field of rV */

int page_s; /* the 8-bit |s| field of rV */

int page_s; /* the 8-bit |s| field of rV */

int page_b[5]; /* the 4-bit |b| fields of rV; |page_b[0]=0| */

int page_b[5]; /* the 4-bit |b| fields of rV; |page_b[0]=0| */

octa page_mask; /* the least significant |s| bits */

octa page_mask; /* the least significant |s| bits */

bool page_bad=true; /* does rV violate the rules? */

bool page_bad=true; /* does rV violate the rules? */

@ @=

@ @=

{@+octa rv;

{@+octa rv;

  rv=data->z.o;

  rv=data->z.o;

  page_bad=(rv.l&7? true: false);

  page_bad=(rv.l&7? true: false);

  page_n=rv.l&0x1ff8;

  page_n=rv.l&0x1ff8;

  rv=shift_right(rv,13,1);

  rv=shift_right(rv,13,1);

  page_r=rv.l&0x7ffffff;

  page_r=rv.l&0x7ffffff;

  rv=shift_right(rv,27,1);

  rv=shift_right(rv,27,1);

  page_s=rv.l&0xff;

  page_s=rv.l&0xff;

  if (page_s<13 || page_s>48) page_bad=true;

  if (page_s<13 || page_s>48) page_bad=true;

  else if (page_s<32) page_mask.h=0,page_mask.l=(1<

  else if (page_s<32) page_mask.h=0,page_mask.l=(1<

  else page_mask.h=(1<<(page_s-32))-1,page_mask.l=0xffffffff;

  else page_mask.h=(1<<(page_s-32))-1,page_mask.l=0xffffffff;

  page_b[4]=(rv.l>>8)&0xf;

  page_b[4]=(rv.l>>8)&0xf;

  page_b[3]=(rv.l>>12)&0xf;

  page_b[3]=(rv.l>>12)&0xf;

  page_b[2]=(rv.l>>16)&0xf;

  page_b[2]=(rv.l>>16)&0xf;

  page_b[1]=(rv.l>>20)&0xf;

  page_b[1]=(rv.l>>20)&0xf;

@ Here's how we compute a tag of the IT-cache or DT-cache

@ Here's how we compute a tag of the IT-cache or DT-cache

from a virtual address, and how we compute a physical address

from a virtual address, and how we compute a physical address

from a translation found in the cache.

from a translation found in the cache.

@d trans_key(addr) incr(oandn(addr,page_mask),page_n)

@d trans_key(addr) incr(oandn(addr,page_mask),page_n)

@=

@=

static octa phys_addr @,@,@[ARGS((octa,octa))@];

static octa phys_addr @,@,@[ARGS((octa,octa))@];

@ @=

@ @=

static octa phys_addr(virt,trans)

static octa phys_addr(virt,trans)

  octa virt,trans;

  octa virt,trans;

{@+octa t;

{@+octa t;

  t=trans;@+ t.l &= -8; /* zero out the protection bits */

  t=trans;@+ t.l &= -8; /* zero out the protection bits */

  return oplus(t,oand(virt,page_mask));

  return oplus(t,oand(virt,page_mask));

@ Cheap (and slow) versions of \MMIX\ leave the page table calculations

@ Cheap (and slow) versions of \MMIX\ leave the page table calculations

to software. If the global variable |no_hardware_PT| is set true,

to software. If the global variable |no_hardware_PT| is set true,

|fill_from_virt| begins its actions in state~1, not state~0. (See the

|fill_from_virt| begins its actions in state~1, not state~0. (See the

|RESUME_TRANS| operation.)

|RESUME_TRANS| operation.)

@=

@=

Extern bool no_hardware_PT;

Extern bool no_hardware_PT;

@ Note: The operating system is supposed to ensure that changes to the page

@ Note: The operating system is supposed to ensure that changes to the page

table entries do not appear in the pipeline when a translation cache is being

table entries do not appear in the pipeline when a translation cache is being

updated. The internal \.{LDPTP} and \.{LDPTE} instructions use only the

updated. The internal \.{LDPTP} and \.{LDPTE} instructions use only the

``hot state'' of the memory system.

``hot state'' of the memory system.

@^operating system@>

@^operating system@>

@=

@=

aaaaa=data->y.o;

aaaaa=data->y.o;

i=aaaaa.h>>29; /* the segment number */

i=aaaaa.h>>29; /* the segment number */

aaaaa.h&=0x1fffffff; /* the address within segment $i$ */

aaaaa.h&=0x1fffffff; /* the address within segment $i$ */

aaaaa=shift_right(aaaaa,page_s,1); /* the page address */

aaaaa=shift_right(aaaaa,page_s,1); /* the page address */

for (j=0;aaaaa.l!=0 || aaaaa.h!=0; j++) {

for (j=0;aaaaa.l!=0 || aaaaa.h!=0; j++) {

  co[2*j].ctl->z.o.h=0, co[2*j].ctl->z.o.l=(aaaaa.l&0x3ff)<<3;

  co[2*j].ctl->z.o.h=0, co[2*j].ctl->z.o.l=(aaaaa.l&0x3ff)<<3;

  aaaaa=shift_right(aaaaa,10,1);

  aaaaa=shift_right(aaaaa,10,1);

if (page_b[i+1]

if (page_b[i+1]

  ; /* nothing needs to be done, since |data->b.o| is zero */

  ; /* nothing needs to be done, since |data->b.o| is zero */

else {

else {

  if (j==0) j=1,co[0].ctl->z.o=zero_octa;

  if (j==0) j=1,co[0].ctl->z.o=zero_octa;

@;

@;

@ The first stage of coroutine $c_j$ is |co[2*j]|. It will pass the $j$th

@ The first stage of coroutine $c_j$ is |co[2*j]|. It will pass the $j$th

control block to the second stage, |co[2*j+1]|, which will load page table

control block to the second stage, |co[2*j+1]|, which will load page table

information from memory (or hopefully from the D-cache).

information from memory (or hopefully from the D-cache).

@=

@=

j--;

j--;

aaaaa.l=page_r+page_b[i]+j;

aaaaa.l=page_r+page_b[i]+j;

co[2*j].ctl->y.p=NULL;

co[2*j].ctl->y.p=NULL;

co[2*j].ctl->y.o=shift_left(aaaaa,13);

co[2*j].ctl->y.o=shift_left(aaaaa,13);

co[2*j].ctl->y.o.h+=sign_bit;

co[2*j].ctl->y.o.h+=sign_bit;

for (;;j--) {

for (;;j--) {

  co[2*j].ctl->x.o=zero_octa;@+ co[2*j].ctl->x.known=false;

  co[2*j].ctl->x.o=zero_octa;@+ co[2*j].ctl->x.known=false;

  co[2*j].ctl->owner=&co[2*j];

  co[2*j].ctl->owner=&co[2*j];

  startup(&co[2*j],1);

  startup(&co[2*j],1);

  if (j==0) break;

  if (j==0) break;

  co[2*(j-1)].ctl->y.p=&co[2*j].ctl->x;

  co[2*(j-1)].ctl->y.p=&co[2*j].ctl->x;

data->b.p=&co[0].ctl->x;

data->b.p=&co[0].ctl->x;

@ At this point the translation of the given virtual address |data->y.o| is

@ At this point the translation of the given virtual address |data->y.o| is

the octabyte |data->b.o|. Its least significant three bits are the

the octabyte |data->b.o|. Its least significant three bits are the

protection code~$p=p_rp_wp_x$; its page address field is scaled by~$2^s$. It

protection code~$p=p_rp_wp_x$; its page address field is scaled by~$2^s$. It

is entirely zero, including the protection bits, if there was a

is entirely zero, including the protection bits, if there was a

page table failure.

page table failure.

@inbuf| and give the caller a sneak preview@>=

@inbuf| and give the caller a sneak preview@>=

c->inbuf.tag=trans_key(data->y.o);

c->inbuf.tag=trans_key(data->y.o);

c->inbuf.data[0]=data->b.o;

c->inbuf.data[0]=data->b.o;

if (cc) {

if (cc) {

  cc->ctl->z.o=data->b.o;

  cc->ctl->z.o=data->b.o;

  awaken(cc,1);

  awaken(cc,1);

@* The write buffer. The dispatcher has arranged things so that speculative

@* The write buffer. The dispatcher has arranged things so that speculative

stores into memory are recorded in a doubly linked list leading upward from

stores into memory are recorded in a doubly linked list leading upward from

|mem|. When such instructions finally are committed, they enter the ``write

|mem|. When such instructions finally are committed, they enter the ``write

buffer,'' which holds octabytes that are ready to be written into designated

buffer,'' which holds octabytes that are ready to be written into designated

physical memory addresses (or into the D-cache and/or S-cache). The ``hot

physical memory addresses (or into the D-cache and/or S-cache). The ``hot

state'' of the computation is reflected not only by the registers and caches

state'' of the computation is reflected not only by the registers and caches

but also by the instructions that are pending in the write buffer.

but also by the instructions that are pending in the write buffer.

@=

@=

typedef struct{

typedef struct{

  octa o; /* data to be stored */

  octa o; /* data to be stored */

  octa addr; /* its physical address */

  octa addr; /* its physical address */

  tetra stamp; /* when last committed (mod $2^{32}$) */

  tetra stamp; /* when last committed (mod $2^{32}$) */

  internal_opcode i; /* is this write special? */

  internal_opcode i; /* is this write special? */

} write_node;

} write_node;

@ We represent the buffer in the usual way as a circular list, with elements

@ We represent the buffer in the usual way as a circular list, with elements

|write_tail+1|, |write_tail+2|, \dots,~|write_head|.

|write_tail+1|, |write_tail+2|, \dots,~|write_head|.

The data will sit at least |holding_time| cycles before it leaves

The data will sit at least |holding_time| cycles before it leaves

the write buffer. This speeds things up when different fields of the same

the write buffer. This speeds things up when different fields of the same

octabyte are being stored by different instructions.

octabyte are being stored by different instructions.

@=

@=

Extern write_node *wbuf_bot, *wbuf_top;

Extern write_node *wbuf_bot, *wbuf_top;

 /* least and greatest write buffer nodes */

 /* least and greatest write buffer nodes */

Extern write_node *write_head, *write_tail;

Extern write_node *write_head, *write_tail;

 /* front and rear of the write buffer */

 /* front and rear of the write buffer */

Extern lockvar wbuf_lock; /* is the data in |write_head| being written? */

Extern lockvar wbuf_lock; /* is the data in |write_head| being written? */

Extern int holding_time; /* minimum holding time */

Extern int holding_time; /* minimum holding time */

Extern lockvar speed_lock; /* should we ignore |holding_time|? */

Extern lockvar speed_lock; /* should we ignore |holding_time|? */

@ @=

@ @=

coroutine write_co; /* coroutine that empties the write buffer */

coroutine write_co; /* coroutine that empties the write buffer */

control write_ctl; /* its control block */

control write_ctl; /* its control block */

@ @=

@ @=

write_co.ctl=&write_ctl;

write_co.ctl=&write_ctl;

write_co.name="Write";

write_co.name="Write";

write_co.stage=write_from_wbuf;

write_co.stage=write_from_wbuf;

write_ctl.ptr_a=(void*)&mem;

write_ctl.ptr_a=(void*)&mem;

write_ctl.go.o.l=4;

write_ctl.go.o.l=4;

startup(&write_co,1);

startup(&write_co,1);

write_head=write_tail=wbuf_top;

write_head=write_tail=wbuf_top;

@ @=

@ @=

static void print_write_buffer @,@,@[ARGS((void))@];

static void print_write_buffer @,@,@[ARGS((void))@];

@ @=

@ @=

static void print_write_buffer()

static void print_write_buffer()

  printf("Write buffer");

  printf("Write buffer");

  if (write_head==write_tail) printf(" (empty)\n");

  if (write_head==write_tail) printf(" (empty)\n");

  else {@+register write_node *p;

  else {@+register write_node *p;

    printf(":\n");

    printf(":\n");

    for (p=write_head;p!=write_tail; p=(p==wbuf_bot? wbuf_top: p-1)) {

    for (p=write_head;p!=write_tail; p=(p==wbuf_bot? wbuf_top: p-1)) {

      printf("m[");@+print_octa(p->addr);@+printf("]=");@+print_octa(p->o);

      printf("m[");@+print_octa(p->addr);@+printf("]=");@+print_octa(p->o);

      if (p->i==stunc) printf(" unc");

      if (p->i==stunc) printf(" unc");

      else if (p->i==sync) printf(" sync");

      else if (p->i==sync) printf(" sync");

      printf(" (age %d)\n",ticks.l-p->stamp);

      printf(" (age %d)\n",ticks.l-p->stamp);

@ The entire present state of the pipeline computation can be visualized

@ The entire present state of the pipeline computation can be visualized

by printing first the write buffer, then the reorder buffer, then the

by printing first the write buffer, then the reorder buffer, then the

fetch buffer. This shows the progression of results from oldest to youngest,

fetch buffer. This shows the progression of results from oldest to youngest,

from sizzling hot to ice cold.

from sizzling hot to ice cold.

@=

@=

Extern void print_pipe @,@,@[ARGS((void))@];

Extern void print_pipe @,@,@[ARGS((void))@];

@ @=

@ @=

void print_pipe()

void print_pipe()

  print_write_buffer();

  print_write_buffer();

  print_reorder_buffer();

  print_reorder_buffer();

  print_fetch_buffer();

  print_fetch_buffer();

@ The |write_search| routine looks to see if any instructions ahead of a given

@ The |write_search| routine looks to see if any instructions ahead of a given

place in the |mem| list of the reorder buffer are storing into a given

place in the |mem| list of the reorder buffer are storing into a given

physical address, or if there's a pending instruction in the write buffer for

physical address, or if there's a pending instruction in the write buffer for

that address. If so, it returns a pointer to the value to be written. If not,

that address. If so, it returns a pointer to the value to be written. If not,

it returns~|NULL|. If the answer is currently unknown, because at least one

it returns~|NULL|. If the answer is currently unknown, because at least one

possibly relevant physical address has not yet been computed, the subroutine

possibly relevant physical address has not yet been computed, the subroutine

returns the special code value~|DUNNO|.

returns the special code value~|DUNNO|.

The search starts at the |x.up| field of a control block for a store

The search starts at the |x.up| field of a control block for a store

instruction, otherwise at the |ptr_a| field of the control block,

instruction, otherwise at the |ptr_a| field of the control block,

unless |ptr_a| points to a committed instruction.

unless |ptr_a| points to a committed instruction.

The |i| field in the write buffer is usually |st| or |pst|, inherited from

The |i| field in the write buffer is usually |st| or |pst|, inherited from

a store or partial store command. It may also be |sync| (from \.{SYNC}~\.1

a store or partial store command. It may also be |sync| (from \.{SYNC}~\.1

or \.{SYNC}~\.3) or |stunc| (from \.{STUNC}).

or \.{SYNC}~\.3) or |stunc| (from \.{STUNC}).

@d DUNNO ((octa *)1) /* an impossible non-|NULL| pointer */

@d DUNNO ((octa *)1) /* an impossible non-|NULL| pointer */

@=

@=

static octa* write_search @,@,@[ARGS((control*,octa))@];

static octa* write_search @,@,@[ARGS((control*,octa))@];

@ @=

@ @=

static octa *write_search(ctl,addr)

static octa *write_search(ctl,addr)

  control *ctl;

  control *ctl;

  octa addr;

  octa addr;

{@+register specnode *p=(ctl->mem_x? ctl->x.up: (specnode*)ctl->ptr_a);

{@+register specnode *p=(ctl->mem_x? ctl->x.up: (specnode*)ctl->ptr_a);

  register write_node *q=write_tail;

  register write_node *q=write_tail;

  addr.l &=-8;

  addr.l &=-8;

  if (p==&mem) goto qloop;

  if (p==&mem) goto qloop;

  if (p > &hot->x && ctl <= hot) goto qloop; /* already committed */

  if (p > &hot->x && ctl <= hot) goto qloop; /* already committed */

  if (p < &ctl->x && (ctl <= hot || p > &hot->x)) goto qloop;

  if (p < &ctl->x && (ctl <= hot || p > &hot->x)) goto qloop;

  for (; p!=&mem; p=p->up) {

  for (; p!=&mem; p=p->up) {

    if (p->addr.h==(tetra)-1) return DUNNO;

    if (p->addr.h==(tetra)-1) return DUNNO;

    if ((p->addr.l&-8)==addr.l && p->addr.h==addr.h)

    if ((p->addr.l&-8)==addr.l && p->addr.h==addr.h)

      return (p->known? &(p->o): DUNNO);

      return (p->known? &(p->o): DUNNO);

qloop:@+ for (;;) {

qloop:@+ for (;;) {

    if (q==write_head) return NULL;

    if (q==write_head) return NULL;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if (q->addr.l==addr.l && q->addr.h==addr.h) return &(q->o);

    if (q->addr.l==addr.l && q->addr.h==addr.h) return &(q->o);

@ When we're committing new data to memory, we can update an existing item in

@ When we're committing new data to memory, we can update an existing item in

the write buffer if it has the same physical address, unless that item is

the write buffer if it has the same physical address, unless that item is

already in the process of being written out. Increasing the value of

already in the process of being written out. Increasing the value of

|holding_time| will increase the chance that this economy is possible, but

|holding_time| will increase the chance that this economy is possible, but

it will also increase the number of buffered items when writes are to

it will also increase the number of buffered items when writes are to

different locations.

different locations.

A store instruction that sets any of the eight interrupt bits

A store instruction that sets any of the eight interrupt bits

\.{rwxnkbsp} will not affect memory, even if it doesn't cause an interrupt.

\.{rwxnkbsp} will not affect memory, even if it doesn't cause an interrupt.

When ``store'' is followed by ``store uncached'' at the same address,

When ``store'' is followed by ``store uncached'' at the same address,

or vice versa, we believe the most recent hint.

or vice versa, we believe the most recent hint.

@=

@=

{@+register write_node *q=write_tail;

{@+register write_node *q=write_tail;

  if (hot->interrupt&(F_BIT+0xff)) goto done_with_write;

  if (hot->interrupt&(F_BIT+0xff)) goto done_with_write;

  if (hot->i!=sync) for (;;) {

  if (hot->i!=sync) for (;;) {

    if (q==write_head) break;

    if (q==write_head) break;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if (q->i==sync) break;

    if (q->i==sync) break;

    if (q->addr.l==hot->x.addr.l && q->addr.h==hot->x.addr.h

    if (q->addr.l==hot->x.addr.l && q->addr.h==hot->x.addr.h

             && (q!=write_head || !wbuf_lock)) goto addr_found;

             && (q!=write_head || !wbuf_lock)) goto addr_found;

  {@+ register write_node *p=(write_tail==wbuf_bot? wbuf_top: write_tail-1);

  {@+ register write_node *p=(write_tail==wbuf_bot? wbuf_top: write_tail-1);

    if (p==write_head) break; /* the write buffer is full */

    if (p==write_head) break; /* the write buffer is full */

    q=write_tail;@+ write_tail=p;

    q=write_tail;@+ write_tail=p;

    q->addr=hot->x.addr;

    q->addr=hot->x.addr;

addr_found: q->o=hot->x.o;

addr_found: q->o=hot->x.o;

  q->stamp=ticks.l;

  q->stamp=ticks.l;

  q->i=hot->i;

  q->i=hot->i;

done_with_write: spec_rem(&(hot->x));

done_with_write: spec_rem(&(hot->x));

  mem_slots++;

  mem_slots++;

@ A special coroutine whose duty is to empty the write buffer is always

@ A special coroutine whose duty is to empty the write buffer is always

active. It holds the |wbuf_lock| while it is writing the contents of

active. It holds the |wbuf_lock| while it is writing the contents of

|write_head|. It holds |Dcache->fill_lock| while waiting for the D-cache

|write_head|. It holds |Dcache->fill_lock| while waiting for the D-cache

to fill a block.

to fill a block.

@=

@=

case write_from_wbuf:

case write_from_wbuf:

  p=(cacheblock*)data->ptr_b;

  p=(cacheblock*)data->ptr_b;

  switch(data->state) {

  switch(data->state) {

  case 4: @;

  case 4: @;

    data->state=5;

    data->state=5;

  case 5:@+if (write_head==wbuf_bot) write_head=wbuf_top;@+ else write_head--;

  case 5:@+if (write_head==wbuf_bot) write_head=wbuf_top;@+ else write_head--;

 write_restart: data->state=0;

 write_restart: data->state=0;

  case 0:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

  case 0:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

    if (write_head==write_tail) wait(1); /* write buffer is empty */

    if (write_head==write_tail) wait(1); /* write buffer is empty */

    if (write_head->i==sync) @;

    if (write_head->i==sync) @;

    if (ticks.l-write_head->stamp

    if (ticks.l-write_head->stamp

      wait(1); /* data too raw */

      wait(1); /* data too raw */

    if (!Dcache || (write_head->addr.h&0xffff0000)) goto mem_direct;

    if (!Dcache || (write_head->addr.h&0xffff0000)) goto mem_direct;

          /* not cached */

          /* not cached */

    if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1); /* D-cache busy */

    if (Dcache->lock || (j=get_reader(Dcache)<0)) wait(1); /* D-cache busy */

    startup(&Dcache->reader[j],Dcache->access_time);

    startup(&Dcache->reader[j],Dcache->access_time);

                if there's a cache hit@>;

                if there's a cache hit@>;

    data->state=((Dcache->mode&WRITE_ALLOC) && write_head->i!=stunc? 1: 3);

    data->state=((Dcache->mode&WRITE_ALLOC) && write_head->i!=stunc? 1: 3);

    wait(Dcache->access_time);

    wait(Dcache->access_time);

  case 1: @addr|

  case 1: @addr|

           into the D-cache@>;

           into the D-cache@>;

    data->state=2;@+sleep;

    data->state=2;@+sleep;

  case 2: data->state=0;@+sleep; /* wake up when the D-cache has the block */

  case 2: data->state=0;@+sleep; /* wake up when the D-cache has the block */

  case 3: @;

  case 3: @;

  mem_direct: @;

  mem_direct: @;

@ @=

@ @=

register cacheblock *p,*q;

register cacheblock *p,*q;

@ The granularity is guaranteed to be 8 in write-around mode

@ The granularity is guaranteed to be 8 in write-around mode

(see |MMIX_config|). Although an uncached store will not be stored in the

(see |MMIX_config|). Although an uncached store will not be stored in the

D-cache (unless it hits in the D-cache), it will go into a secondary cache.

D-cache (unless it hits in the D-cache), it will go into a secondary cache.

@=

@=

if (Dcache->flusher.next) wait(1);

if (Dcache->flusher.next) wait(1);

Dcache->outbuf.tag.h=write_head->addr.h;

Dcache->outbuf.tag.h=write_head->addr.h;

Dcache->outbuf.tag.l=write_head->addr.l&(-Dcache->bb);

Dcache->outbuf.tag.l=write_head->addr.l&(-Dcache->bb);

for (j=0;jbb>>Dcache->g;j++) Dcache->outbuf.dirty[j]=false;

for (j=0;jbb>>Dcache->g;j++) Dcache->outbuf.dirty[j]=false;

Dcache->outbuf.data[(write_head->addr.l&(Dcache->bb-1))>>3]=write_head->o;

Dcache->outbuf.data[(write_head->addr.l&(Dcache->bb-1))>>3]=write_head->o;

Dcache->outbuf.dirty[(write_head->addr.l&(Dcache->bb-1))>>Dcache->g]=true;

Dcache->outbuf.dirty[(write_head->addr.l&(Dcache->bb-1))>>Dcache->g]=true;

set_lock(self,wbuf_lock);

set_lock(self,wbuf_lock);

startup(&Dcache->flusher,Dcache->copy_out_time);

startup(&Dcache->flusher,Dcache->copy_out_time);

data->state=5;@+ wait(Dcache->copy_out_time);

data->state=5;@+ wait(Dcache->copy_out_time);

@ @=

@ @=

if (mem_lock) wait(1);

if (mem_lock) wait(1);

set_lock(self,wbuf_lock);

set_lock(self,wbuf_lock);

set_lock(&mem_locker,mem_lock); /* a coroutine of type |vanish| */

set_lock(&mem_locker,mem_lock); /* a coroutine of type |vanish| */

startup(&mem_locker,mem_addr_time+mem_write_time);

startup(&mem_locker,mem_addr_time+mem_write_time);

mem_write(write_head->addr,write_head->o);

mem_write(write_head->addr,write_head->o);

data->state=5;@+ wait(mem_addr_time+mem_write_time);

data->state=5;@+ wait(mem_addr_time+mem_write_time);

@ A subtlety needs to be mentioned here: While we're trying to

@ A subtlety needs to be mentioned here: While we're trying to

update the D-cache, another instruction might be filling the

update the D-cache, another instruction might be filling the

same cache block (although not because of the same physical address).

same cache block (although not because of the same physical address).

Therefore we |goto write_restart| here instead of saying |wait(1)|.

Therefore we |goto write_restart| here instead of saying |wait(1)|.

@addr| into the D-cache@>=

@addr| into the D-cache@>=

if (Dcache->filler.next) goto write_restart;

if (Dcache->filler.next) goto write_restart;

if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto write_restart;

if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto write_restart;

p=alloc_slot(Dcache,write_head->addr);

p=alloc_slot(Dcache,write_head->addr);

if (!p) goto write_restart;

if (!p) goto write_restart;

if (Scache) set_lock(&Dcache->filler,Scache->lock)@;

if (Scache) set_lock(&Dcache->filler,Scache->lock)@;

else set_lock(&Dcache->filler,mem_lock);

else set_lock(&Dcache->filler,mem_lock);

set_lock(self,Dcache->fill_lock);

set_lock(self,Dcache->fill_lock);

data->ptr_b=Dcache->filler_ctl.ptr_b=(void *)p;

data->ptr_b=Dcache->filler_ctl.ptr_b=(void *)p;

Dcache->filler_ctl.z.o=write_head->addr;

Dcache->filler_ctl.z.o=write_head->addr;

startup(&Dcache->filler,Scache? Scache->access_time: mem_addr_time);

startup(&Dcache->filler,Scache? Scache->access_time: mem_addr_time);

@ Here it is assumed that |Dcache->access_time| is enough to search

@ Here it is assumed that |Dcache->access_time| is enough to search

the D-cache and update one octabyte in case of a hit. The D-cache is

the D-cache and update one octabyte in case of a hit. The D-cache is

not locked, since other coroutines that might be simultaneously reading

not locked, since other coroutines that might be simultaneously reading

the D-cache are not going to use the octabyte that changes.

the D-cache are not going to use the octabyte that changes.

Perhaps the simulator is being too lenient here.

Perhaps the simulator is being too lenient here.

@=

@=

p=cache_search(Dcache,write_head->addr);

p=cache_search(Dcache,write_head->addr);

if (p) {

if (p) {

  p=use_and_fix(Dcache,p);

  p=use_and_fix(Dcache,p);

  set_lock(self,wbuf_lock);

  set_lock(self,wbuf_lock);

  data->ptr_b=(void *)p;

  data->ptr_b=(void *)p;

  p->data[(write_head->addr.l&(Dcache->bb-1))>>3]=write_head->o;

  p->data[(write_head->addr.l&(Dcache->bb-1))>>3]=write_head->o;

  p->dirty[(write_head->addr.l&(Dcache->bb-1))>>Dcache->g]=true;

  p->dirty[(write_head->addr.l&(Dcache->bb-1))>>Dcache->g]=true;

  data->state=4;@+ wait(Dcache->access_time);

  data->state=4;@+ wait(Dcache->access_time);

@ @=

@ @=

if ((Dcache->mode&WRITE_BACK)==0) { /* write-through */

if ((Dcache->mode&WRITE_BACK)==0) { /* write-through */

  if (Dcache->flusher.next) wait(1);

  if (Dcache->flusher.next) wait(1);

  flush_cache(Dcache,p,true);

  flush_cache(Dcache,p,true);

@ @=

@ @=

  set_lock(self,wbuf_lock);

  set_lock(self,wbuf_lock);

  data->state=5;

  data->state=5;

  wait(1);

  wait(1);

@* Loading and storing. A RISC machine is often said to have a ``load/store

@* Loading and storing. A RISC machine is often said to have a ``load/store

architecture,'' perhaps because loading and storing are among the most

architecture,'' perhaps because loading and storing are among the most

difficult things a RISC machine is called upon to do.

difficult things a RISC machine is called upon to do.

We want memory accesses

We want memory accesses

to be efficient, so we try to access the D-cache at the same time as we are

to be efficient, so we try to access the D-cache at the same time as we are

translating a virtual address via the DT-cache. Usually we hit in both

translating a virtual address via the DT-cache. Usually we hit in both

caches, but numerous cases must be dealt with when we miss. Is there

caches, but numerous cases must be dealt with when we miss. Is there

an elegant way to handle all the contingencies? Alas, the author of this

an elegant way to handle all the contingencies? Alas, the author of this

program was unable to think of anything better than to throw lots

program was unable to think of anything better than to throw lots

of code at the problem --- knowing full well that such a spaghetti-like

of code at the problem --- knowing full well that such a spaghetti-like

approach is fraught with possibilities for error.

approach is fraught with possibilities for error.

Instructions like \.{LDO} $x,y,z$ operate in two pipeline stages. The first

Instructions like \.{LDO} $x,y,z$ operate in two pipeline stages. The first

stage computes the virtual address $y+z$, waiting if necessary until $y$

stage computes the virtual address $y+z$, waiting if necessary until $y$

and~$z$ are both known; then it starts to access the necessary caches.

and~$z$ are both known; then it starts to access the necessary caches.

In the second stage we ascertain the corresponding physical address and

In the second stage we ascertain the corresponding physical address and

hopefully find the data in the cache (or in the speculative |mem| list or the

hopefully find the data in the cache (or in the speculative |mem| list or the

write buffer).

write buffer).

An instruction like \.{STB} $x,y,z$ shares some of the computation of

An instruction like \.{STB} $x,y,z$ shares some of the computation of

\.{LDO}~$x,y,z$, because only one byte is being stored but the other seven

\.{LDO}~$x,y,z$, because only one byte is being stored but the other seven

bytes must be found in the cache. In this case, however, $x$~is treated as an

bytes must be found in the cache. In this case, however, $x$~is treated as an

input, and |mem| is the output. The second stage of a store command can begin

input, and |mem| is the output. The second stage of a store command can begin

even though $x$ is not known during the first stage.

even though $x$ is not known during the first stage.

Here's what we do at the beginning of stage~1.

Here's what we do at the beginning of stage~1.

@d ld_st_launch 7 /* state when load/store command has its memory address */

@d ld_st_launch 7 /* state when load/store command has its memory address */

@=

@=

case preld: case prest: case prego:

case preld: case prest: case prego:

  data->z.o=incr(data->z.o,data->xx&-(data->i==prego? Icache: Dcache)->bb);

  data->z.o=incr(data->z.o,data->xx&-(data->i==prego? Icache: Dcache)->bb);

  /* (I hope the adder is fast enough) */

  /* (I hope the adder is fast enough) */

case ld: case ldunc: case ldvts:

case ld: case ldunc: case ldvts:

case st: case pst: case syncd: case syncid:

case st: case pst: case syncd: case syncid:

start_ld_st: data->y.o=oplus(data->y.o,data->z.o);

start_ld_st: data->y.o=oplus(data->y.o,data->z.o);

  data->state=ld_st_launch;@+ goto switch1;

  data->state=ld_st_launch;@+ goto switch1;

case ldptp: case ldpte:@+if (data->y.o.h) goto start_ld_st;

case ldptp: case ldpte:@+if (data->y.o.h) goto start_ld_st;

  data->x.o=zero_octa;@+ data->x.known=true;@+ goto die; /* page table fault */

  data->x.o=zero_octa;@+ data->x.known=true;@+ goto die; /* page table fault */

@ @d PRW_BITS (data->ii==pst? PR_BIT+PW_BIT:

@ @d PRW_BITS (data->ii==pst? PR_BIT+PW_BIT:

                  (data->i==syncid && (data->loc.h&sign_bit))? 0: PW_BIT)

                  (data->i==syncid && (data->loc.h&sign_bit))? 0: PW_BIT)

@=

@=

case ld_st_launch:@+if ((self+1)->next)

case ld_st_launch:@+if ((self+1)->next)

    wait(1); /* second stage must be clear */

    wait(1); /* second stage must be clear */

@;

@;

  if (data->y.o.h&sign_bit)

  if (data->y.o.h&sign_bit)

@;

@;

  if (page_bad) {

  if (page_bad) {

    if (data->i==st || (data->ii>syncid))

    if (data->i==st || (data->ii>syncid))

       data->interrupt|=PRW_BITS;

       data->interrupt|=PRW_BITS;

    goto fin_ex;

    goto fin_ex;

  if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

  if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

  startup(&DTcache->reader[j],DTcache->access_time);

  startup(&DTcache->reader[j],DTcache->access_time);

@;

@;

  pass_after(DTcache->access_time);@+ goto passit;

  pass_after(DTcache->access_time);@+ goto passit;

@ When stage 2 of a load/store command begins, the state will depend

@ When stage 2 of a load/store command begins, the state will depend

on what transpired in stage~1.

on what transpired in stage~1.

For example, |data->state| will be |DT_miss| if the virtual address key

For example, |data->state| will be |DT_miss| if the virtual address key

can't be found in the DT-cache; then stage~2 will have to compute the

can't be found in the DT-cache; then stage~2 will have to compute the

physical address the hard way.

physical address the hard way.

The |data->state| will be |DT_hit| if

The |data->state| will be |DT_hit| if

the physical address is known via the DT-cache, but the data may or may not

the physical address is known via the DT-cache, but the data may or may not

be in the D-cache. The |data->state| will be |hit_and_miss| if the DT-cache

be in the D-cache. The |data->state| will be |hit_and_miss| if the DT-cache

hits and the D-cache doesn't. And |data->state| will be |ld_ready| if

hits and the D-cache doesn't. And |data->state| will be |ld_ready| if

|data->x.o| is the desired octabyte (for example, if both caches hit).

|data->x.o| is the desired octabyte (for example, if both caches hit).

@d DT_miss 10 /* second stage |state| when DT-cache doesn't hold the key */

@d DT_miss 10 /* second stage |state| when DT-cache doesn't hold the key */

@d DT_hit 11 /* second stage |state| when physical address is known */

@d DT_hit 11 /* second stage |state| when physical address is known */

@d hit_and_miss 12 /* second stage |state| when D-cache misses */

@d hit_and_miss 12 /* second stage |state| when D-cache misses */

@d ld_ready 13 /* second stage |state| when data has been read */

@d ld_ready 13 /* second stage |state| when data has been read */

@d st_ready 14 /* second stage |state| when data needn't be read */

@d st_ready 14 /* second stage |state| when data needn't be read */

@d prest_win 15 /* second stage |state| when we can fill a block with zeroes */

@d prest_win 15 /* second stage |state| when we can fill a block with zeroes */

@=

@=

p=cache_search(DTcache,trans_key(data->y.o));

p=cache_search(DTcache,trans_key(data->y.o));

if (!Dcache || Dcache->lock || (j=get_reader(Dcache))<0 ||

if (!Dcache || Dcache->lock || (j=get_reader(Dcache))<0 ||

     (data->i>=st && data->i<=syncid))

     (data->i>=st && data->i<=syncid))

@;

@;

startup(&Dcache->reader[j],Dcache->access_time);

startup(&Dcache->reader[j],Dcache->access_time);

if (p) @@;

if (p) @@;

else data->state=DT_miss;

else data->state=DT_miss;

@ We assume that it is possible to look up a virtual address in the DT-cache

@ We assume that it is possible to look up a virtual address in the DT-cache

at the same time as we look for a corresponding physical address in the

at the same time as we look for a corresponding physical address in the

D-cache, provided that the lower $b+c$ bits of the two addresses are the same.

D-cache, provided that the lower $b+c$ bits of the two addresses are the same.

(They will always be the same if |b+c<=page_s|; otherwise the operating system

(They will always be the same if |b+c<=page_s|; otherwise the operating system

can try to make them the same by ``page coloring'' whenever possible.) If both

can try to make them the same by ``page coloring'' whenever possible.) If both

caches hit, the physical address is known in

caches hit, the physical address is known in

@^page coloring@>

@^page coloring@>

max(|DTcache->access_time,Dcache->access_time|) cycles.

max(|DTcache->access_time,Dcache->access_time|) cycles.

If the lower $b+c$ bits of the virtual and physical addresses differ,

If the lower $b+c$ bits of the virtual and physical addresses differ,

the machine will not know this until the DT-cache has hit.

the machine will not know this until the DT-cache has hit.

Therefore we simulate the operation of accessing the D-cache, but we go to

Therefore we simulate the operation of accessing the D-cache, but we go to

|DT_hit| instead of to |hit_and_miss| because the D-cache will

|DT_hit| instead of to |hit_and_miss| because the D-cache will

experience a spurious miss.

experience a spurious miss.

@d max(x,y) ((x)<(y)? (y):(x))

@d max(x,y) ((x)<(y)? (y):(x))

@=

@=

{@+octa *m;

{@+octa *m;

@;

@;

  data->z.o=phys_addr(data->y.o,p->data[0]);

  data->z.o=phys_addr(data->y.o,p->data[0]);

  m=write_search(data,data->z.o);

  m=write_search(data,data->z.o);

  if (m==DUNNO) data->state=DT_hit;

  if (m==DUNNO) data->state=DT_hit;

  else if (m) data->x.o=*m, data->state=ld_ready;

  else if (m) data->x.o=*m, data->state=ld_ready;

  else if (Dcache->b+Dcache->c>page_s &&@|

  else if (Dcache->b+Dcache->c>page_s &&@|

      ((data->y.o.l^data->z.o.l)&((Dcache->bb<c)-(1<

      ((data->y.o.l^data->z.o.l)&((Dcache->bb<c)-(1<

    data->state=DT_hit; /* spurious D-cache lookup */

    data->state=DT_hit; /* spurious D-cache lookup */

  else {

  else {

    q=cache_search(Dcache,data->z.o);

    q=cache_search(Dcache,data->z.o);

    if (q) {

    if (q) {

      if (data->i==ldunc) q=demote_and_fix(Dcache,q);

      if (data->i==ldunc) q=demote_and_fix(Dcache,q);

      else q=use_and_fix(Dcache,q);

      else q=use_and_fix(Dcache,q);

      data->x.o=q->data[(data->z.o.l&(Dcache->bb-1))>>3];

      data->x.o=q->data[(data->z.o.l&(Dcache->bb-1))>>3];

      data->state=ld_ready;

      data->state=ld_ready;

    }@+else data->state=hit_and_miss;

    }@+else data->state=hit_and_miss;

  pass_after(max(DTcache->access_time,Dcache->access_time));

  pass_after(max(DTcache->access_time,Dcache->access_time));

  goto passit;

  goto passit;

@ The protection bits $p_rp_wp_x$ in a translation cache are shifted

@ The protection bits $p_rp_wp_x$ in a translation cache are shifted

four positions right from the interrupt codes |PR_BIT|, |PW_BIT|, |PX_BIT|.

four positions right from the interrupt codes |PR_BIT|, |PW_BIT|, |PX_BIT|.

If the data is protected, we abort the load/store operation immediately;

If the data is protected, we abort the load/store operation immediately;

this protects the privacy of other users.

this protects the privacy of other users.

@=

@=

p=use_and_fix(DTcache,p);

p=use_and_fix(DTcache,p);

j=PRW_BITS;

j=PRW_BITS;

if (((p->data[0].l<

if (((p->data[0].l<

  if (data->i==syncd || data->i==syncid) goto sync_check;

  if (data->i==syncd || data->i==syncid) goto sync_check;

  if (data->i!=preld && data->i!=prest)

  if (data->i!=preld && data->i!=prest)

    data->interrupt|=j&~(p->data[0].l<

    data->interrupt|=j&~(p->data[0].l<

  goto fin_ex;

  goto fin_ex;

@ @=

@ @=

{@+octa *m;

{@+octa *m;

  if (p) {

  if (p) {

@;

@;

    data->z.o=phys_addr(data->y.o,p->data[0]);

    data->z.o=phys_addr(data->y.o,p->data[0]);

    if (data->i>=st && data->i<=syncid) data->state=st_ready;

    if (data->i>=st && data->i<=syncid) data->state=st_ready;

    else {

    else {

      m=write_search(data,data->z.o);

      m=write_search(data,data->z.o);

      if (m && m!= DUNNO) data->x.o=*m, data->state=ld_ready;

      if (m && m!= DUNNO) data->x.o=*m, data->state=ld_ready;

      else data->state=DT_hit;

      else data->state=DT_hit;

  }@+ else data->state=DT_miss;

  }@+ else data->state=DT_miss;

  pass_after(DTcache->access_time);@+ goto passit;

  pass_after(DTcache->access_time);@+ goto passit;

@ @=

@ @=

{@+octa *m;

{@+octa *m;

  if (!(data->loc.h&sign_bit)) {

  if (!(data->loc.h&sign_bit)) {

    if (data->i==syncd || data->i==syncid) goto sync_check;

    if (data->i==syncd || data->i==syncid) goto sync_check;

    if (data->i!=preld && data->i!=prest) data->interrupt |= N_BIT;

    if (data->i!=preld && data->i!=prest) data->interrupt |= N_BIT;

    goto fin_ex;

    goto fin_ex;

  data->z.o=data->y.o;@+ data->z.o.h -= sign_bit;

  data->z.o=data->y.o;@+ data->z.o.h -= sign_bit;

  if (data->i>=st && data->i<=syncid) {

  if (data->i>=st && data->i<=syncid) {

    data->state=st_ready;@+pass_after(1);@+goto passit;

    data->state=st_ready;@+pass_after(1);@+goto passit;

  m=write_search(data,data->z.o);

  m=write_search(data,data->z.o);

  if (m) {

  if (m) {

    if (m==DUNNO) data->state=DT_hit;

    if (m==DUNNO) data->state=DT_hit;

    else data->x.o=*m, data->state=ld_ready;

    else data->x.o=*m, data->state=ld_ready;

  }@+ else if ((data->z.o.h&0xffff0000) || !Dcache) {

  }@+ else if ((data->z.o.h&0xffff0000) || !Dcache) {

    if (mem_lock) wait(1);

    if (mem_lock) wait(1);

    set_lock(&mem_locker,mem_lock);

    set_lock(&mem_locker,mem_lock);

    data->x.o=mem_read(data->z.o);

    data->x.o=mem_read(data->z.o);

    data->state=ld_ready;

    data->state=ld_ready;

    startup(&mem_locker,mem_addr_time+mem_read_time);

    startup(&mem_locker,mem_addr_time+mem_read_time);

    pass_after(mem_addr_time+mem_read_time);@+ goto passit;

    pass_after(mem_addr_time+mem_read_time);@+ goto passit;

  if (Dcache->lock || (j=get_reader(Dcache))<0) {

  if (Dcache->lock || (j=get_reader(Dcache))<0) {

    data->state=DT_hit;@+pass_after(1);@+ goto passit;

    data->state=DT_hit;@+pass_after(1);@+ goto passit;

  startup(&Dcache->reader[j],Dcache->access_time);

  startup(&Dcache->reader[j],Dcache->access_time);

  q=cache_search(Dcache,data->z.o);

  q=cache_search(Dcache,data->z.o);

  if (q) {

  if (q) {

    if (data->i==ldunc) q=demote_and_fix(Dcache,q);

    if (data->i==ldunc) q=demote_and_fix(Dcache,q);

    else q=use_and_fix(Dcache,q);

    else q=use_and_fix(Dcache,q);

    data->x.o=q->data[(data->z.o.l&(Dcache->bb-1))>>3];

    data->x.o=q->data[(data->z.o.l&(Dcache->bb-1))>>3];

    data->state=ld_ready;

    data->state=ld_ready;

  }@+else data->state=hit_and_miss;

  }@+else data->state=hit_and_miss;

  pass_after(Dcache->access_time);@+ goto passit;

  pass_after(Dcache->access_time);@+ goto passit;

@ The program for the second stage is, likewise, rather long-winded, yet quite

@ The program for the second stage is, likewise, rather long-winded, yet quite

similar to the cache manipulations we have already seen several times.

similar to the cache manipulations we have already seen several times.

Several instructions might be trying to fill the DT-cache for the same page.

Several instructions might be trying to fill the DT-cache for the same page.

(A similar situation faced us in the |write_from_wbuf| coroutine.)

(A similar situation faced us in the |write_from_wbuf| coroutine.)

The second stage therefore needs to do some

The second stage therefore needs to do some

translation cache searching just as the first stage did. In this

translation cache searching just as the first stage did. In this

stage, however, we don't go all out for speed, because DT-cache misses

stage, however, we don't go all out for speed, because DT-cache misses

are rare.

are rare.

@d DT_retry 8 /* second stage |state| when DT-cache should be searched again */

@d DT_retry 8 /* second stage |state| when DT-cache should be searched again */

@d got_DT 9   /* second stage |state| when DT-cache entry has been computed */

@d got_DT 9   /* second stage |state| when DT-cache entry has been computed */

@=

@=

square_one: data->state=DT_retry;

square_one: data->state=DT_retry;

 case DT_retry:@+if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

 case DT_retry:@+if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

   startup(&DTcache->reader[j],DTcache->access_time);

   startup(&DTcache->reader[j],DTcache->access_time);

   p=cache_search(DTcache,trans_key(data->y.o));

   p=cache_search(DTcache,trans_key(data->y.o));

   if (p) {

   if (p) {

@;

@;

     data->z.o=phys_addr(data->y.o,p->data[0]);

     data->z.o=phys_addr(data->y.o,p->data[0]);

     if (data->i>=st && data->i<=syncid) data->state=st_ready;

     if (data->i>=st && data->i<=syncid) data->state=st_ready;

     else data->state=DT_hit;

     else data->state=DT_hit;

   }@+ else data->state=DT_miss;

   }@+ else data->state=DT_miss;

   wait(DTcache->access_time);

   wait(DTcache->access_time);

 case DT_miss:@+if (DTcache->filler.next)

 case DT_miss:@+if (DTcache->filler.next)

     if (data->i==preld || data->i==prest) goto fin_ex;@+ else goto square_one;

     if (data->i==preld || data->i==prest) goto fin_ex;@+ else goto square_one;

   if (no_hardware_PT)

   if (no_hardware_PT)

     if (data->i==preld || data->i==prest) goto fin_ex;@+else goto emulate_virt;

     if (data->i==preld || data->i==prest) goto fin_ex;@+else goto emulate_virt;

   p=alloc_slot(DTcache,trans_key(data->y.o));

   p=alloc_slot(DTcache,trans_key(data->y.o));

   if (!p) goto square_one;

   if (!p) goto square_one;

   data->ptr_b=DTcache->filler_ctl.ptr_b=(void *)p;

   data->ptr_b=DTcache->filler_ctl.ptr_b=(void *)p;

   DTcache->filler_ctl.y.o=data->y.o;

   DTcache->filler_ctl.y.o=data->y.o;

   set_lock(self,DTcache->fill_lock);

   set_lock(self,DTcache->fill_lock);

   startup(&DTcache->filler,1);

   startup(&DTcache->filler,1);

   data->state=got_DT;

   data->state=got_DT;

   if (data->i==preld || data->i==prest) goto fin_ex;@+else sleep;

   if (data->i==preld || data->i==prest) goto fin_ex;@+else sleep;

 case got_DT: release_lock(self,DTcache->fill_lock);

 case got_DT: release_lock(self,DTcache->fill_lock);

   j=PRW_BITS;

   j=PRW_BITS;

   if (((data->z.o.l<

   if (((data->z.o.l<

     if (data->i==syncd || data->i==syncid) goto sync_check;

     if (data->i==syncd || data->i==syncid) goto sync_check;

     data->interrupt |= j&~(data->z.o.l<

     data->interrupt |= j&~(data->z.o.l<

     goto fin_ex;

     goto fin_ex;

   data->z.o=phys_addr(data->y.o,data->z.o);

   data->z.o=phys_addr(data->y.o,data->z.o);

   if (data->i>=st && data->i<=syncid) goto finish_store;

   if (data->i>=st && data->i<=syncid) goto finish_store;

    /* otherwise we fall through to |ld_retry| below */

    /* otherwise we fall through to |ld_retry| below */

@ The second stage might also want to fill the D-cache (and perhaps

@ The second stage might also want to fill the D-cache (and perhaps

the S-cache) as we get the data.

the S-cache) as we get the data.

Several load instructions might be trying to fill the same cache block.

Several load instructions might be trying to fill the same cache block.

So we should go back and look in the D-cache again if we miss and

So we should go back and look in the D-cache again if we miss and

cannot allocate a slot immediately.

cannot allocate a slot immediately.

A \.{PRELD} or \.{PREST} instruction, which is just a ``hint,'' doesn't do

A \.{PRELD} or \.{PREST} instruction, which is just a ``hint,'' doesn't do

anything more if the caches are already busy.

anything more if the caches are already busy.

@=

@=

ld_retry: data->state=DT_hit;

ld_retry: data->state=DT_hit;

 case DT_hit:@+ if (data->i==preld || data->i==prest) goto fin_ex;

 case DT_hit:@+ if (data->i==preld || data->i==prest) goto fin_ex;

@;

@;

  if ((data->z.o.h&0xffff0000) || !Dcache)

  if ((data->z.o.h&0xffff0000) || !Dcache)

@;

@;

  if (Dcache->lock || (j=get_reader(Dcache))<0) wait(1);

  if (Dcache->lock || (j=get_reader(Dcache))<0) wait(1);

  startup(&Dcache->reader[j],Dcache->access_time);

  startup(&Dcache->reader[j],Dcache->access_time);

  q=cache_search(Dcache,data->z.o);

  q=cache_search(Dcache,data->z.o);

  if (q) {

  if (q) {

    if (data->i==ldunc) q=demote_and_fix(Dcache,q);

    if (data->i==ldunc) q=demote_and_fix(Dcache,q);

    else q=use_and_fix(Dcache,q);

    else q=use_and_fix(Dcache,q);

    data->x.o=q->data[(data->z.o.l&(Dcache->bb-1))>>3];

    data->x.o=q->data[(data->z.o.l&(Dcache->bb-1))>>3];

    data->state=ld_ready;

    data->state=ld_ready;

  }@+else data->state=hit_and_miss;

  }@+else data->state=hit_and_miss;

  wait(Dcache->access_time);

  wait(Dcache->access_time);

 case hit_and_miss:@+if (data->i==ldunc) goto avoid_D;

 case hit_and_miss:@+if (data->i==ldunc) goto avoid_D;

    @z.o| in the D-cache@>;

    @z.o| in the D-cache@>;

@ @z.o| in the D-cache@>=

@ @z.o| in the D-cache@>=

@;

@;

if (Dcache->filler.next) goto ld_retry;

if (Dcache->filler.next) goto ld_retry;

if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto ld_retry;

if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto ld_retry;

q=alloc_slot(Dcache,data->z.o);

q=alloc_slot(Dcache,data->z.o);

if (!q) goto ld_retry;

if (!q) goto ld_retry;

if (Scache) set_lock(&Dcache->filler,Scache->lock)@;

if (Scache) set_lock(&Dcache->filler,Scache->lock)@;

else set_lock(&Dcache->filler,mem_lock);

else set_lock(&Dcache->filler,mem_lock);

set_lock(self,Dcache->fill_lock);

set_lock(self,Dcache->fill_lock);

data->ptr_b=Dcache->filler_ctl.ptr_b=(void *)q;

data->ptr_b=Dcache->filler_ctl.ptr_b=(void *)q;

Dcache->filler_ctl.z.o=data->z.o;

Dcache->filler_ctl.z.o=data->z.o;

startup(&Dcache->filler,Scache? Scache->access_time: mem_addr_time);

startup(&Dcache->filler,Scache? Scache->access_time: mem_addr_time);

data->state=ld_ready;

data->state=ld_ready;

if (data->i==preld || data->i==prest) goto fin_ex;@+else sleep;

if (data->i==preld || data->i==prest) goto fin_ex;@+else sleep;

@ If a |prest| instruction makes it to the hot seat,

@ If a |prest| instruction makes it to the hot seat,

we have been assured by the user of |PREST| that the current

we have been assured by the user of |PREST| that the current

values of bytes in virtual addresses |data->y.o-(data->xx&-Dcache->bb)| through

values of bytes in virtual addresses |data->y.o-(data->xx&-Dcache->bb)| through

|data->y.o+(data->xx&(Dcache->bb-1))|

|data->y.o+(data->xx&(Dcache->bb-1))|

are irrelevant. Hence we can pretend that we know they are zero. This

are irrelevant. Hence we can pretend that we know they are zero. This

is advantageous if it saves us from filling a cache block from

is advantageous if it saves us from filling a cache block from

the S-cache or from memory.

the S-cache or from memory.

@=

@=

if (data->i==prest &&@|

if (data->i==prest &&@|

   (data->xx>=Dcache->bb || ((data->y.o.l&(Dcache->bb-1))==0)) &&@|

   (data->xx>=Dcache->bb || ((data->y.o.l&(Dcache->bb-1))==0)) &&@|

   ((data->y.o.l+(data->xx&(Dcache->bb-1))+1)^data->y.o.l)>=Dcache->bb)

   ((data->y.o.l+(data->xx&(Dcache->bb-1))+1)^data->y.o.l)>=Dcache->bb)

  goto prest_span;

  goto prest_span;

@ @=

@ @=

prest_span: data->state=prest_win;

prest_span: data->state=prest_win;

case prest_win:@+ if (data!=old_hot || Dlocker.next) wait(1);

case prest_win:@+ if (data!=old_hot || Dlocker.next) wait(1);

  if (Dcache->lock) goto fin_ex;

  if (Dcache->lock) goto fin_ex;

  q=alloc_slot(Dcache,data->z.o); /* OK if |Dcache->filler| is busy */

  q=alloc_slot(Dcache,data->z.o); /* OK if |Dcache->filler| is busy */

  if (q) {

  if (q) {

    clean_block(Dcache,q);

    clean_block(Dcache,q);

    q->tag=data->z.o;@+q->tag.l &=-Dcache->bb;

    q->tag=data->z.o;@+q->tag.l &=-Dcache->bb;

    set_lock(&Dlocker,Dcache->lock);

    set_lock(&Dlocker,Dcache->lock);

    startup(&Dlocker,Dcache->copy_in_time);

    startup(&Dlocker,Dcache->copy_in_time);

  goto fin_ex;

  goto fin_ex;

@ @=

@ @=

avoid_D:@+ if (mem_lock) wait(1);

avoid_D:@+ if (mem_lock) wait(1);

  set_lock(&mem_locker,mem_lock);

  set_lock(&mem_locker,mem_lock);

  startup(&mem_locker, mem_addr_time+mem_read_time);

  startup(&mem_locker, mem_addr_time+mem_read_time);

  data->x.o=mem_read(data->z.o);

  data->x.o=mem_read(data->z.o);

  data->state=ld_ready;@+ wait(mem_addr_time+mem_read_time);

  data->state=ld_ready;@+ wait(mem_addr_time+mem_read_time);

@ @=

@ @=

  octa *m=write_search(data,data->z.o);

  octa *m=write_search(data,data->z.o);

  if (m==DUNNO) wait(1);

  if (m==DUNNO) wait(1);

  if (m) {

  if (m) {

    data->x.o=*m;

    data->x.o=*m;

    data->state=ld_ready;

    data->state=ld_ready;

    wait(1);

    wait(1);

@ The requested octabyte will arrive sooner or later in |data->x.o|.

@ The requested octabyte will arrive sooner or later in |data->x.o|.

Then a load instruction is almost done, except that we might need

Then a load instruction is almost done, except that we might need

to massage the input a little bit.

to massage the input a little bit.

@=

@=

case ld_ready:@+if (self->lockloc)

case ld_ready:@+if (self->lockloc)

    *(self->lockloc)=NULL, self->lockloc=NULL;

    *(self->lockloc)=NULL, self->lockloc=NULL;

  if (data->i>=st) goto finish_store;

  if (data->i>=st) goto finish_store;

  switch(data->op>>1) {

  switch(data->op>>1) {

    case LDB>>1: case LDBU>>1: j=(data->z.o.l&0x7)<<3;@+i=56;@+goto fin_ld;

    case LDB>>1: case LDBU>>1: j=(data->z.o.l&0x7)<<3;@+i=56;@+goto fin_ld;

    case LDW>>1: case LDWU>>1: j=(data->z.o.l&0x6)<<3;@+i=48;@+goto fin_ld;

    case LDW>>1: case LDWU>>1: j=(data->z.o.l&0x6)<<3;@+i=48;@+goto fin_ld;

    case LDT>>1: case LDTU>>1: j=(data->z.o.l&0x4)<<3;@+i=32;

    case LDT>>1: case LDTU>>1: j=(data->z.o.l&0x4)<<3;@+i=32;

 fin_ld: data->x.o=shift_right(shift_left(data->x.o,j),i,data->op&0x2);

 fin_ld: data->x.o=shift_right(shift_left(data->x.o,j),i,data->op&0x2);

    default: goto fin_ex;

    default: goto fin_ex;

    case LDHT>>1:@+if (data->z.o.l&4) data->x.o.h=data->x.o.l;

    case LDHT>>1:@+if (data->z.o.l&4) data->x.o.h=data->x.o.l;

      data->x.o.l=0;@+ goto fin_ex;

      data->x.o.l=0;@+ goto fin_ex;

    case LDSF>>1:@+if (data->z.o.l&4) data->x.o.h=data->x.o.l;

    case LDSF>>1:@+if (data->z.o.l&4) data->x.o.h=data->x.o.l;

      if ((data->x.o.h&0x7f800000)==0 && (data->x.o.h&0x7fffff)) {

      if ((data->x.o.h&0x7f800000)==0 && (data->x.o.h&0x7fffff)) {

        data->x.o=load_sf(data->x.o.h);

        data->x.o=load_sf(data->x.o.h);

        data->state=3;@+wait(denin_penalty);

        data->state=3;@+wait(denin_penalty);

      else data->x.o=load_sf(data->x.o.h);@+goto fin_ex;

      else data->x.o=load_sf(data->x.o.h);@+goto fin_ex;

    case LDPTP>>1:@+

    case LDPTP>>1:@+

      if ((data->x.o.h&sign_bit)==0 || (data->x.o.l&0x1ff8)!=page_n)

      if ((data->x.o.h&sign_bit)==0 || (data->x.o.l&0x1ff8)!=page_n)

        data->x.o=zero_octa;

        data->x.o=zero_octa;

      else data->x.o.l &= -(1<<13);

      else data->x.o.l &= -(1<<13);

      goto fin_ex;

      goto fin_ex;

    case LDPTE>>1:@+if ((data->x.o.l&0x1ff8)!=page_n) data->x.o=zero_octa;

    case LDPTE>>1:@+if ((data->x.o.l&0x1ff8)!=page_n) data->x.o=zero_octa;

      else data->x.o=incr(oandn(data->x.o,page_mask),data->x.o.l&0x7);

      else data->x.o=incr(oandn(data->x.o,page_mask),data->x.o.l&0x7);

      data->x.o.h &= 0xffff;@+ goto fin_ex;

      data->x.o.h &= 0xffff;@+ goto fin_ex;

    case UNSAVE>>1: @;

    case UNSAVE>>1: @;

@ @=

@ @=

 finish_store: data->state=st_ready;

 finish_store: data->state=st_ready;

case st_ready:@+ switch (data->i) {

case st_ready:@+ switch (data->i) {

 case st: case pst: @;

 case st: case pst: @;

 case syncd: data->b.o.l=(Dcache? Dcache->bb: 8192);@+goto do_syncd;

 case syncd: data->b.o.l=(Dcache? Dcache->bb: 8192);@+goto do_syncd;

 case syncid: data->b.o.l=(Icache? Icache->bb: 8192);

 case syncid: data->b.o.l=(Icache? Icache->bb: 8192);

   if (Dcache && Dcache->bbb.o.l) data->b.o.l=Dcache->bb;

   if (Dcache && Dcache->bbb.o.l) data->b.o.l=Dcache->bb;

   goto do_syncid;

   goto do_syncid;

@ Store instructions have an extra complication, because some of them need

@ Store instructions have an extra complication, because some of them need

to check for overflow.

to check for overflow.

@=

@=

data->x.addr=data->z.o;

data->x.addr=data->z.o;

if (data->b.p) wait(1);

if (data->b.p) wait(1);

switch(data->op>>1) {

switch(data->op>>1) {

 case STUNC>>1: data->i=stunc;

 case STUNC>>1: data->i=stunc;

 default: data->x.o=data->b.o;@+goto fin_ex;

 default: data->x.o=data->b.o;@+goto fin_ex;

 case STSF>>1: set_round;@+ data->b.o.h=store_sf(data->b.o);

 case STSF>>1: set_round;@+ data->b.o.h=store_sf(data->b.o);

    data->interrupt |= exceptions;

    data->interrupt |= exceptions;

    if ((data->b.o.h&0x7f800000)==0 && (data->b.o.h&0x7fffff)) {

    if ((data->b.o.h&0x7f800000)==0 && (data->b.o.h&0x7fffff)) {

      if (data->z.o.l&4) data->x.o.l=data->b.o.h;

      if (data->z.o.l&4) data->x.o.l=data->b.o.h;

      else data->x.o.h=data->b.o.h;

      else data->x.o.h=data->b.o.h;

      data->state=3;@+wait(denout_penalty);

      data->state=3;@+wait(denout_penalty);

 case STHT>>1:@+if (data->z.o.l&4) data->x.o.l=data->b.o.h;

 case STHT>>1:@+if (data->z.o.l&4) data->x.o.l=data->b.o.h;

  else data->x.o.h=data->b.o.h;

  else data->x.o.h=data->b.o.h;

  goto fin_ex;

  goto fin_ex;

 case STB>>1: case STBU>>1: j=(data->z.o.l&0x7)<<3;@+i=56;@+goto fin_st;

 case STB>>1: case STBU>>1: j=(data->z.o.l&0x7)<<3;@+i=56;@+goto fin_st;

 case STW>>1: case STWU>>1: j=(data->z.o.l&0x6)<<3;@+i=48;@+goto fin_st;

 case STW>>1: case STWU>>1: j=(data->z.o.l&0x6)<<3;@+i=48;@+goto fin_st;

 case STT>>1: case STTU>>1: j=(data->z.o.l&0x4)<<3;@+i=32;

 case STT>>1: case STTU>>1: j=(data->z.o.l&0x4)<<3;@+i=32;

  fin_st: @b.o| into the proper field of |data->x.o|,

  fin_st: @b.o| into the proper field of |data->x.o|,

                 checking for arithmetic exceptions if signed@>;

                 checking for arithmetic exceptions if signed@>;

  goto fin_ex;

  goto fin_ex;

 case CSWAP>>1: @;

 case CSWAP>>1: @;

 case SAVE>>1: @;

 case SAVE>>1: @;

@ @b.o| into the proper field...@>=

@ @b.o| into the proper field...@>=

  octa mask;

  octa mask;

  if (!(data->op&2)) {@+octa before,after;

  if (!(data->op&2)) {@+octa before,after;

    before=data->b.o;@+after=shift_right(shift_left(data->b.o,i),i,0);

    before=data->b.o;@+after=shift_right(shift_left(data->b.o,i),i,0);

    if (before.l!=after.l || before.h!=after.h) data->interrupt|=V_BIT;

    if (before.l!=after.l || before.h!=after.h) data->interrupt|=V_BIT;

  mask=shift_right(shift_left(neg_one,i),j,1);

  mask=shift_right(shift_left(neg_one,i),j,1);

  data->b.o=shift_right(shift_left(data->b.o,i),j,1);

  data->b.o=shift_right(shift_left(data->b.o,i),j,1);

  data->x.o.h^=mask.h&(data->x.o.h^data->b.o.h);

  data->x.o.h^=mask.h&(data->x.o.h^data->b.o.h);

  data->x.o.l^=mask.l&(data->x.o.l^data->b.o.l);

  data->x.o.l^=mask.l&(data->x.o.l^data->b.o.l);

@ The \.{CSWAP} operation has four inputs $\rm(\$X, \$Y, \$Z, rP)$ as well as

@ The \.{CSWAP} operation has four inputs $\rm(\$X, \$Y, \$Z, rP)$ as well as

three outputs $\rm(\$X,M_8[A],rP)$. To keep from exceeding the capacity

three outputs $\rm(\$X,M_8[A],rP)$. To keep from exceeding the capacity

of the control blocks in our pipeline, we wait until this instruction reaches

of the control blocks in our pipeline, we wait until this instruction reaches

the hot seat, thereby allowing us non-speculative access to~rP.

the hot seat, thereby allowing us non-speculative access to~rP.

@=

@=

if (data!=old_hot) wait(1);

if (data!=old_hot) wait(1);

if (data->x.o.h==g[rP].o.h && data->x.o.l==g[rP].o.l) {

if (data->x.o.h==g[rP].o.h && data->x.o.l==g[rP].o.l) {

  data->a.o.l=1; /* |data->a.o.h| is zero */

  data->a.o.l=1; /* |data->a.o.h| is zero */

  data->x.o=data->b.o;

  data->x.o=data->b.o;

}@+else {

}@+else {

  g[rP].o=data->x.o; /* |data->a.o| is zero */

  g[rP].o=data->x.o; /* |data->a.o| is zero */

  if (verbose&issue_bit) {

  if (verbose&issue_bit) {

    printf(" setting rP=");@+print_octa(g[rP].o);@+printf("\n");

    printf(" setting rP=");@+print_octa(g[rP].o);@+printf("\n");

data->i=cswap; /* cosmetic change, affects the trace output only */

data->i=cswap; /* cosmetic change, affects the trace output only */

goto fin_ex;

goto fin_ex;

@* The fetch stage. Now that we've mastered the most difficult memory

@* The fetch stage. Now that we've mastered the most difficult memory

operations, we can relax and apply our knowledge to the slightly simpler task

operations, we can relax and apply our knowledge to the slightly simpler task

of filling the fetch buffer. Fetching is like loading/storing, except that we

of filling the fetch buffer. Fetching is like loading/storing, except that we

use the I-cache instead of the D-cache. It's slightly simpler because the

use the I-cache instead of the D-cache. It's slightly simpler because the

I-cache is read-only. Further simplifications would be possible if there

I-cache is read-only. Further simplifications would be possible if there

were no \.{PREGO} instruction, because there is only one fetch unit.

were no \.{PREGO} instruction, because there is only one fetch unit.

However, we want to implement \.{PREGO} with reasonable efficiency, in order

However, we want to implement \.{PREGO} with reasonable efficiency, in order

to see if that instruction is worthwhile; so we include the complications of

to see if that instruction is worthwhile; so we include the complications of

simultaneous I-cache and IT-cache readers, which we

simultaneous I-cache and IT-cache readers, which we

have already implemented for the D-cache and DT-cache.

have already implemented for the D-cache and DT-cache.

The fetch coroutine is always present, as the one and only coroutine with

The fetch coroutine is always present, as the one and only coroutine with

|stage| number~zero.

|stage| number~zero.

In normal circumstances, the fetch coroutine accesses a cache block containing

In normal circumstances, the fetch coroutine accesses a cache block containing

the instruction whose virtual address is given by |inst_ptr| (the instruction

the instruction whose virtual address is given by |inst_ptr| (the instruction

pointer), and transfers up to |fetch_max| instructions from that block to the

pointer), and transfers up to |fetch_max| instructions from that block to the

fetch buffer. Complications arise if the instruction isn't in the cache, or if

fetch buffer. Complications arise if the instruction isn't in the cache, or if

we can't translate the virtual address because of a miss in the IT-cache.

we can't translate the virtual address because of a miss in the IT-cache.

Moreover, |inst_ptr| is a \&{spec} variable whose value might not even be

Moreover, |inst_ptr| is a \&{spec} variable whose value might not even be

known; if |inst_ptr.p| is nonnull, we don't know what to fetch.

known; if |inst_ptr.p| is nonnull, we don't know what to fetch.

@^program counter@>

@^program counter@>

@=

@=

Extern spec inst_ptr; /* the instruction pointer (aka program counter) */

Extern spec inst_ptr; /* the instruction pointer (aka program counter) */

Extern octa *fetched; /* buffer for incoming instructions */

Extern octa *fetched; /* buffer for incoming instructions */

@ The fetch coroutine usually begins a cycle in state |fetch_ready|, with

@ The fetch coroutine usually begins a cycle in state |fetch_ready|, with

the most recently fetched octabytes in positions |fetch_lo|, |fetch_lo+1|,

the most recently fetched octabytes in positions |fetch_lo|, |fetch_lo+1|,

\dots, |fetch_hi-1| of a buffer called |fetched|. Once that buffer has been

\dots, |fetch_hi-1| of a buffer called |fetched|. Once that buffer has been

exhausted, the coroutine reverts to state~0; with luck, the buffer might have

exhausted, the coroutine reverts to state~0; with luck, the buffer might have

more data by the time the next cycle rolls around.

more data by the time the next cycle rolls around.

@=

@=

int fetch_lo, fetch_hi; /* the active region of that buffer */

int fetch_lo, fetch_hi; /* the active region of that buffer */

coroutine fetch_co;

coroutine fetch_co;

control fetch_ctl;

control fetch_ctl;

@ @=

@ @=

fetch_co.ctl=&fetch_ctl;

fetch_co.ctl=&fetch_ctl;

fetch_co.name="Fetch";

fetch_co.name="Fetch";

fetch_ctl.go.o.l=4;

fetch_ctl.go.o.l=4;

startup(&fetch_co,1);

startup(&fetch_co,1);

@ @=

@ @=

if (fetch_co.lockloc) *(fetch_co.lockloc)=NULL,fetch_co.lockloc=NULL;

if (fetch_co.lockloc) *(fetch_co.lockloc)=NULL,fetch_co.lockloc=NULL;

unschedule(&fetch_co);

unschedule(&fetch_co);

startup(&fetch_co,1);

startup(&fetch_co,1);

@ Some of the actions here are done not only by the fetcher but also by the

@ Some of the actions here are done not only by the fetcher but also by the

first and second stages of a |prego| operation.

first and second stages of a |prego| operation.

@d wait_or_pass(t) if (data->i==prego) {@+pass_after(t);@+goto passit;@+}

@d wait_or_pass(t) if (data->i==prego) {@+pass_after(t);@+goto passit;@+}

                   else wait(t)

                   else wait(t)

@=

@=

switch0:@+ switch(data->state) {

switch0:@+ switch(data->state) {

 new_fetch: data->state=0;

 new_fetch: data->state=0;

 case 0: @;

 case 0: @;

   data->y.o=inst_ptr.o;

   data->y.o=inst_ptr.o;

   data->state=1;@+ data->interrupt=0;@+ data->x.o=data->z.o=zero_octa;

   data->state=1;@+ data->interrupt=0;@+ data->x.o=data->z.o=zero_octa;

 case 1: start_fetch:@+ if (data->y.o.h&sign_bit)

 case 1: start_fetch:@+ if (data->y.o.h&sign_bit)

@;

@;

  if (page_bad) goto bad_fetch;

  if (page_bad) goto bad_fetch;

  if (ITcache->lock || (j=get_reader(ITcache))<0) wait(1);

  if (ITcache->lock || (j=get_reader(ITcache))<0) wait(1);

  startup(&ITcache->reader[j],ITcache->access_time);

  startup(&ITcache->reader[j],ITcache->access_time);

@;

@;

  wait_or_pass(ITcache->access_time);

  wait_or_pass(ITcache->access_time);

@@;

@@;

@ @=

@ @=

if (data->i==prego) goto start_fetch;

if (data->i==prego) goto start_fetch;

@ @=

@ @=

if (inst_ptr.p) {

if (inst_ptr.p) {

  if (inst_ptr.p!=UNKNOWN_SPEC && inst_ptr.p->known)

  if (inst_ptr.p!=UNKNOWN_SPEC && inst_ptr.p->known)

    inst_ptr.o=inst_ptr.p->o, inst_ptr.p=NULL;

    inst_ptr.o=inst_ptr.p->o, inst_ptr.p=NULL;

  wait(1);

  wait(1);

@ @d got_IT 19   /* |state| when IT-cache entry has been computed */

@ @d got_IT 19   /* |state| when IT-cache entry has been computed */

@d IT_miss 20 /* |state| when IT-cache doesn't hold the key */

@d IT_miss 20 /* |state| when IT-cache doesn't hold the key */

@d IT_hit 21 /* |state| when physical instruction address is known */

@d IT_hit 21 /* |state| when physical instruction address is known */

@d Ihit_and_miss 22 /* |state| when I-cache misses */

@d Ihit_and_miss 22 /* |state| when I-cache misses */

@d fetch_ready 23 /* |state| when instructions have been read */

@d fetch_ready 23 /* |state| when instructions have been read */

@d got_one 24 /* |state| when a ``preview'' octabyte is ready */

@d got_one 24 /* |state| when a ``preview'' octabyte is ready */

@=

@=

p=cache_search(ITcache,trans_key(data->y.o));

p=cache_search(ITcache,trans_key(data->y.o));

if (!Icache || Icache->lock || (j=get_reader(Icache))<0)

if (!Icache || Icache->lock || (j=get_reader(Icache))<0)

@;

@;

startup(&Icache->reader[j],Icache->access_time);

startup(&Icache->reader[j],Icache->access_time);

if (p) @@;

if (p) @@;

else data->state=IT_miss;

else data->state=IT_miss;

@ We assume that it is possible to look up a virtual address in the IT-cache

@ We assume that it is possible to look up a virtual address in the IT-cache

at the same time as we look for a corresponding physical address in the

at the same time as we look for a corresponding physical address in the

I-cache, provided that the lower $b+c$ bits of the two addresses are the same.

I-cache, provided that the lower $b+c$ bits of the two addresses are the same.

(See the remarks about ``page coloring,'' when we made similar assumptions

(See the remarks about ``page coloring,'' when we made similar assumptions

about the DT-cache and D-cache.)

about the DT-cache and D-cache.)

@^page coloring@>

@^page coloring@>

@=

@=

@;

@;

  data->z.o=phys_addr(data->y.o,p->data[0]);

  data->z.o=phys_addr(data->y.o,p->data[0]);

  if (Icache->b+Icache->c>page_s &&@|

  if (Icache->b+Icache->c>page_s &&@|

      ((data->y.o.l^data->z.o.l)&((Icache->bb<c)-(1<

      ((data->y.o.l^data->z.o.l)&((Icache->bb<c)-(1<

    data->state=IT_hit; /* spurious I-cache lookup */

    data->state=IT_hit; /* spurious I-cache lookup */

  else {

  else {

    q=cache_search(Icache,data->z.o);

    q=cache_search(Icache,data->z.o);

    if (q) {

    if (q) {

      q=use_and_fix(Icache,q);

      q=use_and_fix(Icache,q);

@;

@;

      data->state=fetch_ready;

      data->state=fetch_ready;

    }@+else data->state=Ihit_and_miss;

    }@+else data->state=Ihit_and_miss;

  wait_or_pass(max(ITcache->access_time,Icache->access_time));

  wait_or_pass(max(ITcache->access_time,Icache->access_time));

@ @=

@ @=

p=use_and_fix(ITcache,p);

p=use_and_fix(ITcache,p);

if (!(p->data[0].l&(PX_BIT>>PROT_OFFSET))) goto bad_fetch;

if (!(p->data[0].l&(PX_BIT>>PROT_OFFSET))) goto bad_fetch;

@ At this point |inst_ptr.o| equals |data->y.o|.

@ At this point |inst_ptr.o| equals |data->y.o|.

@=

@=

if (data->i!=prego) {

if (data->i!=prego) {

  for (j=0;jbb;j++) fetched[j]=q->data[j];

  for (j=0;jbb;j++) fetched[j]=q->data[j];

  fetch_lo=(inst_ptr.o.l&(Icache->bb-1))>>3;

  fetch_lo=(inst_ptr.o.l&(Icache->bb-1))>>3;

  fetch_hi=Icache->bb>>3;

  fetch_hi=Icache->bb>>3;

@ @=

@ @=

  if (p) {

  if (p) {

@;

@;

    data->z.o=phys_addr(data->y.o,p->data[0]);

    data->z.o=phys_addr(data->y.o,p->data[0]);

    data->state=IT_hit;

    data->state=IT_hit;

  }@+ else data->state=IT_miss;

  }@+ else data->state=IT_miss;

  wait_or_pass(ITcache->access_time);

  wait_or_pass(ITcache->access_time);

@ @=

@ @=

  if (data->i==prego && !(data->loc.h&sign_bit)) goto fin_ex;

  if (data->i==prego && !(data->loc.h&sign_bit)) goto fin_ex;

  data->z.o=data->y.o;@+ data->z.o.h -= sign_bit;

  data->z.o=data->y.o;@+ data->z.o.h -= sign_bit;

 known_phys:@+  if (data->z.o.h&0xffff0000) goto bad_fetch;

 known_phys:@+  if (data->z.o.h&0xffff0000) goto bad_fetch;

  if (!Icache) @;

  if (!Icache) @;

  if (Icache->lock || (j=get_reader(Icache))<0) {

  if (Icache->lock || (j=get_reader(Icache))<0) {

    data->state=IT_hit;@+ wait_or_pass(1);

    data->state=IT_hit;@+ wait_or_pass(1);

  startup(&Icache->reader[j],Icache->access_time);

  startup(&Icache->reader[j],Icache->access_time);

  q=cache_search(Icache,data->z.o);

  q=cache_search(Icache,data->z.o);

  if (q) {

  if (q) {

    q=use_and_fix(Icache,q);

    q=use_and_fix(Icache,q);

@;

@;

    data->state=fetch_ready;

    data->state=fetch_ready;

  }@+else data->state=Ihit_and_miss;

  }@+else data->state=Ihit_and_miss;

  wait_or_pass(Icache->access_time);

  wait_or_pass(Icache->access_time);

@ @=

@ @=

{@+octa addr;

{@+octa addr;

  addr=data->z.o;

  addr=data->z.o;

  if (mem_lock) wait(1);

  if (mem_lock) wait(1);

  set_lock(&mem_locker,mem_lock);

  set_lock(&mem_locker,mem_lock);

  startup(&mem_locker,mem_addr_time+mem_read_time);

  startup(&mem_locker,mem_addr_time+mem_read_time);

  addr.l&=-(bus_words<<3);

  addr.l&=-(bus_words<<3);

  fetched[0]=mem_read(addr);

  fetched[0]=mem_read(addr);

  for (j=1;j

  for (j=1;j

    fetched[j]=mem_hash[last_h].chunk[((addr.l&0xffff)>>3)+j];

    fetched[j]=mem_hash[last_h].chunk[((addr.l&0xffff)>>3)+j];

  fetch_lo=(data->z.o.l>>3)&(bus_words-1);@+ fetch_hi=bus_words;

  fetch_lo=(data->z.o.l>>3)&(bus_words-1);@+ fetch_hi=bus_words;

  data->state=fetch_ready;

  data->state=fetch_ready;

  wait(mem_addr_time+mem_read_time);

  wait(mem_addr_time+mem_read_time);

@ @=

@ @=

 case IT_miss:@+if (ITcache->filler.next)

 case IT_miss:@+if (ITcache->filler.next)

     if (data->i==prego) goto fin_ex;@+else wait(1);

     if (data->i==prego) goto fin_ex;@+else wait(1);

   if (no_hardware_PT) @;

   if (no_hardware_PT) @;

   p=alloc_slot(ITcache,trans_key(data->y.o));

   p=alloc_slot(ITcache,trans_key(data->y.o));

   if (!p) /* hey, it was present after all */

   if (!p) /* hey, it was present after all */

     if (data->i==prego) goto fin_ex;@+else goto new_fetch;

     if (data->i==prego) goto fin_ex;@+else goto new_fetch;

   data->ptr_b=ITcache->filler_ctl.ptr_b=(void *)p;

   data->ptr_b=ITcache->filler_ctl.ptr_b=(void *)p;

   ITcache->filler_ctl.y.o=data->y.o;

   ITcache->filler_ctl.y.o=data->y.o;

   set_lock(self,ITcache->fill_lock);

   set_lock(self,ITcache->fill_lock);

   startup(&ITcache->filler,1);

   startup(&ITcache->filler,1);

   data->state=got_IT;

   data->state=got_IT;

   if (data->i==prego) goto fin_ex;@+else sleep;

   if (data->i==prego) goto fin_ex;@+else sleep;

 case got_IT: release_lock(self,ITcache->fill_lock);

 case got_IT: release_lock(self,ITcache->fill_lock);

   if (!(data->z.o.l&(PX_BIT>>PROT_OFFSET))) goto bad_fetch;

   if (!(data->z.o.l&(PX_BIT>>PROT_OFFSET))) goto bad_fetch;

   data->z.o=phys_addr(data->y.o,data->z.o);

   data->z.o=phys_addr(data->y.o,data->z.o);

 fetch_retry: data->state=IT_hit;

 fetch_retry: data->state=IT_hit;

 case IT_hit:@+if (data->i==prego) goto fin_ex;@+else goto known_phys;

 case IT_hit:@+if (data->i==prego) goto fin_ex;@+else goto known_phys;

 case Ihit_and_miss:

 case Ihit_and_miss:

    @z.o| in the I-cache@>;

    @z.o| in the I-cache@>;

@ @=

@ @=

case IT_miss: case Ihit_and_miss: case IT_hit: case fetch_ready: goto switch0;

case IT_miss: case Ihit_and_miss: case IT_hit: case fetch_ready: goto switch0;

@ @z.o| in the I-cache@>=

@ @z.o| in the I-cache@>=

if (Icache->filler.next) goto fetch_retry;

if (Icache->filler.next) goto fetch_retry;

if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto fetch_retry;

if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto fetch_retry;

q=alloc_slot(Icache,data->z.o);

q=alloc_slot(Icache,data->z.o);

if (!q) goto fetch_retry;

if (!q) goto fetch_retry;

if (Scache) set_lock(&Icache->filler,Scache->lock)@;

if (Scache) set_lock(&Icache->filler,Scache->lock)@;

else set_lock(&Icache->filler,mem_lock);

else set_lock(&Icache->filler,mem_lock);

set_lock(self,Icache->fill_lock);

set_lock(self,Icache->fill_lock);

data->ptr_b=Icache->filler_ctl.ptr_b=(void *)q;

data->ptr_b=Icache->filler_ctl.ptr_b=(void *)q;

Icache->filler_ctl.z.o=data->z.o;

Icache->filler_ctl.z.o=data->z.o;

startup(&Icache->filler,Scache? Scache->access_time: mem_addr_time);

startup(&Icache->filler,Scache? Scache->access_time: mem_addr_time);

data->state=got_one;

data->state=got_one;

if (data->i==prego) goto fin_ex;@+else sleep;

if (data->i==prego) goto fin_ex;@+else sleep;

@ The I-cache filler will wake us up with the octabyte we want, before

@ The I-cache filler will wake us up with the octabyte we want, before

it has filled the entire cache block. In that case we can fetch one

it has filled the entire cache block. In that case we can fetch one

or two instructions before the rest of the block has been loaded.

or two instructions before the rest of the block has been loaded.

@=

@=

bad_fetch:@+ if (data->i==prego) goto fin_ex;

bad_fetch:@+ if (data->i==prego) goto fin_ex;

  data->interrupt |= PX_BIT;

  data->interrupt |= PX_BIT;

swym_one: fetched[0].h=fetched[0].l=SWYM<<24;

swym_one: fetched[0].h=fetched[0].l=SWYM<<24;

  goto fetch_one;

  goto fetch_one;

case got_one: fetched[0]=data->x.o; /* a ``preview'' of the new cache data */

case got_one: fetched[0]=data->x.o; /* a ``preview'' of the new cache data */

fetch_one:  fetch_lo=0;@+fetch_hi=1;

fetch_one:  fetch_lo=0;@+fetch_hi=1;

  data->state=fetch_ready;

  data->state=fetch_ready;

case fetch_ready:@+if (self->lockloc)

case fetch_ready:@+if (self->lockloc)

    *(self->lockloc)=NULL, self->lockloc=NULL;

    *(self->lockloc)=NULL, self->lockloc=NULL;

  if (data->i==prego) goto fin_ex;

  if (data->i==prego) goto fin_ex;

  for (j=0;j

  for (j=0;j

    register fetch *new_tail;

    register fetch *new_tail;

    if (tail==fetch_bot) new_tail=fetch_top;

    if (tail==fetch_bot) new_tail=fetch_top;

    else new_tail=tail-1;

    else new_tail=tail-1;

    if (new_tail==head) break; /* fetch buffer is full */

    if (new_tail==head) break; /* fetch buffer is full */

@;

@;

    tail=new_tail;

    tail=new_tail;

    if (sleepy) {

    if (sleepy) {

      sleepy=false;@+ sleep;

      sleepy=false;@+ sleep;

    inst_ptr.o=incr(inst_ptr.o,4);

    inst_ptr.o=incr(inst_ptr.o,4);

    if (fetch_lo==fetch_hi) goto new_fetch;

    if (fetch_lo==fetch_hi) goto new_fetch;

  wait(1);

  wait(1);

@ @=

@ @=

  if (cache_search(ITcache,trans_key(inst_ptr.o))) goto new_fetch;

  if (cache_search(ITcache,trans_key(inst_ptr.o))) goto new_fetch;

  data->interrupt|=F_BIT;

  data->interrupt|=F_BIT;

  sleepy=true;

  sleepy=true;

  goto swym_one;

  goto swym_one;

@ @=

@ @=

bool sleepy; /* have we just emitted the page table emulation call? */

bool sleepy; /* have we just emitted the page table emulation call? */

@ At this point we check for egregiously invalid instructions. (Sometimes

@ At this point we check for egregiously invalid instructions. (Sometimes

the dispatcher will actually allow such instructions to occupy

the dispatcher will actually allow such instructions to occupy

the fetch buffer, for internally generated commands.)

the fetch buffer, for internally generated commands.)

@=

@=

tail->loc=inst_ptr.o;

tail->loc=inst_ptr.o;

if (inst_ptr.o.l&4) tail->inst=fetched[fetch_lo++].l;

if (inst_ptr.o.l&4) tail->inst=fetched[fetch_lo++].l;

else tail->inst=fetched[fetch_lo].h;

else tail->inst=fetched[fetch_lo].h;

@^big-endian versus little-endian@>

@^big-endian versus little-endian@>

@^little-endian versus big-endian@>

@^little-endian versus big-endian@>

tail->interrupt=data->interrupt;

tail->interrupt=data->interrupt;

i=tail->inst>>24;

i=tail->inst>>24;

if (i>=RESUME && i<=SYNC && (tail->inst&bad_inst_mask[i-RESUME]))

if (i>=RESUME && i<=SYNC && (tail->inst&bad_inst_mask[i-RESUME]))

  tail->interrupt |= B_BIT;

  tail->interrupt |= B_BIT;

tail->noted=false;

tail->noted=false;

if (inst_ptr.o.l==breakpoint.l && inst_ptr.o.h==breakpoint.h)

if (inst_ptr.o.l==breakpoint.l && inst_ptr.o.h==breakpoint.h)

  breakpoint_hit=true;

  breakpoint_hit=true;

@ The commands |RESUME|, |SAVE|, |UNSAVE|, and |SYNC| should not have

@ The commands |RESUME|, |SAVE|, |UNSAVE|, and |SYNC| should not have

nonzero bits in the positions defined here.

nonzero bits in the positions defined here.

@=

@=

int bad_inst_mask[4]={0xfffffe,0xffff,0xffff00,0xfffff8};

int bad_inst_mask[4]={0xfffffe,0xffff,0xffff00,0xfffff8};

@* Interrupts. The scariest thing about the design of a pipelined machine is

@* Interrupts. The scariest thing about the design of a pipelined machine is

the existence of interrupts, which disrupt the smooth flow of a computation in

the existence of interrupts, which disrupt the smooth flow of a computation in

ways that are difficult to anticipate. Fortunately, however, the discipline of

ways that are difficult to anticipate. Fortunately, however, the discipline of

a reorder buffer, which forces instructions to be committed in order,

a reorder buffer, which forces instructions to be committed in order,

allows us to deal with interrupts in a fairly natural way. Our solution to the

allows us to deal with interrupts in a fairly natural way. Our solution to the

problems of dynamic scheduling and speculative execution therefore solves the

problems of dynamic scheduling and speculative execution therefore solves the

interrupt problem as well.

interrupt problem as well.

@^interrupts@>

@^interrupts@>

\MMIX\ has three kinds of interrupts, which show up as bit codes in the

\MMIX\ has three kinds of interrupts, which show up as bit codes in the

|interrupt| field when an instruction is ready to be committed:

|interrupt| field when an instruction is ready to be committed:

|H_BIT| invokes a trip handler, for \.{TRIP} instructions and

|H_BIT| invokes a trip handler, for \.{TRIP} instructions and

arithmetic exceptions; |F_BIT| invokes a forced-trap handler, for \.{TRAP}

arithmetic exceptions; |F_BIT| invokes a forced-trap handler, for \.{TRAP}

instructions and unimplemented instructions that need to be emulated

instructions and unimplemented instructions that need to be emulated

in software; |E_BIT| invokes a dynamic-trap handler, for external

in software; |E_BIT| invokes a dynamic-trap handler, for external

interrupts like I/O signals or for internal interrupts caused by

interrupts like I/O signals or for internal interrupts caused by

improper instructions.

improper instructions.

In all three cases, the pipeline control has already been redirected to fetch

In all three cases, the pipeline control has already been redirected to fetch

new instructions starting at the correct handler address by the time an

new instructions starting at the correct handler address by the time an

interrupted instruction is ready to be committed.

interrupted instruction is ready to be committed.

@ Most instructions come to the following part of the program, if they

@ Most instructions come to the following part of the program, if they

have finished execution with any~1s among the eight trip bits or the

have finished execution with any~1s among the eight trip bits or the

eight trap bits.

eight trap bits.

If the trip bits aren't all zero, we want to update the event bits

If the trip bits aren't all zero, we want to update the event bits

of~rA, or perform an enabled trip handler, or both. If the trap bits

of~rA, or perform an enabled trip handler, or both. If the trap bits

are nonzero, we need to hold onto them until we get to the hot seat,

are nonzero, we need to hold onto them until we get to the hot seat,

when they will be joined with the bits of~rQ and probably cause an interrupt.

when they will be joined with the bits of~rQ and probably cause an interrupt.

A load or store instruction with nonzero trap bits will be nullified,

A load or store instruction with nonzero trap bits will be nullified,

not committed.

not committed.

Underflow that is exact and not enabled is ignored, in accordance with

Underflow that is exact and not enabled is ignored, in accordance with

the IEEE standard conventions. (This applies also to underflow

the IEEE standard conventions. (This applies also to underflow

triggered by |RESUME_SET|.)

triggered by |RESUME_SET|.)

@d is_load_store(i) (i>=ld && i<=cswap)

@d is_load_store(i) (i>=ld && i<=cswap)

@=

@=

  if ((data->interrupt&0xff) && is_load_store(data->i)) goto state_5;

  if ((data->interrupt&0xff) && is_load_store(data->i)) goto state_5;

  j=data->interrupt&0xff00;

  j=data->interrupt&0xff00;

  data->interrupt -= j;

  data->interrupt -= j;

  if ((j&(U_BIT+X_BIT))==U_BIT && !(data->ra.o.l & U_BIT)) j&=~U_BIT;

  if ((j&(U_BIT+X_BIT))==U_BIT && !(data->ra.o.l & U_BIT)) j&=~U_BIT;

  data->arith_exc=(j&~data->ra.o.l)>>8;

  data->arith_exc=(j&~data->ra.o.l)>>8;

  if (j&data->ra.o.l) @;

  if (j&data->ra.o.l) @;

  if (data->interrupt&0xff) goto state_5;

  if (data->interrupt&0xff) goto state_5;

@ Since execution is speculative, an exceptional condition might not

@ Since execution is speculative, an exceptional condition might not

be part of the ``real'' computation. Indeed, the present coroutine

be part of the ``real'' computation. Indeed, the present coroutine

might have already been deissued.

might have already been deissued.

@=

@=

  i=issued_between(data,cool);

  i=issued_between(data,cool);

  if (i

  if (i

  deissues=i;

  deissues=i;

  old_tail=tail=head;@+resuming=0; /* clear the fetch buffer */

  old_tail=tail=head;@+resuming=0; /* clear the fetch buffer */

@;

@;

  cool_hist=data->hist;

  cool_hist=data->hist;

  for (i=j&data->ra.o.l,m=16;!(i&D_BIT);i<<=1,m+=16);

  for (i=j&data->ra.o.l,m=16;!(i&D_BIT);i<<=1,m+=16);

  data->go.o.h=0, data->go.o.l=m;

  data->go.o.h=0, data->go.o.l=m;

  inst_ptr.o=data->go.o, inst_ptr.p=NULL;

  inst_ptr.o=data->go.o, inst_ptr.p=NULL;

  data->interrupt |= H_BIT;

  data->interrupt |= H_BIT;

  goto state_4;

  goto state_4;

@ @=

@ @=

i=issued_between(data,cool);

i=issued_between(data,cool);

if (i

if (i

deissues=i;

deissues=i;

old_tail=tail=head;@+resuming=0; /* clear the fetch buffer */

old_tail=tail=head;@+resuming=0; /* clear the fetch buffer */

@;

@;

cool_hist=data->hist;

cool_hist=data->hist;

inst_ptr.p=UNKNOWN_SPEC;

inst_ptr.p=UNKNOWN_SPEC;

data->interrupt |= F_BIT;

data->interrupt |= F_BIT;

@ We need to stop dispatching when calling a trip handler from within

@ We need to stop dispatching when calling a trip handler from within

the reorder buffer,

the reorder buffer,

lest we issue an instruction that uses

lest we issue an instruction that uses

|g[255]| or |rB| as an operand.

|g[255]| or |rB| as an operand.

@=

@=

emulate_virt: @;

emulate_virt: @;

state_4: data->state=4;

state_4: data->state=4;

case 4:@+if (dispatch_lock) wait(1);

case 4:@+if (dispatch_lock) wait(1);

  set_lock(self,dispatch_lock);

  set_lock(self,dispatch_lock);

state_5: data->state=5;

state_5: data->state=5;

case 5:@+if (data!=old_hot) wait(1);

case 5:@+if (data!=old_hot) wait(1);

  if ((data->interrupt&F_BIT) && data->i!=trap) {

  if ((data->interrupt&F_BIT) && data->i!=trap) {

    inst_ptr.o=g[rT].o, inst_ptr.p=NULL;

    inst_ptr.o=g[rT].o, inst_ptr.p=NULL;

    if (is_load_store(data->i)) nullifying=true;

    if (is_load_store(data->i)) nullifying=true;

  if (data->interrupt&0xff) {

  if (data->interrupt&0xff) {

    g[rQ].o.h |= data->interrupt&0xff;

    g[rQ].o.h |= data->interrupt&0xff;

    new_Q.h |= data->interrupt&0xff;

    new_Q.h |= data->interrupt&0xff;

    if (verbose&issue_bit) {

    if (verbose&issue_bit) {

      printf(" setting rQ=");@+print_octa(g[rQ].o);@+printf("\n");

      printf(" setting rQ=");@+print_octa(g[rQ].o);@+printf("\n");

  goto die;

  goto die;

@ The instructions of the previous section appear in the switch for

@ The instructions of the previous section appear in the switch for

coroutine stage~1 only. We need to use them also in later stages.

coroutine stage~1 only. We need to use them also in later stages.

@=

@=

case 4: goto state_4;

case 4: goto state_4;

case 5: goto state_5;

case 5: goto state_5;

@ @=

@ @=

case trap:@+ if ((flags[op]&X_is_dest_bit) &&

case trap:@+ if ((flags[op]&X_is_dest_bit) &&

                cool->xxxx>=cool_L)

                cool->xxxx>=cool_L)

    goto increase_L;

    goto increase_L;

  if (!g[rT].up->known || !g[rJ].up->known) goto stall;

  if (!g[rT].up->known || !g[rJ].up->known) goto stall;

  inst_ptr=specval(&g[rT]); /* traps and emulated ops */

  inst_ptr=specval(&g[rT]); /* traps and emulated ops */

  cool->need_b=true, cool->b=specval(&g[255]);

  cool->need_b=true, cool->b=specval(&g[255]);

case trip: if (!g[rJ].up->known) goto stall;

case trip: if (!g[rJ].up->known) goto stall;

  cool->ren_x=true, spec_install(&g[255],&cool->x);

  cool->ren_x=true, spec_install(&g[255],&cool->x);

  cool->x.known=true, cool->x.o=g[rJ].up->o;

  cool->x.known=true, cool->x.o=g[rJ].up->o;

  if (i==trip) cool->go.o=zero_octa;

  if (i==trip) cool->go.o=zero_octa;

  cool->ren_a=true, spec_install(&g[i==trap? rBB: rB],&cool->a);@+break;

  cool->ren_a=true, spec_install(&g[i==trap? rBB: rB],&cool->a);@+break;

@ @=

@ @=

case trap: data->interrupt |= F_BIT;@+ data->a.o=data->b.o;@+ goto fin_ex;

case trap: data->interrupt |= F_BIT;@+ data->a.o=data->b.o;@+ goto fin_ex;

case trip: data->interrupt |= H_BIT;@+ data->a.o=data->b.o;@+ goto fin_ex;

case trip: data->interrupt |= H_BIT;@+ data->a.o=data->b.o;@+ goto fin_ex;

@ The following check is performed at the beginning of every cycle.

@ The following check is performed at the beginning of every cycle.

An instruction in the hot seat can be externally interrupted only if

An instruction in the hot seat can be externally interrupted only if

it is ready to be committed and not already marked for tripping

it is ready to be committed and not already marked for tripping

or trapping.

or trapping.

@=

@=

g[rI].o=incr(g[rI].o,-1);

g[rI].o=incr(g[rI].o,-1);

if (g[rI].o.l==0 && g[rI].o.h==0) {

if (g[rI].o.l==0 && g[rI].o.h==0) {

  g[rQ].o.l |= INTERVAL_TIMEOUT, new_Q.l |= INTERVAL_TIMEOUT;

  g[rQ].o.l |= INTERVAL_TIMEOUT, new_Q.l |= INTERVAL_TIMEOUT;

    if (verbose&issue_bit) {

    if (verbose&issue_bit) {

      printf(" setting rQ=");@+print_octa(g[rQ].o);@+printf("\n");

      printf(" setting rQ=");@+print_octa(g[rQ].o);@+printf("\n");

trying_to_interrupt=false;

trying_to_interrupt=false;

if (((g[rQ].o.h&g[rK].o.h)||(g[rQ].o.l&g[rK].o.l)) && cool!=hot &&@|

if (((g[rQ].o.h&g[rK].o.h)||(g[rQ].o.l&g[rK].o.l)) && cool!=hot &&@|

     !(hot->interrupt&(E_BIT+F_BIT+H_BIT)) && !doing_interrupt &&@|

     !(hot->interrupt&(E_BIT+F_BIT+H_BIT)) && !doing_interrupt &&@|

     !(hot->i==resum)) {

     !(hot->i==resum)) {

  if (hot->owner) trying_to_interrupt=true;

  if (hot->owner) trying_to_interrupt=true;

  else {

  else {

    hot->interrupt |= E_BIT;

    hot->interrupt |= E_BIT;

@;

@;

    inst_ptr.o=g[rTT].o;@+inst_ptr.p=NULL;

    inst_ptr.o=g[rTT].o;@+inst_ptr.p=NULL;

@ @=

@ @=

bool trying_to_interrupt; /* encouraging interruptible operations to pause */

bool trying_to_interrupt; /* encouraging interruptible operations to pause */

bool nullifying; /* stopping dispatch to nullify a load/store command */

bool nullifying; /* stopping dispatch to nullify a load/store command */

@ It's possible that the command in the hot seat has been deissued,

@ It's possible that the command in the hot seat has been deissued,

but only if the simulator has done so at the user's request. Otherwise

but only if the simulator has done so at the user's request. Otherwise

the test `|i>=deissues|' here will always succeed.

the test `|i>=deissues|' here will always succeed.

The value of |cool_hist| becomes flaky here. We could try to keep it

The value of |cool_hist| becomes flaky here. We could try to keep it

strictly up to date, but the unpredictable nature of external interrupts

strictly up to date, but the unpredictable nature of external interrupts

suggests that we are better off leaving it alone. (It's only a heuristic

suggests that we are better off leaving it alone. (It's only a heuristic

for branch prediction, and a sufficiently strong prediction will survive

for branch prediction, and a sufficiently strong prediction will survive

one-time glitches due to interrupts.)

one-time glitches due to interrupts.)

@=

@=

i=issued_between(hot,cool);

i=issued_between(hot,cool);

if (i>=deissues) {

if (i>=deissues) {

  deissues=i;

  deissues=i;

  tail=head;@+resuming=0; /* clear the fetch buffer */

  tail=head;@+resuming=0; /* clear the fetch buffer */

@;

@;

  if (is_load_store(hot->i)) nullifying=true;

  if (is_load_store(hot->i)) nullifying=true;

@ Even though an interrupted instruction has officially been either

@ Even though an interrupted instruction has officially been either

``committed'' or ``nullified,'' it stays in the hot seat for

``committed'' or ``nullified,'' it stays in the hot seat for

two or three extra cycles,

two or three extra cycles,

while we save enough of the machine state to resume the computation later.

while we save enough of the machine state to resume the computation later.

%Notice, incidentally, that |H_BIT| and |E_BIT| might both be present

%Notice, incidentally, that |H_BIT| and |E_BIT| might both be present

%simultaneously. In such cases we first prepare for a trip handler, but

%simultaneously. In such cases we first prepare for a trip handler, but

%interrupt that for a dynamic trap handler. (Ah, the joys of computer

%interrupt that for a dynamic trap handler. (Ah, the joys of computer

%architecture.)

%architecture.)

@=

@=

  if (!(hot->interrupt&H_BIT)) g[rK].o=zero_octa; /* trap */

  if (!(hot->interrupt&H_BIT)) g[rK].o=zero_octa; /* trap */

  if (((hot->interrupt&H_BIT)&&hot->i!=trip) ||@|

  if (((hot->interrupt&H_BIT)&&hot->i!=trip) ||@|

      ((hot->interrupt&F_BIT)&&hot->i!=trap) ||@|

      ((hot->interrupt&F_BIT)&&hot->i!=trap) ||@|

      (hot->interrupt&E_BIT)) doing_interrupt=3, suppress_dispatch=true;

      (hot->interrupt&E_BIT)) doing_interrupt=3, suppress_dispatch=true;

  else doing_interrupt=2; /* trip or trap started by dispatcher */

  else doing_interrupt=2; /* trip or trap started by dispatcher */

  break;

  break;

@ If a memory failure occurs, we should set rF here, either in

@ If a memory failure occurs, we should set rF here, either in

case~2 or case~1. The simulator doesn't do anything with~rF at present.

case~2 or case~1. The simulator doesn't do anything with~rF at present.

@=

@=

switch (doing_interrupt--) {

switch (doing_interrupt--) {

 case 3: @;

 case 3: @;

  @+break;

  @+break;

 case 2: @;@+break;

 case 2: @;@+break;

 case 1: @;

 case 1: @;

  if (hot==reorder_bot) hot=reorder_top;@+ else hot--;

  if (hot==reorder_bot) hot=reorder_top;@+ else hot--;

  break;

  break;

@ @=

@ @=

j=hot->interrupt&H_BIT;

j=hot->interrupt&H_BIT;

g[j?rB:rBB].o=g[255].o;

g[j?rB:rBB].o=g[255].o;

g[255].o=g[rJ].o;

g[255].o=g[rJ].o;

if (verbose&issue_bit) {

if (verbose&issue_bit) {

  if (j) {

  if (j) {

    printf(" setting rB=");@+print_octa(g[rB].o);

    printf(" setting rB=");@+print_octa(g[rB].o);

  }@+else {

  }@+else {

    printf(" setting rBB=");@+print_octa(g[rBB].o);

    printf(" setting rBB=");@+print_octa(g[rBB].o);

  printf(", $255=");@+print_octa(g[255].o);@+printf("\n");

  printf(", $255=");@+print_octa(g[255].o);@+printf("\n");

@ Here's where we manufacture the ``ropcodes'' for resumption.

@ Here's where we manufacture the ``ropcodes'' for resumption.

@d RESUME_AGAIN 0 /* repeat the command in rX as if in location $\rm rW-4$ */

@d RESUME_AGAIN 0 /* repeat the command in rX as if in location $\rm rW-4$ */

@d RESUME_CONT 1 /* same, but substitute rY and rZ for operands */

@d RESUME_CONT 1 /* same, but substitute rY and rZ for operands */

@d RESUME_SET 2 /* set r[X] to rZ */

@d RESUME_SET 2 /* set r[X] to rZ */

@d RESUME_TRANS 3 /* install $\rm(rY,rZ)$ into IT-cache or DT-cache,

@d RESUME_TRANS 3 /* install $\rm(rY,rZ)$ into IT-cache or DT-cache,

        then |RESUME_AGAIN| */

        then |RESUME_AGAIN| */

@d pack_bytes(a,b,c,d) ((((((unsigned)(a)<<8)+(b))<<8)+(c))<<8)+(d)

@d pack_bytes(a,b,c,d) ((((((unsigned)(a)<<8)+(b))<<8)+(c))<<8)+(d)

@=

@=

j=pack_bytes(hot->op,hot->xx,hot->yy,hot->zz);

j=pack_bytes(hot->op,hot->xx,hot->yy,hot->zz);

if (hot->interrupt&H_BIT) { /* trip */

if (hot->interrupt&H_BIT) { /* trip */

  g[rW].o=incr(hot->loc,4);

  g[rW].o=incr(hot->loc,4);

  g[rX].o.h=sign_bit, g[rX].o.l=j;

  g[rX].o.h=sign_bit, g[rX].o.l=j;

  if (verbose&issue_bit) {

  if (verbose&issue_bit) {

    printf(" setting rW=");@+print_octa(g[rW].o);

    printf(" setting rW=");@+print_octa(g[rW].o);

    printf(", rX=");@+print_octa(g[rX].o);@+printf("\n");

    printf(", rX=");@+print_octa(g[rX].o);@+printf("\n");

}@+else { /* trap */

}@+else { /* trap */

  g[rWW].o=hot->go.o;

  g[rWW].o=hot->go.o;

  g[rXX].o.l=j;

  g[rXX].o.l=j;

  if (hot->interrupt&F_BIT) { /* forced */

  if (hot->interrupt&F_BIT) { /* forced */

    if (hot->i!=trap) j=RESUME_TRANS; /* emulate page translation */

    if (hot->i!=trap) j=RESUME_TRANS; /* emulate page translation */

    else if (hot->op==TRAP) j=0x80; /* |TRAP| */

    else if (hot->op==TRAP) j=0x80; /* |TRAP| */

    else if (flags[internal_op[hot->op]]&X_is_dest_bit)

    else if (flags[internal_op[hot->op]]&X_is_dest_bit)

      j=RESUME_SET; /* emulation */

      j=RESUME_SET; /* emulation */

    else j=0x80; /* emulation when r[X] is not a destination */

    else j=0x80; /* emulation when r[X] is not a destination */

  }@+else { /* dynamic */

  }@+else { /* dynamic */

    if (hot->interim)

    if (hot->interim)

      j=(hot->i==frem || hot->i==syncd || hot->i==syncid? RESUME_CONT:

      j=(hot->i==frem || hot->i==syncd || hot->i==syncid? RESUME_CONT:

             RESUME_AGAIN);

             RESUME_AGAIN);

    else if (is_load_store(hot->i)) j=RESUME_AGAIN;

    else if (is_load_store(hot->i)) j=RESUME_AGAIN;

    else j=0x80; /* normal external interruption */

    else j=0x80; /* normal external interruption */

  g[rXX].o.h=(j<<24)+(hot->interrupt&0xff);

  g[rXX].o.h=(j<<24)+(hot->interrupt&0xff);

  if (verbose&issue_bit) {

  if (verbose&issue_bit) {

    printf(" setting rWW=");@+print_octa(g[rWW].o);

    printf(" setting rWW=");@+print_octa(g[rWW].o);

    printf(", rXX=");@+print_octa(g[rXX].o);@+printf("\n");

    printf(", rXX=");@+print_octa(g[rXX].o);@+printf("\n");

@ @=

@ @=

j=hot->interrupt&H_BIT;

j=hot->interrupt&H_BIT;

if ((hot->interrupt&F_BIT) && hot->op==SWYM) g[rYY].o=hot->go.o;

if ((hot->interrupt&F_BIT) && hot->op==SWYM) g[rYY].o=hot->go.o;

else g[j?rY:rYY].o=hot->y.o;

else g[j?rY:rYY].o=hot->y.o;

if (hot->i==st || hot->i==pst) g[j?rZ:rZZ].o=hot->x.o;

if (hot->i==st || hot->i==pst) g[j?rZ:rZZ].o=hot->x.o;

else g[j?rZ:rZZ].o=hot->z.o;

else g[j?rZ:rZZ].o=hot->z.o;

if (verbose&issue_bit) {

if (verbose&issue_bit) {

  if (j) {

  if (j) {

    printf(" setting rY=");@+print_octa(g[rY].o);

    printf(" setting rY=");@+print_octa(g[rY].o);

    printf(", rZ=");@+print_octa(g[rZ].o);@+printf("\n");

    printf(", rZ=");@+print_octa(g[rZ].o);@+printf("\n");

  }@+else {

  }@+else {

    printf(" setting rYY=");@+print_octa(g[rYY].o);

    printf(" setting rYY=");@+print_octa(g[rYY].o);

    printf(", rZZ=");@+print_octa(g[rZZ].o);@+printf("\n");

    printf(", rZZ=");@+print_octa(g[rZZ].o);@+printf("\n");

@ Whew; we've successfully interrupted the computation. The remaining

@ Whew; we've successfully interrupted the computation. The remaining

task is to restart it again, as transparently as possible.

task is to restart it again, as transparently as possible.

The \.{RESUME} instruction waits for the pipeline to drain, because

The \.{RESUME} instruction waits for the pipeline to drain, because

it has to do such drastic things. For example, an interrupt may be

it has to do such drastic things. For example, an interrupt may be

occurring at this very moment, changing the registers needed for resumption.

occurring at this very moment, changing the registers needed for resumption.

@=

@=

case resume:@+ if (cool!=old_hot) goto stall;

case resume:@+ if (cool!=old_hot) goto stall;

  inst_ptr=specval(&g[cool->zz? rWW:rW]);

  inst_ptr=specval(&g[cool->zz? rWW:rW]);

  if (!(cool->loc.h&sign_bit)) {

  if (!(cool->loc.h&sign_bit)) {

    if (cool->zz) cool->interrupt |= K_BIT;

    if (cool->zz) cool->interrupt |= K_BIT;

    else if (inst_ptr.o.h&sign_bit) cool->interrupt |= P_BIT;

    else if (inst_ptr.o.h&sign_bit) cool->interrupt |= P_BIT;

  if (cool->interrupt) {

  if (cool->interrupt) {

    inst_ptr.o=incr(cool->loc,4);@+cool->i=noop;

    inst_ptr.o=incr(cool->loc,4);@+cool->i=noop;

  }@+ else {

  }@+ else {

    cool->go.o=inst_ptr.o;

    cool->go.o=inst_ptr.o;

    if (cool->zz) {

    if (cool->zz) {

      @loc| is rT@>;

      @loc| is rT@>;

      cool->ren_a=true, spec_install(&g[rK],&cool->a);

      cool->ren_a=true, spec_install(&g[rK],&cool->a);

      cool->a.known=true, cool->a.o=g[255].o;

      cool->a.known=true, cool->a.o=g[255].o;

      cool->ren_x=true, spec_install(&g[255],&cool->x);

      cool->ren_x=true, spec_install(&g[255],&cool->x);

      cool->x.known=true, cool->x.o=g[rBB].o;

      cool->x.known=true, cool->x.o=g[rBB].o;

    cool->b= specval(&g[cool->zz? rXX:rX]);

    cool->b= specval(&g[cool->zz? rXX:rX]);

    if (!(cool->b.o.h&sign_bit)) @;

    if (!(cool->b.o.h&sign_bit)) @;

  }@+break;

  }@+break;

@ Here we set |cool->i=resum|, since we want to issue another instruction

@ Here we set |cool->i=resum|, since we want to issue another instruction

after the \.{RESUME} itself.

after the \.{RESUME} itself.

The restrictions on inserted instructions are designed to ensure that

The restrictions on inserted instructions are designed to ensure that

those instructions will be the very next ones issued. (If, for example,

those instructions will be the very next ones issued. (If, for example,

an |incgamma| instruction were necessary, it might cause a page fault

an |incgamma| instruction were necessary, it might cause a page fault

and we'd lose the operand values for |RESUME_SET| or |RESUME_CONT|.)

and we'd lose the operand values for |RESUME_SET| or |RESUME_CONT|.)

A subtle point arises here: If |RESUME_TRANS| is being used to compute

A subtle point arises here: If |RESUME_TRANS| is being used to compute

the page translation of virtual address zero, we don't want to execute

the page translation of virtual address zero, we don't want to execute

the dummy \.{SWYM} instruction from virtual address $-4$! So we avoid

the dummy \.{SWYM} instruction from virtual address $-4$! So we avoid

the \.{SWYM} altogether.

the \.{SWYM} altogether.

@=

@=

  cool->xx=cool->b.o.h>>24, cool->i=resum;

  cool->xx=cool->b.o.h>>24, cool->i=resum;

  head->loc=incr(inst_ptr.o,-4);

  head->loc=incr(inst_ptr.o,-4);

  switch(cool->xx) {

  switch(cool->xx) {

 case RESUME_SET: cool->b.o.l=(SETH<<24)+(cool->b.o.l&0xff0000);

 case RESUME_SET: cool->b.o.l=(SETH<<24)+(cool->b.o.l&0xff0000);

  head->interrupt|=cool->b.o.h&0xff00;

  head->interrupt|=cool->b.o.h&0xff00;

  resuming=2;

  resuming=2;

 case RESUME_CONT: resuming+=1+cool->zz;

 case RESUME_CONT: resuming+=1+cool->zz;

  if (((cool->b.o.l>>24)&0xfa)!=0xb8) { /* not |syncd| or |syncid| */

  if (((cool->b.o.l>>24)&0xfa)!=0xb8) { /* not |syncd| or |syncid| */

    m=cool->b.o.l>>28;

    m=cool->b.o.l>>28;

    if ((1<

    if ((1<

    m=(cool->b.o.l>>16)&0xff;

    m=(cool->b.o.l>>16)&0xff;

    if (m>=cool_L && m

    if (m>=cool_L && m

 case RESUME_AGAIN: resume_again: head->inst=cool->b.o.l;

 case RESUME_AGAIN: resume_again: head->inst=cool->b.o.l;

  m=head->inst>>24;

  m=head->inst>>24;

  if (m==RESUME) goto bad_resume; /* avoid uninterruptible loop */

  if (m==RESUME) goto bad_resume; /* avoid uninterruptible loop */

  if (!cool->zz &&

  if (!cool->zz &&

    m>RESUME && m<=SYNC && (head->inst&bad_inst_mask[m-RESUME]))

    m>RESUME && m<=SYNC && (head->inst&bad_inst_mask[m-RESUME]))

      head->interrupt|=B_BIT;

      head->interrupt|=B_BIT;

  head->noted=false;@+break;

  head->noted=false;@+break;

 case RESUME_TRANS:@+if (cool->zz) {

 case RESUME_TRANS:@+if (cool->zz) {

    cool->y=specval(&g[rYY]), cool->z=specval(&g[rZZ]);

    cool->y=specval(&g[rYY]), cool->z=specval(&g[rZZ]);

    if ((cool->b.o.l>>24)!=SWYM) goto resume_again;

    if ((cool->b.o.l>>24)!=SWYM) goto resume_again;

    cool->i=resume;@+break; /* see ``subtle point'' above */

    cool->i=resume;@+break; /* see ``subtle point'' above */

 default: bad_resume: cool->interrupt |= B_BIT, cool->i=noop;

 default: bad_resume: cool->interrupt |= B_BIT, cool->i=noop;

  resuming=0;@+break;

  resuming=0;@+break;

@ @=

@ @=

  if (resuming&1) {

  if (resuming&1) {

    cool->y=specval(&g[rY]);

    cool->y=specval(&g[rY]);

    cool->z=specval(&g[rZ]);

    cool->z=specval(&g[rZ]);

  }@+else {

  }@+else {

    cool->y=specval(&g[rYY]);

    cool->y=specval(&g[rYY]);

    cool->z=specval(&g[rZZ]);

    cool->z=specval(&g[rZZ]);

  if (resuming>=3) { /* |RESUME_SET| */

  if (resuming>=3) { /* |RESUME_SET| */

    cool->need_ra=true, cool->ra=specval(&g[rA]);

    cool->need_ra=true, cool->ra=specval(&g[rA]);

  cool->usage=false;

  cool->usage=false;

@ @d do_resume_trans 17 /* |state| for performing |RESUME_TRANS| actions */

@ @d do_resume_trans 17 /* |state| for performing |RESUME_TRANS| actions */

@=

@=

case resume: case resum:@+if (data->xx!=RESUME_TRANS) goto fin_ex;

case resume: case resum:@+if (data->xx!=RESUME_TRANS) goto fin_ex;

 data->ptr_a=(void*)((data->b.o.l>>24)==SWYM? ITcache: DTcache);

 data->ptr_a=(void*)((data->b.o.l>>24)==SWYM? ITcache: DTcache);

 data->state=do_resume_trans;

 data->state=do_resume_trans;

 data->z.o=incr(oandn(data->z.o,page_mask),data->z.o.l&7);

 data->z.o=incr(oandn(data->z.o,page_mask),data->z.o.l&7);

 data->z.o.h &= 0xffff;

 data->z.o.h &= 0xffff;

 goto resume_trans;

 goto resume_trans;

@ @=

@ @=

case do_resume_trans: resume_trans: {@+register cache*c=(cache*)data->ptr_a;

case do_resume_trans: resume_trans: {@+register cache*c=(cache*)data->ptr_a;

   if (c->lock) wait(1);

   if (c->lock) wait(1);

   if (c->filler.next) wait(1);

   if (c->filler.next) wait(1);

   p=alloc_slot(c,trans_key(data->y.o));

   p=alloc_slot(c,trans_key(data->y.o));

   if (p) {

   if (p) {

     c->filler_ctl.ptr_b=(void*)p;

     c->filler_ctl.ptr_b=(void*)p;

     c->filler_ctl.y.o=data->y.o;

     c->filler_ctl.y.o=data->y.o;

     c->filler_ctl.b.o=data->z.o;

     c->filler_ctl.b.o=data->z.o;

     c->filler_ctl.state=1;

     c->filler_ctl.state=1;

     schedule(&c->filler,c->access_time,1);

     schedule(&c->filler,c->access_time,1);

   goto fin_ex;

   goto fin_ex;

@* Administrative operations.

@* Administrative operations.

The internal instructions that handle the register stack simply reduce

The internal instructions that handle the register stack simply reduce

to things we already know how to do. (Well, the internal instructions

to things we already know how to do. (Well, the internal instructions

for saving and unsaving do sometimes lead to special cases, based on

for saving and unsaving do sometimes lead to special cases, based on

|data->op|; for the most part, though, the necessary mechanisms are

|data->op|; for the most part, though, the necessary mechanisms are

already present.)

already present.)

@=

@=

case noop:@+if (data->interrupt&F_BIT) goto emulate_virt;

case noop:@+if (data->interrupt&F_BIT) goto emulate_virt;

case jmp: case pushj: case incrl: case unsave: goto fin_ex;

case jmp: case pushj: case incrl: case unsave: goto fin_ex;

case sav:@+if (!(data->mem_x)) goto fin_ex;

case sav:@+if (!(data->mem_x)) goto fin_ex;

case incgamma: case save: data->i=st; goto switch1;

case incgamma: case save: data->i=st; goto switch1;

case decgamma: case unsav: data->i=ld; goto switch1;

case decgamma: case unsav: data->i=ld; goto switch1;

@ We can \.{GET} special registers $\ge21$ (that is, rA, rF, rP, rW--rZ,

@ We can \.{GET} special registers $\ge21$ (that is, rA, rF, rP, rW--rZ,

or rWW--rZZ) only in the hot seat, because those registers are

or rWW--rZZ) only in the hot seat, because those registers are

implicit outputs of many instructions.

implicit outputs of many instructions.

The same applies to rK, since it is changed by \.{TRAP} and

The same applies to rK, since it is changed by \.{TRAP} and

by emulated instructions.

by emulated instructions.

@=

@=

case get:@+ if (data->zz>=21 || data->zz==rK) {

case get:@+ if (data->zz>=21 || data->zz==rK) {

   if (data!=old_hot) wait(1);

   if (data!=old_hot) wait(1);

   data->z.o=g[data->zz].o;

   data->z.o=g[data->zz].o;

 data->x.o=data->z.o;@+goto fin_ex;

 data->x.o=data->z.o;@+goto fin_ex;

@ A \.{PUT} is, similarly, delayed in the cases that hold |dispatch_lock|.

@ A \.{PUT} is, similarly, delayed in the cases that hold |dispatch_lock|.

This program does not restrict the 1~bits that might be

This program does not restrict the 1~bits that might be

\.{PUT} into~rQ, although the contents of that register can have

\.{PUT} into~rQ, although the contents of that register can have

drastic implications.

drastic implications.

@=

@=

case put:@+if (data->xx>=15 && data->xx<=20) {

case put:@+if (data->xx>=15 && data->xx<=20) {

   if (data!=old_hot) wait(1);

   if (data!=old_hot) wait(1);

   switch (data->xx) {

   switch (data->xx) {

  case rV: @;@+break;

  case rV: @;@+break;

  case rQ: new_Q.h |= data->z.o.h &~ g[rQ].o.h;@+

  case rQ: new_Q.h |= data->z.o.h &~ g[rQ].o.h;@+

           new_Q.l |= data->z.o.l &~ g[rQ].o.l;

           new_Q.l |= data->z.o.l &~ g[rQ].o.l;

           data->z.o.l |= new_Q.l;@+

           data->z.o.l |= new_Q.l;@+

           data->z.o.h |= new_Q.h;@+break;

           data->z.o.h |= new_Q.h;@+break;

  case rL:@+ if (data->z.o.h!=0) data->z.o.h=0, data->z.o.l=g[rL].o.l;

  case rL:@+ if (data->z.o.h!=0) data->z.o.h=0, data->z.o.l=g[rL].o.l;

     else if (data->z.o.l>g[rL].o.l) data->z.o.l=g[rL].o.l;

     else if (data->z.o.l>g[rL].o.l) data->z.o.l=g[rL].o.l;

  default: break;

  default: break;

  case rG: @;@+break;

  case rG: @;@+break;

 }@+else if (data->xx==rA && (data->z.o.h!=0 || data->z.o.l>=0x40000))

 }@+else if (data->xx==rA && (data->z.o.h!=0 || data->z.o.l>=0x40000))

   data->interrupt |= B_BIT;

   data->interrupt |= B_BIT;

 data->x.o=data->z.o;@+goto fin_ex;

 data->x.o=data->z.o;@+goto fin_ex;

@ When rG decreases, we assume that up to |commit_max| marginal registers can

@ When rG decreases, we assume that up to |commit_max| marginal registers can

be zeroed during each clock cycle. (Remember that we're currently in the hot

be zeroed during each clock cycle. (Remember that we're currently in the hot

seat, and holding |dispatch_lock|.)

seat, and holding |dispatch_lock|.)

@=

@=

if (data->z.o.h!=0 || data->z.o.l>=256 ||

if (data->z.o.h!=0 || data->z.o.l>=256 ||

      data->z.o.lz.o.l<32)

      data->z.o.lz.o.l<32)

  data->interrupt |= B_BIT;

  data->interrupt |= B_BIT;

else if (data->z.o.l

else if (data->z.o.l

    data->interim=true; /* potentially interruptible */

    data->interim=true; /* potentially interruptible */

    for (j=0;j

    for (j=0;j

      g[rG].o.l--;

      g[rG].o.l--;

      g[g[rG].o.l].o=zero_octa;

      g[g[rG].o.l].o=zero_octa;

      if (data->z.o.l==g[rG].o.l) break;

      if (data->z.o.l==g[rG].o.l) break;

    if (j==commit_max) {

    if (j==commit_max) {

      if (!trying_to_interrupt) wait(1);

      if (!trying_to_interrupt) wait(1);

    }@+else data->interim=false;

    }@+else data->interim=false;

@ Computed jumps put the desired destination address into the |go| field.

@ Computed jumps put the desired destination address into the |go| field.

@=

@=

case go: data->x.o=data->go.o;@+ goto add_go;

case go: data->x.o=data->go.o;@+ goto add_go;

case pop: data->x.o=data->y.o; data->y.o=data->b.o; /* move rJ to |y| field */

case pop: data->x.o=data->y.o; data->y.o=data->b.o; /* move rJ to |y| field */

case pushgo: add_go: data->go.o=oplus(data->y.o,data->z.o);

case pushgo: add_go: data->go.o=oplus(data->y.o,data->z.o);

  if ((data->go.o.h&sign_bit) && !(data->loc.h&sign_bit))

  if ((data->go.o.h&sign_bit) && !(data->loc.h&sign_bit))

    data->interrupt |= P_BIT;

    data->interrupt |= P_BIT;

  data->go.known=true;@+goto fin_ex;

  data->go.known=true;@+goto fin_ex;

@ The instruction \.{UNSAVE} $z$ generates a sequence of internal instructions

@ The instruction \.{UNSAVE} $z$ generates a sequence of internal instructions

that accomplish the actual unsaving. This sequence is controlled by the

that accomplish the actual unsaving. This sequence is controlled by the

instruction currently in the fetch buffer, which changes its X and~Y fields

instruction currently in the fetch buffer, which changes its X and~Y fields

until all global registers have been loaded. The first instructions of the

until all global registers have been loaded. The first instructions of the

sequence are \.{UNSAVE}~$0,0,z$; \.{UNSAVE}~$1,rZ,z-8$;

sequence are \.{UNSAVE}~$0,0,z$; \.{UNSAVE}~$1,rZ,z-8$;

\.{UNSAVE}~$1,rY,z-16$; \dots;

\.{UNSAVE}~$1,rY,z-16$; \dots;

\.{UNSAVE}~$1,rB,z-96$; \.{UNSAVE}~$2,255,z-104$; \.{UNSAVE}~$2,254,z-112$;

\.{UNSAVE}~$1,rB,z-96$; \.{UNSAVE}~$2,255,z-104$; \.{UNSAVE}~$2,254,z-112$;

etc. If an interrupt occurs before these instructions have all been committed,

etc. If an interrupt occurs before these instructions have all been committed,

the execution register will contain enough information to restart the process.

the execution register will contain enough information to restart the process.

After the global registers have all been loaded, \.{UNSAVE} continues by

After the global registers have all been loaded, \.{UNSAVE} continues by

acting rather like~\.{POP}. An interrupt occurring during this last stage

acting rather like~\.{POP}. An interrupt occurring during this last stage

will find $\rm rS

will find $\rm rS

restoring the local registers again. But no information will be lost,

restoring the local registers again. But no information will be lost,

even though the register from which we began unsaving has long since

even though the register from which we began unsaving has long since

been replaced.

been replaced.

@=

@=

case unsave:@+if (cool->interrupt&B_BIT) cool->i=noop;

case unsave:@+if (cool->interrupt&B_BIT) cool->i=noop;

 else {

 else {

   cool->interim=true;

   cool->interim=true;

   op=LDOU; /* this instruction needs to be handled by load/store unit */

   op=LDOU; /* this instruction needs to be handled by load/store unit */

   cool->i=unsav;

   cool->i=unsav;

   switch(cool->xx) {

   switch(cool->xx) {

 case 0:@+ if (cool->z.p) goto stall;

 case 0:@+ if (cool->z.p) goto stall;

  @;@+break;

  @;@+break;

 case 1: case 2: @;@+break;

 case 1: case 2: @;@+break;

 case 3: cool->i=unsave, cool->interim=false, op=UNSAVE;

 case 3: cool->i=unsave, cool->interim=false, op=UNSAVE;

   goto pop_unsave;

   goto pop_unsave;

 default: cool->interim=false,cool->i=noop,cool->interrupt|=B_BIT;@+break;

 default: cool->interim=false,cool->i=noop,cool->interrupt|=B_BIT;@+break;

break; /* this takes us to |dispatch_done| */

break; /* this takes us to |dispatch_done| */

@ @=

@ @=

cool->ren_x=true, spec_install(&g[cool->yy],&cool->x);

cool->ren_x=true, spec_install(&g[cool->yy],&cool->x);

new_O=new_S=incr(cool_O,-1);

new_O=new_S=incr(cool_O,-1);

cool->z.o=shift_left(new_O,3);

cool->z.o=shift_left(new_O,3);

cool->ptr_a=(void*)mem.up;

cool->ptr_a=(void*)mem.up;

@ @=

@ @=

cool->ren_x=true, spec_install(&g[rG],&cool->x);

cool->ren_x=true, spec_install(&g[rG],&cool->x);

cool->ren_a=true, spec_install(&g[rA],&cool->a);

cool->ren_a=true, spec_install(&g[rA],&cool->a);

new_O=new_S=shift_right(cool->z.o,3,1);

new_O=new_S=shift_right(cool->z.o,3,1);

cool->set_l=true, spec_install(&g[rL],&cool->rl);

cool->set_l=true, spec_install(&g[rL],&cool->rl);

cool->ptr_a=(void*)mem.up;

cool->ptr_a=(void*)mem.up;

@ @=

@ @=

switch (cool->xx) {

switch (cool->xx) {

 case 0: head->inst=pack_bytes(UNSAVE,1,rZ,0);@+ break;

 case 0: head->inst=pack_bytes(UNSAVE,1,rZ,0);@+ break;

 case 1:@+ if (cool->yy==rP) head->inst=pack_bytes(UNSAVE,1,rR,0);

 case 1:@+ if (cool->yy==rP) head->inst=pack_bytes(UNSAVE,1,rR,0);

  else if (cool->yy==0) head->inst=pack_bytes(UNSAVE,2,255,0);

  else if (cool->yy==0) head->inst=pack_bytes(UNSAVE,2,255,0);

  else head->inst=pack_bytes(UNSAVE,1,cool->yy-1,0);@+ break;

  else head->inst=pack_bytes(UNSAVE,1,cool->yy-1,0);@+ break;

 case 2:@+ if (cool->yy==cool_G) head->inst=pack_bytes(UNSAVE,3,0,0);

 case 2:@+ if (cool->yy==cool_G) head->inst=pack_bytes(UNSAVE,3,0,0);

  else head->inst=pack_bytes(UNSAVE,2,cool->yy-1,0);@+ break;

  else head->inst=pack_bytes(UNSAVE,2,cool->yy-1,0);@+ break;

@ @=

@ @=

if (data->xx==0) {

if (data->xx==0) {

  data->a.o=data->x.o;@+data->a.o.h &=0xffffff; /* unsaved rA */

  data->a.o=data->x.o;@+data->a.o.h &=0xffffff; /* unsaved rA */

  data->x.o.l=data->x.o.h>>24;@+data->x.o.h=0; /* unsaved rG */

  data->x.o.l=data->x.o.h>>24;@+data->x.o.h=0; /* unsaved rG */

  if (data->a.o.h || (data->a.o.l&0xfffc0000)) {

  if (data->a.o.h || (data->a.o.l&0xfffc0000)) {

    data->a.o.h=0, data->a.o.l&=0x3ffff;@+ data->interrupt |= B_BIT;

    data->a.o.h=0, data->a.o.l&=0x3ffff;@+ data->interrupt |= B_BIT;

  if (data->x.o.l<32) {

  if (data->x.o.l<32) {

    data->x.o.l=32;@+ data->interrupt |= B_BIT;

    data->x.o.l=32;@+ data->interrupt |= B_BIT;

goto fin_ex;

goto fin_ex;

@ Of course \.{SAVE} is handled essentially like \.{UNSAVE}, but backwards.

@ Of course \.{SAVE} is handled essentially like \.{UNSAVE}, but backwards.

@=

@=

case save:@+if (cool->xxinterrupt|=B_BIT;

case save:@+if (cool->xxinterrupt|=B_BIT;

 if (cool->interrupt&B_BIT) cool->i=noop;

 if (cool->interrupt&B_BIT) cool->i=noop;

 else if (((cool_S.l-cool_O.l-cool_L-1)&lring_mask)==0)

 else if (((cool_S.l-cool_O.l-cool_L-1)&lring_mask)==0)

@@;

@@;

 else {

 else {

   cool->interim=true;

   cool->interim=true;

   cool->i=sav;

   cool->i=sav;

   switch(cool->zz) {

   switch(cool->zz) {

 case 0: @;@+break;

 case 0: @;@+break;

 case 1:@+if (cool_O.l!=cool_S.l) @@;

 case 1:@+if (cool_O.l!=cool_S.l) @@;

   cool->zz=2;@+ cool->yy=cool_G;

   cool->zz=2;@+ cool->yy=cool_G;

 case 2: case 3: @;@+break;

 case 2: case 3: @;@+break;

 default: cool->interim=false,cool->i=noop,cool->interrupt|=B_BIT;@+break;

 default: cool->interim=false,cool->i=noop,cool->interrupt|=B_BIT;@+break;

break;

break;

@ If an interrupt occurs during the first phase, say between two |incgamma|

@ If an interrupt occurs during the first phase, say between two |incgamma|

instructions, the value |cool->zz=1| will get things restarted properly.

instructions, the value |cool->zz=1| will get things restarted properly.

(Indeed, if context is saved and unsaved during the interrupt, many

(Indeed, if context is saved and unsaved during the interrupt, many

|incgamma| instructions may no longer be necessary.)

|incgamma| instructions may no longer be necessary.)

@=

@=

cool->zz=1;

cool->zz=1;

cool->ren_x=true, spec_install(&l[(cool_O.l+cool_L)&lring_mask],&cool->x);

cool->ren_x=true, spec_install(&l[(cool_O.l+cool_L)&lring_mask],&cool->x);

cool->x.known=true, cool->x.o.h=0, cool->x.o.l=cool_L;

cool->x.known=true, cool->x.o.h=0, cool->x.o.l=cool_L;

cool->set_l=true, spec_install(&g[rL],&cool->rl);

cool->set_l=true, spec_install(&g[rL],&cool->rl);

new_O=incr(cool_O,cool_L+1);

new_O=incr(cool_O,cool_L+1);

@ @=

@ @=

op=STOU; /* this instruction needs to be handled by load/store unit */

op=STOU; /* this instruction needs to be handled by load/store unit */

cool->mem_x=true, spec_install(&mem,&cool->x);

cool->mem_x=true, spec_install(&mem,&cool->x);

cool->z.o=shift_left(cool_O,3);

cool->z.o=shift_left(cool_O,3);

new_O=new_S=incr(cool_O,1);

new_O=new_S=incr(cool_O,1);

if (cool->zz==3 && cool->yy>rZ) @@;

if (cool->zz==3 && cool->yy>rZ) @@;

else cool->b=specval(&g[cool->yy]);

else cool->b=specval(&g[cool->yy]);

@ The final \.{SAVE} instruction not only stores rG and rA, it also

@ The final \.{SAVE} instruction not only stores rG and rA, it also

places the final address in global register~X.

places the final address in global register~X.

@=

@=

  cool->i=save;

  cool->i=save;

  cool->interim=false;

  cool->interim=false;

  cool->ren_a=true, spec_install(&g[cool->xx],&cool->a);

  cool->ren_a=true, spec_install(&g[cool->xx],&cool->a);

@ @=

@ @=

switch (cool->zz) {

switch (cool->zz) {

 case 1: head->inst=pack_bytes(SAVE,cool->xx,0,1);@+ break;

 case 1: head->inst=pack_bytes(SAVE,cool->xx,0,1);@+ break;

 case 2:@+ if (cool->yy==255) head->inst=pack_bytes(SAVE,cool->xx,0,3);

 case 2:@+ if (cool->yy==255) head->inst=pack_bytes(SAVE,cool->xx,0,3);

  else head->inst=pack_bytes(SAVE,cool->xx,cool->yy+1,2);@+break;

  else head->inst=pack_bytes(SAVE,cool->xx,cool->yy+1,2);@+break;

 case 3:@+ if (cool->yy==rR) head->inst=pack_bytes(SAVE,cool->xx,rP,3);

 case 3:@+ if (cool->yy==rR) head->inst=pack_bytes(SAVE,cool->xx,rP,3);

  else head->inst=pack_bytes(SAVE,cool->xx,cool->yy+1,3);@+break;

  else head->inst=pack_bytes(SAVE,cool->xx,cool->yy+1,3);@+break;

@ @=

@ @=

  if (data->interim) data->x.o=data->b.o;

  if (data->interim) data->x.o=data->b.o;

  else {

  else {

    if (data!=old_hot) wait(1); /* we need the hottest value of rA */

    if (data!=old_hot) wait(1); /* we need the hottest value of rA */

    data->x.o.h=g[rG].o.l<<24;

    data->x.o.h=g[rG].o.l<<24;

    data->x.o.l=g[rA].o.l;

    data->x.o.l=g[rA].o.l;

    data->a.o=data->y.o;

    data->a.o=data->y.o;

  goto fin_ex;

  goto fin_ex;

@* More register-to-register ops.

@* More register-to-register ops.

Now that we've finished most of the hard stuff,

Now that we've finished most of the hard stuff,

we can relax and fill in the holes that we left in the

we can relax and fill in the holes that we left in the

all-register parts of the execution stages.

all-register parts of the execution stages.

First let's complete the fixed point arithmetic operations,

First let's complete the fixed point arithmetic operations,

by dispensing with multiplication and division.

by dispensing with multiplication and division.

@=

@=

case mulu: data->x.o=omult(data->y.o,data->z.o);

case mulu: data->x.o=omult(data->y.o,data->z.o);

  data->a.o=aux;

  data->a.o=aux;

  goto quantify_mul;

  goto quantify_mul;

case mul: data->x.o=signed_omult(data->y.o,data->z.o);

case mul: data->x.o=signed_omult(data->y.o,data->z.o);

  if (overflow) data->interrupt |= V_BIT;

  if (overflow) data->interrupt |= V_BIT;

quantify_mul: aux=data->z.o;

quantify_mul: aux=data->z.o;

  for (j=mul0;aux.l||aux.h;j++) aux=shift_right(aux,8,1);

  for (j=mul0;aux.l||aux.h;j++) aux=shift_right(aux,8,1);

  data->i=j;@+break; /* |j| is |mul0| or |mul1| or \dots~or |mul8| */

  data->i=j;@+break; /* |j| is |mul0| or |mul1| or \dots~or |mul8| */

case divu: data->x.o=odiv(data->b.o,data->y.o,data->z.o);

case divu: data->x.o=odiv(data->b.o,data->y.o,data->z.o);

  data->a.o=aux;@+data->i=div;@+break;

  data->a.o=aux;@+data->i=div;@+break;

case div:@+ if (data->z.o.l==0 && data->z.o.h==0) {

case div:@+ if (data->z.o.l==0 && data->z.o.h==0) {

    data->interrupt |= D_BIT;@+ data->a.o=data->y.o;

    data->interrupt |= D_BIT;@+ data->a.o=data->y.o;

    data->i=set; /* divide by zero needn't wait in the pipeline */

    data->i=set; /* divide by zero needn't wait in the pipeline */

  }@+else {

  }@+else {

    data->x.o=signed_odiv(data->y.o,data->z.o);

    data->x.o=signed_odiv(data->y.o,data->z.o);

    if (overflow) data->interrupt |= V_BIT;

    if (overflow) data->interrupt |= V_BIT;

    data->a.o=aux;

    data->a.o=aux;

  }@+break;

  }@+break;

@ Next let's polish off the bitwise and bytewise operations.

@ Next let's polish off the bitwise and bytewise operations.

@=

@=

case sadd: data->x.o.l=count_bits(data->y.o.h&~data->z.o.h)

case sadd: data->x.o.l=count_bits(data->y.o.h&~data->z.o.h)

                      +count_bits(data->y.o.l&~data->z.o.l);@+ break;

                      +count_bits(data->y.o.l&~data->z.o.l);@+ break;

case mor: data->x.o=bool_mult(data->y.o,data->z.o,data->op&0x2);@+ break;

case mor: data->x.o=bool_mult(data->y.o,data->z.o,data->op&0x2);@+ break;

case bdif: data->x.o.h=byte_diff(data->y.o.h,data->z.o.h);

case bdif: data->x.o.h=byte_diff(data->y.o.h,data->z.o.h);

           data->x.o.l=byte_diff(data->y.o.l,data->z.o.l);@+ break;

           data->x.o.l=byte_diff(data->y.o.l,data->z.o.l);@+ break;

case wdif: data->x.o.h=wyde_diff(data->y.o.h,data->z.o.h);

case wdif: data->x.o.h=wyde_diff(data->y.o.h,data->z.o.h);

           data->x.o.l=wyde_diff(data->y.o.l,data->z.o.l);@+ break;

           data->x.o.l=wyde_diff(data->y.o.l,data->z.o.l);@+ break;

case tdif:@+ if (data->y.o.h>data->z.o.h)

case tdif:@+ if (data->y.o.h>data->z.o.h)

             data->x.o.h=data->y.o.h-data->z.o.h;

             data->x.o.h=data->y.o.h-data->z.o.h;

 tdif_l:@+ if (data->y.o.l>data->z.o.l)

 tdif_l:@+ if (data->y.o.l>data->z.o.l)

             data->x.o.l=data->y.o.l-data->z.o.l;@+ break;

             data->x.o.l=data->y.o.l-data->z.o.l;@+ break;

case odif:@+ if (data->y.o.h>data->z.o.h)

case odif:@+ if (data->y.o.h>data->z.o.h)

    data->x.o=ominus(data->y.o,data->z.o);

    data->x.o=ominus(data->y.o,data->z.o);

  else if (data->y.o.h==data->z.o.h) goto tdif_l;

  else if (data->y.o.h==data->z.o.h) goto tdif_l;

  break;

  break;

@ The conditional set (\.{CS}) instructions are, rather surprisingly,

@ The conditional set (\.{CS}) instructions are, rather surprisingly,

more difficult to implement than the zero~set (\.{ZS}) instructions,

more difficult to implement than the zero~set (\.{ZS}) instructions,

although the \.{ZS} instructions do more. The reason is that dynamic

although the \.{ZS} instructions do more. The reason is that dynamic

instruction dependencies are more complicated with \.{CS}. Consider, for

instruction dependencies are more complicated with \.{CS}. Consider, for

example, the instructions

example, the instructions

$$\advance\abovedisplayskip-.5\baselineskip

$$\advance\abovedisplayskip-.5\baselineskip

  \advance\belowdisplayskip-.5\baselineskip

  \advance\belowdisplayskip-.5\baselineskip

\hbox{\tt LDO x,a,b; \ FDIV y,c,d; \ CSZ y,x,0; \ INCL y,1.}$$

\hbox{\tt LDO x,a,b; \ FDIV y,c,d; \ CSZ y,x,0; \ INCL y,1.}$$

If the value of \.x is zero, the \.{INCL} instruction need not wait for the

If the value of \.x is zero, the \.{INCL} instruction need not wait for the

division to be completed. (We do not, however, abort the division in such a

division to be completed. (We do not, however, abort the division in such a

case; it might invoke a trip handler, or change the inexact bit, etc. Our

case; it might invoke a trip handler, or change the inexact bit, etc. Our

policy is to treat common cases efficiently and to treat all cases correctly,

policy is to treat common cases efficiently and to treat all cases correctly,

but not to treat all cases with maximum efficiency.)

but not to treat all cases with maximum efficiency.)

@=

@=

case zset:@+if (register_truth(data->y.o,data->op)) data->x.o=data->z.o;

case zset:@+if (register_truth(data->y.o,data->op)) data->x.o=data->z.o;

  /* otherwise |data->x.o| is already zero */

  /* otherwise |data->x.o| is already zero */

  goto fin_ex;

  goto fin_ex;

case cset:@+if (register_truth(data->y.o,data->op))

case cset:@+if (register_truth(data->y.o,data->op))

    data->x.o=data->z.o, data->b.p=NULL;

    data->x.o=data->z.o, data->b.p=NULL;

  else if (data->b.p==NULL) data->x.o=data->b.o;

  else if (data->b.p==NULL) data->x.o=data->b.o;

  else {

  else {

    data->state=0;@+data->need_b=true;@+goto switch1;

    data->state=0;@+data->need_b=true;@+goto switch1;

  }@+break;

  }@+break;

@ Floating point computations are mostly handled by the routines in

@ Floating point computations are mostly handled by the routines in

{\mc MMIX-ARITH}, which record anomalous events in the global

{\mc MMIX-ARITH}, which record anomalous events in the global

variable |exceptions|. But we consider the operation trivial if an

variable |exceptions|. But we consider the operation trivial if an

input is infinite or NaN; and we may need to increase the execution

input is infinite or NaN; and we may need to increase the execution

time when subnormals are present.

time when subnormals are present.

@d ROUND_OFF 1

@d ROUND_OFF 1

@d ROUND_UP 2

@d ROUND_UP 2

@d ROUND_DOWN 3

@d ROUND_DOWN 3

@d ROUND_NEAR 4

@d ROUND_NEAR 4

@d is_subnormal(x) ((x.h&0x7ff00000)==0 && ((x.h&0xfffff) || x.l))

@d is_subnormal(x) ((x.h&0x7ff00000)==0 && ((x.h&0xfffff) || x.l))

@d is_trivial(x) ((x.h&0x7ff00000)==0x7ff00000)

@d is_trivial(x) ((x.h&0x7ff00000)==0x7ff00000)

@d set_round cur_round=(data->ra.o.l<0x10000? ROUND_NEAR: data->ra.o.l>>16)

@d set_round cur_round=(data->ra.o.l<0x10000? ROUND_NEAR: data->ra.o.l>>16)

@=

@=

case fadd: set_round;@+data->x.o=fplus(data->y.o,data->z.o);

case fadd: set_round;@+data->x.o=fplus(data->y.o,data->z.o);

 fin_bflot:@+ if (is_subnormal(data->y.o)) data->denin=denin_penalty;

 fin_bflot:@+ if (is_subnormal(data->y.o)) data->denin=denin_penalty;

 fin_uflot:@+ if (is_subnormal(data->x.o)) data->denout=denout_penalty;

 fin_uflot:@+ if (is_subnormal(data->x.o)) data->denout=denout_penalty;

 fin_flot:@+ if (is_subnormal(data->z.o)) data->denin=denin_penalty;

 fin_flot:@+ if (is_subnormal(data->z.o)) data->denin=denin_penalty;

   data->interrupt|=exceptions;

   data->interrupt|=exceptions;

   if (is_trivial(data->y.o) || is_trivial(data->z.o)) goto fin_ex;

   if (is_trivial(data->y.o) || is_trivial(data->z.o)) goto fin_ex;

   if (data->i==fsqrt && (data->z.o.h&sign_bit)) goto fin_ex;

   if (data->i==fsqrt && (data->z.o.h&sign_bit)) goto fin_ex;

   break;

   break;

case fsub: data->a.o=data->z.o;

case fsub: data->a.o=data->z.o;

  if (fcomp(data->z.o,zero_octa)!=2) data->a.o.h ^= sign_bit;

  if (fcomp(data->z.o,zero_octa)!=2) data->a.o.h ^= sign_bit;

  set_round;@+data->x.o=fplus(data->y.o,data->a.o);

  set_round;@+data->x.o=fplus(data->y.o,data->a.o);

  data->i=fadd; /* use pipeline times for addition */

  data->i=fadd; /* use pipeline times for addition */

  goto fin_bflot;

  goto fin_bflot;

case fmul: set_round;@+ data->x.o=fmult(data->y.o,data->z.o);@+ goto fin_bflot;

case fmul: set_round;@+ data->x.o=fmult(data->y.o,data->z.o);@+ goto fin_bflot;

case fdiv: set_round;@+ data->x.o=fdivide(data->y.o,data->z.o);@+

case fdiv: set_round;@+ data->x.o=fdivide(data->y.o,data->z.o);@+

  goto fin_bflot;

  goto fin_bflot;

case fsqrt: set_round;@+ data->x.o=froot(data->z.o,data->y.o.l);@+

case fsqrt: set_round;@+ data->x.o=froot(data->z.o,data->y.o.l);@+

  goto fin_uflot;

  goto fin_uflot;

case fint: set_round;@+ data->x.o=fintegerize(data->z.o,data->y.o.l);@+

case fint: set_round;@+ data->x.o=fintegerize(data->z.o,data->y.o.l);@+

  goto fin_uflot;

  goto fin_uflot;

case fix: set_round;@+ data->x.o=fixit(data->z.o,data->y.o.l);

case fix: set_round;@+ data->x.o=fixit(data->z.o,data->y.o.l);

  if (data->op&0x2) exceptions&=~W_BIT; /* unsigned case doesn't overflow */

  if (data->op&0x2) exceptions&=~W_BIT; /* unsigned case doesn't overflow */

  goto fin_flot;

  goto fin_flot;

case flot: set_round;@+

case flot: set_round;@+

  data->x.o=floatit(data->z.o,data->y.o.l,data->op&0x2, data->op&0x4);

  data->x.o=floatit(data->z.o,data->y.o.l,data->op&0x2, data->op&0x4);

  data->interrupt|=exceptions;@+break;

  data->interrupt|=exceptions;@+break;

@ @=

@ @=

case fsqrt: case fint: case fix: case flot:@+ if (cool->y.o.l>4)

case fsqrt: case fint: case fix: case flot:@+ if (cool->y.o.l>4)

    goto illegal_inst;

    goto illegal_inst;

  break;

  break;

@ @=

@ @=

case feps: j=fepscomp(data->y.o,data->z.o,data->b.o,data->op!=FEQLE);

case feps: j=fepscomp(data->y.o,data->z.o,data->b.o,data->op!=FEQLE);

  if (j==2) data->i=fcmp;

  if (j==2) data->i=fcmp;

  else if (is_subnormal(data->y.o) || is_subnormal(data->z.o))

  else if (is_subnormal(data->y.o) || is_subnormal(data->z.o))

    data->denin=denin_penalty;

    data->denin=denin_penalty;

  switch (data->op) {

  switch (data->op) {

 case FUNE:@+ if (j==2) goto cmp_pos;@+ else goto cmp_zero;

 case FUNE:@+ if (j==2) goto cmp_pos;@+ else goto cmp_zero;

 case FEQLE: goto cmp_fin;

 case FEQLE: goto cmp_fin;

 case FCMPE:@+ if (j) goto cmp_zero_or_invalid;

 case FCMPE:@+ if (j) goto cmp_zero_or_invalid;

case fcmp: j=fcomp(data->y.o,data->z.o);

case fcmp: j=fcomp(data->y.o,data->z.o);

  if (j<0) goto cmp_neg;

  if (j<0) goto cmp_neg;

 cmp_fin:@+ if (j==1) goto cmp_pos;

 cmp_fin:@+ if (j==1) goto cmp_pos;

 cmp_zero_or_invalid:@+ if (j==2) data->interrupt |= I_BIT;

 cmp_zero_or_invalid:@+ if (j==2) data->interrupt |= I_BIT;

  goto cmp_zero;

  goto cmp_zero;

case funeq:@+ if (fcomp(data->y.o,data->z.o)==(data->op==FUN? 2:0))

case funeq:@+ if (fcomp(data->y.o,data->z.o)==(data->op==FUN? 2:0))

    goto cmp_pos;

    goto cmp_pos;

  else goto cmp_zero;

  else goto cmp_zero;

@ @=

@ @=

Extern int frem_max;

Extern int frem_max;

Extern int denin_penalty, denout_penalty;

Extern int denin_penalty, denout_penalty;

@ The floating point remainder operation is especially interesting

@ The floating point remainder operation is especially interesting

because it can be interrupted when it's in the hot seat.

because it can be interrupted when it's in the hot seat.

@=

@=

case frem:@+if(is_trivial(data->y.o) || is_trivial(data->z.o))

case frem:@+if(is_trivial(data->y.o) || is_trivial(data->z.o))

      data->x.o=fremstep(data->y.o,data->z.o,2500);@+ goto fin_ex;

      data->x.o=fremstep(data->y.o,data->z.o,2500);@+ goto fin_ex;

  if ((self+1)->next) wait(1);

  if ((self+1)->next) wait(1);

  data->interim=true;

  data->interim=true;

  j=1;

  j=1;

  if (is_subnormal(data->y.o)||is_subnormal(data->z.o)) j+=denin_penalty;

  if (is_subnormal(data->y.o)||is_subnormal(data->z.o)) j+=denin_penalty;

  pass_after(j);

  pass_after(j);

  goto passit;

  goto passit;

@ @=

@ @=

j=1;

j=1;

if (data->i==frem) {

if (data->i==frem) {

  data->x.o=fremstep(data->y.o,data->z.o,frem_max);

  data->x.o=fremstep(data->y.o,data->z.o,frem_max);

  if (exceptions&E_BIT) {

  if (exceptions&E_BIT) {

    data->y.o=data->x.o;

    data->y.o=data->x.o;

    if (trying_to_interrupt && data==old_hot) goto fin_ex;

    if (trying_to_interrupt && data==old_hot) goto fin_ex;

  }@+else {

  }@+else {

    data->state=3;

    data->state=3;

    data->interim=false;

    data->interim=false;

    data->interrupt |= exceptions;

    data->interrupt |= exceptions;

    if (is_subnormal(data->x.o)) j+=denout_penalty;

    if (is_subnormal(data->x.o)) j+=denout_penalty;

  wait(j);

  wait(j);

@* System operations. Finally we need to implement some operations for the

@* System operations. Finally we need to implement some operations for the

operating system; then the hardware simulation will be done!

operating system; then the hardware simulation will be done!

A \.{LDVTS} instruction is delayed until it reaches the hot seat, because

A \.{LDVTS} instruction is delayed until it reaches the hot seat, because

it changes the IT and DT caches. The operating system should use \.{SYNC}

it changes the IT and DT caches. The operating system should use \.{SYNC}

after \.{LDVTS} if the effects are needed immediately; the system is also

after \.{LDVTS} if the effects are needed immediately; the system is also

responsible for ensuring that the page table permission bits agree with

responsible for ensuring that the page table permission bits agree with

the \.{LDVTS} permission bits when the latter are nonzero. (Also, if

the \.{LDVTS} permission bits when the latter are nonzero. (Also, if

write permission is taken away from a page, the operating system must

write permission is taken away from a page, the operating system must

have previously used \.{SYNCD} to write out any dirty bytes that might

have previously used \.{SYNCD} to write out any dirty bytes that might

have been cached from that page; \.{SYNCD} will be inoperative after write

have been cached from that page; \.{SYNCD} will be inoperative after write

permission goes away.)

permission goes away.)

@=

@=

if (data->i==ldvts) @;

if (data->i==ldvts) @;

@ @=

@ @=

  if (data!=old_hot) wait(1);

  if (data!=old_hot) wait(1);

  if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

  if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

  startup(&DTcache->reader[j],DTcache->access_time);

  startup(&DTcache->reader[j],DTcache->access_time);

  data->z.o.h=0, data->z.o.l=data->y.o.l&0x7;

  data->z.o.h=0, data->z.o.l=data->y.o.l&0x7;

  p=cache_search(DTcache,data->y.o); /* N.B.: Not |trans_key(data->y.o)| */

  p=cache_search(DTcache,data->y.o); /* N.B.: Not |trans_key(data->y.o)| */

  if (p) {

  if (p) {

    data->x.o.l=2;

    data->x.o.l=2;

    if (data->z.o.l) {

    if (data->z.o.l) {

      p=use_and_fix(DTcache,p);

      p=use_and_fix(DTcache,p);

      p->data[0].l=(p->data[0].l&-8)+data->z.o.l;

      p->data[0].l=(p->data[0].l&-8)+data->z.o.l;

    }@+else {

    }@+else {

      p=demote_and_fix(DTcache,p);

      p=demote_and_fix(DTcache,p);

      p->tag.h|=sign_bit; /* invalidate the tag */

      p->tag.h|=sign_bit; /* invalidate the tag */

  pass_after(DTcache->access_time);@+goto passit;

  pass_after(DTcache->access_time);@+goto passit;

@ @=

@ @=

case ld_st_launch:@+ if (ITcache->lock || (j=get_reader(ITcache))<0) wait(1);

case ld_st_launch:@+ if (ITcache->lock || (j=get_reader(ITcache))<0) wait(1);

  startup(&ITcache->reader[j],ITcache->access_time);

  startup(&ITcache->reader[j],ITcache->access_time);

  p=cache_search(ITcache,data->y.o); /* N.B.: Not |trans_key(data->y.o)| */

  p=cache_search(ITcache,data->y.o); /* N.B.: Not |trans_key(data->y.o)| */

  if (p) {

  if (p) {

    data->x.o.l|=1;

    data->x.o.l|=1;

    if (data->z.o.l) {

    if (data->z.o.l) {

      p=use_and_fix(ITcache,p);

      p=use_and_fix(ITcache,p);

      p->data[0].l=(p->data[0].l&-8)+data->z.o.l;

      p->data[0].l=(p->data[0].l&-8)+data->z.o.l;

    }@+else {

    }@+else {

      p=demote_and_fix(ITcache,p);

      p=demote_and_fix(ITcache,p);

      p->tag.h|=sign_bit; /* invalidate the tag */

      p->tag.h|=sign_bit; /* invalidate the tag */

  data->state=3;@+wait(ITcache->access_time);

  data->state=3;@+wait(ITcache->access_time);

@ The \.{SYNC} operation interacts with the pipeline in interesting ways.

@ The \.{SYNC} operation interacts with the pipeline in interesting ways.

\.{SYNC}~\.0 and \.{SYNC}~\.4 are the simplest; they just lock the

\.{SYNC}~\.0 and \.{SYNC}~\.4 are the simplest; they just lock the

dispatch and wait until they get to the hot seat, after which the

dispatch and wait until they get to the hot seat, after which the

pipeline has drained. \.{SYNC}~\.1 and \.{SYNC}~\.3 put a ``barrier''

pipeline has drained. \.{SYNC}~\.1 and \.{SYNC}~\.3 put a ``barrier''

into the write buffer so that subsequent store instructions will not merge with

into the write buffer so that subsequent store instructions will not merge with

previous stores. \.{SYNC}~\.2 and \.{SYNC}~\.3 lock the dispatch until

previous stores. \.{SYNC}~\.2 and \.{SYNC}~\.3 lock the dispatch until

all previous load instructions have left the pipeline. \.{SYNC}~\.5,

all previous load instructions have left the pipeline. \.{SYNC}~\.5,

\.{SYNC}~\.6, and \.{SYNC}~\.7 remove things from caches once they

\.{SYNC}~\.6, and \.{SYNC}~\.7 remove things from caches once they

get to the hot seat.

get to the hot seat.

@=

@=

case sync:@+ if (cool->zz>3) {

case sync:@+ if (cool->zz>3) {

  if (!(cool->loc.h&sign_bit)) goto privileged_inst;

  if (!(cool->loc.h&sign_bit)) goto privileged_inst;

  if (cool->zz==4) freeze_dispatch=true;

  if (cool->zz==4) freeze_dispatch=true;

}@+else {

}@+else {

  if (cool->zz!=1) freeze_dispatch=true;

  if (cool->zz!=1) freeze_dispatch=true;

  if (cool->zz&1) cool->mem_x=true, spec_install(&mem,&cool->x);

  if (cool->zz&1) cool->mem_x=true, spec_install(&mem,&cool->x);

}@+break;

}@+break;

@ @=

@ @=

case sync:@+ switch (data->zz) {

case sync:@+ switch (data->zz) {

 case 0: case 4:@+ if (data!=old_hot) wait(1);

 case 0: case 4:@+ if (data!=old_hot) wait(1);

  halted=(data->zz!=0);@+goto fin_ex;

  halted=(data->zz!=0);@+goto fin_ex;

 case 2: case 3: @;

 case 2: case 3: @;

  release_lock(self,dispatch_lock);

  release_lock(self,dispatch_lock);

 case 1: data->x.addr=zero_octa;@+goto fin_ex;

 case 1: data->x.addr=zero_octa;@+goto fin_ex;

 case 5:@+ if (data!=old_hot) wait(1);

 case 5:@+ if (data!=old_hot) wait(1);

@;

@;

 case 6:@+ if (data!=old_hot) wait(1);

 case 6:@+ if (data!=old_hot) wait(1);

@;

@;

 case 7:@+ if (data!=old_hot) wait(1);

 case 7:@+ if (data!=old_hot) wait(1);

@;

@;

@ @=

@ @=

  register control *cc;

  register control *cc;

  for (cc=data;cc!=hot;) {

  for (cc=data;cc!=hot;) {

    cc=(cc==reorder_top? reorder_bot: cc+1);

    cc=(cc==reorder_top? reorder_bot: cc+1);

    if (cc->owner && (cc->i==ld || cc->i==ldunc || cc->i==pst)) wait(1);

    if (cc->owner && (cc->i==ld || cc->i==ldunc || cc->i==pst)) wait(1);

@ Perhaps the delay should be longer here.

@ Perhaps the delay should be longer here.

@=

@=

if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

if (DTcache->lock || (j=get_reader(DTcache))<0) wait(1);

startup(&DTcache->reader[j],DTcache->access_time);

startup(&DTcache->reader[j],DTcache->access_time);

set_lock(self,DTcache->lock);

set_lock(self,DTcache->lock);

zap_cache(DTcache);

zap_cache(DTcache);

data->state=10;@+wait(DTcache->access_time);

data->state=10;@+wait(DTcache->access_time);

@ @=

@ @=

if (!Icache) {

if (!Icache) {

  data->state=11;@+goto switch1;

  data->state=11;@+goto switch1;

if (Icache->lock || (j=get_reader(Icache))<0) wait(1);

if (Icache->lock || (j=get_reader(Icache))<0) wait(1);

startup(&Icache->reader[j],Icache->access_time);

startup(&Icache->reader[j],Icache->access_time);

set_lock(self,Icache->lock);

set_lock(self,Icache->lock);

zap_cache(Icache);

zap_cache(Icache);

data->state=11;@+wait(Icache->access_time);

data->state=11;@+wait(Icache->access_time);

@ @=

@ @=

case 10:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

case 10:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 if (ITcache->lock || (j=get_reader(ITcache))<0) wait(1);

 if (ITcache->lock || (j=get_reader(ITcache))<0) wait(1);

 startup(&ITcache->reader[j],ITcache->access_time);

 startup(&ITcache->reader[j],ITcache->access_time);

 set_lock(self,ITcache->lock);

 set_lock(self,ITcache->lock);

 zap_cache(ITcache);

 zap_cache(ITcache);

 data->state=3;@+wait(ITcache->access_time);

 data->state=3;@+wait(ITcache->access_time);

case 11:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

case 11:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 if (wbuf_lock) wait(1);

 if (wbuf_lock) wait(1);

 write_head=write_tail, write_ctl.state=0; /* zap the write buffer */

 write_head=write_tail, write_ctl.state=0; /* zap the write buffer */

 if (!Dcache) {

 if (!Dcache) {

   data->state=12;@+ goto switch1;

   data->state=12;@+ goto switch1;

 if (Dcache->lock || (j=get_reader(Dcache))<0) wait(1);

 if (Dcache->lock || (j=get_reader(Dcache))<0) wait(1);

 startup(&Dcache->reader[j],Dcache->access_time);

 startup(&Dcache->reader[j],Dcache->access_time);

 set_lock(self,Dcache->lock);

 set_lock(self,Dcache->lock);

 zap_cache(Dcache);

 zap_cache(Dcache);

 data->state=12;@+wait(Dcache->access_time);

 data->state=12;@+wait(Dcache->access_time);

case 12:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

case 12:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 if (!Scache) goto fin_ex;

 if (!Scache) goto fin_ex;

 if (Scache->lock) wait(1);

 if (Scache->lock) wait(1);

 set_lock(self,Scache->lock);

 set_lock(self,Scache->lock);

 zap_cache(Scache);

 zap_cache(Scache);

 data->state=3;@+wait(Scache->access_time);

 data->state=3;@+wait(Scache->access_time);

@ @=

@ @=

if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

@;

@;

if (clean_co.next || clean_lock) wait(1);

if (clean_co.next || clean_lock) wait(1);

set_lock(self,clean_lock);

set_lock(self,clean_lock);

clean_ctl.i=sync;@+

clean_ctl.i=sync;@+

clean_ctl.state=0;@+

clean_ctl.state=0;@+

clean_ctl.x.o.h=0;

clean_ctl.x.o.h=0;

startup(&clean_co,1);

startup(&clean_co,1);

data->state=13;

data->state=13;

data->interim=true;

data->interim=true;

wait(1);

wait(1);

@ @=

@ @=

if (write_head!=write_tail) {

if (write_head!=write_tail) {

  if (!speed_lock) set_lock(self,speed_lock);

  if (!speed_lock) set_lock(self,speed_lock);

  wait(1);

  wait(1);

@ The cleanup process might take a huge amount of time, so we must allow

@ The cleanup process might take a huge amount of time, so we must allow

it to be interrupted. (Servicing the interruption might, of course,

it to be interrupted. (Servicing the interruption might, of course,

put more stuff into the cache.)

put more stuff into the cache.)

@=

@=

case 13:@+ if (!clean_co.next) {

case 13:@+ if (!clean_co.next) {

   data->interim=false;@+ goto fin_ex; /* it's done! */

   data->interim=false;@+ goto fin_ex; /* it's done! */

 if (trying_to_interrupt) goto fin_ex; /* accept an interruption */

 if (trying_to_interrupt) goto fin_ex; /* accept an interruption */

 wait(1);

 wait(1);

@ Now we consider \.{SYNCD} and \.{SYNCID}. When control comes to this

@ Now we consider \.{SYNCD} and \.{SYNCID}. When control comes to this

part of the program, |data->y.o| is a virtual address and |data->z.o|

part of the program, |data->y.o| is a virtual address and |data->z.o|

is the corresponding physical address; |data->xx+1| is the number of

is the corresponding physical address; |data->xx+1| is the number of

bytes we are supposed to be syncing; |data->b.o.l| is the number of

bytes we are supposed to be syncing; |data->b.o.l| is the number of

bytes we can handle at once (either |Icache->bb| or |Dcache->bb| or 8192).

bytes we can handle at once (either |Icache->bb| or |Dcache->bb| or 8192).

We need a more elaborate scheme to implement \.{SYNCD} and \.{SYNCID}

We need a more elaborate scheme to implement \.{SYNCD} and \.{SYNCID}

than we have used for the ``hint'' instructions \.{PRELD}, \.{PREGO},

than we have used for the ``hint'' instructions \.{PRELD}, \.{PREGO},

and \.{PREST}, because \.{SYNCD} and \.{SYNCID} are not merely hints.

and \.{PREST}, because \.{SYNCD} and \.{SYNCID} are not merely hints.

They cannot be converted into a sequence of cache-block-size commands at

They cannot be converted into a sequence of cache-block-size commands at

dispatch time, because we cannot be sure that the starting virtual address

dispatch time, because we cannot be sure that the starting virtual address

will be aligned with the beginning of a cache block. We need to realize

will be aligned with the beginning of a cache block. We need to realize

that the bytes specified by \.{SYNCD} or \.{SYNCID} might cross a

that the bytes specified by \.{SYNCD} or \.{SYNCID} might cross a

virtual page boundary---possibly with different protection bits

virtual page boundary---possibly with different protection bits

on each page. We need to allow for interrupts. And we also need to

on each page. We need to allow for interrupts. And we also need to

keep the fetch buffer empty until a user's \.{SYNCID} has completely

keep the fetch buffer empty until a user's \.{SYNCID} has completely

brought the memory up to date.

brought the memory up to date.

@=

@=

do_syncid: data->state=30;

do_syncid: data->state=30;

case 30:@+ if (data!=old_hot) wait(1);

case 30:@+ if (data!=old_hot) wait(1);

 if (!Icache) {

 if (!Icache) {

   data->state=(data->loc.h&sign_bit? 31:33);@+goto switch2;

   data->state=(data->loc.h&sign_bit? 31:33);@+goto switch2;

 @z.o|, if any@>;

 @z.o|, if any@>;

 data->state=(data->loc.h&sign_bit? 31: 33);@+wait(Icache->access_time);

 data->state=(data->loc.h&sign_bit? 31: 33);@+wait(Icache->access_time);

case 31:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

case 31:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

@;

@;

 if (((data->b.o.l-1)&~data->y.o.l)xx) data->interim=true;

 if (((data->b.o.l-1)&~data->y.o.l)xx) data->interim=true;

 if (!Dcache) goto next_sync;

 if (!Dcache) goto next_sync;

 @z.o|, if any@>;

 @z.o|, if any@>;

 data->state=32;@+wait(Dcache->access_time);

 data->state=32;@+wait(Dcache->access_time);

case 32:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

case 32:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 if (!Scache) goto next_sync;

 if (!Scache) goto next_sync;

 @z.o|, if any@>;

 @z.o|, if any@>;

 data->state=35;@+wait(Scache->access_time);

 data->state=35;@+wait(Scache->access_time);

do_syncd: data->state=33;

do_syncd: data->state=33;

case 33:@+ if (data!=old_hot) wait(1);

case 33:@+ if (data!=old_hot) wait(1);

 if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

@;

@;

 if (((data->b.o.l-1)&~data->y.o.l)xx) data->interim=true;

 if (((data->b.o.l-1)&~data->y.o.l)xx) data->interim=true;

 if (!Dcache)

 if (!Dcache)

   if (data->i==syncd) goto fin_ex;@+ else goto next_sync;

   if (data->i==syncd) goto fin_ex;@+ else goto next_sync;

 @z.o|, if any@>;

 @z.o|, if any@>;

 data->state=34;

 data->state=34;

case 34:@+if (!clean_co.next) goto next_sync;

case 34:@+if (!clean_co.next) goto next_sync;

 if (trying_to_interrupt && data->interim && data==old_hot) {

 if (trying_to_interrupt && data->interim && data==old_hot) {

   data->z.o=zero_octa; /* anticipate |RESUME_CONT| */

   data->z.o=zero_octa; /* anticipate |RESUME_CONT| */

   goto fin_ex; /* accept an interruption */

   goto fin_ex; /* accept an interruption */

 wait(1);

 wait(1);

next_sync: data->state=35;

next_sync: data->state=35;

case 35:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

case 35:@+ if (self->lockloc) *(self->lockloc)=NULL,self->lockloc=NULL;

 if (data->interim) @;

 if (data->interim) @;

 data->go.known=true;

 data->go.known=true;

 goto fin_ex;

 goto fin_ex;

@ @z.o|, if any@>=

@ @z.o|, if any@>=

if (Icache->lock || (j=get_reader(Icache))<0) wait(1);

if (Icache->lock || (j=get_reader(Icache))<0) wait(1);

startup(&Icache->reader[j],Icache->access_time);

startup(&Icache->reader[j],Icache->access_time);

set_lock(self,Icache->lock);

set_lock(self,Icache->lock);

p=cache_search(Icache,data->z.o);

p=cache_search(Icache,data->z.o);

if (p) {

if (p) {

  demote_and_fix(Icache,p);

  demote_and_fix(Icache,p);

  clean_block(Icache,p);

  clean_block(Icache,p);

@ @z.o|, if any@>=

@ @z.o|, if any@>=

if (Dcache->lock || (j=get_reader(Dcache))<0) wait(1);

if (Dcache->lock || (j=get_reader(Dcache))<0) wait(1);

startup(&Dcache->reader[j],Dcache->access_time);

startup(&Dcache->reader[j],Dcache->access_time);

set_lock(self,Dcache->lock);

set_lock(self,Dcache->lock);

p=cache_search(Dcache,data->z.o);

p=cache_search(Dcache,data->z.o);

if (p) {

if (p) {

  demote_and_fix(Dcache,p);

  demote_and_fix(Dcache,p);

  clean_block(Dcache,p);

  clean_block(Dcache,p);

@ @z.o|, if any@>=

@ @z.o|, if any@>=

if (Scache->lock) wait(1);

if (Scache->lock) wait(1);

set_lock(self,Scache->lock);

set_lock(self,Scache->lock);

p=cache_search(Scache,data->z.o);

p=cache_search(Scache,data->z.o);

if (p) {

if (p) {

  demote_and_fix(Scache,p);

  demote_and_fix(Scache,p);

  clean_block(Scache,p);

  clean_block(Scache,p);

@ @z.o|, if any@>=

@ @z.o|, if any@>=

if (clean_co.next || clean_lock) wait(1);

if (clean_co.next || clean_lock) wait(1);

set_lock(self,clean_lock);

set_lock(self,clean_lock);

clean_ctl.i=syncd;

clean_ctl.i=syncd;

clean_ctl.state=4;

clean_ctl.state=4;

clean_ctl.x.o.h=data->loc.h&sign_bit;

clean_ctl.x.o.h=data->loc.h&sign_bit;

clean_ctl.z.o=data->z.o;

clean_ctl.z.o=data->z.o;

schedule(&clean_co,1,4);

schedule(&clean_co,1,4);

@ We use the fact that cache block sizes are divisors of 8192.

@ We use the fact that cache block sizes are divisors of 8192.

@=

@=

  data->interim=false;

  data->interim=false;

  data->xx -= ((data->b.o.l-1)&~data->y.o.l)+1;

  data->xx -= ((data->b.o.l-1)&~data->y.o.l)+1;

  data->y.o=incr(data->y.o,data->b.o.l);

  data->y.o=incr(data->y.o,data->b.o.l);

  data->y.o.l &= -data->b.o.l;

  data->y.o.l &= -data->b.o.l;

  data->z.o.l = (data->z.o.l&-8192)+(data->y.o.l&8191);

  data->z.o.l = (data->z.o.l&-8192)+(data->y.o.l&8191);

  if ((data->y.o.l&8191)==0) goto square_one;

  if ((data->y.o.l&8191)==0) goto square_one;

      /* maybe crossed a page boundary */

      /* maybe crossed a page boundary */

  if (data->i==syncd) goto do_syncd;@+else goto do_syncid;

  if (data->i==syncd) goto do_syncd;@+else goto do_syncid;

@ If the first page lacks proper protection, we still must try the

@ If the first page lacks proper protection, we still must try the

second, in the rare case that a page boundary is spanned.

second, in the rare case that a page boundary is spanned.

@=

@=

sync_check:@+ if ((data->y.o.l ^ (data->y.o.l+data->xx))>=8192) {

sync_check:@+ if ((data->y.o.l ^ (data->y.o.l+data->xx))>=8192) {

   data->xx -= (8191&~data->y.o.l)+1;

   data->xx -= (8191&~data->y.o.l)+1;

   data->y.o=incr(data->y.o,8192);

   data->y.o=incr(data->y.o,8192);

   data->y.o.l &= -8192;

   data->y.o.l &= -8192;

   goto square_one;

   goto square_one;

 goto fin_ex;

 goto fin_ex;

@* Input and output. We're done implementing the hardware, but there's

@* Input and output. We're done implementing the hardware, but there's

still a small matter of software remaining, because we sometimes

still a small matter of software remaining, because we sometimes

want to pretend that a real operating

want to pretend that a real operating

system is present without actually having one loaded. This simulator

system is present without actually having one loaded. This simulator

therefore implements a special feature: If \.{RESUME}~\.1 is issued in

therefore implements a special feature: If \.{RESUME}~\.1 is issued in

location~rT, the ten special I/O traps of {\mc MMIX-SIM} are performed

location~rT, the ten special I/O traps of {\mc MMIX-SIM} are performed

instantaneously behind the scenes.

instantaneously behind the scenes.

Of course all claims of accurate simulation go out the door when this

Of course all claims of accurate simulation go out the door when this

feature is used.

feature is used.

@d max_sys_call Ftell

@d max_sys_call Ftell

@=

@=

typedef enum{

typedef enum{

@!Halt,@!Fopen,@!Fclose,@!Fread,@!Fgets,@!Fgetws,

@!Halt,@!Fopen,@!Fclose,@!Fread,@!Fgets,@!Fgetws,

@!Fwrite,@!Fputs,@!Fputws,@!Fseek,@!Ftell} @!sys_call;

@!Fwrite,@!Fputs,@!Fputws,@!Fseek,@!Ftell} @!sys_call;

@ @loc| is rT@>=

@ @loc| is rT@>=

if (cool->loc.l==g[rT].o.l && cool->loc.h==g[rT].o.h) {

if (cool->loc.l==g[rT].o.l && cool->loc.h==g[rT].o.h) {

  register unsigned char yy,zz; octa ma,mb;

  register unsigned char yy,zz; octa ma,mb;

  if (g[rXX].o.l&0xffff0000) goto magic_done;

  if (g[rXX].o.l&0xffff0000) goto magic_done;

  yy=g[rXX].o.l>>8, zz=g[rXX].o.l&0xff;

  yy=g[rXX].o.l>>8, zz=g[rXX].o.l&0xff;

  if (yy>max_sys_call) goto magic_done;

  if (yy>max_sys_call) goto magic_done;

           if needed@>;

           if needed@>;

  switch (yy) {

  switch (yy) {

case Halt: @;@+break;

case Halt: @;@+break;

case Fopen: g[rBB].o=mmix_fopen(zz,mb,ma);@+break;

case Fopen: g[rBB].o=mmix_fopen(zz,mb,ma);@+break;

case Fclose: g[rBB].o=mmix_fclose(zz);@+break;

case Fclose: g[rBB].o=mmix_fclose(zz);@+break;

case Fread: g[rBB].o=mmix_fread(zz,mb,ma);@+break;

case Fread: g[rBB].o=mmix_fread(zz,mb,ma);@+break;

case Fgets: g[rBB].o=mmix_fgets(zz,mb,ma);@+break;

case Fgets: g[rBB].o=mmix_fgets(zz,mb,ma);@+break;

case Fgetws: g[rBB].o=mmix_fgetws(zz,mb,ma);@+break;

case Fgetws: g[rBB].o=mmix_fgetws(zz,mb,ma);@+break;

case Fwrite: g[rBB].o=mmix_fwrite(zz,mb,ma);@+break;

case Fwrite: g[rBB].o=mmix_fwrite(zz,mb,ma);@+break;

case Fputs: g[rBB].o=mmix_fputs(zz,g[rBB].o);@+break;

case Fputs: g[rBB].o=mmix_fputs(zz,g[rBB].o);@+break;

case Fputws: g[rBB].o=mmix_fputws(zz,g[rBB].o);@+break;

case Fputws: g[rBB].o=mmix_fputws(zz,g[rBB].o);@+break;

case Fseek: g[rBB].o=mmix_fseek(zz,g[rBB].o);@+break;

case Fseek: g[rBB].o=mmix_fseek(zz,g[rBB].o);@+break;

case Ftell: g[rBB].o=mmix_ftell(zz);@+break;

case Ftell: g[rBB].o=mmix_ftell(zz);@+break;

magic_done: g[255].o=neg_one; /* this will enable interrupts */

magic_done: g[255].o=neg_one; /* this will enable interrupts */

@ @=

@ @=

if (!zz) halted=true;

if (!zz) halted=true;

else if (zz==1) {

else if (zz==1) {

  octa trap_loc;

  octa trap_loc;

  trap_loc=incr(g[rWW].o,-4);

  trap_loc=incr(g[rWW].o,-4);

  if (!(trap_loc.h || trap_loc.l>=0x90))

  if (!(trap_loc.h || trap_loc.l>=0x90))

    print_trip_warning(trap_loc.l>>4,incr(g[rW].o,-4));

    print_trip_warning(trap_loc.l>>4,incr(g[rW].o,-4));

@ @=

@ @=

char arg_count[]={1,3,1,3,3,3,3,2,2,2,1};

char arg_count[]={1,3,1,3,3,3,3,2,2,2,1};

@ The input/output operations invoked by \.{TRAP}s are

@ The input/output operations invoked by \.{TRAP}s are

done by subroutines in an auxiliary program module called {\mc MMIX-IO}.

done by subroutines in an auxiliary program module called {\mc MMIX-IO}.

Here we need only declare those subroutines, and write three primitive

Here we need only declare those subroutines, and write three primitive

interfaces on which they depend.

interfaces on which they depend.

@ @=

@ @=

extern octa mmix_fopen @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fopen @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fclose @,@,@[ARGS((unsigned char))@];

extern octa mmix_fclose @,@,@[ARGS((unsigned char))@];

extern octa mmix_fread @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fread @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fgets @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fgets @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fgetws @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fgetws @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fwrite @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fwrite @,@,@[ARGS((unsigned char,octa,octa))@];

extern octa mmix_fputs @,@,@[ARGS((unsigned char,octa))@];

extern octa mmix_fputs @,@,@[ARGS((unsigned char,octa))@];

extern octa mmix_fputws @,@,@[ARGS((unsigned char,octa))@];

extern octa mmix_fputws @,@,@[ARGS((unsigned char,octa))@];

extern octa mmix_fseek @,@,@[ARGS((unsigned char,octa))@];

extern octa mmix_fseek @,@,@[ARGS((unsigned char,octa))@];

extern octa mmix_ftell @,@,@[ARGS((unsigned char))@];

extern octa mmix_ftell @,@,@[ARGS((unsigned char))@];

extern void print_trip_warning @,@,@[ARGS((int,octa))@];

extern void print_trip_warning @,@,@[ARGS((int,octa))@];

@ @=

@ @=

int mmgetchars @,@,@[ARGS((char*,int,octa,int))@];

int mmgetchars @,@,@[ARGS((char*,int,octa,int))@];

void mmputchars @,@,@[ARGS((unsigned char*,int,octa))@];

void mmputchars @,@,@[ARGS((unsigned char*,int,octa))@];

char stdin_chr @,@,@[ARGS((void))@];

char stdin_chr @,@,@[ARGS((void))@];

octa magic_read @,@,@[ARGS((octa))@];

octa magic_read @,@,@[ARGS((octa))@];

void magic_write @,@,@[ARGS((octa,octa))@];

void magic_write @,@,@[ARGS((octa,octa))@];

@ We need to cut through all the complications of buffers and

@ We need to cut through all the complications of buffers and

caches in order to do magical I/O. The |magic_read| routine finds

caches in order to do magical I/O. The |magic_read| routine finds

the current octabyte in a given physical address by looking at the

the current octabyte in a given physical address by looking at the

write buffer, D-cache, S-cache, and memory until finding it.

write buffer, D-cache, S-cache, and memory until finding it.

@=

@=

octa magic_read(addr)

octa magic_read(addr)

  octa addr;

  octa addr;

  register write_node *q;

  register write_node *q;

  register cacheblock *p;

  register cacheblock *p;

  for (q=write_tail;;) {

  for (q=write_tail;;) {

    if (q==write_head) break;

    if (q==write_head) break;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if ((q->addr.l&-8)==(addr.l&-8) && q->addr.h==addr.h) return q->o;

    if ((q->addr.l&-8)==(addr.l&-8) && q->addr.h==addr.h) return q->o;

  if (Dcache) {

  if (Dcache) {

    p=cache_search(Dcache,addr);

    p=cache_search(Dcache,addr);

    if (p) return p->data[(addr.l&(Dcache->bb-1))>>3];

    if (p) return p->data[(addr.l&(Dcache->bb-1))>>3];

    if (((Dcache->outbuf.tag.l^addr.l)&-Dcache->bb)==0 &&

    if (((Dcache->outbuf.tag.l^addr.l)&-Dcache->bb)==0 &&

          Dcache->outbuf.tag.h==addr.h)

          Dcache->outbuf.tag.h==addr.h)

      return Dcache->outbuf.data[(addr.l&(Dcache->bb-1))>>3];

      return Dcache->outbuf.data[(addr.l&(Dcache->bb-1))>>3];

    if (Scache) {

    if (Scache) {

      p=cache_search(Scache,addr);

      p=cache_search(Scache,addr);

      if (p) return p->data[(addr.l&(Scache->bb-1))>>3];

      if (p) return p->data[(addr.l&(Scache->bb-1))>>3];

      if (((Scache->outbuf.tag.l^addr.l)&-Scache->bb)==0 &&

      if (((Scache->outbuf.tag.l^addr.l)&-Scache->bb)==0 &&

            Scache->outbuf.tag.h==addr.h)

            Scache->outbuf.tag.h==addr.h)

        return Scache->outbuf.data[(addr.l&(Scache->bb-1))>>3];

        return Scache->outbuf.data[(addr.l&(Scache->bb-1))>>3];

  return mem_read(addr);

  return mem_read(addr);

@ The |magic_write| routine changes the octabyte in a given physical

@ The |magic_write| routine changes the octabyte in a given physical

address by changing it wherever it appears in a buffer or cache.

address by changing it wherever it appears in a buffer or cache.

Any ``dirty'' or ``least recently used'' status remains unchanged.

Any ``dirty'' or ``least recently used'' status remains unchanged.

(Yes, this {\it is\/} magic.)

(Yes, this {\it is\/} magic.)

@=

@=

void magic_write(addr,val)

void magic_write(addr,val)

  octa addr,val;

  octa addr,val;

  register write_node *q;

  register write_node *q;

  register cacheblock *p;

  register cacheblock *p;

  for (q=write_tail;;) {

  for (q=write_tail;;) {

    if (q==write_head) break;

    if (q==write_head) break;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if (q==wbuf_top) q=wbuf_bot;@+ else q++;

    if ((q->addr.l&-8)==(addr.l&-8) && q->addr.h==addr.h) q->o=val;

    if ((q->addr.l&-8)==(addr.l&-8) && q->addr.h==addr.h) q->o=val;

  if (Dcache) {

  if (Dcache) {

    p=cache_search(Dcache,addr);

    p=cache_search(Dcache,addr);

    if (p) p->data[(addr.l&(Dcache->bb-1))>>3]=val;

    if (p) p->data[(addr.l&(Dcache->bb-1))>>3]=val;

    if (((Dcache->inbuf.tag.l^addr.l)&-Dcache->bb)==0 &&

    if (((Dcache->inbuf.tag.l^addr.l)&-Dcache->bb)==0 &&

          Dcache->inbuf.tag.h==addr.h)

          Dcache->inbuf.tag.h==addr.h)

      Dcache->inbuf.data[(addr.l&(Dcache->bb-1))>>3]=val;

      Dcache->inbuf.data[(addr.l&(Dcache->bb-1))>>3]=val;

    if (((Dcache->outbuf.tag.l^addr.l)&-Dcache->bb)==0 &&

    if (((Dcache->outbuf.tag.l^addr.l)&-Dcache->bb)==0 &&

          Dcache->outbuf.tag.h==addr.h)

          Dcache->outbuf.tag.h==addr.h)

      Dcache->outbuf.data[(addr.l&(Dcache->bb-1))>>3]=val;

      Dcache->outbuf.data[(addr.l&(Dcache->bb-1))>>3]=val;

    if (Scache) {

    if (Scache) {

      p=cache_search(Scache,addr);

      p=cache_search(Scache,addr);

      if (p) p->data[(addr.l&(Scache->bb-1))>>3]=val;

      if (p) p->data[(addr.l&(Scache->bb-1))>>3]=val;

      if (((Scache->inbuf.tag.l^addr.l)&-Scache->bb)==0 &&

      if (((Scache->inbuf.tag.l^addr.l)&-Scache->bb)==0 &&

            Scache->inbuf.tag.h==addr.h)

            Scache->inbuf.tag.h==addr.h)

        Scache->inbuf.data[(addr.l&(Scache->bb-1))>>3]=val;

        Scache->inbuf.data[(addr.l&(Scache->bb-1))>>3]=val;

      if (((Scache->outbuf.tag.l^addr.l)&-Scache->bb)==0 &&

      if (((Scache->outbuf.tag.l^addr.l)&-Scache->bb)==0 &&

            Scache->outbuf.tag.h==addr.h)

            Scache->outbuf.tag.h==addr.h)

        Scache->outbuf.data[(addr.l&(Scache->bb-1))>>3]=val;

        Scache->outbuf.data[(addr.l&(Scache->bb-1))>>3]=val;

  mem_write(addr,val);

  mem_write(addr,val);

@ The conventions of our imaginary operating system require us to

@ The conventions of our imaginary operating system require us to

apply the trivial memory mapping in which segment~$i$ appears in

apply the trivial memory mapping in which segment~$i$ appears in

a $2^{32}$-byte page of physical addresses starting at $2^{32}i$.

a $2^{32}$-byte page of physical addresses starting at $2^{32}i$.

@=

@=

if (arg_count[yy]==3) {

if (arg_count[yy]==3) {

  octa arg_loc;

  octa arg_loc;

  arg_loc=g[rBB].o;

  arg_loc=g[rBB].o;

  if (arg_loc.h&0x9fffffff) mb=zero_octa;

  if (arg_loc.h&0x9fffffff) mb=zero_octa;

  else arg_loc.h>>=29, mb=magic_read(arg_loc);

  else arg_loc.h>>=29, mb=magic_read(arg_loc);

  arg_loc=incr(g[rBB].o,8);

  arg_loc=incr(g[rBB].o,8);

  if (arg_loc.h&0x9fffffff) ma=zero_octa;

  if (arg_loc.h&0x9fffffff) ma=zero_octa;

  else arg_loc.h>>=29, ma=magic_read(arg_loc);

  else arg_loc.h>>=29, ma=magic_read(arg_loc);

@ The subroutine |mmgetchars(buf,size,addr,stop)| reads characters

@ The subroutine |mmgetchars(buf,size,addr,stop)| reads characters

starting at address |addr| in the simulated memory and stores them

starting at address |addr| in the simulated memory and stores them

in |buf|, continuing until |size| characters have been read or

in |buf|, continuing until |size| characters have been read or

some other stopping criterion has been met. If |stop<0| there is

some other stopping criterion has been met. If |stop<0| there is

no other criterion; if |stop=0| a null character will also terminate

no other criterion; if |stop=0| a null character will also terminate

the process; otherwise |addr| is even, and two consecutive null bytes

the process; otherwise |addr| is even, and two consecutive null bytes

starting at an even address will terminate the process. The number

starting at an even address will terminate the process. The number

of bytes read and stored, exclusive of terminating nulls, is returned.

of bytes read and stored, exclusive of terminating nulls, is returned.

@=

@=

int mmgetchars(buf,size,addr,stop)

int mmgetchars(buf,size,addr,stop)

  char *buf;

  char *buf;

  int size;

  int size;

  octa addr;

  octa addr;

  int stop;

  int stop;

  register char *p;

  register char *p;

  register int m;

  register int m;

  octa a,x;

  octa a,x;

  if (((addr.h&0x9fffffff)||(incr(addr,size-1).h&0x9fffffff))&&size) {

  if (((addr.h&0x9fffffff)||(incr(addr,size-1).h&0x9fffffff))&&size) {

    fprintf(stderr,"Attempt to get characters from off the page!\n");

    fprintf(stderr,"Attempt to get characters from off the page!\n");

@.Attempt to get characters...@>

@.Attempt to get characters...@>

    return 0;

    return 0;

  for (p=buf,m=0,a=addr,a.h>>=29; m

  for (p=buf,m=0,a=addr,a.h>>=29; m

    x=magic_read(a);

    x=magic_read(a);

    if ((a.l&0x7) || m>size-8) @@;

    if ((a.l&0x7) || m>size-8) @@;

    else @@;

    else @@;

  return size;

  return size;

@ @=

@ @=

  if (a.l&0x4) *p=(x.l>>(8*((~a.l)&0x3)))&0xff;

  if (a.l&0x4) *p=(x.l>>(8*((~a.l)&0x3)))&0xff;

  else *p=(x.h>>(8*((~a.l)&0x3)))&0xff;

  else *p=(x.h>>(8*((~a.l)&0x3)))&0xff;

  if (!*p && stop>=0) {

  if (!*p && stop>=0) {

    if (stop==0) return m;

    if (stop==0) return m;

    if ((a.l&0x1) && *(p-1)=='\0') return m-1;

    if ((a.l&0x1) && *(p-1)=='\0') return m-1;

  p++,m++,a=incr(a,1);

  p++,m++,a=incr(a,1);

@ @=

@ @=

  *p=x.h>>24;

  *p=x.h>>24;

  if (!*p && (stop==0 || (stop>0 && x.h<0x10000))) return m;

  if (!*p && (stop==0 || (stop>0 && x.h<0x10000))) return m;

  *(p+1)=(x.h>>16)&0xff;

  *(p+1)=(x.h>>16)&0xff;

  if (!*(p+1) && stop==0) return m+1;

  if (!*(p+1) && stop==0) return m+1;

  *(p+2)=(x.h>>8)&0xff;

  *(p+2)=(x.h>>8)&0xff;

  if (!*(p+2) && (stop==0 || (stop>0 && (x.h&0xffff)==0))) return m+2;

  if (!*(p+2) && (stop==0 || (stop>0 && (x.h&0xffff)==0))) return m+2;

  *(p+3)=x.h&0xff;

  *(p+3)=x.h&0xff;

  if (!*(p+3) && stop==0) return m+3;

  if (!*(p+3) && stop==0) return m+3;

  *(p+4)=x.l>>24;

  *(p+4)=x.l>>24;

  if (!*(p+4) && (stop==0 || (stop>0 && x.l<0x10000))) return m+4;

  if (!*(p+4) && (stop==0 || (stop>0 && x.l<0x10000))) return m+4;

  *(p+5)=(x.l>>16)&0xff;

  *(p+5)=(x.l>>16)&0xff;

  if (!*(p+5) && stop==0) return m+5;

  if (!*(p+5) && stop==0) return m+5;

  *(p+6)=(x.l>>8)&0xff;

  *(p+6)=(x.l>>8)&0xff;

  if (!*(p+6) && (stop==0 || (stop>0 && (x.l&0xffff)==0))) return m+6;

  if (!*(p+6) && (stop==0 || (stop>0 && (x.l&0xffff)==0))) return m+6;

  *(p+7)=x.l&0xff;

  *(p+7)=x.l&0xff;

  if (!*(p+7) && stop==0) return m+7;

  if (!*(p+7) && stop==0) return m+7;

  p+=8,m+=8,a=incr(a,8);

  p+=8,m+=8,a=incr(a,8);

@ The subroutine |mmputchars(buf,size,addr)| puts |size| characters

@ The subroutine |mmputchars(buf,size,addr)| puts |size| characters

into the simulated memory starting at address |addr|.

into the simulated memory starting at address |addr|.

@=

@=

void mmputchars(buf,size,addr)

void mmputchars(buf,size,addr)

  unsigned char *buf;

  unsigned char *buf;

  int size;

  int size;

  octa addr;

  octa addr;

  register unsigned char *p;

  register unsigned char *p;

  register int m;

  register int m;

  octa a,x;

  octa a,x;

  if (((addr.h&0x9fffffff)||(incr(addr,size-1).h&0x9fffffff))&&size) {

  if (((addr.h&0x9fffffff)||(incr(addr,size-1).h&0x9fffffff))&&size) {

    fprintf(stderr,"Attempt to put characters off the page!\n");

    fprintf(stderr,"Attempt to put characters off the page!\n");

@.Attempt to put characters...@>

@.Attempt to put characters...@>

    return;

    return;

  for (p=buf,m=0,a=addr,a.h>>=29; m

  for (p=buf,m=0,a=addr,a.h>>=29; m

    if ((a.l&0x7) || m>size-8) @@;

    if ((a.l&0x7) || m>size-8) @@;

    else @;

    else @;

@ @=

@ @=

  register int s=8*((~a.l)&0x3);

  register int s=8*((~a.l)&0x3);

  x=magic_read(a);

  x=magic_read(a);

  if (a.l&0x4) x.l^=(((x.l>>s)^*p)&0xff)<

  if (a.l&0x4) x.l^=(((x.l>>s)^*p)&0xff)<

  else x.h^=(((x.h>>s)^*p)&0xff)<

  else x.h^=(((x.h>>s)^*p)&0xff)<

  magic_write(a,x);

  magic_write(a,x);

  p++,m++,a=incr(a,1);

  p++,m++,a=incr(a,1);

@ @=

@ @=

  x.h=(*p<<24)+(*(p+1)<<16)+(*(p+2)<<8)+*(p+3);

  x.h=(*p<<24)+(*(p+1)<<16)+(*(p+2)<<8)+*(p+3);

  x.l=(*(p+4)<<24)+(*(p+5)<<16)+(*(p+6)<<8)+*(p+7);

  x.l=(*(p+4)<<24)+(*(p+5)<<16)+(*(p+6)<<8)+*(p+7);

  magic_write(a,x);

  magic_write(a,x);

  p+=8,m+=8,a=incr(a,8);

  p+=8,m+=8,a=incr(a,8);

@ When standard input is being read by the simulated program at the same time

@ When standard input is being read by the simulated program at the same time

as it is being used for interaction, we try to keep the two uses separate

as it is being used for interaction, we try to keep the two uses separate

by maintaining a private buffer for the simulated program's \.{StdIn}.

by maintaining a private buffer for the simulated program's \.{StdIn}.

Online input is usually transmitted from the keyboard to a \CEE/ program

Online input is usually transmitted from the keyboard to a \CEE/ program

a line at a time; therefore an

a line at a time; therefore an

|fgets| operation works much better than |fread| when we prompt

|fgets| operation works much better than |fread| when we prompt

for new input. But there is a slight complication, because |fgets|

for new input. But there is a slight complication, because |fgets|

might read a null character before coming to a newline character.

might read a null character before coming to a newline character.

We cannot deduce the number of characters read by |fgets| simply

We cannot deduce the number of characters read by |fgets| simply

by looking at |strlen(stdin_buf)|.

by looking at |strlen(stdin_buf)|.

@=

@=

char stdin_chr()

char stdin_chr()

  register char* p;

  register char* p;

  while (stdin_buf_start==stdin_buf_end) {

  while (stdin_buf_start==stdin_buf_end) {

    printf("StdIn> ");@+fflush(stdout);

    printf("StdIn> ");@+fflush(stdout);

@.StdIn>@>

@.StdIn>@>

    fgets(stdin_buf,256,stdin);

    fgets(stdin_buf,256,stdin);

    stdin_buf_start=stdin_buf;

    stdin_buf_start=stdin_buf;

    for (p=stdin_buf;p

    for (p=stdin_buf;p

    stdin_buf_end=p+1;

    stdin_buf_end=p+1;

  return *stdin_buf_start++;

  return *stdin_buf_start++;

@ @=

@ @=

char stdin_buf[256]; /* standard input to the simulated program */

char stdin_buf[256]; /* standard input to the simulated program */

char *stdin_buf_start; /* current position in that buffer */

char *stdin_buf_start; /* current position in that buffer */

char *stdin_buf_end; /* current end of that buffer */

char *stdin_buf_end; /* current end of that buffer */

@* Index.

@* Index.

Browse

Tools

Subversion Repositories eco32

[/] [eco32/] [tags/] [eco32-0.22/] [fp/] [implementation/] [mmix/] [mmix-pipe.w] - Diff between revs 15 and 21