Introduction to the Hardware PERQ is implemented with a high-speed microprogrammed processor which has a 170 nanoseconds microcycle time. The microinstruction is 48 bits wide. Most of the data paths in the micro engine are 20 bits wide. The data coming in and out of the processor (IO and Memory data for instance) are 16 bits wide. The extra 4 bits allow the microprogrammed processor to calculate real addresses in a 1 megaword addressing space. The assumption is that virtual addresses are kept in a doubleword in memory but calculations on addresses can be single precision within the processor. The programmer of the virtual machine never sees the 20 bit paths. The XY registers (256 registers x 20 bits) form a double ported file of general-purpose registers. The X port outputs are multiplexed with several other sources (the AMUX) to form the A input to the ALU. The Y port outputs, multiplexed with an 8- or 16-bit constant via the BMUX, form the B input to the ALU. The ALU outputs (R) are fed back to the XY registers as well as the memory data output and memory address registers. Memory data coming from the memory is sent to the ALU via the AMUX. The IO bus (IOB) connects the CPU to IO devices. It consists of an 8-bit address (IOA) which is driven from a microword field and a 16-bit birectionalal data bus (IOD) which is read via AMUX and written from R. Opcodes and operands that are part of the instruction byte stream are buffered in a special 8 x 8 RAM (the OP file). The OP file is loaded 16 bits at a time from the memory data inputs. The output of the OP file is 8 bits wide and can be read via AMUX or can be sent to the micro-addressing section for opcode dispatch. The read port of the OP file is addressed by the 3-bit BPC (Byte Program Counter). A shift matrix (SHIFT), which is part of the special hardware provided for the RasterOp operator, can be accessed by loading an item to be shifted via the R bus, and reading the shifted result on AMUX. A 16-level push down stack (ESTK) is written from R and read on AMUX. The stack is used by the Q-code interpreter to evaluate expressions. BPC and the microstate condition codes can be read as the Micro State Register (USTATE) via AMUX. The Z field is used for many things: as part of a jump address, the upper 8 bits of a constant, Shift Control, and as an IOB address. The F field decodes do not necessarily enforce restrictions on the use of the Z field, they merely enable some of them. In particular, B = 1, SF = 0, and F = 0 or 3 selects a long constant using the Z field. When F <> 2, the Z field is used for a jump address. When SF = 17 and F = 0 or 3, the Z field is used for an IOB address. When F = 2, the Z field is loaded into the Shift Control register. These are the only specific actionsen by the hardware that affect the usage of the Z field. The hardware does nothing to prevent the Z field from being used for several things at once. For example, it could be used for a long constant and a jump address at the same time, or it could be used as an IO address and a jump address at the same time. The assembler, however, will flag an error if the programmer tries to load two different values into the same microinstruction field. Constants Constants can be 8 or 16 bits wide. If B = 1, the B input to the ALU is a constant. If F = 0 or 3 and SF = 0, the Y and Z fields form a 16 bit constant. If SF <> 0 and F <> 0 or 3 then Y is an 8 bit constant. OldCarry OldCarry (in ALU functions 15 and 17) is the carry from the immediately preceeding microinstruction, it is used for multiple precision arithmetic. Condition Codes All ALU related condition codes test the result of the ALU operation from the previous micro cycle. Thus the normal sequence is to perform an ALU operation and test its result in the next microinstruction. For example, comparison of two registers A and B could be done this way: A - B; if Gtr Goto(Label); ! Jumps if A > B All ALU tests with the exception of C19 test the lower 16 bits of the ALU. These are intended for data comparisons. After a subtraction, these condition codes compare the two operands. After other operations, these condition codes compare the 16-bit ALU result against zero. C19 is designed for unsigned address comparisons. Memory Control The memory system cycles in 680 ns (exactly 4 microcycles). Microcycles are numbered starting at 0 (t0, t1, t2 and t3). Requests must be made in a particular cycle (which cycle depends on the type of request). If a memory request is made in the wrong e, the processor will be suspended until the correct cycle. In some contexts, however, a request made in an improper cycle will be ignored--these contexts are explained below. There are 8 types of memory references, coded into the SF field when F = 1. In the following discussion of the memory controller, the terms "Fetch" and "Store" are used for the memory functions which fetch or store exactly one word. The generic terms "fetch type" and "store type" are used for any type of fetching or storing memory reference. The address for all memory references comes from R. For all fetch type references, the address (and the request itself) are latched at t3 and data is available from MDI or MDX (A = 3 or 4) at t2. If MDI or MDX is used during a t0 or t1 following a fetch type memory reference, the processor is suspended until t2. Any address may be used with a Fetch, and the memory word may be read during any cycle from t2 until the following t1. The address for a Fetch2 must be even (double-word aligned). If it is odd, the low-order bit of the address is ignored. After a Fetch2, the first word must be read at t2, and the second word must be read at t3. The address for a Fetch4 must be quad-word aligned. If it is not quad-word aligned, the two low-order bits are ignored. After the Fetch4, the first word must be read at t2, and the succeeding words must be read at t3, t0, and t1. Any address may be used with a Store. The address and Store command are given in a t2 cycle and the data to be written is supplied on R in the following t3. The address for a Store2 must be even (double-word aligned). If it is odd, the low-order bit of the address is ignored. The address and Store2 are given in a t3 cycle, and the data is supplied on R in the following t0 and t1. The address for a Store4 must be quad-word aligned. If it is not quad-word aligned, the two low-order bits are ignored. The address and Store4 are given in a t3 cycle, and the data is supplied in the next four cycles (t0, t1, t2 and t3). The Fetch4R and Store4R types are identical to the Fetch4 and Store4 references except that word 3 of the quad word is received or sent first and word 0 last. (This is generally only useful for RasterOp so that it can do left to right as well as right to left transfers.) The IO system can request memory cycles at any time. The memory system gives priority to the IO system so that if both the processor and the IO system make memory requests, the IO is served first while the processor is delayed. The Hold bit, if set, locks out IO requests while it is set. To be effective, Hold must be asserted in a t2. This is necessary only when doing overlapped memory references. See the high performance rules below. Combinations of memory references are tricky. There are a few rules which, if followed, will never let you go wrong, but they may preclude some clever twist of microcoding to save some cycles. The simple rules are: 1) Never start a memory reference after a fetch type reference until you have taken all the data. 2) Never start a memory reference during the four instructions which follow a store type request. The full rules are complicated, but to achieve high-performance you need to consider them. The following rules define the way the memory controller treats memory requests. 1) After a Fetch or Fetch2 (in t3), any memory reference in t0 or t1 is ignored. A Store specified in the t3 will start immediately, but all others will abort until the correct time. 2) Fetch4 and Fetch4R follow the rules for Fetch and Fetch2 with the exception that a Store4 (in the same direction--forward or reverse) can be specified in t1, but this is only used for RasterOp. 3) After a Store (in t2), any memory reference in t3 or t0 is ignored. References started in t1 are aborted until the correct cycle. 4) After a Store2, Store4 or Store4R (in t3), any memory reference in t0 through t3 is ignored. Memory references started in t0 are aborted until the correct cycle. 5) To be effective, Hold must be asserted in a t2. You must be careful about aborts caused by using MDI in the wrong cycle--you may be aborted past the t2, causing the Hold to be ignored. You may not specify Hold too often--you must allow an IO reference at least once in every 3 memory cycles. 6) After a Fetch, MDI is valid from t2 through the following t1 (four full cycles). For Fetch2, Fetch4, and Fetch4R, each MDI is valid for a single microcycle. Following these rules, we can construct many interesting overlapped memory requests. Note that in the following examples, Hold is always asserted in a t2. A Fetch ... Store sequence is an exception--you need not use Hold, but it doesn't hurt performance, so we assert it for consistency. Opcodes and operands The OP file contains a 4-word sequence of instruction bytes. A quad-word address is contained in a XY register (UPC), and the 8 bytes pointed to by UPC are loaded into the OP file. The lower 3 bits of the byte address (byte within the quad word) are kept in BPC, a hardware register. BPC addresses the OP file to choose a byte. BPC is actually a 4-bit counter. It is incremented after each a byte is taken out of the OP file by NextInst (JMP=6, H=0) or NextOp (A=1). The 4th bit of BPC (BPC[3]), which is the "overflow" of the counter, is testable via a jump condition and indicates that all bytes in OP have been used. The NextOp function (A=1) gets the next byte out of the instruction byte stream for use as an operand. The assembler automatically adds an "If BPC[3] GoTo(Refill)" jump clause. If BPC overflows, then control will go to Refill which increments UPC by 4, set BPC to 0, and starts a Fetch4 to the OP file. The special function LoadOp must be executed in the t1 after the Fetch4 to cause the Op file to be loaded with the data coming on MDI. Refill must then jump back to the instruction which needed the byte so that instruction may be re-executed. The instruction which executes NextOp must be capable of being executed twice (once when BPC overflowed and again when it is re-executed after Refill). This precludes instructions such as "R := NextOp + R". In order for Refill to get back to the instruction which needs to be re-executed, the address of the failed NextOp is saved in a hardware register (Victim) if NextOp is executed when BCP[3] is set. The last instruction in Refill is coded with ReviveVictim (JMP=2, H=1), which sends control back to the "failed" NextOp. BPC can be set without re-loading the OP file, and so the current quad word can be re-read without fetching it from memory a second time. The NextInst jump enables a byte of the OP file (which is inverted for NextInst) into the Addr input of the micro-sequencer. The inverted byte is shifted left by 2 bits and OR-ed with ZAddr, sending control to address ZAddr + (OP' * 4). If BPC[3] is true, OP is forced to 377, sending control to location ZAddr, which is another version of Refill. This version of Refill also does the Fetch4 to the OP file, zeroes BPC, increments UPC by 4, and does the LoadOp, but then repeats the NextInst instead of returning via ReviveVictim. To speed up the execution of Refill, the LoadOp Special Function loads all 4 words via hardware. The LoadOp should be given in the tl following the Fetch4. The instruction which follows the LoadOp can go back to the NextInst/NextOp since the first byte is quaranteed to be in. The three remaining words arrive and are placed in OP by hardware without further microcode assistance. If BPC is set to a non-zero value (to start reading in the middle of the quad word), the Refill code must wait until the correct byte is in the OP file. Shift Control The PERQ shifter can rotate a 16 bit item 0 to 15 places and apply a mask to the shifter outputs. To use the shift hardware, the Z field of the instruction can be coded with the type of shift to be done with the F field set to F = 2. Coding of the shift control uses two 4 bit nibbles (shift control is inverted in the Z field). The item to be shifted is placed on R, and the shifted and masked result can be read via SHIFT (A = 0) on the next instruction. The shift control logic keeps the last value loaded so that the shifter can shift a succession of words without respecifying the shift control function. The shift outputs always have the shifted value of what was last on R. ShiftOnR The ShiftOnR special function allows a shift function to be a variable. The shift control is obtained from the R bus and thus can be a data item. The usage sequence is: 1) Put the shift control (univerted) on R and execute ShiftOnR, 2) Put the item to be shifted on R, and 3) Read the shifted result on SHIFT. Expression Stack The expression stack is used to evaluate expressions. Items are pushed on the stack by placing them on R and using the Push special function: "TOS := Data, Push". Items can be popped off the stack with the Pop special function. The top of the stack can be written without pushing or popping with the "TOS := Data" special function. The value on the top of the stack can be read at any time from TOS (A = 7). The stack is 16 levels deep. The stack can be reset (no items on the stack) by the StackReset special function. Stack empty can be read as a bit in USTATE. Input/Output Bus IOB is the input/output bus for PERQ. The IOB is a 16-bit bi-directional data bus plus a 7-bit address bus. The addresses are supplied on the Z Field. The eighth bit of the Z field indicates the direction of transfer (1=write, 0=read). To read an IO register, set SF = 1 and F = 0 or 3. The IO register is latched in the processor such that a succeeding microinstruction can read it from IOD (A = 2). IO registers can be written by putting a data value on R, putting the appropriate address in Z, and coding the IOB special function (SF = 1, F = 0 or 3). Jumps A jump needing an address normally gets it from the Z field. Since Z is only 8 bits wide and the control store is 4K, another 4 bits of address are needed. Short jumps branch to a location on the same 256-word page as the current microinstruction (CIA). To go to an arbitrary location, the F field can specify long jump (F = 3) which uses the SF field for the upper 4 bits of address. The address for jumps might not come from the Z (and SF). Other sources for jump addresses are: 1) The S register (which is internal to the micro-sequencer), 2) A five level call stack (also internal to the micro-sequencer) which is pushed for a Call and popped for a Return, 3) The current instruction address plus 1, and 4) The Victim register. There are three jumps which are multi-way branches. The three are: 1) NextInst, which is a 256-way branch based on a byte from the OP file; 2) Dispatch, which is a 16 way (or fewer) branch on the lower 4 bits of the SHIFT outputs; and 3) Vector, which branches to 1 of 8 micro-interrupt service routines. For all of these branches, the Z field of the micro-instruction supplies the other bits of the address. For NextInst, the resulting address is: O O O O O O O O Z Z P P P P P P P P Z Z 7 6 7 6 5 4 3 2 1 0 1 0 which results is a 256 way branch with a spacing of 4 instructions. For Dispatch, the address is: Z Z Z Z Z Z S S S S Z Z 7 6 5 4 3 2 3 2 1 0 1 0 which also results in a spacing of 4 instructions. The Vector jump uses the outputs of the micro-interrupt priority encoder (V), which determines the highest priority micro-interrupt condition. The address is: Z Z Z Z Z Z - V V V Z Z 7 6 5 4 3 2 0 2 1 0 1 0 which also has a spacing of four instructions. Interrupts The hardware implements a microlevel interrupt which is used to allow the microprocessor to serve IO devices. There are (a maximum of) 8 interrupt requests which are assigned priorities by the hardware. When any of the interrupt requests is asserted, the branch condition IntrPend will succeed. The intended usage of this feature is that at convenient places in the microcode an instruction which has "If IntrPend Call(VecSrv)" is used. If any interrupts are pending, control will pass to VecSrv which should contain a Vector jump to send control to ZAddr + Vector*4 in the control store. The address of the interrupted instruction is on the call stack, and the interrupt microcode can serve the device and return like a subroutine would.