Xerox Paperless Desktop: The Personal Supercomputer Nobody Wants or Needs
You have somehow made it to the incomplete third installment in a series detailing the world of my upcoming novel No Frills. As usual, I have a bad habit of drifting between the real and fake worlds, but to a first approximation, everything here is in-universe (circa 1976), unless otherwise noted. It’s a sketch of a sketch, of course. To the extent that this “processor” “architecture” has any redeeming value, it is as a non-technical concept map and literature review, as well as a hazy and obscure vision of a computer architecture that might result if you get real weird with it. I still have to detail many other things such as two-phase immersion cooling of the system-on-module with automotive antifreeze, the electron-beam-addressed phase-change random-access memory (EBAPCRAM) storage tube, the flat-screen (but not flat-panel) beam-index tube monitor, and the ergonomic input device.
The System-on-Module
The SoM is a face-to-back Cu-Cu thermocompression-bonded, back-thinned chip stack MCM with one Burroughs Starling 1000 processor and eight DRAM chips, connected using TSVs, area I/O, and a circuit-switched network-on-memory.

To keep it from going completely off the rails, Starling is subject to certain ground rules.
Silicon.
Static CMOS.
DRAM.
There probably should have been more ground rules.
The Processing Element
Triggered instructions blow my mind.
“A TIA [processing element] does not have a program counter or any notion of a static sequence of instructions. Instead, there is a limited pool of triggered instructions that are constantly bidding for execution on the datapath. This fits very naturally into a spatial programming model where each PE is statically configured with a small pool of instructions instead of streaming in a sequence of instructions from an instruction cache… there are no branch or jump instructions in the triggered ISA — every instruction in the pool is eligible for execution if its trigger conditions are met. Thus, every triggered instruction can be viewed as a multi-way branch into a number of possible states in [a finite-state machine]. With clever use of predicate registers, a TIA can be made to emulate the behavior of other control paradigms. For example, a sequential architecture can be emulated by setting up a vector of predicate registers to represent the current state in a sequence — essentially, a program counter. Predicate registers can also be used to emulate classic predication modes, branch delay slots and speculative execution. Triggered instructions is a superset of many traditional control paradigms. The cost of this generality is scheduler area and timing complexity, which imposes a restriction on the number of triggers (and thus, the number of instructions) that the hardware can monitor at all times. While this restriction would be crippling for a temporally programmed architecture, it is reasonable in a spatially-programmed framework because of the low number of instructions typically mapped to a pipeline stage in a spatial workload.”
What’s the catch?
“The ultimate success of this paradigm will be premised on overcoming a number of challenges, including providing a tractable memory model, dealing with the finite size of the spatial array, and providing a high-level programming and debugging environment.”
The TIA is best described as having intermediate rather than low complexity efficiency, between that of a classical CGRA and an OoO superscalar. On the one hand, “triggered instructions… is more flexible in terms of control flow and total instruction capacity, but is likely higher power and area” than a CGRA, and “triggered instructions and Intel’s CSA… require much more complex non-systolic PEs (>3× higher area).” It should come as no surprise that greatly increasing the autonomy of the processing elements — and greatly expanding and diversifying the repertoire of workloads on which they have high performance — comes at an unavoidable area and power cost. On the other hand, the TIA “can achieve 8× greater area-normalized performance than a traditional general-purpose processor.” It would be wrong to think that the TIA can achieve CGRA-like area- and power-normalized performance on regular and data-parallel DSP and multimedia workloads, let alone irregular GPP workloads. What it does is achieve significantly higher area- and power-normalized performance than an OoO superscalar on GPP workloads by exploiting abundant spatial parallelism and eliminating most memory operations, like a classical CGRA. I would go so far as to say that the TIA is a ‘GPCGRA,’ in a nod to the GPGPU. The way in which reconfigurable architectures exploit spatial parallelism and eliminate memory operations comes quite close to explaining the Reconfigurable Computing Paradox, wherein “software-to-configware (software-to-FPGA) migration results in reported speed-up factors of up to more than four orders of magnitude, as well as a reduction in electricity consumption by up to almost four orders of magnitude — although the technological parameters of FPGAs are behind the Gordon Moore curve by about four orders of magnitude, and the clock frequency is substantially lower than that of microprocessors.” In principle, CGRAs can achieve even higher performance and performance-per-watt than FPGAs, because they lie in a sort of Goldilocks zone combining most of the performance of ASICs and most of the programmability of FPGAs, with processing elements that are variations on “simple, ALU-like structures,” and right-sized, word-width datapaths.
This duality and tension between performance and efficiency is perhaps best illustrated by the TIA scheduler. In some ways, it is lightweight. For instance, “the core of the TIA scheduler is essentially a few 1-bit wide trees of AND gates feeding into a priority encoder,” and the “scheduler logic does not consume dynamic power unless there is a change in predicate states.” At the same time, the inherently “parallel nature of trigger resolution” and “the need for all trigger fields to be exposed combinationally to the scheduler in order to execute at a rate of one instruction per cycle” implies high communication complexity, and quite possibly “quadratic area complexity” and “quadratic connectivity” (i.e. register-based instruction memory). My feeling is that, in pushing right up to the limit on the “generality in the useful expressions” imposed by reasonable “scheduler efficiency,” the TIA scheduler obtains a new and unusual sort of agency far beyond that of a ‘dumb’ program counter. It is not an overstatement to say that it can make “an intelligent decision based on the local state of the PE.” The way in which it can “monitor both the one-bit channel status registers and the local predicate registers” is a sort of situational awareness, the prioritization of some triggered instructions over others resembles a fuzzy weighing of options, and the evaluation of “a programmer-specified boolean expression formed from the logical conjunction of a set of queries on the PE’s architectural state” is much like reasoning or cognition. It is remarkable that such a powerful model admits such a reasonable hardware implementation. Perhaps no other processor architecture is more in the spirit of the actor model than the TIA.
It is not surprising that, despite the eye-popping area-normalized performance achieved by the TIA, careless choices of “parameterization options” could very well wipe out any complexity-effectiveness advantage over the baseline OoO superscalar and dynamic dataflow processors. Indeed, in the detailed design, the “two-stage pipelined PE has a comparable number of gate levels in the critical path to a high-performance commercial microprocessor.”
“As with any architecture, a triggered-instruction architecture is subject to a number of parameterization options and their associated cost-versus-benefit trade-offs.”
With that said, the detailed design is remarkably area- and power-efficient.
“The hardware scheduler is built from combinatorial logic — it simply is a tree of AND gates. This means that only the state equations that require re-evaluation will cause the corresponding wires in the scheduler logic to swing and consume dynamic power. In the absence of channel activity or internal state changes, the scheduler does not consume any dynamic power whatsoever. The same control equations would have been evaluated using a chain of branches in a PC-based architecture. Altogether, triggered instructions allow individual PEs to efficiently react to incoming messages, making an intelligent decision based on the local state of the PE. A TI scheduler uses a single unified scheme to monitor both the one-bit channel status registers and the local predicate registers in order to quickly and cheaply make a trigger decision… Our implementation analysis shows that the area cost of the TIA hardware scheduler is less than 2% of a PE’s overall area, much of which is occupied by its architectural state (registers, input/output channel buffers, predicates and instruction storage), datapath logic (operand multiplexers, functional units, etc.) and microarchitectural control overheads — none of which are unique to triggered control. This is not surprising — the core of the TIA scheduler is essentially a few 1-bit wide trees of AND gates feeding into a priority encoder. For our chosen parameterization, this logic is dwarfed by everything else in the PE.”
The first two words I would use to describe the TIA are spooky and eerie, for two reasons. First, it seems to convincingly address many of the long-standing issues in the design of OoO superscalar architectures (sometimes called “limited dataflow”), including those with SMT.
“Breaking over-serialization can be accomplished by finding independent operations. These can be found from two sources. The first source is local parallelism in the PE’s dataflow graph, in which case computation can start as the data arrives, i.e., classical dynamic instruction reordering. The second source arises when the spatial compiler chooses to place unrelated sections of the overall algorithm dataflow graph onto a single PE, statically partitioning the registers between them and statically interleaving operations, i.e. compiler-directed multithreading… On a TI architecture we expect compiler-directed multithreading of non-rate-limiting PEs to be a common and important optimization… since a TI architecture does not impose any ordering between instructions unless explicitly specified, it can gain the ILP benefits of an out-of-order issue processor without the expensive instruction window and reorder buffer. Simultaneously, a TI machine can take advantage of multithreading without duplicating data and control state, but by the compiler partitioning resources as it sees fit. Of course there is a hardware cost associated with this benefit — the TI PE must have a scheduler that can efficiently evaluate the program’s triggers.”
The TIA, being “already parallel in its native execution model” and without “any notion of a static sequence of instructions,” must neither “repeatedly re-extract parallelism from a sequential program representation,” nor maintain “the faithful representation of a monolithic uniprocessor” despite programmers (or better yet, compilers) being able to reason about the behavior of their programs well enough without such an abstract and stylized representation. It is therefore free from the suffocating overheads and bookkeeping of an OoO superscalar. Second, the TIA seems to convincingly address many of the well-known issues in the design of dynamic dataflow architectures.
“In a dataflow-triggered PE, the token-ready bits associated with input sources are managed by the micro-architecture. The TI approach, in contrast, replaces these bits with a vector of architecturally-visible predicate registers. By specifying triggers that span multiple predicates, the programmer can choose to use these bits to indicate data readiness, but can also use them for other purposes, such as control flow decisions. In classic dataflow multiple pipeline stages are devoted to marshaling tokens, distributing tokens, and scoreboarding which instructions are ready. A ‘Wait-Match’ pipeline stage must dynamically pair incoming tokens of dual-input instructions. In contrast, the set of predicates to be updated by an instruction in the TI is encoded in the instruction itself. This both reduces scheduler implementation cost and removes the token-related pipeline stages.”
I believe that this strikes directly at the “bane of practical tagged-token dataflow processors” (the “token-related pipeline stages”), and has the potential to make the dataflow dream a reality. Unfortunately, the sweeping significance of this feat is difficult to appreciate without a lot of context.
“In the dataflow model, the operands trigger the execution of the operation to be performed on them. In other words, dataflow architectures use the availability of data to fetch instructions rather than the availability of instructions to fetch data. Unlike the von Neumann model, the dataflow model is neither based on memory structures that require inherent state transitions, nor does it depend on history sensitivity and program counter to execute a program sequentially. These properties allow the use of the model to represent maximum concurrency to the finest granularity and to facilitate dependency analysis among computations… This model offers many opportunities for parallel processing, because it has neither a program counter nor a global updatable memory, i.e., the two characteristics of the von Neumann model that inhibit parallelism… The dataflow model is self-scheduled since instruction sequencing is constrained only by data dependencies. Moreover, the model is asynchronous because program execution is driven only by the availability of the operands at the inputs to the functional units. Specifically, the firing rule states that an instruction is enabled as soon as its corresponding operands are present, and executed when hardware resources are available. If several instructions become fireable at the same time, they can be executed in parallel. This simple principle provides the potential for massive parallel execution at the instruction level. Thus, dataflow architectures implicitly manage complex tasks such as processor load balancing, synchronization, and accesses to common resources… Unlike control flow programs, binaries compiled for a dataflow machine explicitly contain the data dependency information… The dataflow model has the potential to be an elegant execution paradigm with the ability to exploit inherent parallelism available in applications. However, implementations of the model have failed to deliver the promised performance due to inherent inefficiencies and limitations. One reason for this is that the static dataflow is unable to effectively uncover large amount of parallelism in typical programs. On the other hand, dynamic dataflow architectures are limited by prohibitive costs linked to associative tag lookups, in terms of latency, silicon area, and power consumption… Another significant problem is that dataflow architectures are notoriously difficult to program because they rely on specialized dataflow and functional languages. Dataflow languages are required in order to produce large dataflow graphs that expose as much parallelism as possible to the underlying architecture. However, these languages have no notion of explicit computation state, which limits the ability to manage data structures (e.g. arrays). To overcome these limitations, some dataflow systems include specialized storage mechanisms, such as the I-structure, which preserve the single assignment property. Nevertheless, these storage structures are far from generic and their dynamic management complicates the design… Hybrid models combine single assignment semantics, inherent to dataflow, with consistent memory models that support external side effects in the form of load/store operations. This relieves one of the biggest (if not the biggest) restriction of pure dataflow programming: the inability to support a shared state, and specifically shared data structures.”
“A typical dataflow instruction cycle involves: detecting enabled nodes; determining the operation to be performed; computing the results; and generating and communicating result tokens to appropriate target nodes. Even though the characteristics of this instruction cycle may be consistent with the pure-dataflow model, it incurs more overhead than its control-flow counterpart. For example, matching tokens is more complex than simply incrementing a program counter. The generation and communication of result tokens imposes inordinate overhead compared with simply writing the result to a memory location or a register. These inefficiencies in the pure-dataflow model tend to degrade performance under a low degree of parallelism. Another formidable problem is the management of data structures. The dataflow functionality principle implies that all operations are side-effect-free; that is, when a scalar operation is performed, new tokens are generated after the input tokens have been consumed. However, absence of side effects implies that if tokens are allowed to carry vectors, arrays, or other complex structures, an operation on a structure element must result in an entirely new structure. Although this solution is theoretically acceptable, in practice it creates excessive overhead at the level of system performance.”
“Programs are expressed in a declarative language called Id, which specifies only the data dependencies between sub-computations, allowing for maximally parallel execution… Moreover, programs expressed in this form exhibit a great deal of parallelism, even using conventional algorithms. A parallel program invariably performs more work than a sequential one for the same problem due to generation and synchronization of parallel activities… Traditionally a program is specified by sequence, i.e. a total order, of operations against the machine state. In many cases this over-specifies the program, because it may be permissible for a pair of operations to be performed in any order, yet there is no way to express this… Since the program is a graph, rather than a sequence, the locus of control is a collection of instructions, rather than a single point in the sequence… The matching operation is the basic means of synchronizing independent activities… Normally, instruction pipelining is used to overlap the execution of consecutive, yet independent, operations in a single instruction sequence… In a few machines, instruction pipelining is used to overlap independent sequences, which may represent distinct tasks. The dataflow pipeline serves both roles, as no distinction is made between these forms of parallelism, and a very high pipeline utilization can be achieved… Thus, a uniform synchronization mechanism is used for all levels of parallel activity… The wait-match stage of the instruction pipeline is fairly complex, but it is complex on conventional pipelined machines as well when hazard resolution and pipeline scheduling are accounted for… There is no branch instruction, since there is no single locus of control… The overhead in dataflow execution is directly attributable to asynchronous parallel execution and is typically of the same magnitude as the basic work.”
“Dataflow… offers a uniform synchronization paradigm, representing one end of a spectrum wherein the unit of scheduling is a single instruction. At the opposite extreme are the von Neumann architectures which schedule on a task, or process, basis… Compiler technology coupled with simple pipeline design is now used effectively to cover bounded memory latency in uniprocessors. Unfortunately, the situation is qualitatively different for multiprocessors, where large and often unpredictable latencies in memory and communications systems cannot be tolerated by using similar techniques. This is attributable at the architectural level to poor support for inexpensive dynamic synchronization. Specifically, latency cost is incurred on a per-instruction basis, but synchronization on a per-instruction basis is impractical. A scalable, general-purpose multiprocessor architecture must address these issues… Dataflow architectures offer such synchronization at a per-instruction level as the normal modus operandi. Each instance of an instruction can be thought of as an independent task with specific dependence requirements which must be satisfied prior to initiation. A dataflow machine provides the mechanism to detect efficiently the satisfaction of these requirements and to process all such enabled tasks. Given the means to express a computation as a graph of such interdependent tasks, one has a natural means for executing, at any instant of time, all and only those instructions for which the synchronization constraints have been satisfied. To the extent that the number of such candidate instructions exceeds by some suitable amount the instantaneous capacity of the machine to process them, latencies inherent in the physical machine can be hidden, or masked… Synchronization is the time-coordination of the activities within a computation. In a von Neumann machine, instruction execution is implicitly synchronized by the dispatching circuitry, using a program counter for sequential instruction initiation. In a parallel computer, a qualitatively different kind of synchronization is required between tasks due to their assumed mutual asynchrony… Rather than using a program counter to determine the next instruction to be executed and then fetching operands accordingly, a dataflow machine provides a low-level synchronization mechanism in the form of a Waiting-Matching store which dispatches only those instructions for which data are already available… The notion of a non-blocking, context-interleaving pipeline is a two-edged sword. It provides the ultimate flexibility in context switching, but implies that the notion of locality must be reconsidered. Again by intuition, given a full dataflow pipeline of depth n, the working set size for instruction and operand caches must be on the order of the sum of the n working set sizes for the threads of computation which coexist in the pipe. This also implies that the time to execute the instructions in a graph’s critical path is n times the critical path length… Starting with the observation that the costs associated with dataflow instruction sequencing in many instances are excessive, others have suggested that dataflow ideas should be used only at the inter-procedural level thereby avoiding dataflow inefficiencies while seemingly retaining certain advantages. This view is almost correct, but ignores the importance of the fundamental issues. Restricting architectures to this ‘macro dataflow’ concept would amount to giving up what is possibly a dataflow machine’s biggest feature — the ability to context switch efficiently at a low level to cover memory latency… In general, on a machine capable of supporting multiple simultaneous threads of computation, executing programs expressed as a total ordering of instructions will incur more latency cost than will executing a logically equivalent partial ordering of the same instructions… Dataflow program graphs are a powerful and elegant intermediate representation for programs. They are largely architecture-independent, and have a precise semantics… the partial order represented by a dataflow graph captures all and only the necessary inter-instruction dependencies.”
“Early advocates of data-driven parallel computers had grand visions of plentiful computing power provided by machines that were based on simple architectural principles and that were easy to program, maintain, and extend… If, during a significant part of a computation, a major part of the processing power is not engaged in useful computation, we speak of underutilization. A useful measure of the quality of a parallel computer is its effective utilization, that is, utilization corrected for overhead. The best one can hope for is that the effective utilization of a parallel computer approaches that of a well-designed sequential computer… In many areas that have great needs for processing power, the behavior of algorithms is irregular and highly dependent on the input data, making it necessary to perform scheduling at run time. This calls for asynchronous machines in which computations are free to follow their own instruction stream with little interference from other computations… Computations are seldom completely independent, however, and at the points where interaction occurs they need to be synchronized by some special mechanism. This synchronization overhead is the price to be paid for the higher utilization allowed by asynchronous operation.”
“Across the diverse range of multiprocessor architectures, from small collections of supercomputers to thousands of synchronous single-bit processors, all seem to share one undesirable property: they are hard to use. Programming has taken a giant step backwards. The application writer must consider detailed and hard-to-measure interactions of the program with the machine; the resulting codes are difficult to debug, are of questionable reliability, and are far from portable. Even after arduous work in converting an application for parallel execution, the actual improvement in performance is frequently disappointing. Perhaps we do not yet understand how to express parallelism in a way that is machine-independent. Perhaps we need more sophisticated compilers and associated debuggers to better exploit a machine’s parallelism. Perhaps we are building the wrong machines… There needs to be a fundamental change in processor architecture before we can expect significant progress to be made in the use of multiprocessors. The nature of this change is the deep integration of synchronization into the processor instruction set. The instruction set must constitute a parallel machine language, in which parallel activities are coordinated as efficiently as instructions are scheduled… The instruction set of a dataflow machine forms such a parallel machine language. An instruction is executed only when all of its required operands are available. Thus, low-level synchronization is performed for every instruction and at the same rate as instructions are issued. It is easy for a compiler to use this fine-grain synchronization to produce code which is highly parallel but deterministic for any runtime mapping of instructions onto processors. While there is consensus that dataflow does expose the maximum amount of parallelism, there is considerable debate surrounding efficiency of the execution mechanism… In the von Neumann model, the operands for an instruction are assumed to be available when the program counter points to the instruction. In a dataflow machine an instruction is asynchronously scheduled only when its operands have been produced by some other parallel activities, so a dataflow machine must have an operand matching mechanism for detecting when an instruction has its required operands… The more finely a program is divided into tasks, the greater the opportunity for parallel execution. However, there is a commensurate increase in the frequency of inter-task communication and associated synchronization. So exposing more parallelism, by way of identifying more tasks, does not obviously make the program run faster. In fact, we claim that is largely unprofitable to expose most of the latent parallelism in programs unless synchronization and task management are as efficient as the scheduling of primitive operations like add and multiply… We think there is something fundamentally wrong — why should the task overhead be so extraordinarily large? Is it an inherent aspect of parallel computation or an artifact of the processor architecture used to construct the multiprocessor? We believe the latter… If the task overhead were essentially zero, more parallelism would be exposed and exploited and compiling would be far easier… By ‘low overhead’ we mean on the order of one to ten instructions, basically orders of magnitude better than contemporary multiprocessors… In this ideal world, task overhead is not a first-order issue. Instead, the objective is to expose the maximum amount of parallelism by subdividing the computation as finely as possible… For dataflow machines to be ultimately successful, single-processor performance will need to be on a par with well-engineered von Neumann machines… A dataflow graph imposes only a partial order of execution; all execution orders which obey the data dependencies in the graph yield the same result. This is the source of parallelism in dataflow graphs — a valid execution is to process in parallel all instructions which have their required operands. In contrast, a sequential processor imposes a total order on all operations within a task; two operations can occur in parallel only when it can proved that doing so does not change the sequential semantics… The implementation of the token matching function is the bane of practical tagged-token dataflow processors. Tagged-token dataflow execution requires a token to be checked against all other waiting tokens for a match. If a match is not found then the token must be remembered by the waiting token storage. If a match is found then the matching token must be extracted from the waiting token storage. The matching function and associated storage is most often realized with a large hashtable and sophisticated insertion and extraction algorithms. The resulting hardware is highly complex, slow and is difficult, if not impossible, to fully pipeline… The programming cost for the effective utilization of a multiprocessor can be enormous. It may often be the case that the program cannot be made to run faster no matter how much programming effort is invested… The promise of dataflow machines is that they will be far easier to use. The hardware-enforced fine grain synchronization simultaneously makes compiling easier while exposing the maximum amount of parallelism.”
The parallels between the TIA and dynamic dataflow architectures go far beyond the lack of branch instructions and program counters. The way in which dynamic dataflow “captures… only the necessary inter-instruction dependencies,” “provides the mechanism to detect efficiently the satisfaction” of the “specific dependence requirements which must be satisfied prior to initiation” of an instruction, and provides “a natural means for executing, at any instant of time, all and only those instructions for which the synchronization constraints have been satisfied” immediately calls to mind the way in which the TIA “does not impose any ordering between instructions unless explicitly specified,” provides “a scheduler that can efficiently evaluate the program’s triggers,” and has a “limited pool of triggered instructions that are constantly bidding for execution on the datapath” such that “every instruction in the pool is eligible for execution if its trigger conditions are met.” The essential difference is that the TIA does an end-run around the complexity-ineffective “implementation of the token matching function” that causes the “overhead in dataflow execution” to be “of the same magnitude as the basic work,” and makes it “largely unprofitable to expose most of the latent parallelism in programs.” The “core of the TIA scheduler is essentially a few 1-bit wide trees of AND gates feeding into a priority encoder,” and yet “clever use of predicate registers” allow it to not only emulate dynamic dataflow, but actually implement it at a lower “scheduler implementation cost,” and without any “token-related pipeline stages.” This raises the possibility of truly inexpensive dynamic “synchronization on a per-instruction basis,” “synchronization and task management… as efficient as the scheduling of primitive operations like add and multiply,” the coordination of “parallel activities… as efficiently as instructions are scheduled,” and a truly efficient means to detect the satisfaction of “dependence requirements,” such that “exposing more parallelism, by way of identifying more tasks,” will almost always “make the program run faster.” I would go so far as to say that the beginning of truly practical general-purpose dataflow architectures is to be dated from the TIA.
In addition, I suspect that, because “triggered instructions is a superset of many traditional control paradigms,” and “both control and data dependencies in TIA are converted into predicate dependencies” and “handled in the same way,” the TIA is a true “unified-hybrid architecture” that can execute “both dataflow and control flow scheduling since both models are intimately bound in the architecture,” and efficiently handle “the complete functionality of the system, containing both control and data flow.” As the alternate history has it, Burroughs released Chimera, a “fluent programming language” with an “effect-coeffect system,” together with the Starling processor in 1973.
“Functional languages have severe difficulties dealing with arrays efficiently. In a fluent language, on the other hand, arrays created by an imperative subprogram can be passed to and manipulated by functional subprograms… In a fluent… language programmers may freely mix functional and imperative programming, subject to the sublanguage constraints. These constraints, which restrict the permissible side effects of expressions, are enforced by a process similar to type checking.” “Much like a type system tracks the values that may be stored in particular variables, an effect system tracks the side effects that may be produced by particular functions, allowing the programmer to reason about mutation and side effects more formally than in a traditional imperative language… languages with effect systems simply let the programmer denote certain procedures as having imperative semantics. More powerful effect systems allow tighter control over what effects are allowed. Such a language can be seen as having separate declarative and imperative sublanguages…”
“Effects and coeffects… roughly correspond to computations which change the execution context versus computations which make demands on the context. Effectful features include partiality, non-determinism, I/O, state, and exceptions. Coeffectful features include resource demands, variable access, notions of linearity, and data input requirements. The effectful or coeffectful behavior of a program can be captured and described via type-based analyses… A pure program can be described as a mathematical object that is isolated from the real world. However, even in the most abstract setting, a program is hardly isolated. Instead it interacts with its evaluation context… Indeed, basic operations like I/O are inconceivable in a program that runs in isolation… programs make both demands on, and produce changes to, the context; computations are often both effectful and coeffectful… Moreover, changes to the evaluation context may depend on program requirements, and vice versa. That is, coeffects and effects may interact.”
The checking of effect-coeffect safety provides a solid foundation for object-relational databases and application frameworks that resolve the object-relational impedance mismatch. It would almost seem that the modeling of computations as “rules composed of actions (state transitions) and guards (Boolean expressions that describe when a certain action is legal to apply)” is more abstract and general than synchronous control flow and asynchronous dataflow, and in fact, subsumes them into single model of computation with the power and richness to faithfully capture effect-coeffect interactions. It’s a stretch, but I would liken this to how Routhian mechanics is a “hybrid formulation of Lagrangian mechanics and Hamiltonian mechanics” in which a sort of partial Legendre transform “has the consequence that the Routhian equations are exactly the Hamiltonian equations for some coordinates and corresponding momenta, and the Lagrangian equations for the rest of the coordinates and their velocities.”
It’s important to emphasize that the TIA is an obligate “spatially-programmed architecture,” with “a fabric of hundreds of small processing elements connected directly via an [on-chip network] which is exposed directly to the ISA.”
“The triggered control model is feasible within a spatially-programmed environment because the amount of static instruction state that must be maintained in each PE is small, allowing for inexpensive implementation of a triggered-instruction hardware scheduler.”
The extremely aggressive reuse of data and “reduction in memory operations” should not be forgotten.
“Spatial programming is a paradigm whereby an algorithm’s dataflow graph is broken into regions, which are connected by producer-consumer relationships. Input data is then streamed through this pipelined graph… Intermediate data is passed to the next datapath rather than being kept locally. Therefore no single PE will execute an entire loop body — essentially the program has been transposed between time and space… The loads and stores that a traditional CPU must use to overcome its relatively small register file are replaced by direct PE-to-PE communication. This reduction in memory operations is critical in understanding the benefits of spatial programming… Direct communication uses roughly [95%] lower power than communication through an L1 cache, as the overheads of tag matching, load-store queue search, address translation, and large data array read are removed. Cache coherence overheads, including network traffic and latency are likewise removed. Reduced memory traffic lowers cache pressure, which in turn increases effective memory bandwidth for remaining traffic.”
I would argue that the TIA has a certain naturalness, and that one might naturally arrive at it by taking four things very seriously. First, the total ordering of instructions is anathema to masking or burying the unpredictable latencies of any realistic, multitasking runtime environment — even those for which the machine code has been statically specialized. Second, the wire-delay-limited regime demands that the embedding of processing elements (or other hardware resources) into physical space be architecturally visible, so that the optimizing compiler can model “the communication costs of the target network accurately,” identify “the spatial and temporal coordinates of every node and arc in the control-dataflow graph,” and extend placement and routing techniques from the design of hardware to the design of complete “hybrid hardware/software” systems. Third, the topology of the on-chip network must be lattice-like, with all of the symmetries that this implies, so that the “functional correctness” of a computation is invariant under many remappings (translations, rotations, reflections, and so on) of the CDFG to the compute fabric, the machine code can run natively on processors with any number of processing elements without first being recompiled, and the runtime system has a great deal of latitude to maximize resource utilization by opportunistically placing modules on the fabric. Fourth, the distance spectrum of the lattice-like OCN topology should be quasi-continuous, so that the “distances between processors to which parallel tasks are assigned” can be roughly proportional to the widely varying grain sizes of the “inherent parallelism available in applications.”
The Datapath
As far as I can tell, the innovations of the TIA are largely confined to the front ends of the processing elements. As revolutionary as the TIA is, the examples in the literature invariably have generic RISC-like instruction set architectures — as well they should, to abstract away irrelevant details of the back ends, allow the most direct comparisons with program counter architectures possible, and not fix what isn’t broken. The TIA would, however, seem to be compatible with just about any conceivable ISA, so long as the “set of operations and state updates that can be performed by the datapath (instructions)” are suitably “bounded in complexity,” and admit an efficient hardware implementation. In my opinion, this orthogonality between the front and back end of the TIA processing element, characterized by a neat and clean separation of concerns (the evaluation of “the triggers of instructions and the instructions that actualize a logic conjunction of predicates are implicitly accomplished on the scheduler, rather than on the datapath”), opens the door to a wide variety of unconventional ISAs.
The possibility that I find most interesting is the Transport-Triggered Architecture. In a TTA, transports of operands trigger operations thereon, and not the other way around.
“Central to the idea of transport triggering is to have more control about what is going on within a central processing unit… In order to get more control, the data transport has to be separated from the operations… An interesting feature of a TTA is the fact that bypassing (or forwarding) of data is programmed, instead of relying on expensive associative bypassing hardware… Use of the transport triggering concept enables many compiler optimizations at the data transport level.”
The TTA seems to have originated as a means to keep the growth of multi-ported register file area and power consumption in check as the issue width of VLIW DSPs was increased, in order to exploit the abundant ILP characteristic of DSP workloads.
“Data transport programming provides additional programmer freedom on top of the software controlled bypasses: the possibility to control the timing of the function unit operand and result data transfers. This helps reducing the RF pressure further as there is often slack in the programs: it is not necessary to move the results to the next consumer or the register file immediately when they are ready. Thus, while the software bypassing capability alone reduces the number of accesses to the RF and the register pressure, the additional freedom of transport programming can simplify the number of RF ports required to serve multiple functions units even further.”
In order for this to work, it is necessary to pull off a sort of lateral move in design space, in that more instruction bits are necessary for software to drive the bypass network, but fewer instruction bits are necessary to address the downsized register file, and there is little to no net effect on code density and instruction bandwidth.
One of the few commercial TTA processors (or the only such processor) is the Maxim MAXQ, so named because it is designed to inject low levels of noise into the “the sensitive analog section” of systems “integrating complex analog circuitry with high-performance digital blocks.” The MAXQ underlines how a TTA is, almost by definition, a one-instruction set computer with only a ᴍᴏᴠᴇ instruction, resulting in a dramatic simplification of instruction decoding with respect to the RISC baseline.
“The instruction word contains modularly grouped source and destination operands, which allow simple and fast instruction-decoding hardware and limit signal switching for those modules not involved in the transfer, thus reducing dynamic power consumption and noise. The instruction word uses its full 16 bits to specify source and destination operands, producing an abundant address space for physical registers, indirect memory access, and hardware-triggered operations.” “The MAXQ instruction decode and execution hardware is so simple (and timing so fast) that these operations are moved into the same clock cycle as the program fetch itself, with minimal impact to the maximum operating frequency… The MAXQ architecture distinguishes itself from other 8-bit and 16-bit RISC microcontrollers by offering single-cycle execution without an instruction pipeline (and the wasted clock cycles that accompany it).” “The 16-bit processor implements the instruction fetch, decode and execution done in a single pipeline stage. That is, the core is optimized for branch-heavy code, and the transport programming helps in simplifying the processor structure in order to provide the single-cycle instruction pipeline to reduce branch penalties.”
The logical endpoint of the TTA may very well be “operand registers,” in which functional unit input and output ports are upgraded to full-blown registers.
“Operand registers extend a conventional register organization with small, distributed sets of inexpensive general-purpose registers, each of which is integrated with a single function unit in the execute stage. The shallow, inexpensive register files that implement the operand registers effectively extend the pipeline registers that precede the function units into small register files, preserving the low access energy of the pipeline registers while satisfying a greater fraction of operand references. This lets software capture short-term reuse and instruction-level producer-consumer locality near the function units. Explicit operand forwarding lets software control the routing of operands through the forwarding network so that ephemeral values need not be written to registers. Both mechanisms increase software control over the movement of operands between function units and register files.”
The Starling processing elements have these operand registers, and exposed datapaths more generally. In the baseline TIA, the processing elements “are the ‘endpoints’ of producer/consumer relationships,” the ISA exposes “direct communication channels between PEs as data registers and status bits,” and each “channel has three components — data, a tag and a status predicate that reflects whether an input channel is empty or an output channel is full.” In contrast, the exposed-datapath Starling TIA has not only inter-PE communication channels, but also intra-PE communication channels such that functional units (rather than coarser-grained PEs) are “the ‘endpoints’ of producer/consumer relationships,” because some or all of the operand registers have “data, a tag and a status predicate” that are architecturally visible, and can be queried or inspected by the TIA scheduler. The Starling PE is therefore ‘reflective’ and capable of ‘introspection,’ can “efficiently react” to endogenous as well as exogenous events, and has what might be described (confusingly) as a Triggered Transport-Triggered Architecture, or TTTA. With more “architectural state elements” to query or inspect each cycle, this will obviously increase the area and power cost of the TIA scheduler. The idea is that the operand registers have a lower area and power cost than the corresponding “centralized register file,” and furthermore, new and unusual optimizations of pipelining and buffering become possible.
The 1-bit and 64-bit on-chip networks are “in essence 2D bypass networks serving as bridges between the bypass networks of separate tiles” along the lines of the MIT RAW processor. The combination of architecturally-visible operand registers and the “chip-edge to chip-edge” “common bypass network” is the key to the implementation of the OCNs with neither dedicated “input/output channel buffers,” nor classical “bufferless routing.” My observation (for what it’s worth) is that, by exploiting “slack” in a computer program — that is, when operands are off the critical path, it is not necessary to transport them as soon as they are produced, consume them as soon as they are transported, or both — operand registers are “used to temporarily buffer results,” and if the distinction between inter- and intra-PE data movement is largely erased, then operand registers can perform much the same role as “input/output channel buffers” in the baseline TIA. The leveling of demand for “operand register bandwidth” (and steadying of the flow of data through the OCNs) is therefore the responsibility of the compiler. Furthermore, as it is generally true that “not all the functional units are used in every cycle of the program’s execution,” idle functional units can still perform some useful work by buffering operands — including even those on which they do not operate — on behalf of busy functional units. “Buffers… consume significant energy/power: dynamic energy when read/written and static energy even when they are not occupied… having buffers increases the complexity of the network design because logic needs to be implemented to place packets into and out of buffers… buffers can consume significant chip area.” This downsizing of the processing element improves both serial and parallel performance by “keeping each tile small to maximize the number of tiles that can fit on a chip, increasing the chip’s achievable clock speed and the amount of parallelism it can exploit.”
Note: In Starling v1.1, I need to get this part of the architecture under control. Although a transport-triggered architecture may simplify instruction decoding as in the MAXQ (and free up cycle time for the X₁ stage), it’s important to point out that the baseline TIA PE is scalar, with just four general-purpose registers, two RF read ports, and one RF write port — there is absolutely no pressing need to relieve register pressure here (with software bypassing, for example). Indeed, the RF has a lower area and power cost than the instruction memory and ALU. I do still think that the TIA is somewhat insensitive to low code density, and that a TTA-like ISA could allow better packing of instruction effects and utilization of datapath resources, but it’s important to make that point clear. This approach to bufferless routing also strikes me as especially vague, in retrospect. I’m making a very strong assumption that there are almost never “datapath resource constraints and conflicts” between instructions that do useful work, and instructions that juggle data between operand registers and shared registers (and thereby push and pull data through the OCN). I cannot, however, point to any reason why this would be the case — for all I know, this would cause an explosion in the number of tag comparisons in each cycle, or run into some other showstopping problem. I am married to bufferless routing (the area and power savings are irresistible), but I now think that it will end up looking like Clumsy Flow Control, which is unusual in that it is both “credit-based” and bufferless. I suspect that, if I can do another pass on the TIA literature with an eye to latency-insensitive channels and “reverse credit flows,” and try to find some common ground with CFC, then I might have a shot at imagining a halfway reasonable approach to a bufferless OCN, without straying too far from the baseline TIA.
It’s important to note that the 1-bit registers to which the 1-bit OCN is mapped — that is, 1-bit registers which are write-only with respect to the sending PE, and read-only with respect to the receiving PE — are much like shared predicate registers. These registers are “explicitly managed by the program,” and because they “do not have any special semantic meaning,” “the programmer can use them in a variety of ways.” As I imagine them, these shared predicate registers are a platform for flexible “register-based communication and synchronization” generalizing that in the MIT MAP processor. In addition to “offloading control and predicate signals” from the 64-bit OCN, the 1-bit OCN could cheaply “use thick wires to send signals to greater distances within a single cycle” than the 64-bit OCN and reduce the RC “delays involved in synchronization,” allowing control edges to be mapped to 1-bit “inter-PE datapaths” between non-adjacent PEs.
In light of the “spatial programming model where each PE is statically configured with a small pool of instructions instead of streaming in a sequence of instructions from an instruction cache,” the performance of the TIA may be uniquely insensitive to the low code density (i.e. highly expressive and relatively unencoded instructions) typical of most TTAs. The cost of each instruction fetch is amortized over hundreds or thousands or millions of executions, dramatically reducing the instruction bandwidth. The length of instruction words will be limited not by contention for memory bandwidth, but rather by the area and power consumption of the register-based instruction memory. More fancifully, the configuration of the TIA CGRA with instructions, predicates, and triggers might be thought of as a sort of quasi-static (rather than dynamic) instruction decoding, with the same effect of compressing the instruction bandwidth.
The Pipeline
Although the non-pipelined TIA has surprisingly good performance, it will almost certainly be necessary to pipeline the processing elements.
Updating the predicate state rapidly is important to keep the PE making forward progress… Area is dominated by ALU followed by instruction memory. Power is mostly proportionate to area, excepting the instruction memory which has relatively higher power and the ALU relatively lower power… due to the capacitance of the clock tree of the large sequential instruction memory… The benefits to pipelining in terms of [ILP], timing closure and increased power efficiency through voltage scaling are well understood and thus desirable in most microarchitectures. It is also well-known that pipelines present a wide variety of hazards necessitating control logic that can help mitigate them… A pipelined, triggered microarchitecture is susceptible to the same data hazards as a conventional pipeline, be they register operand dependences or dependences across stages of a pipelined functional unit. On top of this, triggered control introduces unconventional control hazards. With a PC, a processor must compute the next PC in order to fetch and begin executing the next instruction. The next PC is either the result of a predictable update function… or a more complicated branch condition and target resolution calculation. By contrast, in a triggered architecture, program control advances via updates to the predicate and queue states. To trigger the next instruction, one has two options: wait for the preceding instruction to complete its updates to the predicate and queue state, or speculate on the upcoming predicate and queue state and trigger a speculative instruction. The deeper the pipeline, the more critical it becomes to resolve or predict the upcoming states to keep the pipeline full… Queue hazards arise when an in-flight instruction might alter the state of an input or output queue, thus preventing the scheduler from identifying and launching the next instruction. Because queue state has a direct impact on both control flow and data movement, queue hazards straddle the traditional control/data hazard dichotomy. They resemble control hazards in that they affect instruction execution eligibility and data hazards in that the processor must prevent dequeues from empty inputs or enqueues to full outputs… such hazards may be dealt with more effectively and efficiently by pipeline register inspection and augmenting the instruction dispatch with some simple accounting… An efficient interface to determine the state of input and output queues is especially important here because of how spatial architectures encourage producer-consumer relationships between PEs… De-queueing from the inputs in the same cycle as the trigger resolution proved to be a long critical path… To describe the various pipelines, we have divided the work of a processing element into three conceptual stages: trigger selects and fetches an instruction from memory; decode retrieves any operands from the register files and input queues and performs the necessary operand forwarding; and execute, optionally split into X₁ and X₂, performs any arithmetic or logical operations and writes results to the register file or output queues… De-queueing from the inputs in the same cycle as the trigger resolution proved to be a long critical path… A unique aspect of this scheme is that when a program assigns certain semantic significance to particular predicates, writing to them only to represent certain binary decisions, this bank of predictors becomes a per-branch predictor without the traditional overhead of indexing a bank of predictors via the instruction pointer… the single-cycle TDX baseline remains surprisingly competitive due at least partially to its superior [cycles per instruction]… Another surprising observation is that the exact same microarchitecture provides the best global performance and best absolute energy efficiency. At the high-performance extreme, a two-stage pipeline with a split-stage ALU, TDX₁|X₂… achieves the highest throughput… At the low-power extreme this exact same microarchitecture [achieves the lowest energy per instruction]… high-performance designs can counterintuitively share characteristics with ultra-low-power microarchitectures… for simple integer functional units… moderate pipelines in the range of two to three stages are optimal.”
This surprising and robust result about the high performance-per-watt and outright performance of the TDX₁|X₂ “two-stage pipeline with a split-stage ALU” leads me to wonder if this X₁ substage, in which at least some “useful, non-speculative work” can be done, could actually perform “single-cycle execution” of certain low-latency instructions, latch “predicate results” into the predicate registers by the end of the cycle, and in doing so, forward those predicate results in order to update “the predicate state rapidly” and “keep the PE making forward progress” at least some of the time. In addition, certain high-latency instructions may be able to produce some predicate results early, in the TDX₁ stage. The TTA-like ISA and short decode stage of the Starling compute tiles would allow quite a lot of the execute stage to be “moved into the same clock cycle” as the trigger and decode stages with “minimal impact to the maximum operating frequency,” allowing “single-cycle execution” to make considerable forward progress.
The TIA scheduler is already solving a sort of optimization problem, in that it must “dynamically adapt to specific events encountered at execution-time” and “perform dynamic scheduling to achieve better utilization of execution resources” by dispatching triggered instructions “subject to datapath resource constraints and conflicts.” This could be understood as packing of instruction effects in space, in order to minimize horizontal waste. What I’m thinking is that, at a reasonable complexity increment, the TIA scheduler could also pack instruction effects in time, chaining instruction effects to minimize dynamic timing slack (sub-cycle vertical waste), and maximize utilization of the clock period. This would be very much in the spirit of CGRA Express.
The basic idea of operation fusion stems from the observation that the clock period of a CGRA is determined by the worst case delay (critical path delay) in the architecture. When the clock period is not fully utilized, the slack can be used to execute the successive operation if the delay fits into the slack.
With triggered instructions and “compiler-directed multithreading,” “no dynamic overhead is incurred to switch among threads when programs are executing in the pipeline” (as though the threads share a single context, reminiscent of “interleaving multiple threads of control into a single implicitly multithreaded one” using Software Thread Integration), allowing “zero-cycle multithreading” and “fine-grain interleaving of instructions based on data availability,” such that processing elements can “tolerate both long memory latencies and short instruction latencies, while still executing at full capacity even if only one thread is present.” If there are no “critical rate-limiting loops in the… pipeline,” zero-cycle multithreading can mask pipeline hazards in the manner of a barrel processor.
The Memory System
As a tile processor architecture that is “already parallel in its native execution model” and executes spatially distributed computer programs, the TIA naturally lends itself to a “multi-ported and distributed” memory system that “has no central bottlenecks and can sustain a high aggregate memory system bandwidth” using a “large number of independent memory banks.” In theory, the processor designers need only design a single exquisite tile using “hand-crafted, one-transistor-at-a-time methods,” and then “simply stamp out as many tiles… as the silicon die and package allow.” If downward pressure is to be placed on the engineering costs of the memory system, and the performance of the memory is to be scaled up as straightforwardly as the performance of the processor, then some sort of ‘tile memory’ would seem to be appropriate.
The most radical design aspect of the entire Starling system-on-module is the total lack of caches or even scratchpad memories, somewhat like a hard real-time system. The idea is that decoupled systolic arrays, area I/O, Cu-Cu thermocompression bonding of chip stacks, and latency-aware “data placement and migration” actually treat long DRAM latencies instead of just treating their symptoms, freeing up considerable area and power for more “computing elements… that are responsible for performing actual computation.” The coupling of the processor and the nearest-neighbor DRAM chip is so tight that it takes more cycles for processing elements on opposite sides of the array to communicate with each other than it does for them to communicate with some off-chip DRAM — the hallmark of having truly breached the memory wall. The only real solution to the problem of excessive memory latency is to make the wires shorter.
“Users can write highly efficient programs for [a systolic array] so long as the number of available processors matches the problem size. For example, an N × N matrix multiplication requires an N × N array of processing elements. Thus, systolic arrays are highly efficient, but also highly specialized for a single problem size… A commonly accepted technique for overcoming the specialization of systolic arrays is the simulation of multiple processing elements on a single systolic processor. Such a processor uses local memory to store intermediate values of computation. If the architecture is designed to support arbitrarily large problems, the amount of local memory required is unbounded, and thus the local memory available on such a processor must be large. Large memories require a load/store interface to handle intrinsically long access times. Thus, simulation of a systolic array reintroduces the original source of inefficiency: memory operations. Therefore, simulation is inadequate for highly efficient computation and other methods must be found to achieve the benefits of systolic arrays but maintain the flexibility to compute arbitrarily large problems on a general purpose architecture. Out-of-core algorithms provide insight into how to divide up a large problem efficiently among limited resources. The goal of out-of-core algorithms is to efficiently schedule computation on problems that are too large to fit into the main memory of a particular processor. These techniques provide the inspiration to partition a large problem into systolic phases, each of which can be solved without simulation given the resources at hand. For example, suppose one is presented with an R × R array of compute resources and an N × N matrix multiplication where N > R. One could continuously apply the block-partitioning… until each of the block matrices in the matrix multiplication can be computed systolically on an R × R array of processing elements. This technique allows one to apply systolic methods within a general purpose framework without the inefficiency of simulation. However, using this partitioning requires that one store those parts of the matrices that are not needed during a given systolic phase of computation. Furthermore, the data must be accessed in a manner that does not reintroduce memory operations to the critical path of computation. Decoupled access execute architectures provide inspiration to completely separate computation from memory accesses. Data values only need to be accessed on the edges of those processors that implement the systolic array. In this manner, processors on the edges of the array are responsible for continuously accessing data and sending it to those processors performing the actual computation. Therefore, the systolic array of compute processors is augmented with memory processors. Memory processors are given access to a large, but slow, memory while compute processors have no access to local memory apart from a small, fast register set. Therefore, all load/store operations are performed on memory processors, which send data into the systolic array implemented by the compute processors. If one makes use of asymptotically fewer memory processors than compute processors, the operations performed on the memory processors become insignificant as the total number of processors grows. An architecture consisting of an array of compute processors augmented with a set of memory processors on the periphery is referred to as a decoupled systolic architecture.”
The Starling processor is an unconventional decoupled systolic array in which area I/O is used to locate memory tiles “in the interior of the chip,” rather than “on the periphery,” in order to slow the growth of DRAM latency as processing elements are added, relative to the edge I/O baseline. The logical decoupling of CDFGs into computation and communication slices directly corresponds to the physical/spatial decoupling of the systolic array, and is preserved by the mapping of CDFGs to the heterogeneous computation/communication fabric.

“For tiled architectures, the number of cores increase quadratically with feature size. As a result of this, the pressure on memory system exerted by the cores also goes up quadratically. However the perimeter only increases linearly. As a result the number of pins on the boundary also go up only linearly. If the memory system relies on edge I/O for communication with off chip storage like main memory DRAMs, the gap between demand and supply widens with every generation. Flip-chip packaging based area I/O provides a better scaling and also avoids the delay involved in going to and coming back from the chip edge.”
“Solder bumps need to get smaller as the pitch shrinks, but eventually there simply isn’t enough material to ensure a reliable electrical connection. Depending on the approach, the minimum interconnect pitch for solder-based technologies is [30-55μm]. The resulting maximum interconnect density of about 1,110 connections per mm² sharply limits the memory bandwidth that the package can support… Moreover, interconnect schemes based on copper pillars and microbumps create a 20-25μm separation between die in the stack. After deposition of the last layer of device metallization, additional process steps are required for either copper pillars or pads for solder bumps. Solder is a multicomponent material that is known to have thermo-mechanical stress during operation, which can cause cracking and lead to electrical failures. Organic underfill layers are routinely used to accommodate these stresses, but the high thermal resistance of these layers impedes heat dissipation… Each additional die adds to the thermal non-uniformity through the stack… The solution [is] a die bond layer free of any organic material with all-metal interconnect. Each such die stack will perform thermally like a single die. This hybrid bonding approach eliminates the chip-to-chip connectivity bottleneck by replacing solder bumps with high-density metal-metal 3D interconnects. Metal pads (usually copper) surrounded by dielectric contact each other directly. Before bonding, the metal is slightly recessed relative to the surrounding dielectric. Plasma treatment activates the dielectric surface, allowing a dielectric-to-dielectric bond to form on contact. Though the devices being connected have not yet made electrical contact, this bond is strong enough to allow handling, assembly of multi-chip stacks, and so on. Once the complete stack is ready, a final elevated temperature step causes the pads to expand, contacting each other and making a strong metal-to-metal bond. In this structure, there is no solder material, no separation between the devices, and no need for an underfill layer. The inorganic dielectric provides mechanical support, protects the copper surfaces from contamination, and gives superior thermal conductivity. The direct copper-to-copper connection minimizes electrical resistance and maximizes heat dissipation. Copper pad fabrication for hybrid bonding uses the same inlaid metal damascene process as the on-chip interconnect metallization used throughout the semiconductor industry. This… process routinely achieves a 1μm pitch with potential densities approaching a million connections per mm². The net result is up to a [three orders of magnitude] interconnect density increase, while still improving thermal performance and reducing cost of ownership… While hybrid bonding does not require [TSVs], it can exploit them to support complex multilayer stacks. The inlaid copper pads on the frontside of one die can contact TSVs directly on the backside of another… Although HBM memory chips… are approximately 50μm thick, the industry has demonstrated feasibility down to 30μm… High-density interconnects eliminate bottlenecks that might otherwise occur between chips. Instead, designers can treat interfaces between die as comparable to interfaces within die, allowing die to share interconnect layers and a single power supply mesh. With careful attention to vertical placement, such designs cannot just match the monolithic die performance, but improve it by reducing critical path lengths.”
The Network-on-Memory is the key to the implementation of the memory system as an “array of replicated tiles” with the high internal bandwidth required for aggressive dynamic data migration.
“The memory subsystem is a major key performance bottleneck and energy consumer in modern computer systems… the off-chip memory bandwidth does not grow as fast as the processor’s computation throughput. The limited memory bandwidth is the result of relatively slow DRAM technologies (necessary to guarantee low leakage), the non-scalable pinout, and on-board wires that connect the processor and memory chips… a considerable portion of the memory bandwidth demand in many programs and operating system routines is due to bulk data copy and initialization operations. Even though moving data blocks that already exist in memory from one location to another location does not involve computation on the processor side, the processor has to issue a series of read and write operations that move data back and forth between the processor and the main memory… [Network-on-Memory] carries out copy operations entirely within the memory across DRAM banks, without any intervention from the processor… With NoM, banks in 3D stacked memory are equipped with very simple circuit-switched routers… To perform data transfers in parallel, NoM replaces the global links of the shared bus with a set of shorter inter-bank links and, hence, yields higher throughput and scalability… Vaults can be accessed simultaneously… In addition to the conventional address, data, and control buses, NoM connects each bank to its neighboring banks at the X, Y and Z dimensions to form a 3D mesh topology… due to its simple structure and short non-intersecting links, which do not significantly change the structure of the DRAM layer… all vault controllers can serve both regular memory accesses and DRAM refresh requests in parallel with NoM’s data copy operations. The only exception is when the corresponding vault controllers are busy reading and writing NoM data at the source and destination banks, respectively. In between these read and write steps, the data block passes on the NoM links and leaves the read/write bandwidth idle for regular accesses.”
It may be possible to relieve congestion in the 2D network on the “logic die” by routing data through the 3D Network-on-Memory without writing it to and then reading it from the DRAM (the normal sense in which processing elements communicate through shared memory).
The coherence ordering of loads and stores by the memory tiles is supposed to “provide sequential memory semantics without interfering with the parallelism that dataflow exposes,” such that the memory tiles “are generally decoupled from one another except as necessary to preserve program correctness.” By far and away, the most common side effects handled by the unified-hybrid control-dataflow Starling architecture and the effect-coeffect system of Chimera are the shared mutable states in DRAM. As I imagine it, there is a core of functional subprograms, wrapped in a shell of imperative subprograms that catch and handle all of the side effects — most importantly, the “subprograms of memory operations” containing loads and stores.
The scheme takes inspiration from hardware like WaveCache, as well as software like ÆMINIUM, Jade, Athapascan-1, and the Relaxed Memory Calculus, and involves “threading” in the sense of threading “the state of the heap through the program computation,” rather than the better-known notion of threading from control flow architectures. The idea is that, because certain parallel programming language constructs “model an effect system” (like the “data groups” in ÆMINIUM), control edges can be placed in the communication slices of the decoupled CDFGs in such a way as to thread state “through the program following the permissions” to access shared data structures in DRAM granted by those constructs. These control edges therefore “transform implicit dependencies into explicit dependencies,” and constitute “a soup of partial order relations” that ensures determinacy “in the presence of asynchronous updates to a shared structure,” “at the expense of serializing access to the shared data structure.” Importantly, “by supporting fine-grained specifications” and “describing the exact state and permitted operations on it,” control edges in the communication slices provide the “very good memory aliasing” necessary to “express parallelism directly by judiciously placing dependencies only between those memory operations that must actually execute sequentially,” so that loads and stores are not over-serialized. This is how I imagine that the direct and native execution of CDFGs allows “execution to proceed out of program order, but not out of dependency order,” with “subprograms of memory operations” that correctly and automatically sequence “reads with respect to writes in the parallel execution.”
“Functional programs are widely appreciated for being ‘declarative’ and implicitly parallel. However, when modeling state, both properties are compromised because the state variables must be ‘threaded’ through most functions. Further, state ‘updates’ may require much copying… By ‘declarative’ we mean that one does not have to over-specify the details of an algorithm. This is sometimes paraphrased as [say what you want done, not how you want it done]… A common approach to addressing these weaknesses is to extend a functional language with state… In pure functional languages, parallelism is implicit. Annotations for parallelism may be used, but these can be viewed as hints and do not add to the intellectual burden of the programmer. However, in sequential functional languages with imperative extensions, parallelism is usually re-introduced through another layer of constructs such as threads, forks, semaphores, futures, [and so on]. These features usually make the language more complex. Further, when two such parallel programs are composed, their parallelism does not compose, because of the underlying strict semantics. The difference in resulting parallelism can be exponential… Threading ensures determinacy at the expense of serializing access to the shared data structure. For a large class of algorithms, determinacy can be ensured in the presence of asynchronous updates to a shared structure… a local model of whether or not side-effects occur is essential in order to reason about programs with side-effects… In most approaches to parallel computing, one starts with a sequential language and adds parallel annotations… In contrast, we go in the opposite direction: we start with an implicitly parallel language and add sequentialization only where necessary. The degree to which we expose and exploit parallelism is thus much more aggressive.”
“Much of the complexity of parallel execution comes from the fact that parallel tasks can generate many different interleavings of the basic operations, with each interleaving generating a potentially different behavior. To understand the program, one must take all of the different behaviors into account. Expressing the computation in a serial language, however, imposes a total order on the operations and the program only generates one behavior. From the programmer’s perspective this dramatic simplification makes it much easier to reason about what the program may do. From the system’s perspective it imposes the often considerable challenge of automatically generating parallel execution from a serial program… Systems with serial semantics provide mutable data but simplify the programming model by eliminating the concept of parallel execution. Monotonic systems, on the other hand, provide parallel execution but simplify the programming model by eliminating the concept of mutable data… In monotonic systems a variable’s value does not change over the course of the computation — it only becomes more defined. Computation in monotonic languages is therefore inherently parallelizable. All operations can execute concurrently without conflict… The computation proceeds in an information-driven fashion. When one operation generates information that satisfies a second operation’s information request, the second operation executes, generating further information or computation… Programmers using functional languages structure their computation as a set of recursively defined functions. Variables hold values returned from function invocations and are used to transfer values between invocations. Because each variable holds only one value during the course of a computation, the execution of a functional program can be seen as the progressive generation of information about the values of variables… The strengths of monotonic systems are their clean semantics and inherently parallel execution model. They allow programmers to expose the concurrency in their computations without having to deal with the complexity of an explicitly parallel programming system. A major barrier to the widespread acceptance of monotonic languages, however, is the difficulty of implementing these languages efficiently… There are also expressiveness problems associated with monotonic languages… the lack of mutable data can force programmers to tediously thread state through multiple layers of function calls… The expressiveness and efficiency problems have led to hybrid designs which integrate mutable data into monotonic programming languages… Programs written in languages that support mutable data reuse memory efficiently by overwriting old values with new values. Because monotonic languages eliminate the notion of side effects, the implementation is responsible for generating the memory reuse. Garbage collection is one common technique for reusing memory. The lack of side effects in monotonic languages means that each piece of memory goes through the collector between writes… The lack of mutable data means that programs written in monotonic languages typically generate more copying overhead than programs written in more conventional languages. To update a component of a data structure, the program cannot simply generate the new value and store it back into the original structure. It must instead allocate a new structure to hold the value and copy the unchanged values from the old structure to the new structure. If the structure is part of a large linked data structure the program must recursively regenerate all of the data that pointed to the original (and now obsolete) structure… One interesting aspect of update-in-place optimizations is that their use reintroduces the key problem of parallelizing programs that use mutable data: correctly sequencing reads with respect to writes in the parallel execution. Systems that use this optimization can no longer blindly evaluate expressions in parallel. They must analyze the generated access patterns and insert synchronization operations that correctly sequence the evaluation of expressions that access the updated data structure… The multilingual approach works best when it is natural to structure each task as a functional module that accepts a set of inputs and generates a set of outputs… Because it is impossible to implement irregular concurrency patterns with barriers, the tasks in programs with irregular concurrency patterns cannot interact using mutable memory… Shared memory systems… synchronize using constructs such as locks, barriers and condition variables.”
“In functional programming there are no side effects, so programs can execute concurrently to the extent permitted by data dependencies. Although functional programming can solve most problems, having explicit state, as provided by imperative languages, allows the developer to express certain problems in a more intuitive and efficient way… Sharing state between concurrent tasks immediately raises questions [about the order in which those accesses should occur, and how to coordinate those accesses to maintain program invariants]. The reason why those questions are hard to answer is because there are implicit dependencies between code and state. Methods can arbitrarily change any accessible state without revealing this information to the caller. This means that two methods could be dependent on the same state, without the caller knowing about it. Because of this lack of information, [imperative] programming languages use the order in which code is written as proxy to express those implicit dependencies. Therefore the compiler has to follow the given order and cannot exploit potential concurrency automatically… To overcome this situation, we propose to transform implicit dependencies into explicit dependencies and then infer the ordering constraints automatically… by default, everything is concurrent, unless explicit dependencies imply a specific ordering… We propose to use access permissions to specify explicit dependencies between stateful operations. Access permissions are abstract capabilities that grant or prohibit certain kinds of accesses to specific state… The way we use access permissions to specify state dependencies resembles the way Haskell uses its I/O monad to specify access to global state. But unlike the I/O monad, which provides just one permission to all the state in the system, access permissions allow greater flexibility by supporting fine-grained specifications, describing the exact state and permitted operations on it. In ÆMINIUM every method must explicitly mention all of its possible side effects. This allows the system to compute the data dependencies within the code, and within those constraints, execute the program with the maximum possible concurrency. By following this approach our system resembles a dataflow architecture. But instead of producing and consuming data, our system supports shared objects and in-place updates… Because access permissions play such an important role in our system, we promote them to first-class citizens and integrate them into our type system… Shared memory and objects are in general used as communication channels between several concurrent entities, which may modify the shared state. Therefore, we need a mechanism to allow concurrent execution and modifying access to a shared resource. A shared permission provides exactly these semantics… In some situations, application-level dependencies exist that cannot directly inferred via data dependencies… To allow the user to specify such additional dependencies without sacrificing concurrency, we add data groups to our system… Data groups provide a higher-level abstraction and provide information hiding with respect to what state is touched by a method… In our system a data group can be seen as a container which contains all shared permissions to an object… Data groups allow the programmer to explicitly model [their] design intent in the source code. Not only does this allow the ÆMINIUM system to infer the dependencies and correct execution, it also improves the quality of the code itself by explicitly documenting those dependencies… data groups model an effect system…” “ÆMINIUM’s type system prevents data races by either enforcing synchronization when accessing shared data or by correctly computing dependencies to ensure a happens-before relationship (meaning conflicting accesses will be ordered according to their lexical order)… One can thus think of state being threaded through the program following the permissions. In this sense, permissions allow us to treat an imperative program as if it were purely functional, with corresponding benefits for reasoning and parallelization. An analogy can be made to monads such as the state monad in Haskell, which conceptually threads the state of the heap through the program computation.”
“In order to respect sequential consistency… Athapascan-1 has to identify the related data version for each read performed on a shared data. However, parallelism detection is easily possible in this context if all the tasks define the shared data objects that will be accessed during their execution… and which type of access will be performed on them… tasks cannot perform side effects… and shared parameters are typed according to the future manipulation… In the declaration of formal parameters of tasks, the references to shared data version are typed by their access right, i.e. the kind of manipulation the task (and all its sub-tasks, due to lexicographic order semantic) is allowed to perform on the shared data… the strong typing of accesses in shared memory enables the verification of the correctness of a program at compile time.”
“Our memory model is based on explicit, programmer-specified requirements for order of execution… The compiler then realizes those requirements in the most efficient manner it can. This is in contrast to existing memory models, which — if they allow programmer control over synchronization at all — are based on inferring the… consequences of synchronization operations or annotations in the code… RMC is intended to admit all the strange (to a sequentially consistent eye) executions permitted by relaxed memory architectures. Thus, an RMC program can be implemented on such architectures with no synchronization overhead beyond what is explicitly specified by the programmer… Out-of-order execution makes it possible for execution to proceed out of program order, but not out of dependency order.” “In RMC, the programmer explicitly specifies constraints on the order of execution of operations… These constraints are then enforced by the compiler, which has a great deal of latitude in how to achieve them. Because of the very fine-grained information about permissible behaviors, this can allow the generation of more efficient code… While RMC does not have a useful global total ordering of actions, there does exist a useful per-location ordering of writes that is respected by reads. This order is called coherence order. It is a strict partial order that only relates writes to the same location, and it is the primary technical device that we use to model what writes can be read. A read, then, can read from any write executed before it such that the constraints on coherence order are obeyed… The general approach of weak memory models is to abandon any notion of ‘memory’ and to replace it with a history that is given order by a soup of partial order relations… While a monadic lambda calculus seems quite different than a traditional block-structured imperative programming, it is fairly simple to translate from a traditional imperative language to this sort of calculus… existing data and control dependencies in the program may already enforce the execution order we desire, making it unnecessary to emit any additional code at all… given that relative ordering constraints are a key reasoning principle, exposing them directly in the program aids program construction and understanding.”
The mapping of these particular control edges to the 1-bit OCN may achieve large speedups on programs with shared data structures in DRAM by allowing single-cycle communication between non-adjacent PEs, and memory coherence across large clusters of PEs. The ordered and unordered memory interfaces of WaveCache illuminate the relative merits of the implicit and explicit exposure of MLP.
“Wave-ordered memory is not the only way to sequentialize memory operations in a dataflow model. Researchers have proposed an alternative scheme that makes implicit memory dependences explicit by adding a dataflow edge between each memory operation and the next. While this token passing scheme is simple, our results show that wave-ordered memory expresses twice as much memory parallelism. Despite this, token-passing is very useful in some situations, because it gives the programmer or compiler complete control over memory ordering. If very good memory aliasing information is available, the programmer or compiler can express parallelism directly by judiciously placing dependences only between those memory operations that must actually execute sequentially. WaveScalar provides a simple token-passing facility for just this purpose… Instead of steering a single PC through the executable so that the processor executes one path instead of the other, WaveScalar steers values into one part of the dataflow graph and prevents them from flowing into another. It can also use predication to perform both computations and later discard the results on the wrong path. In both cases, the dataflow graph must contain a control instruction for each live value, which is a source of some overhead in the form of extra static instructions… wave-ordered memory exposes twice as much MIPC as token-passing on average… The difference is primarily due to the steering dataflow machines use for control flow. The memory token is a normal data value, so the compiler must insert Steer instructions to guide it along the path of execution. The Steer instructions introduce additional latency that can delay the execution of memory operations… Wave-ordered memory avoids this overhead by encoding the necessary control flow information in the wave-ordering annotations. When the requests arrive at the memory interface they carry that information with them. The wavefront… serves the same conceptual purpose as the token: It moves from operation to operation, allowing them to fire in the correct order, but it does not traverse any non-memory instructions.”
The upside of token-passing schemes is that memory operations are parallel by default, and so in practice, more MLP is exposed and exploited at lower programming costs than would otherwise be the case. The downside is that token-passing schemes require picks, switches, steers, ϕ-functions or other “control operators” that convert “control dependences into data dependences by sending data values to the instructions that need them,” and these “introduce additional latency that can delay the execution of memory operations.” The upside of wave-ordered memory, in contrast, is that it encodes “the necessary control flow information in wave-ordering annotations,” and exposes MLP without increasing instruction counts. The downside is that it exposes it explicitly, using additional “ripple number” annotations that convey the independence of certain memory operations. In other words, the “wave-ordered memory interface… is necessary for executing conventional programs, but it can only express limited parallelism (i.e., by using ripple numbers). WaveScalar’s unordered interface makes a different trade-off: it cannot efficiently provide the sequential ordering that conventional programs require, but it excels at expressing parallelism, because it eliminates unneeded ordering constraints and avoids contention for the store buffer.”
I have a sneaking suspicion that the TIA can resolve this dilemma, for three reasons. First, the way in which the TIA “does not impose any ordering between instructions unless explicitly specified” is much like how “the programmer or compiler can express parallelism directly by judiciously placing dependencies only between those memory operations that must actually execute sequentially” using a token-passing scheme, and consistent with an implicit (subtractive) approach to MLP with no dependencies between memory operations other than predicate dependencies. Second, I imagine that the TI memory tiles can implement something like wave ordering by “setting up a vector of predicate registers to represent the current state in a sequence” of memory operations, or perhaps by performing “tag comparisons” in such a way that load and store “instructions will only fire when the wave numbers in the operand tags match.” Third, the way in which “the triggers of instructions and the instructions that actualize a logic conjunction of predicates are implicitly accomplished on the scheduler, rather than on the datapath” and reduce “both dynamic and static instruction counts relative to program counter-based control” is vaguely reminiscent of how WaveCache eliminates control operators “by encoding the necessary control flow information in the wave-ordering annotations” and “exposes twice as much MIPC as token-passing on average.”
I also suspect that, if two or more instructions in the pool send the same datum to different instructions, in some sense, then triggering exactly one of these instructions in a given cycle would constitute an implicit steer operation. To the extent that these steer operations are statically or dynamically fused with the execution of “useful (non-NOP) instructions,” this would enforce coherence ordering with additional instructions in the instruction memory, rather than in the dynamic instruction sequence.
I cannot stress enough that Memory Access Scheduling and Refresh Triggered Computation are implemented mostly in software, and preferably over 80% in software. There should not be large amounts of area and power tied up on the logic die for the sole purpose of working around the idiosyncrasies of DRAM. Instead, the responsibility for controlling DRAM is shifted from an architecturally-invisible hardware memory controller to the software.
Of course, DRAM is not strictly random-access memory. This is both a challenge and an opportunity, and the programmer and compiler have two ‘handles’ on the quirks and features. First, they control address generation by the “subprogram of memory operations,” “the memory locations of variables and… the mapping of virtual addresses to ranks, banks, and rows in DRAM.” Second, they control the predicate dependencies (and coherence ordering) of loads and stores. This, then, is how a compiler and RTS “aware of the internal DRAM device timing” might arrange for loads and stores to be executed “in an order that both accesses the internal banks in parallel and maximizes the number of column accesses per row access,” and such that the “read/write requests of the application naturally refresh DRAM.” The programmer and compiler have considerable high-level freedom to compose “sophisticated memory ‘personalities’ such as scratchpads, caches, and FIFOs” and configure memory tiles as “lean and high-throughput memory request generation pipelines,” but MAS and RTC place severe limitations on the low-level “memory access patterns” (“linear, strided, repeating, indirect, scatter-gather,” and so on) with which they are implemented. In other words, MAS and RTC straitjacket communication slices such that they touch locations in DRAM in the way that DRAM wants them to be touched: non-randomly, periodically, and even serially.
MAS might be described as an exploitation of favorable “inter-instruction effects,” in the special case that those instructions are loads and stores. In generating a schedule of loads and stores that “conforms to the timing and resource constraints” of DRAM, MAS would certainly seem to be minimizing negative inter-instruction “effects related to resource constraints and delays” such as “circuit state overhead.” MAS and RTC may have much in common with “software decoupled access-execute, in which the compiler transforms the code into coarse-grain phases that are well-matched” to the capabilities of the hardware in some way. For example, the compiler could “cluster memory accesses into ᴀᴄᴄᴇss regions” such that periodic alternation between “a heavily memory-bound phase” and “a heavily compute-bound phase” naturally refreshes DRAM. There may even be opportunities to synchronize and parallelize memory access and reconfiguration, in cases where the latencies are comparable.
The Compiler and Runtime System
I will be the first to admit that Starling is among those architectures that shift much of the design burden from processor designers to compiler designers. The compiler designers, then, must make up for this deficit of design effort by working out how to map the ugliest and messiest workloads in computing onto this regular and orderly hardware. There seem to be a number of cases in the history of computer architecture in which a strategy of aggressively dematerializing the processor, and trading recurring costs at runtime for one-time costs at compile time, ultimately backfired when the complexity of compiler design turned out to be much greater than expected, and any performance-per-watt gains became moot because there was simply no software to run. Who can forget the unsinkable Itanic? One of the things that’s so interesting about the Mill CPU is that it seems to be radically co-designed from the ground up with its accompanying software toolchain, and in fact, the specializer could be thought of as a final link in the toolchain that runs on the targeted processor itself — that of the end user, rather than the software developer. This would seem to avoid the pitfall of inadvertently designing untargetable hardware. Ivan Godard, of course, has a background in compiler design, and perhaps compiler designers who think in terms of intermediate representations are in the best position to understand and appreciate the true, root design problem — the co-design of hardware and software.
The Automaton compiler that emits code for the Starling processor family is indeed a high-risk, high-reward moonshot. The idea is that the Herculean programming effort necessary for Automaton to reliably emit fast code constitutes a good investment of programming effort, which in the long run, saves a vast amount of programming effort, because “larger designs contain more tiles, but the tile structure… does not change,” the processor architecture can “scale to any number of tiles, both in terms of functional correctness and in terms of performance,” and “changing the memory or instruction execution latencies has no effect on the correctness of the program.” The Starling architecture also provides “a cost-effective alternative to custom hardware in many situations” and “universal solution for both general- and special-purpose applications,” allowing mass-market commodity processors to penetrate segments traditionally dominated by ASICs. Starling processors therefore enjoy considerable economies of scale, and are frequently used in lieu of “traditional hardware structures like video RAMDACs.” In this way, the Starling architecture bends the development cost curve, and bucks the long-term secular trend of disaggregating GPPs into heterogeneous clusters of dissimilar, domain-specific processors.
Automaton grapples with the overwhelming complexity of the problem in two ways, one experimental and one theoretical. First, Automaton is a bona fide supercomputer or HPC workload that provides Compilation-as-a-Service to remote programmers. The idea is that a supercomputer based on pseudo-complementary self-aligned-gate InAlAs/InGaAs/InP E/D-HEMT or even Nb/Al₂O₃ QPSJ logic can not only emulate the targeted hardware, but also provide “zero-overhead… completely non-invasive profiling” of the machine code, and the most realistic possible environment for dynamic program analysis. This is conducive to aggressive profile-guided optimization, and allows compiler optimization-space exploration to “be implemented in, and benefit from, dynamic optimization platforms and managed-runtime environments.” Dynamic compiler OSE can then discover “unpredictable optimization interactions,” compare “many variants of each optimization’s heuristic,” “select the best-performing version of each code segment by measuring actual runtimes,” and make decisions “about which optimization configuration performs best… a posteriori, that is after the resulting optimized versions of the code have been examined.” Second, because the “triggered instruction model” is a form of “guarded action model,” it naturally admits a “guarded-command-like” process calculus, and allows the compiler to not only emit machine code that is correct by construction — that is, “develop a program and its proof hand-in-hand, with the proof ideas leading the way” such that the program is manifestly correct, free from deadlock/livelock, free from starvation, and so on — but also perform only those optimizations that respect or preserve these desirable properties, having been restricted to a certain ‘optimization subspace.’ This is not unlike how Inmos Transputers “were intended to be programmed using the programming language occam, based on the communicating sequential processes process calculus” and with “concurrency and channel-based inter-process or inter-processor communication as a fundamental part of the language.”
The spatial nature of the TIA raises the possibility that, just as it eliminates “over-serialization,” it could also eliminate what might be called ‘under-systolization.’ In a systolic architecture, dataflow graphs are (for all practical purposes) lattice graphs with natural, obvious and trivial mappings onto tile processors (sending nodes to processing elements and arcs to “inter-processing-element datapaths”). This is the densest possible “space-time packing” of computations, with maximum reuse of data, and the shortest possible wires. “The speed advantage of short wires has not gone unnoticed. In fact, systolic arrays were proposed by Kung and Leiserson in the late 1970s, and aimed, in part, at exploiting the speed of short wires.” As such, parallelism is exposed with the lowest possible communication and synchronization overheads, and the highest possible area- and power-normalized performance. In other words, the “data and control flow of the array is simple and regular, so that cells can be connected by a network with local and regular interconnections… several data streams move at constant velocities over fixed paths in the network, and interact where they meet. In this fashion, a large proportion of the processors in the array are kept active, so that the array can sustain a high rate of computation flow. The array makes multiple use of each input data item. As a result high computation throughput can be achieved without requiring high bandwidth between the array and memory.” The best example of ‘systolization’ that I have come across is from Figure 3 in Baring It All to Software: Raw Machines, with a loop in a snippet of imperative code being transformed into a six-tile structure that “maintains all the true value dependencies” but “allows independent memory references to proceed in parallel.”
The TIA probably meets the definition of a “wavefront array,” a close relative of the systolic array with at least some “autonomous control at each PE,” and “more complicated, asynchronous intercellular communications.” The most challenging and rewarding development effort of all would be that of a runtime system responsible for dynamic reconfiguration, with “spatial packing of modules into a reconfigurable region” and “packing of modules in the time domain.” This would be a far greater development risk than the one-time specialization of bytecode for a particular “processor model.” As I imagine it, the configware is some bytecode or intermediate representation, interpreted by the runtime system according to the availability of hardware resources, certain rules that “guarantee correct operation,” and possibly hints from the compiler. This would allow efficient “space-time packing” of many highly “irregular dataflow graphs” (characterized by “asynchronous actors performing different tasks, while frequently communicating with neighboring actors”) mapped onto the CGRA, and sustain a considerable fraction of the effective utilization of a single lattice dataflow graph mapped onto those same tiles. This packing would involve not only the usual placement and routing, but also something like runtime system-directed multithreading — the compiler annotates each section of the “overall algorithm dataflow graph” with restrictions on how it can be placed on the same PE as other “unrelated sections” of the same dataflow graph or sections of another overall algorithm dataflow graph entirely, how registers can be partitioned between them, and how their operations can be interleaved, but otherwise gives the runtime system the flexibility necessary to efficiently map dataflow graphs onto the CGRA as the tactical situation dictates. The flowware would then perform address generation and other tasks necessary to stream data through the modules, by way of the memory tiles. This may very well be too elaborate, if the “space-time packing” of modules improves the utilization of hardware resources, but consumes so many of these resources itself that there is no improvement in effective utilization.
The Alternate-Historical Background
The Starling architecture emerged in the late 1960s, in a time of crisis in computer architecture. VLSI design rules were well in hand, and the semiconductor industry was riding the MOSFET (Dennard) scaling tiger by the tail. With each passing year, photolithographic steppers had more optical elements, and larger numerical apertures. Increasingly, He-Cd short-arc lamps were being used as high-power 214nm light sources, to rapidly expose SU-8 photoresists. At this time, the Goodyear Wingfoot instruction set architecture had, by far and away, the highest market share, and the largest and best-developed software ecosystem (Akron is not only the Rubber Capital of the World, but also a “global center for high technology and innovation”). The Wingfoot family of tokenless, dynamic “associative dataflow” (technically control-dataflow) processors are built up from clusters of four “simple, ALU-like” scalar PEs each, connected by Associative Communication Networks that combine features of crossbar switches and content-addressable memories. The ACNs are Scalar Operand Networks that perform “operation-operand matching” by allowing parent nodes in a CDFG to associatively search for their children, not unlike how a “result tag” in the real-world Mill CPU is “compared simultaneously with all of the result tags carried on the bypass data paths for a match,” so that the “tag match mux control circuits” can route corresponding results to the same destination. “The automatic routing mechanism relieves the compiler of the burden of explicitly controlling the dataflow by setting the network switches,” but the compiler can, when appropriate, select an alternative exposed-datapath “programmable crossbar switch” mode with circuit-switched connections “programmed during an initialization phase,” much like how switches “form a circuit-switched network and can create hardware data paths” in the real-world DySER. The instructions are loaded “into the private memories of the processors,” and not only contain “information about data dependencies, in a format so as to permit the implementation of the search phase of the associative dataflow concept,” but also permit “the establishment of localities within dataflow program” and execution of “program fragments that are related to one another to exploit locality and prevent ‘thrashing’ at the top of the memory hierarchy,” up to and including closed-loop operation of the PEs. The Wingfoot ISA fuses RISC-like instructions with special “tag management” instructions, much like how “ᴡᴀᴠᴇ-ᴀᴅᴠᴀɴᴄᴇ is such a simple operation” in the real-world WaveCache architecture that “it can be combined with other instructions” (e.g. fused with ᴀᴅᴅ to form ᴀᴅᴅ-ᴡɪᴛʜ-ᴡᴀᴠᴇ-ᴀᴅᴠᴀɴᴄᴇ) and executed in the same cycle, and this “removes the token-related pipeline stages” much like how “the set of predicates to be updated by an instruction in the triggered-instruction approach is encoded in the instruction itself.” The on-chip network is an ACN-based “hierarchy of crossbars” or “multilevel crossbar system” known as the Hierarchal Crossbar Interface, across which higher-level tag matching provides the inter-cluster communication and synchronization necessary to exploit parallelism at the coarser grain sizes of threads or tasks (the PEs and HCI vaguely resemble the SPEs and EIB of early versions of the real-world Cell BE with a crossbar rather than a ring bus). The business-friendly reliability with which Wingfoot achieved high performance on codes in which no programming effort whatsoever had been invested to expose parallelism (in the manner of an OoO superscalar processor) made it the de facto standard. It was also quite formidable on a number of numerically-intensive workloads with regular control flow, in which data could be streamed through ACNs in the PCS mode, obviating coprocessors in many applications. The problem was that, with the 180nm process node on the horizon, the “vast infrastructure of slow broadcast networks, associative searches, complex control logic, and inherently centralized structures” of the Wingfoot architecture were starting to drag on performance scaling. The market for mainframes — and, increasingly, workstations — would soon arrive at a critical juncture, and it was anyone’s guess as to which clean-sheet ISA and “chip-level distributed” microarchitecture would be locked in by network effects in the coming wire-delay-limited era.
The issue was forced by the application of dynamic holography to photolithography, as high-average-power excimer lasers and new nonlinear optical materials were spun off from weapons-relevant laser fusion research at the national laboratories. “When using a laser source, phase-conjugate-based and real-time holographic mirrors which do not rely on the use of lenses or conventional mirrors… can be used to project an image.”
“The… images projected by conjugate wavefront generation are of sufficient quality to be used for fine-line lithography. Because these images are produced by nonlinear optical processes that automatically correct for many aberrations in the optical train, they are not degraded by the effects that limit the performance of conventional lens and mirror systems… It is conventional to discuss the resolution and depth of focus of projection photolithography systems in terms of the numerical aperture — the sine of the cone angle of the rays imaged by the optical system onto the substrate. Thus the 40° cone angle of these direct rays implies a numerical aperture for the wave-front conjugator of 0.65… If, however, the beam splitter is a perfect cube and if the partially reflecting surface introduces no angle-dependent phase shifts, light can propagate into the conjugator crystal along highly eccentric ray paths… Degenerate four-wave mixing in the medium can produce perfect phase-conjugate replicas of these waves, which then propagate back through the beam splitter along paths symmetrical to those of the object rays. Thus even these waves are focused onto the image point, increasing the effective numerical aperture to ~0.95… When the radii of curvature of the phase fronts of the [counterpropagating pump] beams differ, the change in curvature alters the effective focal length of the image projection system, breaking the phase-conjugation symmetry between the object and image waves. This asymmetry has as one effect the translation of the focal plane. Whenever the focal plane is translated, the magnification deviates from unity… If large enough laser sources are available, exposure tools can be engineered with flat fields several inches in diameter, capable of exposing an entire silicon wafer at once.”
This “achieves the same goal as contact photolithography” and inverse lithography, without “placing the mask in direct contact with the sample” or numerically solving an “ill-posed inverse imaging problem” (the holographic stepper can be thought of as an analog optical computer that solves this and only this problem). The ArF exciplex lasers are pumped by “arbitrarily large… plasma-cathode electron guns,” and their coherence length seems to strike the right balance between speckle and the “phase-conjugate reflectivity” of the nonlinear CdF₂:(Ga,Y)³⁺ bulk single crystals. At any rate, there are a number of ways to increase the coherence length, short of recourse to RF-pumped waveguide excimer lasers. “The high reflectivity (10%) of this device means that the resist exposure time is minimized,” achieving the required equipment throughput (wafers per day) without a laser system the size of an ICF “laser driver module.” The pump beam wavefronts may be shaped such that the reduction ratio exceeds unity, but aberration does not exceed “tolerance limits,” unless this interferes with the enhancement of the NA by light propagating along “highly eccentric ray paths,” or blows up the complexity of the optical system. With a wide 7.8eV bandgap, these phase-conjugate mirrors can reflect DUV light with a short 193nm wavelength at which “stable molecules… undergo significant photochemistry” and “diffraction-limited spot sizes” are small, driving high-resolution “photochemical resistless pattern etching” and “area-selective CVD” using a “wealth of… photochemical reactions” “in a variety of systems, in wet and dry ambients.” “An integrated circuit fabrication process, based on the laser radical etching technique, would not employ the numerous photoresist patterning and acid or plasma etching steps used in existing lithographic methods.” “Since the etching occurs only where the… light strikes the substrate, it is not necessary to mask portions of the substrate not to be etched,” making it “possible to sharply curtail the manufacturing steps of a large-scale integrated circuit,” with “great cost saving implications.” The reticle is “engineered with topography mimicking that of the substrate in order to maintain focus over the area of an exposure,” and even “grazing-incidence UV radiation is sufficient to give good film coverage even on the nearly vertical walls” of high-aspect ratio via holes.
The resulting step changes in NA and k₁ precipitated what came to be known as the ‘hardware crisis.’ Almost overnight, processor architects had more transistors at their disposal than in their wildest dreams, and only faint ideas about how to use them. At this time, Burroughs was the leader in basic research and “innovative hardware architecture,” and in the best position to resolve the design polylemma. The new processor had to have high serial performance, to be the equal of Wingfoot on programs written in imperative languages, and “where little attention had been paid to special parallel algorithms.” This way, existing software could be recompiled but not rewritten, and run in the manner to which its users were accustomed. At the same time, it had to have high parallel performance, and more importantly, fast and balanced asymptotic scaling of compute and memory performance with respect to the scale of integration, with “a parallel software interface to the gate, wire, and pin resources of a chip” that allowed programmers to “extract the maximum amount of performance and energy efficiency in the face of wire delay… by exposing the underlying physical resources as architectural entities.” This would allow Burroughs to make a credible claim that the migration to the Starling ecosystem would not only be completely painless, but also the last such migration that businesses would have to muddle through for the foreseeable future. It would be not only a cornucopian paradise of effortlessly portable software living on a effortlessly replicated hardware substrate, but also the bridge to get there from the austere world of serial computing.
Meanwhile, Edsger W. Dijkstra’s predicate transformer semantics and Guarded Command Language were ramifying upon academic and theoretical computer science. While guarded commands were primarily used to statically prove the correctness of computer programs, researchers at Burroughs took the unusual step of using them as a starting point for a processor architecture. The initial attraction was how, in one fell swoop, the obviation of branch instructions not only improved serial performance by reducing dynamic instruction counts, but also lifted a number of obstructions to the exploitation of parallelism that plagued program counter architectures. The designers of the Starling architecture stood very much on the shoulders of giants, in that many of their ideas already existed, in one embryonic form or another, in the Wingfoot architecture.
As the alternate history has it, Xerox was sitting on a pile of cash and sold the first-generation Paperless Desktop well under cost in order to establish a dominant market position, while at the same time using workstation-like hardware to paper over how the performance of the Starling architecture precipitously declines as the scale of integration is decreased, fewer algorithms can be “taken to the extreme of mapping the entire dataflow graph into the spatial fabric,” and the “computation-to-configuration ratio” falls off of a cliff. This is why the Starling 1000 has a monstrous 576mm² die size (and relatively low yields) using a 180nm process technology based on dynamic holography, as well as extensive use of “photochemical resistless pattern etching” and “area-selective CVD” to “sharply curtail the manufacturing steps” and help contain high manufacturing costs. But it worked, and together with the innovative PARC HMI of the XPD, Starling did indeed deliver a user experience from which one could simply not go back, and not live without. Burroughs Murmuration mainframes — the big iron at the back end of government, business, education, and PSS — have two-dimensional arrays of Starling 1000 processors connected by microbridges and edge I/O.
Secondary Storage
The XPD does not have a hard disk, but rather a General Electric Agility 275 electron-beam-addressed phase-change RAM. This advanced, modern storage tube has no moving parts, truly random access, a “very long-lived” SrNbO₃:Ba²⁺ hot cathode, an In₃SbTe₂ storage target with “high thermal stability as well as large resistance contrast,” an “active field neutralizing system,” and a high-bandwidth traveling-wave “eight-fold double deflector system” that not only allows for extremely aggressive memory paging (crucial for the in-memory processing of extremely large datasets by those Burroughs Murmuration mainframe computers acting as “shared-nothing database machines”), but also satisfies the “normal landing condition,” and dynamically corrects aberrations by “applying appropriate focusing and stigmation fields corrections at each point in the deflection field.” “By applying suitable voltages or currents” to the eight “traveling-wave deflection plates,” “the quadrupole field can always be made to have the correct strength and orientation for exactly correcting the… deflection astigmatism,” and because this “electrostatic octupole” is “both a stigmator and a deflector,” it “makes for a compact column structure.” The controller periodically “employs a self-calibration mode for linearity correction” to update the “dynamic correction waveforms.” “The amorphous semiconductor thin film storage target is structureless in that no physical entities, such as metallic spots or holes, are required to define the memory sites. Thus, the bit position is defined by the electron beam, and its size is limited only by the beam diameter and the possible interaction between the thin film and the electron beam.”
Tertiary Storage
The Kodak Panorama 828 rewritable “optical tape cartridge drive” can randomly access the phase-change memory sites inside the frame addressed by a green “electron-beam scanlaser.” The four-level operation of the Nd:YAG “active element” at 1064nm permits side pumping by an alkali-metal-vapor arc lamp with good “matching of the alkali spectra with the laser absorption bands.” The nonlinear optical medium KTP is used for both mode selection and frequency doubling of the laser light to 532nm (to reduce the diffraction-limited spot size), having both “high nonlinear-optical d coefficients” and “large linear electro-optic r coefficients and low dielectric constants” that “make it attractive for various electro-optic applications, such as modulators and Q switches.” The controller repetitively Q-switches the scanlaser to generate pulses that can “induce amorphization” of crystalline memory sites, and can extend the life of the tape (which is 35mm wide like ICI 1012) by performing a “recondition operation that overwrites data point transitions with random data.”
The Modem
The XPD has an internal modem that “supports both twisted copper and coaxial cable” connections to Western Union’s Packet-Switched Service, with its “Switched Internet Architecture” and “PSTN-like addressing.” The real-name system and per-packet attribution of PSS allow the Network Operations Center to credibly deter spamming and netcrime, and give it the look and feel of a walled garden (more like a public library and less like a casino run by human traffickers). The websites that have become household names include the web directories White Pages and Yellow Pages (which “combine searching and browsing”), Sears and other electronic commerce businesses, the Post Office Department, and the Post Office Savings Bank. The quasi-public nature of Western Union — a natural telecommunications monopoly enmeshed with the Interstate Commerce Commission — leads it to administer PSS in the spirit of the British model of public broadcasting. The 1976 election season marks the emergence of PSS as the “digital town square,” with the exploding popularity of newsgroups and punk webzines.
The Monitor
The portrait-oriented monitor is a serpentine-scanning beam-index tube with 1536×2048 resolution and 24-bit color at a 60Hz refresh rate, and ZnS powder phosphors that are synthesized inside the faceplate using area-selective “ion implantation followed by thermal annealing” (and may even have some “penetron characteristics” in order to improve the accuracy with which they reproduce the color of solid ink prints). “The perfectly flat faceplate… facilitates surface treatments such as antireflective coatings.” The 18" display is remarkably light, because it has just one electron gun, no “shadow mask or aperture grille,” no photomultiplier tube or avalanche photodiode, and a relatively small power supply (as there is no shadow mask to dissipate “80% of the total electron beam current” as heat), as well as because “no implosion-proof glass is required on the front face of the panel,” and it is not “necessary to increase the thickness of the flat faceplate and its skirt by about 20% over a conventional CRT so as to maintain the same tensile stress level” as an “implosion band is fastened around the skirt of the faceplate before the CRT is evacuated” and the “strapping force caused by the band around the panel” offsets “the deformation caused by the evacuation.” The uniformly high brightness and contrast, wide color gamut, high color accuracy, and freedom from geometric distortion of the display help make the XPD a powerful desktop publishing workstation.
The Keyboard
The standard input device is a Maltron-style “dual-hand fully ergonomic 3D” Dvorak keyboard with pointing sticks in the “finger groups” of keys, so that not only are “the most common letters and bigrams… typed on the home row, where the fingers rest, and under the strongest fingers,” but also “the hands do not need to be removed from the home row to manipulate the mouse pointer,” reducing “the time involved in the back and forth motion of an operator’s hand between the typing keyboard and the analog pointing device.” The default software setting is to point with the right stick, and scroll with the left stick (vertically and horizontally, obviating scrollbars).
The Operating System
The Xerox Metaphor distributed, single-address-space operating system (derived from Burroughs Twilight) is bundled with Bravo for professional-grade WYSIWYG document preparation, SuperPaint for the editing of “full-color 2D images and… animations,” the Laurel electronic mail client, and the Hardy web browser. This alien, funhouse-mirror OS blends lightweight kernel, multikernel, unikernel and exokernel features.
The Cooling System
The MCM isn’t cooled by air, mineral oil, or even Fluorinert — it’s hermetically encapsulated in a thin film of dielectric borosilicate glass by chemical vapor deposition, and then cooled by immersion in automotive antifreeze or even tap water with a high boiling point and critical heat flux. Two-phase immersion cooling is mandatory, not only because of the high thermal power density and low ratio of surface area to volume of the MCM, but also because the high temperature uniformity and low thermal stress is necessary for the MCM to achieve a reasonable MTBF — it is absolutely not user-serviceable. It is doubly mandatory in Burroughs Murmuration mainframe computers that must remove heat from arrays of chip stacks joined along all of their edges by “microbridges.” These modules and arrays, mounted together with certain I/O and power supply components on glass motherboards, are packaged inside removable cartridges. The construction is similar to that of vacuum tube electronics with high thermal power densities and large temperature swings, especially in the use of robust, inorganic materials with closely-matched thermal expansion curves (as with borosilicate glass and Kovar). The high operating temperatures are the key to effective but fanless and silent heat dissipation, and indeed, the entire surface of the finned 6061 aluminum tower case becomes hot to the touch when high loads are sustained. The case acts as a condenser, heat spreader, and heat sink, with a metallic wick that acts as a robust inorganic dropwise promoter by wicking the condensate film off of the wall (without being flooded by the condensate as in a heat pipe), and even an anodic film with a high infrared emissivity.
The Future
It is thought that by the year 2000, “it may be possible to do wavefront conjugation by using [VUV] lasers… and an atomic vapor as the conjugator medium.” This would probably involve a free-electron laser, gas-dynamic phase-conjugate mirror, gas puffer, and “differential pumping setup” so as “to induce an etching chemical reaction of the adsorbed etchant precursor gas at the region of interest” (or else that of a “solid etchant… induced under the action of laser light”). As the alternate history has it, by 2000, more processing elements are integrated on-chip than could ever be put to good use, creating a glut of personal computing power (if there wasn’t one already).
Burroughs Starling 1000 (system-on-module) [triggered-instruction tile processor] [“operand registers”] [“dependency network”] [area I/O] [wave pipelining] [DRAM] [“network-on-memory”] [“memory access scheduling”] [“refresh-triggered computation”]
It is a face-to-back “chip stack MCM” with back-thinned and “Cu-Cu thermocompression bonded” flip-chips. The DRAM chips have through-silicon vias. The processor has 1.) a “decentralized architecture” with no “global control… global data,” “global communication or broadcast… global register files… associative structures,” “unified look-up tables,” or “logic which incorporates state from all of the processing elements,” and “enhances locality and diminishes the effects of slow wires” by “keeping all control and most communication local to each” pipeline stage, and 2.) a “decoupled access/execute architecture” with compute tiles that “have no access to local memory apart from a small, fast register set,” and memory tiles that have “access to a large but slow memory,” such that “all load/store operations are performed on memory processors, which send data into the… array implemented by the compute processors,” and 3.) an “exposed datapath architecture” with data transports that are “visible at the architecture level, and are completely programmer-controlled,” “programmer access to function unit ports or small register files coupled with function unit ports to avoid overheads of a large centralized general-purpose register file,” and “only one instruction in the classical sense” such that instruction words contain “modularly grouped source and destination operands, which allow simple and fast instruction-decoding hardware and limit signal switching for those modules not involved in the transfer, thus reducing dynamic power consumption.” The tiles communicate by accessing “outgoing ports… mapped to write-only registers” and “incoming ports” mapped to “read-only registers” such that no “additional synchronization” is required “to prevent values in the destination” processor “from being illegally overwritten.” The compute tiles each have “a two-stage pipeline with a split-stage ALU.” The functional units have “non-uniform register access,” and “can be split into independent parts,” which are “used for different functionality” and “may be used concurrently.” The memory tiles 1.) are placed “in the interior of the chip” like the compute tiles, rather than “on the periphery of the chip,” 2.) have non-uniform memory access, 3.) “maintain local state,” “hold memory requests,” and “are responsible for implementing the… memory interface that guarantees correct memory ordering,” like the “store buffers” of “wave-ordered memory,” 4.) are “responsible for all address calculations for both loads and stores,” and have “reconfigurable scalar datapaths to generate DRAM requests” and “specialized hardware components for generating addresses and packing and unpacking data items,” and 5.) “operate in a dataflow style by checking for the availability of input arguments and output buffering, if required, before selecting a memory operation to execute” like the “processing elements of the fabric,” and “are generally decoupled from one another except as necessary to preserve program correctness,” such that if two “memory operations are neither directly nor indirectly data-dependent, they can execute in any order.” The “aggressive,” bufferless, full-duplex OCNs 1.) are “in essence 2D bypass networks serving as bridges between the bypass networks of separate tiles,” and can provide “ALU-to-ALU operand delivery” with zero-cycle “inter-PE latency… meaning that the downstream processing element may operate on data in the cycle after it is produced” because they are not only “register-mapped and do not require a special send or receive command,” “but also integrated directly into the bypass paths of the processor pipeline” such that “bypass wires that commonly feed signals back to the operand registers also connect the individual pipeline stages to the outgoing network” registers, and 2.) are “programmed routing networks permitting any globally-orchestrated communication pattern on the network topology,” because the network ports are “exposed to the programmer by mapping them into register names in the instruction set,” “instructions can target the networks just as easily as registers,” and the integration of “fine-grained communication” into the “microprocessor model allows a broad range of fine-grain computational algorithms (such as dataflow and systolic paradigms, including wavefront) to be run,” 3.) “are disjoint single- and multi-bit interconnects, where each port on a device is connected to the appropriate interconnect,” with “auxiliary control wires… in addition to the inter-PE network” that “form a basic FPGA within the architecture for control and bitwise logic,” provide “an effective method for offloading control and predicate signals,” can “move predicates from one processing element to another” and may “include functional units to merge predicates,” and provide a “catch-all safety net for any dynamic events” such as “interrupts… I/O… synchronization, and context switches.” The compiler 1.) computes “the data dependencies within the code” and transforms “implicit dependencies between code and state” into explicit dependencies by adding control edges to the decoupled CDFG such that “by default, everything is concurrent, unless explicit dependencies imply a specific ordering” of instructions and the processor can, “within those constraints, execute the program with the maximum possible concurrency,” 2.) “models the communication costs of the target network accurately,” considers “non-uniform register access” and locality “when assigning instructions to functional units,” balances “instruction contention for processing elements and operand latency between producer and consumer instructions,” controls “the routing of operands through the forwarding network so that ephemeral values need not be written to registers,” keeps “a significant fraction of the operand bandwidth local to the function units,” places “operands close to the function unit that will consume them,” keeps “data produced and consumed on the same function unit local to the function unit,” ensures “that producer-consumer operations on the critical path are placed to enable low-latency communication through the bypass paths,” and “performs operation assignment in a way that gravitates sections of those critical paths to a single node so that that node’s zero-cycle local bypass network can be used, minimizing latency,” 3.) performs optimizations across routines “at a spectrum of grain sizes,” and uses “fine-grained control of the instruction effects” and “smaller granularity function unit operations visible in microprograms” to “achieve better parallelism and resource utilization,” and 4.) “enforces a separation of concerns between computation and on-chip memory management,” and “slices the original control-dataflow graph” into a computation slice that includes “the core computation underlying applications” and “cannot have load or store instructions” because it “has no direct memory access,” and a communication slice that includes “operations involved in memory addresses calculation” and “supplies required data to the computation slice.” The ideas are that 1.) the TIA scheduler has reasonable “area and energy” consumption, despite the inherently “parallel nature of trigger resolution” and “the need for all trigger fields to be exposed combinationally to the scheduler in order to execute at a rate of one instruction per cycle,” because it “utilizes predicates as tokens for reducing the tag matching area,” the “hardware scheduler is built from combinatorial logic” and “is essentially a few 1-bit wide trees of AND gates feeding into a priority encoder,” and the “register-based instruction memory” is clock-gated, 2.) the TIA “provides direct hardware support to coordinate parallel activities” as efficiently as “it executes instructions” and “synchronization and task management… as efficient as the scheduling of primitive operations like add and multiply” such that “low-level synchronization is performed… at the same rate as instructions are issued” like a dynamic dataflow architecture, but “expresses dependencies via single-bit predicate registers that are explicitly managed by the program, improving scheduler scalability and removing the token-related pipeline stages” such that “the cost of synchronization is low enough to allow its free and unencumbered use in decomposing programs for parallel execution” and “the objective is to expose the maximum amount of parallelism by subdividing the computation as finely as possible,” 3.) the TIA is a true “unified-hybrid architecture” that can execute “both dataflow and control flow scheduling since both models are intimately bound in the architecture,” freely “mix functional and imperative computation in a single program,” and efficiently handle “dependencies between code and state” and “non-trivial effect-coeffect interaction,” because triggered instructions “is a superset of many traditional control paradigms,” “a TIA can be made to emulate the behavior of other control paradigms” with “clever use of predicate registers,” “both control and data dependencies in TIA are converted into predicate dependencies” and “handled in the same way,” “the programmer can choose to use” predicate bits “to indicate data readiness, but can also use them for other purposes, such as control flow decisions” by “specifying triggers that span multiple predicates,” and the “execution of a particular instruction is caused by the availability of a specific set of tokens, which may be data, control, or a mixture,” 4.) each “two-stage pipeline with a split-stage ALU” can rapidly update “the predicate state,” execute some predicate-dependent instructions “on subsequent clock cycles,” and “keep the PE making forward progress” using “single-cycle execution” of low-latency “fast operations” that latch predicate results “into one or more predicate registers” by the end of the TDX1 stage, because the “instruction decode and execution hardware is so simple (and timing so fast) that these operations are moved into the same clock cycle as the” trigger stage “with minimal impact to the maximum operating frequency,” 5.) the TIA scheduler reduces both horizontal and vertical waste, because it uses “tick information… for dynamic operation fusion scheduling” and exploits “the slack cycle time to accomplish more work in a single clock cycle when the critical timing paths are not exercised” as part of the dispatching of triggered instructions to the datapath “subject to datapath resource constraints and conflicts” and packing together of instruction effects “that can be executed in parallel, obeying data and structural dependencies,” 6.) “multiple algorithms may be mapped to a single processing element and executed… in an interleaved manner,” such that “no dynamic overhead is incurred to switch among threads when programs are executing in the pipeline,” and the “fine-grain interleaving of instructions based on data availability… can tolerate both long memory latencies and short instruction latencies, while still executing at full capacity even if only one thread is present,” as well as prevent pipeline stalls when there are no “critical rate-limiting loops in the… pipeline,” 7.) the “average injected traffic” on the bufferless OCNs is almost always “below the network saturation point” despite saturation being “reached at lower injection rates compared to buffered routing,” because “closed-loop operation” of functional units and operand sharing “reduce the overall network communication,” and the compiler uses “high-level structural information about the program,” a “global-level understanding of program execution,” and “slack in the program” to “control the timing of the function unit operand and result data transfers,” move operands “to function units earlier than the clock cycle where the operation begins execution,” “move the results to the next consumers” later than the clock cycles “when they are ready,” and “opportunistically orchestrate the routing of operands through the forwarding network” in such a way that operand registers, rather than queues, “temporarily store” operands in order to “smooth the flow of data between cells,” 8.) the “dependency network” eliminates underutilization of the 64-bit OCN by 1-bit signals, “allows propagation of… control and data signals to take place concurrently,” is “very narrow” and “can use thick wires to send signals to greater distances within a single cycle” than the 64-bit OCN and reduce the “delays involved in synchronization,” and provides a “fast, distributed” synchronization primitive to the memory tiles that “sidesteps main memory” and is “more scalable than load-store queues” when “memory accesses arising from sequential programming languages must be serialized… due to the potential for aliased addresses, and… some ordering is applied to the memory system in order to implement entity-to-entity communications protocols,” 9.) it keeps “each tile small to maximize the number of tiles that can fit on a chip, increasing the chip’s achievable clock speed and the amount of parallelism it can exploit,” relaxes the trade-off between “instruction contention for processing elements and operand latency between producer and consumer instructions,” and achieves high “performance in not only parallel applications, but also serial applications as well,” because the “elimination of… buffers and virtual channels” enables “significant area savings,” the placement of “I/O ports directly into the IC area, rather than… the periphery of the chip” results in a “substantial reduction in routing, and a corresponding savings in chip… area,” the processing elements have few of the “bulky… structures found in cores,” and each processing element is “already parallel in its native execution model” and can not only “gain the ILP benefits of an out-of-order issue processor without the expensive instruction window and reorder buffer” but also “take advantage of multithreading without duplicating data and control state, but by the compiler partitioning resources as it sees fit,” 10.) it has high performance-per-watt, and the “area allocated to power vias” is reasonable, because it has no long “intra-board wiring required by conventional levels of circuit integration,” the “I/O pads are directly placed where they are connected to the core, reducing… the capacitive loads,” the “mesh topology” of the two-dimensional OCN and three-dimensional network-on-memory allows the “logic and DRAM slices” to be “packed densely and results in uniformly short wires” with low “parasitic impedances,” “closed-loop operation” of functional units avoids communication over long and wide bypasses that “drive large multiplexers at the functional unit inputs, and require significant drive power or repeater power,” the OCN “can directly transfer operands between functional units” such that “intermediate values avoid the register file completely” and reduce “register file energy consumption,” the operand registers are “accessible at a lower energy cost than the centralized register file,” the inter-PE communication is “direct and mostly short-haul… as opposed to occurring over a wide, full-chip network as in typical multicore processors,” the “many processing elements, each of which holds live program values” gives “the aggregate effect of a huge register file in a traditional architecture, which dramatically reduces memory accesses” by aggressively converting “expensive memory operations into inexpensive local communication,” “data transfer and placement are tightly controlled to minimize power consumption per memory access,” “very little energy is spent at runtime on the configuration… as opposed to sequential cores which spend energy fetching configuration information (an instruction) nearly every cycle,” “instructions may be stored directly” in the instruction stores “rather than in a virtualized instruction cache,” the processor has few of the “energy-hungry control structures associated with von Neumann architectures, including most of the instruction-side microarchitecture,” the processor “is already parallel in its native execution model” and “neither requires speculation to increase performance nor does it need to repeatedly re-extract parallelism from a sequential program representation, thereby avoiding two of the main energy taxes in von Neumann architectures,” the “processing elements may be relatively simple and their behavior in a particular dataflow graph be statically known” such that “clock gating and power gating techniques may be applied more effectively than in coarser architectures,” the ISA “provides a parallel software interface to the gate, wire, and pin resources of a chip” and “lets programmers extract the maximum amount of performance and energy efficiency in the face of wire delay… by exposing the underlying physical resources as architectural entities,” 11.) it has high serial performance, because the elimination of “branch or jump instructions” and “load/store operations on compute tiles” reduces “the instruction count on the critical path,” and the “latency through critical path code” is reduced by “single-cycle execution” of “fast operations,” dynamic operation fusion, “closed-loop operation” of the functional units, operand forwarding, and the “elimination of… buffers and virtual channels,” 12.) it has high parallel performance, because the “dataflow firing rule is inherently decentralized” and “well-suited to running on a decentralized, scalable processor” without the “global coordination upon which the von Neumann model relies, in the form of a program counter,” the compiler and RTS place tasks on the “spatial fabric” such that the “distances between processors to which parallel tasks are assigned” correspond to the grain sizes of the “inherent parallelism available in applications,” the “graph-execution style, small size, and malleability” of the processing elements “and the network together enable the expression” of “parallelism of arbitrary granularity” including “instruction, data, pipeline, vector, memory, thread, and task parallelism” “as no distinction is made between these forms of parallelism,” it “does not mandate a particular form of parallelism… be present in an application in order to benefit from the CSA,” the “spatial programming model” exploits irregular parallelism characterized by “asynchronous actors performing different tasks, while frequently communicating with neighboring actors” rather than “uniform data parallelism” and “simple and straightforward” synchronization, coherence ordering alleviates the “serialization bottleneck that a global memory ordering requires,” “exposing communication or memory accesses explicitly in the program or ISA provides more opportunities… to explore MLP,” the decoupled CDFG is annotated with “memory aliasing information from the compiler to expose large amounts of memory parallelism” that “helps to support parallel processing element computation,” the “orthogonal and complementary” forms of exploited “parallelism at a spectrum of grain sizes” “work synergistically to provide multiplicative speedup,” and the aggressive “fine-grain parallelization” allows even small “personal or business applications” to “easily fill and utilize the entire” processor, 13.) the control edges in the communication slice of a decoupled CDFG not only “encode information about dependencies between memory operations” but “also express independence among memory operations,” “give the programmer or compiler complete control over memory ordering,” provide the “very good memory aliasing” necessary to “express parallelism directly by judiciously placing dependences only between those memory operations that must actually execute sequentially” such that “by default, everything is concurrent, unless explicit dependencies imply a specific ordering” and the processor can execute “subprograms of memory operations” with the “maximum possible concurrency” within those constraints, and the “coherence order” of memory operations can be automatically inferred without any dependency tokens or the associated picks, switches, steers, ϕ-functions and other “control operators” that convert “control dependences into data dependences by sending data values to the instructions that need them” and “introduce additional latency that can delay the execution of memory operations,” because they encode “the necessary control flow information” like “wave-ordering annotations,” and imperative “subprograms of memory operations” automatically rationalize “functional subprograms” with “mutable data structures” in memory by ensuring “determinacy at the expense of serializing access to the shared data structures” and “correctly sequencing reads with respect to writes in the parallel execution” during the direct and native execution of the decoupled CDFG by the “unified-hybrid architecture” processor, 14.) the compiler and RTS are “aware of the internal DRAM device timing” and NUMA, use “high-level structural information about the program” and a “global-level understanding of program execution” to place “load and store instructions” on the memory tiles, place variables in memory, and map “virtual addresses to ranks, banks, and rows in DRAM” in such a way that the distribution of data across “separate banks or channels” of memory reduces “bank/channel conflicts,” “refresh operations and the read/write requests” are synchronized “such that read/write requests of the application naturally refresh DRAM,” and the memory tiles “generate a schedule that conforms to the timing and resource constraints,” “three-dimensional nature” and “non-uniform access times” of the DRAM, and improve memory bandwidth utilization “by buffering memory references and choosing to complete them in an order that both accesses the internal banks in parallel and maximizes the number of column accesses per row access,” 15.) it has “the proper balance between memory bandwidth and computational bandwidth,” because the “ability to operate directly on data coming from the networks allows” a “dramatic reduction of memory reads and writes,” the “extremely dense interconnects” allow “massive communication bandwidth to be achieved between the processor cores and memory,” the “multi-ported and distributed” memory system “has no central bottlenecks and can sustain a high aggregate memory system bandwidth” using a “large number of independent memory banks, high-bandwidth through-silicon vias and a high-speed I/O interface,” area I/O and the performance of “memory access on tiles… in the interior of the chip” provide better scaling of memory bandwidth “asymptotically with respect to the number of transistors” than edge I/O, refresh-triggered computation “eliminates conflicts between refresh and read/write,” memory access sequencing “increases the bandwidth utilization,” and the “vault controllers can serve both regular memory accesses and DRAM refresh requests in parallel with” not only “bulk data copy and initialization operations” but also “data migration to local processor nodes” “entirely within the memory across DRAM banks,” 16.) the processor can almost always “use excess parallelism to cover” the “unpredictable latencies in memory,” because it exposes massive “multi-granular parallelism,” the wafer thinning and three-dimensional integration result in “decreased total wiring length, and thus reduced interconnect delay times” and eliminate “long-wire, high-latency communication” over “on-board wires that connect the processor and memory chips,” area I/O and the performance of “memory access on tiles… in the interior of the chip” avoid “the delay involved in going to and coming back from the chip edge,” “memory access latency is reduced” by “NUMA-aware data placement and migration,” and refresh-triggered computation “eliminates conflicts between refresh and read/write,” 17.) the performance on “a wide variety of everyday workloads” scales “very well asymptotically with respect to the number of transistors” with little to no additional software “programming costs” and hardware “design and verification costs,” because “changing the memory or instruction execution latencies has no effect on the correctness of the program,” the “computation power… can be scaled in two dimensions” by “increasing the number of tiles, rather than increasing the size of the tiles,” “larger designs contain more tiles, but the tile structure… does not change,” the decentralized architecture can “scale to any number of tiles, both in terms of functional correctness and in terms of performance” and “does not require a redesign in order to accommodate configurations with more or fewer processing resources,” and when designers “gain access to more transistors, they simply put down more computing elements,” 18.) it is a true CPU and GPP responsible for “providing the OS environment, executing system calls, and initializing the memory contents of an application,” and “a robustly performing product instead of an experimental system” that can gracefully handle “the numerous sources of variable dynamic latency,” “exceptions and interrupts, I/O, DMA, and self-modifying code” of a “wide range of PC workloads… in real-life operation,” and 19.) it is “a cost-effective alternative to custom hardware in many situations,” offers “a universal solution for both general- and special-purpose applications,” and can simulate “traditional hardware structures like video RAMDACs.”
“Early advocates of data-driven parallel computers had grand visions of plentiful computing power provided by machines that were based on simple architectural principles and that were easy to program, maintain, and extend… A major problem is contention for a shared resource… Contention can often be reduced by careful coordination, allocation, and scheduling, but if this is done at runtime, it increases the overhead due to parallelism, that is, processing that would be unnecessary without parallelism. If, during a significant part of a computation, a major part of the processing power is not engaged in useful computation, we speak of underutilization. A useful measure of the quality of a parallel computer is its effective utilization, that is, utilization corrected for overhead. The best one can hope for is that the effective utilization of a parallel computer approaches that of a well-designed sequential computer. Another desirable quality is scalability, that is, the property that the performance of the machine can always be improved by adding more processing elements. We speak of linear speed-up if the effective utilization does not drop when the machine is extended… In many areas that have great needs for processing power, the behavior of algorithms is irregular and highly dependent on the input data, making it necessary to perform scheduling at runtime. This calls for asynchronous machines in which computations are free to follow their own instruction stream with little interference from other computations… Computations are seldom completely independent, however, and at the points where interaction occurs they need to be synchronized by some special mechanism. This synchronization overhead is the price to be paid for the higher utilization allowed by asynchronous operation… There are different strategies to keep this price to an acceptable level. One is to keep the communication between computations to a minimum by dividing the task into large processes that operate mainly on their own private data… A different strategy to minimize synchronization overhead is to make communication quick and simple by providing special hardware and coding the program in a special format… Because communication is quick, the processes can be made very small, about the size of a single instruction in a conventional computer. This makes segmentation trivial and improves scalability since the program is effectively divided into many processes and special hardware determines which of them can execute concurrently… In dataflow machines each instruction is considered to be a separate process. To facilitate data-driven execution each instruction that produces a value contains pointers to all its consumers. Since an instruction in such a dataflow program contains only references to other instructions, it can be viewed as a node in a graph… Because the control flow arcs have been eliminated, the problem of synchronizing data and control flow has disappeared. This is the main reason why dataflow programs are well-suited for parallel processing. In a dataflow graph, the arcs between the instructions directly reflect the partial ordering imposed by their data dependencies; instructions between which there is no path can safely be executed concurrently… In a dataflow graph values flow from one node to another and are, at least at that level of abstraction, not stored in memory… The dataflow equivalent of a selective update operation (changing one element of a structure) is an operation that consumes the old structure, the index, and the new value and produces a completely new structure. This involves copying of structures even when they are stored… Possibly the largest source of overhead computation is in the handling of large data structures… It is clear that such copying is not feasible when large data structures are involved and that data structures have to be stored. Since this storage is visible in the dataflow graph, it constitutes a deviation from the pure functional approach.” “At one extreme is the traditional von Neumann scheduling model, where the instructions are fully ordered… This model does not cleanly extend to parallel processing, but it does allow locality to be exploited to great advantage. The other extreme is… dataflow, where the instructions are partially ordered by the dependencies in the computation. This model can exploit any form of parallelism, although it does not benefit from locality… The ability to exploit all forms of parallelism brings with it the possibility that a program will unleash so much parallelism that the machine’s resources will be overwhelmed… In fact, the intrinsically parallel nature of Epsilon-2 allows us to turn the traditional problem of parallelizing sequential programs on its head — we now start with inherently parallel programs and search for opportunities to sequentialize portions for greater efficiency.” “The dataflow approach exposes ample parallelism, even in programs where little attention has been paid to special parallel algorithms, but the approach may be too successful in this respect. As more parallelism is exposed, more resources are required. Unless precautions are taken, programs with tremendous parallelism will saturate, and even deadlock, a machine of reasonable size… functional languages do not permit updating of data structures… Since a dataflow program specifies only a partial order on the set of operations it comprises, there are many different permissible schedules of operations. Different schedules may exhibit different amounts of parallelism and different resource requirements, although all schedules compute the same result. It is this flexibility that makes the approach tolerant of unpredictable memory latency… It appears that there is a significant cost associated with exposed, but unexploited, parallelism.” “It… is essential to abstract the underlying parallel hardware and provide a configuration-independent programming model… Programs are expressed in a declarative language… which specifies only the data dependencies between sub-computations, allowing for maximally parallel execution. Execution is performed in a data-driven manner in order to exploit all the potential parallelism in a program… Moreover, programs expressed in this form exhibit a great deal of parallelism, even using conventional algorithms. A parallel program invariably performs more work than a sequential one for the same problem due to generation and synchronization of parallel activities… in dataflow architectures, all memory references and synchronization operations are ‘split-phased,’ and thus do not block the processor pipeline. In addition, independent ‘threads’ of computation can be interwoven on a cycle-by-cycle basis… Traditionally a program is specified by sequence, i.e. a total order, of operations against the machine state. In many cases this over-specifies the program, because it may be permissible for a pair of operations to be performed in any order, yet there is no way to express this… For dataflow machines, programs are represented as a graph, rather than a sequence, of instructions so that the partial order is not obscured. Any two nodes not connected by a path in the graph represent an opportunity for parallel execution. Where an operation has multiple operands, the presence of all of the operations must be detected before the operation can proceed. Thus, synchronization of parallel activity and initiation of these activities are intrinsic to the instruction execution cycle of a dataflow machine… Since the program is a graph, rather than a sequence, the locus of control is a collection of instructions, rather than a single point in the sequence… The matching operation is the basic means of synchronizing independent activities… Normally, instruction pipelining is used to overlap the execution of consecutive, yet independent, operations in a single instruction sequence… In a few machines, instruction pipelining is used to overlap independent sequences, which may represent distinct tasks. The dataflow pipeline serves both roles, as no distinction is made between these forms of parallelism, and a very high pipeline utilization can be achieved… Thus, a uniform synchronization mechanism is used for all levels of parallel activity… There is no branch instruction, since there is no single locus of control… The overhead in dataflow execution is directly attributable to asynchronous parallel execution and is typically of the same magnitude as the basic work.” “The dataflow model of… execution is asynchronous, i.e. the execution of an instruction is based on the availability of its operands. Therefore, the synchronization of parallel activities is implicit in the dataflow model… instructions in the dataflow model do not impose any constraints on sequencing except the data dependencies in the program. Hence, the dataflow graph representation of a program exposes all forms of parallelism eliminating the need to manage explicitly parallel execution of a program… However… the dataflow model incurs more overhead in the execution of an instruction cycle compared with its control flow counterpart due to its fine-grained approach to parallelism. The overhead involved in the detection of enabled instructions and the construction of result tokens will generally result in poor performance in applications with a low degree of parallelism… Any two enabled instructions can be executed [either in] order or concurrently. This implicit parallelism is achieved by allowing side-effect-free expressions and functions to be evaluated in parallel. In a dataflow environment, conventional language concepts such as ‘variables’ and ‘memory updating’ are non-existent. Instead, objects (data structures or scalar values) are consumed by an actor (instruction) yielding a result object that is passed to the next actors. The same object can be supplied to different functions at the same time without any possibility of side effects… A typical dataflow instruction cycle involves detection of enabled nodes by matching tokens, computing the results, and generating and communicating the result tokens to appropriate target nodes… The main implication of the functionality principle of dataflow is that all operations are side-effect-free, i.e., when a scalar operation is performed, new tokens are generated after the input tokens have been consumed. However, if tokens are allowed to carry vectors, arrays, or other complex structures is general, the absence of side effects implies that an operation on a structure element must result in an entirely new structure. Although this solution is acceptable from a theoretical point of view, it creates excessive overhead at the level of system performance… Due to its fully distributed nature, implementation of data structure in a dataflow environment requires a memory system capable of handling concurrent accesses. This requires careful considerations of issues such as memory organization, dynamic memory allocation, and garbage collection.” “Even though the characteristics of [the] instruction cycle may be consistent with the pure dataflow model, it incurs more overhead than its control flow counterpart. For example, matching tokens is more complex than simply incrementing a program counter. The generation and communication of result tokens imposes inordinate overhead compared with simply writing the result to a memory location or a register. These inefficiencies in the pure dataflow model tend to degrade performance under a low degree of parallelism… Dataflow architectures based on the original model provide well-integrated synchronization at a very basic level — the instruction level… the self-scheduling paradigm fails to provide an adequate programming medium to encode imperative operations essential for execution of operating system functions… The execution model requires combining result values with target addresses to form result tokens for communication to successor nodes. Sequential scheduling avoids this overhead; whenever locality can be exploited, registers can be used to temporarily buffer results… Architectures based on the dataflow model offer the advantage of instruction-level context switching. Since each datum carries context-identifying information in a continuation or tag, context switching can occur on a per-instruction basis… The generality of the dataflow scheduling makes it difficult to execute a logically related set of threads through the processor pipeline, thereby removing any opportunity to utilize registers across thread boundaries… Therefore, appropriate means of directing scheduling based on some global-level understanding of program execution will be crucial to the success of future dataflow architectures.” “Dataflow architectures offer… a uniform synchronization paradigm, representing one end of a spectrum wherein the unit of scheduling is a single instruction. At the opposite extreme are the von Neumann architectures which schedule on a task, or process, basis… the combination of dataflow-style explicit synchronization and von Neumann-style implicit synchronization in the same instruction set results in an architectural synergism… Compiler technology coupled with simple pipeline design is now used effectively to cover bounded memory latency in uniprocessors. Unfortunately, the situation is qualitatively different for multiprocessors, where large and often unpredictable latencies in memory and communications systems cannot be tolerated by using similar techniques. This is attributable at the architectural level poor support for inexpensive dynamic synchronization. Specifically, latency cost is incurred on a per-instruction basis, but synchronization on a per-instruction basis is impractical… Traditional compile-time sequencing is too weak a paradigm for general-purpose machines… and traditional runtime sequencing mechanisms are not sufficiently flexible… Dataflow architectures offer such synchronization at a per-instruction level as the normal modus operandi. Each instance of an instruction can be thought of as an independent task with specific dependence requirements which must be satisfied prior to initiation. A dataflow machine provides the mechanism to detect efficiently the satisfaction of these requirements and to process all such enabled tasks. Given the means to express a computation as a graph of such interdependent tasks, one has a natural means for executing, at any instant of time, all and only those instructions for which the synchronization constraints have been satisfied. To the extent that the number of such candidate instructions exceeds by some suitable amount the instantaneous capacity of the machine to process them, latencies inherent in the physical machine can be hidden, or masked… Because the processors and memories in the collection occupy physical space, there will necessarily be limitations on the time to communicate between them. The organization of a parallel computer, therefore, gives rise to communication latency… by definition, general-purpose parallel computers imply the cooperation of processors on the solution of a single problem. Therefore, problems must be logically decomposed into communicating fragments, implying the need for some sort of time-coordination, or synchronization… Given a physical parallel computer, any computation running on it is constrained to occupy non-zero volume, and any communication within that volume is subject to speed-of-light limits… In a von Neumann machine, instruction execution is implicitly synchronized by the dispatching circuitry, using a program counter for sequential instruction initiation. In a parallel computer, a qualitatively different kind of synchronization is required between tasks due to their assumed mutual asynchrony… Some load-store architectures [make] the compiler responsible for scheduling instructions, such that the result is guaranteed to be available. The compiler can perform hazard resolution only if the time for each operation [e.g. add, load, etc.] is known… Because the instruction execution times are an intimate part of the object code, any change to the machine’s structure (scaling, redesign) will at the very least require changes to the compiler and regeneration of the code… the cost of any synchronization mechanism will almost certainly dictate the granularity of the tasks lest the machine spend all of its time synchronizing. Once the granularity is determined, so is the exploitable parallelism… caches in a multiprocessor setting introduce a serious synchronization problem called cache coherence… The machine’s performance will ultimately be limited by the rate at which directories can process [the] coherence-maintaining traffic from their peers… The von Neumann model, by virtue of its simplicity, offers some tremendous advantages… given the static structure of a compiled program and nothing more than the value of the program counter during execution, a tremendous amount of information can be deduced, e.g. the satisfaction or lack thereof of data dependences, the termination of predecessor instructions, the non-initiation of successor instructions, and so on… a sequential thread of computation occupying a pipeline has, by definition, exclusive access to the entire state of the underlying machinery. This implies that the cost of communicating data values between instructions can be made extraordinarily low… Advocates of non-von Neumann architectures… have argued that the notion of sequential instruction execution is the antithesis of parallel processing. This criticism is actually slightly off the mark. Rather, a von Neumann machine in a multiprocessor configuration does poorly because it fails to provide efficient synchronization support at a level low enough to permit its liberal and free use… [In most] von Neumann machines, synchronization mechanisms are inherently larger grain… Therefore, the cost of each event is quite high. Such mechanisms are unsuitable for controlling latency cost. Moreover, since task suspension and resumption typically involve expensive context switching, exploitation of parallelism by decomposing a program into many small, communicating tasks may not actually realize a speed-up… Dataflow graphs, as opposed to conventional machine languages, specify only a partial order on the execution of instructions and thus provide opportunities for parallel and pipelined execution at the level of individual instructions… Rather than using a program counter to determine the next instruction to be executed and then fetching operands accordingly, a dataflow machine provides a low-level synchronization mechanism in the form of a Waiting-Matching store which dispatches only those instructions for which data are already available. This mechanism relies on tagging each datum with the address of the instruction to which it belongs and the context in which the instruction is to be executed. One can think of the instruction address as replacing the program counter, and the context identifier as replacing the traditional frame base register. It is the machine’s job to match up data with identical tags and then to execute the denoted instruction. In so doing, a new datum will be produced, with a new tag indicating the successor instruction. Thus, each instruction represents a synchronization operation… the processor pipeline is non-blocking given that the operands for an instruction are available, the corresponding instruction can be executed without further synchronization… By using an encoded dataflow graph for program representation, machine instructions become self-sequencing. One strength of the TTDA is that each datum carries its own context identifying information. By this mechanism, program parallelism can be easily traded for latency because there is no additional cost above and beyond this basic mechanism for switching contexts on a per-instruction basis… The notion of a non-blocking, context-interleaving pipeline is a two-edged sword. It provides the ultimate flexibility in context switching, but implies that the notion of locality must be reconsidered… given a full dataflow pipeline of depth n, the working set size for instruction and operand caches must be on the order of the sum of the n working set sizes for the threads of computation which coexist in the pipe. This also implies that the time to execute the instructions in a graph’s critical path is n times the critical path length… Starting with the observation that the costs associated with dataflow instruction sequencing in many instances are excessive, others have suggested that dataflow ideas should be used only at the inter-procedural level thereby avoiding dataflow inefficiencies while seemingly retaining certain advantages. This view is almost correct, but ignores the importance of the fundamental issues. Restricting architectures to this ‘macro-dataflow’ concept would amount to giving up what is possibly a dataflow machine’s biggest feature — the ability to context switch efficiently at a low level to cover memory latency… In general, on a machine capable of supporting multiple simultaneous threads of computation, executing programs expressed as a total ordering of instructions will incur more latency cost than will executing a logically equivalent partial ordering of the same instructions… the machine language of any scalable, general-purpose parallel computer must be able to express partial ordering… The [dataflow/von Neumann hybrid] architecture… is characterized by its machine language which can be viewed as a superset of both von Neumann and dataflow machine languages… The difficulty in pipelining a processor arises when there are inter-dependencies between instructions in the pipe… Instructions are to be dispatched so as to keep the pipeline full of useful work. Non-useful work includes execution of NOPs (i.e. pipeline bubbles) and instructions which suspend… The pipeline is synchronous, with registers serving as inter-stage interfaces… At every pipe beat, each stage stores its current outputs into the appropriate interface registers… A special case of source addressing occurs when the destination address of the immediately preceding instruction is the same as one of the source addresses in the current instruction (the case of a sequential dependence). In this case, synchronization testing and operand fetching may be ignored, and the previous instruction’s result may be used directly as an input operand. This commonly-used technique is called pipeline bypassing, and obviates pipe bubbles which would otherwise occur while fetching a recently-computed result which has not yet been stored away… Dataflow program graphs are a powerful and elegant intermediate representation for programs. They are largely architecture-independent, and have a precise semantics… In execution, data values are represented as logical entities called tokens which appear to flow across the arcs from one instruction’s output to another’s input. Instructions may be executed when and only when their firing rule is satisfied. The firing rule is a predicate on the presence or absence of tokens at the instruction’s inputs… Given a set of costs for instruction execution, context switching, synchronization, and operand access, partitions can be compared on the basis of how well they [keep the pipeline full]… Depth-first… partitionings tend to be the best at minimizing critical path time and rely heavily on pipeline bypassing… The… partial order represented by a dataflow graph captures all and only the necessary inter-instruction dependences… From the standpoint of graph semantics, it is necessary that all information to be communicated from one instruction to the next be encoded on the token. That is, instruction execution must be independent of any pervasive ‘state.’ Operationally, because the stages in dataflow instruction execution (synchronization, instruction fetch, computation, token formation) are essentially sequential, the time to execute [a sequential] thread is proportional to the product of the pipeline depth and the thread length… the architecture is capable of iteration unfolding and inter-procedural parallelism wherein the synchronization is not at all simple and straightforward, but rather requires a fine-gained approach… Means for expressing both implicit and explicit synchronization must be provided. Implicit, i.e. program counter-based, synchronization provides the means for passing state between instructions within an unbroken thread. Explicit synchronization is necessary at the programming level in the exploitation of parallelism and at the machine level in the masking of latency… the cost of synchronization is low enough to allow its free and unencumbered use in decomposing programs for parallel execution.” “Context switching can and does occur at each pipe beat; any instruction n′ which is enabled as a result of the completion of instruction n may not enter the pipeline for a number of cycles equal to the pipeline depth. The intervening cycles must be filled by instructions from another thread of execution, possibly but not necessarily from a related context. Not surprisingly, this model is highly parallel, but the parallelism comes at the expense of some lost locality.” “In the dataflow model, the operands trigger the execution of the operation to be performed on them. In other words, dataflow architectures use the availability of data to fetch instructions rather than the availability of instructions to fetch data. Unlike the von Neumann model, the dataflow model is neither based on memory structures that require inherent state transitions, nor does it depend on history sensitivity and program counter to execute a program sequentially. These properties allow the use of the model to represent maximum concurrency to the finest granularity and to facilitate dependency analysis among computations. In this sense, the dataflow model has the potential to be an elegant execution paradigm with the ability to exploit the inherent parallelism available in applications… One significant drawback of the dataflow model is its inability to effectively manage data structures, memory organizations and traditional programming languages… The dataflow computing model… offers many opportunities for parallel processing, because it has neither a program counter nor a global updatable memory, i.e., the two characteristics of the von-Neumann model that inhibit parallelism… The dataflow model is self-scheduled since instruction sequencing is constrained only by data dependencies. Moreover, the model is asynchronous because program execution is driven only by the availability of the operands at the inputs to the functional units. Specifically, the firing rule states that an instruction is enabled as soon as its corresponding operands are present, and executed when hardware resources are available. If several instructions become fireable at the same time, they can be executed in parallel. This simple principle provides the potential for massive parallel execution at the instruction level. Thus, dataflow architectures implicitly manage complex tasks such as processor load balancing, synchronization, and accesses to common resources… Unlike control flow programs, binaries compiled for a dataflow machine explicitly contain the data dependency information… The dataflow model has the potential to be an elegant execution paradigm with the ability to exploit inherent parallelism available in applications. However, implementations of the model have failed to deliver the promised performance due to inherent inefficiencies and limitations. One reason for this is that the static dataflow is unable to effectively uncover large amount of parallelism in typical programs. On the other hand, dynamic dataflow architectures are limited by prohibitive costs linked to associative tag lookups, in terms of latency, silicon area, and power consumption… Dataflow languages are required in order to produce large dataflow graphs that expose as much parallelism as possible to the underlying architecture. However, these languages have no notion of explicit computation state, which limits the ability to manage data structures (e.g. arrays). To overcome these limitations, some dataflow systems include specialized storage mechanisms, such as the I-structure, which preserve the single assignment property. Nevertheless, these storage structures are far from generic and their dynamic management complicates the design… Hybrid models combine single assignment semantics, inherent to dataflow, with consistent memory models that support external side-effects in the form of load/store operations. This relieves one of the biggest (if not the biggest) restriction of pure dataflow programming: the inability to support a shared state, and specifically shared data structures… In a unified-hybrid architecture, a program must be executed using both dataflow and control flow scheduling since both models are intimately bound in the architecture… In WaveScalar, pairs of PEs are coupled into pods sharing ALU results via a common bypass network.” “Across the diverse range of multiprocessor architectures, from small collections of supercomputers to thousands of synchronous single-bit processors, all seem to share one undesirable property: they are hard to use… The application writer must consider detailed and hard-to-measure interactions of the program with the machine; the resulting codes are difficult to debug, are of questionable reliability, and are far from portable. Even after arduous work in converting an application for parallel execution, the actual improvement in performance is frequently disappointing. Perhaps we do not yet understand how to express parallelism in a way that is machine-independent… Perhaps we are building the wrong machines… Machine architects adapt sequential processors to the parallel setting by providing an interprocessor communication medium and an ad hoc set of synchronization mechanisms, like locks or fetch-and-add. Then the compiler, or worse yet the programmer, is expected to partition the application into tasks that can run in parallel using the supplied synchronization primitives to ensure deterministic behavior… There needs to be a fundamental change in processor architecture before we can expect significant progress to be made in the use of multiprocessors. The nature of this change is the deep integration of synchronization into the processor instruction set. The instruction set must constitute a parallel machine language, in which parallel activities are coordinated as efficiently as instructions are scheduled… The instruction set of a dataflow machine forms such a parallel machine language. An instruction is executed only when all of its required operands are available. Thus, low-level synchronization is performed for every instruction and at the same rate as instructions are issued. It is easy for a compiler to use this fine-grain synchronization to produce code which is highly parallel but deterministic for any runtime mapping of instructions onto processors. While there is consensus that dataflow does expose the maximum amount of parallelism, there is considerable debate surrounding efficiency of the execution mechanism… the dataflow program executes essentially the same number of instructions independent of the number of processors, whereas the parallelization of a program for a conventional multiprocessor invariably incurs non-trivial execution overhead (e.g. synchronizing through barriers, task creation) and typically yields far less parallel activity… In the von Neumann model, the operands for an instruction are assumed to be available when the program counter points to the instruction. In a dataflow machine an instruction is asynchronously scheduled only when its operands have been produced by some other parallel activities, so a dataflow machine must have an operand matching mechanism for detecting when an instruction has its required operands… The more finely a program is divided into tasks, the greater the opportunity for parallel execution. However, there is a commensurate increase in the frequency of inter-task communication and associated synchronization. So exposing more parallelism, by way of identifying more tasks, does not obviously make the program run faster. In fact, we claim that is largely unprofitable to expose most of the latent parallelism in programs unless synchronization and task management are as efficient as the scheduling of primitive operations like add and multiply… For a given machine, there is a fundamental tradeoff between the amount of parallelism that is profitable to expose and the overhead of synchronization… We think there is something fundamentally wrong — why should the task overhead be so extraordinarily large? Is it an inherent aspect of parallel computation or an artifact of the processor architecture used to construct the multiprocessor? We believe the latter… If the task overhead were essentially zero, more parallelism would be exposed and exploited and compiling would be far easier… By ‘low overhead’ we mean on the order of one to ten instructions, basically orders of magnitude better than [existing] multiprocessors… In this ideal world, task overhead is not a first-order issue. Instead, the objective is to expose the maximum amount of parallelism by subdividing the computation as finely as possible… A machine executes a parallel machine language when it no longer interprets the task state transitions, but provides direct hardware support to coordinate parallel activities with [the] same efficiency with which it executes instructions… For dataflow machines to be ultimately successful, single processor performance will need to be on a par with well-engineered von Neumann machines… A dataflow graph imposes only a partial order of execution; all execution orders which obey the data dependences in the graph yield the same result. This is the source of parallelism in dataflow graphs — a valid execution is to process in parallel all instructions which have their required operands. In contrast, a sequential processor imposes a total order on all operations within a task; two operations can occur in parallel only when it can proved that doing so does not change the sequential semantics… The implementation of the token matching function is the bane of practical tagged-token dataflow processors. Tagged-token dataflow execution requires a token to be checked against all other waiting tokens for a match. If a match is not found then the token must be remembered by the waiting token storage. If a match is found then the matching token must be extracted from the waiting token storage. The matching function and associated storage is most often realized with a large hashtable and sophisticated insertion and extraction algorithms. The resulting hardware is highly complex, slow and is difficult, if not impossible, to fully pipeline… The programming cost for the effective utilization of a multiprocessor can be enormous. It may often be the case that the program cannot be made to run faster no matter how much programming effort is invested… The promise of dataflow machines is that they will be far easier to use. The hardware-enforced fine grain synchronization simultaneously makes compiling easier while exposing the maximum amount of parallelism… The ultimate goal seems clear: even a single-processor dataflow machine should outperform an equivalent-cost conventional uniprocessor.” “Computing science seems to require a new ‘general-purpose’ parallel program organization for computers which allows them to support efficiently both functional and conventional programming languages… The [von Neumann organization] is based on control tokens which, like data tokens, are passed to successor instructions signaling them to execute. A control token may be viewed as a data token carrying no data… When a set of control tokens activates an instruction, the assumption is that all the required input data been computed and that the contents of memory cells used for results may be overwritten. In addition it is assumed that the instruction is safe to execute in parallel with any other instructions that are already executing… the single control token, passed from instruction to instruction, is represented by the contents of the computer’s program counter, which is automatically updated during each instruction execution… The control of execution of a particular instruction is caused by the availability of a specific set of tokens, which may be data, control, or a mixture.” “[CGRAs] possess near-ASIC energy efficiency and performance with post-fabrication software-like programmability… Their hardware can be defined by software at runtime, but their processing elements are not as powerful as those of GPPs, and their interconnections are not as complex as those of FPGAs. This architecture is just flexible enough for specific domains… the mapping of a CGRA is actually equivalent to identifying the spatial and temporal coordinates of every node and arc in the control-dataflow graph… FPGAs are a fine-grained general-purpose flexible architecture and rarely support temporal computation because their reconfiguration process is much slower than that of CGRAs [nanoseconds versus milliseconds to seconds] and thus the reconfiguration is challenging to be pipelined with kernel-level computation… [CGRA] applications are strictly limited to purely computation- or data-intensive kernels, and their performance degrades severely in cases involving any irregular codes or complex control flow… inter-thread communication is typically implemented on [on-chip] instead of in shared memory… The KPN model has been applied to many dataflow CGRAs, such as [TIA] and WaveScalar, which use FIFOs and CAMs as communicating channels among asynchronous PEs… To maximize performance, configurations can be scheduled according to runtime system states (e.g. predicate and data token) and viable resources… CGRAs are powerful computing substrates completely different from conventional sequential processors. Their architectures are far more complicated than CPUs because a two-dimensional array of hardware resources must be scheduled and cooperate. They are even more difficult to program than VLIWs… because explicit dataflow via interconnections is more difficult to manage than implicit dataflow via shared memory… the generated configuration binaries are sent to system schedulers, which determine runtime task placement and eviction according to the resource utilization and states… many CGRAs have arrays of spatially distributed PEs, which could launch many scattered and redundant memory address generation and memory access events. However, regular memory access with continuous addresses or in vector style is more efficient… most CGRAs can specialize the computing facilities for various applications, but few of them can customize the memory access pattern for different applications. These CGRAs typically adopt a conventional memory system, which cannot explore [MLP] efficiently at runtime or upon compilation… a mixed-grained CGRA targeted at cryptography could support bit-level parallelism… If [data locality] can be explicitly expressed in programming and then implemented with spatially distributed efficient CGRA interconnections, then the burden on the external memory bandwidth will be lowered… CGRAs provide distributed interconnect, which is much more energy-efficient than the multi-port register files in CPUs, GPUs, DSPs, and so on, resulting in a much smaller power overhead… CGRAs can… provide programmable interfaces to consolidate memory accesses and avoid redundant address generation… CGRAs rely on the dataflow technique to enable fine-grained TLP, avoiding complicated multi-issue and multithreading logic overhead. They also provide much more energy-efficient inter-task communication than the usual shared memory… Many communication channels exist among PEs, such as shared register files, wire switches and FIFOs… The mechanism of internal synchronization is typically based on the handshaking and dataflow manner… exposing communication or memory accesses explicitly in the program or ISA provides more opportunities for compilers to explore MLP.” “Processors with slow configuration are only useful for very regular dataflow applications with a high computation-to-configuration ratio. On the other hand, processors which handle configurations fast and flexibly are also beneficial for less regular applications… concurrency of configuration and computation hides configuration latency… it is even possible to synchronize the dataflow with the reconfiguration.” “Whereas FPGAs feature bitwise logic in the form of LUTs and switches, CGRAs feature more energy-efficient and area-conscious word-wide PEs, [register files], and their interconnections… Compared to FPGAs, CGRAs have lower delay characteristics and less power consumption. They have much shorter reconfiguration time (cycle level), and thus much more flexible like a programmable processor. On the other hand, gate-level reconfigurability is sacrificed, and thus they are less efficient for bit-level operations.” “The vast majority of most computations are multi-bit operations. [CGRAs] take advantage of this by sharing configuration data across word-wide operations… Configurations are generated using FPGA-like placement and routing algorithms… It is often useful to include some LUTs and single bit communication channels to form a basic FPGA within the architecture for control and bitwise logic.” “…one major inefficiency in FPGAs is that the majority of logic and routing resources are configured at the bit granularity, even though many applications naturally represent data in 8-, 16-, or 32-bit words… Routing 1-bit signals in a 32-bit interconnect wastes 31 wires in each channel… the optimal interconnect will be different for control and datapath signals… The obvious alternative is an interconnect that includes both 32-bit and 1-bit resources, optimized to the demands of each signal type… there are disjoint single- and multi-bit interconnects, where each port on a device is connected to the appropriate interconnect… separate control and data crossbars will be more efficient than one large integrated crossbar… while CGRAs are intended for word-wide applications, the interconnect can be further optimized by providing dedicated resources for single-bit control signals… the addition of a single-bit, fully static, control-path is an effective method for offloading control and predicate signals, thus further increasing the efficiency of the interconnect.” “[CGRAs] represent an architectural midpoint between the flexibility of an FPGA and the efficiency of an ASIC… A flexible-shape execution region increases the overall resource utilization by mapping multiple tasks with different resource requirements. Dynamic partial reconfiguration enables a CGRA to update the hardware configuration as the scheduler makes decisions rapidly… The flexibility to form different sizes and shapes of execution regions gives the scheduler a simplified and quantized view of hardware resources while providing enough information to allocate resources to each task to maximize resource utilization in multi-tasked workloads.” “There remain many important workloads whose best-known implementation involves asynchronous actors performing different tasks, while frequently communicating with neighboring actors. The computation and communication characteristics of these workloads cause them to map efficiently onto spatially-programmed architectures such as [FPGAs]… While FPGAs are very general in their ability to map the compute, control and communication structure of a workload, their [LUT-based] datapaths are deficient in compute density compared to a traditional microprocessor — much less a SIMD engine… Tiled arrays of coarse-grained ALU-style datapaths are known to achieve higher compute density than FPGAs. A number of… spatial architectures [include] a network of ALU-based processing elements onto which operations are scheduled in systolic or dataflow order, with limited or no autonomous control at the PE level. Other approaches incorporate autonomous control at each PE using a program counter. Unfortunately… PC sequencing of ALU operations introduces several inefficiencies when attempting to capture intra- and inter-ALU control patterns of a frequently-communicating spatially-programmed fabric… Classic dataflow architectures… trigger instructions when tokens associated with input data is ready. Multiple pipeline stages are devoted to marshaling tokens, distributing tokens, and scoreboarding which instructions are ready. A ‘Wait-Match’ pipeline stage must dynamically pair incoming tokens of dual-input instructions. In contrast, TI expresses dependencies via single-bit predicate registers that are explicitly managed by the program, improving scheduler scalability and removing the token-related pipeline stages.” “Triggered instructions remove the program counter completely, instead allowing the processing element to concisely transition between states of one or more finite-state machines without executing instructions in the datapath to determine the next state… Spatial programming is a paradigm whereby an algorithm’s dataflow graph is broken into regions, which are connected by producer-consumer relationships. Input data is then streamed through this pipelined graph… In MIMD, [a core] is responsible for executing a full loop iteration containing each instruction A, B, C, [and so on] of the sequential body, keeping the intermediate data… locally in its register file. Cross-core communication is rare and is ideally protected with synchronization primitives such as barriers. In contrast, in a spatial approach a single small core is responsible for executing instruction A for all iterations, with another core executing instruction B. Intermediate data is passed to the next datapath rather than being kept locally. Therefore no single PE will execute an entire loop body — essentially the program has been transposed between time and space… in practice there can be benefits to keeping a small control sequence or [FSM] local. Ideally, the number of operations in each pipeline stage is kept small, as performance is usually determined by the rate-limiting step… A traditional processor would execute such programs serially over time… A MIMD multicore can improve this by mapping different stages onto different cores, but the small number of cores available relative to the large number of stages in the dataflow graph means that each core must multiplex between several stages, so the rate-limiting step generally remains large. Additionally, cross-thread communication can generally only occur via shared virtual memory which uses a coherence protocol to actually interact with the on-chip network and perform the data transfers. In contrast, a typical spatial-programming architecture is a fabric of hundreds of small processing elements connected directly via an OCN which is exposed directly to the ISA. Given enough PEs, an algorithm may be taken to the extreme of mapping the entire dataflow graph into the spatial fabric, resulting in a very fine-grained pipeline… Within the domain of array-of-ALU approaches is a class of architectures that do not feature any autonomous control mechanism inside each ALU. These architectures are either purely systolic, statically map only one operation per ALU, or schedule operations onto the ALUs in strict dataflow order. These architectures rely on being able to transform control-flow graphs into predicated dataflow graphs. Such approaches are effective at mapping the control structures of a subset of problems, but do not approach the flexibility or generality of architectures with internal autonomous control at each PE. Another class of proposals calls for general autonomously-controlled PEs using variants of the existing PC-based control model. The PC-based control model paired with [shared virtual memory] has historically been the best choice for MIMD CPUs that run arbitrary and irregular programs… these existing paradigms introduce unacceptable inefficiencies in the context of spatial programming… The communication flows are completely statically determined by the configuration — no dynamic packet routing is required… The loads and stores that a traditional CPU must use to overcome its relatively small register file are replaced by direct PE-to-PE communication. This reduction in memory operations is critical in understanding the benefits of spatial programming… Direct communication uses roughly [95% less] power than communication through an L1 cache, as the overheads of tag matching, load-store queue search, address translation, and large data array read are removed. Cache coherence overheads, including network traffic and latency are likewise removed… Furthermore, in a spatially-programmed fabric having hundreds of PEs communicating using shared memory would create unacceptable bandwidth bottlenecks. It would also be wasteful — communicating with your neighbor should not have to go through a centralized location… A large degree of the inefficiency [of PC-based schemes] stems from the issue of efficiently composing boolean control flow decisions. In order to overcome this, we draw inspiration from the historical computing paradigm of guarded actions, a field that has a rich technical heritage including [Edsger] Dijkstra’s language of guarded commands… Computation in a traditional guarded action system is described using rules composed of actions (state transitions) and guards (boolean expressions that describe when a certain action is legal to apply). A scheduler is responsible for evaluating the guards of the actions in the system and posting ready actions for execution, taking into account both inter-action parallelism and available execution resources… this paradigm naturally separates the representation of data transformation (via actions) from the representation of control flow (via guards). Additionally, the inherent side-effect-free nature of the guards means that they are a good candidate for parallel evaluation by a hardware scheduler… A triggered instruction architecture applies this concept directly to controlling the scheduling of operations on a PE’s datapath at an instruction-level granularity. In the historical guarded action programming paradigm, arbitrary boolean expressions are allowed in the guard, and arbitrary data transformations can be described in the action. To adapt this concept into an implementable ISA, both must be bounded in complexity. Furthermore, the scheduler must have the potential for efficient implementation in hardware. To this end, we define a limited set of operations and state updates that can be performed by the datapath (instructions) and a limited language of boolean expressions (triggers) built from a variety of possible queries on a PE’s architectural state. The architectural state of our proposed TIA PE is composed of [sets of data registers, predicate registers, read-only input-channel head elements, and write-only output-channel tail elements]. Each channel has three components — data, a tag and a status predicate that reflects whether an input channel is empty or an output channel is full. Tags do not have any special semantic meaning — the programmer can use them in a variety of ways. A trigger is a programmer-specified boolean expression formed from the logical conjunction of a set of queries on the PE’s architectural state. Triggers are evaluated by a hardware scheduler… The set of allowable trigger query functions are carefully chosen to maintain scheduler efficiency while allowing for a large degree of generality in the useful expressions… A trigger can specify a requirement for one or more predicate registers to be either true or false… The scheduler implicitly adds the empty status bits for each operand input channel to the trigger for an instruction. Similarly, a not-full check is implicitly added to each output channel an instruction attempts to write. The programmer does not have to worry about these conditions, but must understand while writing code that the hardware will check them. This facilitates convenient, fine-grained, producer/consumer interaction… A trigger may specify a value that an input channel’s tag must match… An instruction represents a set of data and predicate computations on operands drawn from the architectural state. Instructions selected by the scheduler are executed on the PE’s datapath… An instruction may read a number of operands, each of which can be data at the head of an input channel, a data register, or the vector of predicate registers. An instruction may perform a data computation using one of the standard functions provided by the datapath’s ALU. It may also generate one or more predicate values that are either constants (true/false) or derived from the ALU result via a limited set of datapath-supported functions, e.g., reduction AND, OR and XOR operations, bit extractions, ALU flags such as overflow, [and so on]. An instruction may write the data result and/or the derived predicate result into a set of destinations within the architectural state of the PE. Data results can be written into the tail of an output channel, a data register, or the vector of predicate registers. Predicate results can be written into one or more predicate registers… The PE is pre-configured with a static set of instructions. The triggers for these instructions are then continuously evaluated by a dedicated hardware scheduler that dispatches legal instructions to the datapath for execution. At any given scheduling step, the trigger for zero, one, or more instructions can evaluate to true. The guarded action model — and by extension our triggered instruction model — allows all such instructions to fire in parallel subject to datapath resource constraints and conflicts… The scheduler uses standard combinatorial logic to evaluate the programmer-specified query functions for each trigger based on values in the architectural state elements. This yields a set of instructions that are eligible for execution, the scheduler selects one out of the available set of ready-to-fire instructions using a priority encoder. As with any architecture, a triggered-instruction architecture is subject to a number of parameterization options and their associated [cost-benefit] tradeoffs. Architectural parameters include the number of instances of each class of architectural state element (data registers, predicate registers, etc.), the set of data and predicate functions supported by the datapath, the scope and flexibility of the trigger functions, and the number of input operands and output destinations. The design space of microarchitectural alternatives includes scheduler implementation choices, scalar versus superscalar datapaths, pipelining strategies, [and so on]… A TIA PE does not have a program counter or any notion of a static sequence of instructions. Instead, there is a limited pool of triggered instructions that are constantly bidding for execution on the datapath. This fits very naturally into a spatial programming model where each PE is statically configured with a small pool of instructions instead of streaming in a sequence of instructions from an instruction cache… there are no branch or jump instructions in the triggered ISA — every instruction in the pool is eligible for execution if its trigger conditions are met. Thus, every triggered instruction can be viewed as a multi-way branch into a number of possible states in an FSM. With clever use of predicate registers, a TIA can be made to emulate the behavior of other control paradigms. For example, a sequential architecture can be emulated by setting up a vector of predicate registers to represent the current state in a sequence — essentially, a program counter. Predicate registers can also be used to emulate classic predication modes, branch delay slots and speculative execution. Triggered instructions is a superset of many traditional control paradigms. The cost of this generality is scheduler area and timing complexity, which imposes a restriction on the number of triggers (and thus, the number of instructions) that the hardware can monitor at all times. While this restriction would be crippling for a temporally programmed architecture, it is reasonable in a spatially-programmed framework because of the low number of instructions typically mapped to a pipeline stage in a spatial workload… The hardware scheduler is built from combinatorial logic — it simply is a tree of AND gates. This means that only the state equations that require re-evaluation will cause the corresponding wires in the scheduler logic to swing and consume dynamic power. In the absence of channel activity or internal state changes, the scheduler does not consume any dynamic power whatsoever. The same control equations would have been evaluated using a chain of branches in a PC-based architecture. Altogether, triggered instructions allow individual PEs to efficiently react to incoming messages, making an intelligent decision based on the local state of the PE. A TI scheduler uses a single unified scheme to monitor both the one-bit channel status registers and the local predicate registers in order to quickly and cheaply make a trigger decision… Triggered instructions capture the local control logic programmed onto a single PE, but the spatial fabric must also support some form of inter-PE coordination and data passing. One possibility is to design a global control architecture with a centralized ‘master’ directing non-autonomous PEs. However, access to shared centralized control state is likely to be expensive and limit scalability of the fabric. Distributed control, though scalable, often makes it challenging to map an algorithm into a distributed program without either introducing bugs and deadlocks, or using knowledge about properties of the underlying network fabric. Existing spatial fabrics have solved this using systolic or synchronous networking schemes that cause neighboring PEs to move exactly in lockstep, which can limit performance… In [the latency-insensitive] design paradigm, local modules (whether hardware or software) are unable to make assumptions about the timing characteristics involved with communicating with other modules in the system. Instead, coordination is piggybacked onto data communications. The arrival (or absence) of a message on an input communication channel is itself an implicit form of control synchronization. This works well in a spatial architecture because of the significant amount of inter-PE dataflow required to achieve fine-grained producer-consumer pipelines. These dual-purpose communication channels are called latency-insensitive channels… in the spatial scenario, the producer/consumer flows are statically determined when the program is loaded during the configuration step. Most flows are nearest-neighbor, and often can be mapped so that there is no contention… We can leverage these facts to create statically allocated virtual circuits through the on-chip network. We can then map the programming construct of latency-insensitive channels onto these circuits, which multiplex limited physical link wires between the flows mapped onto them on a hop-by-hop basis, while maintaining non-blocking and deadlock-freedom via reverse credit flows, which are similarly statically mapped and managed… the PEs can also be programmed to operate as network routers in a multihop traversal by adding a triggered instruction that dequeues the data from an input channel and enqueues it into an output channel… TI is never less efficient than a PC-based approach, i.e. it never requires more instructions… TI removes all control operations such as branches. In a classic PC-based setting, the accepted rule of thumb is that about [one in five] instructions is a branch. In this setting TI’s expected benefit would be around 20%… the fine-grained producer-consumer nature of spatially-programmed codes means that control makes up 44% of all operations, which increases the benefit of TI significantly… there are certainly scenarios where over-serialization results in no penalty because the data arrives in the order that matches the static sequence chosen by the compiler. If the compiler can precisely schedule cross-PE data delivery rates then it is possible that this deficiency will never be exposed. In practice, the numerous sources of variable dynamic latency (memory hierarchy, network contention, data-dependent divergence, etc.) mean that there is plenty of opportunity to take advantage of the ability to break over-serialization… Breaking over-serialization can be accomplished by finding independent operations. These can be found from two sources. The first source is local parallelism in the PE’s dataflow graph, in which case computation can start as the data arrives, i.e., classical dynamic instruction reordering. The second source arises when the spatial compiler chooses to place unrelated sections of the overall algorithm dataflow graph onto a single PE, statically partitioning the registers between them and statically interleaving operations, i.e. compiler-directed multithreading. On a PC-based architecture, the serialization restriction is a significant barrier to a compiler’s ability to statically partition one thread of control between unrelated sections of a single algorithm. The dynamic data production/consumption rates must be known to schedule the code — both for efficiency, and to avoid deadlock. On a TI architecture we expect compiler-directed multithreading of non-rate-limiting PEs to be a common and important optimization… since a TI architecture does not impose any ordering between instructions unless explicitly specified, it can gain the ILP benefits of an out-of-order issue processor without the expensive instruction window and reorder buffer. Simultaneously, a TI machine can take advantage of multithreading without duplicating data and control state, but by the compiler partitioning resources as it sees fit. Of course there is a hardware cost associated with this benefit — the TI PE must have a scheduler that can efficiently evaluate the program’s triggers… [with] critical rate-limiting loops in the spatial pipeline, there are fewer opportunities for multiplexing unrelated work onto shared PEs… Triggered instructions use a substantially smaller static instruction footprint… The static code footprint of… rate-limiting inner loops is, in general, fairly small… This… provides strong evidence of the viability and effectiveness of the spatial programming model with small, tight loops arranged in a pipelined graph… all workloads contain a large number of inter-PE channels, and… direct communication without going through shared memory is a feasible communication paradigm for these pipelined graphs. Furthermore, the low hop count indicate that much of this communication is neighbor-to-neighbor, and thus makes the most effective use possible of distributed network bandwidth… The [two-stage] pipelined PE has a comparable number of gate levels in the critical path to a high-performance commercial microprocessor… The hardware scheduler is the centerpiece of a TIA PE. Scheduler implementation cost is one of the primary factors that bounds the scalability of PE size in a triggered control model. Fortunately, the nature of spatial programming is such that small, efficient PEs are effective… the area cost of the TIA hardware scheduler is less than 2% of a PE’s overall area, much of which is occupied by its architectural state (registers, input/output channel buffers, predicates and instruction storage), datapath logic (operand multiplexers, functional units, etc.) and microarchitectural control overheads — none of which are unique to triggered control. This is not surprising — the core of the TIA scheduler is essentially a few 1-bit wide trees of AND gates feeding into a priority encoder. For our chosen parameterization, this logic is dwarfed by everything else in the PE. Similarly, scheduler power consumption is small compared to the rest of the PE. The scheduler logic does not consume dynamic power unless there is a change in predicate states. When this happens, the only wires that swing are the ones that are recomputing the changed control signals. This manner of computing control is more power-efficient than executing datapath instructions to compute the same results… TIA enables efficient control flow and spatial program mapping across PEs, enabling high utilization of ALUs with PEs without the need for an explicit control core… In a dataflow-triggered PE, the token-ready bits associated with input sources are managed by the micro-architecture. The TI approach, in contrast, replaces these bits with a vector of architecturally-visible predicate registers. By specifying triggers that span multiple predicates, the programmer can choose to use these bits to indicate data readiness, but can also use them for other purposes, such as control flow decisions. In classic dataflow multiple pipeline stages are devoted to marshaling tokens, distributing tokens, and scoreboarding which instructions are ready. A ‘Wait-Match’ pipeline stage must dynamically pair incoming tokens of dual-input instructions. In contrast, the set of predicates to be updated by an instruction in the TI is encoded in the instruction itself. This both reduces scheduler implementation cost and removes the token-related pipeline stages… compound conjunctions are explicitly supported in triggered instructions, allowing for efficient mapping of state transitions that would require multiple instructions in dataflow predication… While applications written for RAW are spatially mapped, program counter management and serial execution of instructions reduces efficiency, and makes the cores on RAW sensitive to variable latencies, which TIA overcomes using instruction triggers… Triggered instructions enable control flow at low cost and dynamic instruction issue dependent on data arrival, resulting in less instruction overhead… spatial parallelism is a promising computing paradigm with the potential to achieve significant performance improvement over traditional high-performance architectures for a number of important workloads, many of which do not exhibit uniform data parallelism… The triggered control model is feasible within a spatially-programmed environment because the amount of static instruction state that must be maintained in each PE is small, allowing for inexpensive implementation of a triggered-instruction hardware scheduler… The ultimate success of this paradigm will be premised on overcoming a number of challenges, including providing a tractable memory model, dealing with the finite size of the spatial array, and providing a high-level programming and debugging environment.” “The trigger for each instruction may be a function of one or more predicate register of the control processing engine, FIFO status signals or information regarding tags… In a standard processor, both the control and data processing streams are handled as instructions dispatched to and executed in the execution logic of the processor. However, this can lead to several inefficiencies. For example, in a conventional processor a large number of instructions are devoted solely to computing what the next set of instructions should be (i.e. which instructions are ‘ready’), from where data should be retrieved and to where data may be stored. If instead a programmer describes a pool of operations that execute based on the arrival of certain patterns of inputs then it is possible to separate out the computation of which instructions are ‘ready’ into a parallel circuit that may improve performance dramatically by avoiding instruction-level polling of data sources… The trigger for each instruction may be implemented by a programmer as a function of at least one predicate register of the control processing engine, FIFO status signals from one or more FIFOs… and tags (metadata) that either arrive over FIFOs, or are already present in registers inside the processing engine. This may provide several advantages for a processor, especially in the context of an accelerator. For example: control decisions that may have taken multiple instruction cycles on a standard PC-based architecture may now be computed in a single cycle, control processing for multiple instructions may be computed in parallel if multiple instructions are ready to be executed and processing elements are available, and multiple algorithms may be mapped to a single processing element and executed by the processing element in an interleaved manner.” “…a common observation is that many algorithms primarily utilize byte- or word-level primitive operations, which are not efficiently mapped to bit-level logic and logic-grained spatial architectures such as FPGAs… In contrast to FPGAs, coarse-grained spatial architectures are designed from the ground up to suit the properties of these algorithms. Coarse-grained spatial architectures optimize byte- and word-level primitive operations into hardened logic and through the utilization of ALUs within PE datapaths. The hardened logic results in much higher compute density, which leads to faster clock speeds and reduces compilation times substantially compared to FPGAs. Generally speaking, these coarse-grained spatial architectures have a higher-level programming abstraction, which typically includes some notion of an instruction set architecture. In other words, PEs can be programmed by writing a sequence of software instructions, rather than requiring the hardware-level programming of an FPGA… To further support high compute density and provide efficiency, the instruction memory and register file within the PE are kept quite small, and the complexity of the ALU is kept low… Overall, TIA and spatial architectures in general are able to express and exploit significant amounts of fine-grained parallelism unavailable to non-spatial architectures. The energy efficiency advantage… corresponds both to the low complexity of a PE relative to an x86 core, which results in much lower energy cost per operation, and the aggressive conversion from expensive memory operations into inexpensive local communication… This huge acceleration essentially comes from the exploitation of very fine-grained parallelism and the dramatic reduction of memory reads and writes.” “Programmable spatial architectures composed of ensembles of autonomous fixed-ISA processing elements offer a compelling design point between the flexibility of an FPGA and the compute density of a GPU or shared-memory many-core… To launch an instruction, the front end must compare the predicate and communication queue state to a programmed set of trigger conditions, as opposed to just calculating the next address for the program counter… Tags encode programmable semantic information that accompanies the data communicated over these queues. For example, a tag might be used to indicate the datatype of the accompanying data word or a message to effect control flow like a termination condition. Tag values at the head of the input queues determine, in part, whether an instruction can be fired. The PE also contains a set of single-bit predicate registers, which can be updated immediately upon triggering an instruction, or as the result of a datapath operation… Comparison or logic instructions whose destination is a predicate register provide control flow equivalent to branching in [PC]-based ISAs… In our ISA, we… offer a wide range of comparison operations and logical operators intended primarily for predicate writes to support expressive control flow… A consequence of the parallel nature of trigger resolution is the need for all trigger fields to be exposed combinationally to the scheduler in order to execute at a rate of one instruction per cycle, which has area and energy ramifications for instruction memory… Updating the predicate state rapidly is important to keep the PE making forward progress… a peculiar aspect of triggered instruction-based control is the need for the portion of the instruction that includes trigger information to be accessed in parallel. This precludes all [SRAM]-based storage media for storing trigger information, making it possible for the instruction storage to become large and power-hungry relative to other components. Since some instruction fields are not required by the scheduler… it is possible to store those in a traditional indexed store such as SRAM, so long as the design is pipelined such that the stage in which instructions are triggered is separate from the stage in which those fields are decoded… mixed register/latch-SRAM… can reduce instruction memory area and power usage by 16% and 24%, respectively, over register-only instruction memory and by 9% and 19%, respectively, over latch-only instruction memory… Using clock-gated registers, the instruction storage accounts for 25% of PE area and 41% of PE power. The high area and power use are at least partially attributable to sized-up instruction register transistors, which are both timing critical and relatively high-fan-out. Latches reduce the area by just over 30% and power by 75% thanks to the removal of clock tree capacitance and smaller cells… however… latches increased the critical path of the trigger resolver… Thus, we settled on the clock-gated register-based instruction memory… Area is dominated by ALU followed by instruction memory. Power is mostly proportionate to area, excepting the instruction memory which has relatively higher power and the ALU relatively lower power… due to the capacitance of the clock tree of the large sequential instruction memory… The benefits to pipelining in terms of [ILP], timing closure and increased power efficiency through voltage scaling are well understood and thus desirable in most microarchitectures. It is also well-known that pipelines present a wide variety of hazards necessitating control logic that can help mitigate them… A pipelined, triggered microarchitecture is susceptible to the same data hazards as a conventional pipeline, be they register operand dependences or dependences across stages of a pipelined functional unit. On top of this, triggered control introduces unconventional control hazards. With a PC, a processor must compute the next PC in order to fetch and begin executing the next instruction. The next PC is either the result of a predictable update function… or a more complicated branch condition and target resolution calculation. By contrast, in a triggered architecture, program control advances via updates to the predicate and queue states. To trigger the next instruction, one has two options: wait for the preceding instruction to complete its updates to the predicate and queue state, or speculate on the upcoming predicate and queue state and trigger a speculative instruction. The deeper the pipeline, the more critical it becomes to resolve or predict the upcoming states to keep the pipeline full… Queue hazards arise when an in-flight instruction might alter the state of an input or output queue, thus preventing the scheduler from identifying and launching the next instruction. Because queue state has a direct impact on both control flow and data movement, queue hazards straddle the traditional control/data hazard dichotomy. They resemble control hazards in that they affect instruction execution eligibility and data hazards in that the processor must prevent dequeues from empty inputs or enqueues to full outputs… An efficient interface to determine the state of input and output queues is especially important here because of how spatial architectures encourage producer-consumer relationships between PEs… De-queueing from the inputs in the same cycle as the trigger resolution proved to be a long critical path… A unique aspect of this scheme is that when a program assigns certain semantic significance to particular predicates, writing to them only to represent certain binary decisions, this bank of predictors becomes a per-branch predictor without the traditional overhead of indexing a bank of predictors via the instruction pointer… To describe the various pipelines, we have divided the work of a processing element into three conceptual stages: trigger selects and fetches an instruction from memory; decode retrieves any operands from the register files and input queues and performs the necessary operand forwarding; and execute, optionally split into X1 and X2, performs any arithmetic or logical operations and writes results to the register file or output queues… the single-cycle TDX baseline remains surprisingly competitive due at least partially to its superior CPI… Another surprising observation is that the exact same microarchitecture provides the best global performance and best absolute energy efficiency. At the high-performance extreme, a two-stage pipeline with a split-stage ALU, TDX1|X2… achieves the highest throughput… At the low-power extreme this exact same microarchitecture [achieves the lowest EPI]… high-performance designs can counterintuitively share characteristics with ultra-low-power microarchitectures… The density of power dissipation and thus the [TDP] of GPUs is one of the primary factors that keeps their clock frequencies often twice as slow as CPUs in comparable technologies. When selecting a processing element design, power density will likely be one factor that limits the scalability of the spatial architecture… for simple integer functional units… moderate pipelines in the range of two to three stages are optimal.” “In TIA, the runtime function can be changed by dynamically configuring the PEs (e.g. instruction buffers, predicates and registers), switches and routers, and hence it is reconfigurable… A major problem provoked by dynamic mapping is that it increases the number of reconfigurations. It creates a trade-off between the potential performance improvement and the reconfiguration overhead… [A tile can] hold the configurations for two simultaneous tasks. Therefore, context switching between the two tasks only involves sending tokens with different [task identifiers] to activate corresponding tasks, which takes much fewer cycles than substitution. Therefore, the elastic scheduling scheme achieves better performance with fewer reconfigurations… When unrelated sections of an algorithm are placed on a single tile, instructions from different tasks are triggered by input tokens with distinct [task identifiers]. The register and predicate resources within each processing element are partitioned to avoid resource conflict.” “Unlike TRIPS and WaveScalar, TIA utilizes predicates as tokens for reducing the tag matching area… However, pipeline stalls remain because of the dependency between the issuance of a predicated instruction and the execution of an instruction that generates the predicate… the timing relationship between the transmitted data and their tags is relaxed, while their order is maintained. Therefore, the tags can be forwarded to trigger a subsequent instruction to issue before the arrival of the corresponding data. The instruction that uses the data as an operand is issued and remains pending until the data is available… Generally, data and control flows… are tightly coupled… [TIA] has the flexibility to switch contexts at a PE grain rather than an array grain… Unlike in conventional processors, the issuance of instructions in an autonomous PE is dependent on the intra-PE state and the messages or data from the adjacent PEs. As a result, a data dependency in the pipeline of inter-PEs becomes a control dependency… The triggered instruction paradigm is efficient for most control flows and outperforms all the other techniques for CGRAs… Since both control and data dependencies in TIA are converted into predicate dependencies, the pipeline stalls caused by both types of hazards can be handled in the same way… The instruction pool is small (e.g. 16 instructions)… There is no decoding stage because the operation code in each instruction is decoded in the execution stage. The intra-PE registers can be read or written without decoding since there are only a few and each one can be individually accessed. The execution of most instructions can be completed in one clock cycle, except the multiply and load instructions… Predicate registers in the PE are updated by the output of ALU or the [set predicate] field specified by the programmer and act as a program counter for triggering the issuance of instructions. They can also be used as operands… The operand of an instruction may come from an adjacent PE though the network… The shared data and control states are transferred through the channels. When an operand is in transmission, a programmer-specified tag is piggybacked onto it to identify the data… The arrival or absence of a message (data or tag) is an implicit form of control synchronization. The tags… are sent to the scheduler as predicates for triggering a subsequent instruction… Each PE utilizes a hardware scheduler to evaluate the triggers of instructions and the instructions that actualize a logic conjunction of predicates are implicitly accomplished on the scheduler, rather than on the datapath, resulting in reduced CDFGs and small initiation intervals for loops… [TIA] can also reduce the hardware overhead of the scoreboard or reorder buffer in [out-of-order] cores.” “CGRAs are mostly employed to process compute-intensive kernels… control dependence is sequential in nature and control divergence makes spatial parallelism among processing elements rather difficult. In this way… control-intensive kernels are usually bottlenecks of CGRAs… TIA eliminates the problem of control divergence completely, which results in fewer instructions.” “…simultaneously improving the performance and energy efficiency of program execution with classical von Neumann architectures has become difficult: out-of-order scheduling, simultaneous multithreading, complex register files, and other structures provide performance, but at high energy cost… Exascale computing power-performance targets may demand both high throughput and low energy consumption per operation. [This is provided by spatial arrays of large numbers of low-complexity, energy-efficient processing elements connected by a lightweight, back-pressured communication network] which largely eliminate the control overheads of previous processor designs… [Configurable Spatial Accelerators] allow dataflow graphs to be directly configured onto the CSA array… rather than being transformed into sequential instruction streams… The key architectural interface of [the CSA] is the dataflow operator, [i.e.] a node in a dataflow graph… CSA dataflow execution may depend [only] on highly localized status… resulting in a highly scalable architecture with a distributed, asynchronous execution model… the CSA may also include a rich set of control operators which assist in the management of dataflow tokens in the program graph. Examples of these include a ‘pick’ operator… which multiplexes two or more logical input channels into a single output channel, and a ‘switch’ operator… which operates as a channel demultiplexer… These operators may enable a compiler to implement control paradigms such as conditional expressions… Dataflow architectures generally focus on communication and data manipulation with less attention paid to state. However, enabling real software, especially programs written in legacy sequential languages, requires significant attention to interfacing with memory. [CSAs] use architectural memory operations as their primary interface to… stateful storage. From the perspective of the dataflow graph, memory operations are similar to other dataflow operations, except that they have the side effect of updating a shared store. In particular, memory operations… have the same semantics as every other dataflow operator, for example, they ‘execute’ when their operands [e.g. an address] are available and, after some latency, a response is produced. [CSAs] explicitly decouple the operand input and result output such that memory operators are naturally pipelined and have the potential to produce many simultaneous outstanding requests… making them exceptionally well-suited to the latency and bandwidth characteristics of a memory subsystem. [CSAs] provide basic memory operations such as load, which takes an address channel and populates a response channel with the values corresponding to the addresses, and a store. [CSAs] may also provide more advanced operations such as in-memory atomics and consistency operators. These operations may have similar semantics to their von Neumann counterparts. [CSAs] may accelerate existing programs described using sequential languages… A consequence of supporting these language models is addressing program memory order, [i.e.] the serial ordering of memory operations typically prescribed by these languages… Since compilers are to generate statically correct code, they are usually forced to serialize memory accesses. Typically, compilers targeting sequential von Neumann architectures use instruction ordering as a natural means of enforcing program order. However, [CSAs] have no notion of instruction or instruction-based program ordering as defined by a program counter… incoming dependency tokens… contain no architecturally visible information… memory operations may not execute until they have received a dependency token… memory operations produce an outgoing dependency token once their operation is visible to all logically subsequent, dependent memory operations… dependency tokens are similar to other dataflow tokens in a dataflow graph… since memory operations occur in conditional contexts, dependency tokens may also be manipulated using control operators… like any other tokens. Dependency tokens may have the effect of serializing memory accesses… providing the compiler a means of architecturally defining the order of memory accesses… The latency-insensitive semantics of a CSA may permit a distributed, asynchronous initialization of the fabric, [i.e.] as soon as PEs are configured, they may begin execution immediately. Unconfigured PEs may backpressure their channels until they are configured… preventing communications between configured and unconfigured elements… During… extraction, a logical view of the dataflow graph is captured and committed into memory… including all live control and dataflow tokens and state in the graph… Control, scheduling, and data storage [are] distributed amongst the PEs… removing the overhead of the centralized structures that dominate classical processors… [CSAs] may achieve remarkable performance efficiency relative to traditional multicore processors: compute (e.g. in the form of PEs) may be simpler, more energy-efficient, and more plentiful than in larger cores, and communications may be direct and mostly short-haul, e.g., as opposed to occurring over a wide, full-chip network as in typical multicore processors… Parallelism is explicit in dataflow graphs and [CSAs] spend no or minimal energy to extract it… unlike out-of-order processors which must re-discover parallelism each time an instruction is executed. Since each PE is responsible for a single operation in one embodiment, the register files and ports counts may be small… and therefore use less energy than their counterparts in core. Certain CSAs include many PEs, each of which holds live program values, giving the aggregate effect of a huge register file in a traditional architecture, which dramatically reduces memory accesses. [When] the memory is multi-ported and distributed, a CSA may sustain many more outstanding memory requests and utilize more bandwidth than a core. These advantages may combine to yield an energy level per watt that is only a small percentage over the cost of the bare arithmetic circuitry. For example, in the case of an integer multiply, a CSA may consume no more than 25% more energy than the underlying multiplication circuit. Relative to… a core, an integer operation in that CSA fabric consumes less than [3%] of the energy per integer operation… [CSAs] combine lightweight dataflow-oriented processing elements with a lightweight, low-latency network to form an energy efficient computational fabric… inter-PE latency may be as low as a zero cycles, meaning that the downstream PE may operate on data in the cycle after it is produced… data may be streamed in from memory, through the fabric, and then back out to memory… this architecture may achieve remarkable performance efficiency relative to traditional multicore processors… [using direct communication] as opposed to an extension of the memory system… A PE may obtain operands from input channels and write results to output channels, although internal register state may also be used… This PE consists of several I/O buffers, an ALU, a storage register, some instruction registers, and a scheduler. Each cycle, the scheduler may select an instruction for execution based on the availability of the input and output buffers and the status of the PE. The result of the operation may then be written to either an output buffer or to a [local] register. Data written to an output buffer may be transported to a downstream PE for further processing. This style of PE may be extremely energy efficient, for example, rather than reading data from a complex, multi-ported register file, a PE reads the data from a register. Similarly, instructions may be stored directly in a register, rather than in a virtualized instruction cache… Instruction registers may be set during a special configuration step. During this step, auxiliary control wires and state, in addition to the inter-PE network, may be used to stream in configuration across the several PEs comprising the fabric. As result of parallelism… such a network may provide for rapid reconfiguration… a tile-sized fabric may be configured in less than about [10µs]… Certain [OCNs] are bufferless, and data is to move between producer and consumer in a single cycle… certain [CSAs include] an array of heterogeneous PEs, in which the fabric is composed of several types of PEs each of which implement only a subset of the dataflow operators… Other PEs, including those supporting [floating-point operations] and certain control operations may have a similar implementation style… with the appropriate (dataflow operator) circuitry substituted for the ALU… A configuration may include one or two control words which specify an opcode controlling the ALU, steer the various multiplexers within the PE, and actuate dataflow into and out of the PE channels. Dataflow operators may be implemented by microcoding these configuration bits… Each PE may support a number of wide, data-oriented and narrow, control-oriented channels… The [Request Address File] circuit… may be responsible for executing memory operations and serves as an intermediary between the CSA fabric and the memory hierarchy. As such, the main microarchitectural task of the RAF may be to rationalize the out-of-order memory subsystem with the in-order semantics of CSA fabric. In this capacity, the RAF circuit may be provisioned with completion buffers, [i.e.] queue-like structures that re-order memory responses and return them to the fabric in the request order… Like the various PEs of the fabric, the RAF circuits may operate in a dataflow style by checking for the availability of input arguments and output buffering, if required, before selecting a memory operation to execute… the memory load and store operations that were in a dataflow graph are specified… A distributed accelerator… may adopt a consistency model in which memory transactions arising from a single memory… input channel are ordered with respect to one another. This ordering may be from the perspective of the channel producing the transactions… External observers may see different orderings across addresses, and accesses arising from other input channels, memory operations, and a memory interface circuit (e.g. RAF) may not be guaranteed to be ordered. One advantage of this model is that requests are generally unordered with respect to one another… This may simplify hardware, improve parallelism, and [largely] remove micro-architectural serial bottlenecks. For certain programs that are data parallel, such a model ensures high performance at low energy cost. However, certain largely data-parallel programs may [occasionally] require a degree of memory ordering. [CSAs allow programmers to exert] control over memory ordering in cases where it is desired… rather than punish [average-case memory accesses, which do] not require global visibility and are greatly slowed by waiting to obtain it, special annotations are introduced to apply memory ordering… Memory operations are [optionally] tagged with these modifiers where enhanced ordering is required… [CSAs] allow for finer-grained ordering operations… [CSAs] allow for different memory ordering [or consistency] modes… within a spatial accelerator memory subsystem… memory operations are not implemented in the fabric… to avoid making the memory operations slow, area-hungry, and limited in features… [CSAs] provide for first-class coherent memory subsystems… instead of merely providing basic memory interfaces capable of high performance, but without the refined interface semantics necessary to [all] parallel programming models. Spatial architectures may utilize dependency flows, [i.e.] memory dependency tokens… used to order memory operations. [CSAs] extend the dependency flows to understand more precisely the behavior of system memory… The consistency requirements may be encoded in the operation code of the particular memory operation… A memory dependency token may be a data value (e.g. single bit) that stalls the output of a memory access… from an RAF circuit until the memory dependency token is received… Consistency may refer to the conditions under which a memory dependency token is produced (e.g. generated, transmitted, and/or stored) for a particular operation execution. Subsequent executions (e.g. which are ordered in a pipelined fashion to ensure local consistency) of the same memory operation may be simultaneously outstanding. This enables pipelining of requests even in the presence of dependencies… Memory operations are located at the RAF circuits… memory order is relaxed and the RAF circuits… are generally decoupled from one another except as necessary to preserve program correctness. Situations requiring extra RAF-to-RAF communication may include enforcement of program order… in which memory accesses arising from sequential programming languages must be serialized… due to the potential for aliased addresses, and communication in which some ordering is applied to the memory system in order to implement entity-to-entity communications protocols… Although runtime services in a CSA may be critical, they may be infrequent relative to user-level computation… The micro-protocols may permit each PE (e.g. PE class by type) to interact with the runtime service according to its own needs… This parallelism may enable the configuration of a CSA tile in between hundreds of nanoseconds to a few microseconds… [CSAs] thus leverage properties of dataflow graphs to improve implementation of each runtime service. One key observation is that runtime services may need only to preserve a legal logical view of the dataflow graph, [i.e.] a state that can be produced through some ordering of dataflow operator executions. Services may generally not need to guarantee a temporal view of the dataflow graph, [i.e.] the state of a dataflow graph in a CSA at a specific point in time. This may permit the CSA to conduct most runtime services in a distributed, pipelined, and parallel fashion… provided that the service is orchestrated to preserve the logical view of the dataflow graph… To start configuration, a controller may drive an out-of-band signal which places all fabric targets in its neighborhood into an unconfigured, paused state and swings multiplexors in the local network to a predefined conformation… From an unconfigured state, configuration may occur completely in parallel… in perhaps as few as [200ns]. However, due to the distributed initialization of embodiments of a CSA, PEs may become active, for example sending requests to memory, well before the entire fabric is configured. Extraction may proceed in much the same way as configuration… extraction may be orchestrated in a pipelined fashion. This arrangement… permits most of the fabric to continue executing, while a narrow region is disabled for extraction. Configuration and extraction may be coordinated and composed to achieve a pipelined context switch… [CSAs] execute dataflow graphs… a program manifestation that closely resembles the compiler’s own internal representation of compiled programs… The CSA microarchitecture… removes [most] energy-hungry control structures associated with von Neumann architectures, including most of the instruction-side microarchitecture. By removing these overheads… a CSA obtains a dense, efficient spatial array… The result is performance in not only parallel applications, but also serial applications as well. Unlike cores, which may pay dearly for performance in terms area and energy, a CSA is already parallel in its native execution model… a CSA neither requires speculation to increase performance nor does it need to repeatedly re-extract parallelism from a sequential program representation, thereby avoiding two of the main energy taxes in von Neumann architectures. Most structures in [CSAs] are distributed, small, and energy-efficient, as opposed to the centralized, bulky, energy-hungry structures found in cores… each PE may have a few (e.g., 10 or less) storage registers. Taken individually, these registers may be more efficient that traditional register files. In aggregate, these registers may provide the effect of a large, in-fabric register file. As a result, [CSAs] avoid most of stack spills and fills incurred by classical architectures, while using much less energy per state access. Of course, applications may still access memory. In [CSAs], memory access request and response are architecturally decoupled, enabling workloads to sustain many more outstanding memory accesses per unit of area and energy. This property… reduces the area and energy needed to saturate main memory in memory-bound workloads. [CSAs] expose new forms of energy efficiency which are unique to non-von Neumann architectures. One consequence of executing a single [instruction at most PEs] is reduced operand entropy. In the case of an increment operation, each execution may result in a handful of circuit-level toggles and little energy consumption… In contrast, von Neumann architectures are multiplexed, resulting in large numbers of bit transitions… Because PEs may be relatively simple and their behavior in a particular dataflow graph be statically known, clock gating and power gating techniques may be applied more effectively than in coarser architectures. The graph-execution style, small size, and malleability of [the] PEs and the network together enable the expression many kinds of parallelism: instruction, data, pipeline, vector, memory, thread, and task parallelism may all be implemented… von Neumann-style cores typically optimize for one style of parallelism, carefully chosen by the architects, resulting in a failure to capture all important application kernels. Just as [CSAs] expose and facilitate many forms of parallelism, it does not mandate a particular form of parallelism, or, worse, a particular subroutine be present in an application in order to benefit from the CSA. Many applications, including single-stream applications, may obtain both performance and energy benefits from [CSAs] even when compiled without modification. This reverses the long trend of requiring significant programmer effort to obtain a substantial performance gain in single-stream applications. Indeed, in some applications, [CSAs] obtain more performance from functionally equivalent, but less ‘modern’ codes than from their convoluted, contemporary cousins which have been tortured to target vector instructions… Dataflow operators are fundamentally asynchronous. This enables [CSAs] not only to have great freedom of implementation in the microarchitecture, but it also enables them to simply and succinctly accommodate abstract architectural concepts. [CSAs] naturally accommodate many memory microarchitectures, which are essentially asynchronous, with a simple load-store interface. [CSAs] also leverage asynchrony to provide faster and more-fully-featured runtime services like configuration and extraction, which are believed to be four to six orders of magnitude faster than an FPGA… Configuring the fabric quickly may be of preeminent importance in accelerating small portions of a larger algorithm… An advantage of static configuration may be that very little energy is spent at runtime on the configuration… as opposed to sequential cores which spend energy fetching configuration information (an instruction) nearly every cycle. The previous disadvantage of configuration is that it was a coarse-grained step with a potentially large latency, which places an under-bound on the size of program that can be accelerated in the fabric due to the cost of context switching… [CSAs] include out-of-band signaling in the network (e.g. of only 2–3 bits, depending on the feature set supported)… Initially, configuration state may reside in memory. Multiple [distributed] local configuration controllers… may stream portions of the overall program into their local region of the spatial fabric… adding a few control wires or state elements to a 64- or [32-bit] CSA fabric has a lower cost relative to adding those same control mechanisms to a 4- or 6-bit fabric… To reduce the overhead of configuration, [CSAs] make use of existing network infrastructure to communicate configuration data… the configuration infrastructure adds no more than 2% to the overall fabric area and power… In some [CSAs], a PE issues a reconfiguration request based on some decision in the program dataflow graph. [CSAs] thus provide for a program (e.g., dataflow graph) directed reconfiguration capability… Simple operators, like those handling the unconditional evaluation of arithmetic expressions often consume all incoming data. It is sometimes useful, however, for operators to maintain state, for example, in accumulation… [MLP] helps to support parallel PE computation. If memory accesses are serialized, high parallelism is likely unachievable… The acceleration hardware may… decouple execution of a program from implementation of the memory subsystem… The store operation may also receive an incoming dependency token, and in response to the availability of these three items, the store operation may generate an outgoing dependency token… The memory ordering circuit may be coupled to an out-of-order memory subsystem… The completion queue may be used to hold a load operation that has been scheduled but for which data has not yet been received (and thus has not yet completed). The completion queue may therefore be used to reorder data and operation flow. The operations manager circuit… may provide logic for scheduling and executing queued memory operations when taking into account dependency tokens used to provide correct ordering of the memory operations… To ensure program correctness, the compiler expresses the dependency between the store operation and the load operation… via dependency tokens… To improve performance, the microarchitecture finds and issues as many load commands of an array in parallel as is legal with respect to program order… The priority encoder may… execute certain rules (or follow particular logic) to select the memory operation from those coming in that has priority to be executed first… with minimal hardware, the microarchitecture supports dependency flows by executing memory operations when operands (e.g. address and data for a store, or address for a load) and dependency tokens are available… As a heterogeneous, parametric architecture, [CSAs] may be readily adapted to all computing uses… The processor may include a flow control path network to carry the backpressure signal according to the dataflow graph.” “A challenge faced by [CSAs] is the mapping of a broad set of existing sequential or partially parallel applications to the spatial array, which is distributed by nature and highly parallel. The main difficulty in performing this mapping is the automatic parallelization of memory accesses in legacy programs due to a centralized, largely sequential memory model present in legacy programming environments… This is known as pointer or memory aliasing. The ordering of memory operations produced by a compiler, or the program memory graph, ensures that program memory order is honored within parallel-executing acceleration hardware such as spatial fabric.” “Each… processing element having its own memory resource can require a lot of space and consume resources… The fast ALUs… can execute fast operations with an effective latency of half a clock cycle… most complex integer operations go to the slow ALU [with] integer execution hardware for long-latency… operations, such as a multiplier, shifts, flag logic, and branch processing… Memory load/store operations are executed by the AGUs.” “The dataflow execution model is well-suited to running on a decentralized, scalable processor because it is inherently decentralized. In this model, instructions execute when their inputs are available, and detecting this condition can be done locally for each instruction. The global coordination upon which the von Neumann model relies, in the form of a program counter, is not required. In addition, the dataflow model allows programmers and compilers to express parallelism explicitly, instead of relying on the underlying hardware (e.g. an out-of-order superscalar) to extract it… Previous dataflow systems could not efficiently enforce the sequential memory semantics that imperative languages… require. Instead, they used special dataflow languages that limited their usefulness… if dataflow systems are to become a viable alternative to the von Neumann status quo, they must enforce sequentiality on memory operations without severely reducing parallelism among other instructions… Although the dataflow model… is fundamentally different than the von Neumann model that dominates conventional designs, both models accomplish many of the same tasks in order to execute single-threaded programs written in conventional programming languages. For example, both must determine which instructions to execute and provide a facility for conditional execution; they must pass operands from one instruction to another and they must access memory… Von Neumann processors represent programs as a list of instructions that reside in memory. A program counter selects instructions for execution by stepping from one memory address to the next, causing each instruction to execute in turn. Special instructions can modify the PC to implement conditional execution, function calls, and other types of control transfer. In modern von Neumann processors, instructions communicate with one another by writing and reading values in the register file… To access memory, programs issue load and store instructions. A key tenet of the von Neumann model is the set of memory semantics it provides in which loads and stores occur (or appear to occur) in the order in which the PC fetched them. Enforcing this order is required to preserve read-after-write, write-after-write, and write-after-read dependences between instructions. Modern imperative languages… provide essentially identical memory semantics and rely on the von Neumann architecture’s ability to implement these semantics efficiently. At its heart, the von Neumann model describes execution as a linear centralized process. A single program counter guides execution and there is always exactly one instruction that, according to the model, should execute next. This is both a strength and a weakness. On one hand, it makes control transfer easy, tightly bounds the amount of state that the processor must maintain, and provides a simple set of memory semantics… On the other hand, the model expresses no parallelism… The dataflow execution model has no PC to guide instruction fetch and memory ordering and no register file to serve as a conduit of data values between dependent instructions. Instead, it views instructions as nodes in a dataflow graph, which only execute after they have received their input values. Memory operations execute in the same data-driven fashion, which may result in their being executed out of the program’s linear order. However, although the model provides no total ordering of a program’s instructions, it does enforce the partial orders that a program’s dataflow graph defines. Since individual partial orders are data-independent, they can be executed in parallel, providing the dataflow model with an inherent means of expressing parallelism of arbitrary granularity. In particular, the granularity of parallelism is determined by the length of a data-dependent path. For all operations, data values are passed directly from producer to consumer instructions without intervening accesses to a register file. Dataflow’s advantages are its explicit expression of parallelism among dataflow paths and its decentralized execution model that obviates the need for a program counter or any other centralized structure to control instruction execution. However, these advantages do not come for free. Control transfer is more expensive in the dataflow model, and the lack of a total order on instruction execution makes it difficult to enforce the memory ordering that imperative languages require… The dataflow firing rule is inherently decentralized because it allows each instruction to act autonomously, waiting for inputs to arrive and generating outputs. Portions of the dataflow graph that are not explicitly data-dependent do not communicate at all. Dataflow’s decentralized execution algorithm makes control transfers more difficult to implement. Instead of steering a single PC through the executable so that the processor executes one path instead of the other, WaveScalar steers values into one part of the dataflow graph and prevents them from flowing into another. It can also use predication to perform both computations and later discard the results on the wrong path. In both cases, the dataflow graph must contain a control instruction for each live value, which is a source of some overhead in the form of extra static instructions… in addition to branching, dataflow machines must also distinguish dynamic instances of values from different iterations of a loop… The problem arises because the dataflow execution model makes no guarantee about how long it takes for a data value to flow along a given dataflow arc… Previous dataflow machines provided one of two solutions. In the first, static dataflow, only one value is allowed on each arc at any time… While this restriction resolves the ambiguity between different value instances, it also reduces parallelism by preventing multiple iterations of a loop from executing simultaneously, and makes recursion difficult to support… A second model, dynamic dataflow, tags each data value with an identifier and allows multiple values to wait at the input to an instruction. The dataflow firing rule is modified so that an instruction fires only when tokens with matching tags are available on all of its inputs. The combination of a data value and its tag is called a token. WaveScalar is a dynamic dataflow architecture. Dynamic dataflow architectures differ in how they manage and assign tags to values. In WaveScalar the tags are called wave numbers… Instead of assigning different wave numbers to different instances of specific instructions (as did most dynamic dataflow machines), WaveScalar assigns them to compiler-delineated portions of the dataflow graph, called waves… wave numbers allow instructions from different wave instances… to execute simultaneously. In addition to allowing WaveScalar to extract parallelism, wave numbers also play a key role in enforcing memory ordering… Since dataflow ISAs only enforce the static data dependences in a program’s dataflow graph, they have no mechanism ensuring that memory operations occur in program order… WaveScalar must provide an efficient mechanism to encode this implicit dependence in order to support imperative languages. Wave-ordered memory solves the dataflow memory ordering problem… Within each wave, the compiler annotates memory access instructions to encode the ordering constraints between them. Since wave numbers increase as the program executes, they provide an ordering of the executing waves… To ensure that all memory operations on the correct path are executed, each memory operation also carries the sequence number of its previous and subsequent operations in the program order… Intuitively, the annotations allow the memory system to ‘chain’ memory operations together. When the compiler generates and annotates a wave, there are many potential chains of operations through the wave, but only one chain (i.e. one control path) executes each time the wave executes (i.e. during one dynamic instance of the wave)… In order for the chaining to be successful, the compiler must ensure that there is a complete chain of memory operations along every path through a wave… Combining global inter-wave ordering with local intra-wave ordering provides a total ordering on all operations in the program. The basic version of wave-ordered memory… can be easily extended to express parallelism between memory operations, allowing consecutive loads to execute in parallel or out-of-order. The annotations and rules define a linear ordering of memory operations, ignoring potential parallelism between loads. Wave-ordered memory can express this parallelism by providing a fourth annotation, called a ripple number… Wave-ordered memory is not the only way to provide the required memory ordering… an alternative scheme… makes implicit memory dependences explicit by adding a dataflow edge between each memory operation and the next. While this ‘token-passing’ scheme is simple, it does not perform as well as wave-ordered memory… Despite this, token passing is very useful in some situations because it gives the programmer or compiler complete control over memory ordering. If very good memory aliasing is available, the programmer or compiler can express parallelism directly by judiciously placing dependences only between those memory operations that must actually execute sequentially. WaveScalar provides a simple token-passing facility for just this purpose… WaveScalar’s execution model is distributed. Only instructions that must pass each other data communicate. There is no centralized control point… The WaveCache… contains a scalable grid of simple, identical dataflow processing elements… The larger designs contain more tiles, but the tile structure, and therefore the overall design complexity, does not change… Instead of designing a monolithic core that comprises the entire die, tiled processors cover the die with hundreds or thousands of identical tiles, each of which is a complete, though simple, processing unit. Since they are less complex than the monolithic core and replicated across the die, tiles more quickly amortize design and verification costs. Tiled architectures also generally compute under decentralized control, contributing to shorter wire lengths… A PE has… bypass networks that allow back-to-back execution of dependent instructions… it avoids the large, centralized associative tag-matching store found on some previous dataflow machines… An instruction store holds the decoded instructions that reside at a PE. To keep it single-ported, the RTL design divides it into several small SRAMs, each holding the decoded information needed at a particular stage of the pipeline. The instruction store comprises about 33% of the PE’s area. The matching table handles instruction input matching. Implementing this operation cost-effectively is essential to an efficient dataflow machine… To achieve good performance, PEs must be able to execute dependent instructions on consecutive cycles… The decision to integrate pairs of PEs together is a response to two competing concerns. We wanted the clock cycle to be short and instruction-to-instruction communication to take as few cycles as possible. To reach our cycle-time goal, the PE… had to be pipelined. This increased average communication latency and reduced performance significantly. Allowing pairs of PEs to communicate quickly brought the average latency back down without significantly impacting cycle time… Placing instructions carefully into the WaveCache is critical to good performance… Instructions’ proximity determines the communication latency between them, arguing for tightly packing instructions together. On the other hand, instructions that can execute simultaneously should not end up at the same processing element because competition for the single functional unit will serialize them… In modern technology, wire delay dominates the cost of computation. Wires in most tiled architectures span no more than a single tile, ensuring that wire length does not increase with the number of tiles… An ideal tiled architecture would scale to any number of tiles, both in terms of functional correctness and in terms of performance… WaveScalar is the first architecture to allow programs to manage memory ordering directly by creating and destroying memory orderings and dynamically binding them to a particular thread. WaveScalar’s thread-spawning facility is efficient enough to parallelize small loops. Its synchronization mechanism is also lightweight and tightly integrated into the dataflow framework… The required changes to WaveCache to support the ISA extensions are surprisingly small, and do not impact on the overall structure of the WaveCache because the executing threads dynamically share most WaveCache processing resources… Together, all these instructions place WaveScalar’s tagging mechanism completely under programmer control, and allow programmers to write software such as threading libraries… Scaling multithreaded WaveScalar systems beyond a single die is… feasible. WaveScalar’s execution model makes and requires no guarantees about communication latency, so using several WaveCache processors to construct a larger computing substrate is a possibility… The WaveScalar [ISA] replicates the functionality of a von Neumann processor or a CMP composed of von Neumann processors. Providing these capabilities is essential if WaveScalar is to be a viable alternative to von Neumann architectures, but this is not the limit of what WaveScalar can do… Firstly, it provides a second unordered memory interface that is similar in spirit to the token-passing interface… The unordered interface is built to express memory parallelism. It bypasses the wave-ordered store buffer and accesses the L1 cache directly, avoiding the overhead of wave-ordering hardware. Because unordered operations do not go through the store buffer, they can arrive at the L1 cache in any order or in parallel… a programmer can restrict this ordering by adding edges to a program’s dataflow graph… Secondly, the WaveCache can support very fine-grain threads. On von Neumann machines the amount of hardware devoted to a thread is fixed (e.g. one core on a CMP or one thread context on an SMT machine), and the number of threads that can execute at once is relatively small. On WaveCache, the number of physical store buffers limits the number of threads that use wave-ordered memory, but any number of threads can use the unordered interface at one time. In addition, spawning these threads is very inexpensive. As a result, it is feasible to break a program up into hundreds of parallel fine-grain threads… The [wave-ordered memory interface] is necessary for executing conventional programs, but can only express limited parallelism (i.e. by using ripple numbers). WaveScalar’s unordered interface makes a different trade-off: it cannot efficiently provide the sequential ordering that conventional programs require, but excels at expressing parallelism because it eliminates unnecessary ordering constraints and avoids contention for the store buffer. Accordingly, it allows programmers or compilers to express and exploit memory parallelism when they know it exists. Like all other dataflow instructions, unordered operations are only constrained by their static data dependences. This means that if two unordered memory operations are neither directly nor indirectly data-dependent, they can execute in any order. Programmers and compilers can exploit this fact to express parallelism between memory operations that can safely execute out-of-order; however, they need a mechanism to enforce ordering among those that cannot… since unordered memory operations bypass the wave-ordered interface, unordered accesses must use a different mechanism. To ensure that the load executes after the store, there must be a data dependence between them. This means that memory operations must produce an output token that can be passed to the operations that follow. Loads already do this because they return a value from memory. We modify stores to produce a value when they complete. The value that the token carries is unimportant, since its only purpose is to signal that the store has completed… the unordered implementations achieve superior performance because they exploit more parallelism. The benefits stem from two sources. First, unordered implementations can use more threads. It would be easy to [spawn] hundreds or thousands of threads, but the WaveCache cannot execute this many ordered threads at once, since there are not enough store buffers. Secondly, within each thread the unordered threads’ memory operations can execute in parallel. As a result, fine-grain unordered implementations exploit more inter- and intra-thread parallelism, allowing them to exploit many PEs at once. In fact, the numerous threads that each kernel spawns can easily fill and utilize the entire WaveCache… A key strength of our ordered and unordered memory mechanisms is their ability to coexist in the same application. Sections of an application that have independent and easily analyzable memory access patterns (e.g. matrix manipulations and stream processing) can use the unordered interface, while difficult-to-analyze portions (e.g. pointer-chasing codes) can use wave-ordered memory.” “The wave-ordered memory hardware is distributed throughout the WaveCache as part of the store buffers… The store buffer itself is a small, non-associative memory that holds memory requests. A simple state machine implements the wave-ordered memory logic by ‘walking’ the sequence of requests and stalling when it detects a missing operation. This ensures that memory operations are issued to the L1 data caches in the correct order… The ability to efficiently create and terminate [threads] provides only part of the functionality required to make multithreading useful. Independent threads must also synchronize and communicate with one another. WaveScalar recognizes two types of synchronization: intra- and inter-thread. Intra-thread synchronization is comparable to a memory fence or barrier. It is used to build a relaxed consistency model by synchronizing the execution of a thread with its outstanding memory operations. The second primitive models a hardware queue lock and provides inter-thread synchronization… it is natural to think of a wave-ordered memory sequence as the essence of a thread, because in most systems the notion of a thread and its memory ordering are inseparable. But in WaveScalar, this perspective is misleading: nothing in the WaveScalar architecture requires a thread to have a memory ordering. If a thread could access memory without interacting with the wave-ordered memory system, it could avoid the serialization bottleneck that a global memory ordering requires… an interface to memory that avoids the wave-ordered memory system… can provide substantial benefits for applications where a global ordering of memory operations is not necessary for correct execution… Without an ordered memory interface, WaveScalar threads can execute their memory operations in any order, potentially exposing massive amounts of parallelism… This interface does not require a thread to give up all control over the order in which memory instructions execute. Instead, it allows the thread to directly control which memory operations can fire in any order and which must be sequentialized… [WaveScalar] can profitably execute threads of sizes ranging from as large as a process to as small as a handful of instructions. Such a wide range of thread granularity is possible or, more to the point, such small threads are possible, because WaveScalar drastically reduces the overhead normally required to manipulate threads. Although different aspects of the design contribute to the low overhead, the underlying reason that these different thread functions are so cheap is that they operate within the context of WaveScalar’s dataflow architecture that is built atop a distributed dataflow microarchitecture, the WaveCache.” “At last, a dataflow machine can run ‘real-world’ programs, written in any language, without sacrificing parallelism… By co-locating computation and data in physical space, the WaveCache minimizes long wire, high-latency communication… modern superscalar processor designs will not scale, because they are built atop a vast infrastructure of slow broadcast networks, associative searches, complex control logic, and inherently centralized structures… WaveScalar’s goal is to minimize communication costs by ridding the processor of long wires and broadcast networks. To this end, it includes a completely decentralized implementation of the ‘token-store’ of traditional dataflow architectures and a distributed execution model… Previously, dataflow architectures provided their own style of memory semantics and their own dataflow languages that disallowed side effects, mutable data structures, and many other useful programming constructs… WaveScalar completely abandons the program counter and linear von Neumann execution… As features and cycle times shrink, the hardware structures that form the core of superscalar processors (register files, issue windows, and scheduling logic) become extremely expensive to access. Consequently, clock speed decreases and/or pipeline depth increases… To squeeze maximum performance from each core, architects constantly add new algorithms and structures to designs. Each new mechanism, optimization, or predictor adds additional complexity… Dataflow locality is the predictability of instruction dependencies through the dynamic trace of an application. A processor could take advantage of this predictability to reduce the complexity of its communication system (i.e. register files and bypass networks) and reduce communication costs… the vast majority of operand communication is highly predictable… Such high rates of predictability suggest that current processor communication systems are over-general, because they provide instructions with fast access to many more register values than needed. If the processor could exploit dataflow locality to ensure that necessary inputs were usually close at hand (at the expense of other potential inputs being farther away), they could reduce the average cost of communication. Instead of simply ignoring dataflow locality, however, superscalars destroy it in their search for parallelism. Register renaming removes false dependencies, enables dynamic loop unrolling, and exposes a large amount of dynamic ILP for the superscalar core to exploit. However, it destroys dataflow locality. By changing the physical registers an instruction uses, renaming forces the architecture to provide each instruction with fast access to the entire physical register file. This results in a huge, slow register file and complicated forwarding networks… Destroying dataflow locality leads to a shocking inefficiency in modern processor designs: The processor fetches a stream of instructions with a highly predictable communication pattern, destroys that predictability by renaming, and then compensates by using broadcast communication in the register file and the bypass network combined with complex scheduling in the instruction queue. The consequence is that modern processor designs devote few resources to actual execution… and the vast majority to communication infrastructure. This infrastructure is necessary precisely because superscalars do not exploit dataflow locality… Two serialization points constrain the von Neumann model and, therefore, superscalar processors: The first arises as the processor, guided by the program counter and control instructions, assembles a linear sequence of operations for execution. The second… is at the memory interface where memory operations must complete (or appear to complete) in order to guarantee load-store ordering. The elegance and simplicity of the model are striking, but the price is steep. Instruction fetch introduces a control dependence between each instruction and the next and serves little purpose besides providing the ordering that the memory interface must adhere to. As a result, von Neumann processors are fundamentally sequential; there is no parallelism in the model. In practice, of course, von Neumann processors do achieve limited parallelism… The original motivation for WaveScalar was to build a decentralized superscalar processor core. Initially, we examined each piece of a superscalar and tried to design a new, decentralized hardware algorithm for it. By decentralizing everything, we hoped we could design a truly scalable superscalar. It soon became apparent that instruction fetch is difficult to decentralize, because, by its very nature, a single program counter controls it… Dataflow machines are perhaps the best studied alternative to von Neumann processors… A key feature of WaveScalar is that the [wave advance] instructions allow wave-number management to be entirely distributed and under software control. This is in contrast to traditional dataflow machines in which tag creation is either partially distributed or completely centralized… we intend to exploit this fact to optimize WaveScalar binaries by creating application-specific tagging schemes… The processing elements are freed from managing implicit dependencies through memory and can treat memory operations just like other instructions. The sequencing information included with memory requests provides a concise summary of the path taken through the program… The most exciting aspect of wave-ordered memory is the incorporation of instruction order into the instruction set as a first-class entity. However, [it] only encodes information about dependencies between memory operations. It is possible to devise a more general scheme that also expresses independence among memory operations (i.e. that two memory operation can proceed in any order). An ISA that incorporates such a scheme could use memory aliasing information from the compiler to expose large amounts of memory parallelism to the hardware… By letting the current needs of the program guide instruction placement, the WaveCache can aggressively exploit dataflow locality and achieve higher performance… [other challenges include] handling interrupts, I/O, and other operating system issues.” “Conventional high-performance von Neumann processors go to great lengths to extract parallelism from the sequence of instructions that the program counter generates. Making parallelism explicit removes this complexity from the hardware… Programmers can combine both coarse- and fine-grain threads and unordered and wave-ordered memory in the same program or even the same function… applying diverse styles of threading and memory to a single program can expose significant parallelism in code that is otherwise difficult to parallelize… The instructions remain at their PEs for many, potentially millions, of invocations… Imperative languages require memory operations to occur in a strictly defined order to enforce data dependences through memory. Since these data dependences are implicit (i.e. they do not appear in the program’s dataflow graph) the dataflow execution model does not enforce them… In addition to providing the semantics that imperative programming languages require, wave-ordered memory also provides the WaveScalar instruction set facility for describing the control flow graph of an application. The predecessor, successor, and sequence numbers effectively summarize the control flow graph for the memory system… WaveScalar is the first [ISA] to explicitly provide high-level structural information about the program… Ripples also allow wave-ordered memory to exploit aliasing information that allows some stores to issue out-of-order… By applying ripples to easily-analyzed stores (e.g. the stack or different fields in the same structure), the compiler can allow them to issue in parallel. Wave-ordered memory’s goal is to provide sequential memory semantics without interfering with the parallelism that dataflow exposes… wave-ordered memory exposes twice as much MIPC as token-passing on average… The difference is primarily due to the steering dataflow machines use for control flow. The memory token is a normal data value, so the compiler must insert steer instructions to guide it along the path of execution. The steer instructions introduce additional latency that can delay the execution of memory operations… Wave-ordered memory avoids this overhead by encoding the necessary control flow information in the wave-ordering annotations. When the requests arrive at the memory interface they carry that information with them. The wavefront… serves the same conceptual purpose as the token: It moves from operation to operation, allowing them to fire in the correct order, but it does not traverse any non-memory instructions… Operand messages arrive at the PE either from another PE or from itself (via the ALU bypass network)… Cost-effective matching is essential to an efficient dataflow design and has historically been an impediment to more effective dataflow execution… WaveScalar can run any program that a conventional chip multiprocessor can. Unlike a CMP, however, WaveScalar’s multithreading support caused very little change to the underlying hardware… Because CMPs use the von Neumann execution model, some resources (e.g. the program counter and register file) must be replicated for each thread, and, as a result, the difference between a single- and multi-threaded processor is great. In WaveScalar’s dataflow model, a thread requires no centralized control or resource, so two threads can run almost as easily as one… Tag management increases the burden on the programmer slightly but allows more flexible use of the available space of tags. For instance, the fine-grain threading mechanism… relies upon creative use of wave numbers and thread IDs… Designers of the earliest dataflow machines intended them to run purely functional languages that had no notion of mutable state. However, support for mutable state can make programming some applications much easier… Although I- and M-structures provide mutable state, they do not provide the ordering semantics of imperative languages… However, WaveScalar can emulate M-structures… the dataflow firing rule is naïve, since it ignores locality in execution. Ideally, a dataflow machine would execute program fragments that are related to one another to exploit locality and prevent ‘thrashing’ at the top of the memory hierarchy… Better aliasing knowledge would allow for more aggressive use of the ripple annotation… and structural information would allow the compiler to generate larger, more efficient waves… Extending conventional instruction sets to encode more information about program behavior, control structure, or memory usage is one approach to improving performance in the face of stringent technological constraints. The challenge is finding concise, flexible ways to express useful information to the processor.” “WaveScalar is a tagged-token dataflow architecture. Like all dataflow architectures, its application binary is a program’s dataflow graph… WaveScalar supports traditional dataflow memory semantics (I-structures and M-structures) by a lightweight locking primitive… To achieve back-to-back execution of dependent instructions, we provide bypassing paths that send results from the end of EX directly back to the beginning of EX… The [the intra-pod interconnect] allows two PEs to share their bypass networks and scheduling information.” “Dataflow architectures are inherently multi-threaded (some models even consider each instruction to be its own thread), so augmenting the WaveCache to support multiple von Neumann-style threads is much simpler and requires less hardware than adding multithreading support to an out-of-order superscalar processor… Using [lightweight, memoryless inter-thread communication], we construct a variety of efficient synchronization and communication primitives including locks, barriers, condition variables, and simple message passing… Previous dataflow machines have excelled at exposing parallelism but required programs to be written in special languages that eliminate side effects… In addition to language restrictions, the centralized designs of early dataflow machines’ token stores hampered their performance and scalability. The WaveCache eliminates this problem by distributing the token store and matching logic across the PEs… Like previous dataflow instruction sets, WaveScalar converts control dependences into data dependences by sending data values to the instructions that need them. Rather than changing the value of a program counter, which causes particular instructions to be fetched and executed on a von Neumann machine, WaveScalar includes two dataflow instructions that explicitly steer values to their intended consumers… Like traditional dataflow machines, WaveScalar uses tags to identify different dynamic instances of data. Unlike traditional dataflow machines, where tag creation is either partially distributed or completely centralized, WaveScalar’s tag control mechanism is entirely under software control, and distributed throughout the WaveCache… Traditional threading systems provide a set of function calls and data types that support thread management and synchronization. On most coarse-grained architectures, these mechanisms are heavyweight, because the hardware only provides synchronization primitives that act through memory. WaveScalar sidesteps main memory for all synchronization…” “At first sight, dataflow machines appear ill-suited to executing imperative language programs. Traditionally, the dataflow model is tied closely to functional languages; for example, dataflow operators… are functions from inputs to outputs. Moreover, dataflow machines have no program counter to sequence operations; rather, instructions are scheduled dynamically for execution whenever they receive input data. This is in stark contrast to the traditional operational semantics of imperative language programs. Each statement in an imperative language program is a command whose execution causes a change in a global, updatable store. Sequencing of command execution is achieved through a program counter which specifies the unique next instruction to be executed. The existence of a program counter in the underlying operational model is reflected in the programming language through commands… that modify the program counter. Thus, there is a wide gulf between the dataflow model of execution and the standard operational semantics of imperative languages… Since operations on different memory locations need not be sequentialized, we can enhance parallelism in the dataflow graph by circulating a set of access tokens, one for each variable… we assume initially that there is no aliasing — therefore, every variable denotes a unique memory location. [The token’s] arrival at a statement means that all previous memory operations on location x have completed. Therefore, a memory operation on x begins when it receives the token access, at its input, and it propagates the token to its output when the operation is complete… By allowing independent memory operations to proceed in parallel, we are exploiting fine-grain parallelism across statements… The possibility of aliasing is significant because it means that a single memory location may be accessed by more than one name. The set of variables that may be aliased to a given variable x is called the alias class… Before a memory operation is performed on variable x, it is necessary to ensure that all memory operations on that location have completed… We implement this requirement by specifying that access tokens for all variable names in the alias class of x must be collected before a memory operation on x is initiated… Access tokens that flow across statements are dummy tokens which carry no useful value, but are used only to sequence memory operations. An important optimization is to eliminate memory operations wherever possible by passing values directly on tokens, rather than through memory locations. For variables that are not aliased, this is very easy. Load and store operations are deleted from the graph, and values are passed on tokens from definitions to uses… Eliminating storage operations for potentially aliased variables is more difficult. However, some parallelism can be exploited even for these variables… Our access tokens enforce sequential access to memory, which is necessary only when writes are involved. Parallel access to memory can be allowed among any set of reads, even to potentially aliased variables.” “The compiler adds dependence edges, called token edges, to explicitly synchronize operations whose side-effects may not commute. Operations with memory side-effects (load, store, call, and return) all have a token input. Token edges explicitly encode data flow through memory. An operation with memory side-effects must collect tokens from all its potentially conflicting predecessors… The impact of the complexity of the memory network can be somewhat reduced by fragmenting memory in independent banks connected by separate networks… Perhaps the most important feature of our implementation is the complete absence of any global control structures. Control is completely embodied in the handshaking signals — naturally distributed within the computation. This gives our circuits a very strong datapath orientation, making them amenable to efficient layout… A memory operation can produce a token as soon as its effect is guaranteed to occur in the right order with respect to the potentially interfering operations… The LSQ holds the description of the memory operations under execution until their memory effects are completed; it may also perform dynamic disambiguation and act as a small fully-associative cache… In ASH a software program is translated directly into an executable hardware form, which has no interpretation layer. The synthesized hardware closely reflects the program structure — the definer-user relationships between program operations are translated directly into point-to-point communication links connecting arithmetic units. The resulting circuit is completely distributed, featuring no global communication or broadcast, no global register files, no associative structures, and using resource arbitration only for accessing global memory.” “Increase in the core array size has no effect on designs of a single core and the [OCN] switch. The memory system, however, is a global subsystem, and its complexity and performance are directly influenced by the number of cores… Fast communication allows exploitation of parallelism at arbitrary granularities and thus considerably reduces the complexity of extracting threads and achieving more performance… One of the mechanisms in tiled architectures to help automatic parallelization is a fast point-to-point [OCN] to communicate the scalar values between the ALUs… The instruction streams can quickly communicate over these networks, and thus the parallel streams can tolerate frequent and critical inter-thread communication making thread extraction easier… For tiled architectures, implementing global store atomicity is very expensive. As a consequence, relaxed consistency models are used to keep the memory system costs under check while providing some programming intuition… For tiled architectures, the delays involved in synchronization can be very long due to long distances traversed over memory networks thus deteriorating the critical path. Mechanisms are required to provide fast and easily programmable synchronization… For tiled architectures, the number of cores increase quadratically with feature size. As a result of this, the pressure on memory system exerted by the cores also goes up quadratically. However the perimeter only increases linearly. As a result the number of pins on the boundary also go up only linearly. If the memory system relies on edge I/O for communication with off-chip storage like main memory DRAMs, the gap between demand and supply widens with every generation. Flip-chip packaging based area I/O provides a better scaling and also avoids the delay involved in going to and coming back from the chip edge… WaveScalar distributes the LSQs based on the dynamic wave number… ordering of out-of-order memory requests using some kind of tag handling is expensive… Partial ordering allows for more aggressive issue of memory operations like out-of-order load execution in WaveScalar using ripples. However, ripples are more restrictive than tokens as they do not use static analysis to differentiate between operations on different addresses. Thus stores still have a total order for ripples while this is not the case for ASH tokens… Different address space segments can feature different coherent systems… Queue-based synchronization uses the blocking read semantics for ordering. The queues are distributed as they are point-to-point. Shared memory on the other hand uses a central ordering point… The synchronization can be either one-on-one, or it could be multicast or broadcast like a barrier which is becoming common in the tiled architectures to limit compilation complexity. Implementing broadcast or multicast synchronization using one-to-one synchronization could be expensive and should be avoided, unless they could be done in parallel and are capable of reusing the synchronization information. The [memoryless] synchronization is faster and more distributed and can provide finer-grained synchronization. Furthermore, they can give better performance on multicast and broadcast signals and are easier to program as well. However, dynamic synchronization is essential and thus shared memory primitives for synchronization are indispensable… [OCNs] are more scalable than load-store queues because they are inherently distributed… existing OCN mechanisms use full buses for sending signals. Instead [the dependence network] would use single-bit bus to communicate the signal. Since the bus is very narrow, DPN can use thick wires to send signals to greater distances within a single cycle thus making it faster than using OCNs… different applications benefit from different coherent system organizations… We can support multiple such systems on a single architecture by having them in different independent address spaces. An application can [allocate] a particular data structure in a specific address space based on the data structure access properties. There is also a requirement for a fast synchronization primitive which is also distributed. We can use the DPN for this purpose as it is fast, distributed and can be explicitly accessed by the instructions.” “…good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions… We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application… For monolithic processors, instruction schedulers focus on deciding when an instruction should be fetched. For example, scheduling a load instructions early helps hide its latency. On a tiled architecture, the scheduler also decides where an instruction will execute. For example, placing dependent instructions on the same or adjacent tiles reduces producer-to-consumer operand latency… WaveScalar’s token store is distributed across the processor, with the processing elements… Instruction schedules for tiled architectures can have both a temporal (when) and spatial (where) component… If there are multiple locations where an instruction can execute, be they ALUs, PEs, more complex tiles or register clusters, then the instruction schedule must have a spatial component. Similarly, any architecture with a program counter will require a temporal schedule.” “…some of the dependencies in the dataflow are implicitly carried by memory accesses. If two memory operations access the same location and one of them is a store, their order in the original program execution must be preserved — a dependency edge needs to be added between them… The dependencies between the generated subgraphs necessitate the insertion of communication primitives… In the case where ordering of memory accesses needs to be enforced, special tokens are sent between modules… Non-regular application kernels often contain a variety of memory access patterns, [e.g.] streaming, strided or random. General-purpose processors use caches as a best effort solution to serve all the different interminglings of these patterns in various applications. The flexibility of the FPGAs, on the other hand, allows for customization of the data access mechanism… Each independent data access interface, corresponding to one memory partition, can be supported differently according to the nature of the address stream it generates… The burst accesses make more efficient use of the memory bandwidth, as each request/response pair carries more useful data.” “The [Decoupled Systolic Architecture] encompasses many of the key features of single-chip tiled architectures… This framework is unique in that it explicitly penalizes algorithmic implementations that make use of long wires and/or large local memories while rewarding those algorithms that can efficiently execute using only a small, bounded amount of local memory and a next-neighbor interconnect network. Thus, this framework makes an excellent match for evaluating architectures faced with the growing physical concerns of wire delay and the energy dissipation of on-chip memory… Stream algorithms decouple memory access from computation, performing memory access on tiles on the periphery of the chip and performing computation in a systolic fashion on the tiles in the interior of the chip.” “The ability to operate directly on data coming from the networks allows programmers to eliminate memory operations… Decoupled access/execute architectures provide inspiration to completely separate computation from memory accesses. Data values only need to be accessed on the edges of those processors that implement the systolic array. In this manner, processors on the edges of the array are responsible for continuously accessing data and sending it to those processors performing the actual computation… Memory processors are given access to a large but slow memory, while compute processors have no access to local memory apart from a small, fast register set. Therefore, all load/store operations are performed on memory processors, which send data into the systolic array implemented by the compute processors… An architecture consisting of an array of compute processors augmented with a set of memory processors on the periphery is referred to as a decoupled systolic architecture… To enable high clock frequencies on large chip areas, tiled architectures have short wires that span a fraction of the side length of a chip, and use registers to pipeline the signal propagation. Short wires, in turn, introduce a scheduling problem in space and time to cope with the propagation of signals across distances longer than those reachable via a single wire. Moving data across wires and distributing operations across processors are equally important scheduling goals… systolic arrays were proposed by [Hsiang-Tsung] Kung and [Charles] Leiserson in the late 1970s, and aimed, in part, at exploiting the speed of short wires… the significant area and energy efficiency of systolic arrays merit their reexamination in face of the architectural similarities to [modern] tiled microarchitectures… The compute processor does not include a large local memory because of the intrinsic physical constraint that large memories have larger latencies than small ones. Instead, it has only a small but fast memory in the form of a register set. This lack of local memory also serves to reduce the footprint of a compute processor and thus allows more compute processors to be placed on a single chip… The network is register-mapped, [i.e.] instructions access the network via register names, and it is a programmed routing network permitting any globally-orchestrated communication pattern on the network topology. The latter is important for stream algorithms that change patterns between the phases of the computation… each compute processor uses blocking FIFOs to connect and synchronize with neighboring processors. These FIFOs are exposed to the programmer by mapping them into register names in the instruction set. The outgoing ports are mapped to write-only registers with the semantics of a FIFO-push operation, and the incoming ports as read-only registers with the semantics of a FIFO-pop operation… it is preferable for the network to be tightly integrated with the pipelined functional units. Accordingly, bypass wires that commonly feed signals back to the operand registers also connect the individual pipeline stages to the outgoing network FIFOs. The tight network integration ensures low-latency communication between neighboring compute processors… Each of the subproblems as well as the composition of their results must be suitable for parallelization by means of a systolic algorithm such that the compute processors access data in registers and on the network only. For simple data-parallel applications, the partitioning can be obvious immediately. For applications with more complicated data dependencies, recursive formulations and partitioning methods like those developed for out-of-core algorithms can be helpful… The goal is to move the memory accesses off the critical path. To this end, one must decouple the computation such that the memory accesses occur on the memory processors and compute operations on the compute processors. For a systolic problem, the memory processors feed the input streams into the compute processors, and the decoupling procedure is almost trivial. However, the composition of several subproblems requires careful planning of the flow of intermediate data streams, such that the output streams of one systolic phase can become input streams of a subsequent phase without copying streams across memory processors.” “Stream commands specify long and concurrent patterns of memory access. Expressible patterns include contiguous, strided, and indirect… A stream dispatcher enforces architectural dependences between streams… DFG inputs and outputs are named ports with explicit vector widths, which serve as the inputs and outputs of data streams, facilitating communication… Access patterns for DFG ports are [FIFO] only, while access patterns for… memory can be more complex (linear, strided, repeating, indirect, scatter-gather, etc.)… Streams generally execute concurrently, but streams with the same DFG port must logically execute in program order, and streams that write from output DFG ports wait until that data is available. Also, streams from DFG outputs to DFG inputs can be used to represent inter-iteration dependences… Barrier instructions serialize the execution of certain types of commands. They include a location, either scratchpad or memory, and a direction (read or write). The semantics is that any following stream command must logically enforce the happens-before relationship between itself and any prior commands described by the barrier… Note that in the absence of barriers, all streams are allowed to execute concurrently. Therefore, if two stream-dataflow commands read and write the same scratchpad or memory address, with no barrier in between them, the semantics of that operation are undefined. The same is true between the host core memory instructions and the stream-dataflow commands. The programmer or compiler is responsible for enforcing memory dependences… The vector ports are the interface between the computations performed by the CGRA and the streams of incoming/outgoing data… The spatial nature of the CGRA avoids the overheads of accessing register files or memories for live values… The role of the stream dispatcher is to enforce resource dependences on streams (and barrier commands), and coordinate the execution of the stream-engines by sending them commands… Barrier commands block the core from issuing further stream commands until the barrier condition is resolved… Subsequent streams that have the same source or destination port must be issued in program order… The stream dispatch unit is responsible for maintaining this order, and does so by tracking vector port and stream engine status in a scoreboard. Before issuing a stream, it checks the state of these scoreboards… Stream engines… coordinate the address generation and data transfer over the lifetime of the stream, and finally notify the dispatcher when the corresponding vector ports are freed (when the stream completes)… Ready logic observes the validity of streams, and will connect to external backpressure signals (e.g. remaining space in input vector port) for determining which streams are ready to issue. A stream is ready if its destination is not full and its source has data. A selector prioritizes a ready stream. The state of the selected stream will be sent to an appropriate [AGU]… The affine AGU generates the minimal number of memory requests by examining the access size, stride size and number-of-strides parameters… vector ports attach to a heterogeneous set of CGRA ports, which are selected to spread incoming/outgoing values around the CGRA to minimize contention.” “Improvements can be achieved by creating specialized hardware components for generating addresses and packing and unpacking data items… Using the information about data streams the compiler uses library functions to define a data stream and programs an entry in the [AGU] that will generate the sequence of addresses that correspond to the consecutive array accesses. The compiler also determines… the placement of the data in the FPGA’s external memories, so that it can generate the base addresses for each of the data streams. In addition the compiler [maps] the data to and from the host processor on the target FPGA’s external memories.” “Multiple on-chip address generators and scatter-gather engines make efficient use of DRAM bandwidth by supporting a large number of outstanding memory requests, memory coalescing, and burst mode for dense accesses… Parallel patterns are an extension to traditional functional programming which capture parallelizable computation on… data collections along with corresponding memory access patterns… Plasticine makes a distinction between the operations involved in memory addresses calculation and the core computation underlying applications… Each [AGU] contains a reconfigurable scalar datapath to generate DRAM requests… To enforce data dependencies, we use tokens, which are feed-forward pulse signals routed through the control network.” “The CoRAM architecture assumes… that memory interfaces for loading from and storing to a linear address space will exist at the boundaries of the reconfigurable logic… the CoRAM architecture defines a portable application environment that enforces a separation of concerns between computation and on-chip memory management… Control threads form a distributed collection of logical, asynchronous [FSMs] for mediating data transfers between CoRAMs and the edge memory interface… A control thread maintains local state to facilitate its activities and will typically issue address requests to the edge memory interface on behalf of the application; upon completion, the control thread informs the user logic through channels when the data within specific CoRAMs are ready to be accessed through their locally-addressed SRAM interfaces. Conversely, the user logic can also write its computational results into CoRAMs and issue ‘commands’ to the control threads via channels to write the results to edge memory… To express the memory access requirements of an FPGA application, control threads can only invoke a predefined set of memory and communication primitives called control actions… In general, a control thread issues control actions along a dynamic sequence that can include cycles and conditional paths… The control actions… have the appearance of a memory management API, and abstract away the details of the underlying hardware support — similar to the role served by the [ISA] between software and evolving hardware implementations… the basic set of control actions defined can be used to compose more sophisticated memory ‘personalities’ such as scratchpads, caches, and FIFOs — each of which are tailored to the memory patterns and desired interfaces of specific applications… Control threads can employ more advanced control actions to either increase [MLP] or to customize data partitioning. The non-blocking control actions… explicitly allow for concurrent memory operations and return a tag that must be monitored using ᴄʜᴇᴄᴋ ᴄᴏʀᴀᴍ ᴅᴏɴᴇ. A control thread can also invoke ᴄᴏʀᴀᴍ ᴍᴇᴍʙᴀʀ, which serializes on all previously executed non-blocking control actions by that control thread… Control programs can be realized by direct synthesis into reconfigurable logic… or can be compiled and executed on dedicated multithreaded microcontrollers (e.g. a multithreaded RISC pipeline). Control units must also maintain internal queues and scoreboarding logic to track multiple outstanding control actions, and to allow querying of internal state… A memory personality ‘wraps’ CoRAMs and control threads within reconfigurable logic to provide an even higher level of abstraction (i.e. interface and memory semantics) best suited to the application at hand… control threads gave us a general method for dynamically orchestrating the memory management of applications, but did not force us to over-specify the sequencing details at the RTL level.” “For efficiency, accelerators use idiomatic ISAs, which encode coarse-grain, stateful patterns of program behavior for memory, control, and parallelism (e.g. a 2D memory access pattern)… Our target accelerator paradigm is ‘decoupled-spatial.’ Decoupled indicates how computation and memory operations are executed on different specialized hardware components. Spatial indicates how the low-level details of the execution (e.g. instruction placement, operand routing) are exposed in the ISA. In this model, programs are represented as a decoupled dataflow graph… Common memory access patterns are represented as first-class primitives in the ISA for lean and high-throughput memory request generation pipelines (called stream engines)… Fence instructions synchronize with the accelerator at the boundary of program phases. Streams after the fence are stalled until all prior streams are completed… The [Architecture Description Graph] combines topology, capabilities, and parameters into a unified, graph representation that defines the accelerator ISA. It includes nodes for PEs, switches, memories/address generators, and ports for synchronization. The ADG is not only a hardware specification for RTL generation, but also a capability abstraction to the compiler… The annotated code is analyzed and transformed into a decoupled dataflow intermediate representation… Using slicing, operands that are transitively dependent on address operands of memory instructions are considered part of address generation, and remaining instructions are computation.” “DeSC… enables store-to-load forwarding within a decoupled scenario… The DeSC compiler… splits the given source code into a communication slice and a computation slice… The communication slice is responsible for all loads; it supplies required data to the computation slice… The communication slice is also responsible for all address calculations for both loads and stores.” “Mapping applications to decoupled-spatial hardware involves… decoupling data access from computation operations, and mapping memory access to corresponding units… mapping computation onto processing elements, and communication to the on-chip network… the program is represented as a decoupled dataflow graph, where memory accesses are represented as coarse-grain streams, and computations are dataflow graphs… This decoupled dataflow form eliminates the implicit memory ordering constraints of the original program — disambiguated memory streams are simpler to map to decoupled memories… Each memory engine generates memory requests… and processing elements operate on data as it arrives… Wide memory access can feed many PEs. Instruction dispatch overhead is amortized by spatial fabric configuration… The basic execution model of a memory is that it arbitrates access from concurrent coarse-grained memory patterns, which we refer to as streams. The relative ordering of streams can be synchronized with barriers… DSAGEN does not provide support for maintaining strict sequential memory access semantics (only barrier-wise synchronization of memory access). Therefore, the compiler and/or programmer is responsible for maintaining correctness of memory ordering… Improving the configuration time can aid performance of short program regions. Configuration time is dominated by the longest configuration path.” “Mapping dataflow graphs onto reconfigurable architecture is known as spatial scheduling… an SPU’s program consists of a dataflow graph language describing the computation… along with a control program which contains the commands for streams… the stream code executes on the control core… In turn, the controller will deliver/receive data to/from the [CGRA] compute unit, which executes the dataflow.” “Maps attempts to schedule the memory accesses for maximum parallelism and speed while enforcing proper dependence… shrinking technology makes the speed of basic logic faster and faster, while improvement in wire speed lags behind… These forces impose increasing relative penalties on complex logic and global wires. Consequently, locality and simplicity become the key principles of good hardware design. Locality suggests that processing resources be fully distributed with [nearest-neighbor] connectivity and the elimination of global buses. Simplicity suggests that complex hardware be eliminated from a processor… To support sequential programs, traditional processors use complicated hardware to provide features like memory reorder buffers and mechanisms to extract instruction-level parallelism, which violates the simplicity principle of hardware design… For locality, a Raw architecture distributes not only the functional units and registers but memory as well, and it couples them with a simple point-to-point network… potentially dependent memory accesses which may be spread across the processing tiles must be performed in correct sequential program order. The traditional approach to enforcing dependence on parallelized programs is to synchronize through the memory system using hardware primitives such as locking… In terms of execution cost, the mechanism is heavyweight, which limits the type of programs it can profitably execute to those with access patterns that require few synchronizations… The Maps solution attempts to minimize the synchronization and sequentialization necessary for correct program behavior, and it strives to allow as many parallel accesses as possible.” “Many of the design decisions of Raw were motivated by the increasing importance of wire length in VLSI design. Thus, the tile is made as small as possible, so that the distance between two tiles is minimized. This allows the clock speed to be increased… it is clear that there must be some cooperation between two processes writing to the same address. If this writing is non-deterministic, then so is the final value written… Even in the most ideal shared memory model, there is a need for synchronization. This is done via having control objects in shared memory that are accessed in proper ways, thus acting as locks, semaphores, or other primitives.” “A number of [architectures] exchange the monolithic VLIW or superscalar processor for a distributed, decentralized array of independent, replicated computing elements. These elements are connected by a network with short, local wires that connect only near neighbors. We refer to these architectures as decentralized architectures… A key feature of these architectures is that they have properties that scale very well asymptotically with respect to the number of transistors… Decentralized architectures have a number of asymptotic advantages: First, the micro-architectural design and verification cost of these decentralized architectures is essentially constant with respect to the number of transistors. When the designers gain access to more transistors, they simply put down more computing elements. Second, the wire length of these designs is also constant in size with respect to the number of transistors. As a result, the wire delay and routing congestion of the design, a major problem for monolithic architectures, does not change with the number of transistors. Third, decentralized architectures… offer very precise, geometric clock boundaries rather than a large three-dimensional rat’s nest of wires and transistors… a compiler system that is capable of obtaining scalable ILP performance by performing both spatial instruction partitioning as well as traditional temporal instruction scheduling… demonstrates that for a large class of applications, scalable ILP performance is obtainable… The computing elements are the part of the architecture that are responsible for performing actual computation. These elements can look like in-order processors, small-scale out-of-order processors, FPGA logic, clusters of ALUs, or even single functional units. The ‘network’ is the entire system that is responsible for routing values in a correct fashion (i.e. respecting program dependences) between the computing elements. In a traditional five-stage pipeline, the ‘network’ would consist of the bypassing logic and the bypass paths… To execute a computation on a network of processing elements, one needs to find an assignment from the nodes of the dataflow graph to the nodes of the network of processing elements, and route the intermediate values between these PEs. This assignment of operations can be performed at runtime or compile-time… The ILP network must be implemented as a distributed process… there can be no global wires, global repositories for information, serial operations, unified look-up tables, logic which incorporates state from all of the processing elements, [and so on]. The view that needs to be taken is that of a number of processing elements, running independent state-machines that are isolated from their non-neighbors. A time cost needs to be assigned for any transference of information between non-neighbors. The ILP network must provide operation/operand matching… in order to compute, we need the operands and the operation to meet at some point in space to perform the computation… Run-time assignment is especially attractive because it effectively performs run-time load balancing of processing elements, which is very useful for adjusting to variable latencies in the computation… In run-time operation ordering, the ordering of instructions on a given processing element is determined by the arrival of the operands… Run-time ordering helps tolerate memory access latency and other events with unpredictable timing… The ILP network must be able to tolerate timing variations… computations often have timing characteristics that can be unpredictable at compile time. Their timing or cache behavior may vary with the input dataset, or synchronization with an external device (like an interrupt) may change the timing behavior of a given processing element. If the ILP network cannot tolerate timing variation, it will either fail or severely reduce the sorts of computations that can be performed… Cross-tile instruction scheduling provides a total ordering of computations on each processing node while taking into account the effects of inter-node communication latencies.” “The bypass paths and multi-ported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALUs… We label operand transport interconnects and the algorithms that manage them, whether they are centralized or distributed, scalar operand networks. Specifically, a [SON] is the set of mechanisms that joins the dynamic operands and operations of a program in space to enact the computation specified by a program graph. These mechanisms include the physical interconnection network… as well as the operation-operand matching system (hardware or software) that coordinates these values into a coherent computation… SONs can be designed to have short wire lengths and to scale with increasing transistor counts. Furthermore, because they can be designed around generalized transport networks, scalar operand networks can potentially provide transport for other forms of data including I/O streams, cache misses, and synchronization signals… A non-pipelined processor with a register file and ALU contains a simple specialized form of an SON… [In] a pipelined, bypassed processor with multiple ALUs… the SON includes many more multiplexers, pipeline registers, and bypass paths, and it begins to look much like our traditional notion of a network… increasing the number of [FUs] necessitates more live values in the SON, distributed at increasingly greater distances. This requirement in turn increases the number of physical registers, the number of register file ports, the number of bypass paths, and the diameter of the ALU-register file execution core. These increases make it progressively more difficult to build larger, high-frequency SONs that employ centralized register files as operand interconnect. One solution is to partition and distribute the interconnect and the resources it connects… Mapping ILP to architectures with distributed [SONs] is not as straightforward as with early, centralized architectures… To execute an ILP computation on a distributed-resource microprocessor containing an SON, we must first find an assignment from the nodes of the dataflow graph to the nodes in the network of ALUs. Then, we need to route the intermediate values between these ALUs. Finally, we must make sure that operands and their corresponding operations are correctly matched at the destinations… for those with a multiprocessor or network background, the challenges of frequency scalability, bandwidth scalability, and deadlock may seem particularly familiar. On the other hand, efficient operation-operand matching and the challenge of exceptional conditions may seem more familiar to those with a microprocessor background… If we want to build larger structures and still maintain high frequencies, there is no option except to pipeline the circuits and turn the propagation delay into pipeline latency… Tiling simplifies the task of reasoning about and implementing frequency scalable systems. A system is scaled up by increasing the number of tiles, rather than increasing the size of the tiles… Once interconnect delay becomes significant, high-frequency systems must be designed out of components that operate with only partial knowledge of what the rest of the system is doing. In other words, the architecture needs to be implemented as a distributed process. If a component depends on information that is not generated by a neighboring component, the architecture needs to assign a time cost for the transfer of this information. Non-local information includes the outputs of physically remote ALUs, stall signals, branch mispredicts, exceptions, and the existence of memory dependencies. [The] trade-off between parallelism and locality is becoming increasingly important. On one hand, we want to spread virtual objects — such as cached values, operands, and instructions — as far out as possible in order to maximize the quantities of parallel resources that can be leveraged. On the other hand, we want to minimize communication latency by placing communicating objects close together, especially if they are on the critical path… Because SONs require low-occupancy, low-latency sends to implement operation-operand matching, Raw employs a register-mapped interface to the transport networks. Thus, the RAW ISA dedicates instruction encoding space in each instruction… In fact, a single 32-bit Raw ISA encoded instruction allows up to two destinations for each output, permitting operands to be simultaneously injected into the transport network and retained locally. Furthermore, to reduce send latency, the Raw processor uses a special inverse-bypass system [that] pulls operands from the local bypass network of the processor and into the output FIFOs as soon as they are ready, rather than just at the writeback stage or through the register file… Register-mapped input FIFOs are used to provide… unit-latency receives. One cycle of receive latency is incurred because the receive FIFOs are scheduled and accessed in the dispatch stage of the processor. This cycle of receive latency could be eliminated as with a superscalar if the valid bits are routed one cycle ahead of the data bits in the network… Raw’s use of a zero-cycle bypass for routing operands that a compute processor both produces and uses locally greatly reduces the average cost of operation-operand matching… RawStatic transmits data merely by having instructions target a register-mapped output FIFO… The distance of a component from the bypass networks is an inverse measure of the timing criticality of the component. Components that have ample timing slack can afford to be placed further away and suffer greater wire delay… the static network paths were among the least critical, while the single-cycle ALU, control, and dynamic network components were most critical… The mapping of code to RAW tiles includes the following tasks: assigning instructions to tiles, scheduling the instructions on each tile, and managing the delivery of operands. To make intelligent instruction assignment and scheduling decisions, RAWCC models the communication costs of the target network accurately. RAWCC’s compilation strategy seeks to minimize the latency of operand transport on critical paths of the computation. Accordingly, it performs operation assignment in a way that gravitates sections of those critical paths to a single node so that that node’s zero-cycle local bypass network can be used, minimizing latency… traditional communication mechanisms are inefficient for scalar operand delivery. Operand delivery via traditional message passing has high occupancy costs, while operand delivery via shared memory has high latency costs… Partitioned architectures distribute ALUs and register files over [SONs] that must somehow account for communication latency. Even though the microarchitectures are distributed, ILP can be exploited on these SONs because their latencies and occupancies are extremely low.” “Modern computer systems extract parallelism from problems at two extremes of granularity: [ILP] and coarse-thread parallelism. VLIW and superscalar processors exploit ILP with a grain size of a single instruction, while multiprocessors extract parallelism from coarse threads with a granularity of many thousands of instructions. The parallelism available at these two extremes is limited. The ILP in applications is restricted by control flow and data dependencies, and the hardware in superscalar designs is not scalable. Both the instruction scheduling logic and the register file of a superscalar grow quadratically as the number of execution units is increased. For multicomputers, there is limited coarse-thread parallelism at small problem sizes and in many applications… Fine-threads close the parallelism gap between the single instruction granularity of ILP and the thousand instruction granularity of coarse threads by extracting parallelism with a granularity of 50–1000 instructions. This parallelism is orthogonal and complementary to coarse-thread parallelism and ILP. Programs can be accelerated using coarse threads to extract parallelism from outer loops and large coroutines, fine-threads to extract parallelism from inner loops and small sub-computations, and ILP to extract parallelism from subexpressions. As they extract parallelism from different portions of a program, coarse-threads, fine-threads, and ILP work synergistically to provide multiplicative speedup… Fine-threads are appropriate for execution across multiple processors at a single node of a parallel computer where the interaction latencies are on the order of a few cycles… Low-overhead mechanisms for communication and synchronization are required to exploit fine-grain [TLP]. The cost to initiate a task, pass it arguments, synchronize with its completion, and return results must be small compared to the work accomplished by the task… The [MAP] chip provides three on-chip processors and methods for quickly communicating and synchronizing among them. A thread executing on one processor can directly write to a register on another processor. Threads synchronize by blocking on a register that is the target of a remote write or by executing a fast barrier instruction… The MAP’s integrated mechanisms are orders of magnitude faster than those that operate via global memory… The [MAP] chip… is intended to exploit parallelism at a spectrum of grain sizes, from [ILP] to coarser-grained multi-node parallelism… A synchronization pipeline stage holds instructions from each thread until they are ready to issue and decides on a cycle-by-cycle basis which thread will use the execution unit… this enables fast, zero-overhead interleaving on each cluster… In coarse-grained multiprocessors, communication between threads is exposed to the application as memory references or messages, both of which require many cycles to transmit data from one chip to another. In the MAP chip, threads on separate clusters may communicate either through the shared memory, or through registers. Since the data need not leave the chip to be transferred from one thread to another, communication is fast and well-suited to fine-grain threads… Register-register transfers are extremely fast… The result of any arithmetic operation may be sent directly to a remote register… Since the size of the register file limits the storage for communicated values, register communication is particularly suited to passing small amounts of data quickly, such as transferring signals, arguments, and return values between threads. One drawback is that register communication requires an additional synchronization between the consumer and the producer to prevent values in the destination cluster from being illegally overwritten… The MAP chip allows synchronization through memory, registers, and a hardware barrier instruction, each of which has different costs and benefits… Using register-register communication fuses synchronization with data transfer in a single operation… The simplest synchronization mechanism implemented by the MAP is the cluster barrier instruction ᴄʙᴀʀ. The cluster barrier instruction stalls a thread’s execution until the threads on the other two clusters have reached a ᴄʙᴀʀ instruction. Threads waiting for cluster barriers do not spin or consume any execution resources. The ᴄʙᴀʀ instruction is implemented using six global wires per thread to indicate whether a ᴄʙᴀʀ has been reached, and whether it has been issued. Six wires per thread are necessary in order to guarantee that successive barriers stay in synchrony across all three clusters… Not all synchronization can be easily expressed using a producer-consumer model. A barrier can be used to conglomerate several synchronizations into a single action. Fast barriers reduce the overhead of using parallelism, which is vital if the parallelism to be extracted has short task execution times between synchronizations… there is considerable fine-grain thread parallelism in typical applications… register-based communication and synchronization provides sufficiently low overhead to exploit this parallelism efficiently… Conventional multiprocessor mechanisms with long interaction latencies are unable to exploit fine threads at all. The coarse-thread parallelism that can be exploited in multiprocessors has a granularity of 10³ to 10⁵ instructions and is strongly dependent on problem size… Exploiting fine-grain thread parallelism using register-based mechanisms is also well-matched to the wire-limited nature of [modern] semiconductor processes… Structuring a future microprocessor as a number of superscalar processors that communicate and synchronize via registers keeps most communication local to individual processors. Global communication is made explicit in the processor microarchitecture… partitioning will be even more useful as communication and synchronization can be pipelined to permit scaling to large numbers of on-chip processors… Reducing the synchronization and communication costs between parallel tasks will enable fine-grain parallelization of programs, and allow existing problems, such as personal or business applications, to be solved faster without scaling their size.” “Threads running on different on-chip processors communicate by writing into each other’s registers. They synchronize by blocking on a register that is the target of a remote write or by executing a fast barrier instruction… Due to reduced transistor channel lengths, transistor switching delays are decreasing… However, scaling is less kind to interconnect as the RC wire delay is doubling every generation. As wires get smaller, their cross-sectional area decreases, resulting in higher wire resistance. In addition, the wires are closer together, contributing to higher coupling capacitance between wires in the same layer. Lower resistance wires such as copper, in conjunction with lower dielectric insulators would reduce the RC product at each generation, but would merely delay the inevitable impact on long wires… On-chip communication latency is no longer uniform and independent of distance… Today’s microprocessor architectures, however, are on an evolutionary path that requires global communication… Superscalar microprocessors can be characterized as having both global control and global data. Centralized instruction issue logic examines an instruction window to determine which instructions can be executed simultaneously. The concurrent instructions are then delivered to the execution units which may be distributed throughout the chip. In an attempt to reduce the complexity of the issue logic, a global register file is typically used to hold the data for all of the execution units. As the number of execution units increases, the wire lengths between centralized components (the issue logic and the register file) and the distributed execution units will increase… Other factors will also limit the scalability of superscalar processors. The complexity of the issue logic is proportional to the product of the instruction window size and the number of execution units. In addition, the size of the register file grows quadratically with the number of register file ports, which can become large with increasing numbers of execution units… In a VLIW, a programmer or compiler discovers [ILP] and statically schedules the code across a series of execution units. The register files can be partitioned and placed close to an execution unit with the compiler scheduling the communication among them. The static scheduling and distributed data solves some of the constraints of superscalar architectures. However, a VLIW still has only a single program counter, and branch targets must be broadcast to all execution units… Incorporating multiple processors on a chip that can communicate efficiently with one another via processor registers is well-suited to these emerging technological constraints. By doing this, the MIT MAP chip keeps all control and most communication local to each processor, enhancing locality and diminishing the effects of slow wires. Global communication is abstracted out of the local processor design and can be optimized or pipelined independently… Since future chips will have more silicon area and include more execution units, a much larger demand for parallelism is emerging… The architecture of the MAP chip mitigates the effect of the long wire delays present in current and future silicon process technologies by partitioning the execution units into independent processor clusters… so that all local wires can easily be driven in a small fraction of the clock cycle… The computation power in the MAP chip can be scaled in two dimensions… In a shared memory machine, data synchronization is enforced using spin locks in memory. For control synchronization, the MAP chip implements a barrier instruction (ᴄʙᴀʀ) which enables threads on all three processors to synchronize with only one cycle of overhead… In many ways, the MAP chip’s mechanisms are similar to those found in a dataflow machine. The Monsoon dataflow computer was built to process dataflow tokens, which typically consist of a single instruction. When a token fires, it computes a result and transmits it to other tokens that are waiting to use that result as an operand. When the operand arrives, it can immediately enable a token for execution. Data is delivered directly from one instruction to another on different processing elements, and individual instructions stall until all of their operands arrive. The MAP chip’s register communication and synchronization are analogous to the communication between computation tokens in Monsoon. Using these mechanisms, the MAP chip can provide extremely fast interprocessor communication to conventional programs and allow them to exploit parallelism at a finer granularity than ever before… the nature of parallelism that can be exploited is dictated largely by the overhead to communicate and synchronize among the parallel components… flexible interleaving allows the MAP to exploit [TLP] and to mask variable pipeline, memory, and communication delays… Because memory operations issued by threads on separate clusters may complete out-of-order, synchronization through registers, a cluster barrier, or on a memory location is necessary to coordinate memory communication between clusters… The compiler must insert explicit register-based synchronization operations or use the cluster barrier instruction to enforce instruction ordering between [horizontal threads]… The cluster barrier instruction (ᴄʙᴀʀ)… requires global wiring to connect all of the processor clusters that are synchronizing… The latency of the barrier depends on how many processors are involved and how far away they are… With the high degrees of concurrency that will be available, a better memory system organization may eliminate cache hierarchies and instead use high-density DRAM for the primary storage at each processor. Data communication would then be explicit between processors using register-register or register-memory transfers… Large monolithic cache structures will be unacceptably slow due to long propagation delays in the word and bit lines. Furthermore… large wire latencies and memory access delays between the execution units and storage at the other end of the cache hierarchy will render memory communication useless for fine-grain parallelism… The pipeline uses a scoreboard in the register read stage to track data dependencies and enable data synchronization on register writes between clusters… Control synchronization across the clusters is implemented with a cluster barrier instruction… To enforce memory ordering and detect completion of memory operations, the MAP implements a memory barrier (mbar) instruction which stalls until all outstanding memory references for a thread have finished… The MAP chip implements fine-grain interleaving of instructions based on data availability… Thus, the MAP chip can tolerate both long memory latencies and short instruction latencies, while still executing at full capacity even if only one thread is present. The term zero-cycle multithreading indicates that no dynamic overhead is incurred to switch among threads when programs are executing in the pipeline… Arithmetic units on one cluster can write directly into a register file of another cluster. The data is also delivered to the remote [synchronize] and EX stages using bypass paths so that a remote arithmetic unit can use the data immediately. A cluster barrier ᴄʙᴀʀ instruction can synchronize the pipelines in all of the clusters. Because these mechanisms are integrated tightly into the arithmetic pipelines, they can be used with very little overhead… Synchronization between threads on different clusters is supported in the MAP chip using a cluster barrier (ᴄʙᴀʀ) instruction. The ᴄʙᴀʀ acts as a gate such that no thread can proceed past the barrier until all threads have reached it… Instructions from each thread wait in the reservation stations of the SZ stage until all operands are present. The SZ stage examines the instructions from each thread and selects one to issue based on data availability, thread priority, and arbitration. The MAP also uses valid bits in a register scoreboard and in pipeline registers to unify nearly all instruction synchronization through data dependence, eliminating additional pipeline interlocks. The scoreboard tracks the data from local arithmetic operations, memory operations, and remote register writes. Inter-cluster synchronization takes place through data transfer or by using a cluster barrier instruction which is implemented in the SZ pipeline stage… The MAP chip is designed to reduce the overhead of on-chip thread management, providing hardware support for thread communication, synchronization, and invocation. Communication between clusters takes place via direct register-register transfers which require only one cycle to deliver the data. A global cluster barrier (ᴄʙᴀʀ) instruction enables a barrier synchronization across clusters, with only one cycle of overhead… Register-register transfers are extremely fast, requiring only one more cycle to write to a remote register than to a local register. Since the size of the register file limits the storage for communicated values, register communication is particularly suited to passing small amounts of data quickly, such as transferring signals, arguments, and return values between threads… The MAP chip allows on-chip threads to synchronize through memory, registers, and a hardware barrier instruction, while threads on separate MAP chips synchronize using messages… The MAP chip uses full/empty bits in a register scoreboard to determine when values in registers are valid. When an operation issues, it marks the scoreboard for its destination register invalid. When the operation completes, it writes its result to the destination register and marks the scoreboard valid. Any operation that attempts to use the register while it is empty will stall until the register is valid. To reduce the amount of interaction between physically distant clusters, an operation that writes to a remote cluster does not mark its destination register invalid. Instead, the consumer must execute an explicit empty instruction to invalidate the destination register prior to receiving any data. When the data arrives from a remote register write, the scoreboard is marked valid and any operation waiting on the register is allowed to issue. Using register-register communication fuses synchronization with data transfer in a single operation and allows the consumer to stall rather than spin. The simplest synchronization mechanism implemented by the MAP is the cluster barrier instruction ᴄʙᴀʀ. The ᴄʙᴀʀ instruction stalls a thread’s execution until the threads on the other two clusters have also reached a barrier. Threads waiting for cluster barriers do not spin or consume any execution resources, so other threads can use the execution units instead… Not all synchronization can be easily expressed using a producer-consumer model. A barrier can be used to conglomerate several synchronizations into a single action. Fast barriers reduce the overhead of using parallelism, which is vital if the parallelism to be extracted has short task execution times between synchronizations. Four implementations of barriers across three clusters are examined: memory, register, condition-code, and ᴄʙᴀʀ… The ᴄʙᴀʀ instruction is the fastest and can complete a barrier every cycle. The register and condition code barriers are similar, with Condition Code being one cycle faster since only one write is necessary to communicate with both neighbors. The memory barrier requires 61 cycles, even with all accesses hitting in the cache. For each thread, approximately 20 cycles are needed for the control overhead of testing the barrier counter, while the remaining cycles are consumed contending for the on-chip cache and waiting for the other threads to arrive at the barrier. In order to exploit fine-grain parallelism with task lengths in the 10s of cycles, long latency memory-based barriers cannot be used… The primary factor in this increase is a larger delay to the memory shared by the processors in the group. A global barrier across all of the on-chip processors requires global communication. This simple model shows that the cost to execute a barrier using memory for communication between processors is already quite expensive and will become even more costly in future technologies. At least 850 cycles will be required to synchronize 80 processors in a 0.1m process. The increase is due both to longer global wire delays and to more processors participating in the barrier. More direct mechanisms, such as a cluster barrier instruction that can synchronize directly between processors, require far less time to complete a barrier… A VLIW processor has a single program counter and a single instruction stream which issues in lock-step across the execution units. The data dependencies are enforced by static knowledge of all of the instruction latencies. If the latencies change, a given program must be rescheduled in order to run correctly. In the MAP chip, the register scoreboard enforces the data dependencies between operations. Therefore, changing the memory or instruction execution latencies has no effect on the correctness of the program… In strictly statically scheduled VLIW machines, no hardware interlocks are used to enforce data or control dependencies. The compiler schedules the instructions, taking into account all of the hardware latencies at compile time. Unpredictable latencies, such as those associated with a cache memory system, are difficult to statically schedule… The MAP chip automatically overlaps latencies of all types by allowing its independent processing clusters to slip relative to one another. Synchronization is not implicit and lock-step, but instead is performed only when needed… Instructions from each thread wait in the reservation stations of the SZ stage until all operands are present. The SZ stage examines the instructions from each thread and selects a thread to issue based on data availability, thread priority, and arbitration. The MAP also uses valid bits in a register scoreboard and in pipeline registers to unify nearly all instruction synchronization through data dependence, eliminating pipeline interlocks. The scoreboard tracks the data from local arithmetic operations, memory operations, and remote register writes. Inter-cluster synchronization takes place through data transfer or by using a cluster barrier instruction which is implemented in the SZ pipeline stage. Because the thread communication and synchronization mechanisms are implemented primarily by augmenting the existing cluster to memory communication paths, their cost is small… Scaling existing superscalar and VLIW microarchitectures or increasing monolithic on-chip cache or memory arrays in future technologies is not feasible. Even extending a traditional shared multi-level cache memory hierarchy for multiple on-chip processors will be inappropriate as the latencies between processors and remote reactive caches will be too great… As the number of transistors that can be fabricated on a single chip increases, so will the number of arithmetic execution units. However, the changing balance between gate delay and wire delay will require decentralization of control so that the execution units will be partitioned into independent processors. The processors must be small enough so that all local wires are short, to allow a processor to run at the full clock rate… Processors will communicate by writing into remote registers or FIFOs that can be read locally by a destination processor… Without partitioning a chip into independent processors and providing fast integrated communication and synchronization mechanisms, the large numbers of transistors in future process generations will not be used effectively to increase overall system performance… the communication latency between on-chip processors increases with the physical distance between them. The granularity of parallelism that can be exploited will depend on the distance between processors to which parallel tasks are assigned. Tasks mapped to nearby processors will be able to execute very fine-grain parallelism, while tasks mapped to distant processors will execute at a coarser grain… Reducing the synchronization and communication costs between parallel tasks will enable fine-grain parallelization of programs, and allow existing problems, such as personal or business applications, to be solved faster without scaling their size.” “A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler… Because superscalars require significant hardware resources to support parallel instruction execution, architects for these machines face an uncomfortable dilemma. On the one hand, faster machines require additional hardware resources for both computation and discovery of ILP. On the other hand, these resources often have quadratic area complexity, quadratic connectivity, and global wiring requirements which can be satisfied only at the cost of cycle time degradation. VLIW machines address some of these problems by moving the cycle-time elongating task of discovering ILP from hardware to software, but they still suffer scalability problems due to issue bandwidth, multi-ported register files, caches, and wire delays. Up to now, commercial microprocessors have faithfully preserved their monolithic images. As pressure from all sources demands computers to be bigger and more powerful, this image will be difficult to maintain… The result of the evolution toward scalability is a machine with a distributed register file interconnected via a scalable network. In the spirit of NUMA machines… we call such machines [Non-Uniform Register Access] machines… Because a NURA machine exploits ILP of a single instruction stream, its interconnect must provide latencies that are much lower than that on a multiprocessor… The distributed nature of the computational and storage elements on a NURA machine means that locality should be considered when assigning instructions to functional units. Instruction scheduling becomes a spatial problem as well as a temporal problem… The tile is kept simple and devoid of complex hardware resources in order to maximize the clock rate and the number of tiles that can fit on a chip. Raw’s network is tightly integrated with the processor to provide fast, register-level communication… A word of data travels across one tile in one clock cycle… Accesses to communication ports have blocking semantics that provide near-neighbor flow control; a processor or switch stalls if it is executing an instruction that attempts to access an empty input port or a full output port. This specification ensures correctness in the presence of timing variations introduced by dynamic events such as dynamic memory references and I/O operations, and it obviates the lock-step synchronization of program counters required by many statically scheduled machines… Because the register set is distributed along with the functional units and memory ports, the number of registers and register ports scales linearly with total machine size. Each tile’s individual register set, however, has only a relatively small number of registers and register ports, so the complexity of the register file will not become an impediment to increasing the clock rate. Additionally, because all physical registers are architecturally visible, the compiler can use all of them to minimize the number of register spills… Data assignment and instruction assignment are implemented to allow flow of information in both directions, thus reflecting the interdependent nature of the two assignment problems.” “The conventional ILP architectures have complicated, non-scalable structures (multi-ported or rotating register files, speculation buffers, deferred exception mechanisms, pools of ALUs) that are designed to wrest small degrees of parallelism out of the most twisty code… The Raw architecture provides a… scalable, parallel interface which allows the application to make direct use of every square millimeter of silicon and every I/O pin… The Raw design philosophy argues for an array of replicated tiles, connected by a low-latency, high-throughput, pipelined network. This design has three key implementation benefits, relative to existing superscalar and VLIW processors. First, the wires are short. Wire length has become a growing concern in the VLSI community, now that it takes several cycles for a signal to cross the chip. This is not only because the transistors are shrinking, and die sizes are getting bigger, but because the wires are not scaling with the successive die shrinks, due to capacitive and resistive effects. The luxurious abstraction that the delay through a combinational circuit is merely the sum of its functional components no longer holds. As a result, the chip designer must now worry about both congestion and timing when placing and routing a circuit. Raw’s short wires make for an easy design. Second, Raw is physically scalable. This means that all of the underlying hardware structures are scalable. All components in the chip are of constant size, and do not grow as the architecture is adapted to utilize larger and larger transistor budgets. Future generations of a Raw architecture merely use more tiles with out negatively impacting the cycle time… Finally, Raw has low design and verification complexity… Raw offers constant complexity, which does not grow with transistor budget. Unlike… superscalars and VLIWs, Raw does not require a redesign in order to accommodate configurations with more or fewer processing resources. A Raw designer need only design the smaller region of a single tile, and replicate it across the entire die. The benefit is that the designer can concentrate all of one’s resources on tweaking and testing a single tile, resulting in clock speeds higher than that of monolithic processors… We left in some FPGA-like logic to support fine-grained applications… Each Raw tile is connected to its nearest neighbors through a series of separate, pipelined channels — one or more channels in each direction for each neighbor… These channels are pipelined so that no wire requires more than one cycle to traverse. This means that the Raw network can be physically scaled to larger numbers of tiles without reducing the clock rate, because the wire lengths and capacitances do not change with the number of tiles. The alternative, large common buses, will encounter scalability problems as the number of tiles connected to those buses increases… The topology of the pipelined network which connects the Raw tiles is a 2D mesh. This makes for an efficient compilation target because the two-dimensional logical topology matches that of the physical topology of the tiles. The delay between tiles is then strictly a linear function of the Manhattan distances of the tiles. This topology also allows us to build a Raw chip by merely replicating a series of identical tiles… Because of our desire to minimize the latency from tile to tile, we decided that the… network interface should be directly attached to the processor pipeline. An alternative would have been to have explicit ᴍᴏᴠᴇ instructions which accessed the network ports. Instead, we wanted a single instruction to be able to read a value from the network, operate on it, and write it out in the same cycle… It is interesting to note that the logic for this procedure is exactly the same as for the standard bypass logic, except that the priorities are reversed… It was important that these interfaces… provide the minimal possible latency to the network so as to support as fine-grained parallelism as possible… Just as we provide a simple interface for the compiler to exploit the gobs of silicon resources, we also have a simple interface for the compiler to exploit and program the gobs of pins available… The interface… scales with the number of pins, and works even though pin counts are not growing as fast as logic density… I/O is a first-class software-exposed architectural entity on Raw… The compiler, given the pin image of the chip, schedules the… communication on the chip such that it maximizes the utilization of the ports that exist on the particular Raw chip… in order to attain the pincounts that we desire… we have to use area I/O… The Raw compiler essentially partitions the memory objects of a program across the Raw tiles. Each tile owns a fraction of the total memory space… Raw could support both spatial and temporal multitasking… Ideally, we would be able to expose all of the networks of the peripheral tiles to the pins, so that they could connect to an identical neighbor chip, creating the image of a larger Raw chip. Just as we tiled Raw tiles, we will tile Raw chips… However, some of the network hops may have an extra cycle of latency.” “…the abstraction of a wire as an instantaneous connection between transistors has shaped assumptions and architectural designs. In an interesting twist, just as the clock frequency of processors has risen exponentially over the years, the fraction of the chip that is reachable by a signal in a single clock cycle has decreased exponentially. Thus, the idealized wire abstraction is becoming less and less representative of reality. Architects now need to explicitly account for wire delay in their designs… We designed Raw to use a scalable ISA that provides a parallel software interface to the gate, wire, and pin resources of a chip. An architecture with direct, first-class analogs to all of these physical resources lets programmers extract the maximum amount of performance and energy efficiency in the face of wire delay. In effect, we try to minimize the ISA gap by exposing the underlying physical resources as architectural entities… Future Raw processors will have hundreds or perhaps thousands of tiles… the length of the longest wire in the system is no greater than the length or width of a tile. This property ensures high clock speeds, and the continued scalability of the architecture… The Raw ISA exposes [full-duplex] on-chip networks to the software, enabling the programmer or compiler to directly program the wiring resources of the processor and to carefully orchestrate the transfer of data values between the computational portions of the tiles — much like the routing in a full-custom [ASIC]. Effectively, the wire delay manifests itself to the user as network hops… Memory-intensive domains can have up to 14 dedicated interfaces to DRAM… This exposure makes Raw scalable. Creating subsequent, more powerful, generations of the processor is straightforward; we simply stamp out as many tiles and I/O ports as the silicon die and package allow. The design has no centralized resources, global buses, or structures that get larger as the tile or pin count increases. Finally, wire length, design complexity, and verification complexity are all independent of transistor count… The Raw operating system allows both space and time multiplexing of processes… When the operating system context-switches in a given process, it finds a contiguous region of tiles that corresponds to the dimension of the process, and resumes the execution of the physical threads… Continuous or real-time applications can be locked down and will not be context-switched out… We assign operations to tiles in a manner that minimizes congestion and then configure the network routes between these operations. This is very much like the process of designing a customized hardware circuit… The most aggressive processor networks are register-mapped and do not require a special send or receive command; instructions can target the networks just as easily as registers. Our design takes network integration one step further: the networks are not only register mapped but also integrated directly into the bypass paths of the processor pipeline. This makes the network ports truly first-class citizens in the architecture… In early processor designs, the register file was the central communication mechanism between functional units. Starting with the first pipelined processors, the bypass network became largely responsible for the communication of active values, and the register file evolved into a dumping ground or checkpointing facility for inactive values. The Raw networks… extend this trend and are in essence 2D bypass networks serving as bridges between the bypass networks of separate tiles. The low latency of in-order, inter-tile ALU-to-ALU operand delivery distinguishes Raw from previous systolic or message-passing systems… We can imagine computation becoming inexpensive enough to dedicate entire tiles to prefetching, gathering profile data from neighbor tiles, translating… and dynamically optimizing instructions, or even to simulating traditional hardware structures like video RAMDACs.” “Our general philosophy is to build an architecture based on replicating a simple tile… The distributed architecture also allows multiple high-bandwidth paths to external… DRAM — as many as packaging technology will permit… The amount of memory is chosen to roughly balance the areas devoted to processing and memory… Unlike [out-of-order] superscalars, a Raw processor does not bind specialized logic structures such as register renaming logic or dynamic instruction issue logic into hardware. Instead, the focus is on keeping each tile small to maximize the number of tiles that can fit on a chip, increasing the chip’s achievable clock speed and the amount of parallelism it can exploit… No signal in a Raw processor travels more than a single tile width within a clock cycle… Because the interconnect allows inter-tile communication to occur at nearly the same speed as a register read, compilers can schedule single-word data transfers and exploit ILP, in addition to coarser forms of parallelism… The placement algorithm… is a variant of a VLSI cell-placement algorithm… a systolic structure that maintains all the true value dependencies of [certain loops] allows independent memory references to proceed in parallel. Because the compiler knows that accesses to the A, B, and C arrays are independent, it can create a six-tile systolic structure… Providing hardware for dynamic dependency checking is less important because a Raw compiler can provide memory dependence information to the hardware… [ULSI], faster switching speeds, and growing compiler sophistication will allow a Raw machine’s performance-to-cost ratio to surpass that of traditional architectures for… general-purpose workloads. Because Raw architectures are field-programmable, we further speculate that they will become a cost-effective alternative to custom hardware in many situations. They thus offer a universal solution for both general- and special-purpose applications.” “Integrating fine-grained communication into a conventional microprocessor model allows a broad range of fine-grain computational algorithms (such as dataflow and systolic paradigms, including wavefront) to be run on a general-purpose parallel computer.” “Buffers… consume significant energy/power: dynamic energy when read/written and static energy even when they are not occupied… having buffers increases the complexity of the network design because logic needs to be implemented to place packets into and out of buffers… buffers can consume significant chip area… The basic idea of ‘bufferless routing’ is to always route a packet (or a flit) to an output port regardless of whether or not that output port results in the lowest distance to the destination of the packet. In other words, packets are deflected or ‘misrouted’ by the router to a different output port if an output port that reduces the distance to the destination node is not available. Bufferless routing has also been called ‘hot potato’ routing in network theory, alluding to the scenario that the router immediately needs to pass the potato (i.e. the packet) on to some other router as the potato is too hot to keep (i.e. buffer)… The reduced router latency can enable bufferless routing algorithms to outperform baseline buffered algorithms. At first thought, eliminating buffers in [OCNs] might appear audacious, since it will result in an increase in average packet latencies and a decrease in achievable network throughput compared to buffered routing schemes. Moreover, there could be other issues such as livelocks. Nonetheless, the approach could be suitable for [OCNs]. Intuitively, bufferless deflection routing works well when network utilization is low. A packet is deflected only if a collision occurs in a router, i.e. if multiple packets arrive at a router at the same time, and if not all of these packets can be sent in a productive direction. If only few packets are in the network simultaneously, the number of collisions is low. Hence, most packets progress quickly and can be routed to their destination without being frequently deflected. For larger traffic volumes, the fundamental effect of removing buffers is a reduction of the total available bandwidth in the network. In a buffered network, a packet waits idle in some buffer until it can be routed in a productive direction and therefore does not unnecessarily consume link bandwidth while it is buffered. In contrast, in a bufferless network all packets always consume link bandwidth because, in effect, links act as ‘buffers’ for the packets. Therefore, beyond a certain packet injection rate into the network, bufferless routing algorithms will fail, while good buffered routing algorithms can still perform well… Since in bufferless routing routers cannot store packets in transit, all packets that arrive at a router must immediately be forwarded to an adjacent router… Buffered routing algorithms employ virtual channels to improve buffer performance and flow control mechanisms (e.g. credit-based flow control) to control buffer management in virtual channels. Since bufferless routing eliminates buffers, there are no virtual channels and there is no need to manage buffer allocation/deallocation. This reduces complexity in the router and can enable router latency reductions… removing buffers from routers can result in significant area savings.” “Exposing details of the processor datapath to the programmer is motivated by improvements in the energy efficiency and the simplification of the microarchitecture. However, an instruction format that can control the data path in a more explicit manner requires more expressiveness when compared to an instruction format that implements more of the control logic in the processor hardware and presents conventional general-purpose register based instructions to the programmer. That is, programs for exposed datapath processors might require additional instruction memory bits to be fetched, which consumes additional energy… Their assembly programming is tedious and error-prone due to the additional details exposed to the programmer — their feasible use requires a compiler that supports a higher-level programming language than the assembly… It is usual that the register file is bypassed (or ‘forwarded’) in the processor pipeline to alleviate stalls resulting from data hazards. In case an instruction is reading a result produced by a previous instruction in the pipeline, without bypassing the instruction would need to wait until the result was written to the RF only after which it could read it to execute its operation. With bypassing, the result from a previous instruction can be forwarded to the input of the consuming function unit before it is written to the RF. However, the bypass paths add complexity to the routing and the decoding logic of the processor datapath… Software bypassing reduces the pressure on the number of required general-purpose registers and the number of RF accesses… The operand registers enable explicit operand forwarding between function units, bypassing the larger general purpose register file… the vast majority of temporary values are short lived, which can be interpreted that they are used once or maybe twice in the program by instructions close to the producer, which logically would speak against the additional complexity of multiple operand or result registers… Data transport programming provides additional programmer freedom on top of the software controlled bypasses: the possibility to control the timing of the function unit operand and result data transfers. This helps reducing the RF pressure further as there is often slack in the programs: it is not necessary to move the results to the next consumer or the register file immediately when they are ready… another interesting optimization… was operand sharing. The idea is that as the function unit input ports are visible to the programmer and can have storage of their own, sometimes the previously used operand value can be reused in the successive computation. That is, the operand data that is the same for the next operation needs not to be moved again to the input port of the function unit… the optimization saves energy thanks to the reduced register file reads and avoiding routing the operand again to the function unit input… The [data transport programmed MAXQ] processor implements the instruction fetch, decode and execution… in a single pipeline stage. That is, the core is optimized for branch-heavy code, and the transport programming helps in simplifying the processor structure in order to provide the single-cycle instruction pipeline to reduce branch penalities… Some architectures provide a programmer interface that looks like the microprograms in microcode-based processor implementations. They allow controlling the datapath at the level of multiplexer select signals and expose the finest granularity function unit opcodes to the programmer. While data transport programmed architectures expose the movements of data between registers in the processor, the microprogrammed architectures expose even more details such as the pipeline register enable signals and, in general, have very little decoding, if at all, done to the instruction bits when they are converted to control signals… Explicit data transfers mean the ability to temporally control the data movements between datapath registers… due to the ability to shrink the register file architecture without performance loss, the instruction encodings of the exposed datapath variations (especially transport programmed) also get smaller. This means that sometimes the additional programmer detail can, in fact, lead to improved code density, not always worsening like generally assumed earlier.” “Central to the idea of transport triggering is to have more control about what is going on within a central processing unit… In order to get more control, the data transport has to be separated from the operations… An interesting feature of a TTA is the fact that bypassing (or forwarding) of data is programmed, instead of relying on expensive associative bypassing hardware… FUs and the data transport network are completely separated. Pipelining decisions for transport network and FUs can be taken independently… Use of the transport triggering concept enables many compiler optimizations at the data transport level.” “As TTA processors typically have function unit input registers, it allows operands to be written to function units earlier than the clock cycle where the operation begins execution… Reducing the number of register accesses directly saves energy as the register file access can consume considerable amount of energy, but also indirectly, as reducing both register file reads and writes allow creating processors with data paths that have less register file read and write ports, which are more power-efficient… with ordinary processors without software bypassing, the updated result values are written to the register file on every loop iteration. TTAs allow omitting the write to the register file in the loop body and storing the final value to the result register only in the epilogue of the loop, if the intermediate values are bypassed across loop iterations… This saves power and also frees up processor resources such as buses, register file read ports and instruction word bits to be used for other operations within the loop.” “Since the compiler is in control of every data transport, many transports can be optimized away… Because many values produced during the course of a program do not need to be allocated in general purpose registers, the register file traffic rate reduces… FU logic can be split into independent parts, used for different functionality, without a large impact on the interconnection network and register file… different functionality (like arithmetic and logical operations) does not have to be combined into one FU (e.g. an ALU); it is easier to have separate FUs for non-related operations… The performance is enhanced because multiple operations can be started in a single cycle, resulting in a better FU utilization… the costs of FU splitting are marginal, it requires only extra sockets…” “A functional unit is designed as an integral unit and can serve a single instruction at a time… there may be some part of a functional unit which is not at all required during execution of an instruction, even though such hardware cannot be assigned for another instruction during that execution. Such unutilized hardware could be utilized if the functional units are split in smaller independent units… the split… arithmetic unit AU and logic unit LU can execute two instructions simultaneously.” “By inspecting the programmer’s model, one could conclude that there is a conventional instruction fetch-decode unit that loads an instruction, decodes it, and then activates certain elements of the CPU. That, however, would be a misconception. What sets the MAXQ architecture apart from other, more conventional CPUs is the transfer-triggered nature of [its] core… Transfer-triggering is a technique that allows a simple move instruction to perform every function available in the CPU.” “The first requisite in generating a high-efficiency machine is to maximize utilization of the clock cycles… All MAXQ instructions execute in a single clock cycle except long jump/long call and certain extended register accesses. While many RISC microcontrollers claim to support single-cycle execution, this often applies to a small subset of instructions or addressing modes. With the MAXQ, single-cycle execution is the norm… the MAXQ architecture achieves increased clock-cycle utilization because it does not require an instruction pipeline (common to many RISC microcontrollers) to achieve single-cycle operation. The MAXQ instruction decode and execution hardware is so simple (and timing so fast) that these operations are moved into the same clock cycle as the program fetch itself, with minimal impact to the maximum operating frequency… When a program branch occurs, [a generic RISC CPU that executes from a pipeline] uses one or more clock cycles (depending upon pipeline depth) to divert program fetching to the target branch address and discards the instructions already fetched. Clearly, using clock cycles to discard instructions, versus executing them, is wasteful and undesirable as it reduces performance and increases power consumption… the clocks stolen by the CPU to reload the pipeline are an artifact of the architecture and are unavoidable. The MAXQ architecture distinguishes itself… by offering single-cycle execution without an instruction pipeline (and the wasted clock cycles that accompany it)… The MAXQ instruction word is unique because there is only one instruction in the classical sense, the move instruction. The source and destination operands for the move instruction are the basis for creating instructions and memory accesses, and triggering hardware operations… The instruction word contains modularly grouped source and destination operands, which allow simple and fast instruction-decoding hardware and limit signal switching for those modules not involved in the transfer, thus reducing dynamic power consumption… The instruction word uses its full [width] to specify source and destination operands, producing an abundant address space for physical registers, indirect memory access, and hardware-triggered operations. Ultimately, coupling the abundant source/destination address space with minimal restrictions on source-destination combinations gives rise to a highly orthogonal machine.” “The… the scheduler must manage the computation, flow, and storage of operands across the array to effectively map applications onto CGRAs… Traditional CGRAs do not handle [non-loop or outer loop code] in an effective manner as they have no mechanisms to accelerate dataflow graphs that are narrow and sequential. In fact, the majority of the resources sit idle in such situations… The [basic idea of operation fusion] is to dynamically configure the existing processing elements of a CGRA into small acyclic subgraph accelerators. Each cycle, any FU can be fused with [nearest] neighbors to create an accelerator capable of executing a small computation subgraph in a single cycle. In essence, small configurable compute accelerators are realized on the array to accelerate sequential code. The necessary hardware extensions for a conventional CGRA are quite simple — an inter-FU bypass network is added between neighboring FUs in the array using a few multiplexers… our approach is to exploit the slack cycle time to accomplish more work in a single clock cycle when the critical timing paths are not exercised through the CGRA. In this manner, multiple arithmetic operations can be ‘chained’ together when the critical timing paths are not exercised to accomplish more work in a single cycle… The critical path of a CGRA usually consists of register file read, longest execution in a FU, and write back to register file… While register file access is required for every operation in conventional architectures, CGRAs have distributed interconnect across the array that can directly transfer operands between FUs. When an operation is executed without a register file access through the interconnect, it does not fully utilize the clock period and there is significant slack left… Since multiple operations can be mapped into a single cycle, we need a smaller time unit than the traditional clock cycle used by compiler schedulers. We propose [the] tick, a small time unit based on the actual hardware delay information. The unit delay of one tick is set by the actual latency of the smallest logic component, normally a small multiplexer… Every logic component on CGRAs has their own tick information and the information is used for dynamic operation fusion scheduling.” “Dynamic timing slack… refers to the portion of the clock period that remains after all the signals have propagated through logic paths… When data waits on the latch/flip-flop inputs well in advance of the clock edge, this interval can be viewed as wasted time… DTS [is] mostly limited by the critical paths in the execution units — the most prominent being the carry chain within ALUs and effective address generators. From the carry chain, these paths typically extend to condition generation logic and register bypass networks… most carry chain stress is triggered by code patterns that a compiler has full control over… by elevating timing models into the compiler, we can create additional DTS by squeezing operations at a sub-cycle resolution… there are deep connections between the DTS and the workload… much of the function of the compiler is structuring code and reorganizing the program to maximize the capabilities of the underlying hardware. Given knowledge of what are critical path stressing operations, the compiler is well-equipped to generate code that creates more dynamic timing slack by substituting code that places less stress on critical paths… As the DTS-aware code generation… avoids the cases that are more likely to exercise critical paths, the instruction sub-cycle latencies become more homogeneous… In a deep, well-balanced pipeline, there are critical paths in every stage.” “In some cases, it may be possible to replace a multi-instruction sequence with an equivalent set of instructions (i.e. that produce an equivalent result), but place… much less stress on critical path circuits in the ALU and can tolerate significantly higher clock frequencies.” “The… explicit issue of an instruction can be avoided whether its data-driven execution does not stall the pipeline, i.e. resource conflicts do not arise. In this case, we say that the instruction is implicitly issued, its execution being driven by the decoding of a preceding instruction it depends on, through a [definition-use] relation on a specific register… Starting from the observation that the decode stage of the pipeline is not the most critical, we exploit the difference between clock period and critical path of the decode stage to early compute a simple instruction (addition or subtraction), and reserve the execution stage to implicitly execute a depending one… Our schema… removes the triggered instruction from the program code by issuing it implicitly. This leads to reduced pressure on the instruction cache and earlier execution of the implicitly issued instruction, so that merged instruction pairs gain one latency cycle… Implicit issue is simply a way to remove the need for the instruction fetch and decode stages of the triggered instruction.” “Exploiting fine-grain parallelism is a critical part of exploiting all of the parallelism available in a given program, particularly since highly irregular forms of parallelism are often not visible at coarser levels and since the use of low-level parallelism has a multiplicative effect on the overall performance… Spatial parallelism may be achieved by executing several operations in parallel on different functional units. Temporal parallelism means performing several operations concurrently on the same functional unit, taking advantage of its pipelined structure… In order to exploit these advanced horizontal architectures and to extract the maximum amount of parallelism from ordinary programs by hand, the user has to understand and take into account all of the target machine’s peculiar characteristics. These characteristics may include features such as several pipelined functional units each working on different streams of data, having different latencies and sharing information. Ignoring any of this type of issues may severely restrict the amount of possible speed-up. However, taking them into account, and using them to advantage, is an intricate and difficult task.” “A fully bypassed processor allows an FU to read its inputs from the outputs of any FU from any of the pipeline stages subsequent to execution. The cost of implementing a full bypass network is significant, both in terms of area and wire delay… As the number of FUs increases or as pipeline depth grows, bypass cost increases substantially… many of the bypass paths are underutilized or not used at all for any specific application… The difficulty with a partial versus full bypass network is that operation latencies are no longer fixed. Rather, they are a function of the specific placement of the producing and consuming operations… An intelligent scheduler not only must decide on when to execute each operation, it must decide which FU to execute it on and consider the effects of using that bypass path. Scheduling two dependent operations on FUs without a bypass path between them can dramatically change the schedule length. By the same token, placing operations that have significant slack or no consumers on the FUs with the most important bypass paths can unnecessarily constrain the schedule… The more compact schedule was achieved by making effective use of the available bypass paths to reduce critical inter-operation communication latency… The key is that producer-consumer operations on the critical path are placed to enable low-latency communication through the bypass paths… Conversely, non-critical operations often do not require the bypass paths… so the scheduler needs to ensure they do not unnecessarily utilize the bypass paths, thereby precluding critical operations… Critical dependence chains need to be carefully placed onto FUs so that communication latency is minimized… FLASH performs well on configurations that have irregular latencies at a fine grain level.” “One-bit state network was integrated into XPP-III and DySER to control the [ALU] execution and synchronization… the predicate computation and the predicated operation can be executed in parallel… In other words, parallel condition requires executing two instructions that have control dependence in one cycle, which they actually have potential to achieve with no penalty because the predicate computation is usually faster than the predicated operation.” “One type of hazard that is particularly detrimental to [ILP] machines consists of execution interlocks… An attempt to execute the interlock in a machine cycle could result in an increase in the cycle time of the machine and unnecessarily penalize all instruction executions… We consider primarily instructions that require ALU operations, as the hardware required for the implementation of such units is the most complex in fixed-point units by virtue of the sheer amount of function they must provide. Such units operate with both two’s complement numbers and unsigned numbers, perform arithmetic, logical, and register transfer operations with some of the instructions (but not all) setting condition codes and/or causing overflows, and so on… In order for an implementation to eliminate interlocks between instructions and to execute such instructions in parallel… it is required to ‘collapse’ the interlocks with the incorporation of multiple, and multi-operand execution units… In the device, there is a synergy between the logic blocks that allows the entire function to be performed.” “This parallel execution is achieved by collapsing the execution interlocks between integer ALU operations as well as between address generation operations, but without increasing the cycle time of the base implementation… the most frequent interlocking instructions have simple operations that can be collapsed, e.g., arithmetic operations, logical operations (except shifts), register transfer operations, address generation and branch outcome determinations… True data dependencies can be collapsed into a single-cycle operation in many cases resulting in an increase in parallelism without increasing the cycle time of the base machine.” “Not all instructions impact the execution time of the program. Those that do constitute the critical path of execution, and are typically only a fraction of all instructions… Performance is optimized when there is not one critical timing path, but rather a large number of nearly equal critical timing paths.” “Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file… the flexibility of an FPGA is generally more than is necessary for dataflow graph acceleration.” “Minimum latency through critical path code often requires dependent instructions execute on subsequent clock cycles. Forwarding path delays, however, do not scale with technology and modern CPUs already spend as much time bypassing the ALU result as computing it… In the end, the scalability of modern architectures is hampered by the communication between dependent instructions, not the actual computation… Transient operands, produced values with only one consumer… need not be committed to the architectural state of the machine. These values often connect critical dependent instructions; in other words, this producer-consumer communication is on the critical path of the application… A strand is a string of integer ALU instructions that are joined by transient operands (thus have no fan-out)… The execution target for strands are closed-loop ALUs, which are normal single-cycle integer ALUs with the addition of a self-forwarding mode. In this mode, output values are sent directly back to the inputs of that ALU and not written to the result bus… As modern processors spend half of the execution cycle on ALU execution and half on full bypass, an ALU spinning on its own results can compute two internal values per cycle… linear chains of dependent instructions are common in integer application code, requiring unnecessary communication traffic within issue and bypass.” “…no additional inputs to the complex input multiplexers are required. Only one set of pass-gates and a few input buffers permit this behavior. Self-bypass is one of the reasons why transience must be guaranteed by the static detection: any intermediate values are lost upon usage and are thus unavailable for any later consumers. When a strand is issued to an ALU in closed-loop mode, it is provided with all necessary inputs and op-codes. It then spins on its internal results and produces a single output. During this time, the ALU is busy and not available for issue… the resultant reduction in writebacks from closed-loop operation carries a significant energy benefit. For one, intermediate values no longer use the bypass network. Bypass wires are long, wide, drive large multiplexers at the functional unit inputs, and require significant drive power or repeater power. Additionally, closed-loop operation means that intermediate values avoid the register file completely. As register accesses also incur a significant power cost, it is clearly advantageous to avoid unnecessary accesses… The one-to-one producer-consumer relationship provides numerous opportunities for using direct communication rather than broadcast during execution.” “Operand registers extend a conventional register organization with small, distributed sets of inexpensive general-purpose registers, each of which is integrated with a single function unit in the execute stage. The shallow, inexpensive register files that implement the operand registers effectively extend the pipeline registers that precede the function units into small register files, preserving the low access energy of the pipeline registers while satisfying a greater fraction of operand references. This lets software capture short-term reuse and instruction-level producer-consumer locality near the function units. Explicit operand forwarding lets software control the routing of operands through the forwarding network so that ephemeral values need not be written to registers. Both mechanisms increase software control over the movement of operands between function units and register files… Like explicit operand forwarding, operand registers reduce the fraction of the operand bandwidth that reaches the general-purpose register file. This improves energy efficiency: the small operand register files are inexpensive to access, and the… cost of traversing the forwarding network and pipeline register preceding the execute stage is avoided… register organizations using operand registers require fast, and consequently small, operand register files to avoid creating critical paths in the execute stage… Each operand register is assigned to a single function unit, and only its associated function unit can read it. The function units can write any register, which allows the function units to communicate through operand registers or the more expensive backing registers. This organization allows software to keep a significant fraction of the operand bandwidth local to the function units, to place operands close to the function unit that will consume them, and to keep data produced and consumed on the same function unit local to the function unit… This makes the general-purpose register file smaller and shortens its bit-lines, reducing its access costs by decreasing the bit-line capacitance switched during read and write operations. It also reduces the cost of updating a register because fewer access devices contribute to the capacitance internal to each register… automatic forwarding performed by the forwarding control logic avoids the increase in code size and dynamic instruction counts needed to explicitly route all data.” “The [static pipelining ISA] results in more control given to the compiler to optimize instruction flow through the processor… There is no need for pipeline registers because SP processors do not need to break instructions into multiple stages. In their place are a number of architecturally visible internal registers. Unlike pipeline registers, these internal registers are explicitly read and written by the instructions, and can hold their values across multiple cycles… Since these internal registers are small, can be placed near the portion of the processor that accesses them, and are explicitly accessed, each internal register is accessible at a lower energy cost than the centralized register file… while the pipeline registers of the baseline processor are read and written every cycle, the SP internal registers are only updated when needed. Because these internal registers are exposed at the architectural level, a new level of compiler optimizations can be exploited… SP instructions are most similar to horizontal micro-instructions… the effects in SP instructions specify how to pipeline traditional operations across multiple cycles. While horizontal micro-instructions also specify computation at a low level, they do not expose pipelining at the architectural level… Furthermore, compiler optimizations cannot be performed across… micro-routines since this level is not generally exposed to the compiler… most VLIW instructions represent multiple RISC operations that can be performed in parallel. In contrast, the SP approach encodes individual instruction effects that can be issued in parallel, where most of these effects correspond to an action taken by a single pipeline stage of a traditional RISC instruction… The [TTAs] are similar to VLIWs in that there are a large number of parallel computations specified in each instruction. TTAs, however, can move values directly to and from functional unit ports, to avoid the need for large, multi-ported register files. Likewise, the TTA compiler was able to perform copy propagation and dead assignment elimination on register references. Thus, both the TTA and the SP avoid many unnecessary register file accesses.” “Because register file energy consumption is a significant portion of processor energy, these unnecessary accesses are quite wasteful… In the rare cases where too many effects are scheduled together, the compiler will attempt to move effects into surrounding instructions while obeying structural hazards and dependencies. Only when the compiler cannot do so will an additional instruction be generated for these additional instruction effects… Because an internal register access is cheaper than a register file access, the compiler will prefer the former… We are able to create a more efficient loop due to this fine-grained control of the instruction effects… We also pack together effects that can be executed in parallel, obeying data and structural dependencies.” “Data still passes through the processor in multiple cycles. But how each portion of the processor is controlled during each cycle is explicitly represented in each instruction. Thus instructions are encoded to simultaneously perform actions normally associated with separate pipeline stages… Each instruction consists of a set of effects, each of which updates some portion of the processor… All of the effects specified in a single instruction are independent and are performed in parallel. The values in the internal registers are read at the beginning of the cycle and written at the end of the cycle… except for the effects that solely read or write a register file value, all of the effects operate solely on the internal registers. This is analogous to how RISC architectures only allow load or store instructions to reference memory locations… The static pipeline… accessed the register file significantly less, because it is able to retain values in internal registers with the help of the compiler. Instead of accessing the register file, the statically pipelined code accesses the internal registers often.” “A [Counterflow Pipeline Processor Architecture] uses a bidirectional pipeline — in which instructions flow and results counterflow — to move partially-executed instructions and results in a regular way, subject to a few pipeline rules that guarantee correct operation… The CFPP structure is very regular, which provides hope that it can be laid out on silicon in a simple, regular way. The simplicity and regularity should also help in devising correctness proofs… Pipeline stages communicate primarily with their nearest neighbors, allowing short communication paths that can be very fast… The local nature of most communications and control should permit very fast operation… each destination binding is inserted into the downward-flowing results pipeline so that it may be observed by subsequent instructions… The results pipeline provides the function that in other designs is called ‘bypassing’ or ‘forwarding,’ and does so in a way that is uniform for all stages… The architecture permits several separate register files to co-exist along the pipeline… A major objective of the CFPP structure is to allow local control of the pipeline: the decision as to whether a packet may advance to the next stage of the pipeline requires knowing only the state of the two stages involved in the transfer… Local control is easily achieved by a combinational function on the state of neighboring stages… The CFPP design is an extreme attempt to replace physically long communication paths on a chip with pipelined transport of data and instructions. As chip densities and transistor performance increase, the relative delay of long wires increases. Our hope is that a small number of very fast local pipeline transfers can replace a single, longer delay required to send bits an equivalent distance… At the opposite extreme from the CFPP is a dataflow architecture, in which a new result is broadcast to an associative memory that holds instructions waiting to be executed. The new result may complete the source values required to execute one or more of these instructions; they are identified, read out of the associative memory, and routed to a suitable functional unit. The comparators in the associative memory perform the same function as the comparators that enforce the pipeline rules in the CFPP. However, the dataflow mechanism depends on broadcasting data to an associative memory that presents a large load and long pathways to the broadcast wires, and that slows the operation.” “No centralized scheduling or scoreboarding of the pipeline is performed… Interlock checking is performed at a local level.” “As a read or write access implicitly refreshes the accessed DRAM cells… it is possible to synchronize the refresh operations and the read/write requests, such that read/write requests of the application naturally refresh DRAM. [Refresh-Triggered Computation] not only eliminates conflicts between refresh and read/write, but also reduces the number of refresh commands that the memory controller needs to issue by eliminating redundant refresh operations… both refresh and access requests perform similar operations (i.e. activating and precharging a row) in the DRAM circuitry… a profiling-based or compiler-assisted approach can… automatically determine access patterns of applications and take advantage of RTC without involving the programmer… partitioning the applications to separate banks or channels improves overall system performance by reducing the bank/channel conflicts.” “Every access to the memory naturally opens accessed rows and consequently restores the stored charge in the capacitors of DRAM cells, thus incurring an implicit refresh operation… an access to a single bit in memory refreshes the whole row of the DRAM bank where that bit is located. As such, accesses to one bit, word, or variable may cause others to be refreshed as well… All interactions between variables can be captured given precise knowledge of the memory locations of variables and… the mapping of virtual addresses to ranks, banks, and rows in DRAM.” “Modern DRAMs are not truly random access devices (equal access time to all locations) but rather are three-dimensional memory devices with dimensions of bank, row, and column. Sequential accesses to different rows within one bank have high latency and cannot be pipelined, while accesses to different banks or different words within a single row have low latency and can be pipelined. The three-dimensional nature… makes it advantageous to reorder memory operations to exploit the non-uniform access times of the DRAM. This optimization is similar to how a superscalar processor schedules arithmetic operations out of order. As with a superscalar processor, the semantics of sequential execution are preserved by reordering the results… Each bank operates independently of the other banks and contains an array of memory cells that are accessed an entire row at a time. When a row of this memory array is accessed (row activation) the entire row of the memory array is transferred into the bank’s row buffer. The row buffer serves as a cache to reduce the latency of subsequent accesses to that row. While a row is active in the row buffer, any number of reads or writes (column accesses) may be performed, typically with a throughput of one per cycle. After completing the available column accesses, the cached row must be written back to the memory array by an explicit operation (bank precharge) which prepares the bank for a subsequent row activation… A memory access scheduler must generate a schedule that conforms to the timing and resource constraints of these modern DRAMs… Each DRAM operation makes different demands on the three DRAM resources: the internal banks, a single set of address lines, and a single set of data lines. The scheduler must ensure that the required resources are available for each DRAM operation it schedules… For many DRAMs, the bank, row, and column addresses share a single set of lines. Hence, the scheduler must arbitrate between precharge, row, and column operations that all need to use this single resource. [Rambus DRAMs] provide separate row and column address lines (each with their own associated bank address) so that column and row accesses can be initiated simultaneously. To approach the peak data rate with serialized resources, there must be enough column accesses to each row to hide the precharge/activate latencies of other banks. Whether or not this can be achieved is dependent on the data reference patterns and the order in which the DRAM is accessed to satisfy those references… Memory access scheduling is the process of ordering the DRAM operations (bank precharge, row activation, and column access) necessary to complete the set of currently pending memory references… Memory access scheduling greatly increases the bandwidth utilization of these DRAMs by buffering memory references and choosing to complete them in an order that both accesses the internal banks in parallel and maximizes the number of column accesses per row access, resulting in improved system performance… Contemporary cache organizations waste memory bandwidth in order to reduce the memory latency seen by the processor. As memory bandwidth becomes more precious, this will no longer be a practical solution to reducing memory latency.” “The aggregate bandwidth provided by a single HMC is substantial due to the large number of independent memory banks, high-bandwidth through-silicon vias, and a high-speed I/O interface… [Hybrid Memory Cube] consists of several DRAM layers stacked with a logic layer on the bottom. The different layers are connected with [TSV] interconnects. Each layer of memory is partitioned into different segments and the vertical segments are grouped together to form a vertical slice or a vault. The logic layer on the bottom consists of the memory controllers (one for each vault), the I/Os, and a switch that interconnects the vaults and the I/Os.” “The memory subsystem is a major key performance bottleneck and energy consumer in modern computer systems… the off-chip memory bandwidth does not grow as fast as the processor’s computation throughput. The limited memory bandwidth is the result of relatively slow DRAM technologies (necessary to guarantee low leakage), the non-scalable pinout, and on-board wires that connect the processor and memory chips… a considerable portion of the memory bandwidth demand in many programs and operating system routines is due to bulk data copy and initialization operations. Even though moving data blocks that already exist in memory from one location to another location does not involve computation on the processor side, the processor has to issue a series of read and write operations that move data back and forth between the processor and the main memory… [Network-on-Memory] carries out copy operations entirely within the memory across DRAM banks, without any intervention from the processor… With NoM, banks in 3D stacked memory are equipped with very simple circuit-switched routers… To perform data transfers in parallel, NoM replaces the global links of the shared bus with a set of shorter inter-bank links and, hence, yields higher throughput and scalability… Vaults can be accessed simultaneously… In addition to the conventional address, data, and control buses, NoM connects each bank to its neighboring banks at the X, Y and Z dimensions to form a 3D mesh topology… due to its simple structure and short non-intersecting links, which do not significantly change the structure of the DRAM layer… all vault controllers can serve both regular memory accesses and DRAM refresh requests in parallel with NoM’s data copy operations. The only exception is when the corresponding vault controllers are busy reading and writing NoM data at the source and destination banks, respectively. In between these read and write steps, the data block passes on the NoM links and leaves the read/write bandwidth idle for regular accesses.” “The distributed processing capabilities of HBM… surpass all other off-chip memories. In particular, HBM offers substantially better power consumption… Copper pillar/microbump technology [uses] flip-chip interconnects with solder microbumps to connect memory stacks to logic elements… Solder bumps need to get smaller as the pitch shrinks, but eventually there simply isn’t enough material to ensure a reliable electrical connection. Depending on the approach, the minimum interconnect pitch for solder-based technologies is between 30μm and 55μm. The resulting maximum interconnect density of about 1110 connections per mm² sharply limits the memory bandwidth that the package can support… Moreover, interconnect schemes based on copper pillars and microbumps create a 20–25μm separation between die in the stack. After deposition of the last layer of device metallization, additional process steps are required for either copper pillars or pads for solder bumps. Solder is a multicomponent material that is known to have thermo-mechanical stress during operation, which can cause cracking and lead to electrical failures. Organic underfill layers are routinely used to accommodate these stresses, but the high thermal resistance of these layers impedes heat dissipation… Each additional die adds to the thermal non-uniformity through the stack… The solution [is] a die bond layer free of any organic material with all-metal interconnect. Each such die stack will perform thermally like a single die. This hybrid bonding approach eliminates the chip-to-chip connectivity bottleneck by replacing solder bumps with high-density metal-metal 3D interconnects. Metal pads (usually copper) surrounded by dielectric contact each other directly. Before bonding, the metal is slightly recessed relative to the surrounding dielectric. Plasma treatment activates the dielectric surface, allowing a dielectric-to-dielectric bond to form on contact. Though the devices being connected have not yet made electrical contact, this bond is strong enough to allow handling, assembly of multi-chip stacks, and so on. Once the complete stack is ready, a final elevated temperature step causes the pads to expand, contacting each other and making a strong metal-to-metal bond. In this structure, there is no solder material, no separation between the devices, and no need for an underfill layer. The inorganic dielectric provides mechanical support, protects the copper surfaces from contamination, and gives superior thermal conductivity. The direct copper-to-copper connection minimizes electrical resistance and maximizes heat dissipation. Copper pad fabrication for hybrid bonding uses the same inlaid metal damascene process as the on-chip interconnect metallization used throughout the semiconductor industry. This… process routinely achieves a 1μm pitch with potential densities approaching a million connections per mm². The net result is up to a [three orders of magnitude] interconnect density increase, while still improving thermal performance and reducing cost of ownership… While hybrid bonding does not require [TSVs], it can exploit them to support complex multilayer stacks. The inlaid copper pads on the frontside of one die can contact TSVs directly on the backside of another… Although HBM memory chips… are approximately 50μm thick, the industry has demonstrated feasibility down to 30μm… High-density interconnects eliminate bottlenecks that might otherwise occur between chips. Instead, designers can treat interfaces between die as comparable to interfaces within die, allowing die to share interconnect layers and a single power supply mesh. With careful attention to vertical placement, such designs cannot just match the monolithic die performance, but improve it by reducing critical path lengths.” “Three-dimensional integration technology offers the promise of being a new way of increasing system performance, even in the absence of scaling. This promise is due to a number of characteristic features of 3D integration, including decreased total wiring length (and thus reduced interconnect delay times), a dramatically increased number of interconnects between chips, and [so on]… Perhaps the most compelling advantage of this capability is that it allows a massive communication bandwidth to be achieved between the processor cores and memory… The face-to-back method is based on bonding the front side of the bottom wafer to the thinned backside of the top wafer… The height of the structure and, therefore, the height of the interconnecting via depend on the thickness of the thinned top wafer… For face-to-back schemes, typically, the minimum thickness for the top wafer is on the order of 25–50µm using conventional grind and polish thinning. For copper vias, aspect ratios on the order of 5:1 are typically needed for reliable fill, thus limiting the interconnect pitch to values on the order of 10–20µm… A vertical interconnect is necessary for 3D integration to truly take advantage of 3D for system-level performance, because without it, interconnects would be limited to the periphery of the chip, and in this case, the interconnect density would be no greater than in conventional planar technology… Attachment of two wafers is possible using a thermocompression bond created by applying pressure to two wafers with copper-metallized surfaces at elevated temperatures. For 3D integration, the Cu-Cu join can serve the additional function of providing electrical connection between the two layers… Wafer thinning is a necessary component of 3D integration, as it allows the interlayer distance to be reduced, therefore allowing a high density of vertical interconnects… a key limitation to the extendibility of 3D integration technology will be the ability to deliver power through the chip stack. The difficulty lies in the fact that power vias need to be wide enough to supply power through the entire chip stack while maintaining low enough resistance to avoid excessive voltage drop on Vᴅᴅ and ground. This problem is exacerbated by the fact that the chip dissipating the most power is likely to be located near the heat sink, farthest from the package, necessitating that power vias run throughout the entire stack. Ultimately, it is conceivable that power delivery considerations could limit the number of chips that it is beneficial to integrate, because a greater percentage of the area allocated to power vias will be needed as the total thickness of the stack increases… This [suggests] thinning the individual layers in the 3D stack so that more layers can ultimately be integrated. The development of energy-optimized processor designs will also be critical to maintain or decrease the total chip power as additional layers are added. Along with the issue of power delivery for 3D integration comes the issue of cooling. The main cooling issue arises from the fact that only one chip (usually the microprocessor) can be located directly adjacent to a conventional heat sink. The remaining chips in the stack need to be located between the processor chip and the package and, therefore, necessarily have to operate at a temperature that is higher than the chip located next to the heat sink. This issue… highlights the need to maintain the best thermal conductivity possible between the chips in the stack. The ability to achieve good thermal contact between the layers depends greatly on the integration scheme… metallic connections between the wafers provide the best means for heat transfer between the wafers. Therefore, one would expect that Cu-Cu bonding would have excellent thermal conductivity between the layers, particularly if a large number of dummy metallization layers were present. If an intermediate adhesive layer were used in lieu of dummy bonding pads, then the thermal conductivity between the layers would be decreased.”
Burroughs Chimera (fluent programming language) [effect-coeffect system]
“Functional languages have severe difficulties dealing with arrays efficiently. In a fluent language, on the other hand, arrays created by an imperative subprogram can be passed to and manipulated by functional subprograms… In a fluent… language programmers may freely mix functional and imperative programming, subject to the sublanguage constraints. These constraints, which restrict the permissible side effects of expressions, are enforced by a process similar to type checking.” “Much like a type system tracks the values that may be stored in particular variables, an effect system tracks the side effects that may be produced by particular functions, allowing the programmer to reason about mutation and side effects more formally than in a traditional imperative language… languages with effect systems simply let the programmer denote certain procedures as having imperative semantics. More powerful effect systems allow tighter control over what effects are allowed. Such a language can be seen as having separate declarative and imperative sublanguages…” “Effects and coeffects… roughly correspond to computations which change the execution context versus computations which make demands on the context. Effectful features include partiality, non-determinism, I/O, state, and exceptions. Coeffectful features include resource demands, variable access, notions of linearity, and data input requirements. The effectful or coeffectful behavior of a program can be captured and described via type-based analyses… A pure program can be described as a mathematical object that is isolated from the real world. However, even in the most abstract setting, a program is hardly isolated. Instead it interacts with its evaluation context… Indeed, basic operations like I/O are inconceivable in a program that runs in isolation… programs make both demands on, and produce changes to, the context; computations are often both effectful and coeffectful… Moreover, changes to the evaluation context may depend on program requirements, and vice versa. That is, coeffects and effects may interact.” “Functional programs are widely appreciated for being ‘declarative’ and implicitly parallel. However, when modeling state, both properties are compromised because the state variables must be ‘threaded’ through most functions. Further, state ‘updates’ may require much copying… By ‘declarative’ we mean that one does not have to over-specify the details of an algorithm. This is sometimes paraphrased as [say what you want done, not how you want it done]… A common approach to addressing these weaknesses is to extend a functional language with state… In pure functional languages, parallelism is implicit. Annotations for parallelism may be used, but these can be viewed as hints and do not add to the intellectual burden of the programmer. However, in sequential functional languages with imperative extensions, parallelism is usually re-introduced through another layer of constructs such as threads, forks, semaphores, futures, [and so on]. These features usually make the language more complex. Further, when two such parallel programs are composed, their parallelism does not compose, because of the underlying strict semantics. The difference in resulting parallelism can be exponential… Threading ensures determinacy at the expense of serializing access to the shared data structure. For a large class of algorithms, determinacy can be ensured in the presence of asynchronous updates to a shared structure… a local model of whether or not side-effects occur is essential in order to reason about programs with side-effects… In most approaches to parallel computing, one starts with a sequential language and adds parallel annotations… In contrast, we go in the opposite direction: we start with an implicitly parallel language and add sequentialization only where necessary. The degree to which we expose and exploit parallelism is thus much more aggressive.” “Much of the complexity of parallel execution comes from the fact that parallel tasks can generate many different interleavings of the basic operations, with each interleaving generating a potentially different behavior. To understand the program, one must take all of the different behaviors into account. Expressing the computation in a serial language, however, imposes a total order on the operations and the program only generates one behavior. From the programmer’s perspective this dramatic simplification makes it much easier to reason about what the program may do. From the system’s perspective it imposes the often considerable challenge of automatically generating parallel execution from a serial program… Systems with serial semantics provide mutable data but simplify the programming model by eliminating the concept of parallel execution. Monotonic systems, on the other hand, provide parallel execution but simplify the programming model by eliminating the concept of mutable data… In monotonic systems a variable’s value does not change over the course of the computation — it only becomes more defined. Computation in monotonic languages is therefore inherently parallelizable. All operations can execute concurrently without conflict… The computation proceeds in an information-driven fashion. When one operation generates information that satisfies a second operation’s information request, the second operation executes, generating further information or computation… Programmers using functional languages structure their computation as a set of recursively defined functions. Variables hold values returned from function invocations and are used to transfer values between invocations. Because each variable holds only one value during the course of a computation, the execution of a functional program can be seen as the progressive generation of information about the values of variables… The strengths of monotonic systems are their clean semantics and inherently parallel execution model. They allow programmers to expose the concurrency in their computations without having to deal with the complexity of an explicitly parallel programming system. A major barrier to the widespread acceptance of monotonic languages, however, is the difficulty of implementing these languages efficiently… There are also expressiveness problems associated with monotonic languages… the lack of mutable data can force programmers to tediously thread state through multiple layers of function calls… The expressiveness and efficiency problems have led to hybrid designs which integrate mutable data into monotonic programming languages… Programs written in languages that support mutable data reuse memory efficiently by overwriting old values with new values. Because monotonic languages eliminate the notion of side effects, the implementation is responsible for generating the memory reuse. Garbage collection is one common technique for reusing memory. The lack of side effects in monotonic languages means that each piece of memory goes through the collector between writes… The lack of mutable data means that programs written in monotonic languages typically generate more copying overhead than programs written in more conventional languages. To update a component of a data structure, the program cannot simply generate the new value and store it back into the original structure. It must instead allocate a new structure to hold the value and copy the unchanged values from the old structure to the new structure. If the structure is part of a large linked data structure the program must recursively regenerate all of the data that pointed to the original (and now obsolete) structure… One interesting aspect of update-in-place optimizations is that their use reintroduces the key problem of parallelizing programs that use mutable data: correctly sequencing reads with respect to writes in the parallel execution. Systems that use this optimization can no longer blindly evaluate expressions in parallel. They must analyze the generated access patterns and insert synchronization operations that correctly sequence the evaluation of expressions that access the updated data structure… The multilingual approach works best when it is natural to structure each task as a functional module that accepts a set of inputs and generates a set of outputs… Because it is impossible to implement irregular concurrency patterns with barriers, the tasks in programs with irregular concurrency patterns cannot interact using mutable memory… Shared memory systems… synchronize using constructs such as locks, barriers and condition variables.” “In functional programming there are no side effects, so programs can execute concurrently to the extent permitted by data dependencies. Although functional programming can solve most problems, having explicit state, as provided by imperative languages, allows the developer to express certain problems in a more intuitive and efficient way… Sharing state between concurrent tasks immediately raises questions [about the order in which those accesses should occur, and how to coordinate those accesses to maintain program invariants]. The reason why those questions are hard to answer is because there are implicit dependencies between code and state. Methods can arbitrarily change any accessible state without revealing this information to the caller. This means that two methods could be dependent on the same state, without the caller knowing about it. Because of this lack of information, [imperative] programming languages use the order in which code is written as proxy to express those implicit dependencies. Therefore the compiler has to follow the given order and cannot exploit potential concurrency automatically… To overcome this situation, we propose to transform implicit dependencies into explicit dependencies and then infer the ordering constraints automatically… by default, everything is concurrent, unless explicit dependencies imply a specific ordering… We propose to use access permissions to specify explicit dependencies between stateful operations. Access permissions are abstract capabilities that grant or prohibit certain kinds of accesses to specific state… The way we use access permissions to specify state dependencies resembles the way Haskell uses its I/O monad to specify access to global state. But unlike the I/O monad, which provides just one permission to all the state in the system, access permissions allow greater flexibility by supporting fine-grained specifications, describing the exact state and permitted operations on it. In ÆMINIUM every method must explicitly mention all of its possible side effects. This allows the system to compute the data dependencies within the code, and within those constraints, execute the program with the maximum possible concurrency. By following this approach our system resembles a dataflow architecture. But instead of producing and consuming data, our system supports shared objects and in-place updates… Because access permissions play such an important role in our system, we promote them to first-class citizens and integrate them into our type system… Shared memory and objects are in general used as communication channels between several concurrent entities, which may modify the shared state. Therefore, we need a mechanism to allow concurrent execution and modifying access to a shared resource. A shared permission provides exactly these semantics… In some situations, application-level dependencies exist that cannot directly inferred via data dependencies… To allow the user to specify such additional dependencies without sacrificing concurrency, we add data groups to our system… Data groups provide a higher-level abstraction and provide information hiding with respect to what state is touched by a method… In our system a data group can be seen as a container which contains all shared permissions to an object… Data groups allow the programmer to explicitly model [their] design intent in the source code. Not only does this allow the ÆMINIUM system to infer the dependencies and correct execution, it also improves the quality of the code itself by explicitly documenting those dependencies… data groups model an effect system…” “ÆMINIUM’s type system prevents data races by either enforcing synchronization when accessing shared data or by correctly computing dependencies to ensure a happens-before relationship (meaning conflicting accesses will be ordered according to their lexical order)… One can thus think of state being threaded through the program following the permissions. In this sense, permissions allow us to treat an imperative program as if it were purely functional, with corresponding benefits for reasoning and parallelization. An analogy can be made to monads such as the state monad in Haskell, which conceptually threads the state of the heap through the program computation.” “In order to respect sequential consistency… Athapascan-1 has to identify the related data version for each read performed on a shared data. However, parallelism detection is easily possible in this context if all the tasks define the shared data objects that will be accessed during their execution… and which type of access will be performed on them… tasks cannot perform side effects… and shared parameters are typed according to the future manipulation… In the declaration of formal parameters of tasks, the references to shared data version are typed by their access right, i.e. the kind of manipulation the task (and all its sub-tasks, due to lexicographic order semantic) is allowed to perform on the shared data… the strong typing of accesses in shared memory enables the verification of the correctness of a program at compile time.” “Our memory model is based on explicit, programmer-specified requirements for order of execution… The compiler then realizes those requirements in the most efficient manner it can. This is in contrast to existing memory models, which — if they allow programmer control over synchronization at all — are based on inferring the… consequences of synchronization operations or annotations in the code… RMC is intended to admit all the strange (to a sequentially consistent eye) executions permitted by relaxed memory architectures. Thus, an RMC program can be implemented on such architectures with no synchronization overhead beyond what is explicitly specified by the programmer… Out-of-order execution makes it possible for execution to proceed out of program order, but not out of dependency order.” “In RMC, the programmer explicitly specifies constraints on the order of execution of operations… These constraints are then enforced by the compiler, which has a great deal of latitude in how to achieve them. Because of the very fine-grained information about permissible behaviors, this can allow the generation of more efficient code… While RMC does not have a useful global total ordering of actions, there does exist a useful per-location ordering of writes that is respected by reads. This order is called coherence order. It is a strict partial order that only relates writes to the same location, and it is the primary technical device that we use to model what writes can be read. A read, then, can read from any write executed before it such that the constraints on coherence order are obeyed… The general approach of weak memory models is to abandon any notion of ‘memory’ and to replace it with a history that is given order by a soup of partial order relations… While a monadic lambda calculus seems quite different than a traditional block-structured imperative programming, it is fairly simple to translate from a traditional imperative language to this sort of calculus… existing data and control dependencies in the program may already enforce the execution order we desire, making it unnecessary to emit any additional code at all… given that relative ordering constraints are a key reasoning principle, exposing them directly in the program aids program construction and understanding.”