Operating Systems for ManyCore (Tessellation OS)

Project Website: http://tessellation.cs.berkeley.edu

Our approach to both operating systems and hardware architecture is to deconstruct conventional functionality into primitive mechanisms that software can compose to meet application needs.

A traditional OS is designed to multiplex a large number of sequential jobs onto a small number of processors, with virtual machine monitors (VMMs) adding another layer of virtualization [Rosenblum and Garfinkel 2005]. We instead propose a very thin hypervisor layer that exports spatial hardware partitions to application-level software. These “un-virtual” machines allow each parallel application to use custom processor schedulers without fighting fixed policies in OS/VMM stacks. The hypervisor supports hierarchical partitioning, with mechanisms to allow parent partitions to swap child partitions to memory, and partitions can communicate either through protected shared memory or messaging. Traditional OS functions are provided as unprivileged libraries or in separate partitions. For example, device drivers run in separate partitions for robustness, and to isolate parallel program performance from I/O service events. This approach is similar to the MIT Exokernel project [Engler, Kaashoek et al. 1995], but instead of having a single isolation barrier (user/supervisor), we support arbitrarily nested partitions. An important difference from earlier systems is that we also export physical processor scheduling and virtualization out of the fixed hypervisor layer to better support manycore machines. This deconstructed and distributed OS style improves both scalability and resiliency to hardware and software faults. Our hardware architecture enforces partitioning of not only of cores and of on-chip/off-chip memory, but it also partitions the communication bandwidths between these components and with quality-of-service guarantees for system interconnect. The resulting performance predictability improves parallel program performance, simplifies code (auto)tuning and dynamic load balancing, and supports real-time applications.

Our partitioned architecture allows software to directly use new low-level hardware primitives to build customized parallel run-time systems, while protecting the larger system from bugs in any partition. On-chip memory can be configured as either cache or scratchpad RAM, and user-programmable DMA engines move data efficiently through the memory hierarchy (Williams, Oliker et al. 2007). Atomic fetch-and-op instructions provide efficient memory-side synchronization, including across partitions. Barrier hardware provides rapid synchronization within a partition. As an example, the combination of scratchpad RAM, DMA, and fast barriers, provides a simple and highly efficient run-time for many dwarfs: structured grids, unstructured grids, dense matrices, sparse matrices, and spectral methods.

More complex parallel run-time systems are built on user-level active messages [Eicken, Culler et al. 1992], which provide point-to-point synchronization between cores with direct register-to-register communication, and user-level exception handlers. For example, software thread schedulers use active messages to communicate dataflow synchronization and work stealing requests [Blumofe, Joerg et al. 1995]. We break cache coherence and transactional memory “packaged solutions” into smaller hardware primitives [Hill, Hower et al. 2007]. Per-partition coherence hardware tracks read-write sharing and handles simpler private data caching in hardware, but invokes software handlers when conflicts are detected. Handlers may in turn send active messages to remote cores to resolve the conflict. This mechanism can be used by software for conventional cache coherence, race detection, or transactional memory. Software is responsible for creating state checkpoints, but hardware can optionally generate logs to support memory rollback or deterministic replay [Xu, Bodik et al. 2003]. Hardware cache tags are accessible by software to support context swapping of transaction read/write sets and logs.

A single application can be split across multiple partitions, with inter-partition communication via protected shared memory [Witchel, Cates et al. 2006]. This allows legacy parallel libraries that require their own concurrent run-time model to be quarantined.

The primitives are designed synergistically to allow arbitrary combinations. This enables new forms of hybrid run-time system, but perhaps more importantly, supports composition of run-time models in a single application. A code can switch thread scheduling policy (static versus dynamic) and memory hierarchy (software-managed versus cached versus transactional) to match the current phase of execution, and even run different policies in different subsets of a partition. Such complex run-time strategies would be difficult to create manually, but can be automatically synthesized from the coordination and composition language of the productivity-layer.

We will use the RAMP FPGA-emulation infrastructure [Wawrzynek, Patterson et al. 2007] to allow rapid co-design of the hardware and software features. RAMP makes it possible to run whole applications at reasonable speed, allowing full-system evaluation of our proposed system stack.