#### APPLE-CORE: SVP AND MICROGRIDS

# CAN WE STILL RETHINK THE HARDWARE/SOFTWARE INTERFACE IN GENERAL-PURPOSE PROCESSORS?

RAPHAEL 'KENA' POSS
UNIVERSITY OF AMSTERDAM, THE NETHERLANDS

DSD 2012 CESME, IZMIR, TURKEY SEPTEMBER 6TH, 2012



### CURRENT GENERAL-PURPOSE MULTI-CORES ARE BASED ON LEGACY

- Historical focus on single-thread performance (developments in general-purpose processors: registers, branch prediction, prefetching, out-of-order execution, superscalar issue, trace caches, etc.)
- Legacy heavily biased towards single threads:
  - Symptom: **interrupts** are the **only way** to signal asynchronous external events
  - Retro-fitting hardware multithreading is difficult because of the sequential core's complexity
- What if...
  we redesigned general-purpose processors,
  assuming concurrency is the norm in software?

## MICROGRIDS OF D-RISC CORES



D-RISC cores: hardware multithreading + dynamic dataflow scheduling

- **fine-grained threads**: 0-cycle thread switching, <2 cycles creation overhead
- ISA instructions and NoC protocol for thread management
- dedicated hardware processes for bulk creation and synchronization
- No preemption/interrupts;
   events "create" new threads

In-order, single-issue RISC: small, cheaper, faster/watt

### EXAMPLE 128-CORE MICROGRID



32000+ hw threads

5MB distributed cache

shared MMU
= single virtual address
space, protection using
capabilities

Weak cache coherency

no support for global memory atomics – instead synchronization using point-to-point messaging

Area estimates with CACTI: 100mm2 @ 35nm

#### A PERSPECTIVE SHIFT

|                                   | Function call                        | Predictable loop                                                            |
|-----------------------------------|--------------------------------------|-----------------------------------------------------------------------------|
| Core 17                           | with 4 registers spilled             | requires branch predictor<br>+ cache prefetching<br>to maximize utilization |
|                                   | 30-100 cycles                        | 1+ cycles / iteration overhead                                              |
|                                   | Bulk thread creation                 | Thread family                                                               |
| D-RISC<br>WITH TMU<br>IN HARDWARE | of 1 thread,<br>31 "fresh" registers | 1 thread / "iteration" reuses common TMU and pipeline                       |
|                                   | ~15 cycles<br>(7c sync, ~8c async)   | no BP nor prefetch needed<br>0+ iteration overhead                          |

# THE APPLE-CORE SOFTWARE STRATEGY



## THE "MAIN" ISSUES UNCOVERED IN APPLE-CORE

- Validation: how to detect detect errors, then compare with existing systems
  - need reference / base lines
- Resource management:
   cores, but also memory and NoC channels
  - how to reduce management overheads

• NB: these issues are general to all many-core processors, but exarcerbated in Apple-CORE

#### VALIDATION

- Solution:
  - 1. Choose a **subset of the ISA** that can be emulated in legacy platforms
  - 2. Design the intermediate language SL to use only this subset to **constrain programs**
  - 3. Implement **compilation to both** the new platform and legacy systems and perform **comparative testing**
- This subset resembles fork/join with families and forward-only dataflow synchronization
- It is **deadlock-free**, mostly **deterministic** and **can be serialized** (cf Cilk, Chapel)

#### VALIDATION



#### RESOURCE MANAGEMENT

- At the finest grain:
   provide TLS to threads created by TMU
   Solution: pre-allocate and partition
   statically
- Concurrency resources: let programs define more concurrency than available, serialize on demand
- Algorithms: distributed memory allocator, garbage collection using reference counting

#### RESOURCE MANAGEMENT

- Application components:
  - OS allocates and deallocates cores, memory and network links for top-level family entry points
  - this is called SEP and is distributed
- Either **explicit allocation** in programs

Or annotated static requirements, aggregated at run-time by RTS/OS

### RESULTS: MEMORY-BOUND KERNELS



Legacy platform = MacBook Pro, Core 2 Duo @ 2.4GHz area(1 Core 2 Duo core) ~ area(32 Microgrid cores)

### RESULTS: THROUGHPUT WORKLOADS



Intel IXP = embedded processor specialized for cryptographic workloads

Main results: **Microgrids are general-purpose**, ie not specialized **yet compete** on throughput with state-of-the art specialized hardware

#### RESULTS, WHAT'S NEXT?

- ✓ built enough infrastructure to fit the F/OSS landscape
   yet can't reuse most existing OS code: *no interrupts, no traps*
- ✓ as planned, **higher performance per area and per watt** via hand-coded benchmarks: *granularity in SPEC is too coarse*
- Follow-up research areas:
  - *Internal* issues: memory consistency, scalable cache protocols, ISA semantics, etc.
  - External issues from outside architecture: how to virtualize? how to place tasks over so many "workers"? how to port existing OS code?
  - *Fundamental* issues: concurrent complexity theory?

14

#### THANK YOU!

• More information:

- http://www.apple-core.info/
- http://www.svp-home.org/

### SVP CONCURRENCY MANAGEMENT PROTOCOL

| allocate<br>\$Place → \$F                                                           | Allocate a family context             |  |
|-------------------------------------------------------------------------------------|---------------------------------------|--|
| setstart/setlimit/setstep/<br>setblock<br>$F, V \rightarrow \emptyset$              | Prepare family creation               |  |
| <b>create</b><br>\$F, \$PC → \$ack                                                  | Start bulk creation of threads        |  |
| rput \$F, R, \$V $\rightarrow \emptyset$<br>rget \$F, R $\rightarrow$ \$V           | Read/write dataflow channels remotely |  |
| <b>sync</b><br>\$F → \$ack                                                          | Bulk synchronize on termination       |  |
| $\begin{array}{c} \text{release} \\ \$\text{F} \rightarrow \varnothing \end{array}$ | De-allocate a family context          |  |

# EXTRA - A PERSPECTIVE SHIFT

|                                   | Thread creation                                                                                                   | Context switch                                | Thread cleanup                                                           |
|-----------------------------------|-------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------------|
| Core 17<br>Linux                  | (pre-allocated stack)                                                                                             | syscalls, thread switch, trap, interrupt      |                                                                          |
|                                   | >10000 cycles<br>in pipeline                                                                                      | >10000 cycles<br>in pipeline                  | >10000 cycles<br>in pipeline                                             |
| D-RISC<br>WITH TMU<br>IN HARDWARE | Bulk creation (metadata allocation for N threads) ~15 cycles (7c sync, ~8c async)  Thread creation 1 cycle, async | at every waiting instruction, also I/O events | Thread cleanup 1 cycle, async  Bulk synchronizer cleanup 2 cycles, async |