Design Space Exploration Tool Flow

0 downloads 0 Views 2MB Size Report
New algorithms (LTE; LTE Advanced). - Multimedia applications .... Higher performance for timing critical parts of an application. (cache misses are avoided).

HEINZ NIXDORF INSTITUTE University of Paderborn Schaltungstechnik Dr.-Ing. Mario Porrmann

Design Space Exploration for Memory Subsystems of VLIW Architectures Thorsten Jungeblut1, Gregor Sievers, Mario Porrmann1, Ulrich Rückert2 1 System

and Circuit Technology, University of Paderborn 2 Cognitive Interaction Technology – Center of Excellence, Bielefeld University

HEINZ NIXDORF INSTITUT

Motivation(1)

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

- Increasing complexity of mobile applications - More functionality - New algorithms (LTE; LTE Advanced) - Multimedia applications (Video, 3-D, …)

- Nonflexible hardware  Flexible software implementation (Software-Defined Radio - SDR)  Powerful CPU necessary

- High requirements to ressource efficiency!

2

HEINZ NIXDORF INSTITUT

Motivation(2)

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• In embedded processors size of on-chip memories is limited

• External (SDRAM) memory – Low costs per bit – Slow/high latency

• Intermediate storage of accesses in the cache – Loading of entire cache lines from the external memory – Use of temporal and spatial locality

Register

Level-1 cache

• Size of the caches is limited by the operating frequency of the processor core – Cache hierachie – Level-1 cache is matched to the core frequency → additional levels with higher latency

Level-2 cache

Main memory

Hard disk

3

HEINZ NIXDORF INSTITUT

Outline

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• Concurrent design flow for DSE • VLIW architecture/Cache architecture • Prototyping Environment • Performance results and resource requirements • Conclusion/Outlook Specification

FE

Instruction Memory

Instruction Fetch / L1 Instruction Cache

Benchmarks

Vice-UPSLA Instruction Decode

DC Bypass

Source Code

UPSLA Register Read

RTL-Description

RD

Compiler

RTL-Code

Assembler Code

RTL-Simulator

Assembler ALU

Condition RTL-Code Register

ALU

ALU

/

*

/

*

/

*

Synthesis-Tool Netlist

Register

ALU

EX /

*

Object-Files

Linker LD/ST

LD/ST

LD/ST

LD/ST

ME

Emulator (Prototyp)

ASIC-Realization L1 Data-Cache

Executables

Software Simulator

Data Memory

Profiling-Data

Functional

WR Verification

Register Write

Visualization

Ressource Efficiency

4

Design Space Exploration Tool Flow

Goal: Highly automated design flow

HEINZ NIXDORF INSTITUT Specification

Vice-UPSLA

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

Benchmarks Source Code

RTL-Description

UPSLA

Compiler

RTL-Code

Assembler Code

RTL-Simulator

Assembler

RTL-Code

Object-Files

Synthesis-Tool

Linker

Netlist

Executables

Emulator (Prototyp)

ASIC-Realization

Software Simulator Profiling-Data

Functional Verification

Visualization

Ressource Efficiency 5

HEINZ NIXDORF INSTITUT

Bypass

The CoreVA architecture Modular Design

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

FE

Instruction Fetch / L1 Instruction Cache

DC

Instruction Decode

RD

Register Read

ALU

Condition Register

ALU

ALU

ALU

EX /

* LD/ST

Register

Instruction Memory

/

* LD/ST

/

* LD/ST

/

* LD/ST

ME L1 Data-Cache

WR

Data Memory

Register Write

6

HEINZ NIXDORF INSTITUT

Dynamically Reconfigurable Platform

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

RAPTOR-X64

Prototypic Implementation of Microelectronic Circuits on FPGAs • Up to 200 Million transistors emulated • Flexible, modular concept: PCI-Busmotherboard with up to six modules • Partial dynamic reconfiguration at high reconfiguration bandwidth USB Controller USB 2.0-High-Speed USB-OTG

System Monitor Voltage, Tempature, Analog Inputs

Clock Sythesis, Distribution

TST-JTAG CFG-JTAG

PCI-BusBridge Master, Slave, DMA

CTRL+Config Logic Arbiter, MMU Diagnostics, CLK, Configuration, etc.

Xilinx SystemACE CF CF Access, JTAG Control

Local-Bus (32Bit Data / 32Bit Address)

Module 2

SelectMAP, CFG-JTAG

128

Module 4

85

85

128

Module 3

SelectMAP, CFG-JTAG

75

Module 1 SelectMAP, CFG-JTAG

CTRL, SMB

128

75

128

75

Dual-Port SRAM

CTRL, SMB

85

CTRL, SMB

Module 6

PCI-X-Bus (64Bit Data / 32Bit Address)

USB Logic Local-Bus Master Local-Bus Slave OTG-Control

Broadcast-Bus 7

HEINZ NIXDORF INSTITUT

System Environment

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• • • • • • •

Systembus

Arbiter

Multi master system bus Generic I/D cache interfaces to external memory 4 GB SDRAM Penalty cycles on cache misses: Instr. cache: >73 clock cycles SDRAM Data cache: >61 clock cycles Internal memories can be accessed SDRAM Controller from host system • Generic interface for dedicated Systembus Controller hardware extensions Localbus • 9.1 Gbit/s external bandwidth Interface

Instr. Cache Data Cache

MMIO

FIFO

Host PC Xilinx FPGA

CoreVA CPU

CRC

UART

ASIC

RAPTOR2000 System 8

Cache Architecture Overview

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• I-Cache: – 32 bit per issue slot  4 slot configuration: 128 bit interface – Direct mapped (low latency/power/area) – 16kB cache size, 64 bytes line width (configurable)

• D-Cache: – – – – –

1-/2-port configuration possible Direct mapped 16kB cache size, 32 bytes line width (configurable) Write-back policy, non-blocking Two programmable allocation modes: fetch-on-write-miss/allocate-on-write-miss

• I-/D-Caches can dynamically be configured as scratch pad memories – Higher performance for timing critical parts of an application (cache misses are avoided) – Energy improvements due to nonexistent external memory accesses 9

Cache Architecture Synthesis Results

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

10

Application Evaluation Different Cache Configurations

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• Applications: synthetic benchmarks, baseband, cryptography, multimedia, LTE protocol stack • 50% LD/ST-units per #FUs best trade-off • Concurrent LD/ST ≠ Speedup! Speedup dependent on scheduling!

11

Results(1) Hit Rates

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• High hit rates for all applications • Allocate-on-write-miss ↔ Fetch-on-write-miss

12

Results(2) Portion of Stall Cycles to Execution Time

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• Latencies of SDRAM accesses may vary dependent on the order, distribution and frequency of the accesses.

13

Results(3) Performance

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

14

Results(4) Energy

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

15

Results(5) Energy-Delay

HEINZ NIXDORF INSTITUT Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

16

HEINZ NIXDORF INSTITUT

The CoreVA VLIW architecture ASIC realization

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

1.66 mm

Frequency

400 MHz

Area (32kB SRAM)

2.7 mm²

Power Consumption

0.1 W

1.6 GOP/s in scalar mode 3.2 GOP/s in SIMD mode

Register File

4-issue VLIW processor, 2x MLA,DIV 1-Port I-Cache (16kByte,128 Bit), 2-Port D-Cache (16kByte, 32 Bit) Instruction Cache 65nm ST Microelectronics, Low Power (Thick Oxide), 1.2V MixedVT, 1.8V I/Os (configurable pullups) • Hardware extensions (incl. ECC)

Data Cache

Comp. Cell

Execute

1.66 mm

• • • •

ECC

17

HEINZ NIXDORF INSTITUT

Conclusion/Outlook

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert

• Framework for the design-space exploration of processor architectures and memory subsystems • Rapid prototyping environment RAPTOR • Dynamic configurable cache architecture • 2-slot configuration/allocate-on-write-miss shows best energy trade-off • Performance/Energy gains up to 25% • Future work: – Include associativity – Combination of caches/scratch-pad memories to enhance memory bandwidth

18

HEINZ NIXDORF INSTITUT

Questions?

Universität Paderborn Schaltungstechnik Prof. Dr.-Ing. Ulrich Rückert Specification

Vice-UPSLA

Benchmarks

UPSLA

Compiler

Source Code

RTL-Description RTL-Code

Assembler Code

RTL-Simulator

Assembler

Synthesis-Tool

Linker

RTL-Code

Object-Files

Netlist

Executables

Emulator (Prototyp)

ASIC-Realization

Software Simulator Profiling-Data

Functional Verification

Visualization

Ressource Efficiency

Design space exploration

VLIW architecture SDRAM

Systembus

SDRAM Controller Arbiter

Systembus Controller

Instr. Cache Data Cache

Localbus Interface

MMIO

FIFO

Host PC Xilinx FPGA

CoreVA CPU

CRC

UART

ASIC

RAPTOR2000 System

Rapid prototyping

System Architecture 19

HEINZ NIXDORF INSTITUTE University of Paderborn Schaltungstechnik Dr.-Ing. Mario Porrmann

Thank you for your attention! Heinz Nixdorf Institute University of Paderborn System and Circuit Technology Dipl.-Ing. Thorsten Jungeblut Fürstenallee 11 33102 Paderborn

Tel.: 0 52 51/60 63 39 Fax.: 0 52 51/60 63 51 Email: [email protected] http://wwwhni.upb.de/sct

Suggest Documents