DeepLearning - Paul Mackerras

DeepLearning

Microwatt update by Paul Mackerras ( 2024-May-29 )

Introduction

Microwatt started as demo/proof-of-concept for announcement of Power ISA being made open (August 2019).

Code hosted on github, updated through pull requests

Use automated testing to catch bugs early

Targets simulation and synthesis for FPGAs

Artix-7, ECP5

Digilent Arty A7-100 board, Lambda Concepts ECPIX-5, OrangeCrab

Implements SFS / SFFS subsets of PowerISA v3.1C

Wishbone memory interface, some peripherals from Litex project

Has optional non-pipelined floating-point unit (FPU)

Has radix MMU (accepts any radix-tree layout allowed by ISA)

Runs Linux (userspace compiled with -mno-altivec -mno-vsx)

System overview

Pipeline Overview

Recent Developments

Two-stage execute pipeline (up from one)

Fits better with 2-stage load/store pipeline

Allows 32-bit multiplies to execute without stalling the pipeline

Bypass paths implemented from both stages back to decode2

Integer division and 64-bit multiplications done by FPU (if present)

Division uses reciprocal estimation, Newton-Raphson refinement, multiply, adjust

Simple Branch Target Cache in fetch1

Stores instruction address and target address for direct branch instructions

Enables zero-overhead loops provided loop body is >= 3 instructions long

Moved instruction address translation from icache to fetch1

2-entry “ERAT” plus iTLB in fetch1

Icache now accessed using real addresses => set size no longer constrained to 4kB

Instruction pre-decoding before icache

Compute a 10-bit instruction index from instruction word

Index replaces major opcode in icache => 36 bits per instruction in icache

(index for illegal instructions is 0x200 + major opcode)

Instruction indexes assigned in groups so range tests can be used

e.g., does it have RB operand? does it access FPRs vs GPRs?

Implemented (most) SFFS instructions added in v3.1

br[hwd], cnt[lt]zdm, set[n]bc[r]

Prefixed instructions

Prefixed load/store instructions, paddi, pnop, etc. (SFFS subset)

Added support for vestigial 1-entry “partition table” to MMU

Process table base address read from (PTCR + 8)

Added basic PMU (Performance Monitor Unit)

Implements architected events and a few others

Doesn’t implement random sampling

Added snooping logic to dcache and icache

Added micro-SD card interface using Litex “litesdcard” logic

Added GPIO (general-purpose I/O) interface

Access RAM arrays synchronously wherever possible

Enables use of block RAM in FPGA implementations

GPR/FPR register file requires addresses to be supplied from decode1 rather than decode2; addresses generated from instruction index and instruction word in parallel with decode table lookup

Many performance/timing improvements and logic simplifications

Architecture compliance

Aims to be a complete SFFS v3.1C implementation

Instructions implemented as no-ops:

sync, tlbsync, eieio, dcbf, dcbst, dcbt, dcbtst, icbt, nop, pnop, reserved no-ops

Not implemented:

hashst[p], hashchk[p] and related SPRs

VMX, VSX, MMA, DFP, Hashed page table, SLB, LPAR, SMT, EBB, BHRB, stream prefetch, load/store multiple and string, power-saving mode, etc.

SPRs: AMR/IAMR etc., PPR[32], LPCR, LPIDR, PCR, HRMOR, TIR, PSSCR

Currently in development: (cost ~ 800 – 900 LUTs on Artix-7)

MSR[HV] always 1, hrfid, HSRR0/1, HEIR, HEAI interrupts, etc.

FSCR, HFSCR and related interrupts

CTRL, DSCR (two SPR numbers), VRSAVE

scv, rfscv, wait, cfuged, pdepd, pextd instructions

Possibly also required:

Quadword loads/stores (lq, stq, lqarx, stqcx.)

Some degree of AIL/HAIL support (may need vestigial LPCR)

Some degree of EVIRT support

Debug facilities: CIABR, DAWR[X]n

Instruction fetch from cache-inhibited memory

Writes to timebase register

DEXCR? (with no effect)

Anything else?

Testing

1000 randomly-generated code sequences with results from POWER9

Results compared via a checksum of integer register contents

Doesn’t include floating-point or privileged instructions

simple_random program to generate pseudo-random instruction sequences

Can run on two machines and compare results

Unit test programs

Generally these try executing specific instructions on specific operands, or exercise specific system functionality

branch_alias, decrementer, fpu, illegal, misc, mmu, modes, pmu, prefix, privileged, reservation, sc, spr_read, trace, xics

Boot Linux and observe behavior

Floating-point test suites, benchmarks, etc. can be run under Linux

Experiments

Implemented lq, stq, lqarx, stqcx.

Required “instruction doubling” to handle 128 bits

Atomicity requirements hard to implement (but also hard to observe)

Partial VMX/VSX implementation

Also used instruction doubling

Implemented instructions used in glibc compiled for POWER9

Cache coherency for multi-core implementation

Including correct behavior for larx/stcx, sync, dcbf, etc.

Hardware random number generator

Micro-programmed control for FPU (in development)

Memory management unit

ISA defines two address translation schemes

Hashed page table (HPT), used by AIX and IBM/i and older Linux kernels

Radix page table, used by recent Linux kernels

Microwatt implements radix and not HPT

ISA defines a very general tree structure for radix trees

Memory management unit

Radix tree page directory entry format:

NLS = Next level size, number of bits used to index next level of tree

Two standard layouts defined in ISA

52-bit EA, 4 levels, 64k page size (index fields 13, 9, 9, 5 bits)

52-bit EA, 4 levels, 4k page size (index fields 13, 9, 9, 9 bits)

Microwatt implements a general radix tree walker state machine

Any NLS value from 5 to 16 is permitted

Address space size from 31 bits (2 GiB) to 62 bits (4 EiB)

Any power-of-2 page size >= 4 kiB

Any number of levels between 1 and 10

Memory management unit – implementation

Add level-1 TLBs to instruction and data caches

4kB page size, looked up in parallel with cache tags

Instruction TLB is direct-mapped, 64 entries

Data TLB is 2-way set associative, 128 entries total

MMU is a state machine

Sends a series of requests to the dcache to read the process table and PTEs

Eventually sends the translation to the dcache or icache

EA bits inserted if page size > 4kB

Currently no caching of PTEs or PDEs in the MMU

Nor any caching of partial translation results (no “page walk cache”)

Just the L1 iTLB and dTLB inside the instruction and data caches

Microwatt doesn’t implement the partition table

Instead has a PRTBL register (SPR) to point to the process table

Floating-point unit

Why do a FPU?

It’s in the architecture

Don’t want to define a soft-float ABI – want to avoid fragmentation

How small can we make it?

What are the difficulties?

Divide and square-root instructions

Fused multiply-add instructions

Handling denormalized numbers and exception conditions correctly

Getting correct results down to the last bit in all rounding modes

Results:

Control uses a state machine: handle one instruction at a time, not pipelined

Uses about 4500 LUTs on the Xilinx Artix-7 (~20% of the total SoC)

Add/subtract generally take 5–12 cycles, multiply takes 8–15 cycles

FPU – Data paths (simplified)

Page updated

Report abuse