Microwatt update by Paul Mackerras ( 2024-May-29 )
Introduction
Microwatt started as demo/proof-of-concept for announcement of Power ISA being made open (August 2019).
Code hosted on github, updated through pull requests
Use automated testing to catch bugs early
Targets simulation and synthesis for FPGAs
Artix-7, ECP5
Digilent Arty A7-100 board, Lambda Concepts ECPIX-5, OrangeCrab
Implements SFS / SFFS subsets of PowerISA v3.1C
Wishbone memory interface, some peripherals from Litex project
Has optional non-pipelined floating-point unit (FPU)
Has radix MMU (accepts any radix-tree layout allowed by ISA)
Runs Linux (userspace compiled with -mno-altivec -mno-vsx)
System overview
Pipeline Overview
Recent Developments
Two-stage execute pipeline (up from one)
Fits better with 2-stage load/store pipeline
Allows 32-bit multiplies to execute without stalling the pipeline
Bypass paths implemented from both stages back to decode2
Integer division and 64-bit multiplications done by FPU (if present)
Division uses reciprocal estimation, Newton-Raphson refinement, multiply, adjust
Simple Branch Target Cache in fetch1
Stores instruction address and target address for direct branch instructions
Enables zero-overhead loops provided loop body is >= 3 instructions long
Moved instruction address translation from icache to fetch1
2-entry “ERAT” plus iTLB in fetch1
Icache now accessed using real addresses => set size no longer constrained to 4kB
Instruction pre-decoding before icache
Compute a 10-bit instruction index from instruction word
Index replaces major opcode in icache => 36 bits per instruction in icache
(index for illegal instructions is 0x200 + major opcode)
Instruction indexes assigned in groups so range tests can be used
e.g., does it have RB operand? does it access FPRs vs GPRs?
Implemented (most) SFFS instructions added in v3.1
br[hwd], cnt[lt]zdm, set[n]bc[r]
Prefixed instructions
Prefixed load/store instructions, paddi, pnop, etc. (SFFS subset)
Added support for vestigial 1-entry “partition table” to MMU
Process table base address read from (PTCR + 8)
Added basic PMU (Performance Monitor Unit)
Implements architected events and a few others
Doesn’t implement random sampling
Added snooping logic to dcache and icache
Added micro-SD card interface using Litex “litesdcard” logic
Added GPIO (general-purpose I/O) interface
Access RAM arrays synchronously wherever possible
Enables use of block RAM in FPGA implementations
GPR/FPR register file requires addresses to be supplied from decode1 rather than decode2; addresses generated from instruction index and instruction word in parallel with decode table lookup
Many performance/timing improvements and logic simplifications
Architecture compliance
Aims to be a complete SFFS v3.1C implementation
Instructions implemented as no-ops:
sync, tlbsync, eieio, dcbf, dcbst, dcbt, dcbtst, icbt, nop, pnop, reserved no-ops
Not implemented:
hashst[p], hashchk[p] and related SPRs
VMX, VSX, MMA, DFP, Hashed page table, SLB, LPAR, SMT, EBB, BHRB, stream prefetch, load/store multiple and string, power-saving mode, etc.
SPRs: AMR/IAMR etc., PPR[32], LPCR, LPIDR, PCR, HRMOR, TIR, PSSCR
Currently in development: (cost ~ 800 – 900 LUTs on Artix-7)
MSR[HV] always 1, hrfid, HSRR0/1, HEIR, HEAI interrupts, etc.
FSCR, HFSCR and related interrupts
CTRL, DSCR (two SPR numbers), VRSAVE
scv, rfscv, wait, cfuged, pdepd, pextd instructions
Possibly also required:
Quadword loads/stores (lq, stq, lqarx, stqcx.)
Some degree of AIL/HAIL support (may need vestigial LPCR)
Some degree of EVIRT support
Debug facilities: CIABR, DAWR[X]n
Instruction fetch from cache-inhibited memory
Writes to timebase register
DEXCR? (with no effect)
Anything else?
Testing
1000 randomly-generated code sequences with results from POWER9
Results compared via a checksum of integer register contents
Doesn’t include floating-point or privileged instructions
simple_random program to generate pseudo-random instruction sequences
Can run on two machines and compare results
Unit test programs
Generally these try executing specific instructions on specific operands, or exercise specific system functionality
branch_alias, decrementer, fpu, illegal, misc, mmu, modes, pmu, prefix, privileged, reservation, sc, spr_read, trace, xics
Boot Linux and observe behavior
Floating-point test suites, benchmarks, etc. can be run under Linux
Experiments
Implemented lq, stq, lqarx, stqcx.
Required “instruction doubling” to handle 128 bits
Atomicity requirements hard to implement (but also hard to observe)
Partial VMX/VSX implementation
Also used instruction doubling
Implemented instructions used in glibc compiled for POWER9
Cache coherency for multi-core implementation
Including correct behavior for larx/stcx, sync, dcbf, etc.
Hardware random number generator
Micro-programmed control for FPU (in development)
Memory management unit
ISA defines two address translation schemes
Hashed page table (HPT), used by AIX and IBM/i and older Linux kernels
Radix page table, used by recent Linux kernels
Microwatt implements radix and not HPT
ISA defines a very general tree structure for radix trees
Memory management unit
Radix tree page directory entry format:
NLS = Next level size, number of bits used to index next level of tree
Two standard layouts defined in ISA
52-bit EA, 4 levels, 64k page size (index fields 13, 9, 9, 5 bits)
52-bit EA, 4 levels, 4k page size (index fields 13, 9, 9, 9 bits)
Microwatt implements a general radix tree walker state machine
Any NLS value from 5 to 16 is permitted
Address space size from 31 bits (2 GiB) to 62 bits (4 EiB)
Any power-of-2 page size >= 4 kiB
Any number of levels between 1 and 10
Memory management unit – implementation
Add level-1 TLBs to instruction and data caches
4kB page size, looked up in parallel with cache tags
Instruction TLB is direct-mapped, 64 entries
Data TLB is 2-way set associative, 128 entries total
MMU is a state machine
Sends a series of requests to the dcache to read the process table and PTEs
Eventually sends the translation to the dcache or icache
EA bits inserted if page size > 4kB
Currently no caching of PTEs or PDEs in the MMU
Nor any caching of partial translation results (no “page walk cache”)
Just the L1 iTLB and dTLB inside the instruction and data caches
Microwatt doesn’t implement the partition table
Instead has a PRTBL register (SPR) to point to the process table
Floating-point unit
Why do a FPU?
It’s in the architecture
Don’t want to define a soft-float ABI – want to avoid fragmentation
How small can we make it?
What are the difficulties?
Divide and square-root instructions
Fused multiply-add instructions
Handling denormalized numbers and exception conditions correctly
Getting correct results down to the last bit in all rounding modes
Results:
Control uses a state machine: handle one instruction at a time, not pipelined
Uses about 4500 LUTs on the Xilinx Artix-7 (~20% of the total SoC)
Add/subtract generally take 5–12 cycles, multiply takes 8–15 cycles
FPU – Data paths (simplified)