#### RL4ReAI: Reinforcement Learning for Register Allocation

S. VenkataKeerthy<sup>1</sup>, Siddharth Jain<sup>1</sup>, Anilava Kundu<sup>1</sup>, Rohit Aggarwal<sup>1</sup>, **Albert Cohen**<sup>2</sup>, Ramakrishna Upadrasta<sup>1</sup>

IIT Hyderabad<sup>1</sup>, Google<sup>2</sup>



LLVM Performance Workshop 25<sup>th</sup> February 2023

## **Register allocation**

• Registers are scarce!

Unbounded set of variables  $\rightarrow$  Finite set of registers

- One of the classic NP-Hard problems Reducible to graph coloring
- Solutions
  - Constraint-based: ILP and PBQP formulations
  - Heuristic approaches
- LLVM 4 register allocators
  - Constraint-based: PBQP
  - Heuristic: Greedy, Basic, Fast

#### LLVM's Register Allocation Strategies and Heuristics



- No single best allocator Greedy performs better in general
- Greedy Allocator Heuristics Splitting, Coalescing, Eviction and Spilling
- PBQP Allocator Heuristics Coalescing and Spilling

## What makes ML based Register allocation difficult?

- Complex problem with multiple sub-tasks
  - Splitting, Spilling, Coalescing, etc.
- ML schemes should ensure correctness
  - Register type constraints
  - Live range constraints
- Integration of ML solutions with compiler frameworks
  - $\circ \quad \text{Python} \leftrightarrow \text{C++}$

Proposal - RL4ReAI: Reinforcement Learning for Register Allocation

## **RL4ReAl:** Objectives

**Objectives:** Machine Learning Framework for Register Allocation

- End-to-end application of Reinforcement Learning for register allocation
- Semantically correct code generation
  - Without resorting to a correction phase
  - Correctness constraints imposed on action space
- Multi architecture support

Can an ML model match/outperform half-a-century old heuristics?

# Constraints in Register Allocation

### **Register Allocation: Correctness constraints**

Registers are complicated!

**1.Register Constraints** 

2. Type constraints

3.Congruence constraints

4. Interference constraints

#### **Register Constraints**

- Architectural constraints
  - $\circ$  Eg: IDIV32  $\rightarrow$  Divides contents of \$eax; stores result in \$eax and \$edx
- Register allocation ⇒ Allocating left out virtual registers

| // Source<br>i = 0<br>x = 10<br>y = 20<br>print x<br>z = y / x<br>i++<br>z = z + 10<br>i++<br>0 print y<br>1 print z<br>2 print i | <pre>MOV32ri 0, %i:gr32<br/>MOV32ri 10, %x:gr32<br/>MOV32ri 20, %y:gr32<br/><call %x="" on="" print=""><br/>\$eax = COPY %y:gr32<br/><clear \$edx=""><br/>IDIV32r %x:gr32, implicit-def \$eax, implicit-def \$edx<br/>%z:gr32 = COPY \$eax<br/>%i:gr32 = ADD32ri %i:gr32, 1<br/><br/><call %i="" %y,="" %z,="" on="" print=""></call></clear></call></pre> |
|-----------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|-----------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

## Type constraints

- Different types of registers in a register file
  - General purpose registers
  - Floating point registers
  - Vector registers, ...
- Variable type compatibility with the register type

## **Congruence constraints**

- Real-world ISAs have hierarchy of register classes
  - Congruent classes

| ZMMO YMMO XMMO          | ZMM1 YMM1 XMM1            | ST(0) MMO ST(1) MM1 ALAHAXEAX RAX R80 R8 R80 R80 | CR4      |
|-------------------------|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| ZMM2 YMM2 XMM2          | ZMM3 YMM3 XMM3            | ST(2) MM2 ST(3) MM3 BLBHBXEBX RBX R9W R9D R9 R9BR13WR13DR13 CR1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | CR5      |
| ZMM4 YMM4 XMM4          | ZMM5 YMM5 XMM5            | ST(4) MM4 ST(5) MM5 CLCHCXECX RCX ROX R100 R10 R140 R140 R14 CR2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | CR6      |
| ZMM6 YMM6 XMM6          | ZMM7 YMM7 XMM7            | ST(6) MM6 ST(7) MM7 DLDHDXEDX RDX R110 R11 R150 R150 R15 CR3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | CR7      |
| ZMM8 YMM8 XMM8          | ZMM9 YMM9 XMM9            | BPIBPEBPRBP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | CR8      |
| ZMM10 YMM10 XMM10       | ZMM11 YMM11 XMM11         | CW FP_IP FP_DP FP_CS SIESI RSI SPESPRSP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CR9      |
| ZMM12 YMM12 XMM12       | ZMM13 YMM13 XMM13         | SW                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | CR10     |
| ZMM14 YMM14 XMM14       | ZMM15 YMM15 XMM15         | TW 8-bit register 32-bit register 80-bit register 256-bit register                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | CR11     |
| ZMM16 ZMM17 ZMM18 ZMM19 | ZMM20 ZMM21 ZMM22 ZMM23   | FP_DS 16-bit register 64-bit register 128-bit register 512-bit register                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CR12     |
| ZMM24 ZMM25 ZMM26 ZMM27 | 7 ZMM28 ZMM29 ZMM30 ZMM31 | FP_OPC FP_DP FP_IP CS SS DS GDTR IDTR DR0 DR6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | CR13     |
|                         |                           | ES FS GS TR LDTR DR1 DR7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | CR14     |
|                         |                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | CR15     |
|                         |                           | DR3 DR9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |          |
|                         |                           | DR4 DR10 D                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | R12 DR14 |
|                         |                           | DR5 DR11 D                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | R13 DR15 |
|                         |                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |          |

#### Interference constraints

Register allocation  $\Rightarrow$  Graph coloring problem



Available Registers: R1(Green), R2(Blue)

### RL4ReAl: Reinforcement Learning for Register Allocation



## Interference graphs

Edges: {phy reg - vir reg, vir reg - vir reg}

Vertices

- MIR instruction representations in the live range of a variable
- Instruction  $\rightarrow \mathbb{R}^n$  MIR2Vec embeddings
- Final representation:  $\mathbb{R}^{m \times n}$

**MIR2Vec** representations

- *n* dimensional vector representation
- Opcode and operand information form the entities in MIR
  - $\circ \quad W_o.\llbracket \mathbf{O} \rrbracket + W_a.\left(\llbracket \mathbf{A_1} \rrbracket + \llbracket \mathbf{A_2} \rrbracket + \dots + \llbracket \mathbf{A_n} \rrbracket\right), W_o > W_a$



## Grouping opcodes

- MIR has specialized opcodes
- Based on width, source and destination types
  - 200 different MOV instructions
  - MOV32rm, MOVZX64rr16, MOVAPDrr, etc.
- 15.3K opcodes in x86; 5.4K opcodes in AArch64
  - {build dir}/lib/Target/X86/X86GenInstrInfo.inc
  - o {build dir}/lib/Target/AArch64/AArch64GenInstrInfo.inc
- Generic opcodes
  - Specialized opcodes are grouped together
  - $\circ \quad \{\text{MOV32rx, MOVZX64rr16, MOVAPDrr, ...}\} \rightarrow \text{MOV}$

## **Representing Interference graphs**

- GGNNs Gated Graph Neural Networks
  - Processing graph structured inputs
- Message passing
  - Information propagated multiple times across nodes
- Annotations on nodes  $\rightarrow$  Current state
  - $\circ$  Visited
  - $\circ$  Colored
  - Spilled
- $\mathbb{R}^{m \times n} \to \mathbb{R}^k$



## Hierarchical Reinforcement Learning

- Environment MLRegAlloc pass in LLVM
  - Generates interference graphs + representations
  - Register allocation, splitting and spilling as per the prediction
- Multi-agent hierarchical reinforcement learning
  - $\circ$  Sub tasks of register allocation  $\rightarrow$  Low level agents
- Agents
  - Node selection
  - Task selection
  - Splitting
  - Coloring





## Materialization of splitting

- Involves inserting move instructions
- Dataflow problem
  - Similar to phi or copy placement
- Use dominance frontier

Algorithm 1: move-placement in live range splitting Parameter: Virtual register v, Split point kRename  $v \rightarrow v'$ At use point k do:  $v'' \leftarrow move(v')$ Basic block  $B \leftarrow block(v_k)$ for  $i \in DominanceFrontier(B)$  do  $v' \leftarrow move(v'')$ , after last use(v') in iRename  $v' \rightarrow v''$ ,  $\forall use(v')$  between B and i

## **Global Rewards**

- Based on the throughput (*Th*) of the generated function
- Use LLVM MCA
  - Machine Code Analyzer of LLVM
  - Static model to estimate throughput

$$R_G = \begin{cases} +10, & Th_{RL4ReAl} \ge Th_{Greedy} \\ -10, & Otherwise \end{cases}$$

## Integration with LLVM

- RL4ReAl to-and-fro communication
  - Decisions/Actions by Python model
  - Materialization of decisions in C++ compiler
- LLVM-gRPC gRPC based framework
  - Seamless connection between LLVM and Python ML workloads
    - Works as an LLVM library
    - Easy integration
      - As simple as implementing a few API calls
  - Support for any ML workload
    - Not just limited to RL
    - With both training and inference flow

## Training



#### Training phase

- Involves RL model (Python) requesting C++(LLVM)
- Model takes decisions on splitting and coloring
- C++ (LLVM) generates code for the decision and returns the reward accordingly

## Inference



#### **Inference phase**

- For any input code C++(LLVM) sends a request to the trained model for splitting decision
- As a reply, the trained model returns the decision it took and code is generated.

## Experiments

- MIR2Vec representations
  - 2000 source files from SPEC CPU 2017 and C++ Boost libraries
  - 100 dimensional embeddings; trained over 1000 epochs
- Evaluation
  - o x86 Intel Xeon W2133, 6 cores, 32GB RAM
  - AArch64 ARM Cortex A72, 2 cores, 4GB RAM
- RL models PPO policy with standard set of hyperparameters
- Register allocations
  - General purpose, floating point and vector registers

| Arch.   | Registers                                            |
|---------|------------------------------------------------------|
| x86     | [A-D]L, [A-D]X, [E,R][A-D]X, [SI,DI]L, [E,R][SI,DI], |
|         | SI, DI, R[8-15][B,W,D], FP[0-7], [X,Y,Z]MM[0-15]     |
| AArch64 | [X,W][0-30], [B,H,S,D,Q][0-31]                       |

## Runtime improvements on x86

|                | Runtime | Differe | nce from I | Basic <b>(</b> B | ASIC- <b>x</b> ) |  |                 | Runtime | <b>Difference from</b> BASIC (BASIC- <b>x</b> ) |        |         |       |
|----------------|---------|---------|------------|------------------|------------------|--|-----------------|---------|-------------------------------------------------|--------|---------|-------|
| Benchmarks     | BASIC   |         | OP Greedy  |                  | RL4REAL          |  | Benchmarks      | BASIC   | PBQP                                            | Greedy | RL4REAL |       |
|                |         | 1521    | Chiller    | L                | G                |  |                 |         | 1521                                            | CHILDI | L       | G     |
| 401.bzip2      | 360.6   | -7.3    | 7.5        | -1.1             | 10.8             |  | 505.mcf_r       | 344.9   | 4.5                                             | -1.6   | 8.6     | -4.7  |
| 429.mcf        | 233.8   | 1.4     | -2.9       | 2.7              | -3.6             |  | 520.omnetpp_r   | 475.7   | 6.4                                             | 6.4    | 2.4     | 2.8   |
| 445.gobmk      | 322.3   | -3.3    | 6.4        | 2.4              | 1.7              |  | 531.deepsjeng_r | 299.9   | 4.6                                             | 16.0   | 9.9     | 12.8  |
| 456.hmmer      | 284.3   | 1.8     | 6.1        | 5.0              | -37.6            |  | 541.leela_r     | 439.5   | 1.6                                             | 7.1    | 0.4     | 1.9   |
| 462.libquantum | 256.4   | -10.1   | -1.1       | -2.2             | -6.7             |  | 557.xz_r        | 371.5   | -0.6                                            | 11.9   | 12.1    | -8.5  |
| 471.omnetpp    | 305.7   | 0.7     | 0.4        | 1.2              | 1.2              |  | 508.namd_r      | 236.5   | 2.5                                             | 23.5   | 9.1     | 23.8  |
| 433.milc       | 349.1   | -16.6   | 0.1        | -13.8            | -7.0             |  | 519.lbm_r       | 261.8   | 1.4                                             | 57.7   | 50.9    | 58.1  |
| 470.lbm        | 184.0   | -7.9    | 3.0        | 2.3              | 1.4              |  | 538.imagick_r   | 479.3   | -16.9                                           | 115.5  | 118.8   | 118.4 |
| 482.sphinx3    | 366.0   | -37.5   | 1.6        | -3.1             | -2.7             |  | 544.nab_r       | 417.5   | 5.8                                             | 132.1  | 131.3   | 134.4 |

- RL4ReAl shows speedups over Basic in 14/18 benchmarks
- Runtimes very close to Greedy
- Only 1 show more than 4% slow-down

#### Analysis of Hot functions

%Difference in runtime with Basic as baseline on hot functions

|           | SPEC  | CPU   | 2006   | SPEC    | CPU  | 2017  |
|-----------|-------|-------|--------|---------|------|-------|
| C         | RL4I  | REAL  | Greedy | RL4REAL |      |       |
| GREEDY    |       | L     | G      | GREEDY  | L    | G     |
| Average   | -1.5  | -2.1  | -1.6   | 6.2     | 7.3  | 4.8   |
| # (val>0) | 16    | 17    | 13     | 23      | 23   | 17    |
| # (val<0) | 19    | 18    | 22     | 8       | 8    | 14    |
| Max       | 12.7  | 10.4  | 6.2    | 44.0    | 44.4 | 41.3  |
| Min       | -51.4 | -52.5 | -13.1  | -7.7    | -4.4 | -10.8 |

#### Analysis of Hot functions

%speedups obtained by Greedy and RL4ReAl over Basic

| B/M                                                  | Functions                 | Greedy | RL4ReAl | Diff. |  |
|------------------------------------------------------|---------------------------|--------|---------|-------|--|
| Top 5 functions with highest % speedup (over GREEDY) |                           |        |         |       |  |
| 401                                                  | BZ2_compressBlock         | -51.3  | -5.2    | 46.1  |  |
| 445                                                  | do_get_read_result        | -12.0  | -0.5    | 11.5  |  |
| 482                                                  | mgau_eval                 | -6.0   | 0.3     | 6.3   |  |
| 429                                                  | price_out_impl            | -0.8   | 2.3     | 3.2   |  |
| 445                                                  | subvq_mgau_shortlist      | -9.8   | -6.9    | 2.9   |  |
| 538                                                  | GetVirtualPixelsFromNexus | 8.3    | 28.8    | 20.4  |  |
| 538                                                  | SetPixelCacheNexusPixels  | 4.7    | 21.9    | 17.2  |  |
| 505                                                  | cost_compare              | -7.7   | 8.1     | 15.8  |  |
| 557                                                  | lzma_mf_bt4_skip          | -1.8   | 3.63    | 5.5   |  |
| 525                                                  | biari_decode_symbol       | -2.7   | 2.7     | 5.4   |  |

#### Analysis of Hot functions

| B/M | Functions                    | Greedy    | RL4REAL    | Diff. |
|-----|------------------------------|-----------|------------|-------|
| Top | 5 functions with highest % s | slow-down | (over Gree | DY)   |
| 456 | P7Viterbi                    | 2.2       | -13.1      | -15.3 |
| 482 | vector_gautbl_eval_logs3     | 11.9      | -2.5       | -14.4 |
| 401 | mainGtU                      | 0.3       | -9.6       | -10.0 |
| 401 | fallbackSort                 | 12.6      | 6.2        | -6.4  |
| 445 | fastlib                      | 4.8       | -1.1       | -5.9  |
| 557 | lzma_mf_bt4_find             | 1.5       | -10.7      | -12.3 |
| 531 | feval                        | 26.4      | 17.7       | -8.6  |
| 505 | primal_bea_mpp               | 0.9       | -7.6       | -8.5  |
| 541 | FastBoard::self_atari        | 3.7       | -0.1       | -5.8  |
| 541 | qsearch                      | 6.6       | 1.5        | -5.0  |

%speedups obtained by Greedy and RL4ReAl over Basic

## Runtimes on AArch64

| Benchmarks      | Runtime | Diff. from Basic (Basic- x) |        |         |  |
|-----------------|---------|-----------------------------|--------|---------|--|
| Denemiarks      | BASIC   | PBQP                        | Greedy | RL4REAL |  |
| 401.bzip2       | 1366.9  | -41.1                       | 15.6   | 12.8    |  |
| 429.mcf         | 1320.5  | -12.7                       | -7.5   | 1.6     |  |
| 445.gobmk       | 992.8   | 15.6                        | 26.1   | 14.5    |  |
| 462.libquantum  | 1627.6  | -8.7                        | 4.5    | 9.6     |  |
| 433.milc        | 1251.1  | 59.2                        | 70.9   | 45.4    |  |
| 444.namd        | 855.3   | 2.7                         | 21.8   | 18.8    |  |
| 470.lbm         | 1604.3  | -6.4                        | -16.6  | 16      |  |
| 505.mcf_r       | 1535.1  | 25.9                        | 1.9    | -12.8   |  |
| 508.namd_r      | 845     | 0.4                         | 34.5   | 40.1    |  |
| 523.xalancbmk_r | 979.1   | 8.1                         | -3.4   | 4.4     |  |
| 531.deepsjeng_r | 777.2   | 10.0                        | 30.5   | 4.5     |  |
| 541.leela_r     | 1067.9  | -11.3                       | -0.1   | -19.5   |  |
| 557.xz_r        | 1163.2  | 3.7                         | 22.2   | 21.3    |  |
| 519.lbm_r       | 1657    | 50.9                        | -1.6   | 39.8    |  |
| 538.imagick_r   | 1244.5  | -3.9                        | 75.8   | 65.6    |  |
| 544.nab_r       | 1170.7  | -7.7                        | 31.5   | 32.4    |  |
|                 | Average | 5.3                         | 19.1   | 18.4    |  |

## Policy Improvement on Regression cases

- Regression in performance
  - $\circ \quad \text{Identify} \rightarrow \text{Refine heuristics} \rightarrow \text{Evaluate}$
- MLGO's policy improvement cycle
  - Fine-tuning of learned RL policy on regression cases
- Identify and Refine
  - Poorly performing benchmarks from each configuration
  - RL4Real-L
    - milc (-13.8s  $\rightarrow$  -0.8s)
  - RL4Real-G
    - Hmmer (-37.6s  $\rightarrow$  -26s), xz (-8.5s  $\rightarrow$  -2.5s)
- Strong case for online learning and domain specialization

## Summary

- RL4ReAI: Architecture independent Reinforcement Learning for Register Allocation
- Multi agent hierarchical approach
- Generates semantically correct code: constraints imposed on the action space
- Allocations on par or better than the best allocators of LLVM
- New opportunities for compiler/ML research
- Framework will be open-sourced
- https://compilers.cse.iith.ac.in/publications/rl4real



France

#### Abstract

We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training an inference. Our approach is architecture in-

India

le IIT Hyderabad e India problem is reducible to graph coloring, which is one of the classical NP-Complete problems [8, 22]. Register allocation as an optimization involves additional sub-tasks, more than

graph coloring itself [8]. Several formulations have been proposed that return exact, or heuristic-based solutions. Broadly, solutions are often formulated as constraint-based optimizations [34, 38]. LP [3, 5, 12, 42]. PBQP [31], gametheoretic approaches [45], and are fed to a variety of solvers. In general, these approaches are known to have scalability issues. On the other hand, heuristic-based approaches have been widely used owing to their scalability: resumble solu-

# Thank You!

https://compilers.cse.iith.ac.in/publications/rl4real/



Rohit Aggarwal IIT Hyderabad India

#### Albert Cohen Google France

Ramakrishna Upadrasta IIT Hyderabad India

#### Abstract

We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Our approach is architecture inproblem is reducible to graph coloring, which is one of the classical NP-Complete problems [8, 22]. Register allocation as an optimization involves additional sub-tasks, more than graph coloring itself [8]. Several formulations have been proposed that return exact, or heuristic-based solutions.

Broadly, solutions are often formulated as constraint-based optimizations [34, 38], ILP [3, 5, 12, 42], PBQP [31], gametheoretic approaches [45], and are fed to a variety of solvers. In general, these approaches are known to have scalability issues. On the other hand, heuristic-based approaches have been widely used owing to their scalability: reasonable solu-