

4/10/2024 @ European LLVM Developers' Meeting

#### Enabling HW-based PGO for both Windows and Linux

Wei Xiao (<u>wei3.xiao@intel.com</u>)

Contributors: Timothy Creech, Haohai Wen, Rakesh Krishnaiyer Mike Chynoweth, Ahmad Yasin, Tianqing Wang



# Agenda

- 1. Motivation
- 2. New Feedback Capabilities
- 3. Windows Support
- 4. Demo
- 5. Challenges & Solutions
- 6. Upstreaming
- 7. Summary
- 8. Q&A

### Motivation

- Sampled profiling periodically interrupts program execution to grab a HW event count or machine state. Most CPUs can do this purely in HW or can emulate it in SW (by using a timer).
- Modern CPUs support more advanced forms of HW profiling:

| Intel x86_64 | AMD x86_64 | ARM       | RISC-V | HW Profiling Capabilities           |
|--------------|------------|-----------|--------|-------------------------------------|
| PEBS         | IBS        | SPE       |        | Event-based Sampling                |
| LBR          | LbrExtV2   | BRBE      | CTR    | Short trace of branches             |
| PT           |            | CoreSight |        | Full trace of executed instructions |

These allow gathering samples in HW, possibly multiple at a time, with lower overhead and provide other benefits, such as reduced-skid, precise distribution and Data Address.

## Intel PEBS Overview

### Processor Event-Based Sampling (PEBS)

- Low-overhead sampling (an order of magnitude reduction)
- Reduced-skid or Precise-Distribution



# Intel LBR Overview

### Last Branch Record (LBR)

- CPU collects data for taken branches
  - Source Address –
- Low overhead

#### Recent CPUs offer Architectural LBRs

 Consistent across processor generations and in virtualized environments

| Register<br>Address<br>(Hex) | Architectural MSR<br>Name and bit<br>fields | MSR/Bit Description                                                                                                                                                                                                                             |
|------------------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1500-<br>151FH               | <b>IA32_LBR_x_FROM_IP</b><br>[63:0]         | FROM_IP: The source IP of the recorded branch or event, in canonical form.                                                                                                                                                                      |
| 1600-<br>161FH               | <b>IA32_LBR_x_TO_IP</b><br>[63:0]           | <b>TO_IP:</b> The destination IP of the recorded branch or event, in canonical form.                                                                                                                                                            |
| 1200-<br>121FH               | IA32_LBR_x_INFO                             | Last Branch Record <u>entry</u> X info register (R/W). An attempt<br>to read or write IA32_LBR_x_INFO such that x >=<br>IA32_LBR_DEPTH.DEPTH will #GP.                                                                                          |
|                              | 15:0                                        | CYC_CNT: The elapsed CPU cycles (saturating) since the last LBR was recorded.                                                                                                                                                                   |
|                              | 55:16                                       | Undefined, may be zero or non-zero. Writes of non-zero values do not <u>fault, but</u> reads may return a different value.                                                                                                                      |
|                              | 59:56                                       | BR_TYPE: The branch type recorded by this LBR. Encodings:<br>0000B: JCC<br>0001B: JMP Indirect<br>0010B: JMP Direct<br>0011B: CALL Indirect<br>0100B: CALL Direct<br>0101B: RET<br>011xB: Reserved<br>1xxxB: Other Branch                       |
|                              | 60                                          | CYC_CNT_VALID: CYC_CNT value is valid.                                                                                                                                                                                                          |
|                              | 61                                          | <b>TSX_ABORT:</b> This LBR record is a TSX abort. On processors that do not support Intel <sup>®</sup> TSX ( <u>CPUID.07H.FBX.HLF[</u> bit 4]=0 and CPUID.07H.EBX.RTM[bit 11]=0), this bit is undefined.                                        |
|                              | 62                                          | IN_TSX: This LBR record records a branch that retired during<br>a TSX transaction. On processors that do not support Intel <sup>®</sup><br>TSX ( <u>CPUID.07H.EBX.HLE[</u> bit 4]=0 and<br>CPUID.07H.EBX.RTM[bit 11]=0), this bit is undefined. |
|                              | 63                                          | MISPRED: The recorded branch direction (Jcc) or target<br>(indirect branch) was mispredicted.                                                                                                                                                   |

# HWPGO Overview

#### HW-based PGO is an extension of existing Sampling-based PGO

- HWPGO is a kind of Sampling-based PGO for efficient profiling on optimized binaries in production environments.
- HWPGO enables new types of feedback capabilities provided by HW for new compiler optimizations. HW counters can track a wide range of events, including:
  - Instructions retired
  - Branch mispredictions
  - Cache misses
  - Memory accesses and Data Address
  - Floating-point operations
  - Architectural LBR Inserts (in next-gen CPUs)

Hardware can provide accurate frequency and profiles of other events:



BR\_INST\_RETIRED.NEAR\_TAKEN:uppp: 4016b0 0x4016b0/0x40116d/P/-/-/8 0x401168/0x4016b0/P/-/-/9 1193/0x401160/P/-/-/6 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/5 b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/6 0x4016b0/0x40116d/P/-/-/1 /0x4016b0/P/-/-/11 0x401193/0x401160/M/-/-/3 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 12 BR MISP RETIRED.ALL BRANCHES:upp: 401193 0x4016b0/0x40116d/P/-/-/7 0x401168/0x4016b0/P/-/-/ 01193/0x401160/M/-/-/3 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/5 1b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/5 0x4011b5/0x401170/P/-/-/1 8/0x4016b0/P/-/-/14 0x401193/0x401160/M/-/-/3 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1

7

#### **Branch Mispredict Feedback Example**



| jmp .LBB0_1<br>.p2align 4, 0x90                                                                                                                                                                                           | Before HWPGO:                                                          |                                                                                                                                                                                        |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .LBB0_3:<br>movq %rsi, %rcx<br>movl \$3, %edx<br>callq nop                                                                                                                                                                |                                                                        | After HWPGO:                                                                                                                                                                           |
| movq %rsi, %r13<br>.LBB0_4:<br>movl (%r13), %eax<br>movl %eax, (%r14,%r12,4)<br>incq %r12<br>addq \$4, %rsi<br>addq \$4, %r15<br>cmpq \$20000, %r12<br>je .LBB0_5                                                         | .LBB0_1<br>Opt. mispredicted conditional<br>branch to conditional move | <pre>movl %r12d, %eax imull %r12d, %eax imull %r12d, %eax movl %eax, %edx imull %r12d, %edx imull %eax, %edx cmpl \$8001, (%rbx,%r15) movq %rsi, %r13</pre>                            |
| .LBB0_1:<br>cmpl \$8001, (%rbx,%r12,4)<br>jl .LBB0_3<br># %bb.2:<br>leaq (%rdi,%r12,4), %r13<br>movl %r12d, %eax<br>imull %r12d, %eax<br>imull %r12d, %eax<br>imull %r12d, %edx<br>imull %r12d, %edx<br>imull %r12d, %edx | HWPGO                                                                  | <pre>cmovgeq %rdi, %r13 cmovll %ebp, %edx leaq (%r15,%r13), %rcx callq nop movl (%r13,%r15), %eax movl %eax, (%r14,%r15) incq %r12 addq \$4, %r15 cmpq \$20000, %r12 jne .LBB0_1</pre> |
| movq %r15, %rcx<br>callq nop                                                                                                                                                                                              |                                                                        |                                                                                                                                                                                        |

jmp

.LBB0\_4

| Before HWPGO:                                                                                                                                                                                                                            | After HWPGO:                                                                                                                     |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|
| perf stat -e<br>cles:u,instructions,br_inst_retired.all_branches:u,br_misp_retired.all_branches<br>./unpredictable                                                                                                                       | <pre>\$ perf stat -e cycles:u,instructions,br_inst_retired.all_branches:u,br_misp_retired.all_branches/unpredictable.hwpgo</pre> |
| erformance counter stats for './unpredictable':<br>+10% retired instructions                                                                                                                                                             | Performance counter stats for './unpredictable.hwpgo':                                                                           |
| +10% retired instructions<br>3,243,043,047 cycles:u<br>3,619,535,187 instructions:u # 1.12 insn per cycle<br>917,083,309 br_inst_retired.all_branches:u<br>85,966,707 br_misp_retired.all_branches:u<br>1.021617622 seconds time elapsed | 1,715,030,113 cycles:u 2XIPC<br>4,000,954,710 instructions:u # 2.33 insn per cycle<br>600,132,829 br_inst_retired.all_branches:u |
| 1.8X improvement in overall per                                                                                                                                                                                                          | formance                                                                                                                         |
| 1.0X improvement in overall pen                                                                                                                                                                                                          | Ionnance                                                                                                                         |

# SPGO/HWPGO Compiler Support on Windows

- Windows (and Linux) HWPGO feature supported since Intel<sup>®</sup> oneAPI DPC++/C++ Compiler 2024.0 release
  - LLVM-based Intel proprietary compiler released in Nov 2023
  - https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guidereference/current/hardware-profile-guided-optimization.html
  - https://www.intel.com/content/www/us/en/developer/articles/technical/hwpgo.html
- Basic Windows (and Linux) SPGO/HWPGO features now available in LLVM trunk as of Mar 2024
  - https://clang.llvm.org/docs/UsersManual.html#id50
  - Features mostly contributed by Intel ported from the Intel proprietary codebase above
  - Requires use of Intel VTune SEP from oneAPI 2024.0
- These are the first Windows compilers to support SPGO/HWPGO

## Windows Support: Profiling Tool

mmapped size

Intel® VTune<sup>™</sup> SEP supports Linux perf script output format since oneAPI 20

bin path

\$ sep -perf-script event, ip, brstack -ec BR\_INST\_RETIRED.NEAR\_TAKEN ...

page offset

1 PERF\_RECORD\_MMAP2 20068/0: [0x7ff693d60000(0x2f000) @ 0x1000 00:00 0 0]: r-xp c:\Users\wxiao3\opt\hwpgo-mispredict-example\unpredictable.exe 2 PERF\_RECORD\_MMAP 20068/0: [0x7ffdede10000(0x216000) @ 0x1000]: x c:\Windows\System32\ntdll.dll /v/v/v/cycles syntax in the following order: 3 PERF\_RECORD\_MMAP 20068/0: [0x7ffdecd30000(0xc4000) @ 0x1000]: x c:\Windows\System32\kernel32.dll FROM: branch source instruction 4 PERF\_RECORD\_MMAP 20068/0: [0x7ffdeb550000(0x3a6000) @ 0x1000]: x c:\Windows\System32\KernelBase.dll TO : branch target instruction 5 PERF\_RECORD\_MMAP 20068/0: [0x210c8c40000(0x14000) @ 0x1000]: x c:\Windows\System32\umppc17807.dll M/P/-: M=branch target mispredicted or branch direct X/- : X=branch inside a transactional region. -=not <u>6 PEPE PECORD\_MMAP 20068/0: [0x210c8c80000(0x14000) @ 0x1000]: x c:\Windows\System32\umppc17807.dll</u> A/- : A=TSX abort entry, -=not aborted region or not Event name ECORD\_MMAP 20068/0: [0x210c8c80000(0x14000) @ 0x1000]: x c:\Windows\System32\umppc17807.dll cycles o ٢٢٢٢ - ٢ ECORD\_MMAP 20068/0: [0x7ffdea190000(0x18000) @ 0x1000]: x c:\Windows\System32\kernel.appcore.dll 9 PERF\_RECORD\_MMAP 20068/0: [0x7ffdebb60000(0xa7000) @ 0x1000]: x c:\Windows\System32\msvcrt.dll **10** BR\_INST\_RETIRED.NEAR\_TAKEN:pdir: 7ff693d613e0 0x7ff693d613e0/0x7ff693d61086/M/-/-/0 0x7ff693d61081/0x7ff693d613e0/-/X/A/0 0x7ff693d61086/0 <u>x7ff693d61040/M/-/-/0 0x7ff693d613e0/0x7ff693d51086/M/-/-/0 0x7ff693d61081/0x7ff693d613e0/P/X/A/0 0x7ff693d613e0/0x7ff693d6103d/P/-/-/0 0x7ff693d6103</u> 8/0x7ff693d613e0/P/X/-/0 0x7ff693d61064/0x7ff693d61030/P/-/-/0 0x7ff693d61086/0x7ff693d61040/M/-/-/0 0x7ff693d613e0/0x7ff693d61086/M/-/-/0 0x7ff693d6 1081/0x7ff693d613e0/-/X/A/0 0x7ff693d61 ID f693d61040/M/-/-/0 0x7ff693d613e0/0x7ff693d61086/M/-/-/0 0x7ff693d61081/0x7ff693d613e0/P/X/A/0 0x7ff69 3d613e0/0x7ff693d6103d/P/-/-/0 0x7ff693d61040/M/-/A/0 0x7ff693d61064/0x7ff693d61030/P/-/-/0 0x7ff693d61086/0x7ff693d61040/M/-/-/0 0x7f f693d613e0/0x7ff693d61086/M/-/-/0 0x7ff693d61081/0x7ff693d613e0/P/-/-/1 0x7ff693d613e0/0x7ff693d6103d/P/-/-/0 0x7ff693d61038/0x7ff693d613e0/M/-/A/0 0 x7ff693d61064/0x7ff693d61030/P/-/-/0 0x7ff693d61086/0x7ff693d61040/M/-/-/0 0x7ff693d613e0/0x7ff693d61086/M/-/-/0 0x7ff693d61081/0x7ff693d613e0/-/X/A/ 0 0x7ff693d613e0/0x7ff693d6103d/-/-/A/0 0x7ff693d61038/0x7ff693d613e0/P/X/A/0 0x7ff693d61064/0x7ff693d61030/M/X/A/0 0x7ff693d613e0/0x7ff693d6103d/P/-0x7ff693d61038/0x7ff693d613e0/-/X/A/0 0x7ff693d61064/0x7ff693d61030/P/-/-/0

base addr

PID

### Windows Support: How Symbolization is Handled

#### Use DWARF Instead of PDB

PDB encode de-mangled (display) names DWARF encode mangled (linkage) names



### Windows Support: Changes made to Ilvm-profgen

Understand COFF/PE with DWARF by enhancing:

ProfileGenerator

#### PerfReader

void PerfScriptReader::updateBinaryAddress(const MMapEvent &Event) bool PerfScriptReader::extractMMap2EventForBinary( ProfiledBinary \*Binary, StringRef Line, MMapEvent &MMap)

#### ProfiledBinary

void ProfiledBinary::load()

void ProfiledBinary::setPreferredTextSegmentAddresses(const ELFObjectFileBase \*Obj void ProfiledBinary::setUpDisassembler(const ELFObjectFileBase \*Obj) void ProfiledBinary::disassemble(const ELFObjectFileBase \*Obj) https://github.com/tcreech-intel/hwpgo-mispredict-example

**Microsoft Teams** 

# HW-based PGO Demo

2024-04-08 03:42 UTC

Recorded by Xiao, Wei3 Organized by

Xiao, Wei3

# Challenges & Solutions

### Usability

 Many flags (-fdebug-info-for-profiling, -funique-internal-linkagenames, -gdwarf, /debug:dwarf, ...) needed to produce good debug info. Need to consolidate/simplify.

>OneAPI compilers have "-fprofile-sample-generate"

 Multiple profile types (frequency, branch mispredicts, etc.) will become difficult to produce, manage, and pass to the compiler.

Considering profile "bundles" and higher-level tools to drive creation of PMU profiles.

# Challenges & Solutions (2)

### Debug Info Accuracy

- HWPGO uses debug information (such as DWARF) to associate profile data from the optimized binary to source code and compiler IR.
  - ➢ Pro: neither prevent any optimizations nor add run-time overhead to the profiling binary.
  - ➢ Con: suffer from inaccurate correlation with aggressive optimizations.
- Solutions:
  - ≻ Enhance Debug Info.
  - ≻ Turn off aggressive optimizations.
  - ► PSEUDO-INSTRUMENTATION.

Initial selection DAG: %bb.14 'foo:entry'
SelectionDAG has 5 nodes:
 t0: ch,glue = EntryToken
 t2: i64,ch = CopyFromReg t0, Register:i64 %1, jump\_table.c:4:3
 t4: ch = br\_jt t2:1, JumpTable:i64<0>, t2, jump\_table.c:4:3

https://github.com/llvm/llvm-project/pull/71021

| 1  | unpredictable:12711006836:0 |
|----|-----------------------------|
| 2  | 0:0                         |
| 3  | 3.1: 200031858              |
| 4  | 3.2: 200031858              |
| 5  | 5: 200031858                |
| 6  | 6: 116870734                |
| 7  | 7: 116870734                |
| 8  | 8: 121580402 nop:121580402  |
| 9  | 11: 84838540 nop:84838540   |
| 10 | 13: 200031858               |
| 11 | 15: 0                       |
| 12 | 65517: 116870734            |
| 13 | nop:192386712:206418942     |
| 14 | 1: 192386712                |
|    |                             |

# Challenges & Solutions (3)

### **Profile Maintenance**

- After each optimization, profile (probability) needs to be adjusted to reflect control flow graph changes if any.
- Example below shows one of the bug-fixes made recently:



https://github.com/llvm/llvm-project/pull/86470

# Challenges & Solutions (4)

### Value Profiling

 If both PEBS and LBR records are captured, we can sample both function call counts and function limited arguments

#### \$ perf record --user-regs -b -e xxx

BR INST RETIRED.NEAR TAKEN:uppp: 401168 ABI:2 AX:0x1 BX:0x1d0a03c CX:0x1d023c0 DX:0x1ceeb30 ST:0x3 DI:0x1d0a03c BP:0x1f1f SP:0x7ffde51a1868 IP:0x401168 FLAGS:0x8 R8:0x1d15c50 R9:0x1 R11:0x1d23000 SS:0x2b R10:0xfff R12:0x1d15c50 R13:0x1cf67 R14:0x1d0a038 R15:0x1cdb2a0 0x401168/0x4016b0/P/-/-/18 0x401193/0x401160/M/-/-/3 0x4016b0/0x 40116d/P/-/-/8 0x401168/0x4016b0/P/-/-/18 0x401193/0x401160/M/-/-/3 0x4016b0/0x40116d/P/-/-/8 1168/0x4016b0/P/-/-/15 0x401193/0x401160/M/-/-/3 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/ 0x4011b0/0x4016b0/P/-/-/10 0x4016b0/0x40116d/P/-/-/7 0x401168/0x4016b0/P/-/-/14 0x401193/0x40116 0/M/-/-/3 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/29 0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/29 0x4011b5/0x401170/P/-/-/1 4016b0/0x4011b5/P/-/-/1 0x4011b0/0x4016b0/P/-/-/28 0x4011b5/0x401170/P/-/-/1 0x4016b0/0x4011b5/P/-/ 0x4011b0/0x4016b0/P/-/-/37 0x4016b0/0x40116d/P/-/-/7 0x401168/0x4016b0/P/-/-/18 0x401193/0x401 /M/-/-/6 0x4016b0/0x40116d/P/-/-/7 0x401168/0x4016b0/P/-/-/4 0x401193/0x401160/M/-/-/3

## List of PRs checked into LLVM Trunk so far

- Refer to: <u>https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers</u>
- [Ilvm-profgen] Support COFF binary: <u>83972</u>
- [LLD] [COFF] Port -lto-sample-profile to COFF version of LLD: <u>85701</u>
- Update documentation and release notes for llvm-profgen COFF support: <u>84864</u>
- Profile Maintenance:
  - LoopRotate: <u>86496</u>
- DebugInfo Fix:
  - JumpTable: <u>71018</u>, <u>72075</u>, <u>72082</u>, <u>72118</u>, <u>71021</u>
  - ➢ CodeGen: <u>72192</u>
- Support –gsplit-dwarf for COFF (RFC: <u>71276</u>):
  - MC: <u>D151793</u>, <u>D152119</u>, <u>D152229</u>, <u>D152340</u>
  - Clang & MC: <u>D152785</u>, 82dff24bde112984314568e7d581379fd0ea48e6
  - [LLD][COFF]: <u>D154070</u> (to support /dwodir for LTO)
  - Clang: <u>D154176</u>, <u>D154295</u>
- Emit symbol-table for COFF:
  - ➤ [LLD][COFF]: <u>D149235</u>
- Fix HW-based PGO/Sampling-based PGO gap with Instrumentation-based PGO:
  - InlineCost: <u>66457</u>
  - InstCombine: <u>68474</u>, <u>68502</u>

# Summary

- HW-based PGO is an extension of existing Sampling-based PGO for:
  - ➤Lower overhead
  - ➢ Higher accuracy
  - New feedback capabilities for higher performance gains
- Call community collaboration on HW-based PGO to:
  - Add infrastructure to support more feedback/profile types besides frequency
  - Add optimizations for new feedback/profile types
  - Enhance Debug Info Accuracy
  - Enhance Profile Maintenance
  - ➢ Support Value Profiling

21



# Legal Disclaimer & Optimization Notice

- INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
- Software and workloads used in performance tests may have been optimized for performance only on Intel
  microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
  components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
  should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
  including the performance of that product when combined with other products.
- Copyright © 2024, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



### Collection Overhead Reduced by Extended PEBS



#### Branch Mispredict Feedback CoreMark<sup>®</sup>-PRO Example:

coremark-pro/benchmarks/consumer\_v2/cjpeg/jcdctmgr.c



### New Feedback Capabilities Branch Mispredict Feedback CoreMark®-PRO Example: Before HWPGO:

| Performance counter s  | tats for './cjpeg-rose7-preset.exe | e.Var | nilla -v0 -i1500':      |     | -33%                   | Performance counter s  | tats for './cjpeg-rose7-preset.exe | .Nev | v -v0 -i1500':          |    |        |
|------------------------|------------------------------------|-------|-------------------------|-----|------------------------|------------------------|------------------------------------|------|-------------------------|----|--------|
| 4,255.40 mse           | c task-clock                       | #     | 0.998 CPUs utilized     |     |                        | → 3,118.57 mse         | c task-clock                       | #    | 0.997 CPUs utilized     |    |        |
| 18                     | context-switches                   | #     | 4.230 /sec              |     |                        | 10                     | context-switches                   | #    | 3.207 /sec              |    |        |
| 1                      | cpu-migrations                     | #     | 0.235 /sec              |     |                        | 1                      | cpu-migrations                     | #    | 0.321 /sec              |    |        |
| 178                    | page-faults                        | #     | 41.829 /sec             |     |                        | 177                    | page-faults                        | #    | 56.757 /sec             |    |        |
| 22,051,463,005         | cpu_core/cycles/                   | #     | 5.182 G/sec             |     | +17%                   | 16,174,859,445         | cpu_core/cycles/                   | #    | 5.187 G/sec             |    |        |
| <not counted=""></not> | cpu_atom/cycles/                   |       |                         | (0. | 00%) <b>I</b> / / O    | <not counted=""></not> | cpu_atom/cycles/                   |      |                         | (0 | 0.00%) |
| 48,356,977,972         | cpu_core/instructions/             | #     | 11.364 G/sec            |     |                        | 56,997,553,072         | cpu_core/instructions/             | #    | 18.277 G/sec            |    |        |
| <not counted=""></not> | cpu_atom/instructions/             |       |                         | (0. | 00%)                   | <not counted=""></not> | cpu_atom/instructions/             |      |                         | (0 | 0.00%) |
| 4,727,869,197          | cpu_core/branches/                 | #     | 1.111 G/sec             |     |                        | 4,751,693,034          | cpu_core/branches/                 | #    | 1.524 G/sec             |    |        |
| <not counted=""></not> | cpu_atom/branches/                 |       |                         | (0. | 00%)                   | <not counted=""></not> | cpu_atom/branches/                 |      |                         | (0 | 0.00%) |
| 436,828,793            | cpu_core/branch-misses/            | #     | 102.653 M/sec           |     |                        | 186,405,909            | cpu_core/branch-misses/            | #    | 59.773 M/sec            |    |        |
| <not counted=""></not> | cpu_atom/branch-misses/            |       |                         | (0. | 00%)                   | <not counted=""></not> | cpu_atom/branch-misses/            |      |                         | (0 | 0.00%) |
| 132,308,528,448        | cpu_core/slots/                    | #     | 31.092 G/sec            |     |                        | 97,048,835,412         | cpu_core/slots/                    | #    | 31.120 G/sec            |    |        |
| 44,620,028,215         | cpu_core/topdown-retiring/         | #     | 33.6% Retiring          |     |                        | 53,277,581,186         | cpu_core/topdown-retiring/         | #    | 54.9% Retiring          |    |        |
| 36,836,285,578         | cpu_core/topdown-bad-spec/         | #     | 27.7% Bad Speculation   |     |                        | 23,213,761,622         | cpu_core/topdown-bad-spec/         | #    | 23.9% Bad Speculation   |    |        |
| 36,839,837,672         | cpu_core/topdown-fe-bound/         | #     | 27.7% Frontend Bound    |     |                        | 9,898,710,296          | cpu_core/topdown-fe-bound/         | #    | 10.2% Frontend Bound    |    |        |
| 14,531,129,481         | cpu_core/topdown-be-bound/         | #     | 10.9% Backend Bound     |     |                        | 10,658,832,092         | cpu_core/topdown-be-bound/         | #    | 11.0% Backend Bound     |    |        |
| 519,327,103            | cpu_core/topdown-heavy-ops/        | #     | 0.4% Heavy Operations   | #   | 33.2% Light Operations | 381,031,756            | cpu_core/topdown-heavy-ops/        | #    | 0.4% Heavy Operations   | #  | 54.5   |
| 36,836,233,341         | cpu_core/topdown-br-mispredict/    | #     | 27.7% Branch Mispredict | #   | 0.0% Machine Clears    | 23,213,662,047         | cpu_core/topdown-br-mispredict/    | #    | 23.9% Branch Mispredict | #  | 0.0    |
| 22,311,685,681         | cpu_core/topdown-fetch-lat/        | #     | 16.8% Fetch Latency     | #   | 10.9% Fetch Bandwidth  | 4,950,077,068          | cpu_core/topdown-fetch-lat/        | #    | 5.1% Fetch Latency      | #  | 5.1    |
| 2,077,047,233          | cpu_core/topdown-mem-bound/        | #     | 1.6% Memory Bound       | #   | 9.4% Core Bound        | 381,678,995            | cpu_core/topdown-mem-bound/        | #    | 0.4% Memory Bound       | #  | 10.6   |
|                        |                                    |       |                         |     |                        |                        |                                    |      |                         |    |        |

4.265031168 seconds time elapsed

4.252556000 seconds user 0.004000000 seconds sys

Pe

3.128461260 seconds time elapsed

- Call community collaboration on HWPGO to:
  Add infrastructure support for more feedback/profile types besides frequency (i.e., "-fprofile-sample-use=code.freq.prof")
- Add optimizations for new profile types

intel software 27

54.5% Light Operations 0.0% Machine Clears 5.1% Fetch Bandwidth 10.6% Core Bound

# Windows Support: Ilvm-profgen

#### Canonicalize WINDOWS Virtual Address for COFF/PE

// Canonicalize to use preferred load address as base address.
uint64\_t canonicalizeVirtualAddress(uint64\_t Address) {
 return Address - BaseAddress + getPreferredBaseAddress();

| Disasm | General    | DOS Hdr        | File Hdr | Optional Hdr | Section Hdrs      | Imports    |  |  |
|--------|------------|----------------|----------|--------------|-------------------|------------|--|--|
| Offset | Name       |                |          | Value        | Value             |            |  |  |
| 90     | Magic      |                |          | 20B          | NT64              |            |  |  |
| 92     | Linker Ve  | er. (Major)    |          | E            |                   |            |  |  |
| 93     | Linker Ve  | er. (Minor)    | (        | 0            |                   |            |  |  |
| 94     | Size of C  | ode            | :        | 3A00         |                   |            |  |  |
| 98     | Size of In | itialized Data | a i      | 3000         |                   |            |  |  |
| 9C     | Size of U  | ninitialized [ | )ata (   | 0            |                   |            |  |  |
| A0     | Entry Po   | int            |          | 4068         |                   |            |  |  |
| A4     | Base of C  | Code           |          | 1000         |                   |            |  |  |
|        |            |                |          |              |                   |            |  |  |
| A8     | Image Ba   | ase            |          | 140000000 🧖  |                   |            |  |  |
| BO     | Section A  | Alignment      |          | 1000         |                   |            |  |  |
| B4     | File Aligr | nment          | 1        | 200          |                   |            |  |  |
| B8     | OS Ver. (  | Major)         |          | 6            | Windows Vista / S | erver 2008 |  |  |
| BA     | OS Ver. (  | Minor)         | (        | 0            |                   |            |  |  |
| BC     | Image Ve   | er. (Major)    | (        | 0            |                   |            |  |  |

| 5                                                                                                                                                                                                                           | P        |        | ECORD_MMA | P2 1764/     | 0: [0×00007   | FF9F327100   | 0(0x294000) @   | 0x1000 ]:     | x C:\Windows\  | System32\KernelBase.  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|-----------|--------------|---------------|--------------|-----------------|---------------|----------------|-----------------------|
| 6                                                                                                                                                                                                                           | Ы        | ERF_N  | CORD_MMA  | P2 1764/     | 0: [0x00007   | FF9F266100   | 0(0xfa000)@     | 0x1000 ]: x   | C:\Windows\S   | System32\ucrtbase.dll |
| 7 PERF_RECORP_MMAP2 1764/0: [0x00007ff9f2531000(0x11000) @ 0x1000 ]: x C:\Windows\System32\kernel.appcore.dl                                                                                                                |          |        |           |              |               |              |                 |               |                |                       |
| 8 PERF_RECORD_MMAP2 1764/0: [0x00007ff9ef461000(0x1a000) @ 0x1000 ]: x C:\Windows\System32\vcruntime140.dll<br>9 PERF RECORD MMAP2 1764/0: [0x00007ff9ef451000(0xc000) @ 0x1000 ]: x C:\Windows\System32\vcruntime140 1.dll |          |        |           |              |               |              |                 |               |                |                       |
| 10                                                                                                                                                                                                                          |          |        |           |              |               |              |                 |               |                | System32\msvcp140.dll |
|                                                                                                                                                                                                                             |          |        |           |              |               |              |                 |               |                | t\spgo\sort\sort.exe  |
|                                                                                                                                                                                                                             |          |        |           |              |               |              |                 | <u> </u>      |                |                       |
| n.                                                                                                                                                                                                                          | <u> </u> | a ot 7 | Tox+So    | amont        | Offset()      |              |                 | Evon          | t.Offset       |                       |
| 'y-                                                                                                                                                                                                                         |          | yeri   | exide     | ginen        | Onser()       |              |                 | Even          | i.Onsei        |                       |
|                                                                                                                                                                                                                             |          |        |           |              |               |              |                 |               |                |                       |
|                                                                                                                                                                                                                             | Dis      | asm    | General   | DOS Hd       | r File Hdr    | Optional H   | Idr Section H   | drs 📄 Imp     | oorts 👘 Exce   | eption 📄 BaseReloc.   |
|                                                                                                                                                                                                                             | +        | 5      | 2         | $\mathbf{N}$ |               |              |                 |               |                |                       |
| _                                                                                                                                                                                                                           |          |        |           |              | 1             |              |                 | 1             |                |                       |
| N                                                                                                                                                                                                                           | lan      | ne     | Raw Addr. | Raw size     | Virtual Addr. | Virtual Size | Characteristics | Ptr to Reloc. | Num. of Reloc. | Num. of Linenum.      |
| )                                                                                                                                                                                                                           | > .      | text   | 400       | 3A00         | 1000          | 3926         | 60000020        | 0             | 0              | 0                     |
| )                                                                                                                                                                                                                           | > .      | rdata  | 3E00      | 2200         | 5000          | 20C4         | 40000040        | 0             | 0              | 0                     |
| )                                                                                                                                                                                                                           | > .      | data   | 6000      | 400          | 8000          | 4EB50        | C0000040        | 0             | 0              | 0                     |
| - 3                                                                                                                                                                                                                         | > .      | pdata  | 6400      | 600          | 57000         | 444          | 40000040        | 0             | 0              | 0                     |
| 3                                                                                                                                                                                                                           | > .      | 00cfg  | 6A00      | 200          | 58000         | 28           | 40000040        | 0             | 0              | 0                     |
| )                                                                                                                                                                                                                           | > .      | voltbl | 6C00      | 200          | 59000         | 18           | 0               | 0             | 0              | 0                     |
|                                                                                                                                                                                                                             | <u> </u> | reloc  | 6E00      | 200          | 5A000         | A0           | 42000040        | 0             | 0              | 0                     |
|                                                                                                                                                                                                                             | •        |        |           |              |               |              |                 |               |                |                       |

# HWPGO Documentation/Links

Intel<sup>®</sup> oneAPI DPC++/C++ Compiler:

- <u>https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/current/hardware-profile-guided-optimization.html</u>
- https://www.intel.com/content/www/us/en/developer/articles/technical/hwpgo.h tml
- <u>https://github.com/tcreech-intel/hwpgo-mispredict-example</u>
   LLVM:
- https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers
- Unmerged branch mispredict feedback features:
  - https://github.com/tcreech-intel/llvm-project/tree/ip\_profiles
  - https://github.com/tcreech-intel/llvm-project/tree/unpredictable\_loader
  - https://github.com/tcreech-intel/llvm-project/tree/aggressive\_speculation

#### SPEC CPU2017 Performance on IceLake Windows Server

| llvm-trunk 20240408 | Default (Normalized Performance) | HW-based PGO | Instrumentation-based PGO |
|---------------------|----------------------------------|--------------|---------------------------|
| 500.perlbench_r     | 100%                             | 110.19%      | 113.63%                   |
| 502.gcc_r           | 100%                             | 103.13%      | 105.28%                   |
| 511.povray_r        | 100%                             | 106.37%      | 110.41%                   |

#### HW-based PGO

- 1<sup>st</sup> build: /clang:-fdebug-info-for-profiling /clang:-funique-internal-linkage-names -gdwarf -gline-tables-only -fuse-Id=Ild
- 2<sup>nd</sup> build: /clang:-fprofile-sample-use=default.profdata -gline-tables-only -fuse-ld=lld