LLVM  13.0.0git
Typedefs | Functions | Variables
lib/Target/X86/README-SSE.txt File Reference
#include <xmmintrin.h>
#include <math.h>
#include <emmintrin.h>
Include dependency graph for README-SSE.txt:

Typedefs

using t = bitcast float %x to i32 %s=and i32 %t, 2147483647 %d=bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d=call float @fabsf(float %x) ret float %d } This IR(from PR6194):target datalayout="e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple="x86_64-apple-darwin10.0.0" %0=type { double, double } %struct.float3=type { float, float, float } define void @test(%0, %struct.float3 *nocapture %res) nounwind noinline ssp { entry:%tmp18=extractvalue %0 %0, 0
 

Functions

SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i shift_right (__m128i value, unsigned long offset)
 
float f32 (v4f32 A)
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning select (load CPI1)
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning load CPI2 load (select CPI1, CPI2)' The pattern isel got this one right. Lower memcpy/memset to a series of SSE 128 bit move instructions when it 's feasible. Codegen
 
return _mm_set_ps (0.0, 0.0, 0.0, b)
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps c2 (%esp) ... xorps %xmm0
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax movaps (%eax)
 
void x (unsigned short n)
 
void y (unsigned n)
 
compile to (-O3 -static -fomit-frame-pointer)
 
This currently compiles esp movsd (%esp)
 
This currently compiles esp xmm0 movsd esp movl (%esp)
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower store (fneg(load p), q) into an integer load+xor+store
 
vSInt16 madd (vSInt16 b)
 
Generated code (x86-32, linux)
 
__m128 foo1 (float x1, float x4)
 
gcc mainline compiles it x2 (%rip)
 
gcc mainline compiles it xmm0 x3 (%rip)
 
into eax xorps xmm0 xmm0 movzbl (%esp)
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa LC0 (%rip)
 
compiles to (x86-32)
 
In fpstack this compiles esp eax movl esp fildl (%esp) fmuls LCPI1_0 addl $4
 
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp cvtsi2sd (%esp)
 
In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd esp fldl (%esp) addl $12
 
< float * > store float float *tmp5 ret void Compiles rax rax movl rdi ret This would be better kept in the SSE unit by treating XMM0 as a and doing a shuffle from v[1] to v[0] then a float store[UNSAFE FP] void foo (double, double, double)
 
void norm (double x, double y, double z)
 

Variables

SSE Variable shift can be custom lowered to something like this
 
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i_shift_right
 
SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory byte
 
SSE has instructions for doing operations on complex numbers
 
SSE has instructions for doing operations on complex we should pattern match them For example
 
SSE has instructions for doing operations on complex we should pattern match them For this should turn into a horizontal add
 
Instead we get xmm0
 
Instead we get xmm1 addss xmm1 pshufd
 
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm1
 
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm2
 
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm3
 
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret Also
 
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret there are cases where some simple SLP would improve codegen a bit compiling _Complex float B
 
into __pad1__
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions inline
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in it
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant pool
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C Currently
 
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being lowered
 
This compiles into
 
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top elements
 
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top the top elements of xmm1 are already zero d We could compile this to
 
This might compile to this code
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret However
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these instructions
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the uses
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill slot
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill can
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is used
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle operations
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or something
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > * P2
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 shufps
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to generate
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax pinsrw
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration Guide
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration andnot
 
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration or Various SSE compare translations Add hooks to commute some CMPP operations Apply the same transformation that merged four float into a single bit load to loads from constant pool Floating point max min are commutable when enable unsafe fp path is specified We should turn int_x86_sse_max_ss and X86ISD::FMIN etc into other nodes which are selected to max min instructions that are marked commutable We should materialize vector constants like all ones and signbit with code like
 
This currently compiles esp xmm0 movsd esp eax shrl
 
This currently compiles esp xmm0 movsd esp eax eax addl
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use movmskp
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper node
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from P
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation model =static
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For consider
 
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For float z nounwind readonly
 
< float > tmp20
 
the custom lowered code happens to be right
 
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too early
 
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t know
 
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector spill
 
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative here
 
the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to< 2 x i64 > ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative but otherwise we can produce unaligned loads stores Fixing this will require some huge RA changes Testcase
 
static const vSInt16 a
 
In x86 mode
 
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be better
 
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use insertps [mem]
 
In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use x3
 
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without SSE4
 
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f _f
 
gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f xmm1 movd eax imull LCPI1_0
 
into __pad2__
 
into eax xorps xmm0 xmm0 eax xmm0 movl
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret gcc
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa LC0
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 pinsrb
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 edi
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something horrible
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t _t
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss LCPI1_1
 
into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm0 movaps rdar
 
< doubletmp19 = bitcast double %tmp18 to i64
 
< i128 > tmp10 = lshr i128 %tmp20
 
< i128 >< i128 > tmp11 = trunc i128 %tmp10 to i32
 
< i32tmp12 = bitcast i32 %tmp11 to float
 
< float > tmp5 = getelementptr inbounds %struct.float3* %res
 
< float > i64
 
< float > i32
 
< float * > store float float *tmp5 ret void Compiles rax shrq
 
We currently generate an sqrtsd and divsd instructions This is bad
 
We currently generate an sqrtsd and divsd instructions This is fp div is slow and not pipelined In ffast math mode we could compute scale first and emit mulsd in place of the divs This can be done as a target independent transform If we re dealing with floats instead of doubles we could even replace the sqrtss and inversion with an rsqrtss instruction
 

Typedef Documentation

◆ t

using t = bitcast float %x to i32 %s = and i32 %t, 2147483647 %d = bitcast i32 %s to float ret float %d } declare float @fabsf(float %n) define float @bar(float %x) nounwind { %d = call float @fabsf(float %x) ret float %d } This IR (from PR6194): target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-darwin10.0.0" %0 = type { double, double } %struct.float3 = type { float, float, float } define void @test(%0, %struct.float3* nocapture %res) nounwind noinline ssp { entry: %tmp18 = extractvalue %0 %0, 0

Definition at line 788 of file README-SSE.txt.

Function Documentation

◆ _mm_set_ps()

return _mm_set_ps ( 0.  0,
0.  0,
0.  0,
b   
)

Referenced by foo1().

◆ c2()

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps c2 ( esp)

◆ code()

Generated code ( x86 32,
linux   
)

Definition at line 508 of file README-SSE.txt.

◆ cvtsi2sd()

In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp cvtsi2sd ( esp)

◆ f32()

float f32 ( v4f32  A)

Definition at line 26 of file README-SSE.txt.

References A.

◆ fildl()

In fpstack this compiles esp eax movl esp fildl ( esp)

◆ fldl()

In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd esp fldl ( esp)

◆ foo()

<float*> store float float* tmp5 ret void Compiles rax rax movl rdi ret This would be better kept in the SSE unit by treating XMM0 as a and doing a shuffle from v [1] to v [0] then a float store [UNSAFE FP] void foo ( double  ,
double  ,
double   
)

Referenced by norm().

◆ foo1()

__m128 foo1 ( float  x1,
float  x4 
)

Definition at line 548 of file README-SSE.txt.

References _mm_set_ps(), x2(), and x3.

◆ LC0()

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa LC0 ( rip)

◆ load()

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning load CPI2 load ( select  CPI1,
CPI2   
)

Definition at line 90 of file README-SSE.txt.

◆ madd()

vSInt16 madd ( vSInt16  b)

Definition at line 503 of file README-SSE.txt.

◆ movaps()

_test eax xmm0 eax movaps ( eax)

◆ movl()

This currently compiles esp xmm0 movsd esp movl ( esp)

◆ movsd()

This currently compiles esp movsd ( esp)

◆ movzbl()

into eax xorps xmm0 xmm0 movzbl ( esp)

◆ norm()

void norm ( double  x,
double  y,
double  z 
)

Definition at line 815 of file README-SSE.txt.

References foo(), scale(), x(), y(), and z.

◆ select()

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning select ( load  CPI1)

Referenced by zero().

◆ shift_right()

SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i shift_right ( __m128i  value,
unsigned long  offset 
)

Definition at line 15 of file README-SSE.txt.

◆ store()

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower store ( fneg(load p ,
 
)
static

◆ to() [1/2]

compile to ( -O3 -static -fomit-frame-  pointer)

Definition at line 362 of file README-SSE.txt.

References d.

◆ to() [2/2]

compiles to ( x86 32)

Definition at line 692 of file README-SSE.txt.

References i8, ret(), tmp12, to, uses, and x().

◆ x()

void x ( unsigned short  n)

Definition at line 355 of file README-SSE.txt.

References n.

Referenced by norm(), and to().

◆ x2()

gcc mainline compiles it x2 ( rip)

◆ x3()

gcc mainline compiles it xmm0 x3 ( rip)

◆ y()

void y ( unsigned  n)

Definition at line 358 of file README-SSE.txt.

References n.

Referenced by norm().

Variable Documentation

◆ __m128i_shift_right

SSE Variable shift can be custom lowered to something like which uses a small table unaligned load shuffle instead of going through memory __m128i_shift_right

Definition at line 11 of file README-SSE.txt.

◆ __pad1__

into __pad1__

Definition at line 53 of file README-SSE.txt.

◆ __pad2__

into __pad2__

Definition at line 619 of file README-SSE.txt.

◆ _f

gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret<4 x i32> A On targets without this compiles globl _f _f

Definition at line 582 of file README-SSE.txt.

◆ _t

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t _t

Definition at line 675 of file README-SSE.txt.

◆ a

const vSInt16 a
static
Initial value:
= {- 22725, - 12873, - 22725, - 12873, - 22725, - 12873,
- 22725, - 12873}

Definition at line 500 of file README-SSE.txt.

◆ add

Current eax eax eax ret Ideal eax eax ret Re implement atomic builtins x86 does not have to use add to implement these it can use add

Definition at line 25 of file README-SSE.txt.

◆ addl

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 addl

Definition at line 394 of file README-SSE.txt.

◆ Also

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants Also

Definition at line 43 of file README-SSE.txt.

◆ andnot

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration andnot

Definition at line 318 of file README-SSE.txt.

◆ B

Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret there are cases where some simple SLP would improve codegen a bit compiling _Complex float B
Initial value:
{
return A+B

Definition at line 46 of file README-SSE.txt.

◆ bad

We currently generate an sqrtsd and divsd instructions This is bad

Definition at line 820 of file README-SSE.txt.

◆ better

In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be better

Definition at line 537 of file README-SSE.txt.

◆ byte

Decimal Convert From to National Zoned Signed int_ppc_altivec_bcdcfno int_ppc_altivec_bcdcfzo int_ppc_altivec_bcdctno int_ppc_altivec_bcdctzo int_ppc_altivec_bcdcfsqo int_ppc_altivec_bcdctsqo int_ppc_altivec_bcdcpsgno int_ppc_altivec_bcdsetsgno int_ppc_altivec_bcdso int_ppc_altivec_bcduso int_ppc_altivec_bcdsro i e VA byte

Definition at line 11 of file README-SSE.txt.

Referenced by readPrefixes().

◆ can

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill can

Definition at line 269 of file README-SSE.txt.

◆ code

In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower code

Definition at line 240 of file README-SSE.txt.

◆ consider

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For consider

Definition at line 421 of file README-SSE.txt.

◆ Currently

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C Currently

Definition at line 89 of file README-SSE.txt.

◆ double

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double

◆ early

the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too early

Definition at line 486 of file README-SSE.txt.

◆ eax

<float*> store float float* tmp5 ret void Compiles rax rax movl eax

Definition at line 303 of file README-SSE.txt.

◆ edi

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 edi

Definition at line 650 of file README-SSE.txt.

◆ elements

This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top elements

◆ example

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For example

Definition at line 23 of file README-SSE.txt.

◆ gcc

Definition at line 630 of file README-SSE.txt.

◆ generate

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to generate

Definition at line 301 of file README-SSE.txt.

◆ Guide

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration Guide

Definition at line 318 of file README-SSE.txt.

◆ here

the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative here

Definition at line 490 of file README-SSE.txt.

◆ horrible

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something horrible

Definition at line 672 of file README-SSE.txt.

◆ However

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret However

Definition at line 257 of file README-SSE.txt.

◆ i32

<float> i32

Definition at line 794 of file README-SSE.txt.

◆ i64

<float> i64

Definition at line 794 of file README-SSE.txt.

◆ inline

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions inline

Definition at line 72 of file README-SSE.txt.

◆ insertps

gcc mainline compiles it xmm0 insertps

Definition at line 547 of file README-SSE.txt.

◆ instruction

We currently generate an sqrtsd and divsd instructions This is fp div is slow and not pipelined In ffast math mode we could compute scale first and emit mulsd in place of the divs This can be done as a target independent transform If we re dealing with floats instead of doubles we could even replace the sqrtss and inversion with an rsqrtss instruction

Definition at line 826 of file README-SSE.txt.

◆ instructions

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these instructions

Definition at line 257 of file README-SSE.txt.

◆ into

In fpstack this compiles into

Definition at line 215 of file README-SSE.txt.

◆ it

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in it

Definition at line 81 of file README-SSE.txt.

◆ know

the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t know

Definition at line 489 of file README-SSE.txt.

◆ LC0

◆ LCPI1_0

In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd LCPI1_0

Definition at line 584 of file README-SSE.txt.

◆ LCPI1_1

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss LCPI1_1

Definition at line 677 of file README-SSE.txt.

◆ like

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> eax xmm0 pxor xmm1 movaps xmm2 xmm2 xmm0 movaps eax ret Would it be better to ecx xmm0 xor eax xmm0 xmm0 movaps ecx ret Some useful information in the Apple Altivec SSE Migration or Various SSE compare translations Add hooks to commute some CMPP operations Apply the same transformation that merged four float into a single bit load to loads from constant pool Floating point max min are commutable when enable unsafe fp path is specified We should turn int_x86_sse_max_ss and X86ISD::FMIN etc into other nodes which are selected to max min instructions that are marked commutable We should materialize vector constants like all ones and signbit with code like

Definition at line 340 of file README-SSE.txt.

◆ lowered

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being lowered

Definition at line 89 of file README-SSE.txt.

◆ mode

In fpstack this compiles esp eax movl esp esp ret in SSE mode

Definition at line 527 of file README-SSE.txt.

◆ model

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation model =static
static

◆ movl

We currently generate a but we really shouldn eax ecx xorl edx divl ecx movl

Definition at line 624 of file README-SSE.txt.

◆ movmskp

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use movmskp
Initial value:
{s|d} instead.
CodeGen/X86/vec_align.ll tests whether we can turn 4 scalar loads into a single
(aligned) vector load. This functionality has a couple of problems.
1. The code to infer alignment from loads of globals is in the X86 backend

Definition at line 397 of file README-SSE.txt.

◆ node

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper node

◆ numbers

SSE has instructions for doing operations on complex numbers

Definition at line 22 of file README-SSE.txt.

◆ operations

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle operations

Definition at line 271 of file README-SSE.txt.

◆ P

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from P

Definition at line 411 of file README-SSE.txt.

Referenced by llvm::StringTableBuilder::add(), llvm::legacy::PassManager::add(), llvm::legacy::FunctionPassManager::add(), llvm::DebugifyCustomPassManager::add(), llvm::legacy::FunctionPassManagerImpl::add(), llvm::PMDataManager::add(), llvm::legacy::PassManagerImpl::add(), llvm::AnalysisResolver::addAnalysisImplsPair(), AddCalls(), llvm::vfs::InMemoryFileSystem::addFile(), llvm::vfs::InMemoryFileSystem::addFileNoOwn(), llvm::GlobalsAAResult::FunctionInfo::addFunctionInfo(), llvm::PMTopLevelManager::addImmutablePass(), llvm::RegBankSelect::RepairingPlacement::addInsertPoint(), llvm::PressureDiffs::addInstruction(), llvm::RegPressureTracker::addLiveRegs(), llvm::PMDataManager::addLowerLevelRequiredPass(), llvm::GlobalsAAResult::FunctionInfo::addModRefInfoForGlobal(), llvm::MachineFunctionPassManager::addPass(), llvm::TargetPassConfig::addPass(), llvm::PassManager< LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult & >::addPass(), llvm::ScheduleDAGInstrs::addPhysRegDeps(), llvm::orc::ObjectLinkingLayer::addPlugin(), llvm::SIScheduleBlock::addPred(), llvm::SUnit::addPred(), llvm::MachObjectWriter::addRelocation(), llvm::SIScheduleBlock::addSucc(), llvm::RegisterOperands::adjustLaneLiveness(), llvm::RegPressureTracker::advance(), llvm::jitlink::JITLinkerBase::alignToBlock(), llvm::all_of(), llvm::ms_demangle::ArenaAllocator::alloc(), llvm::ms_demangle::ArenaAllocator::allocArray(), llvm::ms_demangle::ArenaAllocator::allocUnalignedBuffer(), allPredecessorsComeFromSameSource(), llvm::AMDGPUExternalAAWrapper::AMDGPUExternalAAWrapper(), llvm::any_of(), llvm::LiveRegSet::appendTo(), llvm::LoopPass::assignPassManager(), llvm::CallGraphSCCPass::assignPassManager(), llvm::DwarfCFIException::beginFragment(), llvm::rdf::DataFlowGraph::build(), llvm::RegPressureTracker::bumpDeadDefs(), llvm::RegPressureTracker::bumpUpwardPressure(), CalcNodeSethiUllmanNumber(), llvm::ModuleSummaryIndex::calculateCallGraphRoot(), llvm::AAResults::callCapturesBefore(), llvm::CallGraph::CallGraph(), llvm::AAResults::canBasicBlockModify(), canLoopBeDeleted(), llvm::canonicalizePath(), llvm::GraphTraits< CallGraphDOTInfo * >::CGGetValuePtr(), llvm::DOTGraphTraits< CallGraphDOTInfo * >::CGGetValuePtr(), llvm::GraphTraits< CallGraph * >::CGGetValuePtr(), llvm::GraphTraits< const CallGraph * >::CGGetValuePtr(), llvm::GraphTraits< CallGraphNode * >::CGNGetValue(), llvm::GraphTraits< const CallGraphNode * >::CGNGetValue(), changeFCMPPredToAArch64CC(), changeICMPPredToAArch64CC(), charTailAt(), llvm::RuntimeDyldCheckerImpl::check(), llvm::MCAsmParserExtension::check(), llvm::MCAsmParser::check(), checkDyldCommand(), checkDylibCommand(), checkRpathCommand(), checkSubCommand(), llvm::remarks::BitstreamRemarkParser::classof(), llvm::remarks::YAMLRemarkParser::classof(), llvm::remarks::YAMLStrTabRemarkParser::classof(), llvm::SCEVEqualPredicate::classof(), llvm::SCEVWrapPredicate::classof(), llvm::SCEVUnionPredicate::classof(), llvm::DenseMapBase< SmallDenseMap< llvm::Value *, int, 4, DenseMapInfo< llvm::Value * >, llvm::detail::DenseMapPair< llvm::Value *, int > >, llvm::Value *, int, DenseMapInfo< llvm::Value * >, llvm::detail::DenseMapPair< llvm::Value *, int > >::clear(), llvm::rdf::DataFlowGraph::DefStack::clear_block(), llvm::orc::OrcRPCTargetProcessControlBase< RPCEndpointT >::closeConnectionAndWait(), collectBitParts(), collectEHScopeMembers(), llvm::PMTopLevelManager::collectLastUses(), collectLeaves(), llvm::collectPGOFuncNameStrings(), llvm::PMDataManager::collectRequiredAndUsedAnalyses(), combineSIntToFP(), combineUIntToFP(), llvm::ObjectSizeOffsetVisitor::compute(), llvm::AccelTableBase::computeBucketCount(), llvm::EHStreamer::computeCallSiteTable(), llvm::ScheduleDAGMILive::computeCyclicCriticalPath(), llvm::HexagonBlockRanges::computeDeadMap(), computeKnownBitsFromOperator(), ComputeLiveInBlocks(), llvm::rdf::Liveness::computeLiveIns(), llvm::EHStreamer::computePadMap(), llvm::rdf::Liveness::computePhiInfo(), llvm::object::computeSymbolSizes(), computeUnlikelySuccessors(), llvm::JumpThreadingPass::computeValueKnownInPredecessorsImpl(), computeVTableFuncs(), llvm::DwarfUnit::constructContainingTypeDIEs(), llvm::ScalarEvolution::convertSCEVToAddRecWithPredicates(), convertToSinitPriority(), llvm::copy_if(), llvm::count_if(), llvm::symbolize::SymbolizableObjectFile::create(), llvm::vfs::RedirectingFileSystem::create(), llvm::sys::fs::create_directories(), llvm::IRBuilderBase::CreateConstrainedFPCmp(), llvm::TargetFolder::CreateFCmp(), llvm::ConstantFolder::CreateFCmp(), llvm::NoFolder::CreateFCmp(), llvm::IRBuilderBase::CreateFCmp(), llvm::IRBuilderBase::CreateFCmpS(), llvm::TargetFolder::CreateICmp(), llvm::ConstantFolder::CreateICmp(), llvm::NoFolder::CreateICmp(), llvm::IRBuilderBase::CreateICmp(), llvm::createInterleavedLoadCombinePass(), llvm::createLegacyPMAAResults(), llvm::createLegacyPMBasicAAResult(), createNaturalLoopInternal(), llvm::createRepeatedPass(), llvm::sys::fs::createTemporaryFile(), CriticalPathStep(), llvm::GraphTraits< DDGNode * >::DDGGetTargetNode(), llvm::GraphTraits< const DDGNode * >::DDGGetTargetNode(), DecodeAddrMode2IdxInstruction(), DecodeAddrMode3Instruction(), DecodeT2LDRDPreInstruction(), DecodeT2STRDPreInstruction(), llvm::deleteDeadLoop(), deleteLoopIfDead(), llvm::DemotePHIToStack(), llvm::DependenceInfo::depends(), llvm::orc::deregisterFrameWrapper(), llvm::DenseMapBase< SmallDenseMap< llvm::Value *, int, 4, DenseMapInfo< llvm::Value * >, llvm::detail::DenseMapPair< llvm::Value *, int > >, llvm::Value *, int, DenseMapInfo< llvm::Value * >, llvm::detail::DenseMapPair< llvm::Value *, int > >::destroyAll(), llvm::GCNIterativeScheduler::detachSchedule(), llvm::RegPressureTracker::discoverLiveIn(), llvm::RegPressureTracker::discoverLiveInOrOut(), llvm::RegPressureTracker::discoverLiveOut(), llvm::codeview::discoverTypeIndices(), llvm::codeview::discoverTypeIndicesInSymbol(), llvm::RegisterPressure::dump(), llvm::MCAsmMacro::dump(), llvm::PMTopLevelManager::dumpArguments(), llvm::PMDataManager::dumpLastUses(), llvm::PMDataManager::dumpPassArguments(), llvm::PMDataManager::dumpPassInfo(), llvm::LPPassManager::dumpPassStructure(), llvm::RGPassManager::dumpPassStructure(), llvm::PMDataManager::dumpPreservedSet(), llvm::PMDataManager::dumpRequiredSet(), llvm::PMDataManager::dumpUsedSet(), llvm::ehAwareSplitEdge(), llvm::MCELFStreamer::emitCommonSymbol(), llvm::PMDataManager::emitInstrCountChangedRemark(), llvm::InnerLoopVectorizer::emitMinimumIterationCountCheck(), llvm::EpilogueVectorizerMainLoop::emitMinimumIterationCountCheck(), llvm::EpilogueVectorizerEpilogueLoop::emitMinimumVectorEpilogueIterCountCheck(), emitRangeList(), llvm::json::Array::emplace(), llvm::DwarfDebug::endModule(), llvm::IntervalMap< KeyT, ValT, N, Traits >::iterator::erase(), llvm::PriorityWorklist< llvm::LazyCallGraph::SCC *, SmallVector< llvm::LazyCallGraph::SCC *, N >, SmallDenseMap< llvm::LazyCallGraph::SCC *, ptrdiff_t > >::erase_if(), llvm::erase_if(), llvm::SmallPtrSetImplBase::erase_imp(), llvm::GlobalsAAResult::FunctionInfo::eraseModRefInfoForGlobal(), llvm::xray::Profile::expandPath(), llvm::X86::fillValidCPUArchList(), llvm::X86::fillValidTuneCPUList(), llvm::pdb::GSIHashStreamBuilder::finalizeBuckets(), llvm::PPCInstrInfo::finalizeInsInstrs(), StringView::find(), llvm::StringRef::find(), llvm::find_if(), llvm::find_if_not(), llvm::PMTopLevelManager::findAnalysisPass(), llvm::PMTopLevelManager::findAnalysisUsage(), llvm::AnalysisResolver::findImplPass(), findIrreducibleHeaders(), llvm::MachineLoopInfo::findLoopPreheader(), llvm::opt::OptTable::findNearest(), llvm::PeelingModuloScheduleExpander::fixupBranches(), foldFabsWithFcmpZero(), llvm::InstCombinerImpl::foldFCmpIntToFPConst(), llvm::InstCombinerImpl::foldICmpOrConstant(), llvm::PMDataManager::freePass(), llvm::RegBankSelect::InstrInsertPoint::frequency(), llvm::RegBankSelect::MBBInsertPoint::frequency(), llvm::RegBankSelect::EdgeInsertPoint::frequency(), llvm::json::fromJSON(), llvm::JITEvaluatedSymbol::fromPointer(), llvm::PointerSumType< ExtraInfoInlineKinds, llvm::PointerSumTypeMember< EIIK_MMO, llvm::MachineMemOperand * >, llvm::PointerSumTypeMember< EIIK_PreInstrSymbol, llvm::MCSymbol * >, llvm::PointerSumTypeMember< EIIK_PostInstrSymbol, llvm::MCSymbol * >, llvm::PointerSumTypeMember< EIIK_OutOfLine, ExtraInfo * > >::get(), getAdjustedPtr(), getAllocationDataForFunction(), llvm::PredicatedScalarEvolution::getAsAddRec(), llvm::PointerLikeTypeTraits< T * >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< void * >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< const T >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< PointerEmbeddedInt< IntT, Bits > >::getAsVoidPointer(), llvm::pointer_union_detail::PointerUnionUIntTraits< PTs >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< const T * >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< uintptr_t >::getAsVoidPointer(), llvm::FunctionPointerLikeTypeTraits< 4, ReturnT(*)(ParamTs...)>::getAsVoidPointer(), llvm::PointerLikeTypeTraits< PointerIntPair< PointerTy, IntBits, IntType, PtrTraits > >::getAsVoidPointer(), llvm::PointerLikeTypeTraits< PointerUnion< PTs... > >::getAsVoidPointer(), getBBClusterInfoForFunction(), llvm::getBestSimplifyQuery(), llvm::BPIPassTrait< PassT >::getBPI(), llvm::BPIPassTrait< LazyBranchProbabilityInfoPass >::getBPI(), getCombinerObjective(), getConstantEvolvingPHIOperands(), llvm::StackMapParser< Endianness >::LocationAccessor::getConstantIndex(), llvm::RegsForValue::getCopyFromRegs(), llvm::object::MachOObjectFile::getDice(), llvm::RegPressureTracker::getDownwardPressure(), llvm::StackMapParser< Endianness >::LocationAccessor::getDwarfRegNum(), llvm::StackMapParser< Endianness >::LiveOutAccessor::getDwarfRegNum(), llvm::GraphTraits< ModuleSummaryIndex * >::getEntryNode(), llvm::objcarc::getEquivalentPHIs(), llvm::X86::getFeaturesForCPU(), llvm::PointerIntPair< llvm::IntrusiveBackListNode *, 1 >::getFromOpaqueValue(), llvm::PointerLikeTypeTraits< T * >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< ReachingDef >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< void * >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< const T >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< PointerEmbeddedInt< IntT, Bits > >::getFromVoidPointer(), llvm::pointer_union_detail::PointerUnionUIntTraits< PTs >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< const T * >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< uintptr_t >::getFromVoidPointer(), llvm::FunctionPointerLikeTypeTraits< 4, ReturnT(*)(ParamTs...)>::getFromVoidPointer(), llvm::PointerLikeTypeTraits< PointerIntPair< PointerTy, IntBits, IntType, PtrTraits > >::getFromVoidPointer(), llvm::PointerLikeTypeTraits< PointerUnion< PTs... > >::getFromVoidPointer(), llvm::StackMapParser< Endianness >::FunctionAccessor::getFunctionAddress(), llvm::AMDGPUMangledLibFunc::getFunctionType(), getGCMap(), llvm::StackMapParser< Endianness >::RecordAccessor::getID(), llvm::StackMapParser< Endianness >::RecordAccessor::getInstructionOffset(), getIntrinsicParamType(), llvm::X86::getKeyFeature(), llvm::StackMapParser< Endianness >::LocationAccessor::getKind(), llvm::object::MachOObjectFile::getLibraryShortNameByIndex(), llvm::StackMapParser< Endianness >::RecordAccessor::getLiveOut(), getLocalId(), llvm::StackMapParser< Endianness >::RecordAccessor::getLocation(), llvm::ScalarEvolution::getLoopInvariantPredicate(), llvm::RegPressureTracker::getMaxDownwardPressureDelta(), llvm::DataLayout::getMaxPointerSize(), llvm::RegPressureTracker::getMaxUpwardPressureDelta(), llvm::AAResults::getModRefInfo(), llvm::GlobalsAAResult::FunctionInfo::getModRefInfoForGlobal(), llvm::rdf::RefNode::getNextRef(), llvm::StackMapParser< Endianness >::RecordAccessor::getNumLiveOuts(), llvm::StackMapParser< Endianness >::RecordAccessor::getNumLocations(), llvm::StackMapParser< Endianness >::LocationAccessor::getOffset(), llvm::getPassTimer(), llvm::internal::NfaTranscriber::getPaths(), getPermuteNode(), llvm::orc::ExecutionSession::getPlatform(), llvm::ExecutionEngine::getPointerToGlobal(), llvm::LazyValueInfo::getPredicateAt(), llvm::SITargetLowering::getPrefLoopAlignment(), getPropertyName(), llvm::vfs::RedirectingFileSystem::getRealPath(), llvm::rdf::Liveness::getRealUses(), llvm::StackMapParser< Endianness >::FunctionAccessor::getRecordCount(), getReductionValue(), llvm::LiveIntervals::getRegMaskBitsInBlock(), llvm::LiveIntervals::getRegMaskSlotsInBlock(), llvm::object::MachOObjectFile::getRelocation(), llvm::StackMapParser< Endianness >::LocationAccessor::getSizeInBytes(), llvm::StackMapParser< Endianness >::LiveOutAccessor::getSizeInBytes(), llvm::StackMapParser< Endianness >::LocationAccessor::getSmallConstant(), llvm::DependenceInfo::getSplitIteration(), llvm::StackMapParser< Endianness >::FunctionAccessor::getStackSize(), getStruct(), getStructOrErr(), llvm::object::MachOObjectFile::getSymbol64TableEntry(), llvm::object::MachOObjectFile::getSymbolTableEntry(), getSymbolTableEntryBase(), getSymTab(), llvm::getUnderlyingObjects(), llvm::WasmEHFuncInfo::getUnwindSrcs(), llvm::RegPressureTracker::getUpwardPressure(), llvm::RegPressureTracker::getUpwardPressureDelta(), getV_CMPOpcode(), llvm::StackMapParser< Endianness >::ConstantAccessor::getValue(), llvm::PBQP::ValuePool< AllowedRegVector >::getValue(), llvm::FunctionLoweringInfo::getValueFromVirtualReg(), llvm::SystemZTTIImpl::getVectorTruncCost(), llvm::getVRegSubRegDef(), llvm::SmallDenseMap< llvm::Value *, llvm::Value * >::grow(), llvm::handleErrors(), llvm::rdf::RegisterAggr::hasAliasOf(), llvm::rdf::RegisterAggr::hasCoverOf(), llvm::MachineFunctionProperties::hasProperty(), llvm::HexagonMCELFStreamer::HexagonMCEmitCommonSymbol(), llvm::rdf::NodeAllocator::id(), llvm::rdf::DataFlowGraph::id(), llvm::BumpPtrAllocatorImpl< MallocAllocator, 65536 >::identifyObject(), llvm::PassNameParser::ignorablePass(), INITIALIZE_PASS(), llvm::PMDataManager::initializeAnalysisImpl(), llvm::yaml::CustomMappingTraits< std::map< std::vector< uint64_t >, WholeProgramDevirtResolution::ByArg > >::inputOne(), llvm::codeview::DebugStringTableSubsection::insert(), llvm::rdf::RegisterAggr::insert(), llvm::json::Array::insert(), llvm::IntervalMap< KeyT, ValT, N, Traits >::iterator::insert(), llvm::InsertPreheaderForLoop(), insertSpills(), insertUniqueBackedgeBlock(), llvm::HexagonShuffler::insts(), llvm::xray::Profile::internPath(), isAtLineEnd(), llvm::ScalarEvolution::isBasicBlockEntryGuardedByCond(), llvm::isBitcodeWriterPass(), isBlockInLCSSAForm(), llvm::isEqual(), llvm::CmpInst::isEquality(), llvm::ICmpInst::isEquality(), llvm::CmpInst::isFPPredicate(), llvm::ICmpInst::isGE(), llvm::ICmpInst::isGT(), isIntegerWideningViable(), llvm::CmpInst::isIntPredicate(), llvm::isIRPrintingPass(), llvm::LiveRangeCalc::isJointlyDominated(), llvm::ICmpInst::isLE(), isLoopDead(), llvm::ICmpInst::isLT(), isObjectSizeLessThanOrEq(), llvm::isOfRegClass(), isRegUsedByPhiNodes(), llvm::CmpInst::isRelational(), llvm::ICmpInst::isRelational(), isSafeToPromoteArgument(), IsStoredObjCPointer(), isVectorPromotionViable(), isVectorPromotionViableForSlice(), LinearizeExprTree(), LLVMCreateFunctionPassManager(), LLVMCreateGenericValueOfPointer(), LLVMGetAlignment(), LLVMGetCmpXchgFailureOrdering(), LLVMGetCmpXchgSuccessOrdering(), LLVMGetMaskValue(), LLVMGetNumMaskElements(), LLVMGetOrdering(), LLVMGetVolatile(), LLVMIsAtomicSingleThread(), LLVMSetAlignment(), LLVMSetAtomicSingleThread(), LLVMSetCmpXchgFailureOrdering(), LLVMSetCmpXchgSuccessOrdering(), LLVMSetOrdering(), LLVMSetVolatile(), llvm::PassPlugin::Load(), llvm::pdb::HashTable< llvm::support::detail::packed_endian_specific_integral >::load(), llvm::xray::loadProfile(), llvm::vfs::lookupInMemoryNode(), llvm::HexagonTargetLowering::LowerBUILD_VECTOR(), llvm::HexagonTargetLowering::LowerCONCAT_VECTORS(), llvm::NVPTXTargetLowering::LowerFormalArguments(), llvm::HexagonTargetLowering::LowerUnalignedLoad(), llvm::HexagonTargetLowering::LowerVECTOR_SHUFFLE(), llvm::MIPatternMatch::m_GFCmp(), llvm::MIPatternMatch::m_GICmp(), llvm::MIPatternMatch::m_Pred(), llvm::PatternMatch::m_SpecificInt_ICMP(), makeImportedSymbolIterator(), makeReducible(), llvm::rdf::RegisterAggr::makeRegRef(), mapFCmpPred(), llvm::yaml::MappingTraits< ArchYAML::Archive::Child >::mapping(), mapToSinitPriority(), llvm::rdf::DataFlowGraph::markBlock(), llvm::PatternMatch::match(), llvm::MIPatternMatch::And< Pred, Preds... >::match(), llvm::MIPatternMatch::Or< Pred, Preds... >::match(), llvm::MCInstPrinter::matchAliasPatterns(), matchDoublePermute(), matchPermute(), llvm::JumpThreadingPass::maybethreadThroughTwoBasicBlocks(), llvm::rdf::CodeNode::members_if(), mergeConditionalStores(), llvm::xray::mergeProfilesByStack(), llvm::xray::mergeProfilesByThread(), llvm::MIPatternMatch::mi_match(), moveLCSSAPhis(), llvm::object::DiceRef::moveNext(), llvm::object::MachOBindEntry::moveNext(), llvm::PeelingModuloScheduleExpander::moveStageBetweenBlocks(), multipleIterations(), needToReserveScavengingSpillSlots(), node_eq(), llvm::none_of(), llvm::orc::ObjectLinkingLayerJITLinkContext::notifyMaterializing(), llvm::json::Object::Object(), llvm::TargetInstrInfo::RegSubRegPair::operator!=(), llvm::pair_hash< First, Second >::operator()(), llvm::object::symbol_iterator::operator*(), AllocaSlices::partition_iterator::operator*(), llvm::bfi_detail::BlockMass::operator*=(), llvm::MachineRegisterInfo::defusechain_iterator< ReturnUses, ReturnDefs, SkipDebug, ByOperand, ByInstr, ByBundle >::operator++(), llvm::MachineRegisterInfo::defusechain_instr_iterator< ReturnUses, ReturnDefs, SkipDebug, ByOperand, ByInstr, ByBundle >::operator++(), llvm::object::symbol_iterator::operator->(), llvm::rdf::operator<<(), llvm::DiagnosticPrinterRawOStream::operator<<(), llvm::operator<<(), llvm::raw_ostream::operator<<(), llvm::xray::Profile::operator=(), llvm::TargetInstrInfo::RegSubRegPair::operator==(), AllocaSlices::partition_iterator::operator==(), llvm::xray::Graph< VertexAttribute, EdgeAttribute, VI >::operator[](), llvm::opt::OptTable::OptTable(), or32le(), llvm::or32le(), llvm::SMSchedule::orderDependence(), llvm::yaml::CustomMappingTraits< std::map< std::vector< uint64_t >, WholeProgramDevirtResolution::ByArg > >::output(), llvm::yaml::CustomMappingTraits< std::map< uint64_t, WholeProgramDevirtResolution > >::output(), llvm::yaml::CustomMappingTraits< GlobalValueSummaryMapTy >::output(), ParameterPack::ParameterPack(), llvm::json::parse(), llvm::X86::parseArchX86(), llvm::parseCachePruningPolicy(), llvm::AMDGPUMangledLibFunc::parseFuncName(), parseNamePrefix(), parsePredicateConstraint(), parseSegmentOrSectionName(), AbstractManglingParser< ManglingParser< Alloc >, Alloc >::parseTemplateParamDecl(), AbstractManglingParser< ManglingParser< Alloc >, Alloc >::parseType(), AbstractManglingParser< ManglingParser< Alloc >, Alloc >::parseUnnamedTypeName(), llvm::partition(), llvm::partition_point(), llvm::PassNameParser::passEnumerate(), llvm::PassNameParser::passRegistered(), llvm::ProfOStream::patch(), llvm::HexagonTargetLowering::PerformDAGCombine(), llvm::rdf::PhysicalRegisterInfo::PhysicalRegisterInfo(), llvm::ScheduleDAGMI::placeDebugValues(), llvm::AAResults::pointsToConstantMemory(), llvm::rdf::DataFlowGraph::DefStack::pop(), llvm::PMDataManager::preserveHigherLevelAnalysis(), llvm::ConvergingVLIWScheduler::pressureChange(), llvm::cl::OptionDiffPrinter< ParserDT, ValDT >::print(), llvm::cl::OptionDiffPrinter< DT, DT >::print(), llvm::BitTracker::print_cells(), printAsmMRegister(), llvm::SIScheduleBlock::printDebug(), PrintLoadStoreResults(), PrintModRefResults(), llvm::cl::printOptionDiff(), PrintResults(), processPHI(), processRemarkVersion(), processStrTab(), llvm::FoldingSetTrait< std::pair< T1, T2 > >::Profile(), llvm::xray::profileFromTrace(), profitImm(), llvm::ModuleSummaryIndex::propagateAttributes(), llvm::PTOGV(), llvm::support::endian::read(), llvm::support::endian::read16(), llvm::support::endian::read16be(), llvm::support::endian::read16le(), llvm::support::endian::read32(), llvm::support::endian::read32be(), llvm::support::endian::read32le(), llvm::support::endian::read64(), llvm::support::endian::read64be(), llvm::support::endian::read64le(), llvm::readPGOFuncNameStrings(), llvm::WebAssemblyExceptionInfo::recalculate(), llvm::RegPressureTracker::recedeSkipDebugValues(), llvm::PMDataManager::recordAvailableAnalysis(), llvm::PrintIRInstrumentation::registerCallbacks(), llvm::PseudoProbeVerifier::registerCallbacks(), llvm::OptNoneInstrumentation::registerCallbacks(), llvm::TimePassesHandler::registerCallbacks(), llvm::PreservedCFGCheckerInstrumentation::registerCallbacks(), llvm::DebugifyEachInstrumentation::registerCallbacks(), llvm::VerifyInstrumentation::registerCallbacks(), llvm::registerCodeGenCallback(), llvm::RuntimeDyldMachOCRTPBase< RuntimeDyldMachOX86_64 >::registerEHFrames(), llvm::orc::registerFrameWrapper(), registerPartialPipelineCallback(), llvm::ChangeReporter< std::string >::registerRequiredCallbacks(), llvm::rdf::DataFlowGraph::releaseBlock(), llvm::orc::OrcV2CAPIHelper::releasePoolEntry(), llvm::detail::IEEEFloat::remainder(), llvm::MCContext::RemapDebugPaths(), llvm::SetVector< llvm::ElementCount, SmallVector< llvm::ElementCount, N >, SmallDenseSet< llvm::ElementCount, N > >::remove_if(), llvm::NodeSet::remove_if(), llvm::remove_if(), llvm::PMDataManager::removeDeadPasses(), llvm::PMDataManager::removeNotPreservedAnalysis(), llvm::SUnit::removePred(), replaceConstantExprOp(), llvm::HexagonTargetLowering::ReplaceNodeResults(), llvm::json::Path::report(), reportMismatch(), llvm::MachineFunctionProperties::reset(), llvm::orc::OrcV2CAPIHelper::retainPoolEntry(), llvm::orc::OrcV2CAPIHelper::retainSymbolStringPtr(), rewriteNonInstructionUses(), llvm::PassManager< LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult & >::run(), llvm::DevirtSCCRepeatedPass::run(), llvm::orc::LocalCXXRuntimeOverridesBase::runDestructors(), llvm::LPPassManager::runOnFunction(), llvm::RGPassManager::runOnFunction(), llvm::StringSaver::save(), llvm::RegScavenger::scavengeRegisterBackwards(), llvm::PMTopLevelManager::schedulePass(), separateNestedLoop(), llvm::MachineFunctionProperties::set(), llvm::FunctionLoweringInfo::set(), llvm::MipsABIFlagsSection::setAllFromPredicates(), llvm::MipsABIFlagsSection::setASESetFromPredicates(), llvm::MipsABIFlagsSection::setCPR1SizeFromPredicates(), llvm::vfs::InMemoryFileSystem::setCurrentWorkingDirectory(), llvm::MCAssembler::setDWARFLinetableParams(), llvm::MipsABIFlagsSection::setFpAbiFromPredicates(), llvm::MipsABIFlagsSection::setGPRSizeFromPredicates(), llvm::MipsABIFlagsSection::setISAExtensionFromPredicates(), llvm::MipsABIFlagsSection::setISALevelAndRevisionFromPredicates(), llvm::PMTopLevelManager::setLastUser(), llvm::VPBlockBase::setParent(), llvm::orc::ExecutionSession::setPlatform(), llvm::CmpInst::setPredicate(), llvm::ScopedPrinter::setPrefix(), llvm::LineEditor::setPrompt(), llvm::msf::MSFBuilder::setStreamSize(), llvm::MCAsmParser::setTargetParser(), llvm::MIRParserImpl::setupRegisterInfo(), llvm::TrackingVH< Value >::setValPtr(), llvm::CallbackVH::setValPtr(), llvm::OptBisect::shouldRunPass(), shouldSplitOnPredicatedArgument(), simplifyCommonValuePhi(), SimplifyCondBranchToCondBranch(), SimplifyGEPInst(), simplifyICmpWithMinMax(), simplifyOneLoop(), llvm::JumpThreadingPass::simplifyPartiallyRedundantLoad(), skipIfAtLineEnd(), llvm::MachineBasicBlock::SplitCriticalEdge(), llvm::SplitKnownCriticalEdge(), llvm::stable_hash_combine_array(), llvm::StringMap< std::unique_ptr< llvm::vfs::detail::InMemoryNode > >::StringMap(), llvm::BitTracker::subst(), swapAntiDependences(), llvm::SwingSchedulerDAG::SwingSchedulerDAG(), targets(), test(), llvm::OpenMPIRBuilder::tileLoops(), llvm::TimerGroup::TimerGroup(), llvm::to_address(), llvm::SymbolTableListTraits< ValueSubClass >::toPtr(), llvm::orc::LocalCXXRuntimeOverridesBase::toTargetAddress(), llvm::ConvergingVLIWScheduler::traceCandidate(), llvm::GenericSchedulerBase::traceCandidate(), llvm::TrackingVH< Value >::TrackingVH(), tryAdjustICmpImmAndPred(), tryToVectorizeHorReductionOrInstOperands(), unwrap(), llvm::unwrap(), llvm::MipsTargetStreamer::updateABIInfo(), llvm::VFShape::updateParam(), llvm::ScheduleDAGMILive::updatePressureDiffs(), llvm::updateVCallVisibilityInIndex(), llvm::yaml::MappingTraits< ArchYAML::Archive::Child >::validate(), valueDominatesPHI(), llvm::GraphTraits< ValueInfo >::valueInfoFromEdge(), llvm::PMDataManager::verifyPreservedAnalysis(), llvm::InstCombinerImpl::visitIntToPtr(), llvm::InstCombinerImpl::visitPtrToInt(), llvm::InnerLoopVectorizer::widenPHIInstruction(), wrap(), llvm::wrap(), write(), llvm::StringTableBuilder::write(), llvm::support::endian::write(), llvm::support::endian::write16(), llvm::support::endian::write16be(), llvm::support::endian::write16le(), llvm::support::endian::write32(), llvm::support::endian::write32be(), llvm::support::endian::write32le(), llvm::support::endian::write64(), llvm::support::endian::write64be(), llvm::support::endian::write64le(), writeTypeIdCompatibleVtableSummaryRecord(), llvm::xxHash64(), llvm::yaml::yaml2archive(), llvm::objcarc::BundledRetainClaimRVs::~BundledRetainClaimRVs(), llvm::PMDataManager::~PMDataManager(), and llvm::PMTopLevelManager::~PMTopLevelManager().

◆ P2

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or<4 x float> * P2

◆ pinsrb

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 pinsrb

Definition at line 650 of file README-SSE.txt.

◆ pinsrw

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 pinsrw

Definition at line 304 of file README-SSE.txt.

◆ pool

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant pool

Definition at line 85 of file README-SSE.txt.

◆ pshufd

gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret< 4 x i32 > A On targets without this compiles globl _f xmm1 movd eax imull eax movd xmm1 pshufd

Definition at line 35 of file README-SSE.txt.

◆ rdar

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm0 movaps rdar

Definition at line 689 of file README-SSE.txt.

◆ readonly

This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from where P can be anything The alignment inference code cannot handle loads from globals in static non mode because it doesn t look through the extra dyld stub load If you try vec_align ll without relocation you ll see what I mean We should lower which eliminates a constant pool load For float z nounwind readonly
Initial value:
{
%tmp6 = fsub float -0.000000e+00, %z.1

Definition at line 421 of file README-SSE.txt.

Referenced by llvm::xray::loadProfile(), llvm::xray::loadTraceFile(), and loadYAML().

◆ right

the custom lowered code happens to be right

◆ shrl

Current eax eax eax ret Ideal eax shrl

Definition at line 393 of file README-SSE.txt.

◆ shrq

<float*> store float float* tmp5 ret void Compiles rax shrq

Definition at line 803 of file README-SSE.txt.

◆ shufps

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 shufps

Definition at line 293 of file README-SSE.txt.

◆ slot

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill slot

Definition at line 269 of file README-SSE.txt.

◆ something

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or something

Definition at line 278 of file README-SSE.txt.

◆ spill

the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector spill

◆ SSE4

gcc mainline compiles it xmm0 xmm1 movaps xmm2 movlhps xmm2 movaps xmm0 ret We compile vector multiply by constant into poor< i32 10, i32 10, i32 10, i32 10 > ret<4 x i32> A On targets without SSE4

Definition at line 571 of file README-SSE.txt.

◆ Testcase

the custom lowered code happens to be but we shouldn t have to custom lower anything This is probably related to<2 x i64> ops being so bad LLVM currently generates stack realignment when it is not necessary needed The problem is that we need to know about stack alignment too before RA runs At that point we don t whether there will be vector or not Stack realignment logic is overly conservative but otherwise we can produce unaligned loads stores Fixing this will require some huge RA changes Testcase

Definition at line 498 of file README-SSE.txt.

◆ this

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like this

Definition at line 7 of file README-SSE.txt.

◆ tmp10

<i128> tmp10 = lshr i128 %tmp20

Definition at line 791 of file README-SSE.txt.

◆ tmp11

<i128><i128> tmp11 = trunc i128 %tmp10 to i32

Definition at line 792 of file README-SSE.txt.

◆ tmp12

< float * > store float tmp12 = bitcast i32 %tmp11 to float

Definition at line 793 of file README-SSE.txt.

Referenced by to().

◆ tmp19

<double> tmp19 = bitcast double %tmp18 to i64

Definition at line 789 of file README-SSE.txt.

◆ tmp20

< i64 > tmp20

Definition at line 424 of file README-SSE.txt.

◆ tmp5

<float> tmp5 = getelementptr inbounds %struct.float3* %res

Definition at line 794 of file README-SSE.txt.

Referenced by foo().

◆ to

<float*> store float float* tmp5 ret void Compiles to

Definition at line 224 of file README-SSE.txt.

Referenced by to().

◆ used

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is used

◆ uses

This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the uses

Definition at line 258 of file README-SSE.txt.

Referenced by abort_gzip(), bar(), foo(), llvm::PPCFunctionInfo::setUsesPICBase(), and to().

◆ x3

In x86 we generate this spiffy xmm0 xmm0 ret in x86 we generate this which could be xmm1 movss xmm1 xmm0 ret In sse4 we could use insertps to make both better Here s another testcase that could use x3

◆ xmm0

In fpstack this compiles esp eax movl esp esp ret in SSE it compiles into significantly slower esp xmm0 mulsd xmm0 movsd xmm0

Definition at line 33 of file README-SSE.txt.

◆ xmm1

gets compiled into this on rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps rsp movaps xmm1
Initial value:
= all-ones
cmpeqps xmm1

Definition at line 38 of file README-SSE.txt.

◆ xmm2

into eax xorps xmm0 xmm0 eax xmm0 eax xmm0 ret esp eax movdqa xmm0 xmm0 esp const ret align it should be movdqa xmm0 xmm0 We should transform a shuffle of two vectors of constants into a single vector of constants insertelement of a constant into a vector of constants should also result in a vector of constants e g VecISelBug ll We compiled it to something globl _t xmm0 movhps xmm0 movss xmm1 movaps xmm2 xmm2 xmm2

Definition at line 39 of file README-SSE.txt.

◆ xmm3

into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm3

Definition at line 40 of file README-SSE.txt.

z
return z
Definition: README.txt:14
x
void x(unsigned short n)
Definition: README-SSE.txt:355
is
should just be implemented with a CLZ instruction Since there are other e that share this it would be best to implement this in a target independent as zero is the default value for the binary encoder e add r0 add r5 Register operands should be distinct That is
Definition: README.txt:725
load
into xmm2 addss xmm2 xmm1 xmm3 addss xmm3 movaps xmm0 unpcklps xmm0 ret seems silly when it could just be one addps Expand libm rounding functions main should enable SSE DAZ mode and other fast SSE modes Think about doing i64 math in SSE regs on x86 This testcase should have no SSE instructions in and only one load from a constant double ret double C the select is being which prevents the dag combiner from turning load CPI2 load(select CPI1, CPI2)' The pattern isel got this one right. Lower memcpy/memset to a series of SSE 128 bit move instructions when it 's feasible. Codegen
Definition: README-SSE.txt:90
into
This compiles into
Definition: README-SSE.txt:215
tmp6
< i32 > tmp6
Definition: README.txt:1446
P
This currently compiles esp xmm0 movsd esp eax eax esp ret We should use not the dag combiner This is because dagcombine2 needs to be able to see through the X86ISD::Wrapper which DAGCombine can t really do The code for turning x load into a single vector load is target independent and should be moved to the dag combiner The code for turning x load into a vector load can only handle a direct load from a global or a direct load from the stack It should be generalized to handle any load from P
Definition: README-SSE.txt:411
ret
to esp esp setne al movzbw ax esp setg cl movzbw cx cmove cx cl jne LBB1_2 esp ret(also really horrible code on ppc). This is due to the expand code for 64-bit compares. GCC produces multiple branches
movl
into eax xorps xmm0 xmm0 eax xmm0 movl
Definition: README-SSE.txt:624
tmp
alloca< 16 x float >, align 16 %tmp2=alloca< 16 x float >, align 16 store< 16 x float > %A,< 16 x float > *%tmp %s=bitcast< 16 x float > *%tmp to i8 *%s2=bitcast< 16 x float > *%tmp2 to i8 *call void @llvm.memcpy.i64(i8 *%s, i8 *%s2, i64 64, i32 16) %R=load< 16 x float > *%tmp2 ret< 16 x float > %R } declare void @llvm.memcpy.i64(i8 *nocapture, i8 *nocapture, i64, i32) nounwind which compiles to:_foo:subl $140, %esp movaps %xmm3, 112(%esp) movaps %xmm2, 96(%esp) movaps %xmm1, 80(%esp) movaps %xmm0, 64(%esp) movl 60(%esp), %eax movl %eax, 124(%esp) movl 56(%esp), %eax movl %eax, 120(%esp) movl 52(%esp), %eax< many many more 32-bit copies > movaps(%esp), %xmm0 movaps 16(%esp), %xmm1 movaps 32(%esp), %xmm2 movaps 48(%esp), %xmm3 addl $140, %esp ret On Nehalem, it may even be cheaper to just use movups when unaligned than to fall back to lower-granularity chunks. Implement processor-specific optimizations for parity with GCC on these processors. GCC does two optimizations:1. ix86_pad_returns inserts a noop before ret instructions if immediately preceded by a conditional branch or is the target of a jump. 2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of code contains more than 3 branches. The first one is done for all AMDs, Core2, and "Generic" The second one is done for:Atom, Pentium Pro, all AMDs, Pentium 4, Nocona, Core 2, and "Generic" Testcase:int x(int a) { return(a &0xf0)> >4 tmp
Definition: README.txt:1347
and
We currently generate a but we really shouldn eax ecx xorl edx divl ecx eax divl ecx movl eax ret A similar code sequence works for division We currently compile i32 v2 eax eax jo LBB1_2 and
Definition: README.txt:1271
a
=0.0 ? 0.0 :(a > 0.0 ? 1.0 :-1.0) a
Definition: README.txt:489
i32
< float > i32
Definition: README-SSE.txt:794
code
This might compile to this code
Definition: README-SSE.txt:240
P2
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill if only one byte chunk is bring in zeros the one element instead of elements This can be used to simplify a variety of shuffle where the elements are fixed zeros This code generates ugly probably due to costs being off or< 4 x float > * P2
Definition: README-SSE.txt:278
llvm::lto::backend
Error backend(const Config &C, AddStreamFn AddStream, unsigned ParallelCodeGenParallelismLevel, Module &M, ModuleSummaryIndex &CombinedIndex)
Runs a regular LTO backend.
Definition: LTOBackend.cpp:493
tests
Reference model for inliner Oz decision policy Note this model is also referenced by test Transforms Inline ML tests if replacing check those tests
Definition: README.txt:3
llvm::LegalityPredicates::all
Predicate all(Predicate P0, Predicate P1)
True iff P0 and P1 are true.
Definition: LegalizerInfo.h:196
s
multiplies can be turned into SHL s
Definition: README.txt:370
to
This compiles xmm1 mulss xmm1 xorps xmm0 movss xmm0 ret Because mulss doesn t modify the top the top elements of xmm1 are already zero d We could compile this to
Definition: README-SSE.txt:224
B
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm3 movdqa xmm0 addss xmm0 ret there are cases where some simple SLP would improve codegen a bit compiling _Complex float B
Definition: README-SSE.txt:46
esp
We currently emits esp
Definition: README.txt:235
A
* A
Definition: README_ALTIVEC.txt:89
can
This might compile to this xmm1 xorps xmm0 movss xmm0 ret Now consider if the code caused xmm1 to get spilled This might produce this xmm1 movaps xmm0 movaps xmm1 movss xmm0 ret since the reload is only used by these we could fold it into the producing something like xmm1 movaps xmm0 ret saving two instructions The basic idea is that a reload from a spill can
Definition: README-SSE.txt:269
_test
float space text globl _test align _test
Definition: README_ALTIVEC.txt:118
globals
name anon globals
Definition: NameAnonGlobals.cpp:113
This
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical This
Definition: README.txt:418
d
the resulting code requires compare and branches when and if the revised code is with conditional branches instead of More there is a byte word extend before each where there should be only and the condition codes are not remembered when the same two values are compared twice More LSR enhancements i8 and i32 load store addressing modes are identical int int int d
Definition: README.txt:418
X86
Unrolling by would eliminate the &in both leading to a net reduction in code size The resultant code would then also be suitable for exit value computation We miss a bunch of rotate opportunities on various including etc On X86
Definition: README.txt:568
llvm::support::aligned
@ aligned
Definition: Endian.h:30
entry
print Instructions which execute on loop entry
Definition: MustExecute.cpp:339
of
Add support for conditional and other related patterns Instead of
Definition: README.txt:134
xmm1
Instead we get xmm1 addss xmm1 xmm2 movhlps xmm0 movaps xmm3 addss xmm1
Definition: README-SSE.txt:38