bool isNanOrInf(float x) { return (reinterpret_cast<unsigned&>(x) & 0x7F800000) == 0x7F800000; } float test(float x) { return isNanOrInf(x) ? 0 : x; } test(float): vmovd eax, xmm0 not eax vpxor xmm1, xmm1, xmm1 and eax, 2139095040 je .LBB1_2 vmovdqa xmm1, xmm0 .LBB1_2: vmovdqa xmm0, xmm1 ret In contrast gcc does not use xmm1 at all: test(float): vmovd eax, xmm0 mov edx, 2139095040 and edx, eax cmp edx, 2139095040 je .L5 ret .L5: vxorps xmm0, xmm0, xmm0 ret https://gcc.godbolt.org/z/biNBFq
Matthias, do you have any suggestions for how to improve this?
Given these two equivalent IR files where I've flipped the condition and select operands, we produce better results with one than the other. define dso_local float @_Z4testf(float %x) local_unnamed_addr #0 { entry: %0 = bitcast float %x to i32 %and.i = and i32 %0, 2139095040 %cmp.i = icmp eq i32 %and.i, 2139095040 %cond = select i1 %cmp.i, float 0.000000e+00, float %x ret float %cond } define dso_local float @_Z4testf2(float %x) local_unnamed_addr #0 { entry: %0 = bitcast float %x to i32 %and.i = and i32 %0, 2139095040 %cmp.i = icmp ne i32 %and.i, 2139095040 %cond = select i1 %cmp.i, float %x, float 0.000000e+00 ret float %cond }
(In reply to Craig Topper from comment #2) > Given these two equivalent IR files where I've flipped the condition and > select operands, we produce better results with one than the other. > > define dso_local float @_Z4testf(float %x) local_unnamed_addr #0 { > entry: > %0 = bitcast float %x to i32 > %and.i = and i32 %0, 2139095040 > %cmp.i = icmp eq i32 %and.i, 2139095040 > %cond = select i1 %cmp.i, float 0.000000e+00, float %x > ret float %cond > } > > define dso_local float @_Z4testf2(float %x) local_unnamed_addr #0 { > entry: > %0 = bitcast float %x to i32 > %and.i = and i32 %0, 2139095040 > %cmp.i = icmp ne i32 %and.i, 2139095040 > %cond = select i1 %cmp.i, float %x, float 0.000000e+00 > ret float %cond > } For reference, instcombine is missing several potential select canonicalizations, but it does canonicalize this code already. Unfortunately, it's to the 'eq' version, and that's the one that has the poor codegen.
I assume with "bad codegen" you specifically mean the "vmovdqa xmm1, xmm0", "vmovdqa xmm0, xmm1" combination. The problem at the MI level here is that the code looks like something like this after the CMOV got replaced with a jump: ``` bb.0: ... %4:fr32 = FsFLD0SS JE_1 %bb.2 bb.1: %0:fr32 = COPY $xmm0 bb.2: %5:fr32 = PHI %0:fr32, %bb.1, %4:fr32, %bb.0 ``` Unfortunately bb.0->bb.2 is a critical edge, and we are forced to keep the FsFLD0SS instruction in bb.0. If I force CodeGen to split all critical edges we end up with better code: ``` $ llc -o - -phi-elim-split-all-critical-edges=1 -no-phi-elim-live-out-early-exit=1 test.ll ... ## %bb.0: pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset %rbp, -16 movq %rsp, %rbp .cfi_def_cfa_register %rbp movd %xmm0, %eax andl $2139095040, %eax ## imm = 0x7F800000 cmpl $2139095040, %eax ## imm = 0x7F800000 jne LBB0_2 ## %bb.1: pxor %xmm0, %xmm0 LBB0_2: popq %rbp retq ``` I'm not up to date with the discussion of when we split or not split critical edges and it definitely isn't unviversally good to split them. This example however would clearly benefit from splitting, so maybe we can come up with a heuristic to catch this situation...
Or maybe we should change that CMOV expansion code to not produce critical edges. Given that the choice of edge is arbitrary, I'd rather not introduce any critical edge here to have more consistent codegen in that case...
I was just about to ask if we should do that.
Patch to avoid creating critical edges: https://reviews.llvm.org/D54119 Unfortunately it doesn't seem to help this case by itself.
Looks like tailduplication code is making a bad decision because the instructions have been sunk yet... If I move the tailduplicator to run after machine sinking in TargetPassConfig then the example works as expected... Not sure though what the interaction of taildup with the other passes are and if moving it is a good idea in general. Short of that maybe we can do an extra round of machine sinking when we lowered any cmovs? (Or implement some simple version of sinking inside the cmov lowering)
And BTW: For me the example works with your patch regardless. Tail duplication makes the other edge critical so the `FsFLD0SS` can be sunk for me.
I was checking the output from the two test cases from comment 2. With my patch they switch behavior. So I guess it fixes the original case, but regresses some opposite of that case.
Yeah as I explained in my other comment: Short of the instruction actually sunk into the blocks it's probably random which edge the tail duplicator picks. That probably also explains some of the regressions in the tests you saw. I CCed Kyle maybe he can make a statement as to the effects of moving early tail duplication later into the pipeline (at least after the pre-RA machine sinking)...
Kyle, have you had a chance to look at this?
I haven't been working on llvm in a while, but I can't think of any reason why moving early tail duplication after sinking would cause problems. If it improves benchmarks and it doesn't cause regressions, go for it.