The attached .ll file contains unoptimized code produced by the Ada front-end for the operation "loop over all elements of an array of ints, setting them to zero", for several different array types. The arrays are all of fixed size, and differ only in the type of the indexing variable: @sbytezero: zero an array indexed by a signed i8 (index range -128..127) @ubytezero: zero an array indexed by an unsigned i8 (index range 0..255) @sintzero: zero an array indexed by a signed i32 @uintzero: zero an array indexed by an unsigned i32 @srangezero: zero an array with index range -10..10 @urangezero: zero an array with index range 10..30 Thus all of these routines could in theory be optimized to a single memset. That doesn't happen, but let's not ask for the moon! They do get fairly well optimized but some optimizations are missed. I think this matters because these kind of array traversals (only more complicated) come up all the time in Ada code. The main problems seem to come from code like this (extracted from @sbytezero): %tmp = load i8* %i ; <i8> [#uses=1] %tmp1 = sext i8 %tmp to i32 ; <i32> [#uses=5] %tmp3 = sub i32 %tmp1, -128 ; <i32> [#uses=1] %tmp4 = getelementptr [256 x i32]* %tmp2, i32 0, i32 %tmp3 ; <i32*> [#uses=1] Here %i is the array index variable. It runs from -128 to 127. The sext to i32 is generated by the Ada f-e, presumably to avoid overflow when subtracting off the array lower bound of -128, which occurs in the next line. This gets optimized to: %tmp8 = add i8 %indvar, -127 ; <i8> [#uses=1] %tmp115 = sext i8 %tmp8 to i32 ; <i32> [#uses=1] %tmp316 = add i32 %tmp115, 128 ; <i32> [#uses=1] %tmp417 = getelementptr [256 x i32]* %a, i32 0, i32 %tmp316 ; <i32*> [#uses=1] Here %indvar runs from 0 to 254 (nice optimization!), which really corresponds to indices 1..255 in the [256 x i32] array (the first value was peeled off). Thus %tmp8 is between -127 and 127. Thus %tmp115 is also between -127 and 127. Thus %tmp316 is between 1 and 255. It is just %tmp8+1 zextended to i32. Thus the adding of -127 and 128 could have be turned into a single addition of 1. In fact it would have been better not to peel off the first loop value and have %indvar go from 0..255, then just zextend it for use in the getelementptr. PS: there are a bunch of pointless compares in the unoptimized code, like icmp sle i32 -128, %tmp1 ; <i1>:1 [#uses=0] icmp sle i32 %tmp1, 127 ; <i1>:2 [#uses=0] This is me playing with assertion generation in llvm-convert, i.e. one day these may become iassert sle i32 -128, %tmp1 iassert sle i32 %tmp1, 127 if we introduce an assert instruction.
Created attachment 734 [details] The unoptimized code
Created attachment 735 [details] optimized code Just to be clear, all the adding to and subtracting from the induction variable is useless, except for incrementing it to the next value. The optimizers should be able to reduce all the examples to a memset, or at worst a simple loop over the array setting every element to zero.
Interesting. I was just looking at some similar examples that can happen in C code. For the record, please attach a .s file produced by LLC for your favorite target. I will see if I can improve the generated code over the next week or so.
Created attachment 743 [details] X86 Codegen
Created attachment 744 [details] Old PPC Codegen
Created attachment 745 [details] Old ARM Codegen
Perhaps the most amusing is sintzero. One X86 we get: .LBB3_1: #cond_next movl $0, 4(%eax,%ecx,4) incl %ecx cmpl $4294967295, %ecx jne .LBB3_1 #cond_next (yay) on ARM we get: .LBB3_1: @cond_next add r2, r0, r3, lsl #2 <--- boo hiss mov r1, #0 str r1, [r2, #+4] add r3, r3, #1 cmn r3, #1 bne .LBB3_1 @cond_next On PPC, we get: BB3_1: #cond_next add 6, 5, 4 addi 4, 4, 1 li 7, 0 slwi 6, 6, 2 stwx 7, 3, 6 xoris 6, 4, 65535 cmplwi 0, 6, 65535 bne 0, BB3_1 #cond_next say what? :)
Okay, final request. Please attach the machine code generated by GCC with -O3 -fomit-frame-pointer. -Chris
Created attachment 746 [details] x86 codegen (i686-pc-linux-gnu) The functions kinds__TsbytearrayBIP etc can be ignored (I stripped them out of the original example).
Created attachment 747 [details] gcc 4.1.2, i486-linux-gnu, -O3 -fomit-frame-pointer
Created attachment 748 [details] gcc 3.4.6, i486-linux-gnu, -O3 -fomit-frame-pointer
Created attachment 749 [details] gcc 4.3.0, i686-linux-gnu, -O3 -fomit-frame-pointer
If the type of the induction variable had been chosen to be i32, rather than the original type, some of these problems would go away. As for sintzero, you've got to admit that this %tmp7 = add i32 %indvar, -2147483647 ; <i32> [#uses=1] %tmp214 = xor i32 %tmp7, -2147483648 ; <i32> [#uses=1] is a funny way to add one!
That is a funny way to add one. We can definitely instcombine this: define i32 @test(i32 %indvar) { %tmp7 = add i32 %indvar, -2147483647 ; <i32> [#uses=1] %tmp214 = xor i32 %tmp7, -2147483648 ; <i32> [#uses=1] ret i32 %tmp214 } into add of one, but I'd like to fix whatever is creating that in the first place :)
Progress: with several patches, we now compile sintzero to: LBB1_1: #cond_next movl $0, (%eax,%ecx,4) incl %ecx cmpl $4294967295, %ecx jne LBB1_1 #cond_next on X86 LBB1_1: @cond_next mov r2, #0 str r2, [r0, +r3, lsl #2] add r3, r3, #1 cmn r3, #1 bne LBB1_1 @cond_next on ARM (the mov to R2 is already covered by Bug 893) and to: LBB1_1: ;cond_next li r4, 0 addi r2, r2, 1 stw r4, 0(r3) addi r3, r3, 4 cmpwi cr0, r2, -1 bne cr0, LBB1_1 ;cond_next on PPC (which needs to use postinc, but that's another issue). These are all good. The only issue now is that this code isn't correct. Duncan, can you file the other cases as other buzilla bugs? -Chris
The sintzero function is now optimized. Duncan is going to open other bugs for the other cases. This one testcase exposed several issues on multiple targets, yay :) -Chris
The cases where the index variables are not i32s have been split off into PR1301.