LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 1288 - Not so hot code produced for Ada array traversal: sintzero
Summary: Not so hot code produced for Ada array traversal: sintzero
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: 1.0
Hardware: All All
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords: code-quality
Depends on:
Blocks:
 
Reported: 2007-03-29 04:36 PDT by Duncan Sands
Modified: 2010-02-22 12:46 PST (History)
1 user (show)

See Also:
Fixed By Commit(s):


Attachments
The unoptimized code (8.01 KB, text/plain)
2007-03-29 04:37 PDT, Duncan Sands
Details
optimized code (3.77 KB, text/plain)
2007-03-29 12:57 PDT, Duncan Sands
Details
X86 Codegen (1.74 KB, text/plain)
2007-04-01 01:16 PDT, Chris Lattner
Details
Old PPC Codegen (1.79 KB, text/plain)
2007-04-01 01:16 PDT, Chris Lattner
Details
Old ARM Codegen (1.65 KB, text/plain)
2007-04-01 01:17 PDT, Chris Lattner
Details
x86 codegen (i686-pc-linux-gnu) (3.38 KB, text/plain)
2007-04-01 02:43 PDT, Duncan Sands
Details
gcc 4.1.2, i486-linux-gnu, -O3 -fomit-frame-pointer (4.25 KB, text/plain)
2007-04-01 02:46 PDT, Duncan Sands
Details
gcc 3.4.6, i486-linux-gnu, -O3 -fomit-frame-pointer (4.15 KB, text/plain)
2007-04-01 02:47 PDT, Duncan Sands
Details
gcc 4.3.0, i686-linux-gnu, -O3 -fomit-frame-pointer (4.35 KB, text/plain)
2007-04-01 02:49 PDT, Duncan Sands
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Duncan Sands 2007-03-29 04:36:11 PDT
The attached .ll file contains unoptimized code produced by the Ada front-end
for the operation "loop over all elements of an array of ints, setting them to
zero", for several different array types.  The arrays are all of fixed size,
and differ only in the type of the indexing variable:
@sbytezero: zero an array indexed by a signed i8 (index range -128..127)
@ubytezero: zero an array indexed by an unsigned i8 (index range 0..255)
@sintzero: zero an array indexed by a signed i32
@uintzero: zero an array indexed by an unsigned i32
@srangezero: zero an array with index range -10..10
@urangezero: zero an array with index range 10..30

Thus all of these routines could in theory be optimized to a single memset.
That doesn't happen, but let's not ask for the moon!  They do get fairly
well optimized but some optimizations are missed.  I think this matters
because these kind of array traversals (only more complicated) come up all
the time in Ada code.  The main problems seem to come from code like this
(extracted from @sbytezero):
        %tmp = load i8* %i              ; <i8> [#uses=1]
        %tmp1 = sext i8 %tmp to i32             ; <i32> [#uses=5]
        %tmp3 = sub i32 %tmp1, -128             ; <i32> [#uses=1]
        %tmp4 = getelementptr [256 x i32]* %tmp2, i32 0, 
i32 %tmp3              ; <i32*> [#uses=1]
Here %i is the array index variable.  It runs from -128 to 127.  The sext to
i32 is generated by the Ada f-e, presumably to avoid overflow when subtracting
off the array lower bound of -128, which occurs in the next line.  This gets
optimized to:
        %tmp8 = add i8 %indvar, -127            ; <i8> [#uses=1]
        %tmp115 = sext i8 %tmp8 to i32          ; <i32> [#uses=1]
        %tmp316 = add i32 %tmp115, 128          ; <i32> [#uses=1]
        %tmp417 = getelementptr [256 x i32]* %a, i32 0, 
i32 %tmp316             ; <i32*> [#uses=1]
Here %indvar runs from 0 to 254 (nice optimization!), which really corresponds
to indices 1..255 in the [256 x i32] array (the first value was peeled off).
Thus %tmp8 is between -127 and 127.  Thus %tmp115 is also between -127 and 127.
Thus %tmp316 is between 1 and 255.  It is just %tmp8+1 zextended to i32.  Thus
the adding of -127 and 128 could have be turned into a single addition of 1.
In fact it would have been better not to peel off the first loop value and
have %indvar go from 0..255, then just zextend it for use in the getelementptr.

PS: there are a bunch of pointless compares in the unoptimized code, like
        icmp sle i32 -128, %tmp1                ; <i1>:1 [#uses=0]
        icmp sle i32 %tmp1, 127         ; <i1>:2 [#uses=0]
This is me playing with assertion generation in llvm-convert, i.e. one day
these may become
        iassert sle i32 -128, %tmp1
        iassert sle i32 %tmp1, 127
if we introduce an assert instruction.
Comment 1 Duncan Sands 2007-03-29 04:37:53 PDT
Created attachment 734 [details]
The unoptimized code
Comment 2 Duncan Sands 2007-03-29 12:57:08 PDT
Created attachment 735 [details]
optimized code

Just to be clear, all the adding to and subtracting from the induction
variable is useless, except for incrementing it to the next value.  The
optimizers should be able to reduce all the examples to a memset, or at
worst a simple loop over the array setting every element to zero.
Comment 3 Chris Lattner 2007-04-01 01:14:08 PDT
Interesting.  I was just looking at some similar examples that can happen in C code.  For the record, please 
attach a .s file produced by LLC for your favorite target.  I will see if I can improve the generated code over 
the next week or so.
Comment 4 Chris Lattner 2007-04-01 01:16:37 PDT
Created attachment 743 [details]
X86 Codegen
Comment 5 Chris Lattner 2007-04-01 01:16:56 PDT
Created attachment 744 [details]
Old PPC Codegen
Comment 6 Chris Lattner 2007-04-01 01:17:12 PDT
Created attachment 745 [details]
Old ARM Codegen
Comment 7 Chris Lattner 2007-04-01 01:23:45 PDT
Perhaps the most amusing is sintzero.  One X86 we get:

.LBB3_1:	#cond_next
	movl $0, 4(%eax,%ecx,4)
	incl %ecx
	cmpl $4294967295, %ecx
	jne .LBB3_1	#cond_next

(yay)

on ARM we get:

.LBB3_1:	@cond_next
	add r2, r0, r3, lsl #2    <--- boo hiss
	mov r1, #0
	str r1, [r2, #+4]
	add r3, r3, #1
	cmn r3, #1
	bne .LBB3_1	@cond_next

On PPC, we get:

BB3_1:	#cond_next
	add 6, 5, 4
	addi 4, 4, 1
	li 7, 0
	slwi 6, 6, 2
	stwx 7, 3, 6
	xoris 6, 4, 65535
	cmplwi 0, 6, 65535
	bne 0, BB3_1	#cond_next

say what? :)
Comment 8 Chris Lattner 2007-04-01 01:24:07 PDT
Okay, final request.  Please attach the machine code generated by GCC with -O3 -fomit-frame-pointer.

-Chris
Comment 9 Duncan Sands 2007-04-01 02:43:17 PDT
Created attachment 746 [details]
x86 codegen (i686-pc-linux-gnu)

The functions kinds__TsbytearrayBIP etc can be ignored (I stripped
them out of the original example).
Comment 10 Duncan Sands 2007-04-01 02:46:13 PDT
Created attachment 747 [details]
gcc 4.1.2, i486-linux-gnu, -O3 -fomit-frame-pointer
Comment 11 Duncan Sands 2007-04-01 02:47:57 PDT
Created attachment 748 [details]
gcc 3.4.6, i486-linux-gnu, -O3 -fomit-frame-pointer
Comment 12 Duncan Sands 2007-04-01 02:49:34 PDT
Created attachment 749 [details]
gcc 4.3.0, i686-linux-gnu, -O3 -fomit-frame-pointer
Comment 13 Duncan Sands 2007-04-01 05:43:36 PDT
If the type of the induction variable had been chosen to be i32, rather
than the original type, some of these problems would go away.  As for
sintzero, you've got to admit that this
        %tmp7 = add i32 %indvar, -2147483647            ; <i32> [#uses=1]
        %tmp214 = xor i32 %tmp7, -2147483648            ; <i32> [#uses=1]
is a funny way to add one!
Comment 14 Chris Lattner 2007-04-01 18:04:07 PDT
That is a funny way to add one.  We can definitely instcombine this:

define i32 @test(i32 %indvar) {
        %tmp7 = add i32 %indvar, -2147483647            ; <i32> [#uses=1]
        %tmp214 = xor i32 %tmp7, -2147483648            ; <i32> [#uses=1]
        ret i32 %tmp214
}

into add of one, but I'd like to fix whatever is creating that in the first place :)
Comment 15 Chris Lattner 2007-04-02 01:38:29 PDT
Progress: with several patches, we now compile sintzero to:
LBB1_1: #cond_next
        movl $0, (%eax,%ecx,4)
        incl %ecx
        cmpl $4294967295, %ecx
        jne LBB1_1      #cond_next
on X86


LBB1_1: @cond_next
        mov r2, #0
        str r2, [r0, +r3, lsl #2]
        add r3, r3, #1
        cmn r3, #1
        bne LBB1_1      @cond_next
on ARM (the mov to R2 is already covered by Bug 893)

and to:

LBB1_1: ;cond_next
        li r4, 0
        addi r2, r2, 1
        stw r4, 0(r3)
        addi r3, r3, 4
        cmpwi cr0, r2, -1
        bne cr0, LBB1_1 ;cond_next
on PPC (which needs to use postinc, but that's another issue).

These are all good.  The only issue now is that this code isn't correct.

Duncan, can you file the other cases as other buzilla bugs?

-Chris
Comment 16 Chris Lattner 2007-04-02 02:00:35 PDT
The sintzero function is now optimized.  Duncan is going to open other bugs for the other cases. 

This one testcase exposed several issues on multiple targets, yay :)

-Chris
Comment 17 Duncan Sands 2007-04-02 08:21:09 PDT
The cases where the index variables are not i32s have been split off into
PR1301.