1288 – Not so hot code produced for Ada array traversal: sintzero

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 1288 - Not so hot code produced for Ada array traversal: sintzero

Summary: Not so hot code produced for Ada array traversal: sintzero

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	1.0
Hardware:	All All

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:	code-quality

Depends on:
Blocks:

Reported:	2007-03-29 04:36 PDT by Duncan Sands
Modified:	2010-02-22 12:46 PST (History)
CC List:	1 user (show)

See Also:
Fixed By Commit(s):

Attachments
The unoptimized code (8.01 KB, text/plain) 2007-03-29 04:37 PDT, Duncan Sands	Details
optimized code (3.77 KB, text/plain) 2007-03-29 12:57 PDT, Duncan Sands	Details
X86 Codegen (1.74 KB, text/plain) 2007-04-01 01:16 PDT, Chris Lattner	Details
Old PPC Codegen (1.79 KB, text/plain) 2007-04-01 01:16 PDT, Chris Lattner	Details
Old ARM Codegen (1.65 KB, text/plain) 2007-04-01 01:17 PDT, Chris Lattner	Details
x86 codegen (i686-pc-linux-gnu) (3.38 KB, text/plain) 2007-04-01 02:43 PDT, Duncan Sands	Details
gcc 4.1.2, i486-linux-gnu, -O3 -fomit-frame-pointer (4.25 KB, text/plain) 2007-04-01 02:46 PDT, Duncan Sands	Details
gcc 3.4.6, i486-linux-gnu, -O3 -fomit-frame-pointer (4.15 KB, text/plain) 2007-04-01 02:47 PDT, Duncan Sands	Details
gcc 4.3.0, i686-linux-gnu, -O3 -fomit-frame-pointer (4.35 KB, text/plain) 2007-04-01 02:49 PDT, Duncan Sands	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Duncan Sands 2007-03-29 04:36:11 PDT

The attached .ll file contains unoptimized code produced by the Ada front-end
for the operation "loop over all elements of an array of ints, setting them to
zero", for several different array types.  The arrays are all of fixed size,
and differ only in the type of the indexing variable:
@sbytezero: zero an array indexed by a signed i8 (index range -128..127)
@ubytezero: zero an array indexed by an unsigned i8 (index range 0..255)
@sintzero: zero an array indexed by a signed i32
@uintzero: zero an array indexed by an unsigned i32
@srangezero: zero an array with index range -10..10
@urangezero: zero an array with index range 10..30

Thus all of these routines could in theory be optimized to a single memset.
That doesn't happen, but let's not ask for the moon!  They do get fairly
well optimized but some optimizations are missed.  I think this matters
because these kind of array traversals (only more complicated) come up all
the time in Ada code.  The main problems seem to come from code like this
(extracted from @sbytezero):
        %tmp = load i8* %i              ; <i8> [#uses=1]
        %tmp1 = sext i8 %tmp to i32             ; <i32> [#uses=5]
        %tmp3 = sub i32 %tmp1, -128             ; <i32> [#uses=1]
        %tmp4 = getelementptr [256 x i32]* %tmp2, i32 0, 
i32 %tmp3              ; <i32*> [#uses=1]
Here %i is the array index variable.  It runs from -128 to 127.  The sext to
i32 is generated by the Ada f-e, presumably to avoid overflow when subtracting
off the array lower bound of -128, which occurs in the next line.  This gets
optimized to:
        %tmp8 = add i8 %indvar, -127            ; <i8> [#uses=1]
        %tmp115 = sext i8 %tmp8 to i32          ; <i32> [#uses=1]
        %tmp316 = add i32 %tmp115, 128          ; <i32> [#uses=1]
        %tmp417 = getelementptr [256 x i32]* %a, i32 0, 
i32 %tmp316             ; <i32*> [#uses=1]
Here %indvar runs from 0 to 254 (nice optimization!), which really corresponds
to indices 1..255 in the [256 x i32] array (the first value was peeled off).
Thus %tmp8 is between -127 and 127.  Thus %tmp115 is also between -127 and 127.
Thus %tmp316 is between 1 and 255.  It is just %tmp8+1 zextended to i32.  Thus
the adding of -127 and 128 could have be turned into a single addition of 1.
In fact it would have been better not to peel off the first loop value and
have %indvar go from 0..255, then just zextend it for use in the getelementptr.

PS: there are a bunch of pointless compares in the unoptimized code, like
        icmp sle i32 -128, %tmp1                ; <i1>:1 [#uses=0]
        icmp sle i32 %tmp1, 127         ; <i1>:2 [#uses=0]
This is me playing with assertion generation in llvm-convert, i.e. one day
these may become
        iassert sle i32 -128, %tmp1
        iassert sle i32 %tmp1, 127
if we introduce an assert instruction.

Comment 1 Duncan Sands 2007-03-29 04:37:53 PDT

Created attachment 734 [details]
The unoptimized code

Comment 2 Duncan Sands 2007-03-29 12:57:08 PDT

Created attachment 735 [details]
optimized code

Just to be clear, all the adding to and subtracting from the induction
variable is useless, except for incrementing it to the next value.  The
optimizers should be able to reduce all the examples to a memset, or at
worst a simple loop over the array setting every element to zero.

Comment 3 Chris Lattner 2007-04-01 01:14:08 PDT

Interesting.  I was just looking at some similar examples that can happen in C code.  For the record, please 
attach a .s file produced by LLC for your favorite target.  I will see if I can improve the generated code over 
the next week or so.

Comment 4 Chris Lattner 2007-04-01 01:16:37 PDT

Created attachment 743 [details]
X86 Codegen

Comment 5 Chris Lattner 2007-04-01 01:16:56 PDT

Created attachment 744 [details]
Old PPC Codegen

Comment 6 Chris Lattner 2007-04-01 01:17:12 PDT

Created attachment 745 [details]
Old ARM Codegen

Comment 7 Chris Lattner 2007-04-01 01:23:45 PDT

Perhaps the most amusing is sintzero.  One X86 we get:

.LBB3_1:	#cond_next
	movl $0, 4(%eax,%ecx,4)
	incl %ecx
	cmpl $4294967295, %ecx
	jne .LBB3_1	#cond_next

(yay)

on ARM we get:

.LBB3_1:	@cond_next
	add r2, r0, r3, lsl #2    <--- boo hiss
	mov r1, #0
	str r1, [r2, #+4]
	add r3, r3, #1
	cmn r3, #1
	bne .LBB3_1	@cond_next

On PPC, we get:

BB3_1:	#cond_next
	add 6, 5, 4
	addi 4, 4, 1
	li 7, 0
	slwi 6, 6, 2
	stwx 7, 3, 6
	xoris 6, 4, 65535
	cmplwi 0, 6, 65535
	bne 0, BB3_1	#cond_next

say what? :)

Comment 8 Chris Lattner 2007-04-01 01:24:07 PDT

Okay, final request.  Please attach the machine code generated by GCC with -O3 -fomit-frame-pointer.

-Chris

Comment 9 Duncan Sands 2007-04-01 02:43:17 PDT

Created attachment 746 [details]
x86 codegen (i686-pc-linux-gnu)

The functions kinds__TsbytearrayBIP etc can be ignored (I stripped
them out of the original example).

Comment 10 Duncan Sands 2007-04-01 02:46:13 PDT

Created attachment 747 [details]
gcc 4.1.2, i486-linux-gnu, -O3 -fomit-frame-pointer

Comment 11 Duncan Sands 2007-04-01 02:47:57 PDT

Created attachment 748 [details]
gcc 3.4.6, i486-linux-gnu, -O3 -fomit-frame-pointer

Comment 12 Duncan Sands 2007-04-01 02:49:34 PDT

Created attachment 749 [details]
gcc 4.3.0, i686-linux-gnu, -O3 -fomit-frame-pointer

Comment 13 Duncan Sands 2007-04-01 05:43:36 PDT

If the type of the induction variable had been chosen to be i32, rather
than the original type, some of these problems would go away.  As for
sintzero, you've got to admit that this
        %tmp7 = add i32 %indvar, -2147483647            ; <i32> [#uses=1]
        %tmp214 = xor i32 %tmp7, -2147483648            ; <i32> [#uses=1]
is a funny way to add one!

Comment 14 Chris Lattner 2007-04-01 18:04:07 PDT

That is a funny way to add one.  We can definitely instcombine this:

define i32 @test(i32 %indvar) {
        %tmp7 = add i32 %indvar, -2147483647            ; <i32> [#uses=1]
        %tmp214 = xor i32 %tmp7, -2147483648            ; <i32> [#uses=1]
        ret i32 %tmp214
}

into add of one, but I'd like to fix whatever is creating that in the first place :)

Comment 15 Chris Lattner 2007-04-02 01:38:29 PDT

Progress: with several patches, we now compile sintzero to:
LBB1_1: #cond_next
        movl $0, (%eax,%ecx,4)
        incl %ecx
        cmpl $4294967295, %ecx
        jne LBB1_1      #cond_next
on X86


LBB1_1: @cond_next
        mov r2, #0
        str r2, [r0, +r3, lsl #2]
        add r3, r3, #1
        cmn r3, #1
        bne LBB1_1      @cond_next
on ARM (the mov to R2 is already covered by Bug 893)

and to:

LBB1_1: ;cond_next
        li r4, 0
        addi r2, r2, 1
        stw r4, 0(r3)
        addi r3, r3, 4
        cmpwi cr0, r2, -1
        bne cr0, LBB1_1 ;cond_next
on PPC (which needs to use postinc, but that's another issue).

These are all good.  The only issue now is that this code isn't correct.

Duncan, can you file the other cases as other buzilla bugs?

-Chris

Comment 16 Chris Lattner 2007-04-02 02:00:35 PDT

The sintzero function is now optimized.  Duncan is going to open other bugs for the other cases. 

This one testcase exposed several issues on multiple targets, yay :)

-Chris

Comment 17 Duncan Sands 2007-04-02 08:21:09 PDT

The cases where the index variables are not i32s have been split off into
PR1301.