Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not so hot code produced for Ada array traversal: sintzero #1660

Closed
llvmbot opened this issue Mar 29, 2007 · 15 comments
Closed

Not so hot code produced for Ada array traversal: sintzero #1660

llvmbot opened this issue Mar 29, 2007 · 15 comments
Labels
bugzilla Issues migrated from bugzilla code-quality

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 29, 2007

Bugzilla Link 1288
Resolution FIXED
Resolved on Feb 22, 2010 12:46
Version 1.0
OS All
Attachments The unoptimized code, optimized code
Reporter LLVM Bugzilla Contributor

Extended Description

The attached .ll file contains unoptimized code produced by the Ada front-end
for the operation "loop over all elements of an array of ints, setting them to
zero", for several different array types. The arrays are all of fixed size,
and differ only in the type of the indexing variable:
@​sbytezero: zero an array indexed by a signed i8 (index range -128..127)
@​ubytezero: zero an array indexed by an unsigned i8 (index range 0..255)
@​sintzero: zero an array indexed by a signed i32
@​uintzero: zero an array indexed by an unsigned i32
@​srangezero: zero an array with index range -10..10
@​urangezero: zero an array with index range 10..30

Thus all of these routines could in theory be optimized to a single memset.
That doesn't happen, but let's not ask for the moon! They do get fairly
well optimized but some optimizations are missed. I think this matters
because these kind of array traversals (only more complicated) come up all
the time in Ada code. The main problems seem to come from code like this
(extracted from @​sbytezero):
%tmp = load i8* %i ; [#uses=1]
%tmp1 = sext i8 %tmp to i32 ; [#uses=5]
%tmp3 = sub i32 %tmp1, -128 ; [#uses=1]
%tmp4 = getelementptr [256 x i32]* %tmp2, i32 0,
i32 %tmp3 ; <i32*> [#uses=1]
Here %i is the array index variable. It runs from -128 to 127. The sext to
i32 is generated by the Ada f-e, presumably to avoid overflow when subtracting
off the array lower bound of -128, which occurs in the next line. This gets
optimized to:
%tmp8 = add i8 %indvar, -127 ; [#uses=1]
%tmp115 = sext i8 %tmp8 to i32 ; [#uses=1]
%tmp316 = add i32 %tmp115, 128 ; [#uses=1]
%tmp417 = getelementptr [256 x i32]* %a, i32 0,
i32 %tmp316 ; <i32*> [#uses=1]
Here %indvar runs from 0 to 254 (nice optimization!), which really corresponds
to indices 1..255 in the [256 x i32] array (the first value was peeled off).
Thus %tmp8 is between -127 and 127. Thus %tmp115 is also between -127 and 127.
Thus %tmp316 is between 1 and 255. It is just %tmp8+1 zextended to i32. Thus
the adding of -127 and 128 could have be turned into a single addition of 1.
In fact it would have been better not to peel off the first loop value and
have %indvar go from 0..255, then just zextend it for use in the getelementptr.

PS: there are a bunch of pointless compares in the unoptimized code, like
icmp sle i32 -128, %tmp1 ; :1 [#uses=0]
icmp sle i32 %tmp1, 127 ; :2 [#uses=0]
This is me playing with assertion generation in llvm-convert, i.e. one day
these may become
iassert sle i32 -128, %tmp1
iassert sle i32 %tmp1, 127
if we introduce an assert instruction.

@lattner
Copy link
Collaborator

lattner commented Apr 1, 2007

Interesting. I was just looking at some similar examples that can happen in C code. For the record, please
attach a .s file produced by LLC for your favorite target. I will see if I can improve the generated code over
the next week or so.

@lattner
Copy link
Collaborator

lattner commented Apr 1, 2007

X86 Codegen

@lattner
Copy link
Collaborator

lattner commented Apr 1, 2007

Old PPC Codegen

@lattner
Copy link
Collaborator

lattner commented Apr 1, 2007

Old ARM Codegen

@lattner
Copy link
Collaborator

lattner commented Apr 1, 2007

Perhaps the most amusing is sintzero. One X86 we get:

.LBB3_1: #cond_next
movl $0, 4(%eax,%ecx,4)
incl %ecx
cmpl $4294967295, %ecx
jne .LBB3_1 #cond_next

(yay)

on ARM we get:

.LBB3_1: @cond_next
add r2, r0, r3, lsl #​2 <--- boo hiss
mov r1, #​0
str r1, [r2, #+4]
add r3, r3, #​1
cmn r3, #​1
bne .LBB3_1 @cond_next

On PPC, we get:

BB3_1: #cond_next
add 6, 5, 4
addi 4, 4, 1
li 7, 0
slwi 6, 6, 2
stwx 7, 3, 6
xoris 6, 4, 65535
cmplwi 0, 6, 65535
bne 0, BB3_1 #cond_next

say what? :)

@lattner
Copy link
Collaborator

lattner commented Apr 1, 2007

Okay, final request. Please attach the machine code generated by GCC with -O3 -fomit-frame-pointer.

-Chris

@llvmbot
Copy link
Collaborator Author

llvmbot commented Apr 1, 2007

x86 codegen (i686-pc-linux-gnu)
The functions kinds__TsbytearrayBIP etc can be ignored (I stripped
them out of the original example).

@llvmbot
Copy link
Collaborator Author

llvmbot commented Apr 1, 2007

@llvmbot
Copy link
Collaborator Author

llvmbot commented Apr 1, 2007

@llvmbot
Copy link
Collaborator Author

llvmbot commented Apr 1, 2007

@llvmbot
Copy link
Collaborator Author

llvmbot commented Apr 1, 2007

If the type of the induction variable had been chosen to be i32, rather
than the original type, some of these problems would go away. As for
sintzero, you've got to admit that this
%tmp7 = add i32 %indvar, -2147483647 ; [#uses=1]
%tmp214 = xor i32 %tmp7, -2147483648 ; [#uses=1]
is a funny way to add one!

@lattner
Copy link
Collaborator

lattner commented Apr 2, 2007

That is a funny way to add one. We can definitely instcombine this:

define i32 @​test(i32 %indvar) {
%tmp7 = add i32 %indvar, -2147483647 ; [#uses=1]
%tmp214 = xor i32 %tmp7, -2147483648 ; [#uses=1]
ret i32 %tmp214
}

into add of one, but I'd like to fix whatever is creating that in the first place :)

@lattner
Copy link
Collaborator

lattner commented Apr 2, 2007

Progress: with several patches, we now compile sintzero to:
LBB1_1: #cond_next
movl $0, (%eax,%ecx,4)
incl %ecx
cmpl $4294967295, %ecx
jne LBB1_1 #cond_next
on X86

LBB1_1: @​cond_next
mov r2, #​0
str r2, [r0, +r3, lsl #​2]
add r3, r3, #​1
cmn r3, #​1
bne LBB1_1 @​cond_next
on ARM (the mov to R2 is already covered by Bug 893)

and to:

LBB1_1: ;cond_next
li r4, 0
addi r2, r2, 1
stw r4, 0(r3)
addi r3, r3, 4
cmpwi cr0, r2, -1
bne cr0, LBB1_1 ;cond_next
on PPC (which needs to use postinc, but that's another issue).

These are all good. The only issue now is that this code isn't correct.

Duncan, can you file the other cases as other buzilla bugs?

-Chris

@lattner
Copy link
Collaborator

lattner commented Apr 2, 2007

The sintzero function is now optimized. Duncan is going to open other bugs for the other cases.

This one testcase exposed several issues on multiple targets, yay :)

-Chris

@llvmbot
Copy link
Collaborator Author

llvmbot commented Apr 2, 2007

The cases where the index variables are not i32s have been split off into
llvm/llvm-bugzilla-archive#1301 .

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla code-quality
Projects
None yet
Development

No branches or pull requests

2 participants