Created attachment 1113 [details] Function extracted Consider attached testcase. In bb79: bb79: ; preds = %bb57 %tmp81 = load double* %tmp68, align 16 ; <double> [#uses=1] %tmp82 = sub double -0.000000e+00, %tmp81 ; <double> [#uses=1] store double %tmp82, double* %tmp68, align 16 %tmp84 = getelementptr [3 x double]* %Raw_Normal, i32 0, i32 1 ; <double*> [#uses=2] %tmp85 = load double* %tmp84, align 8 ; <double> [#uses=1] %tmp86 = sub double -0.000000e+00, %tmp85 ; <double> [#uses=1] store double %tmp86, double* %tmp84, align 8 %tmp88 = getelementptr [3 x double]* %Raw_Normal, i32 0, i32 2 ; <double*> [#uses=2] %tmp89 = load double* %tmp88, align 8 ; <double> [#uses=1] %tmp90 = sub double -0.000000e+00, %tmp89 ; <double> [#uses=1] store double %tmp90, double* %tmp88, align 8 br label %bb92 This is codegen'ed (-disable-fp-elim) into: movsd .LCPI1_0, %xmm0 movapd %xmm0, %xmm1 xorpd -104(%ebp), %xmm1 movsd %xmm1, -104(%ebp) movsd -96(%ebp), %xmm1 xorpd %xmm0, %xmm1 movsd %xmm1, -96(%ebp) movsd -88(%ebp), %xmm1 xorpd %xmm0, %xmm1 movsd %xmm1, -88(%ebp) We definitely see a problem: even if %ebp is 16-bytes aligned, the access of temporaries is not aligned at all. This causes povray miscompilation at -O2.
xorpd -104(%ebp), %xmm1 movsd %xmm1, -104(%ebp) movsd -96(%ebp), %xmm1 movsd %xmm1, -96(%ebp) movsd -88(%ebp), %xmm1 movsd %xmm1, -88(%ebp) Only xorpd -104(%ebp) has to be aligned on 16-byte. movsd only require 8-byte alignment. Furthermore, the prologue looks like this: pushl %ebp movl %esp, %ebp pushl %ebx pushl %edi pushl %esi subl $172, %esp Assuming %esp was 16-byte aligned before the call. The call push 4-byte and the first pushl %ebp decrement another 4 bytes from %esp. So %ebp = old %esp - 8 = new %esp + 184. -104(%ebp) is 16-byte aligned. This is correct assuming %esp is 16-byte aligned. But is that true? Linux stack alignment requirement is 8-byte only. Anton, can you tell me the value of %esp upon entry?
Hrm, it seems you're right. %ebp is 0xbfe24bf8 for me :( Any ideas how this can be fixed? Can result of PR1636 be used to solve this issue?
We need to dynamically align the stack ptr in the prologue. This means using fp to reference objects in the caller's frame and using sp to reference local objects. If alloca() are used, that means keeping another pointer to reference local objects? I am not sure. I probably won't be getting to this any time soon. Anton, would you like to take a crack at it? :-)
This seems to be exactly the task of PR1636. So, if we'll have sane stack realignment facility both PRs will be fixed. I'll try to dig into this (hope really soon).
*** Bug 1671 has been marked as a duplicate of this bug. ***
The GCC folks are discussing stack realignment, some interesting links here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660
Another link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28074
I think this is fixed on mainline, by generating movups instead of movaps when referencing the stack. Is this true, and if so, is this a code quality bug?
I don't think this is fixed. I believe we can still generate load from stack slots which are *believed* to be 16-byte aligned but are not. The problem is it's isel lowering is assuming the stack ptr is properly aligned upon entry into the function. But this is not true on some x86 targets. For those, e.g. cygwin, the stack ptr must be dynamically aligned if 16-byte aligned stack objects are created.
This should be fixed by series of patches r50152-r50173. Stack realignment is currently enabled in 'easy' cases, when no dynamic-sized objects are present.