This test is definitely broken. Please find attached LLVM bytecode & generated assembler code. Generated code segfaults at first "movaps" instruction in newton function. llc -march=pentium3 generates working code.
Created attachment 456 [details] LLVM bytecode
Created attachment 457 [details] Generated assembler code
Anton, could you give us more information on this? If it crashes on the first movaps in newton: movaps %xmm0, 48(%esp) That must means %esp is not properly aligned. Can you verify if this is the case? Can you tell me if it segfault the first time it reaches this function? Do you have a backtrace? The bug is likely to be higher up in the caller chain.
what is the stack alignment on linux? 8 bytes, 4 bytes? I'm pretty sure it's not 16.
Might this be related to Bug 995? -Chris
Yes, it's that instruction. It crashes the first time entering the function (e.g. "break newton", "run", and few "stepi"'s leads to crash). Stack seems to be unaligned (at the point of crash): (gdb) p $esp $1 = (void *) 0xbf8e0c08 and that seems to be the problem
Chris, default stack alignment on linux is 4 bytes, but sse stuff usually wants memory operands to be 16 bytes aligned.
Evan, are we generating movaps for f64 load/stores or something? -Chris
Yep. We use movaps for f32/f64 load / store. That obviously doesn't work for non-darwin targets.
Can we just override stack alignment from command line? This will allow use movaps on non-darwin targets, if stack alignment if fine for this.
nope, stack alignment is part of the ABI. You can't compile just some functions with aligned stacks: the stack coming into the function has to be aligned.
Well, assume we have 4 bytes stack alignment on linux by default. Command line switch will override stack alignment to e.g. 32. This will enable "movaps" but doesn't break ABI.
I misspoke. Actually we are not using movaps for f32/f64 load / stores. The movaps is probably added by the spiller. I'll poke.
this requires either: 1. dynamic readjustment of the stack pointer in the prolog 2. the entire program to be compiled with this option Since 'entire program' implies libraries (e.g. libc), it isn't feasible.
We should not have been generating spills of vector values in this test case. This was a bug in the X86 max / min dag combine. Fixed.
Works for me now, thanks, Evan.