21435 – LLVM no longer honours -stack-alignment=4

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 21435 - LLVM no longer honours -stack-alignment=4

Summary: LLVM no longer honours -stack-alignment=4

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-10-31 14:16 PDT by Jose Fonseca
Modified:	2015-11-17 16:49 PST (History)
CC List:	8 users (show)

See Also:
Fixed By Commit(s):

Attachments
bug1347703.ll (73.36 KB, application/octet-stream) 2014-10-31 14:16 PDT, Jose Fonseca	Details
bug1347703-r253245.ll (78.74 KB, text/plain) 2015-11-17 12:46 PST, Jose Fonseca	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jose Fonseca 2014-10-31 14:16:17 PDT

Created attachment 13268 [details]
bug1347703.ll

LLVM stoped honouring 

This can easily be reproduced with the attached bug1347703.ll doing

  $ llc -o - -mtriple=i686-pc-mingw32 -mattr=+avx -stack-alignment=4 bug1347703.ll | grep '\<and.*[er]sp'
  $

which produces nothing showing that the stack pointer is never aligned.

However if one disables frame-pointer elimination then it works again:

  $ /home/jfonseca/work/vmware/llvm/llvm/build/linux-x86_64/Debug+Asserts/bin/llc -disable-fp-elim -o - -mtriple=i686-pc-mingw32 -mattr=+avx -stack-alignment=4 bug1347703.ll | grep '\<and.*[er]sp'
	andl	$-32, %esp
	andl	$-32, %esp

  $

Which doesn't seem the right behavior -- using the frame-pointer to align the stack seems an implementation detail -- there are probably other ways, and if frame-pointer is indeed necessary it should be automatically employed as necessary.


This regression happened between LLVM release 3.2 and release 3.3.  I strongly suspect r168627 -- https://github.com/llvm-mirror/llvm/commit/1243922fc1a1e3d2681ed9e78503eeabd875ba93

Comment 1 Reid Kleckner 2014-10-31 15:23:53 PDT

Based on the input IR and what I've seen of the backend, this looks like some kind of phase ordering problem.

The input test case has no allocas, so we don't think we need to realign the stack for any highly aligned allocations. Then register allocation (or vector legalization) runs, and we are forced to spill some large vector values. This introduces a highly aligned stack allocation, but it's already too late, we can't go back and add the realignment prologue.

See this awesome comment in X86RegisterInfo about it maybe being too late to realign the stack:

bool X86RegisterInfo::canRealignStack(const MachineFunction &MF) const {
  if (MF.getFunction()->hasFnAttribute("no-realign-stack"))
    return false;

  const MachineFrameInfo *MFI = MF.getFrameInfo();
  const MachineRegisterInfo *MRI = &MF.getRegInfo();

  // Stack realignment requires a frame pointer.  If we already started
  // register allocation with frame pointer elimination, it is too late now.
  if (!MRI->canReserveReg(FramePtr))
    return false;

  // If a base pointer is necessary.  Check that it isn't too late to reserve
  // it.
  if (MFI->hasVarSizedObjects())
    return MRI->canReserveReg(BasePtr);
  return true;
}

Even if that codepath isn't executing, it's possible that there is a similar issue.

Comment 2 Chad Rosier 2014-10-31 15:33:08 PDT

CC'ing Jim and Bob on this one... (as they can see the radar) and should be able to chime in on this one.

IIRC, this was some refactoring I did with Jakob, but I don't recall the full details.  It is/was a kind of phase ordering issue, but we never came up with a good solution.

Comment 3 Reid Kleckner 2014-10-31 15:38:57 PDT

I think we could conservatively assume that any SSA value will need to be spilled to the stack and align the current frame to handle at least that.

Comment 4 Chad Rosier 2014-10-31 16:35:42 PDT

Reed, your proposal is to revert r168627, correct?  It's conservatively correct, but you waste a register is *many* cases.  I would prefer a more targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer to enable stack-realignment, if the user specifies an alignment larger than the default stack alignment?

Comment 5 Jose Fonseca 2014-10-31 16:56:14 PDT

(In reply to comment #4)
> Reed, your proposal is to revert r168627, correct?  It's conservatively
> correct, but you waste a register is *many* cases.  I would prefer a more
> targeted approach.
> 
> Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
> to enable stack-realignment, if the user specifies an alignment larger than
> the default stack alignment?

To be clear -- my suggestion was enable frame pointer when needed.

I'm not sure that enabling when the user specifies a stack alignment different from default stack alignment covers all cases.  Imagine that the function needs to spill AVX 32-byte registers.  Wouldn't that require a 32-byte aligned stack pointer?

Or what if the function needed to call another function passing a 32-byte vector by reference.

In other words, it seems to me that the alternatives are to revert r168627, or disable frame pointer omission blindly...

Comment 6 Reid Kleckner 2014-10-31 16:59:28 PDT

(In reply to comment #4)
> Reed, your proposal is to revert r168627, correct?  It's conservatively
> correct, but you waste a register is *many* cases.  I would prefer a more
> targeted approach.
> 
> Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
> to enable stack-realignment, if the user specifies an alignment larger than
> the default stack alignment?

I feel like if the default stack alignment is 4 that should do the same thing as the user specifying a stack alignment of 4, right?

Comment 7 Chad Rosier 2014-11-03 08:44:51 PST

(In reply to comment #5)
> (In reply to comment #4)
> > Reed, your proposal is to revert r168627, correct?  It's conservatively
> > correct, but you waste a register is *many* cases.  I would prefer a more
> > targeted approach.
> > 
> > Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
> > to enable stack-realignment, if the user specifies an alignment larger than
> > the default stack alignment?
> 
> To be clear -- my suggestion was enable frame pointer when needed.

If the FP isn't available/reserved, then register allocator uses unaligned loads/stores to fill/spill.  Thus, the FP isn't strictly needed for correctness in the cases you've described.

Comment 8 Chad Rosier 2014-11-03 08:54:29 PST

(In reply to comment #6)
> (In reply to comment #4)
> > Reed, your proposal is to revert r168627, correct?  It's conservatively
> > correct, but you waste a register is *many* cases.  I would prefer a more
> > targeted approach.
> > 
> > Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
> > to enable stack-realignment, if the user specifies an alignment larger than
> > the default stack alignment?
> 
> I feel like if the default stack alignment is 4 that should do the same
> thing as the user specifying a stack alignment of 4, right?

IIRC, the default stack alignment on x86 is 16 bytes (i.e., -stack-alignment=2).  If the user specifies a larger alignment than the default then we might consider reserving the FP and realigning the stack to the larger non-default alignment.

Comment 9 Jose Fonseca 2014-11-03 09:13:03 PST

> IIRC, the default stack alignment on x86 is 16 bytes

It depends on the platforms.  The default stack alignment on x86 on Windows is 4-bytes.

In fact it seems this bus is mis-titled: it seems the problem is not that LLVM doesn't honor -stack-alignment=, but rather that x86 backend is broken for stack alignments (default or overriden) less than 16 bytes.

Comment 10 Jose Fonseca 2014-11-03 09:23:06 PST

(In reply to comment #7)
> If the FP isn't available/reserved, then register allocator uses unaligned
> loads/stores to fill/spill.  Thus, the FP isn't strictly needed for
> correctness in the cases you've described.

If I try the attached input, targetting x86_64 linux with the default stack alignment (which is 16bytes), and grep for rsp I get

$ llc-3.4 -o - -mtriple=x86_64-unknown-linux  -mattr=+avx bug1347703.ll | grep rsp
	subq	$40, %rsp
	movq	%r8, 32(%rsp)           # 8-byte Spill
	movq	%rdi, 24(%rsp)          # 8-byte Spill
	movq	96(%rsp), %rdi
	vmovdqu	%ymm1, -32(%rsp)        # 32-byte Folded Spill
	vmovdqu	%ymm1, -64(%rsp)        # 32-byte Folded Spill
	vmovdqu	%ymm1, -96(%rsp)        # 32-byte Folded Spill
	movq	24(%rsp), %rdi          # 8-byte Reload
	vandps	-32(%rsp), %ymm3, %ymm3 # 32-byte Folded Reload
	vandps	-64(%rsp), %ymm10, %ymm10 # 32-byte Folded Reload
	vandps	-96(%rsp), %ymm10, %ymm10 # 32-byte Folded Reload
[...]

The spills are indeed with VMOVDQU, but doesn't VANDPS require the m256 argument to be 32-byte aligned?

Comment 11 Bob Wilson 2014-11-03 10:49:49 PST

It has been a while and I don't remember the numbers, but I'm pretty sure that reverting r168627 would cause a significant performance regression. We should figure out how to fix the problems with -stack-alignment=4 some other way.

Comment 12 Reid Kleckner 2014-11-03 11:05:34 PST

(In reply to comment #9)
> > IIRC, the default stack alignment on x86 is 16 bytes
> 
> It depends on the platforms.  The default stack alignment on x86 on Windows
> is 4-bytes.

This is why I care. :)

---

I believe there are two workarounds for Jose:

1. Use the alignstack(16) (or 32?) attribute on the functions in question. If you control the frontend, this is the easiest and least fragile thing to do.

2. Use the -force-align-stack hidden x86 backend command line option. This is probably fragile and will break if and when we re-examine flags embedded in libraries like this.

If you can do 1, we can probably let this issue stand.

Comment 13 Jose Fonseca 2014-11-03 13:13:54 PST

(In reply to comment #12)-
> 
> I believe there are two workarounds for Jose:
> 
> 1. Use the alignstack(16) (or 32?) attribute on the functions in question.
> If you control the frontend, this is the easiest and least fragile thing to
> do.

The workaround we took was to set TargetOptions::NoFramePointerElim and TargetOptions::NoFramePointerElimNonLeaf on all x86 and x86_64 builds of Mesa.  We were already doing this for non-release builds (to facilitate stack back traces when debugging/profiling), but after this issue it is clear that frame-pointer-omission is not just a minor optimization in LLVM, but has deep-reaching implications on X86 code generation, so I'm no longer comfortable in having (non-)release builds differing on this matter, as most of our automated tests actually run with non-release builds.

> 2. Use the -force-align-stack hidden x86 backend command line option. This
> is probably fragile and will break if and when we re-examine flags embedded
> in libraries like this.
> 
> If you can do 1, we can probably let this issue stand.

We generate LLVM IR directly (Mesa llvmpipe driver), and I'm used to tweak things in order to get LLVM, and can do so.

But, for whatever my opinion is worth, I have to say this seems a serious flaw in LLVM x86 backend; and taking an approach that works 99% of cases for the sake of performance, instead of one that works 100% seems the wrong trade-off for a compiler; and if there are indeed possible workarounds then I fail to understand why pushing the workarounds to the user instead of automatically detecting the problematic situations and deploy the workarounds without intervention; at very least you should add an assert/error message when this happens -- anything less is just sweeping the problem under the rug...

Comment 14 sroland 2014-11-04 18:41:55 PST

(In reply to comment #7)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > Reed, your proposal is to revert r168627, correct?  It's conservatively
> > > correct, but you waste a register is *many* cases.  I would prefer a more
> > > targeted approach.
> > > 
> > > Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
> > > to enable stack-realignment, if the user specifies an alignment larger than
> > > the default stack alignment?
> > 
> > To be clear -- my suggestion was enable frame pointer when needed.
> 
> If the FP isn't available/reserved, then register allocator uses unaligned
> loads/stores to fill/spill.  Thus, the FP isn't strictly needed for
> correctness in the cases you've described.

But, in this case, it would be forbidden for those fill/spill loads/stores (or maybe can only happen for loads) to get eliminated and folded into the arithmetic ops, though only for sse, not when using avx (as vex encoding can handle unaligned loads in ordinary arithmetic ops whereas sse can't).
However it seems a bit silly to use unaligned stores/loads for spills imho in any case. I'm aware newer cpus don't necessarily have much of a performance penalty for that (intel core ones in particular) though on some the performance impact can be rather horrendous (anyone still using a p4?).

Comment 15 Reid Kleckner 2015-11-17 11:38:18 PST

This is over a year old and the attached LLVM IR no longer parses with TOT llc.

llc's command line interface isn't exactly stable or user facing either, so I don't think there's anything to do here.

Comment 16 Jose Fonseca 2015-11-17 12:33:30 PST

(In reply to comment #15)
> This is over a year old and the attached LLVM IR no longer parses with TOT
> llc.
> 
> llc's command line interface isn't exactly stable or user facing either, so
> I don't think there's anything to do here.

This bug is not specific llc...


I provided the llc command line / IR merely to _facilitate_ reproing and fixing the bug.


We saw the bug on Mesa + llvmpipe with JIT.




The problem here is that 
LLVM is broken when incoming stack aligment is 4.  You can probably repro the isse with `clang -mstack-alignment=4` and the right input.



That said, it's clear that nothing happened. So probably nothing will ever happen.   On Mac and Linux the ABI (both 32 and 64bits) states that stack alignment is always aligned 16bytes, so this is never a issue.  (Maybe it matters for Microsoft, since on Windows 32 bits, the stack alignment is 4bytes.)



But, if you don't want to fix, then at very least mark it as WONTFIX.  Closing as INVALID on the grounds you can't no longer repro is IMO insult to the time I spent on obtaining a small repro, as per the bug guidelines.


LLVM has very little stable interfaces, so if LLVM devs sleep it long enough you can close any bug on the basis you can repro...

Comment 17 Jose Fonseca 2015-11-17 12:46:00 PST

Created attachment 15304 [details]
bug1347703-r253245.ll

Updated IR, correct as of r253245.

Comment 18 Jose Fonseca 2015-11-17 12:50:10 PST

Here. Updated LLVM IR.

Basically took the original IR and massaged with vim subst commands.

  :%s/getelementptr \({.*}\)\*/getelementptr \1, \1*
  :%s/load \(.*\)\* /load \1, \1*
  :%s/getelementptr \(\w\+\)\*/getelementptr \1, \1*

Comment 19 Reid Kleckner 2015-11-17 13:19:47 PST

(In reply to comment #16)
> But, if you don't want to fix, then at very least mark it as WONTFIX. 
> Closing as INVALID on the grounds you can't no longer repro is IMO insult to
> the time I spent on obtaining a small repro, as per the bug guidelines.

Sorry, our bugzilla resolution statuses are really old and sometimes rude. I end up using "invalid" to mean "working as intended". I'm only trying to convey that we aren't taking action because:
- There isn't a correctness issue, since we'll used unaligned spills and fills
- Nobody has updated the issue in a year, indicating that very few people rely on the stack alignment TargetOption

It sounds like it's still important to you and you don't like your workaround, so we can reopen it.

To resummarize what's wrong, LLVM is making a bad performance decision: it is choosing to use unaligned vector spills in order to omit the frame pointer. In Chad's case, this helped, but in your case, it hurts. Is that accurate?

Comment 20 sroland 2015-11-17 14:06:38 PST

(In reply to comment #19)
> To resummarize what's wrong, LLVM is making a bad performance decision: it
> is choosing to use unaligned vector spills in order to omit the frame
> pointer. In Chad's case, this helped, but in your case, it hurts. Is that
> accurate?
The bigger issue was that llvm used to produce incorrect code, because the vector spills/reloads were unaligned but using instructions which required alignment to do it. Now with AVX (which not just has unaligned movs like SSE, but where the arithmetic instructions don't need alignment) this is less of an issue (can still fold the reloads into the arithmetic ops, I just hope it doesn't do it any longer if only SSE is available).
But, our code will produce quite massive amounts of reloads/spills (of vector regs) sometimes and certainly using unaligned instructions for that sounds like an odd decision. It's true newer cpus "usually" don't really have huge penalties for that, but that still goes against every optimization manual I've ever seen. You really typically only should use unaligned data if you can't control data alignment easily. But certainly for spills/reloads, doing it unaligned makes no sense whatsoever.

Comment 21 Reid Kleckner 2015-11-17 14:51:46 PST

I agree, in general, doing unaligned vector loads seems pretty lame. I think in Chad's test case, the vector spills were not in the hot path, and it was on 32-bit x86, where going from 6 to 7 GPRs makes a big difference.

Comment 22 sroland 2015-11-17 15:21:15 PST

(In reply to comment #21)
> I agree, in general, doing unaligned vector loads seems pretty lame. I think
> in Chad's test case, the vector spills were not in the hot path, and it was
> on 32-bit x86, where going from 6 to 7 GPRs makes a big difference.
In our case, we don't have much use for scalar regs :-). But in any case, there just has to be a better solution than just using unaligned vector loads/stores for vector spills (on 32bit, you don't have too many vector regs neither).

Comment 23 Jose Fonseca 2015-11-17 16:45:14 PST

(In reply to comment #19)

> It sounds like it's still important to you and you don't like your
> workaround, so we can reopen it.

Thanks.

> To resummarize what's wrong, LLVM is making a bad performance decision: it
> is choosing to use unaligned vector spills in order to omit the frame
> pointer. In Chad's case, this helped, but in your case, it hurts. Is that
> accurate?


When I filled this bug this was a _crash_, not performance issue.

The 1347703 is the number of a VMware internal bug report, and I've just doubled checked: this was causing crashes on Windows, in 32bits processes, both on AVX and non-AVX systems.


But I've just retried, and the crash/correctness issue might have indeed be fixed in LLVM since this was filed.

For example, if I do

  llc -o - -mtriple=i686-pc-mingw32 -mattr=-avx,+sse2 -stack-alignment=4 bug1347703.ll

then all accesses to the stack are with unaligned MOVs.  This was not the case before.

I have all llc versions from 3.3 to 3.6, and indeed it seems the issue was fixed going from LLVM 3.3 to 3.4.


Performance was never my worry here.  Even with the crash fixed, it makes sense for us to continue using frame pointer.  My main concern was the crash.



So it looks you can resolve this bug after all!

Comment 24 Reid Kleckner 2015-11-17 16:49:00 PST

(In reply to comment #23)
> Performance was never my worry here.  Even with the crash fixed, it makes
> sense for us to continue using frame pointer.  My main concern was the crash.
> 
> So it looks you can resolve this bug after all!

OK, thanks for looking into it.

I think we'll mark this fixed then. If someone runs into the performance issue, there are tons of workarounds: pass -fno-omit-frame-pointer, -mstackrealign, -mstack-alignment=32, or whatever. Those all use a different codepath from TargetOptions::StackAlignmentOverride, and should avoid the unaligned loads.