Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177

rnk · 2018-09-04T18:25:51Z


Bugzilla Link	38829
Resolution	FIXED
Resolved on	Mar 29, 2019 11:04
Version	trunk
OS	Windows NT
CC	@chandlerc,@DMG862,@efriedma-quic,@fhahn,@gburgessiv,@nico,@VReichelt
Fixed by commit(s)	357257

Extended Description

My local build of clang was slow, so I profiled it, and this is what I found. I compiled on Windows with optimizations and debug info. I think enabling debug info makes basic blocks longer, and that's the main reason it matters.

Compiling clang's SemaChecking.cpp file took ~121s total, and of it, 21s is spent in DSE and 21s in memcpyopt. The majority of the time in both of these passes was in callCapturesBefore. The next most expensive pass is induction variable simplification at 14s, but its time is spent in SCEV, not AA.

Most of the time in DSE and memcpyopt is spent in callCapturesBefore, which is linear in the size of the basic block in the worst case. Most of the samples occur in OrderedBlock::comesBefore, which is the inner loop that iterates the linked list of instructions in a BB and numbers them to compute intra-BB dominance.

If we could, for example, maintain accurate position numbers for instructions in the basic block, this would not be necessary, but it would require a fancy algorithm to maintain amortized O(1) insertion and removal of instructions mid-block.

Fixing this without that fancy algorithm seems like a bit of a pain, since we can't blindly create an OrderedBasicBlock in our transforms that use callCapturesBefore. These passes currently add and remove instructions as they go, mutating the block. We'd probably end up seeing ABA bugs where freed instruction addresses appear in the OrderedBasicBlock DenseMap that maps from Instruction* to position. Every place where we update the domtree today, we'd have to update the appropriate basic block.

These passes already both use MemoryDependenceResults and call MD->removeInstruction. In theory we could move the instruction numbering into MemoryDependenceResults, and have it be responsible for maintaining the numbers, but its starting to feel like adding fancy algorithms to ilist / iplist would be more maintainable.

efriedma-quic · 2018-09-04T18:47:21Z

MemoryDependenceAnalysis is terrible for other reasons... maybe our time is better spent trying to eliminate it in favor of MemorySSA.

gburgessiv · 2018-09-04T18:54:14Z

(+1 to eli, but I'm biased ;) )

FWIW, MemorySSA already provides amortized O(1) BB-local dominance queries unless your usage looks like insert() -> query() -> insert() -> query() -> ... I planned to migrate MemDep to MSSA at some point; don't remember what blocked that (I remember there were concerns about pass ordering, but none of them sounded impossible to overcome). We also wanted to port DSE to MSSA.

The approach MSSA uses is really simple, but it appears to work well for the users we have. We keep a valid bit for each BB. When local dominance is requested and the valid bit isn't set, we assign a number to each memory instruction in the relevant BB and set the BB's valid bit. The valid bit is cleared (IIRC) by memory instruction insertion, but not by removal. Moving a memory instruction from one BB to another is modeled as a removal + insertion.

I'll note that this doesn't provide dominance information for non-memory instructions. I'm unsure if this is important here, since I'm not deeply familiar with the exact API that MemDep exposes. However, this makes

I think enabling debug info makes basic blocks longer, and that's the main reason it matters

a non-issue. :)

We could be way more fancy about invalidation/partial rebuilding (e.g. consider an algorithm that numbers instructions with N-sized gaps in between. This trivially allows for at least log(N) insertions before a full renumbering is required, but we can get even fancier than that pretty easily), but again, there's been no need to do so yet.

rnk · 2018-09-04T19:09:10Z

Separate from memdep vs. memssa, it sounds like we have an awfully large number of data structures maintaining local dominance information in DenseMaps.

Would you all support moving that to BasicBlock, i.e. maintain the "valid" bit there, add BasicBlock::comesBefore, lazily number instructions when it is called, invalidate it on insertion/removal? That's a very small and tractable change that I can work on without changing functionality.

efriedma-quic · 2018-09-04T19:37:30Z

That's 4 bytes per instruction extra memory usage, which isn't ideal... but I guess maybe okay? As long as the actual numbers aren't exposed to transforms, so we can mess with the representation later.

I'm kind of worried we'll start relying on it, and end up with algorithms that are actually quadratic anyway due to the invalidation, but I guess that's not really worse than the current situation.

rnk · 2018-09-04T23:47:18Z

Here's a patch to cache instruction ordering, if you want to play with it:
https://reviews.llvm.org/D51664

It'll speed up my clang builds, so, you know, I'm for it. :)

rnk · 2018-09-05T00:24:33Z

Re: reimplementing DSE on MemorySSA, it looks like there's already a patch for it:
https://reviews.llvm.org/D40480

The metrics the author collected look good to me. Should we just do it?

DMG862 · 2018-09-05T08:38:45Z

Hello

There are places where it's is still quadriatic, at least in that algorithm. The reason I wanted such a thing was to support cross-basic block DSE. That's only a small gain though, so not something I had time to continue.

It would at least need rebasing and the DSE pass may have picked up a few tricks since then, which would need to be incorporated. There's also https://reviews.llvm.org/D29624, which is an older version of the same thing, plus https://reviews.llvm.org/D29866, which is a better algorithm if we can get it to work. My idea was to use D40480 until someone made PDSE work correctly, which should then just be better, providing we can get it to do all the the things the current pass does.

IIRC, with some extra caching for the alias calls in isOverwrite, D40480 still had trouble with code like:
store
store
memcpy
store
store
memcpy
store
...
Because of all the modref calls for memcpy.

nico · 2019-03-25T12:45:11Z

*** Bug #36879 has been marked as a duplicate of this bug. ***

fhahn · 2019-03-25T13:33:58Z

*** Bug #22603 has been marked as a duplicate of this bug. ***

rnk · 2019-03-27T18:15:58Z

test case with slow DSE behavior

fhahn · 2019-03-29T17:44:29Z

The quadratic behavior caused by local ordering queries has been fixed by https://reviews.llvm.org/rL357257

The recently attached test case exhibits a different problem. I'll create a follow-up issue for it.

fhahn · 2019-03-29T18:04:55Z

Created attachment 21689 [details]
test case with slow DSE behavior

The hottest stack trace is the same as for #36936 . I think that issue already tracks the underlying problem with the attachment.

rnk · 2021-11-27T02:09:58Z

mentioned in issue llvm/llvm-bugzilla-archive#42515

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177

Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177

rnk commented Sep 4, 2018

efriedma-quic commented Sep 4, 2018

gburgessiv commented Sep 4, 2018

rnk commented Sep 4, 2018

efriedma-quic commented Sep 4, 2018

rnk commented Sep 4, 2018

rnk commented Sep 5, 2018

DMG862 mannequin commented Sep 5, 2018

nico commented Mar 25, 2019

fhahn commented Mar 25, 2019

rnk commented Mar 27, 2019

fhahn commented Mar 29, 2019

fhahn commented Mar 29, 2019

rnk commented Nov 27, 2021

Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177

Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177

Comments

rnk commented Sep 4, 2018

Extended Description

efriedma-quic commented Sep 4, 2018

gburgessiv commented Sep 4, 2018

rnk commented Sep 4, 2018

efriedma-quic commented Sep 4, 2018

rnk commented Sep 4, 2018

rnk commented Sep 5, 2018

DMG862 mannequin commented Sep 5, 2018

nico commented Mar 25, 2019

fhahn commented Mar 25, 2019

rnk commented Mar 27, 2019

fhahn commented Mar 29, 2019

fhahn commented Mar 29, 2019

rnk commented Nov 27, 2021