Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177

Closed
rnk opened this issue Sep 4, 2018 · 13 comments
Labels
bugzilla Issues migrated from bugzilla slow-compile

Comments

@rnk
Copy link
Collaborator

rnk commented Sep 4, 2018

Bugzilla Link 38829
Resolution FIXED
Resolved on Mar 29, 2019 11:04
Version trunk
OS Windows NT
CC @chandlerc,@DMG862,@efriedma-quic,@fhahn,@gburgessiv,@nico,@VReichelt
Fixed by commit(s) 357257

Extended Description

My local build of clang was slow, so I profiled it, and this is what I found. I compiled on Windows with optimizations and debug info. I think enabling debug info makes basic blocks longer, and that's the main reason it matters.

Compiling clang's SemaChecking.cpp file took ~121s total, and of it, 21s is spent in DSE and 21s in memcpyopt. The majority of the time in both of these passes was in callCapturesBefore. The next most expensive pass is induction variable simplification at 14s, but its time is spent in SCEV, not AA.

Most of the time in DSE and memcpyopt is spent in callCapturesBefore, which is linear in the size of the basic block in the worst case. Most of the samples occur in OrderedBlock::comesBefore, which is the inner loop that iterates the linked list of instructions in a BB and numbers them to compute intra-BB dominance.

If we could, for example, maintain accurate position numbers for instructions in the basic block, this would not be necessary, but it would require a fancy algorithm to maintain amortized O(1) insertion and removal of instructions mid-block.

Fixing this without that fancy algorithm seems like a bit of a pain, since we can't blindly create an OrderedBasicBlock in our transforms that use callCapturesBefore. These passes currently add and remove instructions as they go, mutating the block. We'd probably end up seeing ABA bugs where freed instruction addresses appear in the OrderedBasicBlock DenseMap that maps from Instruction* to position. Every place where we update the domtree today, we'd have to update the appropriate basic block.

These passes already both use MemoryDependenceResults and call MD->removeInstruction. In theory we could move the instruction numbering into MemoryDependenceResults, and have it be responsible for maintaining the numbers, but its starting to feel like adding fancy algorithms to ilist / iplist would be more maintainable.

@efriedma-quic
Copy link
Collaborator

MemoryDependenceAnalysis is terrible for other reasons... maybe our time is better spent trying to eliminate it in favor of MemorySSA.

@gburgessiv
Copy link
Member

(+1 to eli, but I'm biased ;) )

FWIW, MemorySSA already provides amortized O(1) BB-local dominance queries unless your usage looks like insert() -> query() -> insert() -> query() -> ... I planned to migrate MemDep to MSSA at some point; don't remember what blocked that (I remember there were concerns about pass ordering, but none of them sounded impossible to overcome). We also wanted to port DSE to MSSA.

The approach MSSA uses is really simple, but it appears to work well for the users we have. We keep a valid bit for each BB. When local dominance is requested and the valid bit isn't set, we assign a number to each memory instruction in the relevant BB and set the BB's valid bit. The valid bit is cleared (IIRC) by memory instruction insertion, but not by removal. Moving a memory instruction from one BB to another is modeled as a removal + insertion.

I'll note that this doesn't provide dominance information for non-memory instructions. I'm unsure if this is important here, since I'm not deeply familiar with the exact API that MemDep exposes. However, this makes

I think enabling debug info makes basic blocks longer, and that's the main reason it matters

a non-issue. :)

We could be way more fancy about invalidation/partial rebuilding (e.g. consider an algorithm that numbers instructions with N-sized gaps in between. This trivially allows for at least log(N) insertions before a full renumbering is required, but we can get even fancier than that pretty easily), but again, there's been no need to do so yet.

@rnk
Copy link
Collaborator Author

rnk commented Sep 4, 2018

Separate from memdep vs. memssa, it sounds like we have an awfully large number of data structures maintaining local dominance information in DenseMaps.

Would you all support moving that to BasicBlock, i.e. maintain the "valid" bit there, add BasicBlock::comesBefore, lazily number instructions when it is called, invalidate it on insertion/removal? That's a very small and tractable change that I can work on without changing functionality.

@efriedma-quic
Copy link
Collaborator

That's 4 bytes per instruction extra memory usage, which isn't ideal... but I guess maybe okay? As long as the actual numbers aren't exposed to transforms, so we can mess with the representation later.

I'm kind of worried we'll start relying on it, and end up with algorithms that are actually quadratic anyway due to the invalidation, but I guess that's not really worse than the current situation.

@rnk
Copy link
Collaborator Author

rnk commented Sep 4, 2018

Here's a patch to cache instruction ordering, if you want to play with it:
https://reviews.llvm.org/D51664

It'll speed up my clang builds, so, you know, I'm for it. :)

@rnk
Copy link
Collaborator Author

rnk commented Sep 5, 2018

Re: reimplementing DSE on MemorySSA, it looks like there's already a patch for it:
https://reviews.llvm.org/D40480

The metrics the author collected look good to me. Should we just do it?

@DMG862
Copy link
Mannequin

DMG862 mannequin commented Sep 5, 2018

Hello

There are places where it's is still quadriatic, at least in that algorithm. The reason I wanted such a thing was to support cross-basic block DSE. That's only a small gain though, so not something I had time to continue.

It would at least need rebasing and the DSE pass may have picked up a few tricks since then, which would need to be incorporated. There's also https://reviews.llvm.org/D29624, which is an older version of the same thing, plus https://reviews.llvm.org/D29866, which is a better algorithm if we can get it to work. My idea was to use D40480 until someone made PDSE work correctly, which should then just be better, providing we can get it to do all the the things the current pass does.

IIRC, with some extra caching for the alias calls in isOverwrite, D40480 still had trouble with code like:
store
store
memcpy
store
store
memcpy
store
...
Because of all the modref calls for memcpy.

@nico
Copy link
Contributor

nico commented Mar 25, 2019

*** Bug #36879 has been marked as a duplicate of this bug. ***

@fhahn
Copy link
Contributor

fhahn commented Mar 25, 2019

*** Bug #22603 has been marked as a duplicate of this bug. ***

@rnk
Copy link
Collaborator Author

rnk commented Mar 27, 2019

@fhahn
Copy link
Contributor

fhahn commented Mar 29, 2019

The quadratic behavior caused by local ordering queries has been fixed by https://reviews.llvm.org/rL357257

The recently attached test case exhibits a different problem. I'll create a follow-up issue for it.

@fhahn
Copy link
Contributor

fhahn commented Mar 29, 2019

Created attachment 21689 [details]
test case with slow DSE behavior

The hottest stack trace is the same as for #36936 . I think that issue already tracks the underlying problem with the attachment.

@rnk
Copy link
Collaborator Author

rnk commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#42515

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla slow-compile
Projects
None yet
Development

No branches or pull requests

5 participants