-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock #38177
Comments
MemoryDependenceAnalysis is terrible for other reasons... maybe our time is better spent trying to eliminate it in favor of MemorySSA. |
(+1 to eli, but I'm biased ;) ) FWIW, MemorySSA already provides amortized O(1) BB-local dominance queries unless your usage looks like insert() -> query() -> insert() -> query() -> ... I planned to migrate MemDep to MSSA at some point; don't remember what blocked that (I remember there were concerns about pass ordering, but none of them sounded impossible to overcome). We also wanted to port DSE to MSSA. The approach MSSA uses is really simple, but it appears to work well for the users we have. We keep a valid bit for each BB. When local dominance is requested and the valid bit isn't set, we assign a number to each memory instruction in the relevant BB and set the BB's valid bit. The valid bit is cleared (IIRC) by memory instruction insertion, but not by removal. Moving a memory instruction from one BB to another is modeled as a removal + insertion. I'll note that this doesn't provide dominance information for non-memory instructions. I'm unsure if this is important here, since I'm not deeply familiar with the exact API that MemDep exposes. However, this makes
a non-issue. :) We could be way more fancy about invalidation/partial rebuilding (e.g. consider an algorithm that numbers instructions with N-sized gaps in between. This trivially allows for at least log(N) insertions before a full renumbering is required, but we can get even fancier than that pretty easily), but again, there's been no need to do so yet. |
Separate from memdep vs. memssa, it sounds like we have an awfully large number of data structures maintaining local dominance information in DenseMaps. Would you all support moving that to BasicBlock, i.e. maintain the "valid" bit there, add BasicBlock::comesBefore, lazily number instructions when it is called, invalidate it on insertion/removal? That's a very small and tractable change that I can work on without changing functionality. |
That's 4 bytes per instruction extra memory usage, which isn't ideal... but I guess maybe okay? As long as the actual numbers aren't exposed to transforms, so we can mess with the representation later. I'm kind of worried we'll start relying on it, and end up with algorithms that are actually quadratic anyway due to the invalidation, but I guess that's not really worse than the current situation. |
Here's a patch to cache instruction ordering, if you want to play with it: It'll speed up my clang builds, so, you know, I'm for it. :) |
Re: reimplementing DSE on MemorySSA, it looks like there's already a patch for it: The metrics the author collected look good to me. Should we just do it? |
Hello There are places where it's is still quadriatic, at least in that algorithm. The reason I wanted such a thing was to support cross-basic block DSE. That's only a small gain though, so not something I had time to continue. It would at least need rebasing and the DSE pass may have picked up a few tricks since then, which would need to be incorporated. There's also https://reviews.llvm.org/D29624, which is an older version of the same thing, plus https://reviews.llvm.org/D29866, which is a better algorithm if we can get it to work. My idea was to use D40480 until someone made PDSE work correctly, which should then just be better, providing we can get it to do all the the things the current pass does. IIRC, with some extra caching for the alias calls in isOverwrite, D40480 still had trouble with code like: |
*** Bug #36879 has been marked as a duplicate of this bug. *** |
*** Bug #22603 has been marked as a duplicate of this bug. *** |
The quadratic behavior caused by local ordering queries has been fixed by https://reviews.llvm.org/rL357257 The recently attached test case exhibits a different problem. I'll create a follow-up issue for it. |
The hottest stack trace is the same as for #36936 . I think that issue already tracks the underlying problem with the attachment. |
mentioned in issue llvm/llvm-bugzilla-archive#42515 |
Extended Description
My local build of clang was slow, so I profiled it, and this is what I found. I compiled on Windows with optimizations and debug info. I think enabling debug info makes basic blocks longer, and that's the main reason it matters.
Compiling clang's SemaChecking.cpp file took ~121s total, and of it, 21s is spent in DSE and 21s in memcpyopt. The majority of the time in both of these passes was in callCapturesBefore. The next most expensive pass is induction variable simplification at 14s, but its time is spent in SCEV, not AA.
Most of the time in DSE and memcpyopt is spent in callCapturesBefore, which is linear in the size of the basic block in the worst case. Most of the samples occur in OrderedBlock::comesBefore, which is the inner loop that iterates the linked list of instructions in a BB and numbers them to compute intra-BB dominance.
If we could, for example, maintain accurate position numbers for instructions in the basic block, this would not be necessary, but it would require a fancy algorithm to maintain amortized O(1) insertion and removal of instructions mid-block.
Fixing this without that fancy algorithm seems like a bit of a pain, since we can't blindly create an OrderedBasicBlock in our transforms that use callCapturesBefore. These passes currently add and remove instructions as they go, mutating the block. We'd probably end up seeing ABA bugs where freed instruction addresses appear in the OrderedBasicBlock DenseMap that maps from Instruction* to position. Every place where we update the domtree today, we'd have to update the appropriate basic block.
These passes already both use MemoryDependenceResults and call MD->removeInstruction. In theory we could move the instruction numbering into MemoryDependenceResults, and have it be responsible for maintaining the numbers, but its starting to feel like adding fancy algorithms to ilist / iplist would be more maintainable.
The text was updated successfully, but these errors were encountered: