The following code doesn't optimize down: // Source: pag.csail.mit.edu/paste2005/slides/yang-paste2005-slides.ppt extern void foo(); void process(int *p, int *q) { if (p != q) return; if (*p != *q) foo(); // unreachable } In a branch whose condition depends on a EQ or NE, in the side where the variables are equal (and if the types are equal (?)), replaceAllUsesOfWith(p, q). This allows the compiler to optimize away the following test and should kill the liveness of one of the variables. I suppose you could also do the same optimization with a switch instruction. Where would such an optimization belong? The instruction combiner?
I've implemented my suggested change and it doesn't work. My change updates the BB for the "then" block after setne to seqeq transformation but that doesn't get through to the unified return block. Modifying the source to read: if (p != q) return else q = p; causes the example to optimize away as it should. Since that's correct in general (subject to type of p and q and isVolatile) I'd like to do the equivalent transformation on the SSA form. I'm not sure exactly how.
It isn't safe to P->replaceAllUsesWith(Q) or vice versa because replaceAllUsesWith will replace ALL uses with no notion of region. This will turn the initial condition into Q != Q, for example. Implementing this requires a value range propagation pass. LLVM has two things related to this: lib/Transforms/Scalar/CondPropagate.cpp, is a very simple conditional value propagor, and is stable. Scalar/CorrelatedExprs.cpp is a more aggressive propagator but has SSA update bugs and generally isn't stable. It should be replaced. Unfortunately there isn't currently an easy place to add this to LLVM. -Chris
Created attachment 344 [details] Adds new Propagate Equality pass Ok, if it doesn't fit anywhere, I'll create a new place for it. This finds all branch instructions (only, no switch yet) and finds the set of basic blocks affected by: * using a depth-first iterator to find the children from the true and false sides * then set-subtract the false from the true. This is written to handle multiple false roots for when switch support is added. Unfortunately, this is probably horribly slow. Comments appreciated. Anyways, it handles the test in the bug report correctly, and still passes the regression tests, so I'd call it a win. This patch needs some minor cleanup work (include the regtest, fix the name of the opt in the various places it appears), but I'll do that after I get a review. I'll probably need to change more than the cleanup I know about.
Duh, that algorithm was incorrect and could replace too many uses, in an ugly case involving a branch inside of a loop. Please wait for an update.
Created attachment 345 [details] New propagate equality pass. There here does what we wanted before. As far as I know, there are no outstanding issues. Comes complete with testcase.
Nifty. Have you evaluated how many times this triggers on various programs? I'd suggest setting up a test to see how often it applies, details here: http://llvm.org/docs/TestingGuide.html#customtest As a side note, this loop: for (Instruction::op_iterator j = i->op_begin(); j != i->op_end(); ++j) ... can be replaced with: i->replaceUsesOfWith(Old, New); Also, please s/LLVM Research Group/you/ in the file header. :) -Chris
Created attachment 346 [details] New new propagate equality pass. This patch has a different implementation; instead of checking each BB for users of the value, check the value's use list for uses within the affected area. Stats: 2 propequal - Number of equal variables substituted in MultiSource/Benchmarks/Olden/bh 7 propequal - Number of equal variables substituted in MultiSource/Benchmarks/Prolangs-C/bison
I realize that this probably isn't ready for review (recent bugs etc), but here is some feedback in the meantime. Overall, the pass looks good and it does catch a useful property. However, I don't think we can justify an entire pass over the IL for this simple optimization, at least not turned on by default. The pass as you have it now can be checked into CVS whenever you're happy with it and it is relatively bug free. Through inspection, I believe you have some issues updating PHI nodes, but I'll let you play with the pass to decide if so and if there is anything else you'd like to change before the first incarnation goes into CVS. Before this gets turned on by default, I'd like to see the pass get a little bit more general. In particular, this is a simple dominator-based optimization. I believe it would be more general and more efficient to formulate this as a table-based dominator tree walk. Basically the pass would like like this: 1. Keep a table of known (in)equalities, e.g. A == B, A != 0, C < D. 2. Walk the code in dominator tree order. When you see a conditional branch where a successor is dominated by the branch block, see if there is a condition that can be inferred in that successor due to the branch condition. If so, enter it in the table before recursing into the successor part of the domtree. When unrecursing back to the branch block, pop the inequality off of the table. 3. When walking instructions, keep track of equalities that can be inferred, and use the table to simplify instructions. Some examples are below. Things that this approach can simplify: X = load P <- from this point on, we know P is non-null, enter P != 0 into the table. if (P == 0) goto ... <- obviously false if (A != 4) { <- record A != 5 in the table B = A+4; <- record B != 9 in the table if (B == 9) { <- false float A, B; if (A > B) <- neither are nan's if (isnan(A)) <- false etc. There are a *large* number of simple transformations like this that we can build up with this capability (look at the ill-fated cee pass for examples). Starting the framework and slowly expanding it is probably the way to go, and the best way to do this is probably in CVS. In addition to new "tricks", you can also make the table more powerful, e.g. to keep track of value ranges in addition to inequalities (e.g. that A is [4 .. 9]), etc. Are you interested in working on this? If so, the best approach is for you to get the initial version set up with the rough framework, then iterate on it in CVS (you'll get write access). Seem reasonable? -Chris
Other more specific comments about your current pass: 1. The LHS/RHS of a setcc instruction are guaranteed to be the same type. 2. Your code should work with calls and casts, why do you special case and ignore them? 3. Your code currently thinks about instructions and constants. I think it would be better to think about instructions and "everything else". In particular, arguments should be handled like constants. 4. There is value in changing one argument to another: int foo(int A, int B) { if (A == B) return A-B; // turn into A-A or B-B, which instcombine will then turn into 0. return 42; }
Created attachment 347 [details] Now passes llvm-test! I should've warned you not to review, as my local copy diverged pretty far from the original. Casts and calls aren't treated specially. I took your tip that the two sides of a setcc are the same type. There's not that much magical about handling constants. All it does it prevent the pass from ripping out "uint 0" and replacing it with "tmp.22" because it knows they're equal. Instead, we should replace an instruction with a constant. Removing that will also cause it to break; it will rip the "uint 0"'s right out of your getelementptr instructions. This also means that it works fine on arguments. In fact, the included testcase just uses function arguments. Unfortunately, it doesn't seem to be much of a win. Zero replacements performed across all of llvm-test.
Looks reasonable, some more comments then :) First, you *do* want to handle arguments somewhat like constants. For example, if you have Arg == 1, you obviously want to replace Arg with 1. However, if you have inst == arg, you want to make sure you replace the inst with the arg, not the arg with the inst. This check: if (!Forest->properlyDominates(BranchBB, BB)) continue; can be done once for the entire expression. In particular, in conjunction with your other check, you're really checking to see if the EqualDest is dominated by the branch block. Please change the code to test to see if the EqualDest is *properly* dominated by the branch block up outside of the use list. If not, don't process any uses. Next, you should check phi nodes more aggressively. Consider this: if (A == 7) { ... } else { ... } C = phi(A, x) here you can change the A in the PHI node with 7, because the use occurs in a dominating block, not actually in the merge block. This is actually required for correctness in obscure cases too. Finally, you should check to make sure you're not replacing FP values incorrectly. Specifically, if you know that (A == B), then you can replace A with B in the true block. If you know "A != B" (FP-only) then A is not neccesarily B in the false block (due to nans). -Chris
As for your more general system idea -- I actually came across this testcase when looking for papers on how to write such a path-sensitive optimizer. Thus far, I haven't found one I'm really happy with. I'm very interested in working on such a pass, though I doubt I could write it single-handedly. Getting the set of properties right (non-overlapping) is important, or at least we'll have to have some sort of canonicalization step (a == 4 implies a != 5). It'd probably work best if we fill entries in the record lazily; so when the optimizer comes across a statement that could be optimized if only property X, it works backwards over the program flow and fills in the records to see if it holds. I don't really want to go the way of value ranges. That's not expressive enough. You can't express that a value is even, for example. Actually, I think that a small expressive language would be the current LLVM instruction set. And this is there point where I admit that I'm just not entirely sure how that would work. It seems that I might want to canonicalize thigs down even farther. Consider that there are (at least) three ways to make a value even: multiply by 2 (or a multiple), shift left by 1 (or more), and "and" away the last bit. All three could be expressed with and/or operations and appropriate control flow, then analyzed by a path-sensitive system with knowledge only of a bit at a time. "mod" would also have to be rewritten this way. Perhaps I should just start with something simple like EQ and NE, and save the inventing for another time and place? I would happily implement the idea you described, the way you described it.
> I'm very interested in working on such a pass, though I doubt I could write it > single-handedly. I'd be happy to help and to guide. > I don't really want to go the way of value ranges. That's not expressive enough. > You can't express that a value is even, for example. Optimization really isn't about catching every possibility for canonicalization and simplification, it's about catching a lot in a reasonable amount of compile time. Having a single forward pass over the dominator tree with a table represents a reasonable use of compile time: using a full theorum prover (f.e.) probably isn't. :) In any case, I think that starting simple, with simple inequalities, is the way to go. Once the table and basic structure is in place, it can be extended in many possible ways. :) > Perhaps I should just start with something simple like EQ and NE, and save the > inventing for another time and place? I would happily implement the idea you > described, the way you described it. Sounds great, -Chris
Oh yeah, btw, instcombine knows that things like X*2&1 are always 0, it can track individual bits, inferring when they are known zero or one. -Chris
Created attachment 349 [details] First attempt at record-based PropagateEquality pass. This is my first pass at trying to implement the system described by Chris. I'm not too pleased with it, I don't see how it can handle things like "x == a; x == b; a == b?", or even just "x == 0; x != 1?" without a design change of some sort. But then, that's what review is for! So far, it can handle this example: extern void f(); extern void g(); void test1(int x, int y) { if (x == 4) { f(); if (y) if (x == 4) { g(); } } } and not much else.
Created attachment 350 [details] first functional record-based propequal pass This pass manages 1366 variable replacements on llvm-test.
I've encountered some problems that warrant design changes. Firstly, it needs to become a two-pass system. The first pass tags the BBs, the second pass performs the replacements. At the moment, we have to iterate through every instruction in the program, not just the use lists of setcc instructions. The reason is that at the time we see the setcc instruction, the properties haven't yet been assigned to the BBs. That happens with the branch instruction. Secondly, the EQProperties and NEProperties aren't canonicalized. Suppose you create a property for "x == y", and a property for "x != z". Then, you have a statement "if (x == z)": the first thing that happens is that x -> y, then we check for "y != z?" which doesn't exist. (The opposite problem exists even if you do variable substitution after.) The EQProperty, needs to store the whole equivalence list, not just pairs. Otherwise, it works rather well. It optimizes the original case that started this bug.
I haven't looked in depth at your code (about to do that now) but I don't think that a 2-pass system is needed. > At the moment, we have to iterate through > every instruction in the program, not just the use lists of setcc instructions. Right. This is actually a good thing, because it will be required for some future extensions. For example, if you have this: X = load P C = seteq P, null We can know that C == false. This case really does occur, for example when compiling the llvm dyn_cast template. This has nothing to do with basic blocks, we we should keep iterating over instructions. -Chris
Some feedback on the "first functional record-based propequal pass" pass: You might want to name the pass "CorrelationPropagation.cpp" or something, as eventually we'll be propagating more than just equality (e.g. x < 5 implies x < 7, and we could catch that in the future). Please add a comment block at the top of the file explaining what the pass does, for example you can include the example that started this PR. Please remove any #includes you can. You don't need Instruction.h (implied by Instructions.h), don't need Casting.h, Interval.h, Use.h, and perhaps others. Instead of making Property an inheritance-based class, you could make property be a simple tuple (V1/ V2/property where property is an enum). Making it a simple value class would allow you to store it by- value in the maps, allowing you to drop the new's. Please don't make the multimaps be global. They should be instance vars in the class. Why are you hashing the properties? With the above changes, PropertiesByHash could be implemented as: std::set<Property>KnownProperties; Just implement operator< for properties to sort properties (in an arbitrary order) and the set will unique them for you. Right now your code appears to depend on the order of basic blocks in a function (you use visit (Function), and attempts to keep track of properties per-basic-block. I'd suggest the code to be modified to work like this pseudo-code: runOnFunction() { std::set KnownProperties; return WalkDominatorTree(DT.getRoot(), KnownProperties); } bool WalkDominatorTree(DTNode *N, InputKnownProperties) { // Copy the properties known so recorded properties only are inherited by successors in the domtree. std::set KnownProperties = InputKnownProperties; foreach (instruction i in N->getBasicBlock()) switch (i->getOpcode()) { case XYZ: visitXYZ(KnownProperties); } if (BBTerminator is branchinst && condition is X == Y && dominates successors) { std::set TrueKnownProperties = KnownProperties; TrueKnownProperties.insert(Property(X, Y, ==)); WalkDominatorTree(DT[TrueSucc], TrueKnownProperties); std::set FalseKnownProperties = KnownProperties; FalseKnownProperties.insert(Property(X, Y, !=)); WalkDominatorTree(DT[FalseSucc], FalseKnownProperties); } else { foreach (dtsucc in N) WalkDominatorTree(DT[FalseSucc], KnownProperties); } } This way, the recursion on the stack naturally handles "pushing and popping" new properties, the visit methods can add properties to KnownProperties if desired (i.e. visitLoad can put (P, null, !=) onto the stack, and the visit methods can simplify conditions if desired (e.g. if you run across a setcc X != Y, and you know about X,Y, you can turn it into true/false. -Chris
Oh yeah, PropagateEquality::newConstant is also already available in LLVM. You want: ConstantExpr::get(Opcode, LHS, RHS); Which does constant folding for you. -Chris
What if there is a BB with two predecessors, neither of whom dominate it, but both happen to carry the same property? It would mean doing a breadth-first traversal of the dominator tree, and each BB takes the union of the properties of its predecessors. Is this too complicated?
That boils down to being a question of compile time vs code quality trade off. In practice, I think these things are unlikely to be significant, but if needed a full dataflow analysis *could* be done. I think sticking to a dom-tree walk is likely to provide the biggest bang for the cycle though. -Chris
Created attachment 351 [details] second attempt; includes load* ptr, ptr != null. Thanks for the review. I was hoping to post a full patch with the new name, but that will probably have to wait until next weekend. Otherwise, I've incorporated almost all of your suggestions. walkDominatorTree got refactored into oblivion. You'll find parts of it named visitBasicBlock, proceedToSuccessor and the 5 visit functions. There's quite a bit of redundancy between the "addProperty" and "resolve" mechanisms. What "addProperty" does is given: %A = seteq %x, 4 %B = setne %y, 0 %C = and bool %A, %B br bool %C label %here, label %there When the visitBranchInst calls addProperty, gives it Property(%C, true, ==). addProperty then look inside of %C and see that for an 'and' instruction, that Property(%A, true, ==) and Property(%B, true, ==). Then, those are lowered to Property(%x, 4, ==) and Property(%y, 0, !=). However, there are some cases this misses. "resolve" runs over every operand of every instruction in every basic block, making it possibly slow, but can catch cases like: %A = seteq %x, 4 %B = setlt %x, 4 br bool %A label %here1, label %there here1: br bool %B label %here2, label %there Because %B is defined outside of the block with the property, it can't be simplified at definition time (through the visit* functions), nor is it detected by addProperty above. This accounts for about 5% of all of the simplifications done. As for performance, my tests run it with -mem2reg -instcombine -propequal, and I see 2121 total substitutions and 26 switch cases removed across llvm-test (in llvm-gcc3, so remove tests using inline asm). Note that it double-counts variable substitutions; %x -> %y -> %z counts as two. It runs as low as 1% and as much as 20% of the optimization time, out of just those three passes.
I'm considering a different design. I realized that the == property is different from other properties in that it's used for the variable substitution. I'm trying a new design which stores sets of synonymous variables together, and properties are relationships are between two synonym-sets instead of two Values. This means that if a != b and a == x, then b != x without needing to store a second property. Further, because checking for substitutes for variables is so common (once per instruction's operand), I'll make an std::map to store the mapping from Value to synonym-set's "most canonicalized form" of said Value. As I'm still working on the implementation, it remains to be seen whether this generates enough benefit on the CPU cycle.
Created attachment 377 [details] collapse equivalent variables; use std::map for fast lookup As is, this patch triggers an assertion, bug 871, for five of the tests in the llvm-test suite. The included feature tests aren't quite exhaustive, and some of them can be optimized by simplifycfg without propsimplify's help. I'd appreciate any review, comments, testing and bug reports. :)
Created attachment 378 [details] fix KPcopy usage Obvious stupid bug in the previous patch; reusing KPcopy means that properties from one successor would be applied to a different one. Funny thing is, this doesn't seem to change anything in llvm-test!
This optimization pass has since been enabled.