This is a very problematic function. - The opt takes more than 20 min. - The llc takes about 10 min. - The code generated for ARM is very poor.
Created attachment 845 [details] iterative_me.bc original function
Created attachment 846 [details] iterative_me.opt.bc optimized function
Created attachment 847 [details] iterative_me.opt.s.bz2 generated code
laurov@laurov-desktop:/tmp$ opt -time-passes -std-compile-opts iterative_me.bc -o iterative_me.opt.bc -f ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 1555.4652 seconds (2394.6937 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 1025.0560 ( 65.9%) 0.4360 ( 56.4%) 1025.4920 ( 65.9%) 1746.6673 ( 72.9%) Dead Store Elimination 527.5409 ( 33.9%) 0.3280 ( 42.4%) 527.8689 ( 33.9%) 645.5995 ( 26.9%) Global Common Subexpression Elimination 0.2600 ( 0.0%) 0.0000 ( 0.0%) 0.2600 ( 0.0%) 0.2602 ( 0.0%) Combine redundant instructions 0.2600 ( 0.0%) 0.0000 ( 0.0%) 0.2600 ( 0.0%) 0.2598 ( 0.0%) Combine redundant instructions 0.2560 ( 0.0%) 0.0000 ( 0.0%) 0.2560 ( 0.0%) 0.2569 ( 0.0%) Combine redundant instructions 0.2440 ( 0.0%) 0.0040 ( 0.5%) 0.2480 ( 0.0%) 0.2502 ( 0.0%) Combine redundant instructions 0.1760 ( 0.0%) 0.0040 ( 0.5%) 0.1800 ( 0.0%) 0.1810 ( 0.0%) Combine redundant instructions 0.0800 ( 0.0%) 0.0000 ( 0.0%) 0.0800 ( 0.0%) 0.1796 ( 0.0%) Bitcode Writer 0.0600 ( 0.0%) 0.0000 ( 0.0%) 0.0600 ( 0.0%) 0.1587 ( 0.0%) Aggressive Dead Code Elimination 0.1160 ( 0.0%) 0.0000 ( 0.0%) 0.1160 ( 0.0%) 0.1294 ( 0.0%) Combine redundant instructions 0.0240 ( 0.0%) 0.0000 ( 0.0%) 0.0240 ( 0.0%) 0.1239 ( 0.0%) Find Used Types 0.0440 ( 0.0%) 0.0000 ( 0.0%) 0.0440 ( 0.0%) 0.0529 ( 0.0%) Sparse Conditional Constant Propagation 0.0520 ( 0.0%) 0.0000 ( 0.0%) 0.0520 ( 0.0%) 0.0516 ( 0.0%) Loop-Closed SSA Form Pass 0.0440 ( 0.0%) 0.0000 ( 0.0%) 0.0440 ( 0.0%) 0.0474 ( 0.0%) Module Verifier 0.0480 ( 0.0%) 0.0000 ( 0.0%) 0.0480 ( 0.0%) 0.0470 ( 0.0%) Module Verifier 0.0440 ( 0.0%) 0.0000 ( 0.0%) 0.0440 ( 0.0%) 0.0421 ( 0.0%) Loop Invariant Code Motion 0.0360 ( 0.0%) 0.0000 ( 0.0%) 0.0360 ( 0.0%) 0.0348 ( 0.0%) Scalar Replacement of Aggregates 0.0320 ( 0.0%) 0.0000 ( 0.0%) 0.0320 ( 0.0%) 0.0295 ( 0.0%) Canonicalize Induction Variables 0.0280 ( 0.0%) 0.0000 ( 0.0%) 0.0280 ( 0.0%) 0.0258 ( 0.0%) Simplify the CFG 0.0240 ( 0.0%) 0.0000 ( 0.0%) 0.0240 ( 0.0%) 0.0225 ( 0.0%) Reassociate expressions 0.0240 ( 0.0%) 0.0000 ( 0.0%) 0.0240 ( 0.0%) 0.0216 ( 0.0%) Unswitch loops 0.0200 ( 0.0%) 0.0000 ( 0.0%) 0.0200 ( 0.0%) 0.0206 ( 0.0%) Canonicalize natural loops 0.0160 ( 0.0%) 0.0000 ( 0.0%) 0.0160 ( 0.0%) 0.0159 ( 0.0%) Simplify the CFG 0.0160 ( 0.0%) 0.0000 ( 0.0%) 0.0160 ( 0.0%) 0.0131 ( 0.0%) Dead Global Elimination 0.0120 ( 0.0%) 0.0000 ( 0.0%) 0.0120 ( 0.0%) 0.0127 ( 0.0%) Natural Loop Construction 0.0119 ( 0.0%) 0.0000 ( 0.0%) 0.0119 ( 0.0%) 0.0119 ( 0.0%) Post-Dominator Tree Construction 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0101 ( 0.0%) Promote Memory to Register 0.0120 ( 0.0%) 0.0000 ( 0.0%) 0.0120 ( 0.0%) 0.0100 ( 0.0%) Simplify the CFG 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0099 ( 0.0%) Loop-Closed SSA Form Pass 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0095 ( 0.0%) Simplify the CFG 0.0120 ( 0.0%) 0.0000 ( 0.0%) 0.0120 ( 0.0%) 0.0089 ( 0.0%) Break critical edges in CFG 0.0079 ( 0.0%) 0.0000 ( 0.0%) 0.0079 ( 0.0%) 0.0087 ( 0.0%) Natural Loop Construction 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0080 ( 0.0%) Dominator Tree Construction 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0080 ( 0.0%) Dominator Tree Construction 0.0079 ( 0.0%) 0.0000 ( 0.0%) 0.0079 ( 0.0%) 0.0073 ( 0.0%) Break critical edges in CFG 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0065 ( 0.0%) Dominator Tree Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0065 ( 0.0%) Unroll loops 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0065 ( 0.0%) Dominator Tree Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0064 ( 0.0%) Dominator Tree Construction 0.0039 ( 0.0%) 0.0000 ( 0.0%) 0.0039 ( 0.0%) 0.0064 ( 0.0%) Simplify the CFG 0.0080 ( 0.0%) 0.0000 ( 0.0%) 0.0080 ( 0.0%) 0.0063 ( 0.0%) Dominator Tree Construction 0.0079 ( 0.0%) 0.0000 ( 0.0%) 0.0079 ( 0.0%) 0.0062 ( 0.0%) Dominator Tree Construction 0.0039 ( 0.0%) 0.0000 ( 0.0%) 0.0039 ( 0.0%) 0.0044 ( 0.0%) Remove unused exception handling info 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0033 ( 0.0%) Function Integration/Inlining 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0032 ( 0.0%) ET Forest Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0030 ( 0.0%) ET Forest Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0028 ( 0.0%) Post-Dominance Frontier Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0026 ( 0.0%) ET Forest Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0026 ( 0.0%) Canonicalize natural loops 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0025 ( 0.0%) ET Forest Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0025 ( 0.0%) Dominance Frontier Construction 0.0039 ( 0.0%) 0.0000 ( 0.0%) 0.0039 ( 0.0%) 0.0025 ( 0.0%) ET Forest Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0025 ( 0.0%) ET Forest Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0025 ( 0.0%) ET Forest Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0021 ( 0.0%) Basic CallGraph Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0020 ( 0.0%) Tail Duplication 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 0.0%) Dominance Frontier Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0020 ( 0.0%) Dominance Frontier Construction 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0015 ( 0.0%) Conditional Propagation 0.0040 ( 0.0%) 0.0000 ( 0.0%) 0.0040 ( 0.0%) 0.0012 ( 0.0%) Scalar Evolution Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0011 ( 0.0%) Tail Call Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0010 ( 0.0%) Rotate Loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0007 ( 0.0%) Conditional Propagation 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0005 ( 0.0%) Unify function exit nodes 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.0%) Dead Type Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.0%) Simplify well-known library calls 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Dead Argument Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Merge Duplicate Global Constants 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Promote 'by reference' arguments to scalars 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Global Variable Optimizer 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No Alias Analysis (always returns 'may' alias) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Evolution Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Basic Value Numbering (default GVN impl) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Lower Set Jump 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Evolution Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Raise allocations from calls to instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Evolution Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Scalar Evolution Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Interprocedural constant propagation 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Load Value Numbering 1554.6931 (100.0%) 0.7720 (100.0%) 1555.4652 (100.0%) 2394.6937 (100.0%) TOTAL
*** This bug has been marked as a duplicate of 930 ***
DSE is a known issue, but this actually triggers a horrible thing in sdisel as well. Reopening for that issue.
Further, lauro, the code looks very strange. Assuming the .c file doesn't contain hundreds of loads, can you file another bug with the .i file that contains iterative_me?
Fixed, patch here: This speeds up isel on this function from: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 60.0536 ( 96.8%) 1.4269 ( 97.9%) 61.4805 ( 96.8%) 61.4842 ( 96.8%) ARM Instruction Selection 1.6086 ( 2.5%) 0.0115 ( 0.7%) 1.6202 ( 2.5%) 1.6203 ( 2.5%) Linear Scan Register Allocator to: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 3.9081 ( 66.9%) 1.6220 ( 98.1%) 5.5301 ( 73.8%) 5.5876 ( 74.0%) ARM Instruction Selection 1.5635 ( 26.7%) 0.0119 ( 0.7%) 1.5754 ( 21.0%) 1.5757 ( 20.8%) Linear Scan Register Allocator a speedup of over 11x, in a release build. -Chris
no really, patch here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20070514/049659.html
It takes a lot of time to compile iterative_me.bc once again: ===-------------------------------------------------------------------------=== ... Statistics Collected ... ===-------------------------------------------------------------------------=== 10 dse - Number of other instrs removed 2 dse - Number of stores deleted ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 139.6793 seconds (142.3439 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 138.7904 ( 99.9%) 0.8715 ( 99.9%) 139.6620 ( 99.9%) 142.3265 ( 99.9%) Dead Store Elimination 0.0157 ( 0.0%) 0.0003 ( 0.0%) 0.0161 ( 0.0%) 0.0161 ( 0.0%) Module Verifier 0.0009 ( 0.0%) 0.0001 ( 0.0%) 0.0010 ( 0.0%) 0.0011 ( 0.0%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.0%) 0.0001 ( 0.0%) Memory Dependence Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Preliminary module verification 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Basic Alias Analysis (default AA impl) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Data Layout 138.8072 (100.0%) 0.8721 (100.0%) 139.6793 (100.0%) 142.3439 (100.0%) TOTAL
Shark says the culprit is MemoryDependenceAnalysis.cpp:208.
Still slow, but now GVN has joined the act: Total Execution Time: 96.7546 seconds (96.9454 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 47.9156 ( 49.6%) 0.0372 ( 33.0%) 47.9528 ( 49.6%) 48.1304 ( 49.6%) Global Value Numbering 47.7280 ( 49.4%) 0.0537 ( 47.7%) 47.7818 ( 49.4%) 47.7866 ( 49.3%) Dead Store Elimination 0.1606 ( 0.2%) 0.0026 ( 2.3%) 0.1632 ( 0.2%) 0.1669 ( 0.2%) Combine redundant instructions 0.1548 ( 0.2%) 0.0029 ( 2.6%) 0.1577 ( 0.2%) 0.1600 ( 0.2%) Combine redundant instructions
Where does this testcase come from? IT is just a huge series of: %tmp4110778.4030 = getelementptr %struct.SnowContext* %s, i32 0, i32 9, i32 4030 ; <i8*> [#uses=2] %tmp14809 = load i8* %tmp4110778.4030 ; <i8> [#uses=1] %tmp4110778.4031 = getelementptr %struct.SnowContext* %s, i32 0, i32 9, i32 4031 ; <i8*> [#uses=2] %tmp14810 = load i8* %tmp4110778.4031 ; <i8> [#uses=1] with different offsets? Please attach a .i file. Otherwise I'll just close this as "insane".
From the name, it looks like it's the motion estimation function for the Snow video codec, which is part of FFmpeg.
The bitcode file doesn't parse anymore.
Marking this as "fixed" again. I'll also upload a .ll file so that it can be reproduced if necessary. ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 0.6507 seconds (0.6710 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.1882 ( 29.7%) 0.0005 ( 2.8%) 0.1887 ( 29.0%) 0.1887 ( 28.1%) Dead Store Elimination 0.1508 ( 23.8%) 0.0019 ( 10.8%) 0.1527 ( 23.5%) 0.1536 ( 22.9%) Global Value Numbering 0.0231 ( 3.6%) 0.0015 ( 8.8%) 0.0246 ( 3.8%) 0.0307 ( 4.6%) Combine redundant instructions 0.0283 ( 4.5%) 0.0020 ( 11.4%) 0.0303 ( 4.7%) 0.0303 ( 4.5%) Value Propagation 0.0292 ( 4.6%) 0.0006 ( 3.3%) 0.0298 ( 4.6%) 0.0298 ( 4.4%) Value Propagation
Created attachment 11932 [details] Non-bitcode version of the problematic file. The others no longer parse with newer LLVMs.