Created attachment 18740 [details] Compile for AMD64 with -O2 The attached (unreduced) test case needs 1.6GB peak RAM and ~3min on my laptop after r296416, before it took 100MB and 1s.
I think critical edge splitting is doing the right thing here, and splitting the critical edges enables (early) tail duplication of the computed goto. Now, in theory, that also seems like the right thing to do, since this allows the computed gotos to be better predicted, which is pretty much the point of the whole exercise. But it looks like actually performing the tail duplication is slow, and regalloc can't really handle the result well: 73.0840 ( 43.7%) 0.1280 ( 12.3%) 73.2120 ( 43.5%) 73.2136 ( 43.5%) Simple Register Coalescing 37.5720 ( 22.5%) 0.6480 ( 62.3%) 38.2200 ( 22.7%) 38.2208 ( 22.7%) Tail Duplication 32.3400 ( 19.4%) 0.1080 ( 10.4%) 32.4480 ( 19.3%) 32.4459 ( 19.3%) Eliminate PHI nodes for register allocation Passing -disable-early-taildup makes this go away, so regalloc is fine with the edges just being split, without duplication: 3.5280 ( 66.2%) 0.0000 ( 0.0%) 3.5280 ( 65.3%) 3.5313 ( 65.4%) Branch Probability Basic Block Placement 0.4120 ( 7.7%) 0.0040 ( 5.3%) 0.4160 ( 7.7%) 0.4154 ( 7.7%) Simple Register Coalescing 0.3080 ( 5.8%) 0.0120 ( 15.8%) 0.3200 ( 5.9%) 0.3196 ( 5.9%) Machine Block Frequency Analysis I'm not really sure what we want to do here, but I'd say the thing that should be bailing early here is taildup, not edge splitting. Adding Kyle as the resident taildup expert.
Kyle, did you have a chance to look into this?
Unblocking 5.0.0 as there seems to be no interest in fixing :-/