New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM not using cmpb for unsigned char comparison #22847
Comments
I think this is intentional fallout from r195496. |
Yep. Here is what IACA says specifically: Haswell: Throughput Analysis ReportBlock Throughput: 0.50 Cycles Throughput Bottleneck: FrontEnd, PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA Port Binding In Cycles Per Iteration:| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 || Cycles | 0.2 0.0 | 0.2 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.2 | 0.2 | 0.0 |N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
- ESP Tracking sync uop was issued@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected | Num Of | Ports pressure in cycles | |
|
Works for me. As long as we can do so in a way that doesn't regress bzip2 (which no one is AFAIK proposing), I have no issues solving this differently. |
I see that there was good discussion about the trade-offs in the commit thread of r195496: Summary: I wish I had access to SPEC, but I don't. I wish LLVM test-suite was actually usable for perf testing, but it's not (despite some recent efforts towards that goal). So ad hoc perf testing ensued...for reproducability's sake:
This is the relative perf for byte-size comparisons with today's (ext'ing) codegen as the baseline using 8 runs for each case: Intel SNB: -4.09% (0.44% relative std dev) |
Based on profiles, it looks like bzip2 is just about the perfect macro test case for testing this codegen change. It has byte comparisons all over the hotspots in BZ2_compressBlock, mainGTU, and BZ2_blocksort. FWIW, gcc 4.9 is generating the byte comparisons by default for this code. |
Forgot one testing tidbit for the record: I didn't alter bzip2 to avoid any I/O (as I assume the SPEC test harness would) or in any other way, and I didn't add any fancy high-res timers. Just used this: $ time ./bzip2 < gcc-4.9-bin.tar > /dev/null |
That's very in line with the results I've seen. On my Ivy Bridge iMac, it was about 5% or so. I ran a test very similar to yours (I used the x86_64 shared cache file as test input). The SPEC harness results, both 2k and 2k6, were always also in line with what I observed looking at the vanilla 1.0.6 package, so I don't think you're missing any interesting datapoints on that front. |
Oh, and FWIW, when I was doing the original testing, a hand-modified assembly version just for mainGTU() got almost all of the gains. |
As the discussion got somewhat fragmented between this bug and the review thread for Jim's patch, I just wanted to relay what I think the correct path forward is. We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation. Sadly, I don't have time to look at this right now, but I wanted the steps that I at least had in mind documented somewhere. -Chandler |
Thanks, Jim and Chandler. Good to know that I can repro what you're seeing without resorting to SPEC. :) I filed bug 22532 before I saw the last comment here. I think that's a separate but related issue to what we have here? I didn't dig to see how the movz in that case got generated. |
This has been fixed since 3.9: I've added a test case at rL355712 |
Extended Description
Small reproduction:
bool f(unsigned char* a, unsigned char b) { return a[0] == b; }
LLVM translates this to
movzbl (%rdi), %eax
cmpl %esi, %eax
GCC instead does
cmpb %dil, (%rsi)
Which according to IACA is better.
The text was updated successfully, but these errors were encountered: