Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM not using cmpb for unsigned char comparison #22847

Closed
llvmbot opened this issue Feb 5, 2015 · 11 comments
Closed

LLVM not using cmpb for unsigned char comparison #22847

llvmbot opened this issue Feb 5, 2015 · 11 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 5, 2015

Bugzilla Link 22473
Resolution FIXED
Resolved on Mar 08, 2019 11:18
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @adibiagio,@d0k,@chandlerc,@RKSimon,@qcolombet,@rotateright

Extended Description

Small reproduction:
bool f(unsigned char* a, unsigned char b) { return a[0] == b; }

LLVM translates this to
movzbl (%rdi), %eax
cmpl %esi, %eax

GCC instead does
cmpb %dil, (%rsi)

Which according to IACA is better.

@d0k
Copy link
Member

d0k commented Feb 5, 2015

I think this is intentional fallout from r195496.

@chandlerc
Copy link
Member

Yep. Here is what IACA says specifically:

Haswell:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - b.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 0.50 Cycles Throughput Bottleneck: FrontEnd, PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |

| Cycles | 0.2 0.0 | 0.2 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.2 | 0.2 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

    • instruction micro-ops not bound to a port
      ^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |

| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | movzx eax, byte ptr [rdi]
| 1 | 0.2 | 0.2 | | | | 0.2 | 0.2 | | | cmp eax, esi
Total Num Of Uops: 2

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - a.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 0.50 Cycles Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |

| Cycles | 0.2 0.0 | 0.2 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.2 | 0.2 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

    • instruction micro-ops not bound to a port
      ^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |

| 2^ | 0.2 | 0.2 | 0.5 0.5 | 0.5 0.5 | | 0.2 | 0.2 | | CP | cmp byte ptr [rdi], sil
Total Num Of Uops: 2

So, same number of microops, but one fewer instruction, and we're no longer frontend-stalled.

Ivybridge and Sandybridge show the same.

The interesting thing is that while these have the same throughput, the two instruction form puts more pressure on the frontend, and we are seeing lots of frontend stalls in our benchmarking.

Back when Jim and I looked into this, we were trying to kill mystery stalls around cmpb. I don't think we ever adequately explained what we were seeing in bzip2, and I think Quentin made a later change to CodeGenPrep that more likely addressed the underlying issue.

So I think we should probably just back out r195496. While there was some problem with bzip2, this seemed even then to be an overly heavy hammer and now we have some more concrete evidence that its hurting other cases. I'll point the review thread here and loop everyone in and see if there is agreement.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 5, 2015

Works for me. As long as we can do so in a way that doesn't regress bzip2 (which no one is AFAIK proposing), I have no issues solving this differently.

@rotateright
Copy link
Contributor

I see that there was good discussion about the trade-offs in the commit thread of r195496:
http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/167221

Summary:
Based on Agner / Intel / AMD, we'd expect that (some large subset of?) P6-based Intel cores will be hurt by using byte comparisons. On the other hand, most recent AMD cores should be helped slightly by the reduction in instructions and code size. This is because Intel tracks subregisters for renaming while AMD tracks whole registers.

I wish I had access to SPEC, but I don't. I wish LLVM test-suite was actually usable for perf testing, but it's not (despite some recent efforts towards that goal). So ad hoc perf testing ensued...for reproducability's sake:

  1. I grabbed the bzip2 1.06 source from:
    http://www.bzip.org/downloads.html

  2. I happened to have the Mac gcc 4.9 tarball around from:
    http://hpc.sourceforge.net/

  3. I gunzip'd the gcc tarball to use as my test file for compression.

  4. I built bzip2 using its default settings (-O2 as of this writing) with relatively recent clang trunk built from -r227025.

  5. I removed the ext'ifying code that was added with r195496 and compared bzip2 perf changes on Intel Sandy Bridge and AMD Jaguar (btver2).

This is the relative perf for byte-size comparisons with today's (ext'ing) codegen as the baseline using 8 runs for each case:

Intel SNB: -4.09% (0.44% relative std dev)
AMD Jaguar: -0.26% (but with 1.21% relative std dev...so in the noise)

@rotateright
Copy link
Contributor

Based on profiles, it looks like bzip2 is just about the perfect macro test case for testing this codegen change. It has byte comparisons all over the hotspots in BZ2_compressBlock, mainGTU, and BZ2_blocksort.

FWIW, gcc 4.9 is generating the byte comparisons by default for this code.

@rotateright
Copy link
Contributor

Forgot one testing tidbit for the record: I didn't alter bzip2 to avoid any I/O (as I assume the SPEC test harness would) or in any other way, and I didn't add any fancy high-res timers. Just used this:

$ time ./bzip2 < gcc-4.9-bin.tar > /dev/null

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 10, 2015

That's very in line with the results I've seen. On my Ivy Bridge iMac, it was about 5% or so. I ran a test very similar to yours (I used the x86_64 shared cache file as test input).

The SPEC harness results, both 2k and 2k6, were always also in line with what I observed looking at the vanilla 1.0.6 package, so I don't think you're missing any interesting datapoints on that front.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 10, 2015

Oh, and FWIW, when I was doing the original testing, a hand-modified assembly version just for mainGTU() got almost all of the gains.

@chandlerc
Copy link
Member

As the discussion got somewhat fragmented between this bug and the review thread for Jim's patch, I just wanted to relay what I think the correct path forward is.

We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.

Sadly, I don't have time to look at this right now, but I wanted the steps that I at least had in mind documented somewhere.

-Chandler

@rotateright
Copy link
Contributor

Thanks, Jim and Chandler. Good to know that I can repro what you're seeing without resorting to SPEC. :)

I filed bug 22532 before I saw the last comment here. I think that's a separate but related issue to what we have here? I didn't dig to see how the movz in that case got generated.

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 8, 2019

This has been fixed since 3.9:

https://godbolt.org/z/mG2uo2

I've added a test case at rL355712

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

5 participants