26815 – Clang really shouldn't have ASM tests in it

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 26815 - Clang really shouldn't have ASM tests in it

Summary: Clang really shouldn't have ASM tests in it

Status:	RESOLVED FIXED

Alias:	None

Product:	clang
Classification:	Unclassified
Component:	-New Bugs (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned Clang Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-03-02 10:44 PST by Renato Golin
Modified:	2016-03-09 12:56 PST (History)
CC List:	14 users (show)

See Also:	24580
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Renato Golin 2016-03-02 10:44:25 PST

After enough discussions in the mailing list, and what I believe is now the consensus, we should move all ASM tests to LLVM and transform all Clang tests into IR tests.

The few cases that people were arguing about:

1. NEON intrinsics need to lower to NEON instructions one-to-one.

That's not true. I love that the back-end combines vmul+vadd into vmla, and it should still do it. Testing for a sequence of IR instructions instead (or builtin) is not worse.

2. Inline ASM needs to be as is in asm output.

No it doesn't. There are a number of cases where we (and GAS) do slight transformations on the asm and output a different code (ex. add r0, -1 -> sub r0, 1). Furthermore, inline asm has a nice IR representation, which we can also test.

3. Specific code-gen issues.

If C code gets translated into IR and then assembly, we can obviously always find an IR that will produce the same assembly. To do that, simple -emit-llvm.

--

Having said that, the move itself will be *very* tedious, but it is necessary.

Given that most people were complaining about the ARM and AArch64 tests, I think it's only fair that we, ARM folks, share the load into moving things. I'll copy as many people as I can into this bug so we can coordinate our efforts, but I'd like to have a much cleaner Clang test by 3.9.0.

Comment 1 Sanjay Patel 2016-03-02 11:20:04 PST

(In reply to comment #0)
> 1. NEON intrinsics need to lower to NEON instructions one-to-one.

We had a variation on this argument for x86's obscene pile of SSE/AVX intrinsics: in *unoptimized* code, we thought it was important that the one-to-one intrinsic -> asm mapping should be true - because debugging those things is tough enough already. This is bug 24580.

However, after:
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20151123/143974.html

...we split those up. I don't think we ever answered if there was a designated place for end-to-end tests of this sort, but based on this recent llvm-dev thread:
http://lists.llvm.org/pipermail/llvm-dev/2016-February/095275.html

maybe test-suite is the right spot?

Comment 2 Renato Golin 2016-03-02 11:42:44 PST

(In reply to comment #1)
> We had a variation on this argument for x86's obscene pile of SSE/AVX
> intrinsics: in *unoptimized* code, we thought it was important that the
> one-to-one intrinsic -> asm mapping should be true - because debugging those
> things is tough enough already. This is bug 24580.

I think this goal is counter-productive.

Imagine the case where the intrinsic is lowered to IR instructions, not builtins, and O0 generates so many wrapper code that the patterns don't match on the back-end.

You have a few not-ideal options:

1. Fudge the match to recognise more variations, which could reduce the certainty of the semantics.

2. Change Clang to reduce the wrapper code for intrinsics-to-IR, which would create special cases and increase maintenance.

3. Force Clang to emit intrinsics, stopping optimisations down the pipeline.

All for the sake of an identity that is not really necessary.

If, for any target, this identity is mandated by the ABI, then Clang *must* generate builtins for ALL intrinsics, and it's up to the back-end (or a target-specific middle-end pass) to convert to IR on early-O1+.

In that case, just checking the the IR has the builtin you want is, now, enough for a Clang test.

Comment 3 Sanjay Patel 2016-03-02 12:22:09 PST

(In reply to comment #2)
> 3. Force Clang to emit intrinsics, stopping optimisations down the pipeline.
> 
> All for the sake of an identity that is not really necessary.

For x86 at least, I don't think there's any concern about preventing optimizations.

We've mostly settled on doing these transforms in InstCombineCalls.cpp. That let's us write readable/debuggable C++ code to do the transform, ensures that we're doing the transform to IR very early, and doesn't interfere with the programmer's intrinsic-based intent when building at -O0.

There are cases where we transform to native IR instructions in the header files. In addition to the vector programmer's debugging concerns, consider that we have macros in headers that look like this:

/* Vector shuffle */
#define _mm256_shuffle_ps(a, b, mask) __extension__ ({ \
        (__m256)__builtin_shufflevector((__v8sf)(__m256)(a), \
                                        (__v8sf)(__m256)(b), \
                                        (mask) & 0x3, \
                                        ((mask) & 0xc) >> 2, \
                                        (((mask) & 0x30) >> 4) + 8, \
                                        (((mask) & 0xc0) >> 6) + 8, \
                                        ((mask) & 0x3) + 4, \
                                        (((mask) & 0xc) >> 2) + 4, \
                                        (((mask) & 0x30) >> 4) + 12, \
                                        (((mask) & 0xc0) >> 6) + 12); })

Granted, not having an x86-specific builtin for this case is nice...the proliferation of x86 intrinsics is unstoppable. Sane instruction set architecture may not have this problem.

Some more details about the optimization options for intrinsics are here:
http://reviews.llvm.org/D10555

Comment 4 Tim Northover 2016-03-02 17:13:01 PST

The complaint isn't really about asm tests, but about running LLVM optimisations. And unfortunately, I think that's the more burdensome one to fix. The IR we produce for 

float32x2_t res(float32x2_t l, float32x2_t r) {
  return vadd_f32(l, r);
}

is:

define <2 x float> @res(<2 x float> %l, <2 x float> %r) #0 {
entry:
  %__p0.addr.i = alloca <2 x float>, align 8
  %__p1.addr.i = alloca <2 x float>, align 8
  %__ret.i = alloca <2 x float>, align 8
  %l.addr = alloca <2 x float>, align 8
  %r.addr = alloca <2 x float>, align 8
  store <2 x float> %l, <2 x float>* %l.addr, align 8
  store <2 x float> %r, <2 x float>* %r.addr, align 8
  %0 = load <2 x float>, <2 x float>* %l.addr, align 8
  %1 = load <2 x float>, <2 x float>* %r.addr, align 8
  store <2 x float> %0, <2 x float>* %__p0.addr.i, align 8
  store <2 x float> %1, <2 x float>* %__p1.addr.i, align 8
  %2 = load <2 x float>, <2 x float>* %__p0.addr.i, align 8
  %3 = load <2 x float>, <2 x float>* %__p1.addr.i, align 8
  %add.i = fadd <2 x float> %2, %3
  store <2 x float> %add.i, <2 x float>* %__ret.i, align 8
  %4 = load <2 x float>, <2 x float>* %__ret.i, align 8
  ret <2 x float> %4
}

Testing that code, or even working out whether it's correct by eye, is nontrivial.

On the other hand, breaking clang tests with an LLVM change can't be fun. I wonder if just running "opt -mem2reg" would be an acceptable compromise: I'd expect it to clean up the generated NEON code massively and it's a very stable pass with a well-defined job.

Comment 5 Tim Northover 2016-03-02 21:44:59 PST

Given how many tests we do now, I'll see if I can whip up a script to auto-convert the existing ones (i.e. replace the asm checks with the CodeGen clang actually produces, using llvm-lit [[VARS]]: not ideal, but certainly no worse than what we've got now given that it already passes).

Other than that, a related question we might want to consider is just how thorough we want these NEON tests to be. I suspect they currently have large but not complete coverage of arm_neon.h, and they take a fairly ridiculous amount of time to execute.

Would a more orthogonal subset be better, especially now that we have the emperor tests in the test-suite? Not sure. And not sure how we'd decide which variants are worth testing either.

We can probably delay answering those questions to later though.

Comment 6 Tim Northover 2016-03-02 22:23:03 PST

Oh, and I also thoroughly agree with Renato that we make absolutely no guarantees about intrinsics mapping to instructions in the ARM world. arm_neon.h describes semantics, not the assembly code output.

Comment 7 Renato Golin 2016-03-03 03:29:27 PST

There's also the question of how much can we really do in LLVM's side. Whenever Clang changes to support new intrinsics, so does LLVM. 

Meaning the same Clang+LLVM revision is known to work and emit the expected assembly, but not necessarily an old Clang (or an LLVM test produced with an old Clang) in a new LLVM revision.

By removing the tests from Clang, we may inadvertently cause the same effect in LLVM's side, by having old IR patterns not match because of some new optimisation pass.

In order to solve this in a reasonable fashion, the best approach is to auto-generate LLVM tests from Clang's "current" output. This can be done with a mix of original C code, a pass to IR and concatenating some RUN/CHECK lines at the begin/end. 

Even though this is only valid for a small class of problems, it's probably the class that will generate most grief in LLVM testing. The rest will hopefully be more stable.

I recommend that the first step becomes duplicating the tests in the LLVM side and see how they end up as. As we accept the odds, we start removing the tests on Clang's side, and we'll be left with only the hairy cases in Clang.

Comment 8 Jon Roelofs 2016-03-03 09:58:37 PST

> Given how many tests we do now, I'll see if I can whip up a script to auto-convert the existing ones (i.e. replace the asm checks with the CodeGen clang actually produces, using llvm-lit [[VARS]]: not ideal, but certainly no worse than what we've got now given that it already passes).

Half-baked idea: this seems like something that'd be more generally useful than a one-off script. Maybe it's worth teaching FileCheck more about llvm's textual IR to make it do this transformation semi-automagically (under some new flag, of course):

<hand waving>

// RUN: %cc1 <whatever> %s -emit-llvm -o - | FileCheck --auto-ir-vars %s

float32x2_t res(float32x2_t l, float32x2_t r) {
  return vadd_f32(l, r);
}

// CHECK-LABEL: define <2 x float> @res(<2 x float> %l, <2 x float> %r) #0 {
// CHECK-NEXT: entry:
// CHECK-DAG:  %__p0.addr.i = alloca <2 x float>, align 8
// CHECK-DAG:  %__p1.addr.i = alloca <2 x float>, align 8
// CHECK-DAG:  %__ret.i = alloca <2 x float>, align 8

</hand waving>

such that the %vars get automatically turned into the appropriate name-agnostic `[[VAR:%[a-zA-Z_][a-zA-z_0-9]+]]` thing for defs, and `[[VAR]]` for uses.

Comment 9 Sanjay Patel 2016-03-03 10:21:12 PST

(In reply to comment #4)
> The complaint isn't really about asm tests, but about running LLVM
> optimisations. And unfortunately, I think that's the more burdensome one to
> fix. The IR we produce for 
> 
> float32x2_t res(float32x2_t l, float32x2_t r) {
>   return vadd_f32(l, r);
> }
> 
> is:
> 
> define <2 x float> @res(<2 x float> %l, <2 x float> %r) #0 {
> entry:
>   %__p0.addr.i = alloca <2 x float>, align 8
>   %__p1.addr.i = alloca <2 x float>, align 8
>   %__ret.i = alloca <2 x float>, align 8
>   %l.addr = alloca <2 x float>, align 8
>   %r.addr = alloca <2 x float>, align 8
>   store <2 x float> %l, <2 x float>* %l.addr, align 8
>   store <2 x float> %r, <2 x float>* %r.addr, align 8
>   %0 = load <2 x float>, <2 x float>* %l.addr, align 8
>   %1 = load <2 x float>, <2 x float>* %r.addr, align 8
>   store <2 x float> %0, <2 x float>* %__p0.addr.i, align 8
>   store <2 x float> %1, <2 x float>* %__p1.addr.i, align 8
>   %2 = load <2 x float>, <2 x float>* %__p0.addr.i, align 8
>   %3 = load <2 x float>, <2 x float>* %__p1.addr.i, align 8
>   %add.i = fadd <2 x float> %2, %3
>   store <2 x float> %add.i, <2 x float>* %__ret.i, align 8
>   %4 = load <2 x float>, <2 x float>* %__ret.i, align 8
>   ret <2 x float> %4
> }
> 
> Testing that code, or even working out whether it's correct by eye, is
> nontrivial.
> 
> On the other hand, breaking clang tests with an LLVM change can't be fun. I
> wonder if just running "opt -mem2reg" would be an acceptable compromise: I'd
> expect it to clean up the generated NEON code massively and it's a very
> stable pass with a well-defined job.

Is it necessary to have complete checking for this kind of case? I think we just need one label and one check line to know that clang has done the intended job of translating the intrinsic to IR:

; CHECK-LABEL: define <2 x float> @res(<2 x float> %l, <2 x float> %r)
; CHECK:       fadd <2 x float>

We can assume that the rest of the IR is correct and covered by tests elsewhere? (Surprisingly, I don't see any x86 equivalents for these kinds of tests...)

Comment 10 Renato Golin 2016-03-03 10:27:23 PST

(In reply to comment #9)
> Is it necessary to have complete checking for this kind of case? I think we
> just need one label and one check line to know that clang has done the
> intended job of translating the intrinsic to IR:
> 
> ; CHECK-LABEL: define <2 x float> @res(<2 x float> %l, <2 x float> %r)
> ; CHECK:       fadd <2 x float>

Unfortunately, that doesn't cover. 

For silly semantics (vadd_f32 -> fadd float) should be enough, but others, like zip, vldN and the ones that rely on shuffles, casts, promotions and truncations, will all need a strict sequence of IR instructions with the correct dependency between them.

However, having CHECK for just the sequence of instructions will be "enough" for most purposes, though, so we may only need full-blown variable usage for the few complex cases...

Comment 11 Sanjay Patel 2016-03-03 10:34:31 PST

(In reply to comment #10)
> For silly semantics (vadd_f32 -> fadd float) should be enough, but others,
> like zip, vldN and the ones that rely on shuffles, casts, promotions and
> truncations, will all need a strict sequence of IR instructions with the
> correct dependency between them.
> 
> However, having CHECK for just the sequence of instructions will be "enough"
> for most purposes, though, so we may only need full-blown variable usage for
> the few complex cases...

Agreed. It looks like Altivec is tested quite efficiently this way:
clang/test/CodeGen/builtins-ppc-altivec.c

Comment 12 Ahmed Bougacha 2016-03-03 10:42:35 PST

What's wrong with a very strict matching of the entire (unoptimized) IR sequence, (à la update_llc_test_checks.py)?

We assume they're correct today so in theory we don't even need to look at the current IR.

We don't expect these to change often, and when they do, updating the output and git-diff ought to be enough.  I think I'd want to know if some unrelated IRGen commit changed the output (it's not unrelated anymore, is it?)

Yes, it's some extra work when something does changes. But if we provide an easy one-line script to update the checks, the change author can either ignore it (the current status quo), or make sure they didn't break anything (strictly better than the status quo).

Comment 13 Tim Northover 2016-03-03 10:55:31 PST

I'd much prefer to check dataflow in all cases. It's very easy to end up with "fadd %a, %a" or get the subtraction the wrong way round (we did that originally, in fact, and it took us 2 years to fix properly because people had started relying on the obviously-wrong operand order).

Comment 14 Tim Northover 2016-03-03 10:59:03 PST

I still think having readable tests is more important than wanting to completely isolate Clang from LLVM's mid-end.

Comment 15 Sanjay Patel 2016-03-03 11:03:54 PST

(In reply to comment #12)
> What's wrong with a very strict matching of the entire (unoptimized) IR
> sequence, (à la update_llc_test_checks.py)?
> 
> We assume they're correct today so in theory we don't even need to look at
> the current IR.
> 
> We don't expect these to change often, and when they do, updating the output
> and git-diff ought to be enough.  I think I'd want to know if some unrelated
> IRGen commit changed the output (it's not unrelated anymore, is it?)
> 
> Yes, it's some extra work when something does changes. But if we provide an
> easy one-line script to update the checks, the change author can either
> ignore it (the current status quo), or make sure they didn't break anything
> (strictly better than the status quo).

Tim mentioned in comment 5 that these tests may be taking more time than they're worth. But an "update_clang_test_checks.py" and "update_opt_test_checks.py" would be extremely useful in general, and hopefully that's where we're headed. :)

Comment 16 Renato Golin 2016-03-03 11:10:13 PST

(In reply to comment #13)
> I'd much prefer to check dataflow in all cases. It's very easy to end up
> with "fadd %a, %a" or get the subtraction the wrong way round (we did that
> originally, in fact, and it took us 2 years to fix properly because people
> had started relying on the obviously-wrong operand order).

I'm not against it, just trying to be pragmatic. 

We don't have perfect tests today, so we don't need perfect tests to replace them. But if people are happy to make them better at the same time as moving, I'm game.

Comment 17 Saleem Abdulrasool 2016-03-03 11:14:58 PST

(In reply to comment #14)
> I still think having readable tests is more important than wanting to
> completely isolate Clang from LLVM's mid-end.

I fully agree with you here.  The thing is that builtins are slightly more involved for this exact reason: they expect to be able to bore through the frontend and hook into the mid or back ends.  I believe a similar argument has been made in favor of intrinsics over builtins in the past.  As already pointed out (by you!) the data flow analysis here is relevant and important since it is possible to break things subtly otherwise.  I think that the readability of the tests directly impact that.

Comment 18 Tim Northover 2016-03-09 12:14:30 PST

I've put a patch up for review at http://reviews.llvm.org/D17999.

Comment 19 Tim Northover 2016-03-09 12:56:46 PST

I think this has been addressed by r263048.