Add support for jumping between MSVC inline asm blob labels #24903

ehsan · 2015-08-21T02:38:49Z


Bugzilla Link	24529
Resolution	WONTFIX
Resolved on	Mar 26, 2017 10:54
Version	trunk
OS	All
Depends On	#9667
CC	@majnemer,@froydnj,@zmodem,@jrmuizel,@jyknight,@rnk

Extended Description

Apologies for the crazy test case. Pay attention to the comments, and ignore the computations (they're just designed to use many registers).

$ cat test.cpp
void g(int);
void f(int x) {
int k = x + x * (x-1);
g(k/x);
int l = (x+1) * x * (x-1);
int l2 = (x+10) * x * (x-15);
int l3 = (x+100) * x * (x-150);
int m = (l2+3) * x * (x-3);
// at this point, ebx has been written to.
__asm {
mov esi, k
L2:
inc eax
}
// this expression reads ebx.
int foo = m * l2 / (x -4) * (m+ k/l* l3) * x;
g(foo);
if (m == 479594508 && k == 10000) {
__asm {
// clobber ebx.
mov ebx, m
inc k
jmp L2
}
}
g(foo);
}
$ cat driver.cpp
#include <stdio.h>
void g(int x) {
printf("%d\n", x);
}
int main() {
void f(int);
f(100);
}
$ ./bin/clang -fasm-blocks -m32 test.cpp driver.cpp -o test.out && ./test.out
100
454085520
454085520
454085520
$ ./bin/clang -O2 -fasm-blocks -m32 test.cpp driver.cpp -o test.out && ./test.out
100
454085520
748924496
748924496

The problem here is that the jmp between the asm blocks invalidates the clobber registers that we compute for each individual asm block. Basically, the compiler thinks that the first asm block doesn't clobber ebx so it will try to reuse ebx afterwards, which doesn't work the second time that leave the first block (after having jmp'ed to it from the second block.)

The generated code makes the problem obvious. I'm including it below with comments inline:

$ ./bin/clang -O2 -fasm-blocks -m32 -c test.cpp && ./bin/llvm-objdump -disassemble -x86-asm-syntax=intel test.o

test.o: file format Mach-O 32-bit i386

Disassembly of section __TEXT,__text:
__Z1fi:
0: 55 push ebp
1: 89 e5 mov ebp, esp
3: 53 push ebx
4: 57 push edi
5: 56 push esi
6: 83 ec 1c sub esp, 28
9: 8b 7d 08 mov edi, dword ptr [ebp + 8]
c: 8d 77 ff lea esi, [edi - 1]
f: 89 f8 mov eax, edi
11: 0f af c0 imul eax, eax
14: 89 45 f0 mov dword ptr [ebp - 16], eax
17: 99 cdq
18: f7 ff idiv edi
1a: 89 04 24 mov dword ptr [esp], eax
1d: e8 de ff ff ff call -34 <__Z1fi>
22: 8d 47 01 lea eax, [edi + 1]
25: 0f af f7 imul esi, edi
28: 0f af f0 imul esi, eax
2b: 89 75 e8 mov dword ptr [ebp - 24], esi
2e: 8d 47 0a lea eax, [edi + 10]
31: 0f af c7 imul eax, edi
34: 8d 4f f1 lea ecx, [edi - 15]
37: 0f af c8 imul ecx, eax
3a: 8d 41 03 lea eax, [ecx + 3]
3d: 8d 5f fd lea ebx, [edi - 3]
40: 0f af df imul ebx, edi

Write to ebx:

  43:	0f af d8 	imul	ebx, eax
  46:	89 5d ec 	mov	dword ptr [ebp - 20], ebx
  49:	8b 75 f0 	mov	esi, dword ptr [ebp - 16]

JMP target:

  4c:	40 	inc	eax

Read from ebx (oops!):

  4d:	0f af cb 	imul	ecx, ebx
  50:	8d 77 fc 	lea	esi, [edi - 4]
  53:	89 c8 	mov	eax, ecx
  55:	99 	cdq
  56:	f7 fe 	idiv	esi
  58:	89 c1 	mov	ecx, eax
  5a:	8b 45 f0 	mov	eax, dword ptr [ebp - 16]
  5d:	99 	cdq
  5e:	f7 7d e8 	idiv	dword ptr [ebp - 24]
  61:	89 c6 	mov	esi, eax
  63:	8d 87 6a ff ff ff 	lea	eax, [edi - 150]
  69:	0f af c7 	imul	eax, edi
  6c:	0f af f0 	imul	esi, eax
  6f:	8d 47 64 	lea	eax, [edi + 100]
  72:	0f af f0 	imul	esi, eax
  75:	01 de 	add	esi, ebx
  77:	0f af f7 	imul	esi, edi
  7a:	0f af f1 	imul	esi, ecx
  7d:	89 34 24 	mov	dword ptr [esp], esi
  80:	e8 7b ff ff ff 	call	-133 <__Z1fi>
  85:	81 fb 0c 08 96 1c 	cmp	ebx, 479594508
  8b:	75 11 	jne	17 <__Z1fi+9E>
  8d:	81 7d f0 10 27 00 00 	cmp	dword ptr [ebp - 16], 10000
  94:	75 08 	jne	8 <__Z1fi+9E>

Clobber ebx in the second block:

  96:	8b 5d ec 	mov	ebx, dword ptr [ebp - 20]
  99:	ff 45 f0 	inc	dword ptr [ebp - 16]
  9c:	eb ae 	jmp	-82 <__Z1fi+4C>
  9e:	89 34 24 	mov	dword ptr [esp], esi
  a1:	e8 5a ff ff ff 	call	-166 <__Z1fi>
  a6:	83 c4 1c 	add	esp, 28
  a9:	5e 	pop	esi
  aa:	5f 	pop	edi
  ab:	5b 	pop	ebx
  ac:	5d 	pop	ebp
  ad:	c3 	ret

MSVC deals with this problem by turning off all optimizations on the function as soon as it encounters a cross asm block jump. This can be observed by comparing the output of MSVC with the jmp instruction commented out vs with it:

============================
With the jmp instruction:

$ cl -O2 -c test.cpp && llvm-objdump -disassemble -x86-asm-syntax=intel test.obj
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.40629 for x86
Copyright (C) Microsoft Corporation. All rights reserved.

test.cpp

test.obj: file format COFF-i386

Disassembly of section .text$mn:
?f@@Yaxh@Z:
0: 55 push ebp
1: 8b ec mov ebp, esp
3: 83 ec 18 sub esp, 24
6: 53 push ebx
7: 56 push esi
8: 8b 45 08 mov eax, dword ptr [ebp + 8]
b: 83 e8 01 sub eax, 1
e: 0f af 45 08 imul eax, dword ptr [ebp + 8]
12: 03 45 08 add eax, dword ptr [ebp + 8]
15: 89 45 fc mov dword ptr [ebp - 4], eax
18: 8b 45 fc mov eax, dword ptr [ebp - 4]
1b: 99 cdq
1c: f7 7d 08 idiv dword ptr [ebp + 8]
1f: 50 push eax
20: e8 00 00 00 00 call 0 <?f@@Yaxh@Z+25>
25: 83 c4 04 add esp, 4
28: 8b 4d 08 mov ecx, dword ptr [ebp + 8]
2b: 83 c1 01 add ecx, 1
2e: 0f af 4d 08 imul ecx, dword ptr [ebp + 8]
32: 8b 55 08 mov edx, dword ptr [ebp + 8]
35: 83 ea 01 sub edx, 1
38: 0f af ca imul ecx, edx
3b: 89 4d ec mov dword ptr [ebp - 20], ecx
3e: 8b 45 08 mov eax, dword ptr [ebp + 8]
41: 83 c0 0a add eax, 10
44: 0f af 45 08 imul eax, dword ptr [ebp + 8]
48: 8b 4d 08 mov ecx, dword ptr [ebp + 8]
4b: 83 e9 0f sub ecx, 15
4e: 0f af c1 imul eax, ecx
51: 89 45 f4 mov dword ptr [ebp - 12], eax
54: 8b 55 08 mov edx, dword ptr [ebp + 8]
57: 83 c2 64 add edx, 100
5a: 0f af 55 08 imul edx, dword ptr [ebp + 8]
5e: 8b 45 08 mov eax, dword ptr [ebp + 8]
61: 2d 96 00 00 00 sub eax, 150
66: 0f af d0 imul edx, eax
69: 89 55 e8 mov dword ptr [ebp - 24], edx
6c: 8b 4d f4 mov ecx, dword ptr [ebp - 12]
6f: 83 c1 03 add ecx, 3
72: 0f af 4d 08 imul ecx, dword ptr [ebp + 8]
76: 8b 55 08 mov edx, dword ptr [ebp + 8]
79: 83 ea 03 sub edx, 3
7c: 0f af ca imul ecx, edx
7f: 89 4d f8 mov dword ptr [ebp - 8], ecx
82: 8b 75 fc mov esi, dword ptr [ebp - 4]
85: 40 inc eax
86: 8b 45 f8 mov eax, dword ptr [ebp - 8]
89: 0f af 45 f4 imul eax, dword ptr [ebp - 12]
8d: 8b 4d 08 mov ecx, dword ptr [ebp + 8]
90: 83 e9 04 sub ecx, 4
93: 99 cdq
94: f7 f9 idiv ecx
96: 8b c8 mov ecx, eax
98: 8b 45 fc mov eax, dword ptr [ebp - 4]
9b: 99 cdq
9c: f7 7d ec idiv dword ptr [ebp - 20]
9f: 0f af 45 e8 imul eax, dword ptr [ebp - 24]
a3: 03 45 f8 add eax, dword ptr [ebp - 8]
a6: 0f af c8 imul ecx, eax
a9: 0f af 4d 08 imul ecx, dword ptr [ebp + 8]
ad: 89 4d f0 mov dword ptr [ebp - 16], ecx
b0: 8b 55 f0 mov edx, dword ptr [ebp - 16]
b3: 52 push edx
b4: e8 00 00 00 00 call 0 <?f@@Yaxh@Z+B9>
b9: 83 c4 04 add esp, 4
bc: 81 7d f8 0c 08 96 1c cmp dword ptr [ebp - 8], 479594508
c3: 75 11 jne 17 <?f@@Yaxh@Z+D6>
c5: 81 7d fc 10 27 00 00 cmp dword ptr [ebp - 4], 10000
cc: 75 08 jne 8 <?f@@Yaxh@Z+D6>
ce: 8b 5d f8 mov ebx, dword ptr [ebp - 8]
d1: ff 45 fc inc dword ptr [ebp - 4]
d4: eb af jmp -81 <?f@@Yaxh@Z+85>
d6: 8b 45 f0 mov eax, dword ptr [ebp - 16]
d9: 50 push eax
da: e8 00 00 00 00 call 0 <?f@@Yaxh@Z+DF>
df: 83 c4 04 add esp, 4
e2: 5e pop esi
e3: 5b pop ebx
e4: 8b e5 mov esp, ebp
e6: 5d pop ebp
e7: c3 ret

============================
Without the jmp instruction:

$ cl -O2 -c test.cpp && llvm-objdump -disassemble -x86-asm-syntax=intel test.obj
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.40629 for x86
Copyright (C) Microsoft Corporation. All rights reserved.

test.cpp

test.obj: file format COFF-i386

Disassembly of section .text$mn:
?f@@Yaxh@Z:
0: 51 push ecx
1: 53 push ebx
2: 55 push ebp
3: 8b 6c 24 10 mov ebp, dword ptr [esp + 16]
7: 8b c5 mov eax, ebp
9: 0f af c5 imul eax, ebp
c: 56 push esi
d: 57 push edi
e: 89 44 24 18 mov dword ptr [esp + 24], eax
12: 99 cdq
13: f7 fd idiv ebp
15: 50 push eax
16: e8 00 00 00 00 call 0 <?f@@Yaxh@Z+1B>
1b: 8d 45 0a lea eax, [ebp + 10]
1e: 83 c4 04 add esp, 4
21: 8d 7d f1 lea edi, [ebp - 15]
24: 0f af f8 imul edi, eax
27: 8d 45 fd lea eax, [ebp - 3]
2a: 0f af fd imul edi, ebp
2d: 8d 5f 03 lea ebx, [edi + 3]
30: 0f af d8 imul ebx, eax
33: 0f af dd imul ebx, ebp
36: 89 5c 24 10 mov dword ptr [esp + 16], ebx
3a: 8b 74 24 18 mov esi, dword ptr [esp + 24]
3e: 40 inc eax
3f: 8d 45 01 lea eax, [ebp + 1]
42: 8d 4d ff lea ecx, [ebp - 1]
45: 0f af c8 imul ecx, eax
48: 8b 44 24 18 mov eax, dword ptr [esp + 24]
4c: 99 cdq
4d: 0f af cd imul ecx, ebp
50: f7 f9 idiv ecx
52: 8d 8d 6a ff ff ff lea ecx, [ebp - 150]
58: 8b f0 mov esi, eax
5a: 8b c3 mov eax, ebx
5c: 0f af f1 imul esi, ecx
5f: 8d 4d 64 lea ecx, [ebp + 100]
62: 0f af c7 imul eax, edi
65: 0f af f1 imul esi, ecx
68: 8d 4d fc lea ecx, [ebp - 4]
6b: 99 cdq
6c: f7 f9 idiv ecx
6e: 0f af f5 imul esi, ebp
71: 03 f3 add esi, ebx
73: 0f af f0 imul esi, eax
76: 0f af f5 imul esi, ebp
79: 56 push esi
7a: e8 00 00 00 00 call 0 <?f@@Yaxh@Z+7F>
7f: 83 c4 04 add esp, 4
82: 81 fb 0c 08 96 1c cmp ebx, 479594508
88: 75 12 jne 18 <?f@@Yaxh@Z+9C>
8a: 81 7c 24 18 10 27 00 00 cmp dword ptr [esp + 24], 10000
92: 75 08 jne 8 <?f@@Yaxh@Z+9C>
94: 8b 5c 24 10 mov ebx, dword ptr [esp + 16]
98: ff 44 24 18 inc dword ptr [esp + 24]
9c: 56 push esi
9d: e8 00 00 00 00 call 0 <?f@@Yaxh@Z+A2>
a2: 83 c4 04 add esp, 4
a5: 5f pop edi
a6: 5e pop esi
a7: 5d pop ebp
a8: 5b pop ebx
a9: 59 pop ecx
aa: c3 ret

The text was updated successfully, but these errors were encountered:

ehsan · 2015-08-21T02:40:54Z

I suggest fixing this bug by clobbering all registers at the end of each inline asm block as soon as we detect a label being used inside a block that doesn't define it.

What do you think?

rnk · 2015-08-24T21:58:42Z

I suggest fixing this bug by clobbering all registers at the end of each
inline asm block as soon as we detect a label being used inside a block that
doesn't define it.

What do you think?

It's not enough. If we want to optimize, the CFG has to be correct, it can't be missing any edges. SSA requires it. This is the same problem that setjmp has, and why users have to add volatile near setjmp.

Here's a different example involving unreachable code:

int f() {
__asm L1: jmp L2
return 42;
// unreachable code?
__asm L2: jmp L1
}

LLVM will see this:

define i32 @f() {
entry:
call sideffect asm "..."()
ret i32 42

after_ret:
call sideffect asm "..."()
unreachable
}

And CFG simplification will delete after_ret, causing errors at assembly time.

jyknight · 2017-02-22T06:20:58Z

ISTM that implementing the GCC "asm goto" (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html), bug 9295, is effectively a prerequisite for this.

Both seem to call for a block-terminator variant of inline asm, with a variable number of target labels tacked on. That'd allow jumping OUT of inline asm safely, and seems like it ought to be implementable without too much complication.

The first complication MSVC style asm adds to that is that you need to parse out the target labels from the assembly instructions, rather than having them listed out for you in one place. But that's just the usual complication for msvc inline-asm.

What seems like the real complication here is that MSVC lets you jump INTO the middle of an asm block (which is not allowed in GCC-style "asm goto").

It also allows you to use C labels and asm labels interchangeably: "Both assembly instructions and goto statements can jump to labels inside or outside the __asm block." (per https://msdn.microsoft.com/en-us/library/78cxesy1.aspx).

So, if you do have some (asm or C) code jumping into the middle of an asm statement, how could you annotate that in LLVM? You could create a new llvm block right before the target asm statement and pretend you'll be jumping to the beginning of the block (even though you're jumping to an asm label in the middle) -- after all, the content of the asm is opaque to LLVM, it doesn't care if you jump to the beginning or the middle of it.

But then what if some other instructions are added to the head of that block before the target asm statement? Skipping those would be bad...

So, do you split asm with a label in the middle into multiple LLVM blocks? But that seems like a bad idea too...presumably you're allowed to depend on the machine keeping the same state when crossing a label within an asm block. But if an asm block got translated into two separate llvm asm statements, in separate basic blocks, then that would not be guaranteed.

rnk · 2017-03-01T23:21:34Z

I agree with everything in c#3. I'm pretty close to just marking this WONTFIX.

We could hypothetically implement this the way we implemented WinEH, where we have an inline asm instrution that is both a terminator and a "leader", meaning must be the first non-phi instruction like a landingpad or catchswitch. Such a block has no insertion point: passes must non insert any non-phi instructions into it. In theory, the existence of catchswitch requires that all passes handle these blocks gracefully. Any phis appearing in a leader-terminator block would be demoted to memory. This would ensure that no code is generated between two asm blobs that should be consistent. We would also need to communicate that the second block must be the layout successor, which is also pretty weird.

However, that's all quite challenging and probably not worth anyone's time. Let's go with wontfix.

llvmbot · 2021-11-27T06:25:40Z

mentioned in issue #9667

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Quuxplusone added the wontfix Issue is real, but we can't or won't fix it. Not invalid label Jan 20, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for jumping between MSVC inline asm blob labels #24903

Add support for jumping between MSVC inline asm blob labels #24903

ehsan commented Aug 21, 2015

ehsan commented Aug 21, 2015

rnk commented Aug 24, 2015

jyknight commented Feb 22, 2017

rnk commented Mar 1, 2017

llvmbot commented Nov 27, 2021

Add support for jumping between MSVC inline asm blob labels #24903

Add support for jumping between MSVC inline asm blob labels #24903

Comments

ehsan commented Aug 21, 2015

Extended Description

============================ With the jmp instruction:

============================ Without the jmp instruction:

ehsan commented Aug 21, 2015

rnk commented Aug 24, 2015

jyknight commented Feb 22, 2017

rnk commented Mar 1, 2017

llvmbot commented Nov 27, 2021

============================
With the jmp instruction:

============================
Without the jmp instruction: