15086 – Tailcall optimization in PIC mode on x86 uses GOT, making lazy dynamic linking impossible

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 15086 - Tailcall optimization in PIC mode on x86 uses GOT, making lazy dynamic linking impossible

Summary: Tailcall optimization in PIC mode on x86 uses GOT, making lazy dynamic linkin...

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P normal
Assignee:	Chih-Hung Hsieh

URL:
Keywords:

Depends on:
Blocks:	21420
	Show dependency tree

Reported:	2013-01-27 16:20 PST by Dimitry Andric
Modified:	2015-05-28 16:03 PDT (History)
CC List:	14 users (show)

See Also:
Fixed By Commit(s):

Attachments
Testcase for tailcall optimization problem (1.05 KB, application/x-gzip) 2013-01-27 16:20 PST, Dimitry Andric	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dimitry Andric 2013-01-27 16:20:15 PST

Created attachment 9933 [details]
Testcase for tailcall optimization problem

I had several reports about certain X.org drivers not loading properly
in FreeBSD 10.0-CURRENT, after the default compiler was switched to
clang 3.2.

This occurred only on i386, and after some debugging, it turned out to
be due to the way clang optimizes tail calls in PIC mode, in combination
with the typical way X.org loads its driver modules.

As an example, X.org loads the VMware display driver, vmware_drv.so,
which is a perfectly normal shared library.  This driver needs the
function vgaHWSaveScreen(), from the support library libvgahw.so, which
is a default component of the X.org server itself.

The loading should proceed as follows:
1) xorg-server calls dlopen(..., RTLD_LAZY) to open vmware_drv.so.  Note
   vmware_drv.so does *not* contain libvgahw.so in its dynamic section,
   to libvgahw.so is not automatically loaded here.
2) xorg-server resolves and calls the driver initialization function in
   vmware_drv.so.
3) The driver initialization function calls dlopen(..., RTLD_LAZY) to
   open libvgahw.so.  Then it resolves several functions, one of which
   is the above mentioned vgaHWSaveScreen().
4) The driver initialization function calls vgaHWSaveScreen().

However, when the driver is compiled by clang at -O2, with -fPIC, the
driver will fail to load at step 1), with:

  [  9820.637] (II) Loading /usr/local/lib/xorg/modules/drivers/vmware_drv.so
  [  9820.644] (EE) Failed to load
                    /usr/local/lib/xorg/modules/drivers/vmware_drv.so:
                    /usr/local/lib/xorg/modules/drivers/vmware_drv.so:
                    Undefined symbol "vgaHWSaveScreen"

In this particular case, the VMware driver contains a function which
tail calls vgaHWSaveScreen():

  static Bool
  VMWARESaveScreen(ScreenPtr pScreen, int mode)
  {
      VmwareLog(("VMWareSaveScreen() mode = %d\n", mode));

      /*
       * This thoroughly fails to do anything useful to svga mode.  I doubt
       * we care; who wants to idle-blank their VM's screen anyway?
       */
      return vgaHWSaveScreen(pScreen, mode);
  }

clang with -fPIC and -O2 turns this into the following:

  VMWARESaveScreen:                       # @VMWARESaveScreen
  # BB#0:                                 # %entry
          calll   .L26$pb
  .L26$pb:
          popl    %eax
  .Ltmp49:
          addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp49-.L26$pb), %eax
          movl    vgaHWSaveScreen@GOT(%eax), %eax
          jmpl    *%eax                   # TAILCALL

So it loads the address of vgaHWSaveScreen() via the GOT.  In contrast,
gcc still uses the PLT, even for a tail call:

  VMWARESaveScreen:
          pushl   %ebp
          movl    %esp, %ebp
          pushl   %ebx
          call    __i686.get_pc_thunk.bx
          addl    $_GLOBAL_OFFSET_TABLE_, %ebx
          subl    $20, %esp
          movl    12(%ebp), %eax
          movl    %eax, 4(%esp)
          movl    8(%ebp), %eax
          movl    %eax, (%esp)
          call    vgaHWSaveScreen@PLT
          addl    $20, %esp
          popl    %ebx
          leave
          ret

The problem here is that GOT relocations can only be resolved directly
at dlopen() time, and cannot be done lazily, like PLT ones.  Usually,
the PLT entries point to stubs that check if the actual function is
already resolved, and if not, call the dynamic linker guts to do
just-in-time resolving.

I am convinced clang should not use GOT relocations here, but mimic
gcc's behaviour, and use PLT relocations instead.  Though I am not sure
if a real tail call (using a jump) is still possible then.

In any case, this happens with any X.org driver or module that happens
to use a tail call to call one of these support library functions.  For
example, a similar error is generated when X.org tries to load the vesa
driver:

  [  9820.645] (II) Loading /usr/local/lib/xorg/modules/drivers/vesa_drv.so
  [  9820.660] (EE) Failed to load
                    /usr/local/lib/xorg/modules/drivers/vesa_drv.so:
                    /usr/local/lib/xorg/modules/drivers/vesa_drv.so:
                    Undefined symbol "shadowUpdatePacked"

Furthermore, these errors at loading time disappear when you add
-fno-optimize-sibling-calls to the optimization flags.  Then clang
generates the following for the VMWARESaveScreen() function:

  VMWARESaveScreen:                       # @VMWARESaveScreen
  # BB#0:                                 # %entry
	  pushl   %ebp
	  movl    %esp, %ebp
	  pushl   %ebx
	  subl    $8, %esp
	  calll   .L26$pb
  .L26$pb:
	  popl    %ebx
  .Ltmp49:
	  addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp49-.L26$pb), %ebx
	  movl    12(%ebp), %eax
	  movl    %eax, 4(%esp)
	  movl    8(%ebp), %eax
	  movl    %eax, (%esp)
	  calll   vgaHWSaveScreen@PLT
	  addl    $8, %esp
	  popl    %ebx
	  popl    %ebp
	  ret

I have attached a self-contained testcase that mimics the way X.org
loads this driver.  It can be run by unpacking the tarball, and typing
"make", assuming that is GNU make, but it should also work with BSD
make.

There is a main program that uses dlopen() to load driver.so, which in
turn uses dlopen() to load submod.so.  The submod.so library contains a
vgaHWSaveScreen() function, which is tail called from a
VMWARESaveScreen() function in driver.so.

When clang optimizes the tail call into a GOT relocation, the driver.so
library will fail to load.

Here is a transcript of a test run with gcc on a recent Ubuntu:

  $ ls -l
  total 24
  -rw-rw-r-- 1 dim dim 578 Jan 27 21:30 driver.c
  -rw-rw-r-- 1 dim dim 121 Jan 27 21:30 driver.h
  -rw-rw-r-- 1 dim dim 868 Jan 27 21:30 main.c
  -rw-rw-r-- 1 dim dim 386 Jan 27 22:22 Makefile
  -rw-rw-r-- 1 dim dim 143 Jan 27 21:30 submod.c
  -rw-rw-r-- 1 dim dim 126 Jan 27 21:30 submod.h
  $ make CC=gcc
  gcc -m32 -O2 -fPIC -DPIC -Wall -Wextra  main.c -o main -ldl
  gcc -m32 -O2 -fPIC -DPIC -Wall -Wextra  -shared driver.c -o driver.so
  gcc -m32 -O2 -fPIC -DPIC -Wall -Wextra  -shared submod.c -o submod.so
  LD_LIBRARY_PATH=. ./main
  main: Loading ./driver.so...
  main: Searching function driver_init...
  main: Calling function driver_init...
  driver_init: Loading ./submod.so...
  vgaHWSaveScreen((nil), 42)...
  main: Result 84, unloading ./driver.so...

This is what is expected.  Then a run with clang 3.2 release:

  $ make clean
  rm -f main driver.so submod.so
  $ make CC=~/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang
  /home/dim/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang -m32 -O2 -fPIC -DPIC -Wall -Wextra  main.c -o main -ldl
  /home/dim/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang -m32 -O2 -fPIC -DPIC -Wall -Wextra  -shared driver.c -o driver.so
  /home/dim/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang -m32 -O2 -fPIC -DPIC -Wall -Wextra  -shared submod.c -o submod.so
  LD_LIBRARY_PATH=. ./main
  main: Loading ./driver.so...
  main: Unable to dlopen ./driver.so: ./driver.so: undefined symbol: vgaHWSaveScreen
  make: *** [all] Error 1

Now the dynamic loader complains about the missing symbol.  Next, we
turn off tail call optimization:

  $ make clean
  rm -f main driver.so submod.so
  $ make CC=~/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang CFLAGS_EXTRA=-fno-optimize-sibling-calls
  /home/dim/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang -m32 -O2 -fPIC -DPIC -Wall -Wextra -fno-optimize-sibling-calls main.c -o main -ldl
  /home/dim/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang -m32 -O2 -fPIC -DPIC -Wall -Wextra -fno-optimize-sibling-calls -shared driver.c -o driver.so
  /home/dim/Downloads/clang+llvm-3.2-x86_64-linux-ubuntu-12.04/bin/clang -m32 -O2 -fPIC -DPIC -Wall -Wextra -fno-optimize-sibling-calls -shared submod.c -o submod.so
  LD_LIBRARY_PATH=. ./main
  main: Loading ./driver.so...
  main: Searching function driver_init...
  main: Calling function driver_init...
  driver_init: Loading ./submod.so...
  vgaHWSaveScreen((nil), 42)...
  main: Result 84, unloading ./driver.so...

and then the driver loading actually works.

Comment 1 Nick Lewycky 2013-01-28 22:06:55 PST

The obvious testcase shows the different behaviour between gcc and clang:

  void vgaHWSaveScreen(int);
  void test(int x) {
    return vgaHWSaveScreen(x);
  }

Comment 2 Tijl Coosemans 2013-10-15 15:16:48 PDT

It looks like this optimisation was introduced by http://llvm.org/viewvc/llvm-project?view=revision&revision=47594

I think this is a valid bug.  When a GOT entry is accessed indirectly through a PLT entry it can be resolved when the PLT entry is first called.  When a GOT entry is accessed directly it needs to be resolved when the module is loaded.

This means that changing a PLT call into a direct GOT access like r47594 does breaks code that depends on the lazy resolving of function references.

The best you can achieve is changing "call func@PLT ; ret" into "jmp func@PLT", but the i386 PLT entries require that %ebx is set correctly as a base pointer and %ebx is callee-saved so this only works if the current function's caller is known to have already set %ebx correctly.  I think this is only the case if the current function has internal visibility which is rarely used.

Comment 3 Bill Wendling 2013-11-21 01:06:26 PST

Fixed in r195318. Patch by Dimitry Andric!

Comment 4 Dimitry Andric 2013-11-22 05:03:50 PST

Unfortunately, miscompilations because of r195318 were reported in bug 18029, so it was reverted in r195439.  So this bug must be reopened, as the fix was not entirely correct.

Bill, can you please undo the merge to the 3.4 branch?

Comment 5 Bill Wendling 2013-11-22 12:06:25 PST

(In reply to comment #4)
> Unfortunately, miscompilations because of r195318 were reported in bug
> 18029, so it was reverted in r195439.  So this bug must be reopened, as the
> fix was not entirely correct.
> 
> Bill, can you please undo the merge to the 3.4 branch?

Done. :-(

Comment 6 Chih-Hung Hsieh 2015-03-18 13:47:03 PDT

This bug also affect Android x86 targets.
The current workaround is the -fno-optimize-sibling-calls flag.
Is there a better workaround or can we fix this bug or revert http://llvm.org/viewvc/llvm-project?view=revision&revision=47594?

Comment 7 Chih-Hung Hsieh 2015-03-30 18:44:39 PDT

I tried to fix this bug but have not found a solution.
The problem is that ebx needs to be setup for the jmp through PLT to work, but ebx needs to be restored after the tail call before returning to the upper caller.

I think llvm should not optimize x86 tail calls for position independent code.
A patch like https://android-review.googlesource.com/#/c/142792
could work to build Android, so we don't need to use the -fno-optimize-sibling-calls flag.

Comment 8 Chih-Hung Hsieh 2015-05-15 15:59:20 PDT

I have a patch for this bug at http://reviews.llvm.org/D9799.

Comment 9 Reid Kleckner 2015-05-22 13:18:27 PDT

Can someone explain why the GOT reference is an issue? I'm missing something here.

Comment 10 Dimitry Andric 2015-05-22 15:02:20 PDT

(In reply to comment #9)
> Can someone explain why the GOT reference is an issue? I'm missing something
> here.

See the description.  tl;dr: This causes problems for some types of dynamically loaded objects, and it currently requires compilation with -fno-optimize-sibling-calls to make it work correctly.

Comment 11 Reid Kleckner 2015-05-22 15:04:57 PDT

(In reply to comment #10)
> (In reply to comment #9)
> > Can someone explain why the GOT reference is an issue? I'm missing something
> > here.
> 
> See the description.  tl;dr: This causes problems for some types of
> dynamically loaded objects, and it currently requires compilation with
> -fno-optimize-sibling-calls to make it work correctly.

Ah, sorry. I glossed over that and thought there was a codegen bug that got fixed by using the GOT and ECX, and we were trying to fix an issue with that. Woops. :(

Comment 12 David (Xinliang) Li 2015-05-23 12:40:55 PDT

(In reply to comment #10)
> (In reply to comment #9)
> > Can someone explain why the GOT reference is an issue? I'm missing something
> > here.
> 
> See the description.  tl;dr: This causes problems for some types of
> dynamically loaded objects, and it currently requires compilation with
> -fno-optimize-sibling-calls to make it work correctly.


The real problem is that LAZY binding only applies to function references:

RTLD_LAZY
Perform lazy binding. Only resolve symbols as the code that references them is executed. If the symbol is never referenced, then it is never resolved. (Lazy binding is only performed for function references; references to variables are always immediately bound when the library is loaded.

With tail call optimization, the generated dynamic relocation is R_386_GLOB_DAT which is bound on loading of driver.so. It will obviously fail as the library defining vgaHWSaveScreen has not been loaded.


It seems to me this is a user error to me:

driver.so explicitly calls vgaHWSaveScreen defined in submod.so, but does not depend on submod.so which seems wrong. Changing the call to use dlsym or make driver.so depend on submod.so, it will run correctly with tail call optimization.

Comment 13 Dimitry Andric 2015-05-23 13:53:01 PDT

(In reply to comment #12)
...
> It seems to me this is a user error to me:
> 
> driver.so explicitly calls vgaHWSaveScreen defined in submod.so, but does
> not depend on submod.so which seems wrong. Changing the call to use dlsym or
> make driver.so depend on submod.so, it will run correctly with tail call
> optimization.

Even if this could be considered a user error (but it is debatable, really), the problem is that my example has been derived from real-world sources, in this case X.org.  It seems infeasible to update all the copies of X.org out there. :)

That said, this bug is now more than two years old, and most users (such as FreeBSD ports, and apparently Android too) have been using the -fno-optimize-sibling-calls workaround for all that time.  So it does not seem terribly important anymore to fix.  Personally, I have given up hope a long time ago.

Comment 14 Paul Pluzhnikov 2015-05-23 23:07:08 PDT

(In reply to comment #12)

> It seems to me this is a user error to me:

User error or not, it does change the semantics of dlopen() with RTLD_LAZY, and it changes them depending on level of optimization, and depending on whether Clang was able to tail-call optimize something or not, which is kind of the worst kind of unpredictability.

I agree with Dimitry: changing PLT reference (which can use lazy resolution) into GOT reference (that must be resolved at load time) when building PIC code seems wrong.

From "man dlopen"

  RTLD_LAZY
    Perform lazy binding.  Only resolve symbols as the code that
    references them is executed. If the symbol is never referenced,
    then it is never resolved.

How would you re-write this to match current Clang behavior?

  If the symbol is never referenced, and is not recursive or is not tail-call
  optimized, then it is never resolved.

Comment 15 David (Xinliang) Li 2015-05-24 00:34:55 PDT

(In reply to comment #14)
> (In reply to comment #12)
> 
> > It seems to me this is a user error to me:
> 
> User error or not, it does change the semantics of dlopen() with RTLD_LAZY,
> and it changes them depending on level of optimization, and depending on
> whether Clang was able to tail-call optimize something or not, which is kind
> of the worst kind of unpredictability.
> 
> I agree with Dimitry: changing PLT reference (which can use lazy resolution)
> into GOT reference (that must be resolved at load time) when building PIC
> code seems wrong.
> 
> From "man dlopen"
> 
>   RTLD_LAZY
>     Perform lazy binding.  Only resolve symbols as the code that
>     references them is executed. If the symbol is never referenced,
>     then it is never resolved.
> 
> How would you re-write this to match current Clang behavior?
> 
>   If the symbol is never referenced, and is not recursive or is not tail-call
>   optimized, then it is never resolved.


Ok -- I am biased towards optimization :) 

You are right that changing the behavior of dlopen is not desirable. To make it this optimization safe, there are two ways to do:

1) introduce a new relocation type such foo@GOTPLT for GOT data references. With this relocation, the linker still produces a PLT entry for it and emits a JUMP_SLOT dynamic relocation (so that GOT entry can be initialized to point to the stub) instead of the regular GOT_DAT relocation.

2) enable the optimization in LTO mode only (where dlopen call is detected and analyzed).

For 1), there is discussion in GCC upstream about it (to support the no-PLT optimization with lazy binding).  This issue will be revisited when that is ready.

Comment 16 Tijl Coosemans 2015-05-24 06:10:23 PDT

(In reply to comment #15)
> Ok -- I am biased towards optimization :) 
> 
> You are right that changing the behavior of dlopen is not desirable. To make
> it this optimization safe, there are two ways to do:
> 
> 1) introduce a new relocation type such foo@GOTPLT for GOT data references.
> With this relocation, the linker still produces a PLT entry for it and emits
> a JUMP_SLOT dynamic relocation (so that GOT entry can be initialized to
> point to the stub) instead of the regular GOT_DAT relocation.

I suspect this won't work on x86.  Before calling a PLT entry %ebx has to be setup as a base pointer and %ebx is also callee saved so it needs to be restored after the call.  In other words, you cannot change the call into a jmp unless you know that %ebx doesn't have to be restored.

Comment 17 David (Xinliang) Li 2015-05-24 12:00:19 PDT

(In reply to comment #14)
> (In reply to comment #12)
> 
> > It seems to me this is a user error to me:
> 
> User error or not, it does change the semantics of dlopen() with RTLD_LAZY,
> and it changes them depending on level of optimization, and depending on
> whether Clang was able to tail-call optimize something or not, which is kind
> of the worst kind of unpredictability.
> 
> I agree with Dimitry: changing PLT reference (which can use lazy resolution)
> into GOT reference (that must be resolved at load time) when building PIC
> code seems wrong.
> 
> From "man dlopen"
> 
>   RTLD_LAZY
>     Perform lazy binding.  Only resolve symbols as the code that
>     references them is executed. If the symbol is never referenced,
>     then it is never resolved.
> 
> How would you re-write this to match current Clang behavior?
> 
>   If the symbol is never referenced, and is not recursive or is not tail-call
>   optimized, then it is never resolved.


Ok -- I am biased towards optimization :) 

You are right that changing the behavior of dlopen is not desirable. To make it this optimization safe, there are two ways to do:

1) introduce a new relocation type such foo@GOTPLT for GOT data references. With this relocation, the linker still produces a PLT entry for it and emits a JUMP_SLOT dynamic relocation (so that GOT entry can be initialized to point to the stub) instead of the regular GOT_DAT relocation.

2) enable the optimization in LTO mode only (where dlopen call is detected and analyzed).

For 1), there is discussion in GCC upstream about it (to support the no-PLT optimization with lazy binding).  This issue will be revisited when that is ready.
(In reply to comment #16)
> (In reply to comment #15)
> > Ok -- I am biased towards optimization :) 
> > 
> > You are right that changing the behavior of dlopen is not desirable. To make
> > it this optimization safe, there are two ways to do:
> > 
> > 1) introduce a new relocation type such foo@GOTPLT for GOT data references.
> > With this relocation, the linker still produces a PLT entry for it and emits
> > a JUMP_SLOT dynamic relocation (so that GOT entry can be initialized to
> > point to the stub) instead of the regular GOT_DAT relocation.
> 
> I suspect this won't work on x86.  Before calling a PLT entry %ebx has to be
> setup as a base pointer and %ebx is also callee saved so it needs to be
> restored after the call.  In other words, you cannot change the call into a
> jmp unless you know that %ebx doesn't have to be restored.

What you suspected is true unfortunately.

Comment 18 Reid Kleckner 2015-05-28 16:03:47 PDT

After some back and forth, we went ahead and disabled most instances of tail calls through the GOT in r238487. You can still get to it with the two supported guaranteed TCO mechanisms, -tailcallopt and musttail.

We could surface a flag for people to get back the extra performance, but I'd rather not take the maintenance burden. If they really want better perf, they should go 64-bit already.