We currently have support for 'inreg' as a parameter attribute on calls. There are situations in which front ends would benefit from having an 'outreg' parameter attribute available. For example, in Clang we occasionally would like to lower structs which should be passed on the stack into scalar LLVM values (in order to avoid byval, which tends to pessimize codeine). On x86_64, for example, we currently are unable to (easily) do this in situations in which the argument to be passed on the stack occurs before all integer registers are exhausted, because turning it into a scalar value would result in the argument occupying an integer register. We could reorder the arguments to deal with this, but that would be much more complicated for the frontend.
Maybe 'onstack' is a more reasonable name.
It would be really nice if the caller could see a value and the callee could see memory, as that is what will happen in practice. Consider for example ------------------------------------- struct foo { long a; long b; char c; }; void h(struct foo *x); __attribute__((noinline)) void f(struct foo x) { h(&x); } void g(long a, long b, char c) { struct foo x = {a, b, c}; f(x); } ------------------------------------ for the function f getting a byval argument is perfect. It knows the value is in memory and is compiled to ------------------------------------ define void @f(%struct.foo* byval align 8 %x) nounwind uwtable optsize noinline ssp { entry: tail call void @h(%struct.foo* %x) nounwind optsize ret void } ------------------------------------ For the function g byval is bad because it thinks it has to create a stack location, but codegen will copy it anyway. We produce: -------------------------------------------- define void @g(i64 %a, i64 %b, i8 signext %c) nounwind uwtable optsize ssp { entry: %x = alloca %struct.foo, align 8 %a1 = getelementptr inbounds %struct.foo* %x, i64 0, i32 0 store i64 %a, i64* %a1, align 8, !tbaa !0 %b2 = getelementptr inbounds %struct.foo* %x, i64 0, i32 1 store i64 %b, i64* %b2, align 8, !tbaa !0 %c3 = getelementptr inbounds %struct.foo* %x, i64 0, i32 2 store i8 %c, i8* %c3, align 8, !tbaa !1 call void @f(%struct.foo* byval align 8 %x) optsize ret void } --------------------------------------------- The IL for f could remain something like ----------------------------------------------- define void @f(%struct.foo* onstack align 8 %x) nounwind uwtable optsize noinline ssp { entry: tail call void @h(%struct.foo* %x) nounwind optsize ret void } ----------------------------------------------- and g become --------------------------------------- define void @g(i64 %a, i64 %b, i8 signext %c) nounwind uwtable optsize ssp { entry: call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ret void } -------------------------------------- That is, for onstack arguments, a pointer in the calle is matched by a value in the caller and codegen is the one responsible for creating the address when it produces the call frame. This should let it do a better job than what we currently do for g: _g: ## @g subq $56, %rsp movq %rdi, 32(%rsp) movq %rsi, 40(%rsp) movb %dl, 48(%rsp) movq 48(%rsp), %rax movq %rax, 16(%rsp) movq 32(%rsp), %rax movq 40(%rsp), %rcx movq %rcx, 8(%rsp) movq %rax, (%rsp) callq _f addq $56, %rsp ret
Hi Rafael, suppose the example was this: void g(long a, long b, char c) { struct foo x = {a, b, c}; f(x); do_something_with(x); <- or address of x, doesn't matter } Then making a copy of x is obligatory when calling f, because f is allowed to write to the memory passed to it, but such changes are not allowed show up in x. So in order to do your proposed optimization, you first have to check that it is safe to pass the address of x directly to f, by analyzing the code following the call to check that it doesn't read from x. If that check passes, then it is OK to pass the address of x even if there were loads of complicated accesses to x before the call to f. However your syntax call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize would only handle simple cases in which it is clear what is in x at the moment of the call. Instead, couldn't you just drop the 'byval' attribute on the call, and then codegen should just pass the address of x straight through?
PS: The analysis would also have to prove that &x doesn't escape before the call to f, otherwise f might read from x (such reads aren't allowed to see any writes that f makes to x).
> Then making a copy of x is obligatory when calling f, because f is allowed > to write to the memory passed to it, but such changes are not allowed show up > in x. Yes, Codegen would still do the copy, like it does todays for simple types like i64. The problem is just the representation at the IL level. And yes, if we known that the function never modifies its argument, we can optimize the code to just drop the byval/onstack attribute. In the example I wrote that optimization was *not* done, there is still an implicit copy done by the codegen. Cheers, Rafael
Just to be clear, you don't need to know that *f* doesn't modify the argument. It is OK if it modifies the argument as long x isn't used after the call to f. So your testcase function g is fine because x isn't used after the call to f. I'm not sure why you want a new IR construct when just dropping the byval on the call should do the right thing in your testcase. Maybe you have some more complicated testcases in mind?
(In reply to comment #6) > Just to be clear, you don't need to know that *f* doesn't modify the argument. > It is OK if it modifies the argument as long x isn't used after the call to f. > So your testcase function g is fine because x isn't used after the call to f. No, it is fine because I am defining outreg as causing codegen to create a copy! > I'm not sure why you want a new IR construct when just dropping the byval on > the call should do the right thing in your testcase. Maybe you have some more > complicated testcases in mind? No, I am trying to handle the general case. As I noted, the example I gave is *not* optimized on any assumptions about f.
Are we talking at cross-purposes? As far as I can see all that is missing for your testcase is an IR level optimization that drops the 'byval' in the call for examples like the call to f in your function g. I don't see why outreg is needed.
(In reply to comment #8) > Are we talking at cross-purposes? > > As far as I can see all that is missing for your testcase is an IR level > optimization that drops the 'byval' in the call for examples like the call > to f in your function g. I don't see why outreg is needed. Because I used a simple example. I am not interested on what is the best codgen for this small testcase given knowledge of what f does, but on a more general problem. Let me give a more complex one --------------------------------------- struct foo { long a; long b; char c; }; int should_modify(void); void h(struct foo *x); __attribute__((noinline)) void f(struct foo x) { if (should_modify()) x.b = 42; h(&x); } void g(long a, long b, char c) { struct foo x = {a, b, c}; f(x); f(x); } -------------------------------------- In here the codegen for f would stays very similar to what it is today: ------------------------------------------- define void @f(%struct.foo* onstack align 8 %x) nounwind uwtable optsize noinline ssp { entry: %call = tail call i32 @should_modify() nounwind optsize %tobool = icmp eq i32 %call, 0 br i1 %tobool, label %if.end, label %if.then if.then: ; preds = %entry %b = getelementptr inbounds %struct.foo* %x, i64 0, i32 1 store i64 42, i64* %b, align 8, !tbaa !0 br label %if.end if.end: ; preds = %entry, %if.then tail call void @h(%struct.foo* %x) nounwind optsize ret void } ------------------------------------------------ but g would become ---------------------------- define void @g(i64 %a, i64 %b, i8 signext %c) nounwind uwtable optsize ssp { entry: call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ;codegen does a copy in here call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ;codegen does a copy in here ret void } ----------------------- I am precisely interested in the cases where we don't know what f does. In fact, I included f's body just to show that the IL for it should stays similar to what it is today. The Il for g would stay the same even if you replaced f's definition with a declaration.
I think you only avoid one copy (the first one): define void @g(i64 %a, i64 %b, i8 signext %c) nounwind uwtable optsize ssp { entry: call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ;codegen does a copy in here ^ I agree you would avoid a copy here. But the call to f may modify the part of the stack you put %a, %b, %c in. Thus you will need to set the stack up again. call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ;codegen does a copy in here ^ I.e. you will need to store %a, %b, %c to stack again before doing this call. ret void } My proposed optimization would also save a copy in this case (the second one).
(In reply to comment #10) > I think you only avoid one copy (the first one): > > define void @g(i64 %a, i64 %b, i8 signext %c) nounwind uwtable optsize ssp { > entry: > call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ;codegen does > a copy in here > > ^ I agree you would avoid a copy here. But the call to f may modify the part > of > the stack you put %a, %b, %c in. Thus you will need to set the stack up again. I would *not* avoid a copy. Codegen is doing the copy as it does today with byval, this avoids us having two copies, on at the IL level and one at codegen. > call void @f(%struct.foo onstack align 8 {%a, %b, %c }) optsize ;codegen does > a copy in here > > ^ I.e. you will need to store %a, %b, %c to stack again before doing this call. > > ret void > } > > > My proposed optimization would also save a copy in this case (the second one). The problem is not a specific case. Add 500 calls to f in g if you want. Only the last one will be a special case if you do not know what f does. Lets keep this bug about the general case, not particular optimizations.
> The problem is not a specific case. Add 500 calls to f in g if you want. Only > the last one will be a special case if you do not know what f does. Lets keep > this bug about the general case, not particular optimizations. This is probably what I'm not getting. As far as I can see you are only dealing with a special case: the case in which you know at compile time what is in the memory at the moment of the call. Also, as far as I can see, you are only ever saving one copy: right now there is a copy of %a, %b, %c into an alloca; then codegen generates a bunch of copies from the alloca, one for each byval call. You save the copy into the alloca. If you are actually saving more copies, please explain how, since I don't see this. I'm also saving only one copy, but a different one: no matter how many 500 calls to f and g etc, codegen doesn't have to do the last copy. So if I'm correct that you are only saving one copy, and only doing it in a special case, is it worth introducing this new IR construct when using my approach you can probably in the same situations also save one copy without any new IR construct?
> This is probably what I'm not getting. As far as I can see you are only > dealing > with a special case: the case in which you know at compile time what is in the > memory at the moment of the call. no. I am just letting the caller uses values and know that the codegen will produce the copy to a stack space. A possible assembly for g is ------------------------------ _g: subq $56, %rsp # move values to callee saved regs movq %rdx, %rbx movq %rsi, %r12 movq %rdi, %r13 # first copy, set up by codegen movq %rbx, 16(%rsp) movq %r12, 8(%rsp) movq %r13, (%rsp) callq _f # second copy, set up by codegen because we don't know if f modified the stack # or not the the caller (_g) has no copy available for us to use (its copy is # in regs). movq %rbx, 16(%rsp) movq %r12, 8(%rsp) movq %r13, (%rsp) callq _f addq $56, %rsp ret ------------------------------------------ > Also, as far as I can see, you are only ever saving one copy: right now there > is a copy of %a, %b, %c into an alloca; then codegen generates a bunch of > copies > from the alloca, one for each byval call. You save the copy into the alloca. > If you are actually saving more copies, please explain how, since I don't see > this. It saves having the two copies on the stack. Currently when we pass a struct by value we always get one because of the alloca and one because of codegen. > I'm also saving only one copy, but a different one: no matter how many 500 > calls to f and g etc, codegen doesn't have to do the last copy. Changing only codegen, how would you avoid codegen doing the copy? The callers copy is in an alloca. If I complicate f's signature just a bit (say "int f(int a, struct foo x)" codegen would have to be very clever to decide where that alloca would live for it to be in a valid frame according to the abi. > So if I'm correct that you are only saving one copy, and only doing it in a > special case, is it worth introducing this new IR construct when using my > approach you can probably in the same situations also save one copy without > any new IR construct? It is not a special case. The above codegen avoids the copy for any f. It also saves stack space and lets the caller uses only values, wich are easier to handle in analysis than addresses.
> no. I am just letting the caller uses values and know that the codegen will > produce the copy to a stack space. A possible assembly for g is > > ------------------------------ > _g: > subq $56, %rsp > # move values to callee saved regs > movq %rdx, %rbx > movq %rsi, %r12 > movq %rdi, %r13 How do you know what the values are? Here it is easy because you are just passing through %a, %b, %c. What about this: void g() { struct foo x; init_x(&x); f(x); } ? As far as I can see you can't handle it: you need to know what "the values" are. That's why I'm saying that you only seem to be handling a special case. > Changing only codegen, how would you avoid codegen doing the copy? The callers > copy is in an alloca. If I complicate f's signature just a bit (say "int f(int > a, struct foo x)" codegen would have to be very clever to decide where that > alloca would live for it to be in a valid frame according to the abi. OK, I can buy this.
> How do you know what the values are? Here it is easy because you are > just passing through %a, %b, %c. What about this: > > void g() { > struct foo x; > init_x(&x); > f(x); > } > > ? > > As far as I can see you can't handle it: you need to know what "the values" > are. > That's why I'm saying that you only seem to be handling a special case. That is not one of the cases we use byval. As far as I can tell the Il can handle that perfectly today, both the caller and callee need a pointer and that is what they get. In fact, even for void g() { int x; init_x(&x); f(x); } we get (and should get) an alloca. Being more specific, the cases I find a onstack attribute interesting are the ones we currently need to use a byval. It might also be handy for cases where we don't currently use a byval but the ABI puts the values in memory anyway, but that is a smaller concern I think.
struct foo { long a; long b; char c; }; extern void init_x(struct foo *); extern void f(struct foo); void g() { struct foo x; init_x(&x); f(x); } -> define void @g() nounwind uwtable { entry: %x = alloca %struct.foo, align 8 call void @init_x(%struct.foo* %x) nounwind call void @f(%struct.foo* byval align 8 %x) nounwind ret void } Notice the byval on the call to @f. In theory it should be possible to set aside the stack frame for the call to @f before calling @init_x, and have @init_x initialize it. Then there would be no need for any copying.
Created attachment 8144 [details] Related patch adding support for not using byval on small structs for x86_64. Added a patch to Clang which does a simple optimization to avoid using byval when trying to pass some small structs on the stack. The patch includes comments about how we would like to use 'onstack' to be able to make the patch more general.
> define void @g() nounwind uwtable { > entry: > %x = alloca %struct.foo, align 8 > call void @init_x(%struct.foo* %x) nounwind > call void @f(%struct.foo* byval align 8 %x) nounwind > ret void > } > > Notice the byval on the call to @f. In theory it should be possible to set > aside the stack frame for the call to @f before calling @init_x, and have > @init_x initialize it. Then there would be no need for any copying. Yes, sorry, I misunderstood your example. It is a case the IL would get uglier. I would propose -------------------------------- define void @g() nounwind uwtable optsize ssp { entry: %x = alloca %struct.foo, align 8 call void @init_x(%struct.foo* %x) nounwind optsize %0 = load %struct.foo* %x, align 8 call void @f(%0) nounwind optsize ret void } declare void @init_x(%struct.foo*) optsize declare void @f(%struct.foo* onstack) optsize -------------------------------- I should still be possible to do the optimization you describe, but I agree the matching now has to account for the load. On the other side, if g itself has uses of x they could be done via the value %0. The example makes me realize that it is more precise to say codegen would be responsible for storing the value in the call frame, not doing a copy.